Twin Gaussian Processes for Binary Classification - IEEE Computer ...

1 downloads 0 Views 219KB Size Report
Keywords-Twin Gaussian processes; Binary classification;. Bayesian methods; Kernel machine. I. INTRODUCTION. Because of the extensive applications in ...
2011 11th IEEE International Conference on Data Mining

Twin Gaussian Processes for Binary Classification Jianjun He∗ , Hong Gu∗ and Shaorui Jiang∗† of Electronic Information and Electrical Engineering Dalian University of Technology, China † Department of Electrical and Computer Engineering University of Florida, USA E-mail: [email protected], [email protected], jshr8787@ufl.edu ∗ Faculty

Abstract—Gaussian process classifiers (GPCs) have recently attracted more and more attention from the machine learning community. However, because the posterior needs to be approximated by using a tractable Gaussian distribution, they usually suffer from high computational cost which is prohibitive for practical applications. In this paper, we present a new Gaussian process model termed as twin Gaussian processes for binary classification. The basic idea is to make predictions based on two latent functions with Gaussian process prior, each of which is close to one of the two classes and is as far as possible from the other. Being compared with the published GPCs, the proposed algorithm allows for an explicit inference based on analytical methods, thereby avoiding the high computational cost caused by approximating the posterior with Gaussian distribution. Experimental results on several benchmark data sets show that the proposed algorithm is valid and can achieve superior performance to the published algorithms. Keywords-Twin Gaussian processes; Binary classification; Bayesian methods; Kernel machine

I. I NTRODUCTION Because of the extensive applications in various fields, such as mobile robot [1], [2], [3], medical diagnosis [4], [5], bioinformatics [6], [7], and software engineering [8], the research and development of machine learning technologies have attracted more and more attention from academics and industries. Due to the desirable properties such as the natural Bayesian interpretation, explicit probabilistic formulation, and the ability to infer model parameters, in recent years, Gaussian process (GP) models [9] have become the important tools for many machine learning technologies including regression [10], classification [11], [12], multitask learning[13], relational learning [14], semi-supervised learning [15] and so on. Being different from other kernel machines such as support vector machine (SVM) [16], [17], the basic idea of the GP models is to define a Gaussian prior distribution over the latent function and then deduce the inference directly in the space of functions. In the case of regression problems, a closed form of the inference can be obtained by using analytical methods. However, for classification tasks, because of the non-Gaussian posterior which usually arises from the non-Gaussian likelihood, the exact inference is analytically intractable. A common approach to overcome this hurdle is to approximate 1550-4786/11 $26.00 © 2011 IEEE DOI 10.1109/ICDM.2011.149

the posterior by using a tractable Gaussian distribution. For improving the performance of the classification algorithms, a number of Gaussian approximations of the posterior have been developed from different points of view. For example, in [11], the Laplace approximation (LA) of the posterior is obtained by doing the second order Taylor expansion of the logarithm of the posterior around its maximum of the posterior; Minka [18] proposed an algorithm called expectation propagation (EP) to find a Gaussian approximation of the posterior based on approximate marginal moments; Other algorithms such as variational bounding (VB)[19] and Kullback-Leibler divergence (KL) minimization [20] are also used to find the approximation adapting to different practical applications. Although these approaches have state of the art performance in some applications, they suffer from high computational cost which is prohibitive for large data sets. Taking LA and EP as examples, for a training set containing 𝑛 samples, the computational complexity for training the models is 𝑂(𝑙𝑛3 ), where 𝑙 is the iterative times of the numerical algorithm for solving the model. In this paper, a new Gaussian process model termed as twin Gaussian processes (TGPs) is proposed for binary classification. The basic idea is to make predictions based on two latent functions with Gaussian process prior, each of which is close to one of the two classes and is as far as possible from the other. Being compared with the traditional GP models of classification, the TGPs model allows for an explicit inference by using analytical methods, thereby being freed from the high computational cost caused by approximating the posterior with Gaussian distribution. Moreover, in the stage of prediction, because the computational complexity of TGPs model is mainly dominated by inverting the kernel matrix of every class but not the total training set, its computational cost is still greatly reduced than the traditional GP models. The paper is organized as follows. The section II presents the basic TGPs model. The more details about the implementation of TGPs is given in III. In section IV, we test the proposed algorithm on several benchmark data sets. The conclusion makes up section V. 1074

The joint prior distributions 𝑝(𝑓+,∗ , 𝐹+ ∣𝐷, 𝑥∗ , 𝜃) and 𝑝(𝑓−,∗ , 𝐹− ∣𝐷, 𝑥∗ , 𝜃) are ⎧ ([ ]) ] [ 𝑓+,∗ 𝑘∗∗ 𝐾∗   𝑝(𝑓+,∗ , 𝐹+ ∣𝐷, 𝑥∗ , 𝜃) = 𝒩 ∣0,   𝐹+ 𝐾∗𝑇 𝐾 ⎨

II. T WIN G AUSSIAN P ROCESSES M ODEL Let 𝑆 = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), ⋅ ⋅ ⋅ , (𝑥𝑛 , 𝑦𝑛 )} be the training set containing 𝑛 samples, where 𝑥𝑖 denotes the 𝑖th training sample, and 𝑦𝑖 ∈ {+1, −1} is the corresponding class label of 𝑥𝑖 . The task of binary classification is to output the correct label for a new sample 𝑥∗ . The basic idea of the traditional gaussian process models for binary classification is to determine an latent function 𝑓 (𝑥) by putting a Gaussian process prior on 𝑓 (𝑥) and defining a likelihood 𝑝(𝑦∣𝑓 (𝑥)) such that the samples of every class can be assigned to one of two disjoint halfspaces {𝑥∣𝑓 (𝑥) > 0} and {𝑥∣𝑓 (𝑥) < 0}. In order to define a likelihood 𝑝(𝑦∣𝑓 (𝑥)) which can reflect this idea, we have to map 𝑓 (𝑥) into the unit interval by using a sigmoid function 𝑠𝑖𝑔 : 𝑅 → [0, 1] such as the logistic 1 function 𝜎(𝑡) = 1+𝑒1 −𝑡 , i.e. 𝑝(𝑦 = 1∣𝑓 (𝑥)) = 1+𝑒−𝑓 (𝑥) . However, because it is non-Gaussian, the sigmoid function will bring us lots of trouble for solving the model. In this section, we will establish the Gaussian process model of binary classification in another way such that an explicit inference can be deduced. The main idea is to find two latent functions with Gaussian process prior, one for each class, such that each function is close to one of the two classes and is as far as possible from the other. The new sample can be classified by assigning it to the class of the closest function to it. This idea is somewhat inspired by the MPSVMs [21] and the TSVMs [22] which belong to a new kind of SVM, but different totally, because the purposes of these algorithms are to find two planes in input or feature space based on the marginal methods in order to improve the performance of SVM. The details of the proposed model is presented as follows.

([ ] [   𝑘∗∗ 𝑓−,∗   ∣0, ⎩ 𝑝(𝑓−,∗ , 𝐹− ∣𝐷, 𝑥∗ , 𝜃) = 𝒩 𝐹− 𝐾∗𝑇

(4) here, 𝑓+,∗ = 𝑓+ (𝑥∗ ), 𝐹+ = [𝑓+,1 , 𝑓+,2 , ⋅ ⋅ ⋅ , 𝑓+,𝑛 ]𝑇 , 𝑓+,𝑖 = 𝑓+ (𝑥𝑖 ), 𝑓−,∗ = 𝑓− (𝑥∗ ), 𝐹− = [𝑓−,1 , 𝑓−,2 , ⋅ ⋅ ⋅ , 𝑓−,𝑛 ]𝑇 , 𝑓−,𝑖 = 𝑓− (𝑥𝑖 ), 𝑘∗∗ = 𝑘(𝑥∗ , 𝑥∗ ), 𝐾 is the covariance matrix of sample set 𝐷, 𝐾∗ is a row vector and its 𝑖th element is 𝑘(𝑥∗ , 𝑥𝑖 ), 𝑖 = 1, 2, ⋅ ⋅ ⋅ , 𝑛, 𝜃 denotes the haperparameters of covariance function. The conditional priors (5) and (6) can be deduced analytically from joint priors (4), (5)

𝑝(𝑓−,∗ ∣𝐹− , 𝐷, 𝑥∗ , 𝜃) = 𝒩 (𝑓−,∗ ∣𝐾∗ 𝐾 −1 𝐹− , 𝑘∗∗ − 𝐾∗ 𝐾 −1 𝐾∗𝑇 ) ≜ 𝒩 (𝑓−,∗ ∣𝑚−,∗ , 𝐴−,∗ )

(6)

As described at the beginning of section II, the model aims at generating function 𝑓+ (𝑓− ) such that it is closer to the positive (negative) class and is as far as possible from the negative (positive) class. For this purpose, we can define the likelihoods as follows ⎧ 𝑓 2 (𝑥)  − +2  𝑝(𝑦 = +1∣𝑓 (𝑥)) = 𝑒  +   2 (𝑥) 𝑓+  ⎨ 𝑝(𝑦 = −1∣𝑓+ (𝑥)) = 1 − 𝑒− 2 (7) 𝑓 2 (𝑥)  − −2  𝑝(𝑦 = −1∣𝑓 (𝑥)) = 𝑒  −   2 (𝑥) 𝑓−  ⎩ 𝑝(𝑦 = +1∣𝑓− (𝑥)) = 1 − 𝑒− 2 Thus, the joint likelihoods 𝑝(𝑌 ∣𝐹+ ) and 𝑝(𝑌 ∣𝐹− ), i.e., the joint probability of observing the class labels of sample set 𝐷 given the values of latent function 𝑓+ or 𝑓− , can be written as ⎧ 2 𝑛 𝑓2 𝑓+,𝑖 ∏  𝑝(𝑌 ∣𝐹 ) = ∏ 𝑝(𝑦 ∣𝑓 ) = ∏ 𝑒− +,𝑖 2  (1 − 𝑒− 2 ) + 𝑖 +,𝑖 ⎨

here, 𝑘(𝑥, 𝑥′ ) is a covariance function of samples 𝑥 and 𝑥′ , which expresses general properties of the functions 𝑓+ and 𝑓− such as their smoothness, scale etc. It is usually a function of several hyperparameters which control these properties. More detailed introduction about covariance functions can be found in [9]. In this paper, the following covariance function will be used, −∣∣𝑥−𝑥′ ∣∣2 𝜃2

𝑝(𝑓+,∗ ∣𝐹+ , 𝐷, 𝑥∗ , 𝜃) = 𝒩 (𝑓+,∗ ∣𝐾∗ 𝐾 −1 𝐹+ , 𝑘∗∗ − 𝐾∗ 𝐾 −1 𝐾∗𝑇 ) ≜ 𝒩 (𝑓+,∗ ∣𝑚+,∗ , 𝐴+,∗ )

B. Likelihood

A. Gaussian Prior Let 𝑓+ and 𝑓− be the latent functions defined on the sample space for positive class (+1) and negative class (−1), respectively. Similar to the traditional Gaussian process models, we can place the Gaussian process prior with zero mean on them, i.e., { 𝑓+ (𝑥) ∼ 𝒢𝒫(0, 𝑘(𝑥, 𝑥′ )) (1) 𝑓− (𝑥) ∼ 𝒢𝒫(0, 𝑘(𝑥, 𝑥′ ))

𝑘(𝑥, 𝑥′ ) = 𝜃1 𝑒

])

𝐾∗ 𝐾

  ⎩ 𝑝(𝑌 ∣𝐹− ) =

𝑖=1 𝑛 ∏

𝑖=1

𝑖∈𝑆 +

𝑝(𝑦𝑖 ∣𝑓−,𝑖 ) =



𝑖∈𝑆 −

𝑒−

2 𝑓−,𝑖 2

𝑖∈𝑆 −



𝑖∈𝑆 +

(1 − 𝑒−

2 𝑓−,𝑖 2

(8) where, 𝑌 = [𝑦1 , 𝑦2 , ⋅ ⋅ ⋅ , 𝑦𝑛 ]𝑇 ; 𝑆 + = {𝑖∣𝑦𝑖 = +1, 𝑖 = 1, ⋅ ⋅ ⋅ , 𝑛} and 𝑆 − = {𝑖∣𝑦𝑖 = −1, 𝑖 = 1, ⋅ ⋅ ⋅ , 𝑛} denote the sets containing the subscripts of all the positive samples and the negative samples, respectively. Note that although the likelihoods in (8) are non-Gaussian, they could be represented as the sum of Gaussian functions. This property already greatly facilitates the computing of the posterior.

(2)

So, the joint prior distributions of the values of 𝑓+ and 𝑓− on 𝐷 = {𝑥1 , 𝑥2 , ⋅ ⋅ ⋅ , 𝑥𝑛 } respectively are { 𝑝(𝐹+ ∣𝐷, 𝜃) = 𝒩 (𝐹+ ∣0, 𝐾) (3) 𝑝(𝐹− ∣𝐷, 𝜃) = 𝒩 (𝐹− ∣0, 𝐾)

1075

)

C. Posterior

This may be done by marginalizing the conditional prior (5) 𝑝(𝑓 ∫ +,∗ ∣𝐷, 𝑌, 𝑥∗ , 𝜃) 𝑝(𝑓+,∗ ∣𝐹+ , 𝐷, 𝑥∗ , 𝜃)𝑝(𝐹+ ∣𝐷, 𝑌, 𝜃)𝑑𝐹+ ∫ − 12 (𝑓+,∗ −𝑚+,∗ )𝑇 𝐴−1 1 1 +,∗ (𝑓+,∗ −𝑚+,∗ ) 1 𝑒 𝑝+ (𝑌 ∣𝐷,𝜃) ∣2𝜋𝐴+,∗ ∣ 2 ∑ (−1)∣𝛼∣ − 1 𝐹 𝑇 (𝐾 −1 +Λ + +Λ𝛼 )𝐹+ 𝑆 2 + ⋅ 𝑑𝐹+ 1 𝑒 2 𝛼⊂𝑆 − ∣2𝜋𝐾∣ ∑ −1 12 1 = 𝑝+ (𝑌1∣𝐷,𝜃) ((−1)∣𝛼∣ ∣2𝜋𝐵+,∗ ∣ 1 1

Being equipped with the likelihoods defined in the above section, the marginal likelihoods for heperparameters 𝜃 regarding as 𝐹+ and 𝐹− can be written as 𝑝∫+ (𝑌 ∣𝐷, 𝜃) = ∑ 𝑝(𝑌 ∣𝐹+ )𝑝(𝐹+ ∣𝐷, 𝜃)𝑑𝐹+ = (−1)∣𝛼∣ ∣𝐾∣−1/2 ∣𝐾 −1 + Λ𝑆 + + Λ𝛼 ∣−1/2 − 𝛼⊂𝑆  [ ]−1/2  ∑ 𝐾𝑆 + ,𝑆 + 𝐾𝑆 + ,𝛼  = (−1)∣𝛼∣ 𝐸 +  𝐾𝛼,𝑆 + 𝐾𝛼,𝛼 𝛼⊂𝑆 −

(9)

and 𝑝− (𝑌 ∣𝐷, 𝜃)  [  ∑ 𝐾𝑆 − ,𝑆 − ∣𝛼∣  = (−1) 𝐸 + 𝐾𝛼,𝑆 − + 𝛼⊂𝑆

𝐾𝑆 − ,𝛼 𝐾𝛼,𝛼

]−1/2   

∫ 𝑝(𝑌 ∣𝐹+ )𝑝(𝐹+ ∣𝐷,𝜃) 𝑝(𝑌 ∣𝐹+ )𝑝(𝐹+ ∣𝐷,𝜃)𝑑𝐹+ 1 − 1 𝐹 𝑇 (𝐾 −1 +Λ ∑ +Λ𝛼 )𝐹+ 𝑆+ (−1)∣𝛼∣ ∣2𝜋𝐾∣− 2 𝑒 2 + 𝛼⊂𝑆 −

 ⎡   ∑ ∣𝛼∣  (−1) 𝐸+⎣  𝛼⊂𝑆 −

𝐾𝑆 + ,𝑆 + 𝐾𝛼,𝑆 +

(11)

𝑝(𝑦 = +1∣𝐷, 𝑌, 𝑥∗ , 𝜃) 2 𝑓+,∗ ∫ = 𝑒− 2 𝑝(𝑓+,∗ ∣𝐷, 𝑌, 𝑥∗ , 𝜃)𝑑𝑓+,∗ ∑ −1 12 1 = 𝑝+ (𝑌1∣𝐷,𝜃) ((−1)∣𝛼∣ ∣2𝜋𝐵+,∗ ∣ 1 1 ∣2𝜋(𝐴−1 +,∗



=

𝛼⊂𝑆 −



𝛼⊂𝑆 −



=

 ⎡   (−1)∣𝛼∣ 𝐸+⎣  𝛼⊂𝑆 + ∑

1

(−1)∣𝛼∣ ∣2𝜋𝐾∣− 2 𝑒

⎤−1/2   ⎦  







1

(−1)∣𝛼∣ ∣𝐸+𝐾𝛼,𝛼 −𝐾𝛼,𝑆 + (𝐸+𝐾𝑆 + ,𝑆 + )−1 𝐾𝑆 + ,𝛼 ∣− 2

𝑝(𝑦 = −1∣𝐷, 𝑌, 𝑥∗ , 𝜃) 2 𝑓−,∗ ∫ = 𝑒− 2 𝑝(𝑓−,∗ ∣𝐷, 𝑌, 𝑥∗ , 𝜃)𝑑𝑓−,∗

⎤−1/2 𝐾𝑆 + ,𝛼 ⎦   𝐾𝛼,𝛼

𝐾𝑆 − ,𝛼 𝐾𝛼,𝛼





𝛼⊂𝑆 +



⋅ (12)

1

(−1)∣𝛼∣ ∣𝐸+𝐾𝛼,𝛼 −𝐾𝛼,𝑆 − (𝐸+𝐾𝑆 − ,𝑆 − )−1 𝐾𝑆 − ,𝛼 ∣− 2 ∗







1

(−1)∣𝛼∣ ∣𝐸+𝐾𝛼,𝛼 −𝐾𝛼,𝑆 − (𝐸+𝐾𝑆 − ,𝑆 − )−1 𝐾𝑆 − ,𝛼 ∣− 2

(16)

1

∣𝐸+𝐾𝑆 − ,𝑆 − ∣− 2 ∗



1

∣𝐸+𝐾𝑆 − ,𝑆 − ∣− 2

At last, the label 𝑦∗ of new sample 𝑥∗ can be predicted as 𝑦∗ = arg max{𝑝(𝑦 = +1∣𝐷, 𝑌, 𝑥∗ , 𝜃), 𝑝(𝑦 = −1∣𝐷, 𝑌, 𝑥∗ , 𝜃)}

D. Prediction To predict the probability that 𝑥∗ is a positive sample, we need to compute the posterior distribution of 𝑓+,∗ firstly.

(15)

here, 𝐸 denotes the unit matrix with different order; 𝑆∗+ = 𝑆 + ∪ {∗}; 𝐾𝑆∗+ ,𝛼 is the covariance matrix between the samples {𝑥𝑖 ∣𝑖 ∈ 𝑆∗+ } and the samples {𝑥𝑖 ∣𝑖 ∈ 𝛼}; 𝐾𝑆∗+ ,𝑆∗+ and 𝐾𝛼,𝑆∗+ are defined analogously. To adopt the posterior distribution of 𝑓−,∗ , we can obtain the probability that 𝑥∗ is a negative sample by using the similar approach with computing the probability 𝑝(𝑦 = +1∣𝐷, 𝑌, 𝑥∗ , 𝜃),

𝑇 (𝐾 −1 +Λ +Λ𝛼 )𝐹− − 1 𝐹− 2 𝑆−

𝐾𝑆 − ,𝑆 − 𝐾𝛼,𝑆 −

1

(−1)∣𝛼∣ ∣𝐸+𝐾𝛼,𝛼 −𝐾𝛼,𝑆 + (𝐸+𝐾𝑆 + ,𝑆 + )−1 𝐾𝑆 + ,𝛼 ∣− 2

1 ∣𝐸+𝐾𝑆 + ,𝑆 + ∣− 2 ∗ ∗ 1 ∣𝐸+𝐾𝑆 + ,𝑆 + ∣− 2

𝛼⊂𝑆 +



1

𝑝(𝑦 = +1∣𝐷, 𝑌, 𝑥∗ , 𝜃)

=

𝛼⊂𝑆 +

∣2𝜋𝐴+,∗ ∣ 2 ∣2𝜋𝐾∣ 2 𝛼⊂𝑆 − −1 −1 − 𝐴−1 𝐵+,∗ 𝐾 −1 𝐾∗𝑇 𝐴−1 +,∗ 𝐾∗ 𝐾 +,∗

+ 1)−1 ∣ 2 ) (14) By using the Woodbury matrix identity repeatedly, the (14) can be written as

and 𝑝(𝐹− ∣𝐷, 𝑌, 𝜃)

∣2𝜋𝐴+,∗ ∣ 2 ∣2𝜋𝐾∣ 2 𝛼⊂𝑆 − −1 𝑇 −1 −1 𝐵+,∗ 𝐾 −1 𝐾∗𝑇 )𝐴−1 +,∗ 𝑓+,∗

) (13) −1 + 𝐾 −1 + Λ𝑆 + + Λ𝛼 . where, 𝐵+,∗ = 𝐾 −1 𝐾∗𝑇 𝐴−1 +,∗ 𝐾∗ 𝐾 Thus, the probability that +1 is the correct class label of sample 𝑥∗ is



𝑝(𝐹+ ∣𝐷, 𝑌, 𝜃)

=

1

⋅ 𝑒− 2 𝑓+,∗ 𝐴+,∗ (𝐴+,∗ −𝐾∗ 𝐾

(10)

where, 𝐸 is a unit matrix; ∣ ⋅ ∣ denotes the cardinality of a set or the determinant of a matrix; 𝛼 denotes the any subset of 𝑆 + or 𝑆 − ; Λ𝛼 denotes the diagonal matrix of order 𝑛 where the 𝑖th diagonal element is 1 if 𝑖 ∈ 𝛼, otherwise, it is 0; 𝐾𝑆 + ,𝛼 is a submatrix of 𝐾 whose rows and columns are indexed by 𝑆 + and 𝛼, respectively, in other words, it is the covariance matrix between the samples {𝑥𝑖 ∣𝑖 ∈ 𝑆 + } and the samples {𝑥𝑖 ∣𝑖 ∈ 𝛼}; The meanings of other symbols such as Λ𝑆 + , 𝐾𝑆 + ,𝑆 + , and 𝐾𝛼,𝛼 are similar to those of Λ𝛼 and 𝐾𝑆 + ,𝛼 , it will not be repeated here. Thus, by using the Bayes’s rule, a close-form of the posterior distributions 𝑝(𝐹+ ∣𝐷, 𝑌, 𝜃) and 𝑝(𝐹− ∣𝐷, 𝑌, 𝜃) can be respectively deduced as

=

= =

(17)

Until now, the whole TGPs model has been presented. Different from the previous GP model, in the TGPs model, we directly obtain the explicit inference by using the analytical

1076

where, 𝐻+,∗ = 𝐾𝑆 − ,𝑆 − −𝐾𝑆 − ,𝑆∗+ (𝐸+𝐾𝑆∗+ ,𝑆∗+ )−1 𝐾𝑆∗+ ,𝑆 − , 𝐻+ = 𝐾𝑆 − ,𝑆 − − 𝐾𝑆 − ,𝑆 + (𝐸 + 𝐾𝑆 + ,𝑆 + )−1 𝐾𝑆 + ,𝑆 − . we can obtain the following relationship between 𝐻+,∗ and 𝐻+

method and avoid the high computational cost of approximating the posterior. It can be seen that this advantage mainly benefits from the defined likelihoods in formula (7) which just come from the idea of defining one latent function for each class. In the next section, the more details about the implementation of TGPs will be given.

𝑇 𝐻+,∗ = 𝐻+ − 𝑃+,∗ 𝐶+,∗ 𝑃+,∗

where, 𝑃+,∗ = 𝐾∗,𝑠+ (𝐸 + 𝐾𝑠+ ,𝑠+ )−1 𝐾𝑠+ ,𝑠− − 𝐾∗,𝑠− is a row vector; 𝐶+,∗ = 1 + 𝑘∗∗ − 𝐾∗,𝑠+ (𝐸 + 𝐾𝑠+ ,𝑠+ )−1 𝐾𝑠+ ,∗ . Thus, formula (20) can be written as

III. T HE I MPLEMENTATION OF TGP S Taking 𝑝(𝑦 = +1∣𝐷, 𝑌, 𝑥∗ , 𝜃) as an example, although we obtained its explicit expression (15) without any approximation in the above section, it is infeasible to directly compute − it by using (15) because of 2(∣𝑆 ∣+1) + 2 times’ calculations of determinant. In this section, we will firstly simplify the formulas (15) and (16), and then present the detailed flow chart of TGPs algorithm. It can be seen that the main difficulty of computing (15) and (16) may be attributable to the computation of the following formula. ∑ (−1)∣𝛼∣ ∣𝐸 + 𝐻𝛼,𝛼 ∣−1/2 𝑔(𝐻) = (18)

𝑝(𝑦 = +1∣𝐷, 𝑌, 𝑥∗ , 𝜃) =

𝛼⊂⟨𝑚⟩

𝐸+𝐻 𝜆3 ∣

=

𝑖=1

=

𝑎−,∗ (∣𝐻 −,∗ ∣(1+

−1 𝐶 −,∗ 𝜆3 −,∗

1 𝜆+ −1 ∣𝑆 − ∣ −1))𝐶 2 +,∗ 𝜆+ )

(𝑎− (∣𝐸−

−1

𝑇 𝑃−,∗ 𝐻 −,∗ 𝑃−,∗ )−1)+𝑏−,∗ ((

𝐸+𝐻− 𝜆3 −

∣−1)+𝑏− ((

1

𝜆−,∗ −1 ∣𝑆 + ∣ −1) 𝜆−,∗ )

1 𝜆− −1 ∣𝑆 + ∣ −1))𝐶 2 −,∗ 𝜆− )

1

−1

+

1+(𝐻 )

− 𝑖,𝑖 , 𝑖 = 1, ⋅ ⋅ ⋅ , ∣𝑆 + ∣} in a {(𝑥𝑖 , 𝑥𝑖 2 )∣𝑥𝑖 = 𝜆2− least-squares sense, respectively; Output: 𝐾, 𝐼+ , 𝐼− , 𝐻+ , 𝐻− , 𝑈+ , 𝑉+ , 𝑈− , 𝑉− , 𝜆+ , 𝜆− , 𝑎 + , 𝑏+ , 𝑎 − , 𝑏− .

Testing Input: 𝑥∗ , 𝐷, 𝑌 , 𝜃, 𝐾, 𝐼+ , 𝐼− , 𝐻+ , 𝐻− , 𝑈+ , 𝑉+ , 𝑈− , 𝑉− , 𝜆 + , 𝜆 − , 𝑎 + , 𝑏+ , 𝑎 − , 𝑏− 1 Computing 𝐾∗ , 𝑃+,∗ , 𝑃−,∗ , 𝐶+,∗ and 𝐶−,∗ ; 1 ∣𝐸+𝐻 ∣ 2∣𝑆 − ∣−2 , where 𝑡𝑟(𝐸 + 𝐻 2 𝜆+,∗ = ( 𝑡𝑟(𝐸+𝐻+,∗+,∗ +,∗ ) = )/∣𝑆 − ∣ ) 𝑇 and ∣𝐸 + 𝐻+,∗ ∣ = (1 − 𝑡𝑟(𝐸 + 𝐻+ ) − 𝑃+,∗ 𝐶+,∗ 𝑃+,∗ 𝑇 𝑇 𝐶+,∗ 𝑃+,∗ 𝑈+ 𝑉+−1 𝑈+ 𝑃+,∗ )∣𝑉+ ∣; Finding 𝑎+,∗ , 𝑏+,∗ by us-

−(𝑎+,∗ +𝑏+,∗ )+𝑎+,∗ ∣𝐸−



∣−1)+𝑏+ ((

∣𝑉− ∣ +∣ 2∣𝑆 − ∣−2 , 𝜆 2∣𝑆 + ∣−2 ; 4 𝜆+ = ( 𝑡𝑟(𝑉∣𝑉 − ) − = ( 𝑡𝑟(𝑉− )/∣𝑆 + ∣ ) + )/∣𝑆 ∣ Finding 𝑎+ , 𝑏+ , 𝑎− , 𝑏− which are the coefficients of the polynomial 𝑓 (𝑥) = 𝑎𝑥 + 𝑏 that fits the data −1 1+(𝐻+ )𝑖,𝑖 {(𝑥𝑖 , 𝑥𝑖 2 )∣𝑥𝑖 = , 𝑖 = 1, ⋅ ⋅ ⋅ , ∣𝑆 − ∣} and 𝜆2

𝑝(𝑦 = +1∣𝐷, 𝑌, 𝑥∗ , 𝜃)

𝐸+𝐻+,∗ 𝜆 −1 ∣𝑆 − ∣ ∣+𝑏+,∗ ( +,∗ 𝜆+,∗ ) 𝜆3 +,∗ 𝐸+𝐻+ 𝜆 −1 ∣𝑆 − ∣ −(𝑎+ +𝑏+ )+𝑎+ ∣𝐸− ∣+𝑏+ ( + 𝜆+ ) 𝜆3 + 1 ∣𝐸+𝐾𝑆 + ,𝑆 + ∣− 2 ∗ ∗ 1 ∣𝐸+𝐾𝑆 + ,𝑆 + ∣− 2

𝐸+𝐻+ 𝜆3 +

𝜆+,∗ −1 ∣𝑆 − ∣ −1) 𝜆+,∗ )

Training Input: 𝐷, 𝑌, 𝜃 1 Computing the covariance matrix 𝐾 of sample set 𝐷; 2 Matrix inversion: 𝐼+ = (𝐸 + 𝐾𝑆 + ,𝑆 + )−1 , 𝐼− = (𝐸 + 𝐾𝑆 − ,𝑆 − )−1 ; 3 Computing matrix 𝐻+ and 𝐻− ; 𝑇 , Eigen decomposition: 𝐸 + 𝐻+ = 𝑈+ 𝑉+ 𝑈+ 𝑇 𝐸 + 𝐻− = 𝑈 − 𝑉 − 𝑈 − ;

𝑚 + 𝑏( 𝜆−1 𝜆 )

and 𝑏 are the coefficients of the polynomial 𝑓 (𝑥) = 𝑎𝑥 + 𝑏 −1 1+𝐻 that fits the data {(𝑥𝑖 , 𝑥𝑖 2 )∣𝑥𝑖 = 𝜆2 𝑖,𝑖 , 𝑖 = 1, ⋅ ⋅ ⋅ , 𝑚} in a least-squares sense. Although the approximation (19) of 𝑔(𝐻) may not be the best one, it can be seen from the experimental results of next section that the TGPs model obtained in this way is already superior to the traditional GP models. Under the approximation (19), we can simplify (15) as

(𝑎+ (∣𝐸−

−1

𝑇 𝑃+,∗ 𝐻 +,∗ 𝑃+,∗ )−1)+𝑏+,∗ ((

(23) where, 𝐻− = 𝐾𝑆 + ,𝑆 + − 𝐾𝑆 + ,𝑆 − (𝐸 + 𝐾𝑆 − ,𝑆 − )−1 𝐾𝑆 − ,𝑆 + , 𝐻−,∗ = 𝐾𝑆 + ,𝑆 + − 𝐾𝑆 + ,𝑆∗− (𝐸 + 𝐾𝑆∗− ,𝑆∗− )−1 𝐾𝑆∗− ,𝑆 + , 𝑃−,∗ = 𝐾∗,𝑠− (𝐸 + 𝐾𝑠− ,𝑠− )−1 𝐾𝑠− ,𝑠+ − 𝐾∗,𝑠+ , 𝐶−,∗ = − 1+𝑘∗∗ −𝐾∗,𝑠− (𝐸 +𝐾𝑠− ,𝑠− )−1 𝐾𝑠− ,∗ , 𝐻 −,∗ = 𝐸 − 𝐸+𝐻 . 𝜆3−,∗ The flow chart of TGPs model is presented as follows:

(19)

where, the role of parameter 𝜆 is to group the value of 𝐸+𝐻 determinants {∣ 𝜆2𝛼,𝛼 ∣}in a as small rang as possible such that 𝑎𝑥 + 𝑏 can approximate 𝑥−1/2 well at all points {𝑥𝛼 = 1 𝐸+𝐻 ∣𝐸+𝐻∣ ) 2𝑚−2 ∣ 𝜆2𝛼,𝛼 ∣}. In this paper, 𝜆 is set to be ( 𝑡𝑟(𝐸+𝐻)/𝑚 𝑚 ∑ 𝐸+𝐻𝑖,𝑖 1 which is the solution of equation 𝑚 = ∣ 𝐸+𝐻 𝜆2 𝜆2 ∣; 𝑎

−1 𝐶 +,∗ 𝜆3 +,∗

𝑝(𝑦 = −1∣𝐷, 𝑌, 𝑥∗ , 𝜃)

where, ⟨𝑚⟩ = {1, ⋅ ⋅ ⋅ , 𝑚}, 𝐻 is a positive-semidefinite matrix of order 𝑚, 𝐻𝛼,𝛼 is the principal submatrix of 𝐻 whose rows and columns are indexed by 𝛼. According to the property of 𝐻, we can obtain the conclusion that ∣𝐸 +𝐻𝛼,𝛼 ∣ ≤ ∣𝐸 +𝐻𝛽,𝛽 ∣ ≤ ∣𝐸 +𝐻∣(for any 𝛼 ⊂ 𝛽 ⊂ ⟨𝑚⟩). If we use a linear function 𝑎𝑥 + 𝑏 to approximate 𝑥−1/2 , the formula (18) can be written as

= −(𝑎 + 𝑏) + 𝑎∣𝐸 −

𝑎+,∗ (∣𝐻 +,∗ ∣(1+

(22) + . where, 𝐻 +,∗ = 𝐸 − 𝐸+𝐻 𝜆3+,∗ In a similar way, we can simplify the formula (16) as

𝛼⊂⟨𝑚⟩

𝑔(𝐻) ∑ 𝐸+𝐻 = 𝜆−𝑚 ( (−1)∣𝛼∣ 𝜆𝑚−∣𝛼∣ ∣ 𝜆2𝛼,𝛼 ∣−1/2 ) 𝛼⊂⟨𝑚⟩ ∑ 𝐸+𝐻 ≈ 𝜆−𝑚 ( (−1)∣𝛼∣ 𝜆𝑚−∣𝛼∣ (𝑎∣ 𝜆2𝛼,𝛼 ∣ + 𝑏))

(21)

(20)

−1

ing the data {(𝑥𝑖 , 𝑥𝑖 2 )∣𝑥𝑖 =

1077

1+(𝐻+ )𝑖,𝑖 −𝐶+,∗ (𝑃+,∗ )2𝑖 ,𝑖 𝜆2+,∗

=

1, ⋅ ⋅ ⋅ , ∣𝑆 − ∣}; 3 Computing 𝑝(𝑦 = +1∣𝐷, 𝑌, 𝑥∗ , 𝜃) by using (22), −1 𝑇 here,∣𝐻 +,∗ ∣ = ∣𝐸 − 𝜆𝑉3+ ∣, 𝐻 +,∗ = 𝑈+ (𝐸 − 𝜆𝑉3+ )−1 𝑈+ , 𝐸+𝐻+ ∣ 𝜆3+

+,∗

𝑉+ ∣; 𝜆3+

Table III T HE RELATIVE PERFORMANCE BETWEEN THE COMPARED ALGORITHMS Data sets

+,∗

= ∣𝐸 − ∣𝐸 − 4 Computing 𝜆−,∗ , 𝑎−,∗ and 𝑏−,∗ in a similar way with 2; 5 Computing 𝑝(𝑦 = −1∣𝐷, 𝑌, 𝑥∗ , 𝜃) by using (23) in a similar way with 3; 6 Predicting labels of 𝑥∗ with (17). Note: (𝐻+ )𝑖,𝑖 denotes the 𝑖th diagonal element of 𝐻+ and (𝑃+,∗ )𝑖 is the 𝑖th element of vector 𝑃+,∗ .

Australian Heart-C Heart-S Hepatitis Ionosphere Sonar WDBC WPBC Total order

It can be seen that the computational complexity for training the model is mainly dominated by the matrix inversion and eigen decomposition in the step 2 and 3 which take about 𝑂(𝑚3 ) operations; And, in the testing stage, the computational complexity is 𝑂(𝑚2 ) operations, here, 𝑚 = max{∣𝑆 + ∣, ∣𝑆 − ∣}. Thus, the computational cost of TGPs is still greatly reduced compared to the traditional GP models such as [11] and [18].

Algorithms T: The proposed algorithm, L: LA, E: EP, V: VB N/A L≻T, E≻T, V≻T N/A N/A T≻L, T≻E, T≻V, E≻L T≻L, T≻E, T≻V N/A T≻V The proposed algorithm (4)>EP (0)>LA (-2)=VB (-2)

the second place and LA and VB algorithms are in general not competitive. V. C ONCLUSION Considering that the previous Gaussian process classifiers usually suffer from the high computational cost caused by approximating the posterior with Gaussian distribution, a novel Gaussian process model is proposed for binary classification in this paper. The basic idea is to find two latent functions with Gaussian process prior, one for each class, such that each function is close to one of the two classes and is as far as possible from the other. The new sample can be classified by the closest function to it. Being compared with the traditional GP models, this not only allows for an explicit inference but also greatly reduces the computational cost. We hope that these contributions can play a role in encouraging the researchers to explore and develop the highly flexible Gaussian process classifiers.

IV. E XPERIMENTS In this section, we test the proposed algorithm on eight benchmark data sets available at the UCI machine learning repository [23]. Further details of these data sets are provided in Table I. For evaluating the efficiency of the proposed algorithm, it is compared with three published algorithms named LA [11], EP [18] and VB [19]. For all the algorithms, the covariance function (2) is used in the experiments and the hyperparameters are selected by using 5-fold cross validation method. Ten times 10-fold cross validation is performed on each data set to measure the performance of these compared algorithms. The average accuracy (%) with standard deviation of each algorithm is presented in Table II, where the best result on each data set is shown in bold face. It is obviously that the proposed algorithm can achieve better performance than other three algorithms on the most of the benchmark data sets. In order to make a clearer view of the relative performance among them on each data set, a partial order ”≻” is defined, where 𝑃 ≻ 𝑄 denotes that algorithm 𝑃 is statistically better than algorithm 𝑄 based on left-tailed paired t-test at 5% significance level. As shown in Table III, the proposed algorithm consistently outperforms all other algorithms on the two physical data sets and is inferior to other algorithms on Heart-C; all the algorithms achieve the same performance on four data sets including Australian, Heart-S, Hepatitis and WDBC. For an overall performance assessment of each algorithm, a score is assigned to them, i.e., algorithm 𝑃 is rewarded by a positive score +1 and 𝑄 is penalied by a negative score −1 if 𝑃 ≻ 𝑄 on a certain data set. As a result, a total order ”>” is defined on these algorithms based on their accumulated score. It can been from the last row of Table III that the proposed algorithm can achieve the best performance, the EP algorithm takes

ACKNOWLEDGMENT This work was supported by the Liaoning Provincial Natural Science Foundation of China (20102025), the China Earthquake Research Funds (200808075) and the National Natural Science Foundation of China (61174027). R EFERENCES [1] Z. Wang, J. He, and H. Gu, “Forward kinematics analysis of a six-Degree-of-Freedom Stewart platform based on independent component analysis and Nelder-Mead algorithm,” IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, vol. 41, no. 3, pp. 589–597, 2011. [2] Z. Wang and H. Gu, “A bristle-based pipeline robot for illconstraint pipes,” IEEE-ASME Transactions on Mechatronics, vol. 13, no. 3, pp. 383–392, 2008. [3] ——, “A review of locomotion mechanisms of urban search and rescue robot,” Industrial Robot: An International Journal, vol. 34, no. 5, pp. 400–411, 2007. [4] I. Kononenko, “Machine learning for medical diagnosis: history, state of the art and perspective,” Artificial Intelligence in Medicine, vol. 23, no. 1, pp. 89–109, 2001.

1078

Table I T HE DATA SETS USED IN THE EXPERIMENTS Data sets Australian Credit Approval (Australian) Heart Disease-Cleveland (Heart-C) Heart Disease-Statlog (Heart-S) Hepatitis Ionosphere Connectionist Bench-Sonar (Sonar) Wisconsin Diagnostic Breast Cancer (WDBC) Wisconsin Prognostic Breast Cancer (WPBC)

Domains Financial Life Life Life Physical Physical Life Life

# Attributes 14 13 13 19 34 60 30 33

# Positive samples 383 164 150 32 126 111 212 47

# Negative samples 307 139 120 123 225 97 357 151

Table II T HE EXPERIMENTAL RESULTS (𝑚𝑒𝑎𝑛 ± 𝑠𝑡𝑑.) OF EACH COMPARED ALGORITHM ON THE BENCHMARK DATA SETS Data sets Australian Heart-C Heart-S Hepatitis Ionosphere Sonar WDBC WPBC

The proposed 86.35 ± 81.45 ± 82.89 ± 85.03 ± 93.45 ± 90.77 ± 97.43 ± 81.11 ±

algorithm 0.19 0.54 0.41 0.84 0.35 0.63 0.01 0.28

Algorithms LA 86.35 ± 0.26 83.37 ± 0.38 82.89 ± 0.31 84.13 ± 1.68 90.48 ± 0.59 89.23 ± 0.73 97.50 ± 0.19 80.51 ± 0.68

EP 86.29 ± 83.30 ± 82.89 ± 84.52 ± 91.28 ± 89.13 ± 97.53 ± 80.91 ±

0.35 0.50 0.31 0.79 0.52 0.73 0.12 0.42

VB 86.32 ± 83.30 ± 82.96 ± 84.39 ± 90.83 ± 88.94 ± 97.43 ± 80.61 ±

0.30 0.50 0.26 0.84 0.62 1.12 0,16 0.28

[5] Z. Wang and H. Gu, “A review of telemedicine in china,” Journal of Telemedicine and Telecare, vol. 15, no. 1, pp. 23– 27, 2009.

[15] N. Lawrence and M. Jordan, “Semi-supervised learning via Gaussian processes,” in Advances in Neural Information Processing Systems, vol. 17. The MIT Press, 2005.

[6] J. Tian, H. Gu, W. Liu, and C. Gao, “Robust prediction of protein subcellular localization combining PCA and WSVMs,” Computers in Biology and Medicine, vol. 41, no. 8, pp. 648– 652, 2011.

[16] J. Tian, H. Gu, C. Gao, and J. Lian, “Local density one-class support vector machines for anomaly detection,” Nonlinear Dynamics, vol. 64, no. 1-2, pp. 127–130, 2011. [17] J. Tian and H. Gu, “Anomaly detection combining one-class svms and particle swarm optimization algorithms,” Nonlinear Dynamics, vol. 61, no. 1-2, pp. 303–310, 2010.

[7] J. Ma and H. Gu, “A novel method for predicting protein subcellular localization based on pseudo amino acid composition,” BMB Reports, vol. 43, no. 10, pp. 670–676, 2010.

[18] T. Minka, “Expectation propagation for approximate Bayesian inference,” in Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, 2001.

[8] D. Zhang and J. J. P. Tsai, “Machine learning and software engineering,” Software Quality Journal, vol. 11, no. 2, pp. 87–119, 2003.

[19] M. Gibbs and D. MacKay, “Variational Gaussian process classifiers,” IEEE Transactions on Neural Networks, vol. 11, no. 6, pp. 1458–1464, 2000.

[9] C. Rasmussen and K. Williams, Gaussian process for machine learning. The MIT press, 2006. [10] A. Ranganathan, M. Yang, and J. Ho, “Online sparse Gaussian process regression and its applications,” IEEE Transactions on Image Processing, vol. 20, no. 2, pp. 391–404, 2011.

[20] M. Opper and C. Archambeau, “The variational Gaussian approximation revisited,” Neural Computation, vol. 21, no. 3, pp. 786–792, 2009.

[11] C. Williams and D. Barber, “Bayesian classification with Gaussian processes,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 12, pp. 1342–1351, 1998.

[21] O. Mangasarian and E. Wild, “Multisurface proximal support vector machine classification via generalized eigenvalues,” IEEE Trans. Pattern Anal. Machine Intell., vol. 28, no. 1, pp. 69–74, 2006.

[12] H. Kim and Z. Ghahramani, “Bayesian Gaussian process classification with the EM-EP algorithm,” IEEE Trans. Pattern Anal. Machine Intell., vol. 28, no. 12, pp. 1948–1959, 2006.

[22] Jayadeva, R. Khemchandani, and S. Chandra, “Twin support vector machines for pattern classification,” IEEE Trans. Pattern Anal. Machine Intell., vol. 29, no. 5, pp. 905–910, 2007.

[13] E. Bonilla, K. Chai, and C. Williams, “Multi-task Gaussian process prediction,” in Advances in Neural Information Processing Systems, vol. 20. The MIT Press, 2008.

[23] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [Online]. Available: http://archive.ics.uci.edu/ml

[14] Z. Xu, K. Kersting, and V. Tresp, “Multi-relational learning with gaussian processes,” in Proceedings of the 21st International Joint Conference on Artificial Intelligence, 2009.

1079