vbmp: Variational Bayesian Multinomial Probit ...

Bioinformatics Advance Access published November 14, 2007

vbmp: Variational Bayesian Multinomial Probit Regression for multi-class classification in R Nicola Lama a∗, Mark Girolami b a

Medical Statistics Unit, Department of Medicine and Public Health, Second University of Napoli, Via Luciano Armanni 5, 80138 Napoli, Italy b Department of Computing Science, University of Glasgow, Glasgow G12 8QQ ,Scotland UK Associate Editor: Dr. Trey Ideker

1

INTRODUCTION

Classification algorithms based on Gaussian Processes (GPs) have become very popular since the influential papers by Neal (1998); Williams and Barber (1998) which motivated the development of posterior approximations which are computationally appealing alternatives to the Markov Chain Monte Carlo approach. As such GP based models have been widely adopted in both regression and classification applications often being considered as alternatives, within the Bayesian inference framework (MacKay, 2003), to kernel machines. Closed-form inference of the classification model is however an intractable problem and different approximation methods have been proposed in the literature Bishop and Tipping (2000); Tipping (2001); Minka (2001). Recently, Girolami and Rogers (2006) showed that exact Bayesian inference is possible in binary and multi-class GP classification through augmenting a GP classifier with Gaussian latent variables and a probit likelihood function and provided a variational approximation. The present implementation of the vbmp package provides the multinomialprobit regression model with GP priors adopting the Variational Bayesian (VB) approach to obtain an estimate of the required posterior distribution over the parameters. Example results are presented on a synthetic dataset and on microarray data from the breast cancer study by Kote-Jarai et al. (2006).

∗ to

whom correspondence should be addressed

2

IMPLEMENTATION

The vbmp package implements training and testing of the multi-class classifier in the homonym function for R software (R Development Core Team, 2005). Several types of kernel functions are available: Gaussian, Cauchy, Laplace, polynomial, homogeneous polynomial, thin-plate spline, linear spline and the inner-product kernels. Arbitrary polynomial order can be specified by adding the grade number at polynomial kernel identifier. The scale of covariance function parameters can either be provided as an argument or could be inferred by the method during the training process. In the second case the method employs importance sampling to obtain posterior mean estimates of the parameters. The location and scale parameters of the Gamma prior over covariance parameters defaults to 1e-6, but can be made more informative when appropriate. The package provides accessor methods to retrieve the main properties from the object returned by the vbmp function. In particular, these methods return estimates of class specific probability values, marginal likelihood lower bound of the regression model, predictive likelihood and of the out-of-sample prediction error. Model convergence diagnostics can be evaluated during model training by enabling the monitor of the evolution graphs of the previously mentioned properties achieved at each iteration.

3

EXAMPLES

Some illustrative experiments are provided to demonstrate the potential of the vbmp package. The R code for these examples is available on the package manual pages and in the vignette, an interactive document containing code snippets giving a more task-oriented description of package functionality.

3.1

Synthetic multi-class dataset example

Similarly to (Girolami and Rogers, 2006), a sample of five hundred uniformly distributed 2-D data points x1 and x2 were drawn from three nonlinearly separable classes t1 = .1 < x21 + x22 < .5, t2 = .6 < x21 + x22 < 1 and t3 associated with x21 + x22 < 0.1, with [x1 , x2 ] being a bivariate Gaussian with mean 0 and covariance the identity matrix with 0.01 values on the main diagonal. This dataset takes the form of two annular rings and one zero-centered Gaussian. An additional eight non informative variables drawn from standard normal distribution were added to the dataset. A second sample of the same size was drawn from the above distribution for testing purposes.

© The Author (2007). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

1

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on June 13, 2013

ABSTRACT Summary: vbmp is an R package for Gaussian Process classification of data over multiple classes. It features multinomial probit regression with Gaussian Process priors and estimates class membership posterior probability employing fast variational approximations to the full posterior. This software also incorporates feature weighting by means of Automatic Relevance Determination. Being equipped with only one main function and reasonable default values for optional parameters, vbmp combines flexibility with ease of usage as is demonstrated on a breast cancer micro-array study. Availability: The R library vbmp implementing this method is part of Bioconductor and can be downloaded from http://bioconductor.org/ packages/2.1/bioc/html/vbmp.html Contact: [email protected], [email protected] Supplementary information: Supplementary data are available at http://www.dcs.gla.ac.uk/ girolami

N. Lama and M. Girolami

−550

−450

Lower bound

4 2

10

20

30

40

50

0

10

20

30

40

50

Predictive Likelihood

Out−of−Sample Percent Prediction Correct

70

−0.6

30

−1.0

90

Iteration

(100 − testErr)

Iteration

50

log(theta)

−2 0 0

PL

−350

Lower Bound

6

Covariance Params Posterior Mean Values

0

10

20

30

40

50

0

Iteration

10

20

30

40

50

Iteration

A Gaussian kernel was adopted with scale parameters inferred by the method using vague hyperpriors (length-scale hyperparameters set to 10−6 ). Figure 1 shows the results and convergence status plotted by the vbmp method at each iteration. The graphs highlight the predictive evolution of the method which achieves 0.022 errorrate performance on the test dataset after 50 iterations. Figure 1 also demonstrates the Automatic Relevance Determination (ARD) process (Neal, 1998) which forces the two informative covariates to small scale parameters while penalizing the other eight noisy input parameters.

3.2

Breast cancer dataset example

In this example, the data are from the study by Kote-Jarai et al. (2006) where the differential gene expression changes following radiation-induced DNA damage in healthy cells from BRCA1/BRCA1 mutation carriers were compared with controls using high-density microarray technology. The dataset consists of 8, 080 cDNA clones of fibroblast cultures from 10 control samples, 10 BRCA1 and 10 BRCA2 mutation carriers. The code available for this example reproduces the leave-one-out cross-validation (LOOCV) prediction performance obtained from the vbmp method on this dataset. Using a common inner-product (linear) covariance kernel and using 25 training iterations, the vbmp multi-class classifier achieved 100% LOOCV performance accuracy. The vbmp method outperformed the SVM classifier adopted by Kote-Jarai et al. (2006) to distinguish the different types of samples studied in this experiment.

4

DISCUSSION

The vbmp package implements a VB approach to flexibly model multi-class datasets. This non-parametric approach is developed within a probabilistic framework for Bayesian inference which yields to efficient sparse approximations by optimizing a strict lower

2

ACKNOWLEDGEMENT NL acknowledge research fellowship support from Associazione Italiana per la Ricerca sul Cancro (AIRC). M.G. is an EPSRC Advanced Research Fellow EP/E052029/1. Conflict of Interest: none declared.

REFERENCES Bishop,C.M., Tipping,M.E. (2000) Variational relevance vector machines, Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, 46-53. R Development Core Team (2005) R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051-07-0. Williams,C.K.I., Barber,D. (1998) Bayesian classification with Gaussian processes, IEEE Transactions on Pattern Analysis and Machine Intelligence20(12), 1342-1352. Neal,R. (1998) Regression and classification using Gaussian process priors, In A.P. Dawid, M. Bernardo, J.O. Berger, and A.F.M. Smith, editors, Bayesian Statistics6, 475-501. Oxford University Press Girolami, M., Rogers, S. (2006) Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors, Neural Computation18, 1790-1817. Girolami, M., Zhong, M. (2007) Data Integration for Classification Problems Employing Gaussian Process Priors, Advances in Neural Information Processing Systems 19, eds B. Scholkopf, J.C. Platt and T. Hofmann, MIT Press, Cambridge. Kosuke,I., van Dyk,D.A. (2005) MNP: R Package for Fitting the Multinomial Probit Models, Journal of Statistical Software14(3), 1-32. Kote-Jarai,Z., Matthews,L., Osorio,A., Shanley,S., Giddings,I., Moseews,F., Locke,I., Evans,G., Girolami,M., Williams,R., Campbell,C. (2006) Accurate prediction of BRCA1 and BRCA2 heterozygous genotype using expression profiling after induced DNA damage, Clinical Cancer Research12(13), 3896-3901. Lawrence, N.D., Platt, J.C., Jordan, M.I. Extensions of the informative vector machine. In J. Winkler, N.D. Lawrence, and M. Niranjan, editors, Deterministic and Statistical Methods in Machine Learning. Springer-Verlag. MacKay,D. (2003) Information theory, inference, and learning algorithms, Cambridge University Press. Minka,T. (2001) A family of algorithms for approximate Bayesian inference, Doctoral dissertation, MIT. Tipping,M. (2001) Sparse Bayesian learning and the relevance vector machine, Journal of Machine Learning Research1, 211-244.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on June 13, 2013

Fig. 1. Monitor evolution diagnostics for the estimates of: posterior means for the Gaussian kernel scale parameters (top-left), lower bound on the marginal likelihood (top-right), predictive likelihood (bottom-left) and outof-sample accuracy (0/1-error loss) on the example dataset 3.1

bound of the marginal likelihood of a multinomial probit regression model. Compared with the multinomial logit approach, this method is appealing since it provides a means of developing a Gibbs sampler and subsequent computationally efficient approximations for the GP random variables. Girolami and Rogers (2006) showed how the multi-class probit GP could be made sparse through Informative Vector Machine updates (Lawrence et al., 2005). To our knowledge, the only other multinomial probit regression R tool available is VBM (Kosuke and van Dyk, 2005) which performs model fitting resorting to Markov chain Monte Carlo. Girolami and Zhong (2007) compared the VB approximation implemented in this package to the Expectation Propagation (EP) approximation (Minka, 2001) and showed that both these approaches performed as well as a Gibbs sampler and consistently outperformed the Laplace approximation. This software also implements a feature weighting method by exploiting the ARD approach to emphasize the most relevant input parameters while reducing the impact of those that do not contribute significantly. The package does not implement any procedure for extensive adhoc tuning of the solution (e.g. as for SVM classifiers) since this is not needed by the method. It is noteworthy that the vbmp method provides confidence measures to the class assignment which could be very useful in clinical practice applications.