Bayesian Multinomial Classification Method using Kernels

10 downloads 97 Views 3MB Size Report
Mar 7, 2007 ... This paper presents a classification method based on the theory of repro- ... Bayesian SVM for regression. ..... Bayesian non linear regres-.
Bayesian Multinomial Classification Method using Kernels March 7, 2007 Abstract A Bayesian approach to multi-category classification based on reproducing kernel Hilbert space (RKHS) is proposed. The likelihood function is taken to be multinomial logistic and is modeled through a latent variable. Individual variance hyperparameters are placed on the weight coefficients, which is known to promote sparseness in the MAP estimates. The hierarchical model is treated with a fully Bayesian inference procedure. The Gibbs sampler was implemented to find the posterior distributions of the parameters. Generally, kernel classifiers are aimed at high dimensional data but preprocessing steps are often taken in order to simplify the complexity of the model fitted. The Bayesian RKHS classifier is able to achieve good classification results without dimension reduction, considerably reducing the manual pre-processing that is usually required.

1

Introduction

In a multi-category classification problem, the training data are available in the form of observations x ∈ RJ and their corresponding classes. The aim is to infer a function from the data that partitions RJ into regions corresponding to different classes. One of the simplest approaches to this problem is logistic regression. In binary classification, the function is modeled as an inverse logit of a weighted linear combination of the predictors. The weights are estimated from the training data. Basis expansion modeling extends this by replacing the predictors x with some nonlinear transformation in the form of basis functions. This paper presents a classification method based on the theory of reproducing kernel Hilbert spaces (RKHS) (Aronszajn, 1950; Parzen, 1970), where the basis functions are taken to be reproducing kernel functions. It is a fully Bayesian hierarchical model; prior distributions are specified for the parameters and an MCMC algorithm is used to obtain samples from the posterior. Hence, it is possible to obtain estimates of uncertainty for the parameters. The MCMC

1

sampling approach reduces the need for calibration of the model parameters prior to running the algorithm in order to achieve a reasonable classification. The algorithm obtains samples from the posterior distribution for the model parameters, which can be used to compute samples from the distributions of the class probabilities. In the examples given in this paper, only the average class probabilities are reported, but the posterior distributions give a more complete picture of the classification and can be useful in identifying borderline observations. The logistic classifier is modeled through a latent variable, which is treated as random and given a Gaussian prior. The use of latent variable in the Bayesian treatment of binary and multinomial classification simplifies the computation and was pioneered by Albert and Chib (1993) and Holmes and Held (2005). This formulation allows the likelihood function to be replaced by alternatives. Different classes of the reproducing kernels are also possible. This paper largely follows the model construction of Mallick et al. (2005). However, it extends their method to multi-category classification by using a multinomial logistic likelihood. As well as the logistic likelihood, Mallick et al. (2005) uses a stochastic version of the support vector machine (SVM) likelihood to which a multicategory extension exists (Zhang and Jordan, 2006). SVMs are a related concept, see Vapnik (1998); Cristianini and Shawe-Taylor (2000); Scholkopf and Smola (2002). Here the likelihood function is modeled through a hinge loss function. SVMs are a sparse deterministic method, as they reduce the set of weight coefficients to a small subset of support vectors. This sparseness property enables SVMs to generalize well, which makes them a popular classification method. However, only point estimates are obtained (with many weight parameters β set to 0), and no measures of uncertainty are provided by the method, which has motivated Bayesian treatments; see Sollich (2002); Tipping (2000, 2001) and Mallick et al. (2005). The SVM is a classification methodology aimed at binary classification, but recently papers have emerged aiming to extend it to multi-class problems. A review of such methods can be found in Hsu and Lin (2002). Noteworthy is the multi-category support vector machine (MSVM) of Lee et al. (2004). Zhang and Jordan (2006) use this result and extend it to a hierarchical Bayesian model. The model presented here uses the prior structure of Automatic Relevance Determination (ARD) (MacKay, 1996; Neal, 1996) originally developed in Bayesian treatment of Neural Networks. It places Gaussian zero-mean priors on the weight parameters β. The key feature of the mechanism is that individual variance hyperparameters are given to each β. This prior construction underlies most of the work on Bayesian Kernel classifiers, notably the Relevance Vector Machine (RVM) (Tipping, 2000, 2001). In a similar approach, Figueiredo (2002) uses a probit likelihood for binary classification and regression. The hierarchical formulation with the independent exponential priors on the variances of βs is equivalent to placing double exponential priors (Laplacian) on βj s, which promote sparseness (Figueiredo, 2002; Bishop and Tipping, 2000).

2

This is analogous to the Bayesian formulation of the Least Absolute Shrinkage and Selection Operator (lasso) (Tibshirani, 1996) where the MAP estimates of βs are encouraged to be either large or exactly zero. In Figueiredo (2002) it is noted that Jeffrey’s independence prior has the effect of shrinking some coefficients to zero. Other related approached include Sollich (2002), who developed a Bayesian treatment of the SVM and Chakraborty et al. (2005) who developed a fully Bayesian SVM for regression. The method is general and can be used for a variety of classification problems. Here it is tested on two examples: a simple benchmark data set and a microarray tumour data set. Kernel classifier methods are aimed at data with high dimensional feature spaces (large J) such as the microarray example in this paper. One advantage of the method presented here is that it does not require preprocessing in order to reduce the size of the feature space. Depending on the application, it can also be sensible for all features to be used for classification rather than a smaller collection of features based on an external, possibly unsatisfactory selection criteria. The current formulation of the model allows feature ’weights’ to be estimated from the data, hence allowing for feature selection within the algorithm. At present, the model is highly parameterized, but the framework is open to including sparsity inducing adjustments. One example of this is for the prior information about the features to be included in the model via prior specification of the weight parameters. Section 2 describes the model and Section 3 describes the MCMC algorithm used to obtain the posterior distributions of the parameters. The model is tested on two data sets in Section 4. Some final remarks are made in Section 5.

2 2.1

Model Multinomial Logistic Likelihood

The training data are n samples (x1 , y1 ), ..., (xn , yn ) where the predictors xi = (xi1 , ...xiJ ) are real valued J- dimensional vectors of feature values and yi = (yi1 , ...yiK ) are K- dimensional categorical response variables with yik = 1 if xi belongs to a class k and 0 otherwise. A standard approach to this classification problem is the multinomial logistic regression model given by: exp(zk ) , P (yk = 1|x) = PK l=1 exp(zl )

(1)

where zk = xT βk for a set of regression parameters β k = [β1k , β2k , ..., βnk ] corresponding to class k. In order to make the model of full rank, yK is removed from the model for all i. In this parameterization β K = 0. Note that this model, by construction, makes the assumption that the logit probabilities are linear in the predictors. An alternative approach is to replace x with a transformation, and subsequently use a linear model in the new space

3

of input features. The kernel classifier described in this paper is one such approach.

2.2

Reproducing Kernel Hilbert Spaces

Let xij be a measurement of the j th feature, for the ith sample. The dependence of y on x is through a linear combination of kernel functions parameterized by θ. Following (1) the logistic likelihood for the training data is: p(y|z) =

n Y K Y

p(yik = 1|zik )yik ,

i=1 k=1

and

exp(zik ) , p(yik = 1|zik ) = PK l=1 exp(zil )

as is defined in equation (1), however zik are linear combinations of the kernel functions: zik (xi , βk , θ) =

n X

βkl K(xi , xl |θ) = KTi βk , i = 1, ..., n

l=1

In order to improve the mixing and convergence of the MCMC algorithm used to implement inference for the model, the functions zik s are re-defined as Gaussian random variables with means KTi βk and variance σ 2 (Mallick et al., 2005; Holmes and Held, 2005; Denison et al., 2002). The Bayesian treatment of the RKHS approach places a normal prior on β parameters, so a conjugate definition of the predictors zi allows us to analytically derive the joint conditional density for βs and a result the β parameters can be updated simultaneously. In this work, the Gaussian kernel was used:   J X θj (xij − xlj )2  , i, l = 1, ..., n K(xi , xl |θ) = exp − j=1

Note that in most kernel methods only a single θ parameter is used, sometimes referred to as the bandwidth parameter.

2.3

Prior Specification

In a Bayesian inference approach, priors are assigned to parameters β, z, σ 2 , τ and θ.

4

The prior model is specified as: zik ∼ N (KTi βk , σ 2 ); βk ∼ M V N (0, σ 2 T−1 k ); σ 2 ∼ IG(γ1 , γ2 ); τik ∼ G(γ3 , γ4 ); θj

∼ U (0, au ),

where Tk is a matrix with diagonal entries τ1k , ..., τnk , G denotes a gamma prior, IG an inverse gamma, M V N is a multivariate normal of dimension n and U is the uniform probability density function over the interval (0, a). The directed acyclic graph representation of the model is given in Figure 1.

γ3

γ4

γ1

τ

γ2

au

σ2 θ β z

K

x

y Figure 1: The directed acyclic graph of the model.

3

Inference

A Metropolis-within-Gibbs algorithm was used for sampling from the posterior. The output from the MCMC is a a set of samples (β (m) , θ (m) , z(m) , σ 2 (m) , τ (m) ),

5

for m = 1, ..., M iterations from the joint posterior: p(β, θ, z, τ, σ 2 |y) ∝ p(y|z, β, θ, τ, σ 2 )p(z|β, θ, σ 2 )p(β|τ )p(θ)p(τ )p(σ 2 ) n Y p(yi |zi ) ∝ i=1

×

exp(− 2σ12

×

exp(− 2σ12

× ×

Pn PK−1

i=1 k=1 (zik 2 n(K−1)/2 (σ )

− KTi βk )2 )

PK−1

T k=1 βk Tk βk ) Q −1 12 (σ 2 )n(K−1)/2 K−1 k=1 |Tk | exp(−γ2 /σ 2 )(σ 2 )−γ1 −1 n K−1 Y Y exp(−γ4 τik )(τik )γ3 −1 . i=1 k=1

The full conditional distributions that were sampled from can be found in the Appendix A. The MCMC algorithm is implemented so that it iterates through block updates of the parameters starting with z. Each zi = [zi1 ...zi(K−1) ] is proposed to be updated conditionally on the rest of the parameters including the matrix z without the ith row. In the second step, the parameters θ are jointly proposed to be updated. The proposal densities for both zi s and θ are random walks and both are sampled using a Metropolis step within the Gibbs algorithm. Subsequently, parameters β, σ 2 and τ are block updated directly from their full conditional distributions via Gibbs steps. The model can also be formulated so that the kernel contains a single bandwidth parameter θ. This is given by:   J X (xij − xlj )2  , i, l = 1, ..., n K(xi , xl |θ) = exp −θ j=1

The uniform prior is placed on θ as in Section 2.3 and the conditional distribution p(θ|z) remains unchanged. The proposal density was also taken to be a random walk. A new observation x∗ is classified in class k∗ = arg maxk p(k|x∗ , x, y). This is given by the usual Monte Carlo integration approximation:

1 p(k|x , x, y) ≈ M − burn in + 1 ∗

(m)

where K∗i

M X

(m)T

(m)

βk ) exp(K∗ PK−1 (m)T (m) βl ) m=burn in 1 + l=1 exp(K∗ (m)

= K(x∗ , xi |θ (m) ) and βK

6

= 0.

4 4.1

Examples Wine recognition data

The data are the results of a chemical analysis aimed at classifying wines of three different origins. The wines were grown in the same region but come from different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. 178 samples were tested. The data were divided into a training (labelled) sample and a testing (unlabelled) sample with almost equal numbers of observations; see Table 1. The complete data set and its kernel can be seen in Figure 2. These plots indicate that the groups are fairly separable and a classification algorithm should be able to achieve a good test error rate. Class

Training

Testing

class 1 class 2 class 3

30 36 24

29 35 24

Table 1: Number of wine samples of each class for the training and testing sets.

True class

class class class class

1 2 2 3

Classified as

class class class class

Test error rate by groups

2 1 3 2

0.0034 0.0091 0.0045 0.0023

Table 2: Average misclassification rates for the types of wine presented as a proportion of the testing data over ten random splits. Total average test error rate was 0.019. The MCMC algorithm was ran for 100,000 iterations with the first 20,000 iterations discarded. This was repeated over ten random splits of the data into training and testing. The results were averaged over ten splits. The average test error rate was 0.019 with standard deviation 0.007, which compares favourably to the results reported in Lee et al. (2004) and Zhang and Jordan (2006). Both papers report a slightly smaller test error rate of 0.0169, but both use a leave-one-out method to evaluate classification results. This method is computationally more intensive and is not well suited for MCMC implementations, rather it is more commonly utilized for model selection. As it uses more data in the training set than in the test set, it will give an approximately unbiased estimate of the prediction error, but at the cost of high variance (Hastie et al., 2001).

7

wine data coloured by groups 5

20 4

40

normalised response for all observations

3

60

2

1

80

0

100

−1

120

−2

140

−3

160 −4

2

4

6

8

10

12

20

fetures

40

60

80

100

120

140

160

Figure 2: Individual observations in the wine data are plotted and coloured by groups. The graphs shows that there is a systematic difference between the three groups in different parts of the feature space. The kernel of the data is also plotted. The classification groupings can be seen in the image of the kernel matrix: the classification labels are 1-59 for group 1, 60-130 group 2 and 131-178 group 3.

Table 2 shows that no misclassification occurs between classes 1 and 3. Graphs in Figure 3 indicate that classes 1 and 3 might be easier to distinguish in this feature space. It is encouraging to see this reflected in the misclassification rates. The trace plots of the θ parameters for a run of the algorithm for each of the ten random splits are given in Figure 4. These graphs indicate that due to the over-parameterization of the model, the posterior distributions of the θs are flat, as the data do not contain a enough information about the parameters. In kernel classifier methods all of the data is summarized within the kernel, so a proper parameterization is paramount to extracting from the kernel the information useful for discriminating between the classes. In the MCMC setting, one would expect that the θ parameters will converge to a posterior distribution, but in this model the convergence has not been reached. Nevertheless, one has ”converged” to a good classification algorithm. Hence, the results are extremely sensitive to the choice of the starting value of θ as well as the size of the standard deviation parameter of the random walk. In each run, the algorithm was tested for four combinations of these two parameters and the best combination was chosen. The algorithm was run for a single bandwidth parameter θ. The trace plots of one run of the algorithm for each of the ten random splits are given in Figure 5. These graphs show a better convergence, as expected. The average misclassification rate for this formulation was 0.026 with standard deviation of 0.012. The breakdown of the misclassifications is similar to the multiple θ model, see Table 3. In spite of the issues surrounding over-parameterization,

8

gp1 vs gp2 4 2 0 −2 −4

0

2

4

6

8

10

12

14

8

10

12

14

8

10

12

14

gp 2 vs gp3 4 2 0 −2 −4

0

2

4

6 gp1 vs gp3

4 2 0 −2 −4

0

2

4

6

Figure 3: The model should distinguish most easily between groups 1 and 3.

defining the kernel with multiple θs allows a wider class of kernels to be explored. It also opens up the model to sparsity-inducing adjustments, which is further considered in Section 5. True class

class class class class

1 2 2 3

Classified as

class class class class

test error rate by groups

2 1 3 2

0.0045 0.0091 0.0068 0.0057

Table 3: Average misclassification rates for the types of wine presented as a proportion of the testing data set. Total average test error rate was 0.026 for the model with single parameter θ.

4.2

Microarray data

Khan et al. (2001) describe gene expression profile data consisting of eightythree mRNA microarray slides, divided into a training and testing data set; see Table 4. Each microarray slide corresponds to an individual suffering from one of four tumour types (EWS, BLC, NB and RMS). The total of 2308 genes profiles are reported for each slide. The original data set contains five additional test slides that do not correspond to any of the four types of tumours and were removed from the analysis. This corresponds to a 4- category classification problem with a large number of features J = 2308 and small number of observations (n = 83). The data set is available from http://research.nhgri.nih.gov/microarray/Supplement/. The aim of the analy-

9

40

40

20

20

0

0

2000

4000

6000

0

8000 10000

20

40

10

20

0

0

2000

4000

6000

0

8000 10000

20

20

10

10

0

0

2000

4000

6000

0

8000 10000

20

20

10

10

0

0

2000

4000

6000

0

8000 10000

40

40

20

20

0

0

2000

4000

6000

0

8000 10000

0

2000

4000

6000

8000 10000

0

2000

4000

6000

8000 10000

0

2000

4000

6000

8000 10000

0

2000

4000

6000

8000 10000

0

2000

4000

6000

8000 10000

Figure 4: The trace plots of the θ parameters for a run of the algorithm for each of the ten random splits for the wine data. The algorithm was ran for 100,000 iterations, only every tenth is plotted here.

sis is to classify the slides into one of four tumour types on the basis of the gene profiles. The image of the kernel matrix (with θj set to 1 ∀j) of the training data set is given in Figure 6. A simple criterion of Dudoit et al. (2002) gives the marginal relevance measure of feature xf in class separation: rf =

PK Pn

k=1 i=1 I(yik PK Pn k=1 i=1 I(yik

(k)

= 1)(¯ x.f − x ¯.f )2 (k)

= 1)(xik − x ¯.f )2 (k)

,

where x ¯.f is the overall average of feature f and x ¯.f is the average of feature f in group k. The relevance criterion was for each feature was used as the starting value of parameters θf . The kernel with these weights is also given in Figure 6. Class separation is clearer in the image matrix at these values of the θ. The algorithm was ran with these starting values for θ and with all of the starting values equal. For the same split of data into training and testing sets as used by Khan et al. (2001) and Lee et al. (2004), the algorithm classifies all of the observations correctly, with all of the features utilized in the construction of the kernel matrix. In addition, the algorithm was ran for ten random splits of the data with approximately equal sample sizes in the training and testing data set, see Table 5. The average misclassification error was 0.06 with standard deviation 0.04.

10

0.2

0.2

0.1

0.1

0

0

2000

4000

6000

0

8000 10000

0.2

0.2

0.1

0.1

0

0

2000

4000

6000

0

8000 10000

0.2

0.2

0.1

0.1

0

0

2000

4000

6000

0

8000 10000

0.2

0.2

0.1

0.1

0

0

2000

4000

6000

0

8000 10000

0.2

0.2

0.1

0.1

0

0

2000

4000

6000

0

8000 10000

0

2000

4000

6000

8000 10000

0

2000

4000

6000

8000 10000

0

2000

4000

6000

8000 10000

0

2000

4000

6000

8000 10000

0

2000

4000

6000

8000 10000

Figure 5: The trace plots of the single bandwidth θ parameter for a run of the algorithm for each of the ten random splits for the wine data. The algorithm was ran for 100,000 iterations, only every tenth is plotted.

5

Discussion

The kernel classifier presented here is a fully Bayesian, genuine multi-category extension of the Bayesian binary kernel classifier. It achieves good results with the benchmark wine data and the high dimensional microarray data set. The classifier does not require pre-processing steps to reduce the dimension of the feature space. Note that the tumour classification problem contained 2308 features for only 83 samples. Kernel methods function through the matrix K whose dimension is only dependent on the number of observations n. If they work well, they condense the information in the large variable space to this smaller dimension. Therefore,

Class

Training

Testing

EWS BLC NB RMS

23 8 12 20

6 3 6 5

Table 4: Number of mRNA slides of each tumour class in the training and testing sets. 11

10

10

20

20

30

30

40

40

50

50

60

60 10

20

30

40

50

60

10

20

30

40

50

60

Figure 6: The kernel matrix of the microarray training set before and after weighting the features by their marginal relevance measure. The sections of the matrix correspond to: 1-23 EWS, 24-31 BLC, 32-43 NB, and 44-63 RMS.

Class

Training

Testing

EWS BLC NB RMS

15 6 9 13

14 5 9 12

Table 5: Number of mRNA slides of each tumour class in the training and testing sets in each of the ten random splits of the data. they are ideally suited for problems with a large feature space (J) and small n. However, they are highly parameterized and this is seen in the poor MCMC mixing for θs, even though there is a good classification. There are two aspects to inducing sparseness in this model: reduction of the feature space, i.e. setting θj = 0 and the reduction of the basis functions, setting βi = 0. A model sparse in the basis functions is easier to interpret and will generalize better. In the Bayesian framework, sparseness in the MAP estimates can be achieved through a careful selection of prior distributions, such as the ARD hierarchy used here. It is also useful to identify the subset of the features that contain the information about the group discrimination. For example, one would not expect all of the gene profiles on a microarray slide to be useful in differentiating between the tumour types. An obvious way to reduce the number of parameters in the model and improve convergence is to constrain all of the θj s to be the equal, as seen in Section 4.1. The drawback of this parameterization is that all of the features are equally contributing to the kernel, and if the subset of useful features is small, their influence on the kernel will be lost. In the current pa-

12

rameterization of the model, the parameters θ serve as weights whose purpose is emphasize the important features. Shrinking a θj = 0, removes feature j from the data. The problem is made slightly more complex, because the algorithm is optimizing for the best linear combination of the features, not just the best subset. The focus of future work is to use decision theory within this model to find the optimal subset of features.

6

Acknowledgements

This work has been made possible by the Network of Excellence MUSCLE, contract number FP6-507752 (www.muscle-noe.org), funded by the European Union.

A

Appendix

The conditional distributions for each parameter in the model are given by: p(τ |β) = p(β|τ )p(τ ) P n K−1 T Y Y exp(− 2σ12 K−1 k=1 βk Tk βk ) exp(−γ4 τik )(τik )γ3 −1 = Q 1 K−1 −1 2 2 n(K−1)/2 (σ ) k=1 |Tk | i=1 k=1 n K−1 Y Y

=

i=1 k=1

β2 1 G(γ3 + , γ4 + ik2 ), 2 2σ

p(β, σ 2 |z, θ, τ ) = p(z|β, θ, σ 2 )p(β|τ )p(σ 2 ) 2 −(n(K−1)+γ1 +1)

= (σ )

K−1 Y

exp(−

k=1 K−1 Y

= IG(γ1 + n(K − 1), γ˜2 )

γ˜2 1 (βk − mk )T V−1 (βk − mk ) − 2 ) k 2 2σ σ

M V N(n) (mk , σ 2 Vk ),

k=1

where mk = Vk KT zk , Vk = (KT K + Tk )−1 and γ˜2 = γ2 + 12 T mk V−1 k mk ),

PK−1 k=1

(zTk zk −

K−1 1 X (zik − KTi βk )2 ) p(zi |z−i β, θ, τ, σ ) ∝ p(yi |zi ) exp(− 2 2σ k=1 # "K−1 K K−1 X X X 1 T 2 (zik − Ki βk ) . ∝ exp yik zik − log exp(zik ) − 2σ 2 2

k=1

k=1

13

k=1

The conditional distribution of the θ parameter is p(θ|z) ∝ p(z|θ)p(θ). Since the prior on θ is uniform: p(θ|z) ∝ p(z|θ) Z Z ∝ p(z|β, θ, σ 2 )p(β, σ 2 )dβdσ 2 ∝ dβdσ

Z Z

(σ 2 )−(n(K−1)+γ1 +1)

k=1

2 (−γ1 −n(K−1))

∝ γ˜2

K−1 Y

K−1 Y

exp(−

1 γ˜2 (βk − mk )T V−1 k (βk − mk ) − 2 ) 2 2σ σ

1

|Vk | 2 .

k=1

References Albert, J. and S. Chib (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 (422), 669–679. Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68, 337–404. Bishop, C. M. and M. E. Tipping (2000). Variational relevance vector machines. In UAI ’00: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, San Francisco, CA, USA, pp. 46–53. Morgan Kaufmann Publishers Inc. Chakraborty, S., M. Ghosh, and B. Mallick (2005). Bayesian non linear regression for large p small n problems. Technical report, Department of Statistics, University of Florida. Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines. Cambridge University Press: Cambridge, UK. Denison, D., C. Holmes, B. Mallick, and A. Smith (2002). Bayesian Methods for Nonlinear Classification and Regression. Chichester: John Wiley and Sons. Dudoit, S., J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97 (457), 77–87. Figueiredo, M. (2002). Adaptive sparseness using jeffreys prior. In T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, Cambridge, MA, pp. 697–704. MIT Press.

14

Hastie, T., R. Tibshirani, and J. Friedman (2001). The Elements of Statistical Learning. Springer. Holmes, C. and L. Held (2005). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis 1, 145–168. Hsu, C.-W. and C.-J. Lin (2002). A comparison on methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13, 415– 425. Khan, J., J. S. Wei, M. Ringnr, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer (2001, June). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7 (6), 673–679. Lee, Y., Y. Lin, and G. Wahba (2004). Multicategory Support Vector Machines: Theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67 – 81. MacKay, D. (1996). Bayesian non-linear modelling for the 1993 energy prediction competition. Mallick, B. K., D. Ghosh, and M. Ghosh (2005). Bayesian classification of tumors using gene expression data. J. Royal Statistical Soc. B 67, 219–234. Neal, R. M. (1996). Springer Verlag.

Bayesian Learning for Neural Networks.

New York:

Parzen, E. (1970). Statistical inference on time series by rkhs methods. In R.Pyke (Ed.), Proceedings 12th Biennial Seminar, Montreal. Canadian Mathematical Congress. 1-37. Scholkopf, B. and A. Smola (2002). Learning with Kernels- Support Vector Machines, Reproducing Kernel Hilbert Spaces , Regularization, Optimization and Beyond. MA: MIT Press: Cambridge. Sollich, P. (2002). Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning 46 (1-3), 21–52. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc. B. 58, 267–288. Tipping, M. E. (2000). The relevance vector machine. In S. A. Solla, T. K. Leen, and K. R. Mller (Eds.), Advances in Neural Information Processing Systems, Volume 12, pp. 652–658. MIT Press. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, 211–244.

15

Vapnik, V. (1998). Statistical Learning Theory. New York: Wiley-Interscience. Zhang, Z. and M. I. Jordan (2006). Bayesian multicategory support vector machines. In In Uncertainty in Artificial Intelligence (UAI), Proceedings of the Twenty-Second Conference.

16

Suggest Documents