Discriminative Training of Subspace Gaussian Mixture Model for ...

8 downloads 11984 Views 246KB Size Report
Volume 6215 of the book series Lecture Notes in Computer Science (LNCS). Cite this ... method using subspace GMM density model and discriminative training.
Discriminative Training of Subspace Gaussian Mixture Model for Pattern Classification Xiao-Hua Liu and Cheng-Lin Liu National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences 95 Zhongguancun East Road, Beijing 100190, P.R. China {xhliu,liucl}@nlpr.ia.ac.cn

Abstract. The Gaussian mixture model (GMM) has been widely used in pattern recognition problems for clustering and probability density estimation. For pattern classification, however, the GMM has to consider two issues: model structure in high-dimensional space and discriminative training for optimizing the decision boundary. In this paper, we propose a classification method using subspace GMM density model and discriminative training. During discriminative training under the minimum classification error (MCE) criterion, both the GMM parameters and the subspace parameters are optimized discriminatively. Our experimental results on the MNIST handwritten digit data and UCI datasets demonstrate the superior classification performance of the proposed method. Keywords: Subspace GMM, EM algorithm, Discriminative training, MCE.

1 Introduction The Gaussian mixture model (GMM) is widely used in pattern recognition problems for clustering, probability density estimation and classification. Many methods have been proposed for GMM parameter estimation and model selection (e.g. [1][2]). Despite the capability of GMM to approximate arbitrary distributions, the precise density estimation requires a large number of training samples, especially in high-dimensional space (say, dimensionality over 20). Researchers have proposed structure-constrained GMMs for high-dimensional data, such as diagonal covariance, tied covariance, semi-tied covariance [3], and GMM in subspace [4-5]. On the other hand, the GMM is a generative model, with parameters estimated for each class independently by maximum likelihood (ML) using the expectation-maximization (EM) algorithm [1]. Without considering decision boundaries in training, the obtained models do not necessarily give high classification accuracy. Discriminative training methods, typically using the maximum mutual information (MMI) [6] or minimum classification error (MCE) criterion [7], have been proposed to optimize the GMM parameters aiming for classification. For sequence recognition problems like speech recognition, the GMM is commonly used to model the states of hidden Markov models (HMMs). Discriminative training has been applied to GMMs with diagonal covariance [8] and semi-tied covariance [9]. D.-S. Huang et al. (Eds.): ICIC 2010, LNCS 6215, pp. 213–221, 2010. © Springer-Verlag Berlin Heidelberg 2010

214

X.-H. Liu and C.-L. Liu

GMM with semi-tied covariance is also called orthogonal GMM [10], where an orthogonal transformation matrix is shared by multiple diagonal Gaussian components. With diagonal covariance, discriminative training criteria MMI and MCE can be easily optimized by gradient descent. Batch-mode updating algorithms are also available for MMI (called extended Baum-Welch algorithm [11]) and MCE training [12], and they can learn full covariance matrices. Dimensionality reduction can help improve the classification performance of GMM in high-dimensional space. The subspace GMM of Moghaddam and Pentland [4] formulates the density function as the combination of a GMM in principal feature subspace and a spherical Gaussian in complement subspace. It utilizes the flexibility of density approximation of GMM and exploits the information in complement subspace, and has shown efficiency in classification [13]. In this paper, we propose a classification method using subspace GMM density model and discriminative training based on MCE. The discriminative training of subspace GMM has not been reported before, and in discriminative training of semi-tied GMMs (orthogonal GMMs), the transformation matrix was not trained discriminatively [9][10][14]. A previous method, called discriminative learning quadratic dsicriminant function (DLQDF) [15], optimizes the Gaussian parameters and subspace parameters discriminatively for single Gaussian. Our method optimizes both the GMM parameters and the subspace parameters discriminatively, and is suitable for classification in cases of non-Gaussian distributions. As to the training criteria, we choose the MCE criterion because it is flexible for various discriminant functions and is easier to implement by gradient descent. We can alternatively use the MMI criterion and can expect similar performance. We have evaluated the proposed method on multitude of high-dimensional datasets, including the MNIST handwritten digit data [16] and some datasets from the UCI Machine Learning Repository [17]. Our experimental results show that the discriminative training of subspace GMMs yields significantly higher classification accuracies than ML estimation, and the joint training of subspace parameters outperforms that of training GMM parameters only. Our results are also superior to a recent discriminative training method for orthogonal GMM [14]. The rest of this paper is organized as follows. Section 2 briefly describes the subspace Gaussian mixture density model. Section 3 describes the discriminative training method. Section 4 presents our experimental results and Section 5 offers concluding remarks.

2 Subspace Gaussian Mixture Model Denote a pattern as a point (vector) x in d-dimensional feature space. The subspace GMM density function of [4] combines a GMM in the principal subspace and a spherical Gaussian in complement subspace:

p (x | ω ) = pF (x | ω ) pF (x | ω ) ,

(1)

where pF ( x | ω ) is the GMM in principal subspace spanned by k principal eigenvectors A = [φ1 ,… , φk ] of x space centered at mx:

Discriminative Training of Subspace Gaussian Mixture Model for Pattern Classification

215

M

pF (x | ω ) = ∑ π j p j (y | θ j ),

(2)

j =1

where y = A ( x − m x ) and p j ( y | θ j ) is a component Gaussian with parameters T

θ j = {μ j , Σ j } in k-dimensional subspace. pF ( x | ω ) is a spherical Gaussian in d-k dimensional complement subspace:

pF ( x | ω ) = where

⎛ ε 2 (x) ⎞ 1 exp ⎜ − ⎟, (d −k ) / 2 (2πρ ) ⎝ 2ρ ⎠

ε 2 (x) =|| x − m x ||2 − || y ||2 ,

and ρ =

(3)

1 d ∑ λl is the average of the d − k l = k +1

eigenvalues in complementary subspace [4]. We use the ML estimate of subspace GMMs (one for each class) as the initialization of discriminative training [5]. The principal subspace of each class is spanned by the k largest eigenvectors of the class covariance matrix, and the GMM parameters in the principal subspace are estimated by the EM algorithm. The classification performance is rather sensitive to the value of ρ. We select a common value of ρ by cross-validation for high classification accuracy [5].

3 Discriminative Training of Subspace GMM The subspace GMM density model has been shown effective for pattern classification, but the parameters are estimated in generative learning: the training data of each class is used to estimate the class parameters independently. By discriminative training to optimize the decision boundary, the classification performance can be further improved. 3.1 MCE Training In the MCE criterion, the misclassification loss of each training sample is approximated from the difference of discriminant functions between the genuine class and competing classes. The empirical loss is minimized by stochastic gradient descent to optimize the classifier parameters. For classifiers based on probability density models and reasonably assumed equal prior probabilities, the discriminant function is gi ( x) = log p ( x | ωi ) . In MCE training [7], the misclassification measure of a sample x from labeled class ωc is

hc (x) = − g c (x) + g r (x) , where

(4)

g r (x) = max g i (x) is the discriminant function of the most competing class i≠c

(rival class). It is obvious that hc(x)>0 indicates misclassification while hc(x)