Distribution-based Bayesian Minimum Expected Risk for ... - Maya Gupta

2 downloads 0 Views 181KB Size Report
work assumed that the prior distribution of the covariance matrix has a Wishart distribution. Work by Brown et al. [6] on this topic uses conjugate priors, and they ...
Distribution-based Bayesian Minimum Expected Risk for Discriminant Analysis Santosh Srivastava

Maya R. Gupta

Department of Applied Mathematics University of Washington Seattle, WA 98195, USA Email: [email protected]

Department of Electrical Engineering University of Washington Seattle, WA 98195, USA Email: [email protected]

Abstract— This paper considers a distribution-based Bayesian estimation for classification by quadratic discriminant analysis, instead of the standard parameter-based Bayesian estimation. This approach also yields closed form solutions, but removes the parameter-based restriction of requiring more training samples than feature dimensions. We investigate how to define a prior so that it has an adaptively regularizing effect: yielding robust estimation when the number of training samples are small compared to the number of feature dimensions, but converging as the number of data points grows large. Comparative performance on a suite of simulations shows that the distribution-based Bayesian discriminant analysis is advantageous in terms of average error.

I. I NTRODUCTION A popular classification method models the likelihood of each class as a Gaussian distribution and uses the consequent posterior distributions to estimate the class. For such methods, Bayesian estimation of the posterior minimizes the expected misclassification cost. In this paper we consider two issues in using Bayesian estimation effectively for quadratic discriminant analysis. First, we consider directly integrating out the uncertainty over the domain of Gaussian probability distributions (pdfs), as opposed to the standard approach of integrating out the uncertainty over the domain of the parameters. The proposed distribution-based Bayesian discriminant analysis removes the parameter-based Bayesian analysis restriction of requiring more training samples than feature dimensions. Comparative performance on the Friedman suite of simulations shows that the distribution-based Bayesian discriminant analysis is also advantageous in terms of average error. The second issue considered here is the choice of prior. Ideally, a prior should have an adaptively regularizing effect, yielding robust estimation when the number of training samples are small compared to the number of feature dimensions (and hence the number of parameters), but also converging as the number of data points grows large. In practice, there can be more informative features d than labeled training samples n. This situation has previously been addressed through regularization (such as regularizing quadratic discriminant analysis with linear discriminant analysis [3]). This paper takes first steps towards showing that the prior itself can act as an efficient regularizing force.

The paper uses the following notation: X i ∈ Rd G Yi ∈ G X ∈ Rd Y ∈G n nh I(A) ¯h X Sh

ith random training sample {1, 2, . . . , G} set of class labels class label corresponding to Xi random test sample class label corresponding to test point X number of training samples number of training samples of class h Indicator function over the event A n 1 X = Xi I(Yi =h) , sample mean of class h nh i=1

= = = = = = = =

=

n X

(Xi − µ)(Xi − µ)T I(Yi =h)

i=1

|B| = determinant of matrix B

II. R ELATED W ORK Discriminant analysis using Bayesian estimation was first proposed by Geisser [4] and Keehn [5]. Geisser’s work used a noninformative prior distribution to calculate the posterior odds that a test sample belongs to a particular class. Keehn’s work assumed that the prior distribution of the covariance matrix has a Wishart distribution. Work by Brown et al. [6] on this topic uses conjugate priors, and they proposed a hierarchical approach that compromises between the two extremes of linear and quadratic Bayes discriminant analysis, similar to Friedman’s regularized discriminant analysis [3]. Raudys and Jain note that the Geisser and Keehn Bayesian discriminant analysis may be inefficient when the class sample sizes differ [7]. In all of this prior work, the mean µ and covariance Σ are treated as random variables and integrated with respect to a uniform measure over the domain of the parameters (i.e Lebesgue measure), which we will refer to as parameter-based Bayesian discriminant analysis. In this paper we consider a distribution-based Bayesian discriminant analysis such that the uncertainty is considered to be over the set of Gaussian pdfs and the Bayesian estimation is done over the domain of the Gaussian pdfs.

III. D ISTRIBUTION - BASED BAYESIAN D ISCRIMINANT A NALYSIS Suppose one is given an iid training set T = {(Xi , Yi ), i = 1, 2, . . . n} and a test sample X. Let C(g, h) be the cost of choosing class label Yˆ = g when the truth was Y = h, and let P (Y = h) be the prior probability of class h. The ideal class estimate for X would be ∗ 4

Y = arg min g

C(g, h)p(X|Y = h)P (Y = h),

(1)

h=1

but this cannot be computed because p(x|Y = h) is unknown. Define the distribution-based Bayesian discriminant analysis class estimate to be 4 Yˆ = arg min EN [

G X

C(g, h)Nh (X)P (Y = h)],

(2)

where N = (N1 , N2 , . . . NG ), are iid random Gaussian pdfs such that Nh (X) is a random variable that models the unknown (and hence uncertain) p(X|Y = h). Thus (2) minimizes the expected misclassification cost with respect to the uncertainty of N given the training data T . To be consistently Bayesian, we also model the uncertainty of the prior P (Y = h) as a random variable Θ and take the expectation over Θ; this yields an estimate of the class h +1 (which is the Bayesian minimum prior Pˆ (Y = h) = nn+G expected risk estimate for the multinomial, also called Laplace correction [8]). Exchanging the sum and the expectation, (2) can be rewritten as,

g

G X

C(g, h)ENh [Nh (X)]Pˆ (Y = h),

(3)

h=1

Next, we discuss how to evaluate ENh [Nh (X)], which requires integration over the domain of Gaussian pdfs. A. Statistical Models and Measure Consider the family M of d-dimensional multivariate Gaussian probability distributions. Let each element of M be a probability distribution N : Rd → R that is parameterized by the real-valued variables (µ, Σ) in some open set in d(d+1) Rd ⊗ R 2 , that is M = {N (· ; µ, Σ)}. Thus, the set M 2 defines a d +3d 2 -dimensional statistical model, [2, pp. 25–28]. Let the differential element over the set M be defined by the Riemannian metric [1], [2], dM I(µ, Σ)

1

= |I(µ, Σ)| 2 dµdΣ, where = −EX [H(log N (X; (µ, Σ)))],

(4) (5)

where H is the Hessian operator with respect to the parameters µ and Σ. The term I is also known as the Fisher information matrix. Then, dM

=

`(Nh , T )p(Nh ) , αh

dµ 1



|Σ| 2 |Σ|

d+1 2

=

dµdΣ |Σ|

d+2 2

.

(6)

(8)

where αh is a normalization constant, p(Nh ) is a prior (defined later in Section III-B), and ` is the likelihood of the data T `(Nh , T ) = exp[− 12 tr(Σ−1 h Sh ) −

(9) ¯ h )(µh − X ¯ h )T )] −X . nh

−1 nh 2 tr(Σh (µh

(2π)

h=1

4 Yˆ = arg min

M

where f (Nh ) is the posterior probability of Nh given the training samples T , f (Nh ) =

G X

g

Let Nh be a possible realization of the Gaussian pdf Nh . Then using the measure defined in (6) Z ENh [Nh (X)] = Nh (X)f (Nh )dM, (7)

nh d 2

|Σh |

2

B. Priors To solve the classification problem given in (2), one still requires a prior p(Nh ). We suggest that the prior should: • lead to closed form solution • add sensible bias to the classification when the number of training samples n is small compared to the number of feature dimensions d • allow the estimation to converge to the true generating class conditional normals as n → ∞. To meet these goals, we use as a prior: p(Nh ) = p(µh )p(Σh ) =

|Bh | exp[−tr(Σ−1 h Bh )] |Σh |

d+3 2

,

(10)

where Bh is a positive definite matrix (further specified in (11)). The prior (10) is equivalent to a noninformative prior for the mean µ, and a modified inverted Wishart prior with d + 3 degrees of freedom over Σ. The number of degrees of freedom was chosen to be d + 3 so that in one-dimension this is equivalent to the well-known inverted gamma pdf. The positive definite matrix Bh will determine the location of the maximum of the prior probability distribution. To be responsive to the scale of the data samples {Xi }, we set Bh as a function of the maximum likelihood estimate of the ˆ (M L,h) . The maximum covariance of the data for class h, Σ likelihood estimate of the covariance may be close to singular, so instead of using it directly, we use the trace: ˆ (M L,h) trΣ I (11) d where I is the d-dimensional identity matrix. For similar reasons, Friedman proposed regularizing covariance estimates ˆ (M L,h) and Bh [3]. by a linear combination of Σ Bh =

C. Solutions Theorem: Under the assumption of the prior given by (10), (3) takes the form Yˆ

4

=

arg min g

G X h=1

.Pˆ (Y = h).

d

C(g, h)

nh2 Γ( nh +d+4 )| S2h + Bh | 2 d

(nh + 1) 2 Γ( nh2+4 )|Ah |

nh +d+3 2

nh +d+4 2

(12)

is to classify test point X as class label

where ¯ h )(X − X ¯ h )T Sh nh (X − X Ah = + + Bh . 2 2(nh + 1)

(13)

4 Yˆ = arg min g

Proof: To solve (3), first we simplify (7). The posterior (8) requires calculation of the normalization constant αh : |Bh |

αh =

nh d 2

(2π)

(

) Γd ( nh +d+3 2

2π d )2 . nh | Sh + Bh | nh +d+3 2 2

Z

(14)

.

|Bh |

.

=

nh d

d

Σh

µh

Sh exp[−tr(Σ−1 h ( 2

1

(2π) 2 |Σh | 2 + Bh ))]

nh +d+3

|Σh | 2 nh h ¯ h )(µh − X ¯ h )T ] dµh dΣ . exp[− Σ−1 (µh − X d+2 , 2 h 2 |Σh | (integrating and using (13) this can be written as) ) Γd ( nh +d+4 |Bh | 2 dnh

nh +d+4

d

αh (2π) 2 (nh + 1) 2 |Ah | 2 (plugging the value of αh from (14)) d

=

)| S2h + Bh | nh2 Γd ( nh +d+4 2

1

nh +d+3 2

d

(2π) 2 (nh + 1) d2 Γd ( nh +d+3 )|Ah | 2

nh +d+4 2

(15)

Simplifying the multivariate gamma function Γd , ENh [Nh (X)] =

(16) d 2

) nh Γ( nh +d+4 2

1 d 2

(2π) (nh + 1)

d 2

| S2h + Bh |

Γ( nh2+4 )

|Ah |

nh +d+3 2

nh +d+4 2

where Ah is given by equation (13). Substituting (16) into equation (3) proves the theorem.  Notably, the solution (16) is valid for any nh > 0 and any feature space dimension d. Corollary 1: The parameter-based Bayesian discriminant analysis solution using the prior given in (10) is to classify test point X as class label 4 Yˆ = arg min g

G X

d

C(g, h)

h=1

.

nh2 Γ( nh2+2 ) d

(nh + 1) 2 Γ( nh +2−d ) 2

| S2h + Bh | |Ah |

nh +1 2

nh +2 2

Pˆ (Y = h).

(17)

The proof is omitted due to lack of space; it is similar to the proof of the Theorem. Notably, the parameter-based Bayesian discriminant solution (17) will not hold if nh ≤ d − 2. Corollary 2: The distribution-based Bayesian discriminant analysis solution using the noninformative prior p(Nh ) = p(µh )p(Σh ) =

1 |Σh |

d+1 2

,

C(g, h)

h=1

|Sh |

. where T h = Sh +

αh (2π) 2 1 T exp[− 2 tr(Σ−1 h (X − µh )(X − µh ) )]

Z

d

nh2 Γ( nh +d+2 ) 2

(18)

d

(nh + 1) 2 Γ( nh2+2 )

nh +d+1 2

|Sh + Th |

Therefore ENh [Nh (X)] =

G X

nh +d+2 2

Pˆ (Y = h).

(19)

¯ h )(X − X ¯ h )T nh (X − X (nh + 1)

The proof is omitted due to lack of space; it is similar to the proof of the Theorem. Again, this distribution-based Bayesian discriminant solution (19) will hold for any nh > 0 and any d. In the next section, we will also compare to a parameterbased Bayesian discriminant analysis given by Geisser [4] that uses the noninformative prior over Σ and µ. IV. S IMULATION The performance of the various estimators was compared using simulations similar to those proposed by Friedman to evaluate regularized discriminant analysis [3]. The comparison is between parameter-based Bayesian estimation, distributionbased Bayesian estimation, quadratic discriminant analysis, and nearest-means classification. Further, for the Bayesian estimates, the non-informative prior was compared to the modified Wishart prior for the covariance (the non-informative prior was used throughout for the mean). The class label is randomly drawn to be class 1 (Y = 1) with probability half, and class 2 (Y = 2) with probability half. 1) Equal Spherical Covariance Matrices: Each class conditional distribution was normal with identity covariance matrix I. The mean of the first class µ1 was the origin. Each component of the mean µ2 of the second class was 3. Results are shown in Figure 1. 2) Unequal Spherical Covariance Matrices: Conditioned on class 1, the distribution was normal with identity covariance matrix I and mean at the origin. Conditioned on class 2, the distribution was normal with covariance matrix 2I and each component of the mean was 3. Results are shown in Figure 2 (top). 3) Equal Highly Ellipsoidal Covariance Matrices: Covariance matrices of each class distribution were the same, and highly ellipsoidal. The eigenvalues of the common covariance matrices were given by 9(i − 1) + 1]2 , 1 ≤ i ≤ d, (20) d−1 so the ratio of the largest to smallest eigenvalue is 100. A first case was that the class mean differences were concentrated in a low-variance subspace. The mean of class 1 was located at the origin and ith component of the mean of class 2 was given by ! r ei d − i µ2i = 2.5 , 1 ≤ i ≤ d. d d2 − 1 ei = [

Legend

Unequal Spherical Covariance Matrices 0.7

0.6

Parameter−based with Wishart Prior

0.5

Distribution−based with Noninformative Prior

0.4 cost

Distribution−based with Wishart Prior

Parameter−based with Noninformative Prior

0.3

Quadratic Discriminant Analysis

0.2

Nearest−means

0.1

0

0

10

20

30

40

50

dimension Equal Spherical Covariance Matrices

Equal Highly−Ellipsoidal Covariances; Low−Variance Subspace Means

0.6

0.6

0.5

0.5

0.4

0.4 cost

0.7

cost

0.7

0.3

0.3

0.2

0.2

0.1

0.1

0

0

10

20

30

40

50

dimension

Fig. 1.

Results of Equal Spherical Covariance Matrix simulation.

Results are shown in Figure 2 (bottom). A second case was that the class mean differences were concentrated in a high-variance subspace. The mean of the class 1 was again located at the origin and the ith component of the mean of class 2 was given by ! r ei i − 1 , 1 ≤ i ≤ d. µ2i = 2.5 d d2 − 1 Results are shown in Figure 3. 4) Unequal Highly Ellipsoidal Covariance Matrices: Covariance matrices were highly ellipsoidal and different for each class. The eigenvalues of the class 1 covariance were given by equation (20) and those of class 2 were given by 9(d − i) + 1]2 , 1 ≤ i ≤ d. d−1 A first case was that the class means were identical. A second case was that the class means were different, where e2i = [

0

0

10

20

30

40

50

dimension

Fig. 2. Results of Unequal Spherical Covariance Matrix simulation (top) and Equal Highly Ellipsoidal Covariance Matrix with Low-Variance Subspace Mean Differences simulation (bottom).

the mean of class 1 was located at the origin and the ith 14 component of the mean of class 2 was given by µ2i = √ . d Results are shown in Figure 4 (top) and Figure 4 (bottom) respectively. 5) Experimental procedure: For each of the above described choices of class conditional covariance matrix and mean, the figures show the average misclassification costs from 1000 replications of the following procedure: First n = 40 training sample pairs were drawn iid. Each classifier used the training samples to estimate its parameters. For all the classifiers, the prior probability of each of the two classes was estimated based on the number of observations from each class using Bayesian minimum expected risk estimation. Then, 100 test samples were drawn iid, and classified by each estimator.

Legend

Unequal Highly−Ellipsoidal Covariances; Same Mean 0.7

0.6

Parameter−based with Wishart Prior

0.5

Distribution−based with Noninformative Prior

0.4 cost

Distribution−based with Wishart Prior

Parameter−based with Noninformative Prior

0.3

Quadratic Discriminant Analysis

0.2

Nearest−means

0.1

0

0

10

20

30

40

50

dimension Equal Highly−Ellipsoidal Covariances; High−Variance Subspace Means

Unequal Highly−Ellipsoidal Covariances; Different Means

0.6

0.6

0.5

0.5

0.4

0.4 cost

0.7

cost

0.7

0.3

0.3

0.2

0.2

0.1

0.1

0

0

10

20

30

40

50

dimension

Fig. 3. Results of Equal Highly Ellipsoidal Covariance Matrix with High Variance Subspace Mean Differences simulation.

V. C ONCLUSION The distribution-based Bayesian discriminant analysis is seen to perform better in almost all cases of the simulations. In particular, using the adaptive modified Wishart prior led to significantly better performance in some cases. We hypothesize that this choice of prior has a regularizing effect, and that using a well-designed adaptive prior could be an effective regularization strategy for discriminant analysis without the need for cross-validation to find regularization parameters as in regularized discriminant analysis [3]. ACKNOWLEDGMENT This work was funded in part by the Office of Naval Research, Code 321, Grant # N00014-05-1-0843. We thank B´ela Frigyik and Richard Olshen for helpful discussions.

0

0

10

20

30

40

50

dimension

Fig. 4. Results of Unequal Highly Ellipsoidal Covariance Matrix with Same Mean simulation (top) and Unequal Highly Ellipsoidal Covariance Matrix with Different Mean simulation (bottom).

R EFERENCES [1] R. E. Kass, “The Geometry of Asymptotic Inference,” Statistical Science, vol. 4, no. 3, pp. 188–234, 1989. [2] S. Amari and H. Nagaoka, “Methods of Information Geometry,” Oxford University Press, 2000, USA. [3] J. H. Friedman, “Regularized Discriminant Analysis,” Journal of the American Statistical Association, vol. 84, no. 405, pp. 165–175, 1989. [4] S. Geisser, “Posterior Odds for Multivariate Normal Distributions,” Journal of Royal Society Series B Methodological, vol. 26, pp. 69–76, 1964. [5] D. G. Keehn, “A Note on Learning for Gaussian Properties”, IEEE Transactions on Information Theory, vol. 11, pp. 126–132, 1965. [6] P. J. Brown, T. Fearn, and M.S. Haque, “Discrimination With Many Variables”, Journal of the American Statistical Association, vol. 94, no. 448, pp. 1320–1329, 1999. [7] S. J. Raudys, A. K. Jain, “Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 3, pp. 252–264, 1991. [8] E. T. Jaynes, “Probability Theory: the logic of science”, Cambridge University Press, 2003.

Suggest Documents