Improved Gaussian Mixture Density Estimates Using ... - CiteSeerX

17 downloads 526 Views 327KB Size Report
Jul 1, 1995 - Tresp@zfe.siemens.de .... case if the center of a unit coincides with one of the data points and the units' ... will call this the complete data case.
Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging Dirk Ormoneit 

Volker Tresp

Institut fur Informatik (H2) Siemens AG TU Munchen Central Research 80290 Munchen, Germany 81730 Munchen, Germany [email protected] [email protected] July 1, 1995 Abstract

We compare two regularization methods which can be used to improve the generalization capabilities of Gaussian mixture density estimates. The rst method consists of de ning a Bayesian prior distribution on the parameter space. We derive EM (Expectation Maximization) update rules which maximize the a posterior parameter probability in contrast to the usual EM rules for Gaussian mixtures which maximize the likelihood function. In the second approach we apply ensemble averaging to density estimation. This includes Breiman's "bagging", which has recently been found to produce impressive results for classi cation networks. To our knowledge this is the rst time that ensemble averaging is applied to improve density estimation.

1 Introduction Gaussian mixture models have recently attracted major attention in the neural network community. Important examples of their application include learning from patterns with missing inputs ([TAN94], [GJ94]), network-structuring with rule-based knowledge ([VTA92]), the representation of inverse problems ([Gha94], [Bis94]), discriminant analysis ([HT94], [KL95]), the combination of estimators ([TT95]), active learning ([CGJ94]) and probability density estimation ([NHFO94]). The appeal of Gaussian mixture models is based to a high The research of the rst author is made possible by a scholarship from the Deutsche Forschungsgemeinschaft (DFG) within the framework of the Graduiertenkolleg at the Department of Computer Science of the Technical University of Munich. 

1

degree on the applicability of the EM (Expectation Maximization) learning algorithm, which may be implemented as a fast neural network learning rule ([Now91], [Orm93]). Severe problems arise, however, due to singularities and local minima in the likelihood error surface. Particularly in high-dimensional environments these frequently cause the computed density estimates to possess only relatively limited generalization capabilities in terms of predicting the densities of unknown data points. A solution which is proposed in this paper is regularization. We will compare two regularization methods. The rst one consists of de ning a Bayesian prior on the parameters. By using conjugate priors we can derive EM learning rules for nding the MAP (maximum a posteriori probability) parameter estimate. The second approach consists of averaging the outputs of ensembles of Gaussian mixture density estimators trained on identical or resampled data sets. The latter is a form of "bagging" which was introduced by Breiman ([Bre94]) and which has recently been found to produce impressive results for classi cation networks. By using the regularized density estimators in a Bayes classi er ([VTA92], [HT94], [KL95]), we demonstrate that both methods lead to density estimates which are superior to the unregularized Gaussian mixture estimate.

2 Gaussian Mixtures and the EM Algorithm Consider the problem of estimating the probability density of a continuous random vector x 2 Rd based on a set x = fxk j1  k  mg of iid. realizations of x. As a density model we choose the classPof Gaussian mixtures p(xj) = Pni=1 ip(xji; i; i), where the restrictions i  0 and ni=1 i = 1 apply.  denotes the parameter vector (i; i ; i)ni=1. The p(xji; i; i) are multivariate normal densities:   p(xji; i; i) = (2)? 2d jij?1=2 exp ? 12 (x ? i)t?i 1 (x ? i ) The Gaussian mixture model is well suited to approximate continuous probability densities. Based on the model and given the data x, we may formulate the log-likelihood as

l() = log

"m Y

k=1

#

p(xk j) =

m X k=1

log

n X i=1

i p(xk ji; i; i):

Maximum likelihood parameter estimates ^ may eciently be computed with the EM (Expectation Maximization) algorithm ([RW84], [DLR89]). It consists of the iterative application of the following two steps: 1. In the E-step, based on the current parameter estimates, the posterior probability that unit i is responsible for the generation of pattern xk is estimated as k hki = Pn ip(0xp(jxi;k ji0i;; i0 );  0 ) : (1) i i i0 =1 i 2

2. In the M-step, we obtain new parameter estimates (denoted by the prime) using Pm m k k X 0i = m1 hki (2) (3) 0i = Pk=1m hhi xl k=1

l=1 i

Pm

0 t 0 k k k k=1 hi (xP? i )(x ? i ) : m hl l=1 i

i 0 =

(4)

Note that 0i is a scalar, whereas 0i denotes a d-dimensional vector and i0 is a d  d matrix. It is well known that training neural networks as predictors using the maximum likelihood parameter estimate leads to over tting. The problem of over tting is even more severe in density estimation due to singularities in the log-likelihood error surface. Obviously, the data likelihood is maximized in a trivial way if we concentrate all the probability mass on one or several samples of the training set. In a Gaussian mixture this is just the case if the center of a unit coincides with one of the data points and the units' standard deviation approaches zero. Figure 1 compares the true and the estimated probability density in a toy problem. As may be seen, the contraction of the Gaussians results in (possibly in nitely) high peaks in the Gaussian mixture density estimate. A simple way to achieve numerical stability is to arti cially enforce a lower bound on the units' standard deviations. This is a very rude way of regularization, however, and usually results in low generalization capabilities. The problem becomes even more severe in high-dimensional spaces, because the average distance between the data points increases. To yield reasonable approximations, we will apply two methods of regularization, which will be discussed in the following two sections. "dens.dat"

20 20 15 15 10 10 5

5 0

0 1

0

0

0.5 0.5 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: Comparison of true density and unregularized density estimation.

3 Bayesian Regularization In this section we will propose a Bayesian prior distribution on the Gaussian mixture parameter space, which leads to a numerically stable version of the EM algorithm. First consider the case that we know which data point was generated by which Gaussian unit. We will call this the complete data case. We now select prior distributions on the parameters which are conjugate priors assuming complete data. Selecting the conjugate prior has a 3

number of advantages. In particular, for complete data, we obtain analytic solutions for the posterior density and the predictive density. In our case, we don't know which data point was generated by which Gaussian and in this sense we are faced with the problem of missing data. For incomplete data, it is not possible to obtain closed form solutions for the posterior distributions but we can derive mathematically convenient EM-update rules to obtain the MAP parameter estimates. A conjugate prior of a single multivariate normal density is a product of a Normal density N (i j^; ?1i) and a Wishart density Wi(?i 1j ; ) ([Bun94]). A proper conjugate prior for the the mixture weightings  = (1; :::; n) is a Dirichlet density D(j ). For a de nition of those densities, see the appendix. Our goal is to nd the MAP parameter estimate, that is parameters which assume the maximum of the log-posterior

lp() =

m X k=1

log

n X

n X i=1

ip(xk ji; i; i) + log D(j )

+ [log N (i j^; ?1i) + log Wi(?i 1j ; )]: i=1

As in the unregularized case, we may use the EM-algorithm to nd a local maximum of lp() (for a derivation, see the appendix). The E-step is identical to (1). The M-step becomes Pm Pm k k k + i ? 1 kP =1 hi x +  ^ (6) h 0 k =1 i 0  = i i = m + Pn ? n (5) m hl +  l=1 i i=1 i 0i =

Pm

k k ? 0 )(xk ? 0 )t +  (0 ? ^ )(0 ? ^)t + 2 k=1 hi (x P i i i i : m hl + 1 + 2( ? (d + 1)=2) l=1 i

(7)

As typical for conjugate priors, prior knowledge corresponds to a set of arti cial training data which is also re ected in the EM-update equations. In (7) the additional terms introduce a lower bound for the elements of i0, since the numerator is (elementwise) greater or equal 2 and the denominator has an upper limit. In our experiments, we focus on a prior on the variances which is implemented by 6= 0, where 0 denotes the d  d zero matrix. All other parameters we set to \neutral" values: d

i = 1 8i : 1  i  n; = (d + 1)=2;  = 0; = I

I d is the d  d unity matrix. Note, that the scalar  introduces a bias which favors large variances. The e ect of various choices of the regularization parameter  on the density estimate is illustrated in gure 2. Note that if  is chosen too small, over tting still occurs. If it is chosen to large, on the other hand, the model is not powerful enough to recognize the underlying structure. Typically,  is not known a priori. The simplest procedure consists of using that  which leads to best performance on a validation set, analogous to the determination of 4

"cir_g0_1.dat"

"cir_g0_05.dat"

2 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1.5 1 0.5 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 0

 = 0:01

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

 = 0:05 "cir_g0_1.dat"

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

 = 0:1 Figure 2: Density estimate for various regularization parameters. the optimal weight decay parameter in neural network training. This obviously requires the availability of a suciently large set of sample data. This is not a requirement for the ensemble averaging method which will be proposed in the following section. The main advantage of the Bayesian method, however, is that in contrast to averaging, only negligible additional computations are required if compared with standard EM.

4 Averaging Gaussian mixtures In this section we will average the predictions of several Gaussian mixtures to yield an improved probability density estimation. The averaging over neural network ensembles has successfully been applied to regression and classi cation tasks ([PC93], [Bre94]). Perrone showed that the square error of the average is always smaller than (or equal to) the average of the square errors. In our experiments we tried di erent averaging variants. First, one may simply train all networks on the complete set of training data. The only source of disagreement between the individual predictions consists in di erent local solutions found by the error minimization procedure due to di erent starting points in this case. Disagreement is essential to yield an improvement by averaging, however, so that this proceeding only seems advantageous in cases where the relation between training data and weights is extremely non-deterministic in the sense that in training, di erent solutions are found from di erent random starting points. A straightforward way to increase the disagreement is to train 5

each network on a resampled version of the original data set. If we resample the data without replacement, we obviously yield a data set of reduced size, in our experiments 70% of the original. Resampling with replacement corresponds to drawing random samples from the empirical distribution induced by the training data and represents the proceeding commonly applied in bootstrapping ([ET94]). The averaging of neural network predictions based on resampling with replacement has recently been propagated under the notation "bagging" by Breiman ([Bre94]), who has yielded dramatically improved results in several classi cation tasks. He also notes, however, that an actual improvement of the prediction can only result if the estimation procedure is relatively unstable. As discussed, this is particularly the case for Gaussian mixture training. We therefore expect bagging to be well suited for our task. The quality of a predictor based on a common linear regression, for example, cannot be improved with the bagging approach. There is a cross-over point, after which the "contamination" of the data due to the resampling becomes predominant to the increased stability of the prediction.

5 Experiments and Results To assess the practical advantage resulting from the regularization, we used the density estimates to construct classi ers and compared the resulting prediction accuracies using a toy problem and a real-world problem. The reason is that the the generalization error of density estimates in terms of the (negative) likelihood of the generalization data is rather unintuitive whereas performance on a classi cation problem provides a good impression of the degree of improvement. Assume we have a set of N labeled data z = f(xk ; lk )jk = 1; :::; N g, where lk 2  = f1; :::; C g denotes the class label of each input xk . A classi er of new inputs x is yielded by choosing the class l with the maximum posterior classprobability p(ljx). The posterior probabilities may be derived from the class-conditional data likelihood p(xjl) via Bayes theorem: p(ljx) = p(xpj(lx)p)(l) / p(xjl)p(l): The resulting partitions of the input space are optimal for the true p(ljx). A viable way to approximate the posterior p(ljx) is to estimate p(xjl) and p(l) from the sample data. The training of the classi er is not discriminatory, i.e. not directed to optimize decision borders between class partitions. While this may be an disadvantage for some tasks, one gains applicational exibility and modularity by not specifying which classes are to be distinguished.

5.1 Toy problem

In the toy classi cation problem we will try to discriminate the two classes of circulatory arranged data shown in gure 3. We generated 500 data points for each class and subdivided them into three sets of 100/200/200 data points. The rst of these was used for 6

Algorithm

Training Validation Accuracy

A unregularized 1.16 Bayes (  = 0.05) 1.27 Bayes (  = 0.10) 1.62  Bayes ( = 0.20) 2.22 averaging (local minima) averaging (70 % subset) averaging (bagging) -

B 1.08 1.29 1.71 2.20 -

A 1.92 1.83 1.72 2.22 -

B 2.22 1.90 1.77 2.23 -

75.8 % 81.5 % 82.8 % 64.5 % 76.3 % 77.5 % 79.0 %

Table 1: Performances in the toy classi cation problem training, the second as a validation set to determine the optimal regularization parameter, and the third to test the generalization performance. As a network architecture we chose a Gaussian mixture with 40 hidden units. Table 1 summarizes the results of the various density estimation methods. The performances on training and validation set are measured in terms of the negative

Figure 3: Toy Classi cation Task. data log-likelihood. As usually, smaller values indicate a better performance. We indicate separate results for class A and B, since the densities of both were separately estimated. The nal row shows the prediction accuracy in terms of the percentage of correctly classi ed data in the test set. The best results in each category are underlined. According to our expectations, both regularization methods outperform the common Gaussian mixture approach in terms of correct classi cation. The performance in the Bayesian regularization case is hereby very sensitive to the appropriate choice of the regularization parameter . Optimality of  with respect to the density prediction on the validation set and optimality with respect to prediction accuracy on the test set coincide. The bagging approach is the best of the averaging methods, though its performance is still inferior to the Bayesian approach.

7

5.2 BUPA liver disorder classi cation

As a second task we applied our methods to a real-world decision problem from the medical environment. The problem is to detect liver disorders which might arise from excessive alcohol consumption. Available information consists in ve blood tests as well as a measure of the patients' daily alcohol consumption. We subdivided the 345 available samples into Algorithm Accuracy unregularized 64.8 % Bayes (  = 0.05) 65.5 % Bayes (  = 0.10) 66.9 %  Bayes ( = 0.20) 61.4 % averaging (local minima) 65.5 % averaging (70 % subset) 72.4 % averaging (bagging) 71.0 %

Table 2: Performances in the liver disorder classi cation problem a training set of 200 and a test set of 145 samples. Due to the relatively few data we did not try to determine the optimal regularization parameter using a validation process and will report results on the test set for di erent parameter values . The results of our experiments are shown in table 2. Again, both regularization methods led to an improvement in prediction accuracy. Opposite to the toy problem, the result of the averaged predictor was superior to the Bayesian approach here. Note that the resampling led to an improvement of more than ve percent points compared to unresampled averaging.

6 Conclusion We proposed a Bayesian and an averaging approach to regularize Gaussian mixture density estimates. Both led to considerably improved generalization capabilities of density based predictors in a toy and a real-world classi cation task. Interestingly, none of the methods outperformed the other in both tasks. This might be explained with the fact that Gaussian mixture density estimates are particularly unstable in high-dimensional spaces with relatively few data. The bene t of averaging might thus be greater in this case. Averaging proved to be particularly e ective if applied in connection with resampling of the training data, a result which corresponds to similar perceptions in regression and classi cation tasks. We so far did not discuss the fact that it is also related to considerable computational e ort, what is not the case for the Bayesian regularization. The advantage of averaging, however, is that one does not have to worry about the determination of hyper parameters.

In case of need, the optimal Bayesian regularization parameter  might be determined according to appropriate Bayesian methods ([Mac91]). 

8

7 Acknowledgements The rst author would like to thank Prof. Wilfried Brauer, TU Munchen for his his constant support. The liver disorder data were gathered by BUPA Medical Research and commendably made available through the UCI Machine Learning Database.

Appendix: Log-posterior maximization In a Bayesian framework, instead of maximizing the mere data-likelihood we maximize the log-posterior of the data, given an appropriate prior distribution on the weight space. This is equivalent to maximizing l(; x) = log p(; x) = log p(xj)+log p() with respect to the parameters. As mentioned in section 3, we choose the prior from the conjugate prior distribution family of the normal distribution to yield a tractable posterior distribution. We obtain a product of a Dirichlet-, a Normal- and a Wishart-distribution: Y p() = p(; ; j ; ; ; ) = D(j ) ni=1 N (i j^; ?1i)Wi(?i 1j ; ) n X

n Y

i ?1

i = 1 i , with i  0 and i =1   d ?1  ? 1 ? t ? 1 ? 1 = 2 2 N (i j^;  i) = (2) j ij exp ? 2 (i ? ^) i (i ? ^) h i Wi(?i 1j ; ) = cj?i 1j ?(d+1)=2 exp ?tr( ?i 1) : b and c are normalizing factors depending on ; ; . tr(:) is the trace operator of a matrix. As previously mentioned we apply the EM algorithm to maximize the incomplete-date log-posterior of the data. Following the common proceeding, we determine the expected complete-data posterior in the E-step (the unprimed parameters are the current estimates, the primed parameters are the new estimates): where

D(j ) = b

Q(0j) =

i=1

m X n h X k log

k=1 i=1 n X

hi

i

0i + log p(xk ji; 0i; 0i) + log D(0 j )

+ [log N (0i j^; ?10i) + log Wi(0?i 1j ; )]: i=1

hki is the posterior probability that xk was generated by Gaussian i and is computed as k hki = Pn ip(0xp(jxi;k ji0i;; i0 );  0 ) : i i i0 =1 i In the M-step, we maximize Q(0j) with respect to 0. This task may be decomposed into the following two optimization problems: 0

=

m X n X k log arg max

h _ k=1 i=1 i 9

_ i + log D(_j )

(8)

(0i; 0i)

=

n m X X k log arg max

hi p(xk ji; _i; _ i) _ _i ;i k=1 i=1 n X + [log N (_i j^; ?1_ i) + log Wi(_ i?1j ; )]: i=1

(9)

Both may be solved analyticallyPby setting the derivatives to zero. In equation(8), we have to consider the constraint that i i = 1 which can be enforced using Lagrange multipliers. This yields the Lagrange function # " n n X 0 X k 0 0 0 L( ; ) = hi log i + log D( j ) ?  i ? 1 : i=1 i=1

We thus have to solve

n @ L=X k 0 ?1 + ( ? 1)0 ?1 ? = 0 h i {z i } 0 | @i i=1 i i @ log D(0 j ) # " n @ L=? X 0 i ? 1 @ i=1 @0i

= 0:

Substituting and resolving for 0i yields (5). To solve (9), we rst have to compute the derivatives of the involved terms with respect to the elements of 0i and 0?i 1. We will denote the corresponding gradients as r0i and r0?i 1 . Setting to zero yields m X k 0?1 ( k

k=1

hi

i

x ? i ) ? 0i?1{z(0i ? ^)} = 0 | r0i log N (0ij^;?1 0i )

m i h 1X k 0 ? (xk ?  )(xk ?  )t ? h i i 2 k=1 i i h i ? (d +{z1)=2)0i ? } = 0: +1=2 0i ? (0i ? ^)(0i ? ^)t +( | |

{z

r0 ?1 log N (0i j^;?1 0i )

}

r0 ?1 log Wi(0i ?1 j ; ) i

i

To determine the gradients of log N (0i j^; ?10i) and log Wi(0i?1j ; ), one has to compute the derivatives of several matrix operators, one of which is the logarithm of a determinant. It is helpful to know that ! @ log jAj = A?1 for A 2 Rdd : @aij 1id;1j d After some algebra one yields the update equations (6) and (7) in section 3. 10

References [Bis94] [Bre94] [Bun94]

C. M. Bishop. Mixture density networks. Technical report, Aston Univ., 1994. Leo Breiman. Bagging predictors. Technical report, UC Berkeley, 1994. W. Buntine. Operations for learning with graphical models. Journal of Arti cial Intelligence Research, 2:159{225, 1994. [CGJ94] D. A. Cohn, Z. Ghahramani, and M.I. Jordan. Active learning with statistical models. In Adv. in Neural Information Processing Systems 6. MIT Press, 1994. [DLR89] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society B, 1989. [ET94] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1994. [Gha94] Z. Ghahramani. Solving inverse problems using and EM approach to density estimation. In Proceedings of the 1993 Connectionist Models Summer School. Erlbaum Associates, 1994. [GJ94] Z. Ghahramani and M.I. Jordan. Supervised learning from incomplete data via an EM approach. In Advances in Neural Information Processing Systems 6. MIT Press, 1994. [HT94] T. Hastie and R. Tibshirani. Discriminant analysis by gaussian mixtures. Technical report, AT&T Bell Labs and University of Toronto, 1994. [KL95] N. Kambhatla and T. K. Leen. Classifying with gaussian mixtures and clusters. In Advances in Neural Information Processing Systems 7. MIT Press, 1995. [Mac91] David J.C. MacKay. Bayesean Modelling and Neural Networks. PhD thesis, California Institute of Technology, Pasadena, 1991. [NHFO94] R. Neuneier, F. Hergert, W. Finno , and D. Ormoneit. Estimation of conditional densities: A comparison of neural network approaches. Proceedings of the International Conference on Arti cial Neural Networks, 1:689{692, 1994. [Now91] Steven J. Nowlan. Soft Competitive Adaption: Neural Network Learning Algorithms based on Fitting Statistical Mixtures. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, 1991. [Orm93] D. Ormoneit. Estimation of probability densities using neural networks. Master's thesis, Technische Universitat Munchen, 1993. [PC93] Michael P. Perrone and Leon N. Cooper. When networks disagree: Ensemble methods for hybrid nerual networks. In Neural Networks for Speech and Image Processing. Chapman Hall, 1993.

11

[RW84]

R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26, 1984. [TAN94] V. Tresp, S. Ahmad, and R. Neuneier. Training neural networks with de cient data. In Advances in Neural Information Processing Systems 6. MIT Press, 1994. [TT95] V. Tresp and M. Taniguchi. Combining estimators using non-constant weighting functions. In Advances in Neural Information Processing Systems 7. MIT Press, 1995. [VTA92] J. Hollatz V. Tresp and S. Ahmad. Network structuring and training using rulebased knowledge. Technical report, Siemens AG, Central Research and Development, Munich, 1992.

12