Mixture Density Estimation Based on Maximum ... - Springer Link

3 downloads 0 Views 126KB Size Report
Abstract. We address the problem of estimating an unknown probability density function from a sequence of input samples. We approximate the input density ...
Neural Processing Letters 9: 63–76, 1999. © 1999 Kluwer Academic Publishers. Printed in the Netherlands.

63

Mixture Density Estimation Based on Maximum Likelihood and Sequential Test Statistics N.A. VLASSIS?, G. PAPAKONSTANTINOU and P. TSANAKAS Department of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus, 15773 Athens, Greece. http://www.dsdab.ece.ntua.gr

Abstract. We address the problem of estimating an unknown probability density function from a sequence of input samples. We approximate the input density with a weighted mixture of a finite number of Gaussian kernels whose parameters and weights we estimate iteratively from the input samples using the Maximum Likelihood (ML) procedure. In order to decide on the correct total number of kernels we employ simple statistical tests involving the mean, variance, and the kurtosis, or fourth moment, of a particular kernel. We demonstrate the validity of our method in handling both pattern classification (stationary) and time series (nonstationary) problems. Key words: Gaussian mixtures, PNN, semi-parametric estimation, number of mixing components, test statistics, stationary distributions, nonstationary distributions

1. Introduction In this paper we study the problem of estimating the probability density function, or density, of a random variable X based on a sequence of observations, or samples, xi of X. Knowing the exact form of a random variable’s density could be helpful in designing optimal classification rules as in statistical discriminant analysis [1], performing probabilistic clustering of the variable’s domain [2], or predicting a time series evolution [3]. For the input density we choose a particular parametric model, namely, the finite Gaussian mixture model [4]. This approximates the unknown density p(x) by a finite weighted sum of normal (Gaussian) densities, or kernels, as p(x) =

K X

πj fj (x),

(1)

j =1

where fj (x) is in the univariate case the normal density N(mj , sj ) with mean mj and variance sj2 . In the above formula πj denote the mixing weights and K is the number of kernels. ? New affiliation since September 1998: University of Amsterdam, Dept. of Computer Science,

Kruislaan 403, 1098 SJ Amsterdam, E-mail: [email protected]

64

N.A. VLASSIS ET AL.

Both the weights of the mixture and the parameters, e.g., mean and variance, of each kernel we are to estimate from the training set iteratively using the Maximum Likelihood (ML) procedure. We assume that the input samples xi are obtained sequentially, one at a time, rather than as a batch. This is typical for neural network algorithms and may prove helpful in several pattern recognition and signal detection applications [5]. The difficulty in using such mixture models for estimating an unknown density lies mainly in the ignorance of the total number K of kernels. Thus, although ML techniques have been successfully applied in the estimation of the weights and the parameters of the kernels [6], they cannot be used directly for finding the total number of kernels, since then they would impose a separate kernel for each input sample, resulting in over-fitting. On the other hand, most of the precise mathematical techniques for estimating the number of kernels in mixtures (see [4, ch. 5]) usually assume a batch training set and thus cannot be used in our framework. In the following we propose a solution to the problem of estimating the unknown number of kernel densities when the ML procedure for model fitting is used. In parallel to applying the ML technique for the estimation of the mixing weights and kernel parameters, we apply at each iteration step simple statistical tests involving the mean, the variance, and the kurtosis, or fourth moment, of the kernels to decide when a kernel should split in two or when two kernels should join in one, and a simple test involving the mixing weight of a kernel to decide upon its removal. This work is a continuation of [7]. We demonstrate herein the validity of our method in examples of unknown density estimation, pattern classification, and time series problems, and discuss on its usefulness in other problems.

2. Related Work Recent research in neural networks has proposed several models for mixture density estimation that follow the above definitions. The Probabilistic Neural Network (PNN) [8] is a nonparametric kernel model that stores for each input sample a separate Gaussian kernel function with constant mean and variance, and can be regarded as the neural network realization of the method of Parzen [9]. The mixing weights are assumed equal among all kernels and equal to the reciprocal of the total number of input samples. The constraints the original PNN model imposed on the parameters of the kernels and the mixing weights were relaxed in subsequent works [10, 11, 12, 13]. In [11] different mixing weights are used, while in [10, 12] the kernels are represented by multivariate Gaussian functions whose parameters are estimated by the Maximum Likelihood technique, similarly to our approach here. However, most of the above approaches assume that the total number K of kernels is known a priori, and it turns out that the automatic estimation of K is a difficult problem [4, 1]. A neural network method for deciding on the normality of

MIXTURE DENSITY ESTIMATION BASED ON MAXIMUM LIKELIHOOD

65

a particular kernel that is based on Pearson’s χ2 test statistic is proposed in [13]. There, each kernel is divided into equiprobable regions and it is counted the number of input samples belonging to each region. Based on these numbers an appropriate test statistic is formed that follows χ2 distribution and whose failure would imply nonnormality for the underlying kernel which is then split. However, there is a certain difficulty to combine that test with sequential estimation procedures, i.e., when the parameters of the underlying distribution are continuously modified. For the same problem of estimating the total number of kernels, mathematical methods, which assume the training set is in a batch form, have been developed that are based on the likelihood ratio test statistic [14], the method of moments [15], graphical methods or tests for unimodality [4]. In most of the above cases, however, it is difficult to find the asymptotic distribution of the respective test statistic due to unsatisfied regularity conditions of the underlying tests, in which case empirical Monte Carlo bootstrapping techniques are the only solution. 3. The mixture model We say that a random variable X has a finite mixture distribution when it can be represented by a probability density function in the form of (1). In the following we will assume that the kernels are univariate Gaussian functions and generalize to the multivariate case in the next section. Under this framework the unknown density p(x) of the one-dimensional random variable X can be written as p(x) = π1 f1 (x; µ1 , σ1 ) + · · · + πK fK (x; µK , σK ), where fj (x; µj , σj ) denotes the normal density N(µj , σj ) # " 1 −(x − µj )2 fj (x; µj , σj ) = √ , exp 2σj2 σj 2π

(2)

(3)

parametrized on the mean µj and the variance σj2 , and K is the total number of kernels. In order for p(x) to be a correct probability density function with integral 1 over the input space, the additional constraints K X

πj = 1,

πj ≥ 0

(4)

j =1

must hold. We further assume that a sequence of independent observations or samples x1 , . . . , xn , . . . of the random variable X are received, one at a time, from which we must estimate the unknown density of X. In other words, we must fit the above mixture parametric model to the input data. Such an estimation procedure would require the evaluation from the training sequence of the total number K of kernels, together with the 3K unknown parameters µj , σj , and πj .

66 3.1.

N.A. VLASSIS ET AL.

THE MAXIMUM LIKELIHOOD PROCEDURE

The joint density p(x1 , . . . , xn ; ψ) = p(x1 ; ψ) · · · p(xn ; ψ) of a sequence of n independent samples xi of the random variable X, regarded as a function of the unknown parameter vector ψ, with ψ = [µ1 , σ1 , π1 , . . . , µK , σK , πK ], is called the likelihood function of the samples. The vector that maximizes the likelihood function or equivalently its logarithm L(x1 , . . . , xn ; ψ) = ln p(x1 , . . . , xn ; ψ) =

n X

ln p(xi ; ψ)

(5)

i=1

is called the Maximum Likelihood (ML) estimate of ψ. Under reasonable assumptions it can be shown [6] that a unique ML estimate of the parameter vector ψ exists, which at least locally maximizes the likelihood function and is the root of the equation ∇ψ L(x1 , . . . , xn ; ψ) = 0,

(6)

which for a parameter θj , with θj = µj or θj = σj2 , and using (1) and (5) reads n n X ∂L(x1 , . . . , xn ; ψ) ∂ ln p(xi ; ψ) X 1 ∂p(xi ; ψ) = = ∂θj ∂θ p(x ; ψ) ∂θj j i i=1 i=1 n X πj ∂fj (xi ; θj ) = = 0. p(xi ; ψ) ∂θj i=1

(7)

In this context we ignore singularity problems, i.e., we assume that the log-likelihood function is bounded above. If singularities are to be a problem, then appropriate constraints on the parameters σj must be placed as in [4] of the form min(σi /σj ) ≥ c > 0, i 6 = j , with c constant. Using the continuous version of Bayes’ theorem [4] we estimate the probability that a particular input sample xi is drawn from a kernel j as P {j |xi } =

πj fj (xi ; µj , σj ) , p(xi ; ψ)

(8)

with the mixing weights πj regarded as prior probabilities on the kernels, or simply priors, and thus a further simplification of (7) is possible as n n X ∂ ln fj (xi ; θj ) P {j |xi } ∂fj (xi ; θj ) X = P {j |xi } = 0, f (x ; θj ) ∂θj ∂θj i=1 j i i=1

(9)

which differs from (7) on the P {j |xi } term and can be regarded as a separate, weighted maximum likelihood estimation on each kernel. Inserting the formula for the Gaussian (3) yields the ML estimates of the mean µj and variance σj2 of a

MIXTURE DENSITY ESTIMATION BASED ON MAXIMUM LIKELIHOOD

67

kernel j as

Pn P {j |xi }xi , µj = Pi=1 n i=1 P {j |xi } Pn 2 i=1 P {j |xi }(xi − µj ) 2 Pn σj = . i=1 P {j |xi }

(10) (11)

In order to find the ML estimates for the priors πj we introduce a Lagrange mulP tiplier λ for the condition K j =1 πj = 1 and find the roots of the first partial derivative with respect to πj of the quantity   n K X X ln p(xi ; ψ) − λ  πj − 1 , (12) j =1

i=1

which after some derivations gives the ML estimate for the priors πj = n

−1

n X

P {j |xi }.

(13)

i=1

Substituting the sum in the denominators of (10) and (11) from (13), after some derivations, one can easily arrive at iterative formulas for the quantities µj , σj2 , and πj of a kernel j as µj := µj + (nπj )−1 P {j |x}(x − µj ),

(14)

σj2 := σj2 + (nπj )−1 P {j |x}[(x − µj )2 − σj2 ],

(15)

πj := πj + n−1 (P {j |x} − πj ).

(16)

P It is not difficult to verify that the new πj satisfy K j =1 πj = 1. Interestingly, exactly the same formulas are obtained in [4] using the method of scores. We should note nevertheless that there is an inherent bias in the estimation of the above quantities since at each step the variance depends on the current value of the mean. However, simulations showed that in practical cases and for relatively large n, e.g., n ≥ 100, this bias can be ignored. In the following we assume that the input sequence of random samples xi is infinite and that the number n in (14)–(16) is constant. This can be viewed as a moving average procedure on previous inputs. In this case we do not ask for a stochastic convergence in the regular sense but seek for a way to model potentially nonstationary input distributions. 3.2.

TESTING FOR THE NUMBER OF KERNELS

Perhaps the most difficult problem in the theory of mixture density models is the proper estimation of the number of kernels when a method like ML is used [4].

68

N.A. VLASSIS ET AL.

Figure 1. Kernel splitting: the input samples (vertical bars) are assumed to follow a two-kernel distribution but the ML procedure tries to fit the data with a single kernel.

We propose here a method that on-line seeks for the correct number of kernels, based on simple test statistics for testing the hypothesis of single normality against a two-kernel alternative. Splitting a kernel We first look for a test statistic to decide when a kernel should split in two. Consider the simple case of Figure 1 where the input samples (vertical bars) are assumed to follow a two-kernel mixture distribution, but the algorithm tries to fit the data with a single kernel (Gaussian curve). A statistical test is needed to check the hypothesis that the input samples follow a single Gaussian with µ and σ against the alternative that they follow a mixture of two kernels, in which case the single kernel should split in two. Since from the ML estimates (14) and (15) of the first two moments of the underlying kernel one cannot decide on the normality of the kernel, it is reasonable to base the hypothesis on tests involving higher moments. We form a simple sequential test statistic based on a weighted formula of the kurtosis, or fourth moment, of a kernel j as " #  x − µj 4 −1 − kj − 3 , (17) kj := kj + (nπj ) P {j |x} σj with µj and σj the current ML estimates for the parameters of the kernel. On the hypothesis that xi follow a normal distribution N(µj , σj ) it follows [17] that the random variable p (18) q = kj nπj /96 approximately follows normal distribution N(0, 1). Since N(0, 1) is symmetrical about zero we accept the hypothesis that the kernel j is N(µj , σj ) if (cf. [16, ch. 9– 3]) |q|