A Generalized Multivariate Logistic Model and EM Algorithm based on ...

0 downloads 0 Views 870KB Size Report
(2 cosh (1. 2 σ−1(s − µ))). 2δ. (4). Now, the moment generating function E{exp(ξt)}. ϕ(t) of the ..... variate distributions, Volume 2, John Wiley & Sons, Inc.: New.
A GENERALIZED MULTIVARIATE LOGISTIC MODEL AND EM ALGORITHM BASED ON THE NORMAL VARIANCE MEAN MIXTURE REPRESENTATION Jason A. Palmer1 , Ken Kreutz-Delgado2 , and Scott Makeig1 1

Swartz Center for Computational Neuroscience Institute for Neural Computation 2 Dept. of Electrical and Computer Engineering University of California San Diego

ABSTRACT We present an EM algorithm for Maximum Likelihood estimation of the location, scale, and skew, and shape parameters of the z distribution, also known as the generalized logistic function (type IV). We use the Barndorff-Nielsen, Kent, and Sørensen representation of the z distribution as a Gaussian location-scale mixture to derive an EM algorithm for estimating the location, scale, skew, and shape parameters. We use a variational bound on the likelihood function to determine a monotonically converging closed form update for the skew (or drift) parameter. The algorithm also extends naturally to multivariate GLSM estimation using the Kolmogorov-Smirnov mixing density in odd dimensions. Index Terms— Generalized Logistic, Gaussian Location-Scale Mixtures, Multivariate Logistic, Quasiparametric density estimation 1. INTRODUCTION Fisher’s z distribution has the form, p(s) =

exp(s)α Γ(α+β) , Γ(α)Γ(β) (1 + exp(s))α+β

α, β > 0

(1)

with the location, scale family defined by z(s; µ, σ, α, β) = σ −1 p(σ −1 (s − µ)). This density arises as from the so-called z transformation log(X/(1 − X)) when X ∼ B(α, β) [1]. It also arises from the F distribution as the density of log X for X ∼ F (α/2, β/2), and as the distribution of difference of log gamma random variables, log X1 − log X2 with X1 , X2 independent Gamma distributed [2]. The z distribution has log linear tails, with the slope of the left and right tails being proportional to α and −β respectively. The shape varies from Laplacian (double-exponential) to Gaussian with the magnitude of (α, β). It was first described by Fisher [1, 3], and was studied extensively by Prentice [2, 4]. It is also known as the generalized logistic function (Type IV) [5, p.142], and the exponential generalized beta distribution of the second kind (EGB2) [6]. The z distribution is one of the few closed form density models that allow control of skew,1 and is thus very useful in parametric modeling. The z distribution was shown to be a Normal Variance Mean Mixture by Barndorff-Nielsen et al. [7]. That is, a z distributed random variable S can be represented in the generative form, S = ξ 1/2 Z + ξ θ

(2)

This research was partially funded by the Swartz Foundation. 1 Another being the Generalized Hyperbolic density [7], which involves the Bessel K function.

where Z is standard Normal, ξ is a non-negative random variable and independent of Z, and θ ∈ R is a skew or drift parameter. This representation raises the possibility of using an EM algorithm as with (symmetric) GSMs, as first suggested by Dempster, Laird, and Rubin [8]. Prentice [4] considered a standard Newton-Raphson type method for Maximum Likelihood parameter estimation, but he notes that the method suffers from complications in terms of the shape parameters. The EM algorithm has the virtue of being monotonically convergent and only involving first derivatives, as well as extending naturally to higher dimensions. In §2 we consider the mixing density associated with the z distribution, and the conditions under which the model is extensible to higher dimensions, i.e. for Gaussian z ∼ N (0, Σ), we consider the random vector, s = ξ 1/2 z + ξθ + µ (3) In §3 we derive the complete log likelihood associated with the EM algorithm, along with posterior mixing variable expectations. In §4 we derive the parameter updates, first considering the straightforward location and scale updates µ and σ, and then deriving a variational bound related to the θ objective leading to closed form (constrained) maximum. In §5 we derive the updates arising when this model is embedded in a finite mixture model EM framework. 2. MIXING DENSITY AND MULTIVARIATE DEPENDENT MODELS Although the mixing density can only be expressed in series form, we don’t need the density itself for the EM algorithm. Let f (ξ; δ) be the mixing density of the symmetric Generalized Logistic function with shape parameter α = β = δ. Then for s = ξ 1/2 z + µ, with z ∼ N (0, σ 2 ),   Z ∞ 1 1 (s − µ)2 √ p1 (s ; δ, µ, σ) = exp − f (ξ; δ) dξ 2 σ2 ξ σ 2πξ 0 =

Γ(2δ) Γ(δ)2 2 cosh

σ −1 1 −1 σ (s 2

− µ)

2δ

(4)

Now, the moment generating function E{exp(ξt)} , ϕ(t) of the mixing density is given by, √  √  Γ δ + 2t Γ δ − 2t ϕ1 (t; δ) = , t < 21 δ 2 Γ(δ)2 We can thus construct a modified form of the mixing density for any γ < 12 δ 2 , exp(γξ) f (ξ; δ, γ) = f (ξ; δ) ϕ(γ; δ)

In particular we shall take γ = 12 kθk2Σ−1 in the sequel. The moment generating function of this modified mixing density is then,

For the densities then, we have for n = 1, p1 (s ; δ, µ, θ, σ) =

ϕ( t ; δ, θ, Σ) =     q q Γ δ + kθk2Σ−1 + 2 t Γ δ − kθk2Σ−1 + 2 t   Γ δ + kθkΣ−1 Γ δ − kθkΣ−1

(5)

Using this moment generating function to determine the first moment, we find, E{ξ ; δ, θ, Σ} =

Ψ(δ + kθkΣ−1 ) − Ψ(δ − kθkΣ−1 ) kθkΣ−1

Var(ξ ; δ, θ, Σ) = Ψ0 (δ − kθkΣ−1 ) + Ψ0 (δ + kθkΣ−1 ) − E{ξ ; δ, θ, Σ} kθk2Σ−1 which tends to 23 Ψ000 (δ) as kθkΣ−1 → 0. The mean and variance can be used to normalize the mixing scale and variance. Now if we consider R the density of the random vector defined in (3), we have, p(s) = N (s ; µ + ξθ, ξΣ)f (ξ; δ, γ)dξ, p(s) = | det Σ|−1/2 exp θ T Σ−1 (s − µ)





(2π)−n/2 ×

 exp − 12 ξ −1 ks − µk2Σ−1 − 12 ξ kθk2Σ−1 f (ξ; δ, γ) dξ

With γ = 21 kθk2Σ−1 , the ξ term in the exponent cancels and we are left with,  exp θ T Σ−1 (s − µ)  p(s ; δ, µ, θ, Σ) = | det Σ|−1/2 × ϕ1 12 kθk2Σ−1 ; δ Z ∞  (2π)−n/2 ξ −n/2 exp − 21 ξ −1 ks − µk2Σ−1 f (ξ; δ) dξ 0

The integral is seen to have a similar form to that in (4) except for the higher degree of the factor ξ −n/2 . If we note however the effect d of repeated applications of the operator − 1s ds under the integral in (4), we see that in the case of n odd, pn (s) can be written, −1/2

T

−1

| det Σ| exp θ Σ (s − µ)  B δ − kθkΣ−1 , δ + kθkΣ−1 n−1  2 −2δ −1 d × 2 cosh 12 y 2πy dy y =ks−µkΣ−1

−1 d y dy

k 2 cosh

(7)

1 y 2

−2δ

Writing out the result in the one, three, and five dimensional cases, −2δ we have Ck (y ; δ) = 2 cosh( 21 y Pk (y ; δ), where, P0 (y)

=

P1 (y ; δ)

=

P2 (y ; δ)

=

3. COMPLETE LOG LIKELIHOOD AND POSTERIOR MOMENTS −1 PN Define the sample t=1 st . The complete log Q mean m , N likelihood log t p(st |ξt )p(ξt ) of N i.i.d. samples, scaled by N −1 , is,  − 12 log det Σ + θ T Σ−1 (m − µ) − log ϕ1 21 kθk2Σ−1 ; δ − N −1

N X

1 2

ξt−1 kst − µk2Σ−1

(8)

Note that we only need to compute the posterior expectation of ξt−1 . But using the differentiation trick again, we have, R −1 ξt p(st |ξt )p(ξt )dξt E{ξt−1 |st } = p(st )  k+1 Pk+1 kst − µkΣ−1 ; δ V p1 (y ; δ)  (9) = = V k p1 (y ; δ) Pk kst − µkΣ−1 ; δ y =kst −µkΣ−1

In the one dimensional case, we have, E{ξt−1 |st

δ tanh ; δ, µ, σ} = y

1 2

 y

y=σ −1 (st −µ)

3

and in R , we have,

y =kst −µkΣ−1

ξt−1

In this formulation, the expected value of given st does not depend on θ in any dimension n. Let νt , E{ξt−1 |st }. Then the complete log likelihood (8), scaled by N −1 , can be written,  − 12 log |Σ| + θ T Σ−1 (m − µ) − log ϕ1 21 kθk2Σ−1 ; δ − N −1

N X

1 2

νt kst − µk2Σ−1

(10)

t=1

4. PARAMETER UPDATES

1 δ

where 0 ≤ kθkΣ−1 < δ.

cosh(y) − 1 1 − y/ sinh(y) δ + y sinh(y) y2

Define, 

  Ck ks − µkΣ−1 ; δ exp θ T Σ−1 (s − µ)  (2π)k (det Σ)1/2 B δ − kθkΣ−1 , δ + kθkΣ−1

E{ξt−1 |st ; δ, µ, Σ} =



pn (s ; δ, µ, θ, Σ) =

Ck (y ; δ) ,

pn (s ; δ, µ, θ, Σ) =

t=1

0

ξ −n/2

and for n = 2k + 1, k = 1, 2, . . .,

(6)

As kθkΣ−1 → 0, this tends to 2 Ψ0 (δ). Similarly we find that the variance,

Z

 σ −1 exp θ σ −1(s − µ) Γ(2 δ)    Γ δ − σ −1 θ Γ δ + σ −1 θ 2 cosh( 1 σ −1(s − µ) 2δ 2

tanh( 12 y)

y   δ tanh( 21 y) cosh(y) − 1 1 − y/ sinh(y) δ + y y sinh(y) y2

We first consider the updates for µ and Σ assuming θ is given. We then derive an update for θ given µ and Σ. Finally we derive a variational update the shape parameter δ which monotonically increases the log likelihood as part of the EM algorithm.

4.1. Location update

where we define,

Setting the gradient of (10) with respect to µ equal to zero, we find that the optimal µ satisfies,

and

N 1 X νt (st − µ) = θ N t=1

Defining νˆ , N −1 have,

PN

t=1

νt , and mν , (N νˆ)−1

PN

t=1

νt st , we

µ = mν − νˆ−1 θ

(11)

4.2. Drift update We would like to maximize the posterior likelihood (10) over θ by solving,  max (m − µ)T Σ−1 θ − log Γ(δ + kθkΣ−1)Γ(δ − kθkΣ−1) θ

The function h(t) = log(Γ(δ + t)Γ(δ − t)) is increasing and unbounded above on the interval [ 0, δ). While the convexity of function ensures a unique solution, we can only formulate the solution in terms of inverse Digamma or Psi functions, which are not readily computable. We can however bound the cost function variationally in terms of a similar function that is tractable, to derive a monotonic coordinate ascent algorithm. Specifically, we have, log(Γ(δ + t)Γ(δ − t)) = inf v log v

1 − h(v) δ 2 − t2

for a certain function h which it is not necessary to compute, and the optimal v is given by, Ψ(δ + t) − Ψ(δ − t) ˆ 2 − t2 ) v= = 12 ξ(δ 2 t(δ 2 − t2 )−1

a (1 − b)2 − ad

(17)

4.4. Shape parameter update Note that log p(s ; δ) has the general form,

 G(δ) = F (δ) − 2 δ log 2 cosh( 12 kst − µkΣ−1 ) N + const.  where Fk (δ) = log Pk − log ϕ1 12 kθk2Σ−1 ; δ , i.e.,

Γ(2 δ) exp log Pk (yt ; δ) N   , δ > kθkΣ−1 Fk (δ) = log Γ δ + kθkΣ−1 Γ δ − kθkΣ−1 We note that F is increasing and concave for δ > kθk, but F is also convex with respect to the logarithm log(δ 2 − kθk2 ). Thus,  Fk (δ) = sup u log δ 2 − kθk2Σ−1 − H(u) u

for a certain relative conjugate function H(u), where the unique optimal u is given by,  (18) u = (2 δ)−1 δ 2 − kθk2Σ−1 Fk0 (δ) where we have F00 (δ) = 2Ψ(2δ) − Ψ(δ + kθk) − Ψ(δ − kθk), F10 (δ) = F00 (δ) + δ −1 . If we define, N X

2 log cosh

1 2

 kst − µkΣ−1

(19)

t=1

max

δ>kθ k −1 Σ

 u log δ 2 − kθk2Σ−1 − ηˆ δ

which has the optimum, u+ (13)

δ=

q

u2 + ηˆ2 kθk2Σ−1

(20)

ηˆ

5. FINITE MIXTURE MODEL

We use the same variational representation used to derive the drift P T update. Let us define Σν , (N νˆ)−1 N t=1 νt (st − µ)(st − µ) . At a stationary point then, Σ satisfies, Σ = νˆ Σν −(m−µ)θ T −θ(m−µ)T +

2v θθ T (14) δ 2 − kθk2Σ−1

This equation can be solved for t , kθk2Σ−1 by using the Woodbury identity three times to invert (14).

δ2

2vA−1 θθ T A−1 − t + 2v θ T A−1 θ

(15)

We can readily formulate a finite mixture model EM algorithm to estimate the model, M  X  p s ; {αj , δj , µj , θ j , Σj }M = αj p s ; δj , µj , θ j , Σj j=1 j=1

Define the hidden model index for time t, jt ∈ {1, . . . , M } and the hidden model indicator random variables, ( 1, jt = j ejt = 0, jt 6= j so that the mixture random variable representation has the form,

This leads to the update,

t=

d , dT(ˆ ν Σν )−1d

then we have the surrogate problem for δ,

4.3. Structure matrix update

δ 2 + (2v + 1)t0 −

t0 , θ T A−1 θ =

(12)

θ

This solution may be derived in closed form q ! v 2 + δ 2 km − µk2Σ−1 − v m−µ θ= km − µkΣ−1 km − µkΣ−1

b , θ T(ˆ ν Σν )−1d,

ηˆ = 2 log 2 + N −1

where t = kθkΣ−1 . We thus have the surrogate problem,  max (m − µ)T Σ−1 θ + v log δ 2 − kθk2Σ−1

Σ−1 = A−1 −

a , θ T(ˆ ν Σν )−1θ,

q

δ 2 + (2v + 1)t0 2

2

− 4δ 2 t0 (16)

st =

M X j=1

1/2

ejt µj + ξjt θ j + ξjt zjt



(21)

 where zjt ∼ N (0, Σj ), ξjt ∼ f ξ; δj , 12 kθ j k2 , independent, for P t = 1, . . . , N , 0 ≤ kθ j k < δj , j = 1, . . . , M , M j=1 αj (µj + ξˆj θ j ) = 0, and ξˆj defined by (6). Define the posterior expectations eˆjt = E{ejt | st }. Then we have the standard mixture model updates, eˆjt = PM

αj` p( st ; δj , µj , θ j , Σj )

j 0 =1

(22)

αj` 0 p( st ; δj 0 , µj 0 , θ j 0 , Σj 0 )

PN ˆjt , where we use ` to indicate the iteraand αj`+1 = N1 t=1 e −1 tion number. Now defining νjt , E{ξjt |st , jt = j} and mj , P N `+1 −1 ˆjt st , the complete log likelihood is, (N αj ) t=1 e M X

 αj`+1 θ Tj Σ−1 j (mj − µj ) − log ϕ 1 2

log |Σj | − (N αj`+1 )−1

N X

Monotonic convergence of the algorithm without the need to set or modify step sizes has been verified. Figure 1 shows a case study application to the eyeblink muscle signal component typically seen in EEG recordings. In Figure 2 we plot the Fisher’s z mixture fits, showing the log histogram, the model, and the mixture components constituting the each model. In Figure 3 we show the determination of model order using the Generalized Likelihood Ratio Test approach, where twice the change in log likelihood is distributed χ2 (k) where k is the difference in the number of degrees of freedom.

 1 kθ j k2Σ−1 ; δj 2 j

j=1



6. EXPERIMENTS

1 2

eˆjt νjt kst − µj k2Σ−1



j

t=1

Then for the location parameter updates, we have, µj = mjν − νˆj−1 θ j

(23)

For the drift parameters updates, define dj , mj − µj . With the  variational parameter estimates vj = 12 ξˆj δj2 − kθ j k2Σ−1 , we have, j

θj =

q vj2 + δj2 kdj k2 −1 − vj ! Σj dj kdj k2 −1

(24)

Σj

Fig. 2. Fisher’s z mixture model fits of eyeblink source distribution with M = 2, 3, 5, 7.

For the structure matrix update, let yjt , st − µj , and define, Σjν , (N αj`+1 νˆj )−1

N X

T eˆjt νjt yjt yjt

(25)

t=1

then Σj is updated as in (16). And finally, for the shape parameters, we have, q uj + u2j + ηˆj2 kθ j k2 δj = (26) ηˆj where uj = (2 δj )−1 (δj2 − kθ j k2 )F 0 (δj ) and ηˆj = 2 log 2 +  P 1 2N −1 N t=1 log cosh 2 kyjt kΣ−1

7. CONCLUSION We have derived a multivariate density model generalizing the nonsymmetric Generalized Logistic, or Fisher’s z distribution, based on the GLSM model of Barndorff-Nielsen et al. [7]. We derived an EM algorithm to update the location, scale, and drift parameters, using novel variational representations, and we used the explicit likelihood formula to derive an EM algorithm to fit a finite mixture model.

j

Fig. 1. Topographic map and plot of eyeblink electric potential. Spikes correspond to blinks.

Fig. 3. Plots of twice log likelihood versus model degrees of freedom. Red lines plot boundary of region with models having significant likelihood increase, using the Generalized Likelihood Ratio Test.

REFERENCES [1] R. A. Fisher, “On the ‘probable error’ of a coefficient of correlation deduced from a small sample,” Metron, vol. 1, pp. 3–32, 1921, reprinted in Collected Papers of R.A. Fisher, The University of Adelaide, 1971. [2] R. L. Prentice, “Discrimination among some parametric models,” Biometrika, vol. 62, no. 3, pp. 607–614, 1975. [3] R. A. Fisher, “The mathematical distributions used in the common tests of significance,” Econometrica, vol. 3, pp. 353–363, 1935, reprinted in Contributions to Mathematical Statistics by R.A. Fisher, New York: Wiley, 1950. [4] R. L. Prentice, “A generalization of the probit and logit methods for dose response curves,” Biometrics, vol. 32, no. 4, pp. 761– 768, 1976. [5] N. L. Johnson, S. Kotz, and N. Balakrishnan, Continuous univariate distributions, Volume 2, John Wiley & Sons, Inc.: New York, 1995. [6] J. B. McDonald, “Parametric models for partially adaptive estimation with skewed and leptokurtic residuals,” Economics Letters, vol. 37, pp. 273–278, 1991. [7] O. Barndorff-Nielsen, J. Kent, and M. Sørensen, “Normal variance-mean mixtures and z distributions,” International Statistical Review, vol. 50, pp. 145–159, 1982. [8] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Iteratively reweighted least squares for linear regression when errors are Normal/Independent distributed,” in Multivariate Analysis V, P. R. Krishnaiah, Ed. 1980, pp. 35–57, North Holland Publishing Company.

Suggest Documents