Stat Comput (2012) 22:287–299 DOI 10.1007/s11222-010-9225-9
Maximum likelihood inference for mixtures of skew Student-t-normal distributions through practical EM-type algorithms Hsiu J. Ho · Saumyadipta Pyne · Tsung I. Lin
Received: 11 January 2010 / Accepted: 27 December 2010 / Published online: 19 January 2011 © Springer Science+Business Media, LLC 2011
Abstract This paper deals with the problem of maximum likelihood estimation for a mixture of skew Student-tnormal distributions, which is a novel model-based tool for clustering heterogeneous (multiple groups) data in the presence of skewed and heavy-tailed outcomes. We present two analytically simple EM-type algorithms for iteratively computing the maximum likelihood estimates. The observed information matrix is derived for obtaining the asymptotic standard errors of parameter estimates. A small simulation study is conducted to demonstrate the superiority of the skew Student-t-normal distribution compared to the skew t distribution. The proposed methodology is particularly useful for analyzing multimodal asymmetric data as produced by major biotechnological platforms like flow cytometry. We provide such an application with the help of an illustrative example. Keywords ECM algorithm · ECME algorithm · Flow cytometry · Outliers · ST mixtures · STN mixtures
H.J. Ho Department of Statistics, Tunghai University, Taichung 407, Taiwan S. Pyne Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02115, USA T.I. Lin () Department of Applied Mathematics and Institute of Statistics, National Chung Hsing University, Taichung 404, Taiwan e-mail:
[email protected] T.I. Lin Department of Public Health, China Medical University, Taichung 404, Taiwan
1 Introduction Finite mixture models, represented as a convex linear combination of component density functions mixed in varying proportions, have been a major analytical tool in applications arising naturally in a number of scientific areas, including density estimation, supervised classification, unsupervised clustering, data mining, image analysis, pattern recognition, machine learning, and so on (see, e.g., Titterington et al. 1985; McLachlan and Basford 1988; McLachlan and Peel 2000; Bishop 2006; Frühwirth-Schnatter 2006). In most of these applications, component densities are generally assumed to be Gaussian because of its wide applicability and desirable properties. However, in many practical situations the measured component densities may exhibit highly asymmetric behavior and heavier than normal tails, subsequently yielding wrong clustering identifications. To cope with such obstacles, some authors start to work on a mixture model with asymmetric component densities in recent years. Lin et al. (2007a) proposed a novel mixture model using the skew t (ST) distribution of Azzalini and Capitaino (2003), called the STMIX model, allowing for accommodation of both skewness and thick tails for making more robust inferences. More recently, Pyne et al. (2009) and Lin (2010) extended the STMIX model to multivariate cases using two variants of multivariate skew t distribution proposed by Azzalini and Capitaino (2003) and Sahu et al. (2003), respectively. Karlis and Santourian (2009) considered the use of finite mixtures of the normal inverse Gaussian distribution and its multivariate extensions. Frühwirth-Schnatter and Pyne (2010) investigated the Bayesian analysis of univariate and multivariate skew normal and skew t mixtures. Lately, Gómez et al. (2007) introduced the skew Studentt-normal (STN) distribution and claimed that it is a good alternative to model heavy tailed data with strong degrees
288
Stat Comput (2012) 22:287–299
of asymmetry. They also showed that it has a wider range of skewness than the skew-normal (SN) distribution (Azzalini 2005) or the family of distributions introduced in Nadarajah and Kotz (2003). Cabral et al. (2008) adopted a Bayesian approach to modeling mixtures of STN distributions, called STNMIX model henceforth, via the implementation of Markov chain Monte Carlo algorithms. Like the STMIX model, the STNMIX model includes normal mixtures (NMIX), t mixtures (TMIX) and skew normal mixtures (SNMIX) introduced by Lin et al. (2007b) as special cases. The main objective of this paper is to offer some computationally feasible EM-type algorithms to calculate the maximum likelihood (ML) estimates of parameters of STNMIX models. Moreover, we provide a simple way of obtaining the standard errors of estimates by inverting the observed information matrix. It is worth noting that both STN and ST densities have the same set of parameters that plays the same role. We conduct a simulation study to compare the algorithmic flexibility for these two families of distributions in terms of the computation time and converged log-likelihood values. The outline of the paper is as follows. In Sect. 2, we establish notation and outline some main results. Section 3 discusses the specification of STNMIX model and presents two EM-type algorithms for obtaining the ML estimates of the parameters. The proposed method is illustrated in Sect. 4 with a set of flow cytometry data. In Sect. 5, we undertake a simulation study to compare the fitting performance and computational cost between the ST and STN models. Some concluding remarks are given in Sect. 6.
y −ξ . ψ(y) = 2t (y | ξ, σ 2 , ν) λ σ
(1)
We shall write Y ∼ STN(ξ, σ 2 , λ, ν) if Y has the density of (1). The density (1) includes the t distribution (λ = 0), and the truncated t distribution (|λ| → ∞) as two special cases. In addition, as ν → ∞, (1) becomes the skew normal density, the normal density (λ = 0) or the truncated normal density (|λ| → ∞). Following Cabral et al. (2008), the STN distribution has a convenient stochastic representation λ |ζ1 | ζ2 Y =ξ +σ , (2) +√ τ + λ2 τ (τ + λ2 ) where ζ1 and ζ2 are two independent N (0, 1) random variables and τ ∼ (ν/2, ν/2). Denoting by γ = τ −1 (τ + λ2 )|ζ1 |, a further hierarchical representation of the STN distribution can be written as σλ σ2 Y |(γ , τ ) ∼ N ξ + , γ, τ + λ2 τ + λ2 (3) τ + λ2 γ |τ ∼ TN 0, ; (0, ∞) , τ τ ∼ (ν/2, ν/2), where TN(μ, σ 2 ; (a, b)) represents the truncated normal distribution for N (μ, σ 2 ) lying within the truncated interval (a, b). From (3), the joint pdf of Y , γ and τ is given by f (y, γ , τ )
2 Some characterizations of the STN distribution To simplify notation, we let φ(·) and (·) denote the probability density function (pdf) and the cumulative distribution function (cdf) of the standard normal distribution, respectively. Let ((ν + 1)/2) t (x | ξ, σ , ν) = √ (ν/2) πνσ 2
1+
(x
−(ν+1)/2 − ξ )2 νσ 2
denote the pdf of the t distribution with location ξ , scale σ 2 and degrees of freedom (df) ν, and t (x | ν) simply for the case when ξ = 0 and σ = 1; and (α, β) the gamma distribution with density g(x|α, β) ∝ x α−1 exp{−βx}. We start by defining the STN distribution and then studying some further properties. As introduced by Gómez et al. (2007), a random variable Y is said to follow the STN with location parameter ξ ∈ R, scale parameter σ 2 ∈ (0, ∞), skewness parameter λ ∈ R and df ν ∈ (0, ∞) if it has the density
=
1 (ν/2)ν/2 (ν+1)/2−1 ντ τ I (γ > 0) exp − πσ (ν/2) 2 1 τ (y − ξ )2 1 λ(y − ξ ) 2 × exp − . γ − − 2 2 σ σ2
(4)
Integrating out γ in (4), we get f (y, τ ) =
1/2
(ν/2)ν/2 (ν+1)/2−1 τ (ν/2) (ν + u2 )τ (λu), × exp − 2 2 πσ 2
(5)
where u = (y − ξ )/σ . Dividing (4) by (5) gives f (γ |y, τ ) =
I (γ > 0) (γ − λu)2 = f (γ |y), √ exp − 2 (λu) 2π (6)
implying that γ and τ are conditionally independent given Y = y. It follows from (6) that the conditional distribution
Stat Comput (2012) 22:287–299
289
of γ given Y = y is γ | Y = y ∼ TN λu, 1; (0, ∞) .
γY = a ν
κY = −3 + 2π
Moreover, dividing (5) by (1) yields f (τ |y) =
ν + u2 (ν+1)/2 1 ((ν + 1)/2) 2 (ν + u2 )τ . × τ (ν+1)/2−1 exp − 2
(7)
ν + 1 ν + u2 . , 2 2
As can be seen from (2), an alternative hierarchical representation of the STN distribution is Y | τ ∼ SN(ξ, τ −1 σ 2 , τ −1/2 λ),
τ ∼ (ν/2, ν/2).
3π(ν−3) (ν−4)(ν−2)2
+
2η31 ν−3 )
,
− 2aν2 η11 ( ην13 +
if ν > 3, (11) 2η31 ν−3 )
π 2 ]2 [ ν−2 − aν2 η11
,
if ν > 4.
It follows from (7) that the conditional distribution of τ given Y = y is τ |Y =y∼
η13 3π ν−2 η11 + π( ν π 2 ]3/2 [ ν−2 − aν2 η11
3 − 2aν2 η11
(8)
Making use of Lemma 1 in Lin et al. (2007b) for a simple way of iteratively obtaining high-order moments of SN distribution, we obtain ν σ aν η11 , if ν > 1, E(Y ) = ξ + (9) π ν ν 2 , if ν > 2, (10) − aν2 η11 var(Y ) = σ 2 ν−2 π
(12)
where γY and κY are the measures of skewness and kurtosis ν coefficients, aν = ( ν−1 2 )/ ( 2 ) and
∞ ν −s ν λ , dτ (ν > s), ηst = g τ 2 2 (τ + λ2 )t/2 0 which can be easily evaluated by using numerical integration routine ‘integrate’ built in R (R Core Development Team 2008). As ν → ∞, (9)–(12) are reduced to 2 2 E(Y ) = ξ + σ δ, var(Y ) = σ 2 1 − δ 2 , π π √ 2(4 − π)δ 3 8(π − 3)δ 4 γY = , κ = 3 + , Y [π − 2δ 2 ]3/2 [π − 2δ 2 ]2 √ where δ = λ/ 1 + λ2 . Table 1 compares the ranges of γY and κY between the ST and STN distributions for different values of ν. It is clearly seen that the ranges of skewness coefficients of the STN distribution are identical to those of the ST distribution, but the kurtosis coefficients of STN take values in a wider range.
Table 1 Ranges for the measures of skewness and kurtosis coefficients for different values of ν for the STN and ST distributions, respectively STN
ν
ST
Skewness range lower
Kurtosis range upper
lower
Skewness range
Kurtosis range
upper
lower
upper
lower
upper
5
−2.5496
2.5496
8.9245
23.1085
−2.5496
2.5496
9.0000
23.1085
6
−2.0518
2.0518
5.8782
12.6735
−2.0518
2.0518
6.0000
12.6735
7
−1.7977
1.7977
4.8720
9.4612
−1.7977
1.7977
5.0000
9.4612
8
−1.6430
1.6430
4.3773
7.9363
−1.6430
1.6430
4.5000
7.9363
9
−1.5386
1.5386
4.0858
7.0543
−1.5386
1.5386
4.2000
7.0543
10
−1.4634
1.4634
3.8946
6.4821
−1.4634
1.4634
4.0000
6.4821
11
−1.4065
1.4065
3.7601
6.0821
−1.4065
1.4065
3.8571
6.0821
12
−1.3620
1.3620
3.6605
5.7871
−1.3620
1.3620
3.7500
5.7871
13
−1.3263
1.3263
3.5839
5.5609
−1.3263
1.3263
3.6667
5.5609
14
−1.2969
1.2969
3.5232
5.3820
−1.2969
1.2969
3.6000
5.3820
15
−1.2723
1.2723
3.4740
5.2370
−1.2723
1.2723
3.5455
5.2370
16
−1.2514
1.2514
3.4334
5.1173
−1.2514
1.2514
3.5000
5.1173
17
−1.2335
1.2335
3.3992
5.0167
−1.2335
1.2335
3.4615
5.0167
18
−1.2179
1.2179
3.3701
4.9310
−1.2179
1.2179
3.4286
4.9310
19
−1.2042
1.2042
3.3450
4.8572
−1.2042
1.2042
3.4000
4.8572
20
−1.1921
1.1921
3.3231
4.7929
−1.1921
1.1921
3.3750
4.7929
290
Stat Comput (2012) 22:287–299
Fig. 1 A comparison of skewness and kurtosis contours between the STN and ST distributions across different combinations of ν and λ
Figure 1 depicts how γY and κY change with different combinations of ν and λ. Observing this figure, the contour patterns between the two distributions are generally similar except for the area where λ is close to zero. In the nonmixture case, the figure gives helpful guidance on choosing suitable initial values of λ and ν for ML fittings based on values of sample skewness and kurtosis. Although STN is quite analogous to ST, there are yet essential differences between the two proposals. We summarize the differences below and also clarify why we prefer the STN model from computational perspectives. (a) The main difference between the STN and ST is in the expression of the skewness function. Note that the skewness function of STN is described by a cdf of the standard normal distribution and depends on λ only, while the skewness function of ST is a cdf of the Student’s t distribution and is related to values of both λ and ν. So computation of the STN density is much simpler and faster than that of ST. (b) The Fisher information Iλν for the STN distribution is zero, indicating the estimates of λ and ν are asymptotically uncorrelated. However, this is not the case for the ST distribution. Specifically, the skewness effects of the ST distribution could be partly explained by its df and vice versa. (c) The ECM algorithm for the STN distribution is analytically simple and easy to implement, unlike the ST dis-
tribution in which the E-step involves numerical integration, which can be computationally prohibitive for large samples.
3 The skew Student-t-normal mixture model 3.1 Model formulation Consider n independent random variables Y1 , . . . , Yn , which are taken from a mixture of STN distributions. The pdf of a g-component STNMIX model is f (yj | ) =
g
wi ψ(yj | ξi , σi2 , λi , νi ),
(13)
i=1
where wi ’s are mixing
g proportions which are constrained to be positive and i=1 wi = 1, ψ(yj | ξi , σi2 , λi , νi ) is the STN density defined in (1) and = (w1 , . . . , wg−1 , θ 1 , . . . , θ g ) represents all unknown parameters. Note that the component vector θ i consists of (ξi , σi2 , λi , νi ). To pose this mixture model to an incomplete data problem, it is conceivable to introduce allocation variables Z j = (Z1j , . . . , Zgj )T , j = 1, . . . , n, whose values are a set of binary variables with 1 if Yj belongs to group i, Zij = 0 otherwise,
Stat Comput (2012) 22:287–299
291
g and satisfying i=1 Zij = 1. This implies that Z j follows a multinomial random vector with 1 trial and cell probabilities w1 , . . . , wg , denoted by Z j ∼ M(1; w1 , . . . , wg ). It follows from (3) that a hierarchical formulation of (13) can be represented by σi2 σ i λ i γj , , Yj | (γj , τj , Zij = 1) ∼ N ξi + τj + λ2i τj + λ2i τj + λ2i γj | (τj , Zij = 1) ∼ TN 0, ; (0, ∞) , τj
(14)
τj | (Zij = 1) ∼ (νi /2, νi /2), Z j ∼ M(1; w1 , . . . , wg ). As an immediate consequence, we establish the following proposition, which is crucial for evaluating some conditional expectations in the proposed EM-type algorithms. Proposition 1 Given the hierarchical representation (14), we have the following (the symbol “ | · · · ” denotes conditioning on Zij = 1 and Yj = yj ):
Maximization Either (ECME) algorithm (Liu and Rubin 1994) for ML estimation of the STNMIX model. The basic idea of ECM is that the maximization (M) step of EM is replaced by several computationally simple conditional maximization (CM) steps, while ECME is a simple modification of the ECM algorithm that replaces some of the CMsteps on maximizing the constrained actual log-likelihood function and has been shown to have a faster convergence rate. A key feature of these two EM-type algorithms is that they preserve the stability of the EM algorithm with their monotone convergence. For notational convenience, let y = (y1 , . . . , yn ), γ = (γ1 , . . . , γn ), τ = (τ1 , . . . , τn ) and Z = (Z1 , . . . , Zn ). According to (14), the complete data log-likelihood function of given (γ , τ , Z, y), aside from additive constants, is c ( | y, γ , τ , Z) g n 1 = Zij log wi − log σi2 2 j =1 i=1
1 τj (yj − ξi )2 2 − + (γj − βi (yj − ξi )) 2 σi2 νi νi νi + log − log 2 2 2 νi + (log τj − τj ) , 2
(a) The conditional expectation of γjk are given by
E(γjk | · · · ) =
⎧ λu + ⎪ ⎪ ⎨ i ij
φ(λi uij ) (λi uij ) ,
(k − 1)E(γj ⎪ ⎪ ⎩ for k 2.
k−2
if k = 1;
| · · · ) + λi uij E(γjk−1 | · · · ),
(b) Some specific conditional expectations related to functions of τj are E(τjk | · · · ) =
2k ((νi + 1)/2 + k) , (νi + u2ij )k ((νi + 1)/2)
νi + u2ij νi + 1 − log , E(log τj | · · · ) = DG 2 2 d where uij = (yj − ξi )/σi and DG(x) = dx log (x) is the digamma function. (c) The conditional expectation of Zij given Yj = yj is
E(Zij | Yj = yj ) = Pr(Zij = 1 | Yj = yj ) =
wi ψ(yj | ξi , σi2 , λi , νi ) . f (yj | )
(15)
where βi = λi /σi is a reparameterized parameter. Hence, the expected value of complete data log-likelihood (15) evaluˆ (h) , which we shall denote the Q-function, ated with = is ˆ (h) . ˆ (h) ) = E c ( | y, γ , τ , Z) | y, Q( |
(16)
To evaluate the Q-function, the necessary conditional ex(h) ˆ (h) ), τˆ (h) = E(τj | pectations include zˆ = E(Zij | yj , ij
ˆ yj , Zij = 1,
(h)
ij
ˆ ), κˆ ij = E(log τj | yj , Zij = 1, (h)
(h) ˆ γˆ1ij = E(γj | yj , Zij = 1,
ˆ Zij = 1,
(h)
(h)
(h)
),
(h)
) and γˆ2ij = E(γj2 | yj ,
). By Proposition 1, we have (h)
Proof The proof is straightforward and hence is omitted.
(h)
zˆ ij =
3.2 Computational aspects
(h)
ˆ f (yj | (h)
(h)
In this subsection, we describe in detail how to exploit two extensions of the EM algorithm (Dempster et al. 1977), the Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin 1993), and the Expectation Conditional
wˆ i(h) ψ(yj | θˆ i )
τˆij =
νˆ i
+1
(h) νˆ i
+ uˆ ij
2(h)
,
)
,
(h) (h) 2(h) νˆ i + uˆ ij νˆ + 1 (h) κˆ ij = DG i − log , 2 2
(17)
292
Stat Comput (2012) 22:287–299
φ(λˆ i uˆ ij )
νi νi νi (h) νi (h) − log + (κˆ ij − τˆij ) . (18) + log 2 2 2 2
(h) (h)
(h) γˆ1ij
(h) (h) = λˆ i uˆ ij
+
and
(λˆ i uˆ ij ) (h) (h)
γˆ2ij = 1 + λˆ i uˆ ij γˆ1ij , (h)
(h) (h) (h)
In summary, the implementation of the ECM algorithm proceeds as follows:
where uˆ ij = (yj − ξˆi )/σˆ i . Therefore, the Q-function (16) can be written as (h)
(h)
ˆ Q( | =
(h)
(h)
zˆ ij
log wi −
j =1 i=1
−
1 1 (h) log σi2 − 2 τˆij (yj − ξi )2 2 2σi
1 (h) (h) γˆ2ij − 2βi γˆ1ij (yj − ξi ) + βi2 (yj − ξi )2 2
n
(h) (h) j =1 zˆ ij τˆij yj
ξˆi(h+1) =
2(h+1)
=
n 1
(h) (h) zˆ ij τˆij yj (h) nˆ i j =1
(h+1) 2
− ξˆi
.
n (h) (h) ˆ (h+1) ) j =1 zˆ ij γˆ1ij (yj − ξi (h+1) ˆ βi = n . (h) ˆ (h+1) )2 j =1 zˆ ij (yj − ξi Accordingly, for updating λˆ i
(h)
(h)
lower=1 and upper=100. As pointed out by Liu and Rubin (1994), the one-dimensional search involved in CMstep 5 could be very slow in certain situations. To circumvent such a limitation, one may use a more efficient ECME algorithm, which refers to some CM steps of the ECM algorithm replaced by steps that maximize a restricted actual log-likelihood function, called the ‘CML-step’, without sacrificing simplicity. If the dfs are assumed to be identical, say ν1 = · · · = νg = ν, the above CM-step 5 is suggested to switch to the following CML-step. CML-step 5: Update νˆ (h) by
we calculate
n (h+1) (h+1) ˆ (h+1) βi λˆ i = σˆ i =
(h)
CM-step 1: Calculate wˆ i = nˆ i /n, where nˆ i =
n (h) j =1 zˆ ij . (h) CM-step 2: Update ξˆi by maximizing (18) over ξi , which leads to
2(h) n (h) (h) (h) n (h) (h) ˆ i λˆ i j =1 zˆ ij yj − σ j =1 zˆ ij γˆ1ij . (h) (h) ˆ 2(h) nˆ (h) j =1 zˆ ij τˆij + λi i
(h+1) 2(h+1) (h) CM-step 4: Fix ξi = ξˆi and σi2 = σˆ i , update βˆi by maximizing (18) over βi , which leads to
where
(h)
and γˆ2ij , for i = 1, . . . , g and j = 1, . . . , n.
+ λˆ i
n
CM-step 3: Fix ξi = ξˆi(h+1) , update σˆ i2(h) by maximizing (18) over σi2 , which gives σˆ i
ˆ (h) , compute zˆ (h) , τˆ (h) , κˆ (h) , γˆ (h) E-step: Given = ij ij ij 1ij (h+1)
)
g n
(h)
(h) (h) (h+1) ˆ ij j =1 zˆ ij γˆ1ij u ,
n (h) 2(h+1) ˆ ij j =1 zˆ ij u
= (yj − ξˆi(h+1) )/σˆ i(h+1) . (h+1) by maximizing Calculate νˆ i
νˆ
(h+1)
= arg max ν
n j =1
log
g
(h+1)
wˆ i
(h+1) ψ(yj | ξˆi ,
i=1
2(h+1) ˆ (h+1) , λi , ν) σˆ i
.
(h+1) uˆ ij
(18) over ν, CM-step 5: which is equivalent to solve the root of the following equation: n νi νi 1 (h) (h) (h) zˆ ij (κˆ ij − τˆij ) = 0. − DG + 1 + (h) log 2 2 nˆ i j =1 Note that the CM-step 5 requires a one-dimensional search for the root of νi , which can be easily achieved by using the ‘uniroot’ function built in R. In the call to ‘uniroot’, one can optionally specify an appropriate short-interval to reduce the searching time, for example,
The E-step and CM/CML-steps are alternated repeatedly until a suitable convergence rule is satisfied, e.g., the Aitken (h+1) acceleration based stopping criterion |(h+1) − ∞ | < , where (h+1) is the observed log-likelihood evaluated at (h+1) ˆ (h) , ∞ is the asymptotic estimate of the log-likelihood at iteration h+1 (McLachlan and Krishnan 2008; Chap. 4.9) and is the desired tolerance. For numerical analyses in Sect. 4 and Sect. 5, a default value of = 10−6 was used to terminate the iterations. Note that the above procedure is also applicable for the simple case (g = 1) by treating Zij = 1.
Stat Comput (2012) 22:287–299
293
3.3 Notes on implementation
4 An illustration
We address some practical issues raised in employing the above procedures. As analogous to other iterative optimization algorithms, one needs to get good starting values of to achieve convergence swiftly. A simple way of automatically generating a selection of initial values is formally described below.
Flow cytometry is a popular biotechnological platform that is used in both clinical and research applications for rapid single-cell investigation of surface and intracellular markers. It is a powerful laser-based technique for scanning, profiling and sorting microscopic particles flowing in a stream of water. Typically, it uses multiple fluorescent dye conjugated antibodies to study, in parallel channels, the expression of different proteins on the surface of or within each cell of a given sample in the form of corresponding fluorescent intensities. Flow cytometry is extensively used in a large number of biomedical applications such as molecular and cellular biology (e.g., to measure DNA content), hematology (e.g., to test leukemia samples) and immunology (e.g., to conduct CD4 tests for HIV AIDS). Very recently, it was shown by Pyne et al. (2009) and Frühwirth-Schnatter and Pyne (2010) that flow cytometric data are ideally suited for multimodal asymmetric finite mixture modelling. The above mentioned studies have shown how to computationally mimic the biologist’s view of a flow cytometric sample consisting of a mixture of non-Gaussian cell populations with mathematical models defined by finite mixtures of skew normal and skew t distributions. Thus each distinct cell population can be modelled by its corresponding distribution and proportion (or size). Importantly, such a finite mixture model approach can automate traditional flow cytometric analysis of identifying cell populations in a sample manually, which is often labor-intensive, subjective and non-reproducible. In this regard, it was noted by the same studies that such skewed nature of flow cytometric cell populations makes accurate model selection problematic. In the present study, we tackle this challenge with our new STNMIX model, which outperformed the NMIX, TMIX, SNMIX and STMIX models in terms of precision of flow cytometric data modelling, and by doing accurate model section with the help of greedy learning. Flow cytometry data are generally stored in binary Flow Cytometry Standard (.fcs) format. Glynn (2006) developed the FCSExtract utility to convert .fcs data to ASCII text format; they also provided a working data set ‘CC4_067_BM.fcs’ consisting of ten attributes measured on 5,634 cells. To illustrate our univariate mixture modelling approach, we analyze the data from the channel PE.Cy5, corresponding to the attribute measured with the fluorescent dye Cy5. As specified by the scale transformation (y − a)/(b − a) in the BioConductor package flowCore (Hahne et al. 2009), we preprocessed the data by dividing each observation by the scale factor b of 1,000 and using the shift factor a of 0. We consider the model selection by fitting a series of mixture models, namely the NMIX, SNMIX, TMIX, STMIX and STNMIX models, to this data set with g = 2–7
1. Perform the K-means algorithm initialized with respect to a randomly chosen cluster centers. (0) 2. Initialize the zero-one membership indicator Zˆ j = (0) g
{ˆzij }i=1 according to the K-means clustering results. 3. Initialize the mixing proportions, component locations and component scale variances as follows:
n (0) wˆ i
=
(0) j =1 zˆ ij
n
n ,
n 2(0) σˆ i
=
(0) ξˆi
(0) j =1 zˆ ij yj (0) j =1 zˆ ij
= n
and
(0) ˆ (0) 2 j =1 zˆ ij (yj − ξi ) .
n (0) j =1 zˆ ij
(0) (0) 4. Initialize λˆ 1 = · · · = λˆ g = 0 and a relatively small ini(0) tial value for νi ’s, say νˆ i = 10 for all i.
The EM-type algorithms do not provide directly the asymptotic covariance matrix of the estimates. An informationbased method is considered for evaluating the standard error estimates. The details are sketched in Appendix. In practice, there may exist multiple modes in calculating the log-likelihood function. As the log-likelihood function tends to have multiple modes, the algorithm needs to be initialized with a variety of several starting values. Since the K-means method does not guarantee unique clustering, one can perform the K-means clustering with various randomly chosen cluster centers. The global optimum can be determined directly by comparing their mode masses and the associated log-likelihood values. We adopt the Bayesian information criterion (BIC; Schwarz 1978) for selecting the number of components in mixture models. Herein, the form of BIC is given by BIC = m log n − 2max , where max is the maximized log-likelihood and m is the number of free parameters in the model. Accordingly, models with small BIC scores are preferred. Under certain regularity conditions, Keribin (2000) presented a theoretical justification for the consistency of BIC in determining the optimal number of components of a mixture model. Fraley and Raftery (1998, 2002) and recently McNicholas and Murphy (2008) have shown the effectiveness of BIC in selecting the number of components for Gaussian mixture models.
294
Stat Comput (2012) 22:287–299
Table 2 ML estimation results for fitting different mixture models to the PE.Cy5 data. Results for TMIX, STMIX and STNMIX models with unequal degree of freedoms are shown in parentheses. Model with Model
NMIX
TMIX
max
2
−4458.171
5
3
−4080.938
8
4
−4030.673
11
3 4
SNMIX
−4056.998 (−4037.870) −4031.355 (−4026.758)
6 (7) 9 (11) 12 (15)
BIC
g
max
8959.524
5
−4014.003
14
8148.918
8230.969
6
−4010.034
17
8166.890
8156.347
7
−4009.344
20
8191.419
8612.539 (8601.937) 8191.726 (8170.743) 8166.348 (8183.065)
5 6 7
−4014.841 (−4013.507) −4010.268 (−4009.619) −4004.222 (−4003.601)
m
15 (19) 18 (23) 21 (27)
8159.231 (8191.108) 8175.995 (8217.879) 8189.813 (8240.390)
−4242.546
7
8545.549
5
−4010.717
19
8185.528
3
−4041.599
11
8178.199
6
−4007.368
23
8213.377
4
−4014.494
15
8158.536
7
−4005.381
27
8243.950
3 4
2 STNMIX
−4280.360 (−4270.741)
m
2
2 STMIX
BIC
g
2
the smallest BIC value for each family was shown in bold. The best chosen model is indicated by ‘∗ ’
3∗ 4
−4103.167
8
8275.426
(−4101.475)
(9)
(8280.678)
−4021.244
12
8146.127
(−4017.311) −4011.890 (−4010.503) −4111.852
(14) 16 (19) 8
(8155.534) 8161.965 (8185.101) 8292.797
(−4102.415)
(9)
(8282.559)
−4015.449
12
8134.536
(−4014.084) −4010.478 (−4009.734)
(14) 16 (19)
(8149.081) 8159.142 (8183.564)
components. Consequently, the latter three models are fitted for scenarios with equal and unequal dfs. We ran the ECME algorithms as described in Sect. 3 and also in Lin et al. (2007a, 2007b). For each run, thirty different initializations based on the K-means clustering technique were used. The algorithm was terminated based on the pre-described stopping criterion in Sect. 3.2. When g is large, it perhaps stopped too early and thus some small perturbations on the converged log-likelihoods may exist. The values of log-likelihood maxima and the number of parameters together with BIC values are listed in Table 2. Note that a smaller value of BIC is associated with a better fitted model. It is clearly seen that the 3-component STNMIX model with equal dfs has the best fit, followed by the 3-component STMIX model with equal dfs. Both models effectively capture the actual number of underlying cell populations, while the NMIX, TMIX and SNMIX models need more than three components to capture the skewness as well as the heavy tails in the data.
5 6 7
5 6 7
−4010.282 (−4007.208) −4006.730 (−4006.422) −4002.771 (−3998.183) −4006.375 (−4005.542) −4005.687 (−4005.013) −4004.679 (−4002.594)
20 (24) 24 (29) 28 (34) 20 (24) 24 (29) 28 (34)
8193.295 (8221.693) 8220.738 (8263.305) 8247.366 (8290.009) 8185.482 (8218.362) 8218.651 (8260.487) 8251.182 (8298.831)
Table 3 compares the ML estimates along with the associated standard errors for the best fitted STNMIX model with the corresponding values for the other four competing 3-component mixture models. The estimated dfs of the STMIX and STNMIX models are considerably less than 10 for all the components, signifying the validity of the use of heavy-tailed component densities. The estimated skewness parameters in STMIX models are all positive and somewhat significant. As for STNMIX, two of the components seem to be weakly skewed as the standard errors of λˆ 2 and λˆ 3 are relatively large. One possible explanation of this phenomenon is that in the STMIX model the fat-tailed behavior is partly characterized by its skewness parameters, yielding lager estimates for the skewness parameters. To verify this further, we carry out the KolmogorovSmirnov (K-S) goodness of fit test to compare the quality of fit among the three skew modelling options. The resulting K-S distances are 0.010, 0.007 and 0.006 for the SNMIX, STMIX and STNMIX models, respectively. Given that a
Stat Comput (2012) 22:287–299
295
Table 3 Summary results from fitting various three-component mixture models to the PE.Cy5 data Parameter
NMIX
TMIX
SNMIX
STMIX
STNMIX
Est
Sd
Est
Sd
Est
Sd
Est
Sd
Est
Sd 0.013
w1
0.240
0.007
0.252
0.007
0.287
0.009
0.297
0.010
0.301
w2
0.403
0.013
0.396
0.018
0.340
0.020
0.307
0.039
0.304
0.091
w3
0.357
–
0.352
–
0.373
–
0.396
–
0.395
–
ξ1
0.349
0.005
0.357
0.005
0.177
0.012
0.180
0.011
0.193
0.017
ξ2
1.504
0.025
1.520
0.025
1.067
0.030
1.172
0.057
1.274
0.140
ξ3
1.845
0.005
1.854
0.006
1.753
0.028
1.788
0.032
1.804
0.053
σ1
0.150
0.004
0.147
0.005
0.290
0.020
0.281
0.021
0.261
0.029
σ2
0.549
0.012
0.463
0.018
0.709
0.020
0.510
0.066
0.389
0.128
σ3
0.153
0.005
0.144
0.006
0.185
0.018
0.163
0.013
0.151
0.011
λ1
–
–
–
–
2.659
0.410
2.732
0.429
2.156
0.506
λ2
–
–
–
–
3.062
0.722
1.914
0.686
0.969
0.934
λ3
–
–
–
–
0.950
0.377
0.637
0.385
0.477
0.615
ν
–
–
10.325
2.004
–
–
6.763
2.377
4.451
2.030
Fig. 2 Histogram of the PE.Cy5 data and three fitted STNMIX densities (g = 1–3). The dashed lines indicate the true grouping of fitted STN densities
smaller K-S distance corresponds to a better resemblance between the experimental data and the fitted distribution, we conclude that the most precise modelling of the count and asymmetry for this flow cytometric dataset is achieved by STNMIX. Following a strategy similar to Vlassis and Likas (2002), we devised a greedy EM algorithm for learning STN mix-
tures, which is quite effective in selecting an appropriate number of components and overcoming the local maxima problem. The detailed implementations are sketched in a longer version of this paper which is available from the authors. The greedy EM procedure determines that g = 3 is the favorite choice, which is consistent with that selected by the information-based criterion in Table 2. Figure 2 shows
296
Stat Comput (2012) 22:287–299
the fitted densities for g = 1 − 3 components. Based on the graphical visualization, it appears that the fitted STNMIX density adapts the shape of the histogram quite perfectly.
5 Simulation study In this section, we undertook a small simulation study to compare the fitting performance and required computational burden of the existing ST model with the proposed STN model. Lin et al. (2007a) have developed an efficient ECME algorithm for the ML fitting of ST distributions. A comparison of some characterizations between the two models is given in Table 4. Notice that both of them reduce to the SN model as ν → ∞. As shown by Lin et al. (2007a, 2007b), the SN distribution has a convenient hierarchial representation: Y | γ ∼ N (ξ + δγ , (1 − δ 2 )σ 2 ),
γ ∼ TN(0, σ 2 ; (0, ∞)), (19)
√ where δ = λ/ 1 + λ2 . For making a fair comparison, we generated synthetic data from the SN distribution, where the presumed parameters are given by ξ = 1, σ = 2 and λ = 3 for producing highly skewed observations. In the simulation, the numbers of n considered were 250, 500, 1,000, 5,000 and 10,000. To further create data with fat tails, we add 2% extreme values generated by a uniform distribution over the interval
[10, 20] to each simulated data set. So the simulation sample sizes become 255, 510, 1,020, 5,100 and 10,200 after appending 2% artificial outliers. Simulations were run with a total of 500 replications for each sample size. To conduct experimental studies, each simulated data set was fitted via the ECME algorithm under ST and STN scenarios with the same initial value, say ξˆ (0) = 1, σˆ (0) = 2, λˆ (0) = 3 and νˆ (0) = 4. In each trail, we recorded their respective converged log-likelihoods, denoted by ˆST and ˆSTN and consumed CPU times (in seconds), denoted by CTST and CTSTN . All computations were carried out by R package 2.9.2 in win32 environment of desktop PC machine with 3.00 GHz/Intel Core(TM)2 Duo CPU Processor and 4.0 GB RAM. It is noted that the amount of required iterations cannot be directly comparable because each iteration of the ECME algorithm involves different numbers of inner iterations for computing the update of ν. Table 5 presents the average log-likelihood values, the amount of iterations and the CPU time for various sample sizes. We found that both models produce comparable average log-likelihood values. Particularly, the STN model has significantly better performance when the sample size becomes large (n > 1,000). Moreover, the average CPU time in the STN scenario appears to be substantially reduced. Note that the above simulation study was also carried out for various other sets of parameters and all of which gave similar results. As recommended by an anonymous referee, an interesting comparison can be made by generating data from the ST and STN models each time and examining how often
Table 4 Comparison of some characterizations between the STN and ST distributions Distribution Stochastic representation Hierarchical representation Density Mean Variance
Skewness
Kurtosis
STN(ξ, σ 2 , λ, ν)
ST(ξ, σ 2 , λ, ν)
U λ|V | +√ Y =ξ +σ τ + λ2 τ (τ + λ2 ) √ Y | τ ∼ SN(ξ, σ 2 /τ, λ/ τ ) 2 t (u | ν)(λu) σ ν ξ+ σ aν η11 π ν ν 2 σ2 − aν2 η11 ν −2 π aν
3 + 2aν2 η11
−3 + 2π
2π π 3π ν−3 η31 + ν η13 − ν−2 η11 π 2 2 3/2 [ ν−2 − aν η11 ]
3π(ν−3) (ν−4)(ν−2)2
− 2aν2 η11 ( ην13 +
π 2 ]2 [ ν−2 − aν2 η11
2η31 ν−3 )
σ Y =ξ + √ τ
U λ|V | +√ √ 1 + λ2 1 + λ2
Y | τ ∼ SN(ξ, σ 2 /τ, λ) 2 ν+1 t (u | ν)T λu σ ν + u2 ν ξ+ σ aν δ π ν ν − aν2 δ 2 σ2 ν −2 π aν δ
2aν2 δ 2 −
−3 + 2π
π δ2 ν−3
ν +1
+
3π (ν−3)(ν−2) π [ ν−2 − aν2 δ 2 ]3/2 2aν2 2 3π(ν−3) − ν−3 δ (3 − δ 2 ) (ν−4)(ν−2)2 π [ ν−2 − aν2 δ 2 ]2
U and V are independent standard normal distributions, τ ∼ (ν/2, ν/2) with mean 1, T (·|ν) is the cdf of the Student’s t distribution with df ν, √ ∞ λ ν ν−s ν 2 aν = ( ν−1 2 )/ ( 2 ), u = (y − ξ )/σ , δ = λ/ 1 + λ , and ηst = 0 (τ +λ2 )t/2 g(τ | 2 , 2 ) dτ
Stat Comput (2012) 22:287–299
297
Table 5 Comparison of average log-likelihood and CPU time (in seconds; CT) for ST and STN models under various sample sizes Model
n = 255
n = 510
n = 1020
n = 5100
(θ)
CT
(θ)
CT
(θ )
CT
(θ )
STN
−454.283
0.370
−909.864
0.681
−1820.673
1.363
−9107.609
ST
−456.262
0.937
−913.956
1.880
−1828.824
3.999
−9149.175
n = 10200 CT
(θ )
CT
9.141
−18218.840
17.551
22.719
−18302.190
49.548
Fig. 3 Improvement of converged log-likelihoods and relative computational cost of the STN model over the ST counterpart for various sample sizes. (a) Box plots of scaling log-likelihood improvements, (b) Relative improvement percentages in CPU time
we can recognize the true model. To conduct this experimental study, we generate 500 samples of sizes n = 100, 300, 500 and 1,000 from the ST and STN distributions with the same parameter setting. The presumed values for the location, scale variance and skewness parameters are ξ = 0, σ 2 = 1 and λ = 4, respectively. For the values of dfs used in the study, we take a low value (ν = 3) for yielding a heavytailed distribution and a high value (ν = 50) for approaching to the SN distribution. For model comparison, each simulated data set was fitted twice under ST and STN scenarios starting with ten different initializations to avoid getting stuck in local traps. In each trial, we compare the performance of the two models according to their best log-likelihoods obtained from each of ten fits. Note that since the two competing models have the same number of parameters, the log-likelihood can be regarded as a reasonable model selection criterion. A simulated data set (ignoring the known true model) is then assigned to one or the other model that has a larger log-likelihood. Table 6 presents the proportions of selecting the true model based on 500 replications. The table shows that the STN model is more recognizable than the ST model for all cases. Interest-
Table 6 Proportion of selecting the true model (%) Model
ν 3
50
n = 100
n = 300
n = 500
n = 1000
STN
78.4
81.2
87.2
92.6
ST
38.2
72.2
75.2
90.0
STN
63.6
58.6
58.4
56.6
ST
37.2
42.6
45.4
46.4
ingly, when ν is small, the probability of selecting the true model increases steadily with the sample size n. In contrast, the two models are not easy to distinguish for large ν.
6 Conclusion We have proposed a new family of mixture models based on the STN distributions, called the STNMIX model, which is allowed to accommodate multimodality, asymmetry and heavy tails jointly and to offer greater flexibility than the STMIX counterpart. We have described a four-level hierarchical formulation for the STNMIX model and presented
298
Stat Comput (2012) 22:287–299
efficient EM-type algorithms for parameter estimation in a complete data framework. Experimental results show that the STNMIX model has better performance than the other competitors. So far the present application is limited to the data with univariate outcomes. There has been an increasing literature on multivariate mixtures using non-elliptically contoured distributions, such as the recent proposals of Lin (2009, 2010), Wang et al. (2009) and Karlis and Santourian (2009). The methodology as well as the EM-type algorithms can be extended to a multivariate version of STNMIX model and will be reported in a follow-up paper.
Explicit expressions for the elements of sˆj are
Acknowledgements The authors would like to express his deepest gratitude to the Chief Editor, the Associate Editor and two anonymous referees for their valuable comments and suggestions that greatly improved this paper. This research was supported by the National Science Council of Taiwan (Grant NO. NSC97-2118-M-005-001-MY2).
where zˆ ij , uˆ ij , γˆ1ij , κˆ ij and τˆij are zˆ ij , uˆ ij , γˆ1ij , κˆ ij and (h) ˆ respectively. In the case of τˆij in (17) evaluated at ,
g ν1 = · · · = νg = ν, this leads to sˆj,ν = i=1 sˆj,νi . Standard ˆ are extracted from the square root of the diagoerrors of nal elements of the inverse of (21).
zˆ rj zˆ gj − (r = 1, . . . , g − 1), wˆ r wˆ g zˆ ij (τˆj + λˆ 2i )uˆ ij − λˆ i γˆ1ij , sˆj,ξi = σˆ i zˆ ij sˆj,σi = (τˆj + λˆ 2i )uˆ 2ij − λˆ i γˆ1ij uˆ ij − 1 , σˆ i sˆj,wr =
sˆj,λi = zˆ ij uˆ ij (γˆ1ij − λˆ i uˆ ij ), zˆ ij νˆ i νˆ i sˆj,νi = log + 1 − DG + κˆ ij − τˆij , 2 2 2 (h)
(h)
(h)
(h)
Appendix: Estimation of standard errors References We follow the information-based method exploited by Basford et al. (1997) to calculate the asymptotic covariance matrix of the ML estimates. The empirical information matrix is defined as I e ( | y) =
n
s(yj | )sT (yj | )
j =1
− n−1 S(y | )S T (y | ),
(20)
where S(y | θ ) = nj=1 s(yj | θ ). Following Louis (1982), the individual score can be determined as ∂ log f (yj |) ∂ ∂cj ( | yj , Z j , τj ) y =E , , j ∂
s(yj | ) =
where cj (|y j , Z j , γ j , τj ) is the complete data loglikelihood formed from the single observation yj . ˆ into , (20) is reduced Substituting the ML estimates to ˆ | y) = I e (
n
sˆj sˆTj ,
(21)
j =1
where sˆj is an individual score vector containing elements of (ˆsj,w1 , . . . , sˆj,wg−1 , sˆj,ξ1 , . . . , sˆj,ξg , sˆj,σ1 , . . . , sˆj,σg , sˆj,λ1 , . . . , sˆj,λg , sˆj,ν1 , . . . , sˆj,νg )T .
Azzalini, A.: The skew-normal distribution and related multivariate families (with discussion). Scand. J. Stat. 32, 159–188 (2005) Azzalini, A., Capitaino, A.: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t -distribution. J. R. Stat. Soc. B 65, 367–389 (2003) Barndorff-Nielsen, O.E.: Normal inverse Gaussian distributions and stochastic volatility modelling. Scand. J. Stat. 24, 1–13 (1997) Basford, K.E., Greenway, D.R., McLachlan, G.J., Peel, D.: Standard errors of fitted means under normal mixture. Comput. Stat. 12, 1–17 (1997) Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Singapore (2006) Cabral, C.R.B., Bolfarine, H., Pereira, J.R.G.: Bayesian density estimation using skew student-t-normal mixtures. Comput. Stat. Data Anal. 52, 5075–5090 (2008) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977) Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998) Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–612 (2002) Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, New York (2006) Frühwirth-Schnatter, S., Pyne, S.: Bayesian inference for finite mixtures of univariate and multivariate skew normal and skew-t distributions. Biostatistics 11, 317–336 (2010) Glynn, E.F.: FCSExtract Utility. Stowers Institute for Medical Research. Online available at: http://research.stowers-institute.org/ efg/ScientificSoftware/Utility/FCSExtract/ (2006) Gómez, H.W., Venegas, O., Bolfarine, H.: Skew-symmetric distributions generated by the distribution function of the normal distribution. Environmetrics 18, 395–407 (2007) Hahne, F., LeMeur, N., Brinkman, R.R., Ellis, B., Haaland, P., Sarkar, D., Spidlen, J., Strain, E., Gentleman, R.: flowCore: a Bioconductor package for high throughput flow cytometry. BMC Bioinform. 10, 106 (2009)
Stat Comput (2012) 22:287–299 Karlis, D., Santourian, A.: Model-based clustering with nonelliptically contoured distributions. Stat. Comput. 19, 73–83 (2009) Keribin, C.: Consistent estimation of the order of mixture models. Sankhy¯a 62, 49–66 (2000) Li, J.Q., Barron, A.R.: Mixture density estimation. In: Advances in Neural Information Processing Systems 12. MIT Press, Cambridge (2000) Lin, T.I.: Maximum likelihood estimation for multivariate skew normal mixture models. J. Multivar. Anal. 100, 257–265 (2009) Lin, T.I.: Robust mixture modeling using multivariate skew t distributions. Stat. Comput. 20, 343–356 (2010) Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007a) Lin, T.I., Lee, J.C., Yen, S.Y.: Finite mixture modelling using the skew normal distribution. Stat. Sin. 17, 909–927 (2007b) Liu, C.H., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81, 633–648 (1994) Louis, T.A.: Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. B 44, 226–233 (1982) McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Application to Clustering. Dekker, New York (1988) McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, New York (2008) McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000) McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)
299 Meinicke, P., Brodag, T., Fricke, W.F., Waack, S.: P -value based visualization of codon usage data. Algorithms Mol. Biol. 1, 10 (2006) Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278 (1993) Nadarajah, S., Kotz, S.: Skewed distributions generated by the normal kernel. Stat. Probab. Lett. 65, 269–277 (2003) Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.I., Maier, L., BaecherAllan, C., McLachlan, G.J., Tamayo, P., Hafler, D.A., De Jager, P.L., Mesirov, J.P.: Automated high-dimensional flow cytometric data analysis. Proc. Natl. Acad. Sci. USA 106, 8519–8524 (2009) R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2008) Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with application to Bayesian regression models. Can. J. Stat. 31, 129–150 (2003) Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461– 464 (1978) Titterington, D.M., Smith, A.F.M., Markov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1985) Vlassis, N., Likas, A.: A greedy EM algorithm for Gaussian mixture learning. Neural Process. Lett. 15, 77–87 (2002) Wang, K., Ng, S.K., McLachlan, G.J.: Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In: Proceedings of DICTA 2009, Conference of Digital Image Computing: Techniques and Applications, Melbourne, pp. 526– 531. IEEE Computer Society, Los Alamitos (2009)