Maximum likelihood inference for mixtures of skew Student-t-normal ...

17 downloads 0 Views 715KB Size Report
Jan 19, 2011 - Abstract This paper deals with the problem of maximum likelihood estimation for a mixture of skew Student-t- normal distributions, which is a ...
Stat Comput (2012) 22:287–299 DOI 10.1007/s11222-010-9225-9

Maximum likelihood inference for mixtures of skew Student-t-normal distributions through practical EM-type algorithms Hsiu J. Ho · Saumyadipta Pyne · Tsung I. Lin

Received: 11 January 2010 / Accepted: 27 December 2010 / Published online: 19 January 2011 © Springer Science+Business Media, LLC 2011

Abstract This paper deals with the problem of maximum likelihood estimation for a mixture of skew Student-tnormal distributions, which is a novel model-based tool for clustering heterogeneous (multiple groups) data in the presence of skewed and heavy-tailed outcomes. We present two analytically simple EM-type algorithms for iteratively computing the maximum likelihood estimates. The observed information matrix is derived for obtaining the asymptotic standard errors of parameter estimates. A small simulation study is conducted to demonstrate the superiority of the skew Student-t-normal distribution compared to the skew t distribution. The proposed methodology is particularly useful for analyzing multimodal asymmetric data as produced by major biotechnological platforms like flow cytometry. We provide such an application with the help of an illustrative example. Keywords ECM algorithm · ECME algorithm · Flow cytometry · Outliers · ST mixtures · STN mixtures

H.J. Ho Department of Statistics, Tunghai University, Taichung 407, Taiwan S. Pyne Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02115, USA T.I. Lin () Department of Applied Mathematics and Institute of Statistics, National Chung Hsing University, Taichung 404, Taiwan e-mail: [email protected] T.I. Lin Department of Public Health, China Medical University, Taichung 404, Taiwan

1 Introduction Finite mixture models, represented as a convex linear combination of component density functions mixed in varying proportions, have been a major analytical tool in applications arising naturally in a number of scientific areas, including density estimation, supervised classification, unsupervised clustering, data mining, image analysis, pattern recognition, machine learning, and so on (see, e.g., Titterington et al. 1985; McLachlan and Basford 1988; McLachlan and Peel 2000; Bishop 2006; Frühwirth-Schnatter 2006). In most of these applications, component densities are generally assumed to be Gaussian because of its wide applicability and desirable properties. However, in many practical situations the measured component densities may exhibit highly asymmetric behavior and heavier than normal tails, subsequently yielding wrong clustering identifications. To cope with such obstacles, some authors start to work on a mixture model with asymmetric component densities in recent years. Lin et al. (2007a) proposed a novel mixture model using the skew t (ST) distribution of Azzalini and Capitaino (2003), called the STMIX model, allowing for accommodation of both skewness and thick tails for making more robust inferences. More recently, Pyne et al. (2009) and Lin (2010) extended the STMIX model to multivariate cases using two variants of multivariate skew t distribution proposed by Azzalini and Capitaino (2003) and Sahu et al. (2003), respectively. Karlis and Santourian (2009) considered the use of finite mixtures of the normal inverse Gaussian distribution and its multivariate extensions. Frühwirth-Schnatter and Pyne (2010) investigated the Bayesian analysis of univariate and multivariate skew normal and skew t mixtures. Lately, Gómez et al. (2007) introduced the skew Studentt-normal (STN) distribution and claimed that it is a good alternative to model heavy tailed data with strong degrees

288

Stat Comput (2012) 22:287–299

of asymmetry. They also showed that it has a wider range of skewness than the skew-normal (SN) distribution (Azzalini 2005) or the family of distributions introduced in Nadarajah and Kotz (2003). Cabral et al. (2008) adopted a Bayesian approach to modeling mixtures of STN distributions, called STNMIX model henceforth, via the implementation of Markov chain Monte Carlo algorithms. Like the STMIX model, the STNMIX model includes normal mixtures (NMIX), t mixtures (TMIX) and skew normal mixtures (SNMIX) introduced by Lin et al. (2007b) as special cases. The main objective of this paper is to offer some computationally feasible EM-type algorithms to calculate the maximum likelihood (ML) estimates of parameters of STNMIX models. Moreover, we provide a simple way of obtaining the standard errors of estimates by inverting the observed information matrix. It is worth noting that both STN and ST densities have the same set of parameters that plays the same role. We conduct a simulation study to compare the algorithmic flexibility for these two families of distributions in terms of the computation time and converged log-likelihood values. The outline of the paper is as follows. In Sect. 2, we establish notation and outline some main results. Section 3 discusses the specification of STNMIX model and presents two EM-type algorithms for obtaining the ML estimates of the parameters. The proposed method is illustrated in Sect. 4 with a set of flow cytometry data. In Sect. 5, we undertake a simulation study to compare the fitting performance and computational cost between the ST and STN models. Some concluding remarks are given in Sect. 6.

  y −ξ . ψ(y) = 2t (y | ξ, σ 2 , ν) λ σ

(1)

We shall write Y ∼ STN(ξ, σ 2 , λ, ν) if Y has the density of (1). The density (1) includes the t distribution (λ = 0), and the truncated t distribution (|λ| → ∞) as two special cases. In addition, as ν → ∞, (1) becomes the skew normal density, the normal density (λ = 0) or the truncated normal density (|λ| → ∞). Following Cabral et al. (2008), the STN distribution has a convenient stochastic representation   λ |ζ1 | ζ2 Y =ξ +σ  , (2) +√ τ + λ2 τ (τ + λ2 ) where ζ1 and ζ2 are two independent N (0, 1) random variables and τ ∼ (ν/2, ν/2). Denoting by γ =  τ −1 (τ + λ2 )|ζ1 |, a further hierarchical representation of the STN distribution can be written as   σλ σ2 Y |(γ , τ ) ∼ N ξ + , γ, τ + λ2 τ + λ2   (3) τ + λ2 γ |τ ∼ TN 0, ; (0, ∞) , τ τ ∼ (ν/2, ν/2), where TN(μ, σ 2 ; (a, b)) represents the truncated normal distribution for N (μ, σ 2 ) lying within the truncated interval (a, b). From (3), the joint pdf of Y , γ and τ is given by f (y, γ , τ )

2 Some characterizations of the STN distribution To simplify notation, we let φ(·) and (·) denote the probability density function (pdf) and the cumulative distribution function (cdf) of the standard normal distribution, respectively. Let ((ν + 1)/2) t (x | ξ, σ , ν) = √ (ν/2) πνσ 2

 1+

(x

−(ν+1)/2 − ξ )2 νσ 2

denote the pdf of the t distribution with location ξ , scale σ 2 and degrees of freedom (df) ν, and t (x | ν) simply for the case when ξ = 0 and σ = 1; and (α, β) the gamma distribution with density g(x|α, β) ∝ x α−1 exp{−βx}. We start by defining the STN distribution and then studying some further properties. As introduced by Gómez et al. (2007), a random variable Y is said to follow the STN with location parameter ξ ∈ R, scale parameter σ 2 ∈ (0, ∞), skewness parameter λ ∈ R and df ν ∈ (0, ∞) if it has the density

=

  1 (ν/2)ν/2 (ν+1)/2−1 ντ τ I (γ > 0) exp − πσ (ν/2) 2     1 τ (y − ξ )2 1 λ(y − ξ ) 2 × exp − . γ − − 2 2 σ σ2

(4)

Integrating out γ in (4), we get  f (y, τ ) =

1/2

(ν/2)ν/2 (ν+1)/2−1 τ (ν/2)   (ν + u2 )τ (λu), × exp − 2 2 πσ 2

(5)

where u = (y − ξ )/σ . Dividing (4) by (5) gives f (γ |y, τ ) =

  I (γ > 0) (γ − λu)2 = f (γ |y), √ exp − 2 (λu) 2π (6)

implying that γ and τ are conditionally independent given Y = y. It follows from (6) that the conditional distribution

Stat Comput (2012) 22:287–299

289

of γ given Y = y is   γ | Y = y ∼ TN λu, 1; (0, ∞) .

γY = a ν

κY = −3 + 2π

Moreover, dividing (5) by (1) yields f (τ |y) =

  ν + u2 (ν+1)/2 1 ((ν + 1)/2) 2   (ν + u2 )τ . × τ (ν+1)/2−1 exp − 2



(7)

 ν + 1 ν + u2 . , 2 2

As can be seen from (2), an alternative hierarchical representation of the STN distribution is Y | τ ∼ SN(ξ, τ −1 σ 2 , τ −1/2 λ),

τ ∼ (ν/2, ν/2).

3π(ν−3) (ν−4)(ν−2)2

+

2η31 ν−3 )

,

− 2aν2 η11 ( ην13 +

if ν > 3, (11) 2η31 ν−3 )

π 2 ]2 [ ν−2 − aν2 η11

,

if ν > 4.

It follows from (7) that the conditional distribution of τ given Y = y is τ |Y =y∼

η13 3π ν−2 η11 + π( ν π 2 ]3/2 [ ν−2 − aν2 η11

3 − 2aν2 η11

(8)

Making use of Lemma 1 in Lin et al. (2007b) for a simple way of iteratively obtaining high-order moments of SN distribution, we obtain ν σ aν η11 , if ν > 1, E(Y ) = ξ + (9) π   ν ν 2 , if ν > 2, (10) − aν2 η11 var(Y ) = σ 2 ν−2 π

(12)

where γY and κY are the measures of skewness and kurtosis ν coefficients, aν = ( ν−1 2 )/ ( 2 ) and  

∞ ν −s ν λ , dτ (ν > s), ηst = g τ 2 2 (τ + λ2 )t/2 0 which can be easily evaluated by using numerical integration routine ‘integrate’ built in R (R Core Development Team 2008). As ν → ∞, (9)–(12) are reduced to   2 2 E(Y ) = ξ + σ δ, var(Y ) = σ 2 1 − δ 2 , π π √ 2(4 − π)δ 3 8(π − 3)δ 4 γY = , κ = 3 + , Y [π − 2δ 2 ]3/2 [π − 2δ 2 ]2 √ where δ = λ/ 1 + λ2 . Table 1 compares the ranges of γY and κY between the ST and STN distributions for different values of ν. It is clearly seen that the ranges of skewness coefficients of the STN distribution are identical to those of the ST distribution, but the kurtosis coefficients of STN take values in a wider range.

Table 1 Ranges for the measures of skewness and kurtosis coefficients for different values of ν for the STN and ST distributions, respectively STN

ν

ST

Skewness range lower

Kurtosis range upper

lower

Skewness range

Kurtosis range

upper

lower

upper

lower

upper

5

−2.5496

2.5496

8.9245

23.1085

−2.5496

2.5496

9.0000

23.1085

6

−2.0518

2.0518

5.8782

12.6735

−2.0518

2.0518

6.0000

12.6735

7

−1.7977

1.7977

4.8720

9.4612

−1.7977

1.7977

5.0000

9.4612

8

−1.6430

1.6430

4.3773

7.9363

−1.6430

1.6430

4.5000

7.9363

9

−1.5386

1.5386

4.0858

7.0543

−1.5386

1.5386

4.2000

7.0543

10

−1.4634

1.4634

3.8946

6.4821

−1.4634

1.4634

4.0000

6.4821

11

−1.4065

1.4065

3.7601

6.0821

−1.4065

1.4065

3.8571

6.0821

12

−1.3620

1.3620

3.6605

5.7871

−1.3620

1.3620

3.7500

5.7871

13

−1.3263

1.3263

3.5839

5.5609

−1.3263

1.3263

3.6667

5.5609

14

−1.2969

1.2969

3.5232

5.3820

−1.2969

1.2969

3.6000

5.3820

15

−1.2723

1.2723

3.4740

5.2370

−1.2723

1.2723

3.5455

5.2370

16

−1.2514

1.2514

3.4334

5.1173

−1.2514

1.2514

3.5000

5.1173

17

−1.2335

1.2335

3.3992

5.0167

−1.2335

1.2335

3.4615

5.0167

18

−1.2179

1.2179

3.3701

4.9310

−1.2179

1.2179

3.4286

4.9310

19

−1.2042

1.2042

3.3450

4.8572

−1.2042

1.2042

3.4000

4.8572

20

−1.1921

1.1921

3.3231

4.7929

−1.1921

1.1921

3.3750

4.7929

290

Stat Comput (2012) 22:287–299

Fig. 1 A comparison of skewness and kurtosis contours between the STN and ST distributions across different combinations of ν and λ

Figure 1 depicts how γY and κY change with different combinations of ν and λ. Observing this figure, the contour patterns between the two distributions are generally similar except for the area where λ is close to zero. In the nonmixture case, the figure gives helpful guidance on choosing suitable initial values of λ and ν for ML fittings based on values of sample skewness and kurtosis. Although STN is quite analogous to ST, there are yet essential differences between the two proposals. We summarize the differences below and also clarify why we prefer the STN model from computational perspectives. (a) The main difference between the STN and ST is in the expression of the skewness function. Note that the skewness function of STN is described by a cdf of the standard normal distribution and depends on λ only, while the skewness function of ST is a cdf of the Student’s t distribution and is related to values of both λ and ν. So computation of the STN density is much simpler and faster than that of ST. (b) The Fisher information Iλν for the STN distribution is zero, indicating the estimates of λ and ν are asymptotically uncorrelated. However, this is not the case for the ST distribution. Specifically, the skewness effects of the ST distribution could be partly explained by its df and vice versa. (c) The ECM algorithm for the STN distribution is analytically simple and easy to implement, unlike the ST dis-

tribution in which the E-step involves numerical integration, which can be computationally prohibitive for large samples.

3 The skew Student-t-normal mixture model 3.1 Model formulation Consider n independent random variables Y1 , . . . , Yn , which are taken from a mixture of STN distributions. The pdf of a g-component STNMIX model is f (yj | ) =

g

wi ψ(yj | ξi , σi2 , λi , νi ),

(13)

i=1

where wi ’s are mixing

g proportions which are constrained to be positive and i=1 wi = 1, ψ(yj | ξi , σi2 , λi , νi ) is the STN density defined in (1) and  = (w1 , . . . , wg−1 , θ 1 , . . . , θ g ) represents all unknown parameters. Note that the component vector θ i consists of (ξi , σi2 , λi , νi ). To pose this mixture model to an incomplete data problem, it is conceivable to introduce allocation variables Z j = (Z1j , . . . , Zgj )T , j = 1, . . . , n, whose values are a set of binary variables with  1 if Yj belongs to group i, Zij = 0 otherwise,

Stat Comput (2012) 22:287–299

291

g and satisfying i=1 Zij = 1. This implies that Z j follows a multinomial random vector with 1 trial and cell probabilities w1 , . . . , wg , denoted by Z j ∼ M(1; w1 , . . . , wg ). It follows from (3) that a hierarchical formulation of (13) can be represented by   σi2 σ i λ i γj , , Yj | (γj , τj , Zij = 1) ∼ N ξi + τj + λ2i τj + λ2i   τj + λ2i γj | (τj , Zij = 1) ∼ TN 0, ; (0, ∞) , τj

(14)

τj | (Zij = 1) ∼ (νi /2, νi /2), Z j ∼ M(1; w1 , . . . , wg ). As an immediate consequence, we establish the following proposition, which is crucial for evaluating some conditional expectations in the proposed EM-type algorithms. Proposition 1 Given the hierarchical representation (14), we have the following (the symbol “ | · · · ” denotes conditioning on Zij = 1 and Yj = yj ):

Maximization Either (ECME) algorithm (Liu and Rubin 1994) for ML estimation of the STNMIX model. The basic idea of ECM is that the maximization (M) step of EM is replaced by several computationally simple conditional maximization (CM) steps, while ECME is a simple modification of the ECM algorithm that replaces some of the CMsteps on maximizing the constrained actual log-likelihood function and has been shown to have a faster convergence rate. A key feature of these two EM-type algorithms is that they preserve the stability of the EM algorithm with their monotone convergence. For notational convenience, let y = (y1 , . . . , yn ), γ = (γ1 , . . . , γn ), τ = (τ1 , . . . , τn ) and Z = (Z1 , . . . , Zn ). According to (14), the complete data log-likelihood function of  given (γ , τ , Z, y), aside from additive constants, is c ( | y, γ , τ , Z)  g n 1 = Zij log wi − log σi2 2 j =1 i=1

  1 τj (yj − ξi )2 2 − + (γj − βi (yj − ξi )) 2 σi2     νi νi νi + log − log  2 2 2    νi + (log τj − τj ) , 2

(a) The conditional expectation of γjk are given by

E(γjk | · · · ) =

⎧ λu + ⎪ ⎪ ⎨ i ij

φ(λi uij ) (λi uij ) ,

(k − 1)E(γj ⎪ ⎪ ⎩ for k  2.

k−2

if k = 1;

| · · · ) + λi uij E(γjk−1 | · · · ),

(b) Some specific conditional expectations related to functions of τj are E(τjk | · · · ) =

2k ((νi + 1)/2 + k) , (νi + u2ij )k ((νi + 1)/2)

    νi + u2ij νi + 1 − log , E(log τj | · · · ) = DG 2 2 d where uij = (yj − ξi )/σi and DG(x) = dx log (x) is the digamma function. (c) The conditional expectation of Zij given Yj = yj is

E(Zij | Yj = yj ) = Pr(Zij = 1 | Yj = yj ) =

wi ψ(yj | ξi , σi2 , λi , νi ) . f (yj | )

(15)

where βi = λi /σi is a reparameterized parameter. Hence, the expected value of complete data log-likelihood (15) evaluˆ (h) , which we shall denote the Q-function, ated with  =  is   ˆ (h) . ˆ (h) ) = E c ( | y, γ , τ , Z) | y,  Q( | 

(16)

To evaluate the Q-function, the necessary conditional ex(h) ˆ (h) ), τˆ (h) = E(τj | pectations include zˆ = E(Zij | yj ,  ij

ˆ yj , Zij = 1, 

(h)

ij

ˆ ), κˆ ij = E(log τj | yj , Zij = 1,  (h)

(h) ˆ γˆ1ij = E(γj | yj , Zij = 1, 

ˆ Zij = 1, 

(h)

(h)

(h)

),

(h)

) and γˆ2ij = E(γj2 | yj ,

). By Proposition 1, we have (h)

Proof The proof is straightforward and hence is omitted. 

(h)

zˆ ij =

3.2 Computational aspects

(h)

ˆ f (yj |  (h)

(h)

In this subsection, we describe in detail how to exploit two extensions of the EM algorithm (Dempster et al. 1977), the Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin 1993), and the Expectation Conditional

wˆ i(h) ψ(yj | θˆ i )

τˆij =

νˆ i

+1

(h) νˆ i

+ uˆ ij

2(h)

,

)

,

  (h)  (h) 2(h)  νˆ i + uˆ ij νˆ + 1 (h) κˆ ij = DG i − log , 2 2

(17)

292

Stat Comput (2012) 22:287–299

φ(λˆ i uˆ ij )

     νi νi νi (h) νi (h) − log  + (κˆ ij − τˆij ) . (18) + log 2 2 2 2

(h) (h)

(h) γˆ1ij

(h) (h) = λˆ i uˆ ij

+

and

(λˆ i uˆ ij ) (h) (h)

γˆ2ij = 1 + λˆ i uˆ ij γˆ1ij , (h)

(h) (h) (h)

In summary, the implementation of the ECM algorithm proceeds as follows:

where uˆ ij = (yj − ξˆi )/σˆ i . Therefore, the Q-function (16) can be written as (h)

(h)

ˆ Q( |  =

(h)

 (h)

zˆ ij

log wi −

j =1 i=1



1 1 (h) log σi2 − 2 τˆij (yj − ξi )2 2 2σi

 1  (h) (h) γˆ2ij − 2βi γˆ1ij (yj − ξi ) + βi2 (yj − ξi )2 2

n

(h) (h) j =1 zˆ ij τˆij yj

ξˆi(h+1) =

2(h+1)

=

n 1

(h) (h)  zˆ ij τˆij yj (h) nˆ i j =1

(h+1) 2

− ξˆi

.

n (h) (h) ˆ (h+1) ) j =1 zˆ ij γˆ1ij (yj − ξi (h+1) ˆ βi = n . (h) ˆ (h+1) )2 j =1 zˆ ij (yj − ξi Accordingly, for updating λˆ i

(h)

(h)

lower=1 and upper=100. As pointed out by Liu and Rubin (1994), the one-dimensional search involved in CMstep 5 could be very slow in certain situations. To circumvent such a limitation, one may use a more efficient ECME algorithm, which refers to some CM steps of the ECM algorithm replaced by steps that maximize a restricted actual log-likelihood function, called the ‘CML-step’, without sacrificing simplicity. If the dfs are assumed to be identical, say ν1 = · · · = νg = ν, the above CM-step 5 is suggested to switch to the following CML-step. CML-step 5: Update νˆ (h) by

we calculate

n (h+1) (h+1) ˆ (h+1) βi λˆ i = σˆ i =

(h)

CM-step 1: Calculate wˆ i = nˆ i /n, where nˆ i =

n (h) j =1 zˆ ij . (h) CM-step 2: Update ξˆi by maximizing (18) over ξi , which leads to

2(h) n (h) (h) (h) n (h) (h) ˆ i λˆ i j =1 zˆ ij yj − σ j =1 zˆ ij γˆ1ij . (h) (h) ˆ 2(h) nˆ (h) j =1 zˆ ij τˆij + λi i

(h+1) 2(h+1) (h) CM-step 4: Fix ξi = ξˆi and σi2 = σˆ i , update βˆi by maximizing (18) over βi , which leads to

where

(h)

and γˆ2ij , for i = 1, . . . , g and j = 1, . . . , n.

+ λˆ i

n

CM-step 3: Fix ξi = ξˆi(h+1) , update σˆ i2(h) by maximizing (18) over σi2 , which gives σˆ i

ˆ (h) , compute zˆ (h) , τˆ (h) , κˆ (h) , γˆ (h) E-step: Given  =  ij ij ij 1ij (h+1)

)

g n

(h)

(h) (h) (h+1) ˆ ij j =1 zˆ ij γˆ1ij u ,

n (h) 2(h+1) ˆ ij j =1 zˆ ij u

= (yj − ξˆi(h+1) )/σˆ i(h+1) . (h+1) by maximizing Calculate νˆ i

νˆ

(h+1)

= arg max ν

n j =1

 log

g

(h+1)

wˆ i

(h+1) ψ(yj | ξˆi ,

i=1



2(h+1) ˆ (h+1) , λi , ν) σˆ i

.

(h+1) uˆ ij

(18) over ν, CM-step 5: which is equivalent to solve the root of the following equation:     n νi νi 1 (h) (h) (h) zˆ ij (κˆ ij − τˆij ) = 0. − DG + 1 + (h) log 2 2 nˆ i j =1 Note that the CM-step 5 requires a one-dimensional search for the root of νi , which can be easily achieved by using the ‘uniroot’ function built in R. In the call to ‘uniroot’, one can optionally specify an appropriate short-interval to reduce the searching time, for example,

The E-step and CM/CML-steps are alternated repeatedly until a suitable convergence rule is satisfied, e.g., the Aitken (h+1) acceleration based stopping criterion |(h+1) − ∞ | < , where (h+1) is the observed log-likelihood evaluated at (h+1) ˆ (h) , ∞  is the asymptotic estimate of the log-likelihood at iteration h+1 (McLachlan and Krishnan 2008; Chap. 4.9) and  is the desired tolerance. For numerical analyses in Sect. 4 and Sect. 5, a default value of  = 10−6 was used to terminate the iterations. Note that the above procedure is also applicable for the simple case (g = 1) by treating Zij = 1.

Stat Comput (2012) 22:287–299

293

3.3 Notes on implementation

4 An illustration

We address some practical issues raised in employing the above procedures. As analogous to other iterative optimization algorithms, one needs to get good starting values of  to achieve convergence swiftly. A simple way of automatically generating a selection of initial values is formally described below.

Flow cytometry is a popular biotechnological platform that is used in both clinical and research applications for rapid single-cell investigation of surface and intracellular markers. It is a powerful laser-based technique for scanning, profiling and sorting microscopic particles flowing in a stream of water. Typically, it uses multiple fluorescent dye conjugated antibodies to study, in parallel channels, the expression of different proteins on the surface of or within each cell of a given sample in the form of corresponding fluorescent intensities. Flow cytometry is extensively used in a large number of biomedical applications such as molecular and cellular biology (e.g., to measure DNA content), hematology (e.g., to test leukemia samples) and immunology (e.g., to conduct CD4 tests for HIV AIDS). Very recently, it was shown by Pyne et al. (2009) and Frühwirth-Schnatter and Pyne (2010) that flow cytometric data are ideally suited for multimodal asymmetric finite mixture modelling. The above mentioned studies have shown how to computationally mimic the biologist’s view of a flow cytometric sample consisting of a mixture of non-Gaussian cell populations with mathematical models defined by finite mixtures of skew normal and skew t distributions. Thus each distinct cell population can be modelled by its corresponding distribution and proportion (or size). Importantly, such a finite mixture model approach can automate traditional flow cytometric analysis of identifying cell populations in a sample manually, which is often labor-intensive, subjective and non-reproducible. In this regard, it was noted by the same studies that such skewed nature of flow cytometric cell populations makes accurate model selection problematic. In the present study, we tackle this challenge with our new STNMIX model, which outperformed the NMIX, TMIX, SNMIX and STMIX models in terms of precision of flow cytometric data modelling, and by doing accurate model section with the help of greedy learning. Flow cytometry data are generally stored in binary Flow Cytometry Standard (.fcs) format. Glynn (2006) developed the FCSExtract utility to convert .fcs data to ASCII text format; they also provided a working data set ‘CC4_067_BM.fcs’ consisting of ten attributes measured on 5,634 cells. To illustrate our univariate mixture modelling approach, we analyze the data from the channel PE.Cy5, corresponding to the attribute measured with the fluorescent dye Cy5. As specified by the scale transformation (y − a)/(b − a) in the BioConductor package flowCore (Hahne et al. 2009), we preprocessed the data by dividing each observation by the scale factor b of 1,000 and using the shift factor a of 0. We consider the model selection by fitting a series of mixture models, namely the NMIX, SNMIX, TMIX, STMIX and STNMIX models, to this data set with g = 2–7

1. Perform the K-means algorithm initialized with respect to a randomly chosen cluster centers. (0) 2. Initialize the zero-one membership indicator Zˆ j = (0) g

{ˆzij }i=1 according to the K-means clustering results. 3. Initialize the mixing proportions, component locations and component scale variances as follows:

n (0) wˆ i

=

(0) j =1 zˆ ij

n

n ,

n 2(0) σˆ i

=

(0) ξˆi

(0) j =1 zˆ ij yj (0) j =1 zˆ ij

= n

and

(0) ˆ (0) 2 j =1 zˆ ij (yj − ξi ) .

n (0) j =1 zˆ ij

(0) (0) 4. Initialize λˆ 1 = · · · = λˆ g = 0 and a relatively small ini(0) tial value for νi ’s, say νˆ i = 10 for all i.

The EM-type algorithms do not provide directly the asymptotic covariance matrix of the estimates. An informationbased method is considered for evaluating the standard error estimates. The details are sketched in Appendix. In practice, there may exist multiple modes in calculating the log-likelihood function. As the log-likelihood function tends to have multiple modes, the algorithm needs to be initialized with a variety of several starting values. Since the K-means method does not guarantee unique clustering, one can perform the K-means clustering with various randomly chosen cluster centers. The global optimum can be determined directly by comparing their mode masses and the associated log-likelihood values. We adopt the Bayesian information criterion (BIC; Schwarz 1978) for selecting the number of components in mixture models. Herein, the form of BIC is given by BIC = m log n − 2max , where max is the maximized log-likelihood and m is the number of free parameters in the model. Accordingly, models with small BIC scores are preferred. Under certain regularity conditions, Keribin (2000) presented a theoretical justification for the consistency of BIC in determining the optimal number of components of a mixture model. Fraley and Raftery (1998, 2002) and recently McNicholas and Murphy (2008) have shown the effectiveness of BIC in selecting the number of components for Gaussian mixture models.

294

Stat Comput (2012) 22:287–299

Table 2 ML estimation results for fitting different mixture models to the PE.Cy5 data. Results for TMIX, STMIX and STNMIX models with unequal degree of freedoms are shown in parentheses. Model with Model

NMIX

TMIX

max

2

−4458.171

5

3

−4080.938

8

4

−4030.673

11

3 4

SNMIX

−4056.998 (−4037.870) −4031.355 (−4026.758)

6 (7) 9 (11) 12 (15)

BIC

g

max

8959.524

5

−4014.003

14

8148.918

8230.969

6

−4010.034

17

8166.890

8156.347

7

−4009.344

20

8191.419

8612.539 (8601.937) 8191.726 (8170.743) 8166.348 (8183.065)

5 6 7

−4014.841 (−4013.507) −4010.268 (−4009.619) −4004.222 (−4003.601)

m

15 (19) 18 (23) 21 (27)

8159.231 (8191.108) 8175.995 (8217.879) 8189.813 (8240.390)

−4242.546

7

8545.549

5

−4010.717

19

8185.528

3

−4041.599

11

8178.199

6

−4007.368

23

8213.377

4

−4014.494

15

8158.536

7

−4005.381

27

8243.950

3 4

2 STNMIX

−4280.360 (−4270.741)

m

2

2 STMIX

BIC

g

2

the smallest BIC value for each family was shown in bold. The best chosen model is indicated by ‘∗ ’

3∗ 4

−4103.167

8

8275.426

(−4101.475)

(9)

(8280.678)

−4021.244

12

8146.127

(−4017.311) −4011.890 (−4010.503) −4111.852

(14) 16 (19) 8

(8155.534) 8161.965 (8185.101) 8292.797

(−4102.415)

(9)

(8282.559)

−4015.449

12

8134.536

(−4014.084) −4010.478 (−4009.734)

(14) 16 (19)

(8149.081) 8159.142 (8183.564)

components. Consequently, the latter three models are fitted for scenarios with equal and unequal dfs. We ran the ECME algorithms as described in Sect. 3 and also in Lin et al. (2007a, 2007b). For each run, thirty different initializations based on the K-means clustering technique were used. The algorithm was terminated based on the pre-described stopping criterion in Sect. 3.2. When g is large, it perhaps stopped too early and thus some small perturbations on the converged log-likelihoods may exist. The values of log-likelihood maxima and the number of parameters together with BIC values are listed in Table 2. Note that a smaller value of BIC is associated with a better fitted model. It is clearly seen that the 3-component STNMIX model with equal dfs has the best fit, followed by the 3-component STMIX model with equal dfs. Both models effectively capture the actual number of underlying cell populations, while the NMIX, TMIX and SNMIX models need more than three components to capture the skewness as well as the heavy tails in the data.

5 6 7

5 6 7

−4010.282 (−4007.208) −4006.730 (−4006.422) −4002.771 (−3998.183) −4006.375 (−4005.542) −4005.687 (−4005.013) −4004.679 (−4002.594)

20 (24) 24 (29) 28 (34) 20 (24) 24 (29) 28 (34)

8193.295 (8221.693) 8220.738 (8263.305) 8247.366 (8290.009) 8185.482 (8218.362) 8218.651 (8260.487) 8251.182 (8298.831)

Table 3 compares the ML estimates along with the associated standard errors for the best fitted STNMIX model with the corresponding values for the other four competing 3-component mixture models. The estimated dfs of the STMIX and STNMIX models are considerably less than 10 for all the components, signifying the validity of the use of heavy-tailed component densities. The estimated skewness parameters in STMIX models are all positive and somewhat significant. As for STNMIX, two of the components seem to be weakly skewed as the standard errors of λˆ 2 and λˆ 3 are relatively large. One possible explanation of this phenomenon is that in the STMIX model the fat-tailed behavior is partly characterized by its skewness parameters, yielding lager estimates for the skewness parameters. To verify this further, we carry out the KolmogorovSmirnov (K-S) goodness of fit test to compare the quality of fit among the three skew modelling options. The resulting K-S distances are 0.010, 0.007 and 0.006 for the SNMIX, STMIX and STNMIX models, respectively. Given that a

Stat Comput (2012) 22:287–299

295

Table 3 Summary results from fitting various three-component mixture models to the PE.Cy5 data Parameter

NMIX

TMIX

SNMIX

STMIX

STNMIX

Est

Sd

Est

Sd

Est

Sd

Est

Sd

Est

Sd 0.013

w1

0.240

0.007

0.252

0.007

0.287

0.009

0.297

0.010

0.301

w2

0.403

0.013

0.396

0.018

0.340

0.020

0.307

0.039

0.304

0.091

w3

0.357



0.352



0.373



0.396



0.395



ξ1

0.349

0.005

0.357

0.005

0.177

0.012

0.180

0.011

0.193

0.017

ξ2

1.504

0.025

1.520

0.025

1.067

0.030

1.172

0.057

1.274

0.140

ξ3

1.845

0.005

1.854

0.006

1.753

0.028

1.788

0.032

1.804

0.053

σ1

0.150

0.004

0.147

0.005

0.290

0.020

0.281

0.021

0.261

0.029

σ2

0.549

0.012

0.463

0.018

0.709

0.020

0.510

0.066

0.389

0.128

σ3

0.153

0.005

0.144

0.006

0.185

0.018

0.163

0.013

0.151

0.011

λ1









2.659

0.410

2.732

0.429

2.156

0.506

λ2









3.062

0.722

1.914

0.686

0.969

0.934

λ3









0.950

0.377

0.637

0.385

0.477

0.615

ν





10.325

2.004





6.763

2.377

4.451

2.030

Fig. 2 Histogram of the PE.Cy5 data and three fitted STNMIX densities (g = 1–3). The dashed lines indicate the true grouping of fitted STN densities

smaller K-S distance corresponds to a better resemblance between the experimental data and the fitted distribution, we conclude that the most precise modelling of the count and asymmetry for this flow cytometric dataset is achieved by STNMIX. Following a strategy similar to Vlassis and Likas (2002), we devised a greedy EM algorithm for learning STN mix-

tures, which is quite effective in selecting an appropriate number of components and overcoming the local maxima problem. The detailed implementations are sketched in a longer version of this paper which is available from the authors. The greedy EM procedure determines that g = 3 is the favorite choice, which is consistent with that selected by the information-based criterion in Table 2. Figure 2 shows

296

Stat Comput (2012) 22:287–299

the fitted densities for g = 1 − 3 components. Based on the graphical visualization, it appears that the fitted STNMIX density adapts the shape of the histogram quite perfectly.

5 Simulation study In this section, we undertook a small simulation study to compare the fitting performance and required computational burden of the existing ST model with the proposed STN model. Lin et al. (2007a) have developed an efficient ECME algorithm for the ML fitting of ST distributions. A comparison of some characterizations between the two models is given in Table 4. Notice that both of them reduce to the SN model as ν → ∞. As shown by Lin et al. (2007a, 2007b), the SN distribution has a convenient hierarchial representation: Y | γ ∼ N (ξ + δγ , (1 − δ 2 )σ 2 ),

γ ∼ TN(0, σ 2 ; (0, ∞)), (19)

√ where δ = λ/ 1 + λ2 . For making a fair comparison, we generated synthetic data from the SN distribution, where the presumed parameters are given by ξ = 1, σ = 2 and λ = 3 for producing highly skewed observations. In the simulation, the numbers of n considered were 250, 500, 1,000, 5,000 and 10,000. To further create data with fat tails, we add 2% extreme values generated by a uniform distribution over the interval

[10, 20] to each simulated data set. So the simulation sample sizes become 255, 510, 1,020, 5,100 and 10,200 after appending 2% artificial outliers. Simulations were run with a total of 500 replications for each sample size. To conduct experimental studies, each simulated data set was fitted via the ECME algorithm under ST and STN scenarios with the same initial value, say ξˆ (0) = 1, σˆ (0) = 2, λˆ (0) = 3 and νˆ (0) = 4. In each trail, we recorded their respective converged log-likelihoods, denoted by ˆST and ˆSTN and consumed CPU times (in seconds), denoted by CTST and CTSTN . All computations were carried out by R package 2.9.2 in win32 environment of desktop PC machine with 3.00 GHz/Intel Core(TM)2 Duo CPU Processor and 4.0 GB RAM. It is noted that the amount of required iterations cannot be directly comparable because each iteration of the ECME algorithm involves different numbers of inner iterations for computing the update of ν. Table 5 presents the average log-likelihood values, the amount of iterations and the CPU time for various sample sizes. We found that both models produce comparable average log-likelihood values. Particularly, the STN model has significantly better performance when the sample size becomes large (n > 1,000). Moreover, the average CPU time in the STN scenario appears to be substantially reduced. Note that the above simulation study was also carried out for various other sets of parameters and all of which gave similar results. As recommended by an anonymous referee, an interesting comparison can be made by generating data from the ST and STN models each time and examining how often

Table 4 Comparison of some characterizations between the STN and ST distributions Distribution Stochastic representation Hierarchical representation Density Mean Variance

Skewness

Kurtosis

STN(ξ, σ 2 , λ, ν)

ST(ξ, σ 2 , λ, ν)

  U λ|V | +√ Y =ξ +σ  τ + λ2 τ (τ + λ2 ) √ Y | τ ∼ SN(ξ, σ 2 /τ, λ/ τ ) 2 t (u | ν)(λu) σ ν ξ+ σ aν η11 π   ν ν 2 σ2 − aν2 η11 ν −2 π aν

3 + 2aν2 η11

−3 + 2π

2π π 3π ν−3 η31 + ν η13 − ν−2 η11 π 2 2 3/2 [ ν−2 − aν η11 ]

3π(ν−3) (ν−4)(ν−2)2

− 2aν2 η11 ( ην13 +

π 2 ]2 [ ν−2 − aν2 η11

2η31 ν−3 )

σ Y =ξ + √ τ



U λ|V | +√ √ 1 + λ2 1 + λ2

Y | τ ∼ SN(ξ, σ 2 /τ, λ)  2 ν+1 t (u | ν)T λu σ ν + u2 ν ξ+ σ aν δ π   ν ν − aν2 δ 2 σ2 ν −2 π aν δ

2aν2 δ 2 −

−3 + 2π

π δ2 ν−3



 ν +1

+

3π (ν−3)(ν−2) π [ ν−2 − aν2 δ 2 ]3/2 2aν2 2 3π(ν−3) − ν−3 δ (3 − δ 2 ) (ν−4)(ν−2)2 π [ ν−2 − aν2 δ 2 ]2

U and V are independent standard normal distributions, τ ∼ (ν/2, ν/2) with mean 1, T (·|ν) is the cdf of the Student’s t distribution with df ν, √ ∞ λ ν ν−s ν 2 aν = ( ν−1 2 )/ ( 2 ), u = (y − ξ )/σ , δ = λ/ 1 + λ , and ηst = 0 (τ +λ2 )t/2 g(τ | 2 , 2 ) dτ

Stat Comput (2012) 22:287–299

297

Table 5 Comparison of average log-likelihood and CPU time (in seconds; CT) for ST and STN models under various sample sizes Model

n = 255

n = 510

n = 1020

n = 5100

(θ)

CT

(θ)

CT

(θ )

CT

(θ )

STN

−454.283

0.370

−909.864

0.681

−1820.673

1.363

−9107.609

ST

−456.262

0.937

−913.956

1.880

−1828.824

3.999

−9149.175

n = 10200 CT

(θ )

CT

9.141

−18218.840

17.551

22.719

−18302.190

49.548

Fig. 3 Improvement of converged log-likelihoods and relative computational cost of the STN model over the ST counterpart for various sample sizes. (a) Box plots of scaling log-likelihood improvements, (b) Relative improvement percentages in CPU time

we can recognize the true model. To conduct this experimental study, we generate 500 samples of sizes n = 100, 300, 500 and 1,000 from the ST and STN distributions with the same parameter setting. The presumed values for the location, scale variance and skewness parameters are ξ = 0, σ 2 = 1 and λ = 4, respectively. For the values of dfs used in the study, we take a low value (ν = 3) for yielding a heavytailed distribution and a high value (ν = 50) for approaching to the SN distribution. For model comparison, each simulated data set was fitted twice under ST and STN scenarios starting with ten different initializations to avoid getting stuck in local traps. In each trial, we compare the performance of the two models according to their best log-likelihoods obtained from each of ten fits. Note that since the two competing models have the same number of parameters, the log-likelihood can be regarded as a reasonable model selection criterion. A simulated data set (ignoring the known true model) is then assigned to one or the other model that has a larger log-likelihood. Table 6 presents the proportions of selecting the true model based on 500 replications. The table shows that the STN model is more recognizable than the ST model for all cases. Interest-

Table 6 Proportion of selecting the true model (%) Model

ν 3

50

n = 100

n = 300

n = 500

n = 1000

STN

78.4

81.2

87.2

92.6

ST

38.2

72.2

75.2

90.0

STN

63.6

58.6

58.4

56.6

ST

37.2

42.6

45.4

46.4

ingly, when ν is small, the probability of selecting the true model increases steadily with the sample size n. In contrast, the two models are not easy to distinguish for large ν.

6 Conclusion We have proposed a new family of mixture models based on the STN distributions, called the STNMIX model, which is allowed to accommodate multimodality, asymmetry and heavy tails jointly and to offer greater flexibility than the STMIX counterpart. We have described a four-level hierarchical formulation for the STNMIX model and presented

298

Stat Comput (2012) 22:287–299

efficient EM-type algorithms for parameter estimation in a complete data framework. Experimental results show that the STNMIX model has better performance than the other competitors. So far the present application is limited to the data with univariate outcomes. There has been an increasing literature on multivariate mixtures using non-elliptically contoured distributions, such as the recent proposals of Lin (2009, 2010), Wang et al. (2009) and Karlis and Santourian (2009). The methodology as well as the EM-type algorithms can be extended to a multivariate version of STNMIX model and will be reported in a follow-up paper.

Explicit expressions for the elements of sˆj are

Acknowledgements The authors would like to express his deepest gratitude to the Chief Editor, the Associate Editor and two anonymous referees for their valuable comments and suggestions that greatly improved this paper. This research was supported by the National Science Council of Taiwan (Grant NO. NSC97-2118-M-005-001-MY2).

where zˆ ij , uˆ ij , γˆ1ij , κˆ ij and τˆij are zˆ ij , uˆ ij , γˆ1ij , κˆ ij and (h) ˆ respectively. In the case of τˆij in (17) evaluated at ,

g ν1 = · · · = νg = ν, this leads to sˆj,ν = i=1 sˆj,νi . Standard ˆ are extracted from the square root of the diagoerrors of  nal elements of the inverse of (21).

zˆ rj zˆ gj − (r = 1, . . . , g − 1), wˆ r wˆ g  zˆ ij  (τˆj + λˆ 2i )uˆ ij − λˆ i γˆ1ij , sˆj,ξi = σˆ i  zˆ ij  sˆj,σi = (τˆj + λˆ 2i )uˆ 2ij − λˆ i γˆ1ij uˆ ij − 1 , σˆ i sˆj,wr =

sˆj,λi = zˆ ij uˆ ij (γˆ1ij − λˆ i uˆ ij ),       zˆ ij νˆ i νˆ i sˆj,νi = log + 1 − DG + κˆ ij − τˆij , 2 2 2 (h)

(h)

(h)

(h)

Appendix: Estimation of standard errors References We follow the information-based method exploited by Basford et al. (1997) to calculate the asymptotic covariance matrix of the ML estimates. The empirical information matrix is defined as I e ( | y) =

n

s(yj | )sT (yj | )

j =1

− n−1 S(y | )S T (y | ),

(20)

where S(y | θ ) = nj=1 s(yj | θ ). Following Louis (1982), the individual score can be determined as ∂ log f (yj |) ∂   ∂cj ( | yj , Z j , τj ) y =E ,  , j ∂

s(yj | ) =

where cj (|y j , Z j , γ j , τj ) is the complete data loglikelihood formed from the single observation yj . ˆ into , (20) is reduced Substituting the ML estimates  to ˆ | y) = I e (

n

sˆj sˆTj ,

(21)

j =1

where sˆj is an individual score vector containing elements of (ˆsj,w1 , . . . , sˆj,wg−1 , sˆj,ξ1 , . . . , sˆj,ξg , sˆj,σ1 , . . . , sˆj,σg , sˆj,λ1 , . . . , sˆj,λg , sˆj,ν1 , . . . , sˆj,νg )T .

Azzalini, A.: The skew-normal distribution and related multivariate families (with discussion). Scand. J. Stat. 32, 159–188 (2005) Azzalini, A., Capitaino, A.: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t -distribution. J. R. Stat. Soc. B 65, 367–389 (2003) Barndorff-Nielsen, O.E.: Normal inverse Gaussian distributions and stochastic volatility modelling. Scand. J. Stat. 24, 1–13 (1997) Basford, K.E., Greenway, D.R., McLachlan, G.J., Peel, D.: Standard errors of fitted means under normal mixture. Comput. Stat. 12, 1–17 (1997) Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Singapore (2006) Cabral, C.R.B., Bolfarine, H., Pereira, J.R.G.: Bayesian density estimation using skew student-t-normal mixtures. Comput. Stat. Data Anal. 52, 5075–5090 (2008) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977) Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998) Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–612 (2002) Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, New York (2006) Frühwirth-Schnatter, S., Pyne, S.: Bayesian inference for finite mixtures of univariate and multivariate skew normal and skew-t distributions. Biostatistics 11, 317–336 (2010) Glynn, E.F.: FCSExtract Utility. Stowers Institute for Medical Research. Online available at: http://research.stowers-institute.org/ efg/ScientificSoftware/Utility/FCSExtract/ (2006) Gómez, H.W., Venegas, O., Bolfarine, H.: Skew-symmetric distributions generated by the distribution function of the normal distribution. Environmetrics 18, 395–407 (2007) Hahne, F., LeMeur, N., Brinkman, R.R., Ellis, B., Haaland, P., Sarkar, D., Spidlen, J., Strain, E., Gentleman, R.: flowCore: a Bioconductor package for high throughput flow cytometry. BMC Bioinform. 10, 106 (2009)

Stat Comput (2012) 22:287–299 Karlis, D., Santourian, A.: Model-based clustering with nonelliptically contoured distributions. Stat. Comput. 19, 73–83 (2009) Keribin, C.: Consistent estimation of the order of mixture models. Sankhy¯a 62, 49–66 (2000) Li, J.Q., Barron, A.R.: Mixture density estimation. In: Advances in Neural Information Processing Systems 12. MIT Press, Cambridge (2000) Lin, T.I.: Maximum likelihood estimation for multivariate skew normal mixture models. J. Multivar. Anal. 100, 257–265 (2009) Lin, T.I.: Robust mixture modeling using multivariate skew t distributions. Stat. Comput. 20, 343–356 (2010) Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007a) Lin, T.I., Lee, J.C., Yen, S.Y.: Finite mixture modelling using the skew normal distribution. Stat. Sin. 17, 909–927 (2007b) Liu, C.H., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81, 633–648 (1994) Louis, T.A.: Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. B 44, 226–233 (1982) McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Application to Clustering. Dekker, New York (1988) McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, New York (2008) McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000) McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)

299 Meinicke, P., Brodag, T., Fricke, W.F., Waack, S.: P -value based visualization of codon usage data. Algorithms Mol. Biol. 1, 10 (2006) Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278 (1993) Nadarajah, S., Kotz, S.: Skewed distributions generated by the normal kernel. Stat. Probab. Lett. 65, 269–277 (2003) Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.I., Maier, L., BaecherAllan, C., McLachlan, G.J., Tamayo, P., Hafler, D.A., De Jager, P.L., Mesirov, J.P.: Automated high-dimensional flow cytometric data analysis. Proc. Natl. Acad. Sci. USA 106, 8519–8524 (2009) R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2008) Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with application to Bayesian regression models. Can. J. Stat. 31, 129–150 (2003) Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461– 464 (1978) Titterington, D.M., Smith, A.F.M., Markov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1985) Vlassis, N., Likas, A.: A greedy EM algorithm for Gaussian mixture learning. Neural Process. Lett. 15, 77–87 (2002) Wang, K., Ng, S.K., McLachlan, G.J.: Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In: Proceedings of DICTA 2009, Conference of Digital Image Computing: Techniques and Applications, Melbourne, pp. 526– 531. IEEE Computer Society, Los Alamitos (2009)