Local Statistical Modeling via a Cluster-Weighted ... - Springer Link

1 downloads 0 Views 1MB Size Report
DOI: 10.1007/s00357. Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical Distributions. Salvatore Ingrassia. Universit`a di Catania, Italy.
Journal of Classification 29:363-401 (2012) DOI: 10.1007/s00357-012-9114-3

Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical Distributions Salvatore Ingrassia Universit`a di Catania, Italy

Simona C. Minotti Universit`a di Milano-Bicocca, Italy

Giorgio Vittadini Universit`a di Milano-Bicocca, Italy

Abstract: Cluster-weighted modeling (CWM) is a mixture approach to modeling the joint probability of data coming from a heterogeneous population. Under Gaussian assumptions, we investigate statistical properties of CWM from both theoretical and numerical point of view; in particular, we show that Gaussian CWM includes mixtures of distributions and mixtures of regressions as special cases. Further, we introduce CWM based on Student-t distributions, which provides a more robust fit for groups of observations with longer than normal tails or noise data. Theoretical results are illustrated using some empirical studies, considering both simulated and real data. Some generalizations of such models are also outlined. Keywords: Cluster-weighted modeling; Mixture models; Model-based clustering.

The authors sincerely thank the referees for their interesting comments and valuable suggestions. We also thank Antonio Punzo for helpful discussions. Authors’ Addresses: S. Ingrassia, Dipartimento di Economia e Impresa, Universit`a di Catania, Corso Italia 55, Catania, Italy, e-mail: [email protected]; S.C. Minotti, Dipartimento di Statistica, Universit`a di Milano-Bicocca, Via Bicocca degli Arcimboldi 8 - 20126 Milano, Italy, e-mail: [email protected]; G. Vittadini, Dipartimento di Metodi Quantitativi per l’Economia e le Scienze Aziendali, Universit`a di Milano-Bicocca, Via Bicocca degli Arcimboldi 8 - 20126 Milano, Italy, e-mail: [email protected]. Published online 23 August 2012

364

S. Ingrassia, S.C. Minotti, and G. Vittadini 1.

Introduction

Finite mixture models provide a flexible approach to the statistical modeling of a wide variety of random phenomena characterized by unobserved heterogeneity. In these models, dating back to the work of Newcomb (1886) and Pearson (1894), the observations in a sample are assumed to arise from unobserved groups in the population; the purpose is to identify the groups and estimate the parameters of the conditional-group density functions. If no exogenous variables explain the means and the variances of each component, we refer to unconditional mixture models, i.e., the so-called finite mixtures of distributions (FMD), developed for both normal and non-normal components (e.g. Everitt and Hand 1981; Titterington, Smith and Makov 1985; McLachlan and Basford 1988; McLachlan and Peel 2000; Fr¨uhwirth-Schnatter 2005). Otherwise, we refer to conditional mixture models, i.e., finite mixtures of regression models (FMR) and finite mixtures of generalized linear models (FMGLM), (e.g. DeSarbo and Cron 1988; Jansen 1993; Wedel and DeSarbo 1995; McLachlan and Peel 2000; Fr¨uhwirth-Schnatter 2005). These models are also known as mixture-ofexperts models in machine learning (Jordan and Jacobs 1994; Peng, Jacobs, and Tanner 1996; Ng and McLachlan 2007, 2008), switching regression models in econometrics (Quandt 1972), latent class regression models in marketing (DeSarbo and Cron 1988; Wedel and Kamakura 2000), and mixed models in biology (Wang, Puterman, Cockburn and Le 1996). An extension of FMR are the so-called finite mixtures of regression models with concomitant variables (FMRC) (Dayton and Macready 1988; Wedel 2002), where the weights of the mixture functionally depend on a set of concomitant variables, which may be different from the explanatory variables and are usually modeled by a multinomial logistic distribution. The present paper focuses on a different mixture approach to modeling the joint probability of a response variable and a set of explanatory variables. The original formulation, proposed by Gershenfeld (1997) under Gaussian and linear assumptions and called cluster-weighted modeling (CWM), was developed in the context of media technology to build a digital violin with traditional inputs and realistic sound (Gershenfeld, Sch¨oner and Metois 1999; Gershenfeld 1999, Sch¨oner 2000; Sch¨oner and Gershenfeld 2001). Wedel (2002)) refers to such a model as the saturated mixture regression model. Moreover, Wedel and DeSarbo (2002) propose tests of dependencies that lead to a set of nested models for market segment derivation and profiling. However, this overview leaves out general distributional assumptions and is based only on independence properties, without analyses in terms of probability density functions and posterior probabilities.

365

Local Statistical Modeling

With respect to the existing literature, this paper proposes CWM in a statistical setting and shows that it is a general and flexible family of mixture models. In particular, under Gaussian assumptions, we specify probability distribution assumptions and related statistical properties, and demonstrate links with traditional mixture models in terms of probability density functions and posterior probabilities. Apart from the usual Gaussian assumptions, we also propose cluster-weighted modeling with Student-t distributions to provide more robust data fitting. Finally, theoretical results are illustrated by means of examples based on both simulated and real datasets. The remainder of the paper is organized as follows. In Section 2, we reformulate CWM from a statistical point of view in a wide framework that covers both direct and indirect applications in the spirit of Titterington et al. (1985). In Section 3, we investigate some relationships between linear Gaussian CWM and some traditional Gaussian-based mixture models. In particular, we prove that linear Gaussian CWM leads to the same family of probability distributions generated by finite mixtures of Gaussian distributions (FMG) and, under suitable assumptions, includes FMR and FMRC as special cases. Moreover, we introduce a more general version of Gaussian CWM. In Section 4, we introduce CWM based on another type of elliptical distribution: the Student-t distribution. Data modeling according to the Student-t distribution has been proposed to provide more robust fitting for groups of observations with longer than normal tails or noise data (e.g. Zellner 1976; Lange, Little and Taylor 1989; Bernardo and Gir´on 1992; McLachlan and Peel 2000; Peel and McLachlan 2000; Nadarajah and Kotz 2005; Andrews and McNicholas 2011; Baek and McLachlan 2011). Here, we prove that linear Student-t CWM defines a wide family of probability distributions which, under suitable assumptions, strictly includes finite mixtures of Student-t distributions (FMT). In Section 5, theoretical results given in the paper are illustrated on the basis of both simulated and real datasets. Finally, in Section 6, we suggest new perspectives for further research. In the Appendix, a geometrical analysis of the decision surfaces generated by CWM is reported. 2.

Cluster-Weighted Modeling

In the original formulation, CWM was introduced under Gaussian and linear assumptions (Gershenfeld 1997); here, we present CWM in a quite general setting. Let (X, Y ) be the pair of random vector X and random variable Y defined on Ω with joint probability distribution p(x, y), where X is a d-dimensional input vector with values in some space X ⊆ Rd and Y is a response variable having values in Y ⊆ R. Thus, (x, y) ∈ X × Y ⊆ Rd+1 . Suppose that Ω can be partitioned into G disjoint groups, say Ω1 , . . . , ΩG ,

366

S. Ingrassia, S.C. Minotti, and G. Vittadini

that is Ω = Ω1 ∪ · · · ∪ ΩG . CWM decomposes the joint probability p(x, y) as follows: G  p(x, y; θ) = p(y|x, Ωg ) p(x|Ωg ) πg , (1) g=1

where p(y|x, Ωg ) is the conditional density of the response variable Y given the predictor vector x and Ωg , p(x|Ωg ) is the probability density of x given  Ωg , πg = p(Ωg ) is the mixing weight of Ωg , (πg > 0 and G g=1 πg = 1), g = 1, . . . , G, and θ denotes the set of all parameters of the model. Hence, the joint density of (X, Y ) can be viewed as a mixture of local models p(y|x, Ωg ) weighted (in a broader sense) on both local densities p(x|Ωg ) and mixing weights πg . In the spirit of Titterington et al. (1985, p. 2), we can distinguish three types of application for CWM in (1): 1. Direct application of type A. We assume that each group Ωg is characterized by an input-output relation that can be written as Y |x = μ(x; β g )+ εg , where εg is a random variable with zero mean and finite 2 , and β denotes the set of parameters of μ(·) function, variance σ,g g g = 1, . . . , G. 2. Direct application of type B. We assume that a random vector Z is defined on Ω = Ω1 ∪ · · · ∪ ΩG with values in Rd+1 and each z ∈ Z ⊆ Rd+1 belongs to one of these groups. Further, vector z is partitioned as z = (x , y) and we assume that within each group we write p(z; Ωg ) = p((x , y) ; Ωg ) = p(y|x, Ωg ) p(x|Ωg ). In other words, CWM in (1) is another form of density of FMD given by: p(z; θ) =

G  g=1

p(z|Ωg )πg =

G 

p(y|x, Ωg ) p(x|Ωg ) πg .

(2)

g=1

3. Indirect application. In this case, CWM in (1) is simply used as a mathematical tool for density estimation. Throughout this paper, we concentrate on direct applications that essentially have classification purposes. In this case, posterior probability p(Ωg |x, y) of unit (x, y) belonging to the g-th group (g = 1, ..., G) is given by: p(Ωg |x, y) =

p(x, y, Ωg ) p(y|x, Ωg )p(x|Ωg )πg = G , p(x, y) p(y|x, Ω )p(x|Ω )π j j j j=1

g = 1, . . . , G

(3) that is, the classification of each unit depends on both marginal and conditional densities. Because p(x|Ωg )πg = p(Ωg |x)p(x), from (3) we get:

367

Local Statistical Modeling

p(y|x, Ωg )p(Ωg |x)p(x) p(y|x, Ωg )p(Ωg |x) = G , p(Ωg |x, y) = G j=1 p(y|x, Ωj )p(Ωj |x)p(x) j=1 p(y|x, Ωj )p(Ωj |x) (4)

with

p(x|Ωg )πg p(x|Ωg )πg , p(Ωg |x) = G = p(x) j=1 p(x|Ωj )πj  where we set p(x) = G j=1 p(x|Ωj )πj . 2.1 The Basic Model: Linear Gaussian CWM

In the traditional framework, both marginal and conditional densities are assumed to be Gaussian, with X|Ωg ∼ Nd (μg , Σg ) and Y |x, Ωg ∼ 2 ), so that we shall write p(x|Ω ) = φ (x; μ , Σ ) and N (μ(x, β g ), σε,g g g d g 2 ), g = 1, . . . , G, where the conditional p(y|x, Ωg ) = φ(y; μ(x; β g ), σε,g densities are based on linear mappings, so that μ(x; β g ) = bg x + bg0 , for some β = (bg , bg0 ) , with b ∈ Rd and bg0 ∈ R . Thus, we get: p(x, y; θ) =

G 

2 φ(y; bg x + bg0 , σε,g ) φd (x; μg , Σg ) πg ,

(5)

g=1

with φ(·) denoting the probability density of Gaussian distributions. The approach in (5) will be referred to as linear Gaussian CWM. In particular, posterior probability in (3) specializes as: 2 )φ (x; μ , Σ )π φ(y; bg x + bg0 , σε,g g g d g p(Ωg |x, y) = G  2 j=1 φ(y; bj x + bj0 , σε,j )φd (x; μj , Σj )πj

g = 1, . . . , G

(6) and the decision surfaces that separate the groups belong to the family of quadrics (see the Appendix for a geometrical analysis). 3. Linear Gaussian CWM and Relationships

with Traditional Mixture Models In this section, we investigate the relationships between linear Gaussian CWM and traditional Gaussian-based mixture models, considering both probability density functions and posterior probabilities. In particular, we prove that, under suitable assumptions, linear Gaussian CWM in (5) leads to the same posterior probability of such mixture models. In this sense, we say that CWM contains FMG, FMR, and FMRC. In particular, linear Gaussian CWM leads to the same family of probability distributions generated by FMG.

368

S. Ingrassia, S.C. Minotti, and G. Vittadini

3.1 Finite Mixtures of Gaussian Distributions

Let Z be a random vector defined on Ω = Ω1 ∪ · · · ∪ ΩG with joint probability distribution p(z), where Z assumes values in some space Z ⊆ Rd+1 . Assume that density p(z) of Z has the form of a FMD, i.e.,  p(z) = G g=1 p(z|Ωg )πg , where p(z|Ωg ) is the probability density of Z|Ωg and πg = p(Ωg ) is the mixing weight of group Ωg , g = 1, . . . , G. Finally, (z) (z) denote with μg and Σg the mean vector and the covariance matrix of Z|Ωg , respectively. Now let us set Z = (X , Y ) , where X is a random vector with values d in R and Y is a random variable. Thus, we can write     (x) (xx) (x y) μ Σ Σ g g g μ(z) and Σ(z) . (7) g = g = (y) (y x) 2(y) μg Σg σg Further, the posterior probability in the g-th group is given by: p(z|Ωg )πg p(Ωg |z) = G j=1 p(z|Ωj )πj

(8)

g = 1, . . . , G .

We have the following result: Proposition 1 Let Z be a random vector defined on Ω = Ω1 ∪ · · · ∪ ΩG with values in Rd+1 , and assume that Z|Ωg ∼ Nd+1 (μg , Σg ) (g = 1, . . . , G). In particular, the density p(z) of Z is a FMG: G  p(z) = φd+1 (z; μg , Σg )πg . (9) g=1

Then, p(z) can be written similar to (5), that is as a linear Gaussian CWM. Proof. Let us set Z = (X , Y ) , where X is a d-dimensional random vector and Y is a random variable. According to well-known results of multivariate statistics (e.g. Mardia, Kent and Bibby 1979), from (9) we get G G   p(z) = φd+1 (z; μg , Σg )πg = φd+1 ((x, y); μg , Σg )πg g=1

=

G 

g=1 (xx) φd (x; μ(x) )φ(y; μ(y|x) , σg2(y|x) )πg , g , Σg g

g=1 (y|x)

(y)

(yx)

(xx)−1

(x)

2(y|x)

(yy)

where μg = μg + Σg Σg (x − μg ) and σg = Σg . If we (yx) (xx)−1 (y) (yx) (xx)−1 (x) 2 = σ 2(y|x) , set bg = Σg Σg , bg0 = μg − Σg Σg μg and σε,g g then (9) can be written in the form of (5). 

369

−20

0

20

40

Local Statistical Modeling

−4

−2

0

2

4

6

−40

−20

0

20

−6

−10

−5

0

5

10

Figure 1. Two examples of Gaussian CWM densities based on quadratic (above) and cubic (below) mappings.

Using similar arguments, FMG and linear Gaussian CWM can be shown to lead to the same distribution of posterior probabilities and, thus, CWM contains FMG. We remark that the equivalence between FMG and CWM holds only for linear mappings μ(x; β g ) = bg x + bg0 (g = 1, . . . , G), while, more generally, Gaussian CWM p(x, y; θ) =

G 

2 φ(y; μ(x; β g ), σε,g ) φd (x; μg , Σg ) πg

(10)

g=1

includes a quite wide family of models. In Figure 1, we plot two examples of density (10) for both quadratic and cubic functions μ(x; β g ) (with G = 2 groups).

370

S. Ingrassia, S.C. Minotti, and G. Vittadini

3.2 Finite Mixtures of Regression Models

Secondly, let us consider FMR (DeSarbo and Cron 1988; McLachlan and Peel 2000, Fr¨uhwirth-Schnatter 2005): f (y|x; ψ) =

G 

2 φ(y; bg x + bg0 , σε,g ) πg ,

(11)

g=1

where vector ψ denotes the overall parameters of the model. Posterior probability p(Ωg |x, y) of the g-th group (g = 1, ..., G) for FMR is: p(Ωg |x, y) = =

f (y|x; ψ, Ωg ) f (y|x; ψ) 2 )π φ(y; bg x + bg0 , σε,g g , G  2 j=1 φ(y; bj x + bj0 , σε,j )πj

g = 1, . . . , G (12)

that is, the classification of each observation depends on the local model and the mixing weight. We have the following result: Proposition 2 Let us consider linear Gaussian CWM in (5), with X|Ωg ∼ Nd (μg , Σg ) for g = 1, ..., G. If the probability density of X|Ωg does not depend on group g, i.e., φd (x; μg , Σg ) = φd (x; μ, Σ) for every g = 1, . . . , G, then it follows: p(x, y; θ) = φd (x; μ, Σ)f (y|x; ψ) ,

where f (y|x; ψ) is the FMR model in (11). Proof. Assume that φd (x; μg , Σg ) = φd (x; μ, Σ), g = 1, . . . , G, then (5) yields: p(x, y; θ) =

G 

2 φ(bg x + bg0 , σε,g ) φd (x; μ, Σ) πg

g=1

= φd (x; μ, Σ)

G 

2 φ(y; bg x + bg0 , σε,g ) πg

g=1

= φd (x; μ, Σ)

G 

f (y|x; ψ)πg ,

g=1

where f (y|x; ψ) is the FMR model in (11). 

371

Local Statistical Modeling

The second result of this section shows that, under the same hypothesis, CWM contains FMR. Corollary 3 If the probability density of X|Ωg ∼ Nd (μg , Σg ) in (5) does not depend on the g-th group, i.e., φd (x; μg , Σg ) = φd (x; μ, Σ) for every g = 1, . . . , G, then the posterior probability in (6) coincides with (12). Proof. Assume that φd (x; μg , Σg ) = φd (x; μ, Σ), g = 1, . . . , G; thus, from (6) we get 2 )φ (x; μ, Σ)π φ(y; bg x + bg0 , σε,g g d p(Ωg |x, y) = G  2 j=1 φ(y; bj x + bj0 , σε,j )φd (x; μ, Σ)πj

=

2 )φ (x; μ, Σ)π φ(y; bg x + bg0 , σε,g g d G 2 )π φd (x; μ, Σ) j=1 φ(y; bj x + bj0 , σε,j j

2 )π φ(y; bg x + bg0 , σε,g g = G  2 j=1 φ(y; bj x + bj0 , σε,j )πj

for g = 1, . . . , G, which coincides with (12).  3.3 Finite Mixtures of Regression Models with Concomitant Variables

FMRC (Dayton and Macready 1988; Wedel 2002) are an extension of FMR: ∗



f (y|x; ψ ) =

G 

2 φ(y; bg x + bg0 , σε,g ) p(Ωg |x, ξ) ,

(13)

g=1

where mixing weight p(Ωg |x, ξ) is a function depending on x through some parameters ξ , and ψ ∗ is the augmented set of all parameters of the model. Probability p(Ωg |x, ξ) is usually modeled by a multinomial logistic distribution with the first component as baseline, that is: exp(wg x + wg0 ) . p(Ωg |x, ξ) = G  j=1 exp(wj x + wj0 )

(14)

Equation (14) is satisfied if local densities p(x|Ωg ), g = 1, ..., G, are assumed to be multivariate Gaussian with the same covariance matrices (e.g. Anderson 1972). Posterior probability p(Ωg |x, y) of the g-th group (g = 1, ..., G) for FMRC is:

372

S. Ingrassia, S.C. Minotti, and G. Vittadini

2 )p(Ω |x, ξ) φ(y; bg x + bg0 , σε,g f ∗ (y|x; ψ ∗ , Ωg ) g . =  ∗ ∗ G  2 f (y|x; ψ ) j=1 φ(y; bj x + bj0 , σε,j )p(Ωj |x, ξ) (15) Under suitable assumptions, linear Gaussian CWM leads to the same estimates of bg , bg0 (g = 1, . . . , G) in (13).

p(Ωg |x, y) =

Proposition 4 Let us consider linear Gaussian CWM in (5), with X|Ωg ∼ Nd (μg , Σg ) (g = 1, ..., G). If Σg = Σ and πg = π = 1/G for every g = 1, . . . , G, then it follows that p(x, y; θ) = p(x)f ∗ (y|x; ψ ∗ ),

where f ∗ (y|x; ψ ∗ ) is the FMRC model in (13) based on the multinomial lo gistic in (14) and p(x) = G g=1 p(x|Ωg )πg . Proof. Assume Σg = Σ and πg = π = 1/G for every g = 1, . . . , G; thus, the density in (5) yields: p(x, y; θ) =

G 

2 φ(bg x + bg0 , σε,g ) φd (x; μg , Σ) π

g=1 G 

φd (x; μg , Σ)π p(x) g=1   G  exp − 12 (x − μg ) Σ−1 (x − μg )  2 φ(bg x + bg0 , σε,g ) G = p(x)  1 ,  Σ−1 (x − μ ) exp − (x − μ ) j j j=1 g=1 2 = p(x)

2 φ(bg x + bg0 , σε,g )

where

  exp − 21 (x − μg ) Σ−1 (x − μg )  1  G  −1 (x − μ ) j j=1 exp − 2 (x − μj ) Σ 1  1  =  −1  1 + j=g exp − 2 (x − μj ) Σ (x − μj ) + 12 (x − μg ) Σ−1 (x − μg ) 1   (16) =  −1  1 + j=g exp (μj − μg ) Σ x − 12 (μj + μg ) Σ−1 (μj − μg )

and we recognize that (16) can be written in form (14) for suitable constants wg , wg0 (g = 1, . . . , G). This completes the proof.  Based on similar arguments, we can immediately prove that, under the same hypotheses, CWM contains FMRC.

373

Local Statistical Modeling

Table 1. Relationships between linear Gaussian CWM and traditional Gaussian mixtures.

model

p(y|x, Ωg ) parameterassumptions ization of πg FMG Gaussian Gaussian none FMR none Gaussian none (μg , Σg ) = (μ, Σ), g=1,. . . ,G FMRC none Gaussian logistic Σg = Σ and πg = π, g=1,. . . ,G p(x|Ωg )

Corollary 5 Let us consider the linear Gaussian CWM in (5). If Σg = Σ and πg = π = 1/G for every g = 1, . . . , G, then the posterior probability in (6) coincides with (15). Proof. First, based on (14), let us rewrite (15) as 2 ) exp(w x + w ) φ(y; bg x + bg0 , σε,g g0 g

p(Ωg |x, y) = G

 j=1 φ(y; bj x

2 ) exp(w x + w ) + bj0 , σε,j j0 j

.

(17)

Assume Σg = Σ and πg = π = 1/G for every g = 1, . . . , G. Thus, through (4), equation (6) reduces to 2 )φ (x; μ , Σ) φ(y; bg x + bg0 , σε,g d g , p(Ωg |x, y) = G  x + b , σ 2 )φ (x; μ , Σ) φ(y; b j0 ε,j d j j j=1

and after some algebra we find a quantity similar to (17). This completes the proof.  As for the relation between FMRC and linear Gaussian CWM, consider that joint density p(x, Ωg ) can be written in either form: p(x, Ωg ) = p(x|Ωg )p(Ωg )

or

p(x, Ωg ) = p(Ωg |x)p(x) ,

(18)

where quantity p(x|Ωg ) is involved in CWM (left-hand side), while FMRC contains conditional probability p(Ωg |x) (right-hand side). In other words, CWM is a Ωg -to-x model, while FMRC is a x-to-Ωg model. According to Jordan (1995), they are called the generative direction model and the diagnostic direction model, respectively, in the framework of neural networks. The results of this section are provided in Table 1, which summarizes the relationships between linear Gaussian CWM and traditional Gaussian mixture models. Finally, we remark that if conditional distributions p(y|x, Ωg ) = 2 ) (g = 1, . . . , G) do not depend on group g , that is φ(y; bg x + bg0 , σε,g

374

S. Ingrassia, S.C. Minotti, and G. Vittadini

2 ) = φ(y; b x + b , σ 2 ) for g = 1, . . . , G, then (5) speφ(y; bg x + bg0 , σε,g 0 ε cializes as:

p(x, y; θ) =

G 

φ(y; b x + b0 , σε2 ) φd (x; μg , Σg ) πg

g=1

= φ(y; b x + b0 , σε2 )

G 

φd (x; μg , Σg ) πg ,

(19)

g=1

and this implies, from (11) and (13), that FMR and FMRC are reduced to a single straight line: f (y|x; ψ) = f ∗ (y|x; ψ ∗ ) = φ(y; b x + b0 ), G  because G g=1 πg = g=1 p(Ωg |x, ξ) = 1. The results of this section will be illustrated from a numerical point of view in Section 5.1. 4.

Student-t CWM

In this section, we introduce CWM based on another important type of elliptical distribution: the Student-t distribution. Data modeling according to the Student-t distribution has been proposed to provide more robust fitting for groups of observations with longer than normal tails or noise data (e.g. Zellner 1976; Lange et al. 1989; Bernardo and Gir´on 1992; McLachlan and Peel 1998, 2000; Peel and McLachlan 2000; Nadarajah and Kotz 2005; Andrews and McNicholas 2011; Baek and McLachlan 2011). Recent applications also include analysis of orthodontic data via linear effect models (Pinheiro, Liu and Wu 2001), marketing data analysis (Andrews, Ansari and Currim 2002), and asset pricing (Kan and Zhou 2006). In particular, and different from the Gaussian case, we prove that linear Student-t CWM defines a wide family of probability distributions, which, under suitable assumptions, strictly includes FMT as a special case. To begin with, we recall that a q variate random vector Z has a multivariate t distribution with degrees of freedom ν ∈ (0, ∞), location parameter μ ∈ Rq , and q × q positive definite inner product matrix Σ if it has density p(z; μ, Σ, ν) =

Γ((ν + q)/2)ν ν/2 , Γ(ν/2)|πΣ|1/2 [ν + δ(z; μ, Σ)](ν+q)/2

(20)

where δ(z; μ, Σ) = (z − μ) Σ−1 (z − μ) denotes the squared Mahalanobis distance between z and μ, with respect to matrix Σ, and Γ(·) is the Gamma function.

375

Local Statistical Modeling

In this case, we write Z ∼ tq (μ, Σ, ν), and then E(Z) = μ (for ν > 1) and Cov(Z) = νΣ/(ν − 2) (for ν > 2). If U is a random variable, independent of Z, such that νU has a chi-squared distribution with ν degrees of freedom, that is νU ∼ χ2ν , then it is well known that Z|(U = u) ∼ Nq (μ, Σ/u). Throughout this section, we assume that X|Ωg has a multivariate t distribution with location parameter μg , inner product matrix Σg , and degrees of freedom νg , that is, X|Ωg ∼ td (μg , Σg , νg ), and that Y |x, Ωg has a 2 , and det distribution with location parameter μ(x; β g ), scale parameter σ,g 2 , ζ ), g = 1, . . . , G. grees of freedom ζg , that is, Y |x, Ωg ∼ t(μ(x; β g ), σ,g g Thus, (1) specializes as: p(x, y; θ) =

G 

2 t(y; μ(x; β g ), σ,g , ζg ) td (x; μg , Σg , νg ) πg ,

(21)

g=1

and this model will be referred to as t-CWM. The special case in which μ(x; β g ) is a linear mapping will be called linear t-CWM: p(x, y; θ) =

G 

2 t(y; bg x + bg0 , σ,g , ζg ) td (x; μg , Σg , νg ) πg ,

(22)

g=1

where, according to (20), g = 1, . . . , G, we have 2 t(y; bg x + bg0 , σ,g , ζg ) = ζ /2

Γ((ζg + 1)/2)ζg g  2 {ζ + [y − (b x + b )]2 /σ 2 }(ζg +1)/2 Γ(ζg /2) πσ,g g g0 g ,g ν /2

td (x; μg , Σg , νg ) =

Γ((νg + d)/2)νg g . Γ(νg /2)|πΣg |1/2 {νg + δ(x; μg , Σg )}(νg +d)/2

Moreover, the posterior probability in (3) specializes as: 2 , ζ )t (x; μ , Σ , ν )π t(y; bg x + bg0 , σ,g g d g g g g p(Ωg |x, y) = G ,  2 j=1 t(y; bj x + bj0 , σ,j , ζj )td (x; μj , Σj , νj )πj g = 1, . . . , G

and the decision surfaces that separate the groups are elliptical (see the Appendix for a geometrical analysis). The result in the following implies that, different from the Gaussian case, linear t-CWM defines a larger family of probability distributions than FMT; in fact, the family of distributions generated by linear t-CWM strictly includes the family of distributions generated by FMT.

376

S. Ingrassia, S.C. Minotti, and G. Vittadini

Proposition 6 Let Z be a random vector defined on Ω = Ω1 ∪ · · · ∪ ΩG with values in Rd+1 and set Z = (X , Y ) , where X is a d-dimensional input vector and Y is a random variable defined on Ω. Assume that the density of Z = (X , Y ) can be written in the form of a linear t-CWM (22), where 2 , ζ ), g = 1, . . . , G. X|Ωg ∼ td (μg , Σg , νg ) and Y |x, Ωg ∼ t(bg x+bg0 , σ,g g 2∗ 2 If ζg = νg + d and σ,g = σ,g [νg + δ(x; μg , Σg )]/(νg + d), then linear t2 , CWM (22) coincides with FMT for suitable parameters bg , bg0 , and σ,g g = 1, . . . , G. Proof. Let Z be a q -variate random vector having multivariate t distribution (20) with degrees of freedom ν ∈ (0, ∞), location parameter μ, and positive definite inner product matrix Σ. If Z is partitioned as Z = (Z1 , Z2 ) , where Z1 takes values in Rq1 and Z2 in Rq2 = Rq−q1 , then Z can be written as



μ1 Σ11 Σ12 μ= and Σ = . μ2 Σ21 Σ22 Hence, based on properties of multivariate t distribution (e.g. Dickey 1967; Liu and Rubin 1995), it can be proven that: Z1 ∼ tq1 (μ1 , Σ11 , ν) and

Z2 |z1 ∼ tq2 (μ2|1 , Σ∗2|1 , ν + q1 ),

(23)

where μ2|1 = μ2|1 (z1 ) = μ2 + Σ21 Σ−1 11 (z1 − μ1 ) Σ∗2|1 = Σ∗2|1 (z1 ) =

ν + δ(z1 ; μ1 , Σ11 ) Σ2|1 , ν + q1

(24)

 −1 with Σ2|1 = Σ22 − Σ21 Σ−1 11 Σ12 and δ(z1 ; μ1 , Σ11 ) = (z1 − μ1 ) Σ11 (z1 −   μ1 ). In particular, if we set Z = (X , Y ) , then (22) coincides with FMT if ∗2 = σ 2 [ν + δ(x; μ , Σ )]/(ν + d). ζg = νg + d and σ,g g g g ,g g 

Thus, the linear t-CWM in (22) defines a wide family of densities, which strictly includes FMT as special case but is able to model more general cases. Analogous to the Gaussian case, the linear t-CWM in (22) also includes finite mixtures of regression models with Student-t errors (FMR-t): f (y|x; ψ) =

G 

2 t(y; bg x + bg0 , σε,g , ζ g ) πg ,

g=1

where vector ψ denotes the overall parameters of the model.

(25)

377

Local Statistical Modeling

Moreover, because the Gaussian distribution can be regarded as the limit of the Student-t distribution, as the number of degrees of freedom tends to infinity, then linear t-CWM contains FMRC as a limiting special case. The relationships among linear t-CWM, FMT, FMR-t and FMRC are summarized in Table 2. However, the analysis of finite mixtures of regressions under Studentt distribution assumptions, which have not been proposed in the literature yet (as far as the authors know), provides ideas for further research. A numerical analysis comparing the classifications obtained by FMT and linear t-CWM will be presented in Section 5.2. Moreover, a procedure for robust clustering of noise data will be proposed in Section 5.3. Finally, in Table 3 we summarize all the models discussed in the paper to show that linear CWM based on elliptical distributions is a general and flexible family of mixture models, which includes well-known models as special cases. We remark that if the degrees of freedom become large, linear Gaussian CWM can be seen as a limiting special case of linear t-CWM. 5. Empirical Studies

Statistical models introduced in the previous sections have been evaluated on the grounds of many empirical studies based on both simulated and real data. The CWM parameters have been estimated by means of the EM algorithm according to the maximum likelihood approach, and the routines have been implemented in R with different initialization strategies to avoid local optima. The number of clusters has been assumed as given for two reasons. First, the goal of this section is to point out some differences among the models discussed in the paper for fixed numbers of clusters, as well as in the abundance of similar studies in the literature. Secondly, many criteria for the estimation of the number of clusters have been proposed in the area of model-based clustering (e.g. see Fonseca 2008 for a comprehensive review); however, their application to CWM (which combines both conditional and marginal distributions) needs to be carefully investigated to avoid overly simplistic proposals. This section is organized as follows. In Subsection 5.1, we compare linear Gaussian CWM, FMR, and FMRC; in Subsection 5.2, we compare linear t-CWM and FMT; then, by extending the proposals by Peel and McLachlan (2000) and Greselin and Ingrassia (2010) for FMT, in Subsection 5.3 we propose a procedure for robust clustering of noise data and present some numerical studies. 5.1 Comparisons Among Linear Gaussian CWM, FMR, and FMRC

To begin with, we present some simulation studies to empirically verify the theoretical results of Section 3, that is, linear Gaussian CWM con-

none none

FMR-t FMRC

Student-t Student-t

p(y|x, Ωg ) Student-t none logistic

parameterization of πg none

assumptions 2∗ ζg = νg + d, σ,g 2 = σ,g [νg + δ(x; µg , Σg )]/(νg + d) (µg , Σg , νg ) = (µ, Σ, ν), g = 1, . . . , G νg → ∞, Σg = Σ, πg = π, g = 1, . . . , G

parameterization of πg none none none none none none logistic

p(y|x, Ωg ) 2 t(y; bg x + bg0 , σε,g , ζg )  2 t(y; bg x + bg0 , σε,g , ζg )

2 t(y; bg x + bg0 , σε,g , ζg )

2 t(y; bg x + bg0 , σε,g , ζg )

2 , ζg ) t(y; bg x + bg0 , σε,g  2 , ζg ) t(y; bg x + bg0 , σε,g 2 , ζg ) t(y; bg x + bg0 , σε,g

p(x|Ωg ) td (µg , Σg , νg ) td (µg , Σg , νg )

td (µg , Σg , νg )

td (µg , Σg , νg )

none none none

model t-CWM G-CWM

FMG

FMT

FMR-t FMR FMRC

νg → ∞, ζg → ∞, g = 1, . . . , G νg → ∞, ζg → ∞, g = 1, . . . , G 2∗ ζg = νg + d, σ,g = σg2 [νg + δ(x; µg , Σg )]/(νg + d) (µg , Σg , νg ) = (µ, Σ, ν), g = 1, . . . , G ζg → ∞, νg → ∞, (µg , Σg ) = (µ, Σ), g = 1, . . . , G νg → ∞, Σg = Σ, πg = π, g = 1, . . . , G

assumptions

Table 3. Overview of models included in linear CWM based on elliptical distributions.

p(x|Ωg ) Student-t

model FMT

Table 2. Relationships between linear t-CWM and Student-t based mixtures.

378 S. Ingrassia, S.C. Minotti, and G. Vittadini

379

Local Statistical Modeling

tains FMR and FMRC as special cases (in particular, data modeling via FMRC has been carried out by means of Flexmix R-package, see Leisch (2004)). For this aim, we considered some cases concerning classification of units (x, y) ∈ R2 . The data were obtained as follows: first, we generated samples x1 , . . . , xG from the X random variable according to G Gaussian distributions with parameters (μg , σg2 ); each sample xg has a size Ng , (g = 1, ..., G). Subsequently, from xg = (xg1 , . . . , xgNg ) we obtained vector yg = (yg1 , . . . , ygNg ) considering realizations of the random variables Ygn = bg1 xgn + bg0 + g , for g = 1, . . . , G and n = 1, . . . , Ng , where 2 ). bg1 , bg0 ∈ R and ∼ N (0, σ,g The results obtained with CWM, FMR, and FMRC were compared in terms of the following quantities: 1. Wilks’ lambda Λ, which is used in multivariate analysis of variance and is given by det W , Λ= (26) det T where T is the total scatter matrix and W is the within-class scatter matrix:   Ng W= G zg )(zng − ¯zg ) n=1 (zng − ¯ g=1 T=

G Ng g=1

and

n=1 (zng

¯)(zng − ¯z) , −z

with z = (x, y) and z¯g = (¯ xg , y¯g ) being the vector mean of the g-th group. 2. An index of weighted model fitting (IWF) E defined as: ⎛ ⎡ ⎛ ⎞⎤2 ⎞1/2 N G   1 ⎣y n − ⎝ E =⎝ μ(xn ; β g )p(Ωg |xn , yn )⎠⎦ ⎠ . (27) N n=1

g=1

3. The misclassification rate η , which is the percentage of units classified in the wrong class. Example 1 Let us consider an example with G = 2 groups in which the densities of X|Ωg are not the same. Data were generated according to the following parameters:

Group 1 Group 2

Ng 100 200

φ(x; μg , σg2 ) μg σg 10 2 -10 2

and are shown in Figure 2.

2 ) φ(y; bg0 + bg1 x, σε,g bg0 bg1 σ,g 2 6 2 4 -6 2

380

S. Ingrassia, S.C. Minotti, and G. Vittadini

40

60

Y

80

100

true distribution

−15

−10

−5

0

5

10

15

5

10

15

5

10

15

X

40

60

Y

80

100

CWM

−15

−10

−5

0 X

40

60

Y

80

100

FMR

−15

−10

−5

0 X

Figure 2. Example 1. True distribution, data classification, and fitted lines according to CWM and FMR. The symbols + and ◦ denote data classified into groups 1 and 2, respectively.

381

Local Statistical Modeling

The results obtained with CWM and FMR are summarized in the following table: CWM FMR

Λ 0.0396 0.2306

E 2.003 7.013

η 0.00% 5.33%

The analysis shows that CWM leads to well separated groups (Λ = 0.0396), where all units have been classified properly (η = 0.00%) and the local models p(y|x, Ωg ) exhibit a good fit to the data (E = 2.003). On the contrary, FMR leads to partially overlapping groups (Λ = 0.2306), poor data fit (E = 7.013), and a higher misclassification rate (η = 5.33%). Thus, CWM clearly outperforms FMR. Classification of data according to CWM and FMR is represented in Figure 2. FMRC essentially leads to the same classification as CWM. Example 2 Let us consider another example in which the densities of X|Ωg are not the same, but with G = 3 groups. Data were generated according to the following parameters:

Group 1 Group 2 Group 3

Ng 100 200 150

φ(x; μg , σg2 ) μg σg 5 1 10 2 20 3

2 ) φ(y; bg0 + bg1 x, σε,g bg0 bg1 σ,g 40 6 2 40 -1.5 1 150 7 2

and are shown in Figure 3. The results obtained with CWM and FMR are summarized in the following table: CWM FMR

Λ 0.0498 0.0909

E 1.678 1.647

η 0.00% 8.67%

The analysis shows that CWM leads to well separated groups (Λ = 0.0498), where all units have been classified properly (η = 0.00%) and the local models p(y|x, Ωg ) exhibit a good fit to the data (E = 1.678). FMR yields a slightly greater value of Wilks’ lambda (Λ = 0.0909) and essentially the same fit with the data (E = 1.647), but a higher misclassification rate (η = 8.67%). Classification of data according to CWM and FMR is represented in Figure 3. FMRC achieves the same classification as CWM. Example 3 Let us consider another example with G = 2 groups, in which the densities of X|Ωg are not the same but the groups overlap in part. Data were generated according to the following parameters:

382

S. Ingrassia, S.C. Minotti, and G. Vittadini

−40

−20

0

20

Y

40

60

80

true distribution

5

10

15

20

25

20

25

20

25

X

−40

−20

0

20

Y

40

60

80

CWM

5

10

15 X

−40

−20

0

20

Y

40

60

80

FMR

5

10

15 X

Figure 3. Example 2. True distribution, data classification, and fitted lines according to CWM and FMR. The symbols +, ◦ and  denote data classified into groups 1, 2 and 3, respectively.

383

Local Statistical Modeling

Group 1 Group 2

Ng 150 150

φ(x; μg , σg2 ) μg σg -2 6 0 2

2 ) φ(y; bg0 + bg1 x, σε,g bg0 bg1 σ,g 4 0.5 1 4 1.0 3

and are shown in Figure 4. The results obtained with the three models are summarized in the following table: CWM FMR FMRC

Λ 0.9354 0.9680 0.7102

E 2.86 2.34 2.34

η 22.67% 25.00% 30.00%

The analysis shows that CWM and FMR lead to comparable results in terms of the indices Λ, E , and η ; however, confusion matrices given in Table 4 indicate that FMR leads to a classification that is biased toward group 2. FMRC leads to better values of Λ and E but a worse value of η , as it is not able to recognize groups that overlap. Figure 4 highlights different classification according to CWM, FMR, and FMRC. Example 4 (Plasma Concentration of Beta-Carotene) The last example concerns a case study based on real data of beta-carotene plasma levels described in Nierenberg, Stukel, Baron, Dain, and Greenberg (1989) and available in the R package CAMAN. The data were recently modeled in Nierenberg, Stukel, Baron, Dain and Greenberg (1989) using FMR and here we compare this approach with CWM. Our analysis concentrates on the relationship between beta-carotene plasma level and the amount of beta-carotene in the diet using a subset of N = 144 male non-smokers. Moreover, following the schema in Schlattmann (2009), we considered G = 4 groups. First, we remark that the Bayesian information criterion (BIC) assumes a slightly greater value for CWM (-4298.6) than FMR (-4324.5) (see also Table 5). We can compare the two values because FMR can be regarded as a nested model of CWM (see Section 3.2). Moreover, CWM leads to better separated groups (Λ = 0.0934) than FMR (Λ = 0.2655); in contrast, the IWF index for CWM is slightly worse (E = 70.3925) than for FMR (E = 54.1186); however, considering the range of beta-carotene plasma levels, this difference is not relevant and the fit may be considered good in both cases. Classification according to CWM and FMR is represented in Figure 5. Finally, the parameter estimates are reported in Table 6, where the standard errors of the estimates are given in parentheses. FMR classifies individuals into four straight lines along the dietary beta-carotene axis (see Figure 5). The effect of dietary beta-carotene is

384

S. Ingrassia, S.C. Minotti, and G. Vittadini CWM

−5

−5

0

0

y

y

5

5

10

10

15

15

true distribution

−10

−5

0

5

10

−15

15

−10

−5

0

x

x

FMR

FMRC

5

10

15

5

10

15

−5

−5

0

0

y

y

5

5

10

10

15

15

−15

−15

−10

−5

0 x

5

10

15

−15

−10

−5

0 x

Figure 4. Example 3. True distribution, data classification, and fitted lines according to CWM and FMR. The symbols +, ◦, and  denote data classified into groups 1, 2 and 3, respectively.

rather small in the first subpopulation (b11 = 0.0032), slightly greater in the second and fourth subpopulations (b21 = 0.0301 and b41 = 0.0419), but much larger in the third subpopulation (b31 = 0.2572). CWM produces a different classification, which takes into account the distribution of dietary beta-carotene (see Figure 5); in particular, a subpopulation with a negative relationship between beta-carotene plasma level and dietary beta-carotene (b21 = −0.0393) is identified. To complete the data analysis, first we remark that the histogram of data concerning the amount of beta-carotene in the diet shows that the population is heterogeneous with respect to the independent variable (see Figure 6). This heterogeneity can be captured by CWM (which models the joint distribution) but not by FMR (which models only the conditional distribution). Second, we point out that group 2 in CWM exhibits a negative slope

385

Local Statistical Modeling

Table 4. Example 3. Confusion matrices and misclassification rates according to CWM, FMR, and FMRC.

true 1 2 η

CWM classified 1 2 114 36 32 118 146 154 = 22.67%

FMR classified true 1 2 1 91 59 2 16 134 107 193 η = 25%

FMRC classified true 1 2 1 109 41 2 49 101 158 142 η = 30%

Table 5. Betaplasma data. Values of BIC, Λ and E in CWM and FMR.

BIC Λ E

CWM -4298.6 0.0934 70.38

FMR -4324.5 0.2655 54.11

Table 6. Betaplasma data. Mixing weights and parameter estimates (with standard errors) of CWM and FMR. group 1

parameters π1 b10 b11

CWM 0.4036 103.0155 (2.2870) 0.0087 (0.0011)

FMR 0.3917 89.9892 (1.0669) 0.0032 (0.0004)

2

π2 b20 b21

0.2789 436.1387 (5.1796) -0.0393 (0.0014)

0.3852 125.4027 (1.4870) 0.0301 (0.0006)

3

π3 b30 b31

0.0283 486.4574 (25.6135) 0.1590 (0.0058)

0.1016 -11.9203 (9.5508) 0.2572 (0.0033)

4

π4 b40 b41

0.2893 112.1419 (5.1502) 0.0524 (0.0051)

0.1215 249.8431 (1.6601) 0.0419 (0.0006)

386

S. Ingrassia, S.C. Minotti, and G. Vittadini

1500

CWM

3

1000

3

3

500

Y

3

22 2 2 2 2 2 2 2 2 2 4 4 42 222 2 2 22 444 4 2 2 2 4 44 4 1 1 2 22 2 2 444 444 1 1 1 1 44 111 2 2 4 4444 444 14111111 1111 11 11 1 11 1 4 2 44 4444444411 1111 111111111 1 1 1 1 44 44 1 1 2 1 4

0

2 2 2

0

2000

2

2 2

4000

2

6000

8000

X

1500

FMR

3

1000

3

3

500

Y

3

3 3 3 3 4 3 3 4 4 4 4 4 4 44 4 4 44 222 2 444 4 2 2 2 2 4 24 2 2 2 2 22 2 2 222 222 2 2 2 2 22 222 1 1 2 2222 221 12222211 1212 21 21 1 21 1 1 11 1111111111 1111 111111111 1 1 1 1 11 11 1 1 1 1 1

0

4

0

2000

2

2 1

4000

1

6000

8000

X

2e−04 0e+00

1e−04

Density

3e−04

4e−04

Figure 5. Betaplasma data. Classification and fitted lines according to CWM and FMR.

0

2000

4000

6000

8000

10000

x

Figure 6. Betaplasma data. Histogram of the amount of beta-carotene in the diet.

Local Statistical Modeling

387

(b21 = −0.0393), while FMR leads to a model with all positive slopes. The identification of a subpopulation that negatively reacts to dietary betacarotene seems to confirm recent findings about the adverse effect of antioxidant intake on the incidence of lung cancer (see Schlattmann 2009, p. 11). This might be a starting point for further investigations in the biomedical area. 5.2 A Comparison Between Linear t-CWM and FMT in Robust

Clustering In this section, we present results concerning robust classification via linear t-CWM and FMT based on a real data set studied by Campbell and Mahon (1974) about rock crabs of the genus Leptograpsus (available at http://www.stats.ox.ac.uk/pub/PRNN/). Here, attention is focused on a sample of n = 100 blue crabs with n1 = 50 males and n2 = 50 females (see Figure 7). Each specimen has five measurements (expressed in mm): width of the frontal lip (FL), rear width (RW), length along the midline (CL), maximum width (CW) of the carapace, and body depth (BD). According to the classes of application of CWM introduced in Section 2, this case study concerns a direct application of type B; in particular, in (22) the variable CL has been selected as the Y -variable. We compare linear t-CWM and FMT after some outliers were introduced into the original dataset. This was done by adding various constant values to the second variate of the 25th point, according to the procedure proposed in McLachlan and Peel (2000) and Peel and McLachlan (2000). In Table 7, we report the overall misclassification rate of linear tCWM and FMT for each perturbed version of the original dataset. We read Table 7 beginning from the central row, where the initial dataset is considered (as the constant value is null). In this case, linear t-CWM outperforms FMT. Scanning the subsequent rows of the table, where the constant values increase progressively from 5 to 20 mm, the error rate for linear t-CWM remains almost unchanged while for FMT it slowly increases (reaching 20%). Finally, we remark that misclassification error rates obtained via linear t-CWM are equivalent to those obtained in Greselin and Ingrassia (2010) using suitable constraints on the eigenvalues of the covariance matrices of the two groups. 5.3 A Procedure for Robust Clustering of Noise Data

By extending the proposals by Peel and McLachlan (2000) and Greselin and Ingrassia (2010) for FMT, in this section we propose a procedure for robust clustering of noise data and present some numerical studies.

388

S. Ingrassia, S.C. Minotti, and G. Vittadini 15

25

35

45

8

12

16

20

10

14

18

BD

45

6

| |||| || |||||||| ||||||| ||| || ||||||||||||||||||||||||| || ||| |||| |

15

25

35

CL

| | |||||||||| || |||||||||| |||||||||||| |||||||||||||| ||||||| ||||| | |

20

30

40

50

CW

20

| | ||||||||||||||||||||| ||||||| | | |||||||||||||||||||||| ||||| |||||| |

8

12

16

FL

RW

8 10

1 2

14

| | |||| | ||||| |||| || |||||| ||||||||||||||||||||| |||||||||||||||| | | | |

|| ||| | |||||||| | ||| ||| | |||||||| |||| |||||||||||| | ||| | | ||| 6

10

14

18

20

30

40

50

8 10

14

Figure 7. Scatterplot matrix of the crab dataset.

The procedure consists of three steps. First, we identify a subset O of units that are marked as outliers (i.e., noise data). Second, we model the reduced dataset D  = D \ O using CWM and estimate the parameters. Finally, based on this estimate, we classify the whole dataset D into G groups plus a group of noise data. The first step can be performed following different strategies. Once the estimates of the parameters of the g-th group (g = 1, . . . , G) have been obtained, consider the squared Mahalanobis distance between each unit and the g-th local estimate. In the framework of robust clustering via FMT, Peel and McLachlan (2000) proposed an approach based on the maximum likelihood; in particular, an observation xn is treated as an outlier (and thus will be classified as noise data) if G 

ˆ g ) > χ2 (q) , ˆ g, Σ zˆjn δ(xn ; μ 1−α

g=1

where zˆjn = 1 if xn unit is classified in the j -th group according to the ˆ g ) = (xn − ˆg, Σ maximum posterior probability and 0 otherwise, δ(xn ; μ

389

Local Statistical Modeling

Table 7. Crab dataset. Comparison of error rates when fitting linear t-CWM and FMT with the modification of a single point.

Constant −15 −10 −5 0 5 10 15 20

linear t-CWM Error Rate 13% 13% 13% 12% 12% 11% 12% 12%

FMT Error Rate 19% 19% 20% 18% 20% 20% 20% 20%

−1

ˆ g (xn − μ ˆ g ) Σ ˆ g ) and χ21−α (ν) denotes the quantile of order (1 − α) of μ the chi-squared distribution with ν degrees of freedom. More recent approaches are based on the forward search (Riani, Cerioli, Atkinson, Perrotta and Torti 2008; Riani, Atkinson and Cerioli 2009), maximum likelihood estimation with a trimmed sample (Cuesta-Albertos, Matr´an and Mayo-Iscar 2008; Gallegos and Ritter 2009a, 2009b) and on multivariate outlier tests based on the minimum covariance determinant estimator (Cerioli 2010). For the scope of the present paper, we followed Peel and McLachlan’s strategy using a Student-t CWM. CWM of noise data according to other strategies provides ideas for further research. As far as the second step is concerned, once the reduced dataset D  = D \ O was obtained, the parameters were estimated according to either Gaussian or Student-t CWM. In the following, such strategies will be referred to as Student-Gaussian CWM (tG-CWM) and Student-Student CWM (tt-CWM), respectively. Finally, the data were classified into G + 1 groups. A similar strategy has also been considered in Greselin and Ingrassia (2010). In the following, we present the results of some numerical studies.

Example 5 The first simulated dataset concerns a sample of 300 units generated according to (5) with G = 3, d = 1, π1 = π2 = π3 = 1/3. The parameters are listed in the following table for two different values of σ = σ = 2 and σ = σ = 4:

Group 1 Group 2 Group 3

Ng 100 100 100

φ(x; μg , σg2 ) μg σg 5 σ 10 σ 20 σ

2 ) φ(y; bg0 + bg1 x, σε,g bg0 bg1 σ,g 40 6 σ 40 -1.5 σ 150 -7 σ

390

−50

0

50

100

S. Ingrassia, S.C. Minotti, and G. Vittadini

0

10

20

30

20

30

−50

0

50

100

a)

−10

0

10

b) Figure 8. Example 5. a) data with σ = 2, b) data with σ = 4 (circles represent noise).

The sample data {(xn , yn )}n=1,...,300 were obtained as follows: first, we generated the samples x1 , . . . , xN according to G = 3 Gaussian distributions with parameters (μg , σg ), g = 1, ..., G. Next, for each xg we generated the value yg according to a Gaussian distribution with mean bg0 + bg1 x and variance σ2 . Then, the above dataset was augmented by including a sample of 50 points generated from a uniform distribution in the rectangle [−5, 30] × [−50, 130] to simulate noise. Thus, the whole dataset D contains N = 350 units (see Figure 8). The results are summarized in Table 8. In practice, tG-CWM and ttCWM have the same performance. In the case when σ = 2, tt-CWM slightly outperforms tG-CWM (misclassification rates η = 6.00% and 5.71%, respectively). However, tt-CWM recognizes a larger number of outliers than tG-CWM; the reverse occurs in the case of σ = 4, when we observe η =

391

Local Statistical Modeling

Table 8. Example 5. Confusion matrices, mean squared error, and misclassification rate for Student-Gaussian CWM and Student-Student CWM. a) Student-Gaussian CWM estimated true 1 2 3 outlier 1 98 0 0 2 0 97 0 3 2 0 0 100 0 3 1 0 15 34 outlier

true 1 2 3 outlier

case σ = 2: E = 7.97, η = 6.00%

case σ = 4: E = 4.34, η = 4.29%

b) Student-Student CWM estimated true 1 2 3 outlier 1 94 0 0 6 0 92 0 8 2 0 0 99 1 3 1 0 4 45 outlier

true 1 2 3 outlier

case σ = 2: E = 2.97, η = 5.71%

case σ = 4: E = 5.8, η = 5.71%

estimated 1 2 3 99 0 0 0 98 1 0 1 99 5 2 4

1 98 0 0 4

estimated 2 3 0 0 93 4 0 100 0 7

outlier 1 1 0 39

outlier 2 3 0 39

4.29% and 5.71%, respectively. We remark that the smallest misclassification error η corresponds to the model with the smallest mean squared error E.

Example 6 The second simulated example concerns a dataset of size 150 generated according to model (5) with G = 3, d = 1, π1 = π2 = π3 = 1/3. The parameters are listed in the following table for two different values of σ = σ = 2 and σ = σ = 4:

Group 1 Group 2 Group 3

Ng 100 100 100

φ(x; μg , σg2 ) μg σg 5 σ 10 σ 40 σ

2 ) φ(y; bg0 + bg1 x, σε,g bg0 bg1 σ,g 2 6 σ 2 -1.5 σ 2 -7 σ

i.e., data are divided into G = 3 groups along one straight line. Next, we added to previous data a sample of 25 points generated by a uniform distribution in the rectangle [−5, 30] × [−50, 130] to simulate noise. Thus, D contains N = 175 units (see Figure 9). The results are summarized in Table 9. In the case when σ = 2, tG-CWM slightly outperforms tt-CWM (misclassification rates η = 4.00% and 5.14%, respectively); in the case

392

−50

0

50

100

150

200

250

S. Ingrassia, S.C. Minotti, and G. Vittadini

0

10

20

30

40

50

30

40

50

0

50

100

150

200

250

300

a)

−10

0

10

20

b) Figure 9. Example 6. a) data with σ = 2, b) data with σ = 4 (circles represent noise).

when σ = 4, tG-CWM essentially identifies two groups (and thus η = 40%), while tt-CWM recognizes the three groups with a misclassification rate η of 8.00%. Figure 9b) explains the reason for the relevant misclassification error in data fitting via tG-CWM, i.e., two clusters are very close; in this case, tG-CWM identifies these two clusters as a whole while tt-CWM correctly separates them. We point out that, in this case, the smallest misclassification error η again corresponds to the model with the smallest mean squared error E . Example 7 The third example concerns a dataset of size 300 generated according to (5) with G = 2, d = 2, π1 = π2 = 1/2 with the following parameters for p(y|x, Ωg ): μ(x, β 1 ) = 6x1 + 1.2x2

and μ(x, β 2 ) = −1.5x1 + 3x2 ,

393

Local Statistical Modeling

Table 9. Example 6. Confusion matrices, mean squared error, and misclassification rate for Student-Gaussian CWM and Student-Student CWM. a) Student-Gaussian CWM estimated true 1 2 3 outlier 1 47 0 0 3 0 50 0 1 2 0 0 49 1 3 0 2 1 22 outlier case σ = 2: E = 2.29, η = 4.00% b) Student-Student CWM estimated true 1 2 3 outlier 1 46 0 0 4 0 49 0 1 2 0 0 48 2 3 0 1 1 23 outlier case σ = 2: E = 7.25, η = 5.14%

true 1 2 3 outlier

estimated 1 2 3 0 50 0 4 50 0 0 0 50 19 0 1

outlier 0 0 4 5

case σ = 4: E = 71.39, η = 40.00%

true 1 2 3 outlier

estimated 1 2 3 49 0 0 4 46 0 0 0 46 0 0 5

outlier 1 0 4 20

case σ = 4: E = 31.96, η = 8.00%

that is μ1 = (6, 1.2) and μ2 = (−1.5, 3) , and the following parameters for p(x|Ωg ) = φ2 (x; μg , Σg ), g = 1, 2, for two different values of σ1 = σ2 = σ and σ,1 = σ,2 = σ , i.e., σ = σ = 2 and σ = σ = 4:



4 −0.1 4 0.1  μ1 = (5, 20) , Σ1 = and μ2 = (2, 4), Σ2 = . −0.1 4 0.1 4 Next, a sample of 50 points generated from a uniform distribution in the rectangle [−5, 40] × [−5, 40] × [−20, 170] was added to simulate noise. Thus, the dataset D contains N = 350 units. The results are summarized in Table 10. In the case when σ = 2, tG-CWM slightly outperforms ttCWM (misclassification rates η = 2.00% and 2.29%, respectively); similar results are obtained in the case when σ = 4, where η = 6.57% and 7.43%, respectively. Again, the smallest misclassification error η was attained in the model with the smallest mean squared error E . 6. Concluding Remarks

Cluster-weighted modeling based on elliptical distributions was proposed as a general and flexible family of mixture models. For the Gaussian case, the relationships between CWM and some competitive local statisti-

394

S. Ingrassia, S.C. Minotti, and G. Vittadini

Table 10. Example 7. Confusion matrices, mean squared error, and misclassification rate for Student-Gaussian CWM and Student-Student CWM. a) Student-Gaussian CWM estimated true 1 2 outlier 1 149 0 1 0 144 6 2 0 0 50 outlier case σ = 2: E = 2.08, η = 2.00% b) Student-Student CWM estimated true 1 2 outlier 1 139 0 11 0 138 12 2 0 0 50 outlier case σ = 2: E = 3.94, η = 2.29%

true 1 2 outlier

estimated 1 2 149 1 0 145 0 2

outlier 0 5 48

case σ = 4: E = 2.32, η = 6.57%

true 1 2 outlier

estimated 1 2 140 1 0 135 0 1

outlier 9 15 49

case σ = 4: E = 4.64, η = 7.43%

cal models (such as mixtures of distributions and mixtures of regressions) were investigated in the first part of the paper. In particular, based on analytical arguments we showed that CWM can be regarded as a generalization of such models. Moreover, CWM was introduced in a general setting that also includes non-linear relationships; examples concerning quadratic and cubic functions were reported. In this spirit, we remark that Proposition 2 could be extended to finite mixtures of generalized linear models. Further, some numerical simulations showed that CWM provides a very flexible and powerful framework for data classification and data fitting. In the second part of the paper, we introduced new cluster-weighted modeling based on Student-t distributions, proving that linear t-CWM defines a wide family of mixture models for robust clustering. Moreover, following an approach by Peel and McLachlan (2000), a procedure for robust clustering of noise data was proposed. In this context, recent literature on robust parameter estimation provides ideas for further research (e.g. Riani et al. 2008; Riani et al. 2009; Cuesta-Albertos et al. 2008; Gallegos and Ritter 2009a, 2009b; Cerioli 2010). Other important issues deserve the attention of further research. The first one concerns computational aspects of parameter estimation in CWM. Parameters were estimated according to the maximum likelihood approach by means of the EM algorithm; however, the behaviour of the EM algorithm under different conditions was not studied. Numerical simulations

Local Statistical Modeling

395

confirmed the findings of Faria and Soromenho (2010) in fitting mixtures of linear regressions and we point out that algorithm initialization is quite critical. An initial guess was made according to either a preliminary clustering of data using a k-means algorithm or a random grouping of data, but our numerical studies pointed out that there is no optimal strategy. Finally, we remark that to reduce such critical aspects, suitable constraints on the eigenvalues of the covariance matrices could be implemented (e.g. Ingrassia 2004; Ingrassia and Rocci 2007; Greselin and Ingrassia 2010). The second issue concerns mixture of regression models with Student-t errors, introduced in Section 4. The third issue concerns the estimate of the number of clusters. As we remarked above, many different criteria for model selection have been proposed in the literature in the area of model-based clustering (e.g. Fonseca 2008). The use of such indices for selection of cluster-weighted models needs to be carefully analyzed and evaluated in a subsequent paper. Appendix: Decision Surfaces of CWM The potential of CWM as a general and flexible framework for classification purposes can also be illustrated from a geometrical point of view, by considering the decision surfaces that separate the groups. In the following, we discuss the binary case and will prove that these decision surfaces belong to the family of quadrics. In the specific case of two groups, a decision surface is the set of (x, y) ∈ Rd+1 such that p(Ω0 |x, y) = p(Ω1 |x, y) = 0.5. Given that p(x|Ωg )πg = p(Ωg |x)p(x), we can rewrite p(Ω1 |x, y) as: p(y|x, Ω1 )p(Ω1 |x) p(y|x, Ω0 )p(Ω0 |x) + p(y|x, Ω1 )p(Ω1 |x) 1 = p(y|x, Ω0 )p(Ω0 |x) 1+ p(y|x, Ω1 )p(Ω1 |x) 1  . = p(y|x, Ω1 ) p(Ω1 |x) − ln 1 + exp − ln p(y|x, Ω0 ) p(Ω0 |x)

p(Ω1 |x, y) =

(28)

Thus, it results that p(Ω1 |x, y) = 0.5 when ln

p(y|x, Ω1 ) p(Ω1 |x) + ln =0, p(y|x, Ω0 ) p(Ω0 |x)

which may be rewritten as: ln

p(y|x, Ω1 ) p(x|Ω1 ) π1 + ln + ln =0. p(y|x, Ω0 ) p(x|Ω0 ) π0

(29)

396

S. Ingrassia, S.C. Minotti, and G. Vittadini

x2

x2

y

x1

y

x1

Figure 10. Examples of decision surfaces for linear Gaussian CWM (heteroscedastic case).

In the linear Gaussian-CWM, the first and the second term in (29) are, respectively:  2 2πσ,0 p(y|x, Ω1 ) (y − b0 x − b00 )2 (y − b1 x − b10 )2 = ln  ln + − 2 2 p(y|x, Ω0 ) 2σ,0 2σ,1 2 2πσ,1 ln

p(x|Ω1 ) 1 |Σ0 | = ln p(x|Ω0 ) 2 |Σ1 |  1  −1 (x − μ0 ) Σ−1 + 0 (x − μ0 ) − (x − μ1 ) Σ1 (x − μ1 ) . 2

Then, equation (29) is satisfied for (x, y) ∈ Rd+1 such that: σ,0 (y − b0 x − b00 )2 (y − b1 x − b10 )2 1 |Σ0 | + + − + ln 2 2 σ,1 2 |Σ1 | 2σ,0 2σ,1  π1 1  −1 (x − μ0 ) Σ−1 =0, 0 (x − μ0 ) − (x − μ1 ) Σ1 (x − μ1 ) + ln 2 π0 (30)

ln

which defines quadratic surfaces, i.e., quadrics. Examples of quadrics are spheres, circular cylinders, and circular cones. In Figure 10, we give two examples of surfaces generated by (30).

397

Local Statistical Modeling

y

x1 x2

Figure 11. Examples of decision surfaces for linear Gaussian CWM (homoscedastic case).

In the homoscedastic case Σ0 = Σ1 = Σ, it is well known that:  p(x|Ω1 ) 1 = ln (x − μ0 ) Σ−1 (x − μ0 ) − (x − μ1 ) Σ−1 (x − μ1 ) p(x|Ω0 ) 2 = w x + w0 , (31) where 1 w0 = (μ0 + μ1 ) Σ−1 (μ0 − μ1 ). 2 In this case, according to (31) equation (29) yields:

w = Σ−1 (μ1 − μ0 ) and

ln

p(y|x, Ω1 ) π1 + w x + w0 + ln =0 p(y|x, Ω0 ) π0

(see Figure 11). In the linear t-CWM, the first and the second term in (29) are, respectively:   p(x|Ω1 ) 1 |Σ0 | Γ((ν1 + d)/2)Γ(ν0 /2) = ln + ln + ln p(x|Ω0 ) Γ((ν0 + d)/2)Γ(ν1 /2) 2 |Σ1 | ν0 + d ln{ν0 + δ(x; μ0 , Σ0 )} + 2 ν1 + d ln{ν1 + δ(x; μ1 , Σ1 )} − 2

398

S. Ingrassia, S.C. Minotti, and G. Vittadini

  σ,0 p(y|x, Ω1 ) Γ((ζ1 + 1)/2)Γ(ζ0 /2) = ln ln + ln p(y|x, Ω0 ) Γ((ζ0 + 1)/2)Γ(ζ1 /2) σ,1 

 y − b0 x − b00 2 ζ0 + 1 ln ζ0 + + 2 σ,0 

 ζ1 + 1 y − b1 x − b10 2 ln ζ1 + − . 2 σ,1 Then, equation (29) is satisfied for (x, y) ∈ Rd+1 such that: 

 σ,0 ζ0 + 1 y − b0 x − b00 2 ln ζ0 + c(ν0 , ν1 , ζ0 , ζ1 ) + ln + + σ,1 2 σ,0 

 1 |Σ0 | y − b1 x − b10 2 ζ1 + 1 ln ζ1 + + ln − 2 σ,1 2 |Σ1 | +

ν1 + d π1 ν0 + d ln{ν0 +δ(x; μ0 , Σ0 )}− ln{ν1 +δ(x; μ1 , Σ1 )}+ln = 0, 2 2 π0 (32)

where 

   Γ((ζ1 + 1)/2)Γ(ζ0 /2) Γ((ν1 + d)/2)Γ(ν0 /2) c(ν0 , ν1 , ζ0 , ζ1 ) = ln +ln . Γ((ζ0 + 1)/2)Γ(ζ1 /2) Γ((ν0 + d)/2)Γ(ν1 /2)

We remark that, in this case, the decision surfaces are elliptical. References ANDERSON, J.A. (1972), “Separate Sample Logistic Discrimination”, Biometrika, 59, 19– 35. ANDREWS, R.L., ANSARI, A., and CURRIM, I.S. (2002), “Hierarchical Bayes Versus Finite Mixture Conjoint Analysis Models: A Comparison of Fit, Prediction, and Partworth Recovery”, Journal of Marketing, 39, 87–98. ANDREWS, J.L., and McNICHOLAS, P.D. (2011), “Extending Mixtures of Multivariate T-Factor Analyzers”, Statistics and Computing 21(3), 361–373. BAEK, J., and McLACHLAN, G.J. (2011), “Mixtures of Common T-Factor Analyzers for Clustering High-Dimensional Microarray Data”, Bioinformatics, 27, 1269–1276. ´ BERNARDO, J.M., and GIRON, F.J. (1992), “Robust Sequential Prediction from NonRandom Samples: The Election Night Forecasting Case” in Bayesian Statistics 5, eds. J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, Oxford: Oxford University Press, pp. 61–77. CAMPBELL, N.A., and MAHON, R.J. (1974), “A Multivariate Study of Variation in Two Species of Rock Crab of Genus Letpograspus”, Australian Journal of Zoology, 22, 417–455.

Local Statistical Modeling

399

CERIOLI, A. (2010), “Multivariate Outlier Detection with High-Breakdown Estimators”, Journal of the American Statistical Society, 105(489), 147–156. ´ CUESTA-ALBERTOS, J.A., MATRAN, C., and MAYO-ISCAR, A. (2008), “Trimming and Likelihood: Robust Location and Dispersion Estimation in the Elliptical Model”, The Annals of Statistics, 36(5), 2284–2318. DAYTON, C.M., and MACREADY, G.B. (1988), “Concomitant-Variable Latent-Class Models”, Journal of the American Statistical Association, 83, 173–178. DESARBO, W.S., and CRON, W.L. (1988), “A Maximum Likelihood Methodology for Clusterwise Linear Regression”, Journal of Classification, 5(2), 249–282. DICKEY, J.T. (1967), “Matricvariate Generalizations of the Multivariate t Distribution and the Inverted Multivariate t Distribution”, The Annals of Mathematical Statistics, 38, 511–518. EVERITT, B.S., and HAND, D.J. (1981), Finite Mixture Distributions, London: Chapman & Hall. FARIA, S., and SOROMENHO, G. (2010), “Fitting Mixtures of Linear Regressions, Journal of Statistical Computation and Simulation, 80, 201–225. FONSECA, J.R.S. (2008), “Mixture Modeling and Information Criteria for Discovering Patterns in Continuous Data”, Eighth International Conference on Hybrid Intelligent Systems, IEEE Computer Society. ¨ FRUWIRTH-SCHNATTER, S. (2005), Finite Mixture and Markov Switching Models, Heidelberg: Springer. GALLEGOS, M.T., and RITTER, G. (2009a), “Trimming- Algorithms for Clustering Contaminated Grouped Data and Their Robustness”, Advances in Data Analysis and Classification, 3, 135-167. GALLEGOS, M.T., and RITTER, G. (2009b), “Trimmed ML Estimation of Contaminated Mixtures”, Sankhya, 71-A, Part 2, 164–220. GERSHENFELD, N. (1997), “Non Linear Inference and Cluster-Weighted Modeling”, Annals of the New York Academy of Sciences, 808, 18-24. GERSHENFELD, N. (1999), The Nature of Mathematical Modelling, Cambridge: Cambridge University Press, pp. 101–130. ¨ GERSHENFELD, N., SCHONER, B., and METOIS, E. (1999), “Cluster-Weighted Modelling for Time-Series Analysis”, Nature, 397, 329-332. GRESELIN, F., and INGRASSIA, S. (2010), “Constrained Monotone EM Algorithms of Multivariate t distributions”, Statistics & Computing, 20, 9–22. INGRASSIA, S. (2004), “A Likelihood-Based Constrained Algorithm for Multivariate Normal Mixture Models”, Statistical Methods & Applications, 13, 151–166. INGRASSIA, S., and ROCCI, R. (2007), “Constrained Monotone EM Algorithms for Finite Mixture of Multivariate Gaussians”, Computational Statistics & Data Analysis, 51, 5339–5351. JANSEN, R.C. (1993), “Maximum Likelihood in a Generalized Linear Finite Mixture Model by Using the EM Algorithm”, Biometrics, 49, 227–231. JORDAN, M.I. (1995), “Why the Logistic Function? A Tutorial Discussion on Probabilities and Neural Networks”, MIT Computational Cognitive Science Report 9503. JORDAN, M.I., and JACOBS, R.A. (1994), “Hierarchical Mixtures of Experts and the EM Algorithm”, Neural Computation, 6, 181–224. KAN, R., and ZHOU, G. (2006), “Modelling Non-Normality Using Multivariate t: Implications for Asset Pricing”, Working paper, Washington University, St. Louis.

400

S. Ingrassia, S.C. Minotti, and G. Vittadini

LANGE, K.L., LITTLE, R.J.A., and TAYLOR, J.M.G. (1989), “Robust Statistical Modeling Using the t Distribution”, Journal of the American Statistical Society, 84(408), 881– 896. LEISCH, F. (2004), “Flexmix: A General Framework for Finite Mixture Models and Latent Class Regression in R”, Journal of Statistical Software, 11(8), 1–18. LIU, C., and RUBIN, D.M. (1995), “ML Estimation of the t Distribution using EM and its Extensions, ECM and ECME”, Statistica Sinica, 5, 19–39. MARDIA, K.V., KENT, J.T., and BIBBY, J.M. (1979), Multivariate Analysis, London: Academic Press. McLACHLAN, G.J., and BASFORD, K.E. (1988), Mixture Models: Inference and Applications to Clustering, New York: Marcel Dekker. McLACHLAN, G.J., and PEEL, D. (1998), “Robust Cluster Analysis via Mixtures of Multivariate t-distributions”’ in Lecture Notes in Computer Science, Vol. 1451, eds. A. Amin, D. Dori, P. Pudil, and H. Freeman, Berlin: Springer-Verlag, pp. 658–666. McLACHLAN, G.J., and PEEL, D. (2000), Finite Mixture Models, New York: Wiley. NADARAJAH, S., and KOTZ, S. (2005), “Mathematical Properties of the Multivariate t Distributions”, Acta Applicandae Mathematicae, 89, 53–84. NEWCOMB, S. (1886), “A Generalized Theory of the Combination of Observations so as to Obtain the Best Result”, American Journal of Mathematics, 8, 343–366. NG, S.K., and McLACHLAN, G.J. (2007), “Extension of Mixture-of-Experts Networks for Binary Classification of Hierarchical Data”, Artificial Intelligence in Medicine, 41, 57–67. NG, S.K., and McLACHLAN, G.J. (2008), “Expert Networks with Mixed Continuous and Categorical Feature Variables: A Location Modeling Approach, in: Machine Learning Research Progress, eds. H. Peters and M. Vogel, New York: Hauppauge, pp. 355–368. NIERENBERG, D.W., STUKEL, T.A., BARON, J., DAIN, B.J., and GREENBERG, R. (1989), “Determinants of Plasma Levels of Beta-carotene and Retinol”, American Journal of Epidemiology, 130(3), 511–521. PEARSON, K. (1894), “Contributions to the Mathematical Theory of Evolution”, Philosophical Transactions of the Royal Society of London A, 185, 71–110. PEEL, D., and McLACHLAN, G.J. (2000), “Robust Mixture Modelling Using the t Distribution”, Statistics & Computing, 10, 339–348. PENG, F., JACOBS, R.A., and TANNER, M.A. (1996), “Bayesian Inference in Mixturesof-Experts and Hierarchical Mixtures-of-Experts Models with an Application to Speech Recognition”, Journal of the American Statistical Association, 91, 953–960. PINHEIRO, J.C., LIU, C., and WU, Y.N. (2001), “Efficient Algorithms for Robust Estimation in Linear Mixed-Effects Models Using the Multivariate t Distribution”, Journal of Computational and Graphical Statistics, 10, 249–276. QUANDT, R.E. (1972), “A New Approach to Estimating Switching Regressions”, Journal of the American Statistical Society, 67, 306–310. RIANI, M., CERIOLI, A., ATKINSON, A.C., PERROTTA, D., and TORTI, F. (2008), “Fitting Mixtures of Regression Lines with the Forward Search”, in Mining Massive Data Sets for Security, eds. F. Fogelman-Souli´e, D. Perrotta, J. Piskorki and R. Steinberg, Amsterdam: IOS Press, pp. 271–286. RIANI, M., ATKINSON, A.C., and CERIOLI, A. (2009), “Finding an Unknown Number of Multivariate Outliers”, Journal of the Royal Statistical Society B, 71(2), 447–466. SCHLATTMANN, P. (2009), Medical Applications of Finite Mixture Models, Berlin-Heidelberg: Springer-Verlag.

Local Statistical Modeling

401

¨ SCHONER, B. (2000), Probabilistic Characterization and Synthesis of Complex Data Driven Systems, Ph.D. Thesis, MIT. ¨ SCHONER, B., and GERSHENFELD, N. (2001), “Cluster Weighted Modeling: Probabilistic Time Series Prediction, Characterization, and Synthesis” in Nonlinear Dynamics and Statistics, ed. A.I. Mees, Boston: Birkhauser, pp. 365–385. TITTERINGTON, D.M., SMITH, A.F.M., and MAKOV, U.E. (1985), Statistical Analysis of Finite Mixture Distributions, New York: Wiley. WANG, P., PUTERMAN, M.L., COCKBURN, I., and LE, N. (1996), “Mixed Poisson Regression Models with Covariate Dependent Rates”, Biometrics, 52, 381–400. WEDEL, M. (2002), “Concomitant Variables in Finite Mixture Models”, Statistica Nederlandica, 56(3), 362–375. WEDEL, M., and DESARBO, W. (1995), “A Mixture Likelihood Approach for Generalized Linear Models”, Journal of Classification, 12, 21–55. WEDEL, M., and DESARBO, W. (2002), “Market Segment Derivation and Profiling via a Finite Mixture Model Framework”, Marketing Letters, 13, 17–25. WEDEL, M., and KAMAMURA, W.A. (2000), Market Segmentation. Conceptual and Methodological Foundations, Boston: Kluwer Academic Publishers. ZELLNER, A. (1976), “Bayesian and Non-Bayesian Analysis of the Regression Model with Multivariate Student-t Error Terms”, Journal of the American Statistical Society, 71, 400–405.

Suggest Documents