Model-based clustering and classification via ...

4 downloads 0 Views 94KB Size Report
eroscedasticity, homometroscedasticity, homotroposcedasticity, and homoscedastic- ity. The mixture models are then fitted using all available data (labeled and ...
Model-based clustering and classification via patterned covariance analysis Luca Bagnato and Francesca Greselin

Abstract This work deals with the classification problem in the case that groups are known and both labeled and unlabeled data are available. The classification rule is derived using Gaussian mixtures, with covariance matrices fixed according to a multiple testing procedure, which allows to choose among four alternatives: heteroscedasticity, homometroscedasticity, homotroposcedasticity, and homoscedasticity. The mixture models are then fitted using all available data (labeled and unlabeled) and adopting the EM and the CEM algorithms. Applications on real data are provided in order to show the classification performance of the proposed procedure. Key words: Model-based clustering; Classification; Patterned covariance analysis; Multiple testing procedure

1 Introduction The main purpose of the discriminant (or classification) analysis is to assign an object to one of the K groups G1 , . . . , GK according to a rule based on a vector of observations x = (x1 , . . . , xd ) of d variables. Following the well-known Bayes rule, the object with measurement x = (x1 , . . . , xd ) is assigned to the group Gk when k = argmax π j f j (x) ;

(1)

1≤ j≤K

where π j is the unconditional prior probability of observing a class j member and f j (x) denotes the j-th group conditional density of x. Then, when π j and f j (x) for all j = 1, . . . , K are known (or estimated), it is possible to classify an object using the rule (1). The most common discriminant methods assume that the data within group j are generated from a d-variate normal with mean µ j and covariance matrix Σ j , then the density of the data is given by

Luca Bagnato, Francesca Greselin Universit`a di Milano-Bicocca, Dipartimento di Metodi Quantitativi per le Scienze Economiche e Aziendali; e-mail: [email protected], [email protected]

1

2

Luca Bagnato and Francesca Greselin K

f (x) =

∑ π j f j (x) ,

(2)

j=1

! !−1/2 with f j (x) = (2π )−p/2 !Σ j ! exp[−1/2(x − µ j )# Σ −1 j (x − µ j )]. These methods differ depending on how Σ j , j = 1, . . . , K, are defined. In the linear discriminant analysis (LDA) it is assumed that Σ j = Σ for all j while in the quadratic discriminant analysis (QDA) Σ j can vary according to j. In the regularized discriminant analysis (RDA) (see [6]) the covariance matrices are defined in order to obtain an intermediate classifier between the linear, the quadratic and the nearest-means classifiers. Unfortunately RDA does not provide easily interpretable classification rules. The eigenvalue decomposition discriminant analysis (EDDA), proposed in [1], defines the covariance matrix Σ j of the group G j according to its spectral decomposition. EDDA has similar aims of the RDA but it provides a straightforward geometric interpretation. The traditional discriminant analysis involves samples of known origin (labeled data) and provides classification rules for samples of unknown origin (unlabeled data). These methodologies suffer whenever only a few known observations are available. Furthermore, unlabeled data could contain important information in order to define the classification rule. In this framework we adopt the idea presented in [4] and we use the multiple testing procedure in [7] to choose the structure for the covariance matrices. This allows us to obtain an improved classification rule (based on labeled and unlabeled data) and to achieve parameter reduction and interpretable results at the same time. In Section 2 we briefly present the patterned covariance analysis proposed in [7] and we recall the estimation procedure of [4]. Finally, in Section 3, we show an application on real data.

2 Patterned covariance analysis and model estimation Let Σ j = Γ j Λ j Γ #j be the spectral decomposition of the matrix Σ j , j = 1, . . . , K, where Λ j is the diagonal matrix of the eigenvalues of Σ j sorted in non-increasing order, and Γ j is the d × d orthogonal matrix whose columns are the normalized eigenvectors of Σ j ordered according to their eigenvalues. In [7] the tested hypothesis, in addition to homoscedasticity (LDA) and heteroscedasticity (QDA), are the following: 1. homometroscedasticity: Σ j = Γ j Λ Γ #j , j = 1, . . . , K, that is the Σ j s have the same shape and size but different orientations, 2. homotroposcedasticity: Σ j = Γ Λ j Γ # , j = 1, . . . , K, that is the Σ j s have the same orientation but different shapes and sizes. Naturally, these two cases are intermediate alternatives, lying between LDA and QDA. The authors test such structures through a multiple testing inferential procedure.

Model-based clustering and classification via patterned covariance analysis

3

Note that the alternatives here considered are similar to those used in EDDA since this parametrization does not allow to separate shape and size and it does not distinguish cases with diagonal covariance matrices. On the other side, homotroposcedasticity is not considered in EDDA which deals with the more general case of common principal components. The common principal components (and hence also homotroposcedasticity) has not been taken into account in [4]. Furthermore, our approach has been motivated by many real applications as presented in [7]. To estimate the model we follow the idea proposed in [4] where N labeled observations and M unlabeled observations are available and both of them are used in the mixture model estimation. The EM (see [5]) and CEM (see [3]) algorithm, which respectively maximize the log-likelihood and the complete-data likelihood, are here used. The main difference of our procedure consists in how the model restrictions are chosen. While in [4] the constraint is chosen according to the highest BIC value, we adopt the multiple testing inferential procedure defined in [7] which chooses among the four alternatives described in the previous section. Then, while in [4] the structure is chosen ex-post (the model estimation) we propose to choose the structure ex-ante using only labeled data.

3 Application We consider the crabs data set (genus Leptograpsus), studied in [2], and we follow the setting in [8] paying attention to the sample of 100 blue crabs, where 50 crabs are males (group 1) and 50 crabs are females (group 2). On each crab d = 2 measures are considered: the rear width and the length along the midline of the carapace. To perform this application we use the R environment. The R code necessary to for both multiple testing procedure and mixture model estimation (EM and CEM) is available from the authors upon request. The purpose is here to compare our classification rule with the one obtained using the BIC indicator for model selection. In particular, we randomly unlabel P = 10, . . . , 60 percent of observations from the original data set. For each value of P we generate 1000 replications and calculate the average misclassification rate. It is interesting to observe (see Table 1) that our proposal has similar results to those obtained using BIC. Model selection obtained ex-ante by the testing procedure appears to be more efficient since only one model has to be estimated. Furthermore, as shown in Table 2 the multiple test still detects the real underlying structure (homometroscedasticity) even in presence of high percentages of unlabeled data.

4 Conclusions and further developments In this paper we propose a new classification rule based on normal mixture models and on a multiple test recently proposed in the literature, related to patterned covariance matrices. The main advantage of this method is that only one model has to be

4

Luca Bagnato and Francesca Greselin

estimated, instead of evaluating ex-post the BIC for all considered alternatives. Further investigation is needed, maybe extending the multiple test hypothesis in such a way that the best model structure will be chosen in advance, among a higher number of available patterns. Moreover, additional studies will concern its performance when relaxing the underlying assumption of multinormal distribution. BIC Percentage of unlabeled data 10 20 30 40 50 60

Multiple test

EM

CEM

EM

CEM

7.26 (0.081) 6.9 (0.053) 7.15 (0.041) 7.39 (0.035) 7.7 (0.033) 7.93 (0.03)

7.19 (0.081) 6.91 (0.054) 7.12 (0.04) 7.27 (0.034) 7.41 (0.03) 7.43 (0.027)

7.23 (0.081) 6.91 (0.051) 7.16 (0.041) 7.38 (0.035) 7.68 (0.032) 7.91 (0.03)

7.18 (0.082) 6.92 (0.051) 7.15 (0.04) 7.27 (0.038) 7.48 (0.03) 7.45 (0.027)

Table 1: Average misclassification rates (in percentage) for 1000 samples from crabs data when P% of all observations are randomly unlabeled, P = 10, . . ., 60. Standard deviation are given in parentheses. Percentage of Homo Homotropo Homometro Hetero unlabeled data scedasticity scedasticity scedasticity scedasticity 10 0.000 0.000 0.990 0.010 20 0.000 0.000 0.973 0.027 30 0.000 0.000 0.967 0.033 40 0.000 0.000 0.948 0.052 50 0.000 0.000 0.944 0.056 60 0.002 0.000 0.933 0.065

Table 2: Percentage of times (on 1000 replications) that the test procedure in [7] chooses the related covariance structure on the crab data set (where P = 10, . . ., 60% of unlabeled data are considered).

References 1. Bensmail, H., Celeux, G.: Regularized gaussian discriminant analysis through eigenvalue decomposition. Journal of the American statistical Association 91(436) (1996) 2. Campbell, N.A., Mahon, R.J.: A multivariate study of variation in two species of rock crab of genus Leptograpsus. Australian Journal of Zoology 22(3), 417–425 (1974) 3. Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Institut National de Recherche en Informatique et en Automatique (1991) 4. Dean, N., Murphy, T., Downey, G.: Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society: Series C (Applied Statistics) 55(1), 1–14 (2006) 5. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B 39(1), 1–38 (1977) 6. Friedman, J.: Regularized discriminant analysis. Journal of the American statistical association 84(405), 165–175 (1989) 7. Ingrassia, S., Greselin, F., Punzo, A.: Assessing the pattern of covariance matrices via an augmentation multiple testing procedure. SMA 20(2), 141–170 (2011) 8. McLachlan, G.J., Peel, D.: Finite Mixture Models. John Wiley & Sons (2000)

Suggest Documents