A Copula-Based Algorithm for Discovering Patterns of Dependent ...

3 downloads 48 Views 508KB Size Report
CHIPMAN, H., and TIBSHIRANI, R. (2006), “Hybrid Hierarchical Clustering with Applications to Microarray Data”, Biostatistics, 7(2), 286–301.
Journal of Classification 29:50-75 (2012) DOI: 10.1007/s00357-012-9099-y

A Copula-Based Algorithm for Discovering Patterns of Dependent Observations F. Marta L. Di Lascio Universit`a di Bologna, Italy

Simone Giannerini Universit`a di Bologna, Italy

Abstract: The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients. Keywords: Clustering methods; CoClust algorithm; Copula functions; Model–based clustering; Microarray data.

The authors wish to thank Estela Bee Dagum, Paola Monari and Alessandra Luati for their support. This work has been partially financed by MIUR funds. Supplementary material and the R package CoClust are available at http://www2.stat.unibo.it/giannerini/coclust. Authors’ Address: Simone Giannerini and F. Marta L. Di Lascio, Dipartimento di Scienze Statistiche, Universit`a di Bologna, via Belle Arti 41, 40126 Bologna, Italy, emails: simone. [email protected]; [email protected]. Published online 11 January 2012

51

A Copula-Based Algorithm

1.

Introduction

Clustering methods are among the most used techniques for the analysis of multivariate data. In literature many different clustering algorithms have been proposed, each one with its pro and cons. From the first generation of clustering techniques (e.g. K –means and hierarchical clustering) to the second one (e.g. model–based clustering and biclustering) (Moreau, De Smet, and Thijs 2002), this kind of analysis has been employed in order to cluster observations (to find homogenous groups w.r.t. the observed variables) and/or to cluster variables (for dimensionality reduction, detecting multicollinearity etc.) and it has been used in many applied fields. In the last ten years clustering techniques have been largely used also in the analysis of microarray data. Such methods can be applied i) to the expression levels of genes, with the aim of identifying expression patterns of functionally related/co–regulated genes, ii) to the expression profiles of a set of cells or tissue samples, with the aim of grouping samples with similar biological characteristics. From the first application of clustering methods to microarray data (Eisen, Spellman, Brown and Botstein 1998), clustering became one of the most used unsupervised methods in gene expression data analysis. We recall the work of Sørlie et al. (2001) on the K –means algorithm. Furthermore, a relatively recent clustering method applied to microarray data is the hybrid hierarchical clustering of Chipman and Tibshirani (2006). Classical clustering techniques used in microarray data analysis either ignore the dependence relationship between genes or are limited to the linear dependence case. Other techniques are used to study the dependence in gene expression data, e.g. Friedman, Linial, Nachman and Pe’er (2000) use Bayesian networks, but to our knowledge the dependence relationship between the mRNA produced by different genes has not been investigated in detail in the clustering context. Copula modeling, by contrast, allows to investigate the multivariate dependence by overcoming the limitations of Gaussian linear models. On this basis, we propose a copula–based clustering algorithm (CoClust hereafter) that inherits the good properties of copula models. We assume that there is a multivariate probability model (data generating process, DGP hereafter) that generates the clustering. In particular, we assume that each cluster is represented by a (marginal) univariate density function while the whole clustering is modeled through a joint density function defined via copula. Hence, the main purpose of the CoClust is to identify dependent groups in such a way that the complex dependence between observations can be uncovered. Note that our perspective is different from the usual ones. Indeed, on the one hand, standard clustering techniques

52

F. M. L. Di Lascio and S. Giannerini

group in a same cluster observations with high similarity scores, e.g. genes with similar expression profiles, and put in different clusters dissimilar observations; from this perspective the operational definition of cluster is based on internal cohesion (homogeneity) and external isolation (separation). On the other hand, model–based clustering assumes that units in different clusters are independent while units in the same cluster are (marginally) dependent. Our way of interpreting clustering, however, is based on internal independence and external dependence: units in different clusters are dependent while units in the same cluster are independent. Indeed, given the number of clusters K that defines the dimension of the copula model, the algorithm assigns K observations at a time to K different clusters/margins on the basis of their dependence structure. This has two important consequences: i) each cluster contains observations that can be seen as independent realizations of the same marginal random variable, ii) observations put in different clusters K at a time gain a natural ordering given by their dependence. Hence, the algorithm can provide an interesting interpretation of each clustered K – plet of observations. These aspects make the method particularly useful in contexts where the number of observations is not large or has been reduced through a preliminary analysis. For instance, in the context of microarray data analysis, it is important to identify groups of genes whose expression level (quantified by its mRNA) is related to that of other genes in order to uncover the interactions between the biological processes in which they are involved. In order to have a more effective gene clustering, pre–processing algorithms as well as preliminary analyses may be used to identify and eliminate noise data and select the subset of genes most useful for revealing biological relationships. In this context, we believe that the information on the dependence structure between (clusters of) genes in a microarray experiment provided by the CoClust can share new light on their biological interactions. The paper is structured as follows. Section 2 presents the basic theoretical tools regarding copula functions with particular emphasis on estimation methods and the copula models used. Section 3 presents in detail the new clustering algorithm. In Section 4 the proposed algorithm is tested on simulated data for different scenarios and compared with the model–based clustering (Fraley and Raftery 1998, 2000), while Section 5 presents its applications to real microarray data. Finally, Section 6 discusses and outlines conclusions and proposals for further research. 2. Copula Functions 2.1 Sklar’s Theorem

The concept of ‘copula’ or ‘copula function’ originates in the context of probabilistic metric spaces through the well–known Sklar’s theorem

53

A Copula-Based Algorithm

(Sklar 1959) that clearly states that copulas are joint distribution functions of standard uniform random variates. In terms of distribution functions Sklar’s theorem states the following: Theorem 1 (Sklar’s theorem) Let F be a K–dimensional joint distribution function with margins F1 , . . . , Fk , . . . , FK . Then there exist a copula C such ¯ K (where R ¯ denotes the extended real line) that for all x ∈ R F (x1 , . . . , xk , . . . , xK ) = C(F1 (x1 ), . . . , Fk (xk ), . . . , FK (xK )).

(1)

If F1 , F2 , . . . , Fk , . . . , FK are continuous, then C is unique; otherwise, C is uniquely determined on the domain of F1 × F2 × . . . Fk × . . . × FK . Conversely, if C is a K–copula and F1 , F2 , . . . , Fk , . . . , FK are distribution functions, then the function F defined in (1) is a K–dimensional joint distribution function with margins F1 , . . . , Fk , . . . , FK . For the proof see Schweizer and Sklar (1983) and Nelsen (2006). According to this theorem we can split the joint probability function into the margins and a copula, so that the latter represents the ‘association’ between variables. For continuous random variables, the copula density c is related to the density f of the distribution F , through the following canonical representation f (x1 , . . . , xK ) = c(F1 (x1 ), . . . , FK (xK ))

K 

fk (xk ),

(2)

k=1

in which fk is the marginal probability density function of the distribution Fk . 2.2 Copula Families

In literature many different copula families have been proposed. For an extended review see Nelsen (2006) and Joe (1997). The most used families are the Elliptical, which contains both the Gaussian and the t copula models, and the Archimedean, which includes Clayton’s, Frank’s and Gumbel’s copula models. The implementation of our algorithm allows us to choose in input one model among them (see Supplementary Materials at http://www2.stat.unibo.it/giannerini/coclust). In particular, the Archimedean family of copula functions is useful in empirical modeling and is popular because of ease of derivation. In this paper we focus on the following models 1. Gaussian copula

  C(u1 , . . . , uK ) = ΦK Φ−1 (u1 ), . . . , Φ−k (uk ), . . . , Φ−K (uK ); θ2 , (3)

54

F. M. L. Di Lascio and S. Giannerini

where Φ is the cumulative distribution function of the standard univariate normal distribution and ΦK is the standard K–variate normal distribution with correlation parameter θ2 restricted to the interval (−1, 1); 2. Frank’s copula   K −θ2 uk − 1) 1 k=1 (e C(u1 , . . . , uK ) = − ln 1 + , θ2 (e−θ2 − 1)K−1

(4)

with θ2 ∈ (0, ∞) and K ≥ 3; independence is attained as θ2 reaches zero; 3. Clayton’s copula C(u1 , . . . , uk , . . . , uK ) =

K

− θ1 2 u−θ −K+1 k

2

,

(5)

k=1

where the parameter θ2 is restricted on the region (0, ∞) and as it approaches zero the marginal distributions become independent. The normal or Gaussian copula is symmetric and it is flexible in that it allows for equal degrees of positive and negative dependence. Frank’s copula, unlike Clayton’s and Gumbel’s copula models, is (permutation) symmetric, its tails are less heavy than those of a Gaussian copula and it allows negative dependence only for bivariate joint distributions. Clayton’s copula, on the contrary, is asymmetric, it exhibits strong left tail dependence and relatively weak right tail dependence and it cannot account for negative dependence. The computation of the Frank’s and Clayton’s dependence parameter from the association measures, like those by Kendall and Spearman, is straightforward and the relationship between them is one–to–one (Cherubini, Luciano and Vecchiato 2004). Moreover, all the three copula models are comprehensive, so that they allow the maximum range of dependence. These reasons together with the possibility of describing a multivariate complex dependence through a single parameter explain the popularity of such families in the applied literature on copulas. 2.3 Estimation Methods for Copula Models

In this section we describe briefly the estimation methods that can be used within our algorithm in order to estimate a copula model. In particular, we focus on both a parametric and a semi–parametric approach. From the canonical representation of equation (2) we can state that, in general, a statistical modeling problem for copulas can be decomposed into two steps:

55

A Copula-Based Algorithm

i) identification of the marginal distributions and ii) selection of the appropriate copula function. Hence, we focus on sequential two–step maximum likelihood estimation methods in which the marginal parameters are estimated in the first step and are used to estimate the dependence parameter of the copula function in the second step. This approach is computationally less intensive w.r.t other well–known estimation methods for copulas, e.g. the exact maximum likelihood method (see Cherubini, Luciano and Vecchiato 2004, p. 154) that estimates simultaneously both the parameters for the margins and those for the copula. Moreover, the two–step approach allows to adopt a semi–parametric estimation method besides the parametric one. Suppose we observe n independent realizations from a multivariate distribution as in (1), (X1i , . . . , Xki , . . . , XKi ) : i = 1, 2, . . . , n and suppose that the K margins have cumulative distribution function Fk , density function fk (k = 1, 2, . . . , K ) and a copula function C . Let θ 1 = (β 1 , . . . , β k , . . . , β K ) be the vector of marginal parameters and θ 2 be the vector of copula parameters. The parameter vector to be estimated is θ = (θ 1 , θ 2 ) . The log–likelihood function l(θ) =

n

log c {F1 (X1i ; β 1 ) , . . . , FK (XKi ; β K ) ; θ 2 }

i=1

+

n K

log fi (Xki ; β k )

i=1 k=1

is composed of two positive terms: the first term involves the copula density c and its parameters, the second one involves marginal densities and their parameters. Starting from these considerations Joe and Xu (1996) proposed a two–step estimation method called inference for the margins (IFM hereafter). A fully parametric approach for the IFM method is based on the estimation of the marginal parameters θ 1 at the first step either by ˆ 1 = arg max θ β

n K

log fi (Xki ; β),

i=1 k=1

where the marginal distributions share the same parameters β or by an ML estimation for each margin ˆ = arg max β k βk

n

log fi (Xki ; β k ),

(6)

i=1

where each marginal distribution Fk has its own parameters β k and θ 1 = (β 1 , . . . , β k , . . . , β K ) . In the second step, the dependence parameters θ 2 ˆ 1 are estimated by given θ

56

F. M. L. Di Lascio and S. Giannerini

ˆ 2 = arg max θ θ2

n

 

  ˆ ˆ log c F1 X1i ; β 1 , . . . , . . . , FK XKi ; β K ; θ 2 .

i=1

(7) Joe (1997) has proved that, under appropriate regularity conditions, the IFM estimator is asymptotically Gaussian with √ ˆ − θ 0 ) → N (0, G n(θ ¸ −1 (θ 0 )), (8) where G ¸ −1 is the Godambe information matrix (Godambe 1960) and θ 0 is the vector of the “true” values. In a two–step semi–parametric approach, by contrast, the empirical cumulative distribution functions are used to model the  margins without asˆ ˆ sumptions on their parametric form, i.e. F1 Xki ; β k with k = 1, . . . , K , and the maximum likelihood is used to estimate the copula parameters by n  

  ˆ ˆ , . . . , . . . , FˆK XKi ; β ˆ θ 2 = arg max log c Fˆ1 X1i ; β 1 K ; θ2 . θ2

i=1

(9) 3.

The CoClust Algorithm

In this section we describe the clustering algorithm based on copula functions (CoClust). CoClust assumes that the data are generated by a multivariate copula function whose arguments are the probability–integral transforms of the density functions that generate the clusters. The kind and the strength of the dependence between clusters (margins) are modeled through a copula function. The procedure we propose groups observations by using the maximized log–likelihood function of a copula fit. The parameter vector (dependence parameter plus marginal parameters) is estimated by one of the methods presented in Section 2.3. There is no need to set a priori the exact number of clusters, nor is a starting classification required because the algorithm explores every possible initial solution and chooses the best number of clusters within a given range of possibilities. Since it is based on copulas, CoClust allows i) to investigate the (complex) multivariate dependence structure of the data generating process without any assumption on the margins, ii) to overcome some of the limitations of other well–known clustering methods like K –means and hierarchical ones (see Chapter 4 in Di Lascio 2008). CoClust gives an integrated picture of the relationships between different clusters of observations and, at the same time, it allows to investigate how the observations belonging to different clusters are associated to each other.

57

A Copula-Based Algorithm

3.1 The CoClust Algorithm and Its Interpretation

Starting from a G × S data matrix ⎡ x11 . . . x1s . . . x1S .. .. ⎢ .. ... ... . . ⎢ . ⎢ ⎢ xg1 . . . xgs . . . xgS ⎢ . .. .. ⎣ .. . . ... ... xG1 . . . xGs . . . xGS





⎤ x1 ⎥ ⎢ .. ⎥ ⎥ ⎢ . ⎥ ⎥ ⎢ ⎥ ⎥ = ⎢ xg ⎥ , ⎥ ⎢ . ⎥ ⎦ ⎣ .. ⎦ xG

(10)

where xg , g = 1, . . . , G is a row vector, the procedure treats each row of such matrix as a single element to be allocated to a cluster. At the first step the algorithm selects the optimal number of clusters K according to a goodness of fit criterion of a copula function estimated on all the possible combinations of the rows of the matrix (10). The optimal combination of rows constitutes the starting point of the clustering procedure, as will be described in detail below. From the second step onwards, the algorithm allocates the remaining observations/rows to the K clusters. To this end, a goodness of fit criterion similar to that used at the previous step is used. Notice that the number of clusters K is the dimension of the copula function; hence, the terms ‘margin’ and ‘cluster’ are used interchangeably since a classification in K clusters is obtained by means of a copula model with K margins. In the following we describe the procedure in detail. 1. Choose the number of clusters K and allocate a S–dimensional observation (one row of matrix (10)) to each cluster, that is: 1.1 For each k = 2, 3, . . . , Kmax , where Kmax ≤ G, specified by the user, is the  maximum number of clusters to be tried, estimate CG,k = G k copula functions and obtain the maximized log– likelihood of the copula, that is S

 

 ˆ lxg11 ...xgk1 θˆ2 = max log c Fg11 Xg11 s ; β g11 , . . . (11) θ2 ∈Θ

s=1

  ˆ . . . , Fgk1 Xgk1 s ; β gk1 ; θ2

∀ {g11 , . . . , gk1 } ⊆ {1, 2, . . . , G}; here, gab indicates the row number of the matrix (10) and identifies the b-th observation of the a-th cluster (a = 1, . . . , k and b = 1, . . . , na ).  max G 1.2 At step 1.1 the algorithm computes K k=2 k likelihood functions (11). Now, the procedure selects the K -plet of observa-

58

F. M. L. Di Lascio and S. Giannerini ∗ . . . x ∗ ) that maximizes the log–likelihood comtions/rows (xg11 gK1 puted at step 1.1, that is, it selects the combination for which we have 

 max lxg11 ...xgK1 θˆ2 . (12)

{g11 ...gK1 }⊆{1,...,G}

At this step the dimension K of the copula (or number of clusters) ∗ is allocated to the first cluster, x ∗ is allocated is chosen and xg11 g21 to the second cluster and so on. 2. Allocate the second K -plet of observations/rows to the clusters among the remaining (G − K) ones, that is: 2.1 Select the K –plet corresponding to the largest log–likelihood  among those computed on the set of G−K combinations of obK servations that have not been allocated yet; in other words, select the K –plet (xg12 , . . . , xgK2 ) of obervations/rows in (10) among ∗ , . . . , g ∗ } that maxiall {g12 , . . . , gK2 } ⊆ {1, 2, . . . , G} \ {g11 K1 mizes the log-likelihood of the copula model. 2.2 Compute all the permutations of the selected K –plet (xg12 , . . . , xgK2 ) in order to associate each observation to the margin that has generated it. Then, estimate a copula model on each permu∗ . . . x ∗ ) already clustered. tation by using also the K -plet (xg11 gK1 Hence, at the first iteration, the algorithm works on the rows of the following matrix ⎡ ⎤ ∗ ∗ xg11 xg12 1 . . . xg12 S 1 . . . xg11 S ⎢ .. ⎥ .. .. .. .. .. ⎣ . ⎦= . . . . . ∗ xg∗ 1 . . . xgK1 S xgK2 1 . . . xgK2 S ⎡ K1 ⎤ ∗ xg11 xg12 ⎢ .. .. ⎥ , (13) ⎣ . . ⎦ ∗ xgK1 xgK2 where each row is obtained by merging two different K -plets of rows of the matrix in (10), and the algorithm computes

 ∗ ...xg∗ lxg11 θˆ2 = K1 xg12 ...xgK2

max

θ2 ∈Θ

S  s=1





  ˆ ∗ , . . . , Fg∗ Xg∗ s ; β ˆ ∗ ; θ2 ∗ ∗ s; β log c Fg11 Xg11 g11 g K1 K1 K1



   ˆ ˆ + log c Fg12 Xg12 s ; β g12 , . . . , FgK2 XgK2 s ; β gK2 ; θ2

(14)

59

A Copula-Based Algorithm

∗ , . . . , g ∗ ). Nofor all {g12 , . . . , gK2 } ∈ {1, 2, . . . , G} \ (g11 K1 tice that in the maximization of the likelihood (14), the K -plet ∗ , . . . , x ∗ ) is kept fixed since it has been allocated at the (xg11 gK1 first step. ∗ . . . x ∗ ) for which we have 2.3 Select the permutation (xg12 gK2 

 ∗ ∗ max lxg11 ...xgK1 θˆ2 , (15) ∗ ∗ {g12 ...gK2 }∈ Ψ[{1,...,G}\(g11 ,...,gK1 )]

xg12 ...xgK2

where Ψ[A] represents the set of the all possible permutations of the elements of the set A. Similarly to step 1.2, the algorithm ∗ to the first cluster, x ∗ to the second cluster and so allocates xg12 g22 on. 3. Allocate the remaining (G − 2K) observations/rows, K at a time, to the clusters. At the generic i-th iteration 3.1 Select the K –plet (xg1i , . . . , xgKi ) of observations/rows – where {g1i , . . . , gKi } is chosen among all the possible subsets of ∗ , . . . , g∗ , . . . , g∗ ∗ R = {1, . . . , G}\{g11 K1 1(i−1) , . . . , gK(i−1) } – that corresponds to log–likelihood among those computed   the largest combinations of observations that have not on the set of G−iK K been allocated yet. 3.2 Compute all the permutations of the selected K –plet (xg1i , . . . , xgKi ); estimate a copula model on each permutation by using also the K -plets already clustered. At the i-th iteration, the algorithm works on the rows of the following matrix: ⎡ ⎤ ∗ ∗ ∗ xg11 xg12 . . . xg1(i−1) xg1i ⎢ .. .. .. .. .. ⎥ , (16) ⎣ . . . . . ⎦ ∗ xgK1

∗ xgK2 ...

∗ xgK(i−1)

xgKi

where each row is obtained by merging i different K -plets of rows of the matrix in (10).  Eq. (14) becomes ... xg∗ lxg∗ θˆ2 = 11

K1

... ... ... ... xg∗ xg ∗ 1(i−1) K(i−1) xg1i ... xgKi

max

⎧ i−1 S ⎨



 ˆ ∗ ∗ ∗ log c Fg1j Xg1j ; β s g1j , . . .

⎩ j=1

  

 ˆ ∗ ; θ2 + log c Fg Xg s ; β ˆ ∗ ∗ XgKj . . . , FgKj s ; β gKj 1i 1i g1i , . . .

  ˆ . . . , FgKi XgKi s ; β (17) gKi ; θ2

θ2 ∈Θ

s=1

60

F. M. L. Di Lascio and S. Giannerini

for all {g1i , . . . , gKi } ∈ R. ∗ . . . x ∗ ) that maximizes the likeli3.3 Select the permutation (xg1i gKi hood in Eq. (17) with respect to the set Ψ [R]. Similarly to step ∗ to the first cluster, x ∗ to the 2.3, the algorithm allocates xg1i g2i second cluster and so on.

The CoClust algorithm is a procedure that first selects the number of clusters and then adds one observation at a time (that is, K observations at a time) in each cluster; each observation is a S –dimensional vector and its components are treated as (independent) realizations of the same random variable. The K candidate observations are allocated to the K clusters on the basis of the value of the maximized log–likelihood function of the copula model. Since at each step we compare non–nested models, the criterion is equivalent to the well–known Bayes information criterion (BIC) and Akaike information criterion (AIC). This fact, together with the assumption of a multivariate probability model as the generating process of the clustering, makes CoClust similar to the model–based clustering method of Fraley and Raftery (1998). Indeed, model–based clustering (MClust hereafter) assumes that the data are generated by a finite mixture of probability distributions f (x) =

K

τk fk (x) ,

(18)

k=1

in which fk (x) is the density of an observation x from the k–th component that represents a different group or cluster and τk (x) is the probability that an observation belongs to the k–th component. In general, a multivariate normal distribution with mean μk and covariance matrix Σk is assumed for each component. The parameters of the model (18) are estimated through an EM algorithm for a different number of clusters and covariance structures and, then, the best model is selected by using the BIC. Note that, even if CoClust and MClust have some aspects in common, they are based on a different concept of cluster. Indeed, in MClust the groups are assumed to be independent, contrary to what happens in CoClust. Theoretically, the advantages of our approach over MClust are: i) marginal distributions need not to be Gaussian; ii) the multivariate model can capture asymmetric features more parsimoniously, i.e. there is no need to add extra Gaussian components in order to capture long tails; iii) similarly to what happens in time series modeling, non-linear features can be modeled by means of linear processes (i.e. ARMA) at the price of requiring non-parsimonious models with a high number of parameters. In our case, the complex multivariate dependence is modeled through a one or two–parameter copula model; clearly, there is no need to estimate a whole

A Copula-Based Algorithm

61

covariance matrix or to introduce further assumptions on it in order to reduce the number of estimated parameters. 3.2 A Toy Example

Suppose we have the G × S matrix in Eq. (10) with G = 6 and S = 3 ⎡ ⎤ ⎡ ⎤ x11 x12 x13 x1 ⎢ x21 x22 x23 ⎥ ⎢ x2 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ x31 x32 x33 ⎥ ⎢ x3 ⎥ ⎢ ⎥=⎢ ⎥ (19) ⎢ x41 x42 x43 ⎥ ⎢ x4 ⎥ , ⎢ ⎥ ⎢ ⎥ ⎣ x51 x52 x53 ⎦ ⎣ x5 ⎦ x61 x62 x63 x6 on which we apply the CoClust algorithm. Suppose also that at step 1.1, the procedure has selected K = 2 clusters and the doublet (x1 , x2 ), where x1 comes, say, from the first margin, whereas, x2 comes from the second margin; in other words, the log-likelihood (11) computed by varying both k = 2, 3, . . . , Kmax and the k-plets of rows of the matrix (19) is maximum when K = 2, g11 = 1 and g21 = 2. Notice that, at step 1.2 we have G CG,K = K = 62 = 15 possible doublets of observations/rows of matrix (19) together with their corresponding maximized log-likelihoods computed by using Eq. (11) ⎤ ⎡ x1 x2 lx1 x2 ⎢ ... ... ... ⎥ ⎥ ⎢ ⎢ x2 x6 lx2 x6 ⎥ ⎥ ⎢ ⎢ x3 x4 lx3 x4 ⎥ . (20) ⎥ ⎢ ⎢ ... ... ... ⎥ ⎥ ⎢ ⎣ x4 x6 lx4 x6 ⎦ x5 x6 lx5 x6   At step 2.1 the algorithm selects the subset of CG−K,K = G−K = K 4 2 = 6 combinations of pairs that do not contain either x1 or x2 , that is ⎡ ⎤ x3 x4 lx3 x4 ⎢ ... ... ... ⎥ ⎢ ⎥ (21) ⎣ x4 x6 lx4 x6 ⎦ , x5 x6 lx5 x6 and selects the pair that maximizes the log-likelihood in Eq. (11), say, {x4 , x6 }. At step 2.2, in order to associate each observation to the margin that generated it, the algorithm computes the 2! = 2 permutations of the selected

62

F. M. L. Di Lascio and S. Giannerini

doublet {x4 , x6 }. Then, a copula model is estimated on each permutation, by using also the doublet (x1 , x2 ) allocated at step 1. At step 2.3, the algorithm selects the permutation for which we have (see Eq. (15)) 



 max lx1 x2 θˆ2 ; lx1 x2 θˆ2 , (22) x4 x6

x6 x4

say (x6 , x4 ), and allocates such doublet to the two clusters   x11 x12 x13 x61 x62 x63 x1 x6 → Cluster 1 = x21 x22 x23 x41 x42 x43 x2 x4 → Cluster 2. At step 3, we have one doublet of observations not allocated yet, namely, {x3 , x5 }; hence, CoClust repeats the procedure described above and allocates it to the clusters. Supposing that the pair (x5 , x3 ) is selected, at the end of the procedure we obtain the following partition  x1 x6 x5 → Cluster 1 x2 x4 x3 → Cluster 2, from which the 3–plet of observations (x1 , x6 , x5 ) have a complex dependence structure which has been captured through the copula model. The same argument holds also for (x2 , x4 , x3 ). The process of creation of each K -plet of observations as described above provides each K -plet with a meaningful interpretation in terms of its dependence structure. This feature renders CoClust particularly suitable to applications with a small number of observations, typically when a subset of units has been selected as a result of a previous analysis. 4. Simulation Study

The purpose of this section is to evaluate the performance of CoClust and compare it with the model–based clustering proposed in Fraley and Raftery (1998). We consider four different simulation experiments in which we vary the DGP, the kind of margins and the estimation method. Table 1 summarizes the scenarios of the simulation study performed; these represent different situations that may occur in practice. Notice that in Scenarios 3 and 4 the DGP is not a copula model. With these scenarios we assess the performance of CoClust when the model is misspecified. In all the experiments the true numbers of clusters is set to 3 while the steps of the studies performed are as follows: i) Generate a G × S data matrix from a trivariate DGP defined in Table 1 such that G3 rows/observations belong to each margin; ii) apply the CoClust algorithm to the rows of such matrix;

63

A Copula-Based Algorithm Table 1. Four scenarios of the simulation study performed. SCENARIO

DGP

MARGINS

APPROACH

1

Frank copula

Gaussian

Parametric

2

Frank copula

Gamma, Beta, Gaussian

Semi–Parametric

3

Skew–Normal

Skew–Normal

Semi–Parametric

4

Mixture of Gaussians

Gaussian

Semi–Parametric

iii) assess the performance of the clustering by means of three sets of measures that concern 1. clusters’ identification, 2. dependence between clusters and 3. the correspondence between clusters and the margins of the true multivariate distribution. In detail, we have: 1. p.n.c.: percentage of replications in which the identified number of clusters is correct; p.w.s.: percentage of well–identified cluster sizes G  in our case ; p.c.a.: percentage of correctly allocated observations; 3 2. θˆ2∗ : average over replications of the post–clustering estimates of the dependence parameter; r.p.: rejection percentage of the null hypothesis H0 : θ2 = 0 with α = 0.05; 3. the percentage of rejection of the null hypothesis on the kind of marginal probability model for each cluster (details in the next Sections). Note that this set of measures is computed w.r.t. the replications in which the identified number of clusters is correct. For each setting we simulate 200 replications with G = 45 and S = 21. When the DGP is not a copula model (Scenarios 3 and 4) we choose the best copula model among those presented in Section 2.2 on the basis of the BIC. Moreover, we compare CoClust with model–based clustering (Fraley and Raftery 1998, 2000) described in Section 2.3 in its univariate version. 4.1 Scenario 1: Frank Copula and Gaussian Margins

In this first scenario we focus on a three–dimensional Frank copula function (Eq. (4)) and on normal margins. We perform different simulations by varying: 1. the value of the dependence parameter (θ2 = 10 mild, θ2 = 21 strong); 2. the values of the marginal parameters, β k = (μk , σk ) (k = 1, 2, 3).

64

F. M. L. Di Lascio and S. Giannerini

As regards the margins we investigate the following three cases: 1. well–separated margins:  μ1 + 3σ1 < μ2 − 3σ2 μ2 + 3σ2 < μ3 − 3σ3 .

(23)

2. overlapping margins: at least one of the two conditions expressed in (23) does not hold. 3. nested margins: 

μ1 + 3σ1  μ2 + 3σ2  μ3 + 3σ3 μ1 − 3σ1  μ2 − 3σ2  μ3 − 3σ3 .

We call well–separated the margins that have less than (0.03/2)% of overlapping observations; the opposite situation occurs when we have three nested margins, so that more than 99, 7% of the observations of each distribution overlap. The intermediate situation is called overlapping. The performance is assessed by means of the three sets of measures introduced above; in particular, the third set of measures includes the rejection percentage of the null hypothesis on mean, variance and Gaussianity in each cluster by means of Student’s t test, Chi Square test and Kolmogorov– Smirnov test, respectively, with α = 0.05. Notice that the test upon θ2 is based on a result of Joe (1997) presented in Eq. (8). Tables 2 and 3 present the results. In particular, Table 2 reports the first two sets of measures, while Table 3 contains information about the goodness of fit of the clustering procedure. From these simulations we may argue that our clustering algorithm is able to find the correct number of clusters and the true clusters’ size in almost every situation, including the case of nested margins. Moreover, it appears to be able to identify the ‘true’ clusters since the percentage of correctly allocated observations equals 100% in all the situations investigated except for the nested case with θ2 = 10 in which this percentage is still very high (83.9%). In about 90% of the replications CoClust finds clusters of observations according to the true dispersion structure of the data irrespectively of i) the degree of overlap of margins and ii) the level of dependence between margins. Now we compare the performance of CoClust with that of MClust. The results are shown in Tables 4 and 5. Clearly, the model–based clustering method is able to identify the correct number of clusters when the margins are not nested while it fails to do so in the nested margin case: the p.n.c. decreases dramatically from 100% to less than 10% for both levels

65

A Copula-Based Algorithm Table 2. CoClust performance: Scenario 1 Dependence

Kind of

Parameter

Margins

Clustering p.n.c.

θ2 = 10

95.6%

p.w.s.

p.c.a.

θˆ2∗

95.6%

r.p.

100%

13.57

85%

Overlapping

100%

100%

100%

11.21

91%

Nested

100%

100%

100%

12.66

90%

Well–separated θ2 = 21

Dependence

Well–separated

95.5%

95.5%

100%

6.41

89.0%

Overlapping

89.5%

89.5%

100%

5.19

82.5%

Nested

93.9%

93.9%

6.35

84.8%

83.9%

Table 3. CoClust performance: Scenario 1. Ck , with k = 1, 2, 3 indicates the clusters. Dependence

Kind of

Parameter

Margins

C1

C2

C3

C1

C2

C3

C1

C2

C3

Well–separated

6%

9%

9%

0%

0%

0%

3%

9%

3%

Overlapping

7%

4%

4%

0%

0%

0%

4%

4%

7%

Nested

8%

5%

5%

0%

0%

0%

15%

7%

7%

Well–separated

2%

8%

2%

0%

0%

0%

3%

11%

5%

Overlapping

6%

6%

8%

0%

0%

0%

12%

2%

8%

Nested

3%

5%

3%

0%

0%

0%

6%

5%

0%

θ2 = 21

θ2 = 10

Mean

Variance

Normality

of dependence. Moreover, it mis–allocates about half of the observations in all the cases. As for the rejection percentage of the null hypothesis on the dependence parameter, its performance is not satisfactory independently of the degree of overlap and the strength of dependence. In both Tables, the performance of the model–based clustering worsen as the level of overlap of margins increases. 4.2 Scenario 2: Frank Copula, Asymmetric Margins

In the second scenario the DGP is the three–dimensional Frank copula function (Eq. (4)) with θ2 = 10. Differently from Scenario 1, we change the kind of margins and estimation method for copula. Here, the three margins are: X1 ∼ Gamma(3, 4), X2 ∼ Beta(2, 1) and X3 ∼ Gaussian(7, 2) and we use the semi–parametric estimation method described in Section 2.3. We test the equality between the marginal probability distributions and the clusters by using the Kolmogorov–Smirnov test (α = 0.05). The results are summarized in Table 6. As in the previous scenario CoClust shows a very satisfactory performance even if the margins are dif-

66

F. M. L. Di Lascio and S. Giannerini Table 4. MClust performance: Scenario 1 Dependence

Kind of

Parameter

Margins

p.n.c.

p.w.s.

p.c.a.

Well–separated

100%

37%

Overlapping

100%

21%

6%

Well–separated Overlapping Nested

θ2 = 21

Clustering

θˆ2∗

r.p.

49.99%

9.98

37%

49.92%

5.65

10%

0%

32.85%

0.00

0%

100%

35%

49.98%

4.82

28%

100%

23%

49.91%

3.10

15%

10%

0%

32.67%

0.10

0%

Nested

θ2 = 10

Dependence

Table 5. MClust performance: Scenario 1 Dependence

Kind of

Parameter

Margins

C1

C2

Well–separated

4%

3%

Overlapping

7%

9%

Nested

50%

50%

Well–separated

6%

Overlapping

9%

Nested

40%

70%

θ2 = 21

θ2 = 10

Mean

Variance C3

Normality

C1

C2

C3

C1

C2

C3

5%

0%

0%

0%

4%

4%

4%

7%

0%

0%

0%

5%

6%

3%

100%

0%

0%

0%

33%

33%

67%

5%

3%

0%

0%

0%

4%

3%

2%

10%

8%

0%

0%

0%

4%

6%

7%

90%

0%

0%

0%

20%

50%

50%

ferent and are estimated through the empirical distribution function. The MClust algorithm, by contrast, is not able to identify correctly both the number of margins/clusters, their size as well as their composition. 4.3 Scenario 3: Multivariate Skew–Normal

In this scenario the DGP comes from a skew-normal family of distributions (Azzalini and Dalla Valle 1996, Azzalini and Capitanio 1999). In particular, we generate data from a three-variate skew normal based on the parametrization given in Azzalini and Capitanio (1999); we set the mean vector to (4, 6, 7), the ‘shape’ parameter to (−1, 1, 1) and the off–diagonal elements of the correlation matrix to 0.7. We adopt a semi–parametric approach to fit the multivariate distributions. The marginal distribution of the data generating process (univariate skew normal distribution with shape parameter vector equals to (0.26, 0.79, 0.79)) are compared with the identified clusters by using the Kolmogorov–Smirnov test.

67

A Copula-Based Algorithm Table 6. CoClust VS MClust performance: Scenario 2 Method

Clustering p.n.c.

CoClust

95%

MClust

26.2%

Dependence θˆ2∗

p.w.s.

p.c.a.

95%

76.67%

9.7

93.7%

0%

41.28%

−0.06

2.5%

Margins

r.p.

C1

C2

9.5% 100%

5.3% 100%

C3 6.6% 100%

Table 7. CoClust VS MClust performance: Scenario 3 Method

Clustering

Dependence

Margins

p.n.c.

p.w.s.

p.c.a.

θˆ2∗

CoClust

67%

67%

95.63%

0.58

100%

0%

0%

0%

MClust

0%

0%

9.20%

0.02

7%

100%

100%

100%

r.p.

C1

C2

C3

The results are summarized in Table 7. Also in this case CoClust outperforms MClust even if the DGP is not a copula model. CoClust always rejects the null hypothesis of independence and, as for the percentage of well–identified number of clusters and clusters’ size, its performance is quite satisfactory since in 67% of the replications it is able to find the correct values. Furthermore, the high value of the p.c.a. and the results of the Kolmogorov–Smirnov test show that CoClust is able to identify the ‘true’ clusters and the correct model for the margins. The model–based clustering procedure, by contrast, does not perform well even though all the margins are nearly Gaussian. 4.4 Scenario 4: Multivariate Mixed Normal

In this scenario the DGP is a Gaussian mixture model with three components. In particular, we set the mean and the standard deviation vectors of the margins to (5, 6, 8) and (3, 1, 0.5), respectively. Moreover, the correlation between margins is Corr(X1 , X2 ) = Corr(X1 , X3 ) = 0.7 and Corr(X2 , X3 ) = 0.5. As in Scenarios 2 and 3 we adopt a semi–parametric approach to fit the multivariate distributions; moreover we compare the margins through the Kolmogorov–Smirnov test. In this scenario the performance of CoClust is similar to that of MClust as regards the identified number of clusters and the percentage of correctly allocated observations (see first and third column of Table 8). Notice that here MClust performs better than in any other scenario. Moreover, when CoClust manages to find the correct number of clusters it also builds clus-

68

F. M. L. Di Lascio and S. Giannerini Table 8. CoClust VS MClust performance: Scenario 4 Method

Clustering p.n.c.

p.w.s.

CoClust

39%

39%

MClust

32%

0%

Dependence

Margins

θˆ2∗

r.p.

54.62%

0.72

100%

0%

0%

0%

57.20%

0.05

11%

68%

68%

68%

p.c.a.

C1

C2

C3

ters with the correct size and identifies the correct probability model for the margins. 5. Empirical Applications: Clustering Genes of Breast-Cancer

Patients In this section we apply CoClust algorithm to the real microarray data set of Hedenfalk et al. 2001. 21 patients were observed: seven carriers of the BRCA1 mutation, seven carriers of the BRCA2 mutation and seven patients with sporadic cases of breast cancer. For a complete description of the data set see also http://research.nhgri.nih.gov/microarray/NEJM Supplement. Gene expression ratios included in the data file were derived from the ratio of fluorescent intensity from a tumor sample and fluorescent intensity from a common reference sample (MCF–10A, used for all 21 microarray experiments). Therefore, the ratio may take value from 0 to infinity. We focus on the set of 51 genes whose variation in expression among all experiments best differentiated among these types of cancers (Hedenfalk et al. 2001). The data set is available at http://research.nhgri.nih.gov/microarray/NEJM Supplement/Images/GeneList51.pdf. We split the data set in three different sets of gene expressions according to the type of cancer sample observed: BRCA1, BRCA2 and Sporadic and we apply CoClust algorithm to the gene expressions of such three sets. We use a Frank copula and we log–transform the data in order to achieve symmetry. Finally, we assess the biological significance of our results by considering the distributions of gene annotations across the clusters: we investigate the cellular localization and the biological processes in which the genes are involved as provided by the UniGene human–sequence collection (available at http://ncbi.nlm.nih.gov/sites/ entrez?db=unigene) and the Gene Ontology consortium (available at http:// amigo.geneontology.org/cgi-bin/amigo/go.cgi). Table 9 shows the results about the copula model estimated for each one of the three data sets analyzed. The estimated margins for the three data sets are shown in Figure 1. Tables 10, 11 and 12 show a cluster in each column, whereas each row contains a K –plet of dependent genes. As for the

69

A Copula-Based Algorithm

Table 9. Dependence analysis of 51 genes observed in three different sets of cancer samples. Kind of Sample

K

θˆ2

se ˆ θ2

p–value1

Log–Lik

BRCA1

3

3.792

0.436

< 0.001

40.386

BRCA2

5

3.792

0.368

< 0.001

69.707

Sporadic

5

4.041

0.388

< 0.001

71.476

H0 : θ2 = 0

−2

−1

0

1

(a)

2

3

4

0.0

0.0

0.0

0.2

0.2

0.2

0.4

0.4

0.6

0.4

0.6

0.8

Density

0.6

1.0

0.8

1.2

0.8

1.0

1.4

1

−4

−2

0

(b)

2

4

−2

−1

0

1

2

(c)

Figure 1. Gaussian margins from clustering of BRCA1 (a), BRCA2 (b) and Sporadic (c) cancer samples. Each curve represents a different cluster.

BRCA1 mutation cancer samples (Table 10), the CoClust algorithm reveals dependence between genes involved in similar biological processes, e.g. the defense response with genes Myxovirus resistance 2 and Zinc finger protein 161, both of them components of the nucleus, but also genes involved in different biological processes like the polyamine metabolism, phospholipid metabolism and the negative regulation of cell proliferation (human mRNA for ornithine decarboxylase antizyme, ORF 1 and ORF 2, glutathione peroxidase 4, transducer of ERBB2, 1, respectively). Notice that selenophosphate synthetase and minichromosome maintenance deficient 7 are dependent and have similar molecular functions (e.g. ATP and nucleotide binding) but are involved in different biological processes (e.g. cell cycle and protein modification). Remarkably, the candidate gene to tumor suppression (suppression of tumorigenicity 13) is related to the integrin, beta 8 that mediates cell–cell

Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Cluster 1 ests dkfzp564m2423 protein d123 gene product interleukin enhancer binding factor 2, 45kD forkhead box M1 h. mrna for ornithine decarboxyl. antiz. integrin, beta 8 ctp synthase selenophosphate synthetase butyrate response factor 1 kiaa0246 protein ests carbamoyl–phosphate synthetase 2 cyclin–dependent kinase 4 tumor protein p53–binding prot., 2 s–phase response myotubularin related protein 4

Cluster 2 myxovirus resistance 2 phosphofructokinase, platelet gdp dissociation inhibitor 2 transcription factor AP–2 gamma nuclease sensitive element binding protein 1 transducer of erbb2, 1 suppression of tumorigenicity 13 thyroid autoantigen 70kD phytanoyl–CoA hydroxylase very low density lipoprotein receptor low density lipoprotein–related protein 1 platelet–derived growth fact. beta polyp. ests retinoblastoma–like 2 guan nucleot binding protein, alpha inhibit activ polypept 3 armadillo rep. gene deletes in velocard. synd. nitrogen fixation cluster-like

Cluster 3 zinc finger protein 161 phosphofructokinase, platelet chromobox homolog 3 kiaa0601 protein udp–galactose transporter related glutathione peroxidase 4 hydroxyacyl cytochrome c oxidase subunit VIc minichromosome maintenance deficient 7 ests arp1 homolog A ests protein phosphatase 1 cold shock domain protein A keratin 8 proliferating cell nuclear antigen apex nuclease

Table 10. Clustering of 51 Genes in BRCA1 Mutation Cancer Samples

70 F. M. L. Di Lascio and S. Giannerini

Cluster 1 cold shock domain protein A nuclease sensitive el. binding prot. 1 hydroxyacyl proliferating cell nuclear antigen protein phosphatase 1 phosphofructokinase, platelet keratin 8 kiaa0601 protein dkfzp564m2423 protein ctp synthase

Cluster 1 d123 gene product phosphofructokinase, platelet thyroid autoantigen 70kD glutathione peroxidase 4 keratin 8 guan nucleot binding prot. ests transcription factor AP–2 gamma nuclease sensitive el. binding prot. 1 ests

Obs 1 2 3 4 5 6 7 8 9 10

Obs 1 2 3 4 5 6 7 8 9 10

Cluster 3 gdp dissociation inhibitor 2 integrin, beta 8 guan. nucleot. binding prot. 3 cyclin–dependent kinase 4 thyroid autoantigen 70kD minichromosome maintenance deficient 7 transducer of erbb2, 1 transcription factor AP–2 gamma arp1 homolog A glutathione peroxidase 4

Cluster 4 tumor protein p53–binding prot., 2 ests low density lipoprotein–related protein 1 forkhead box M1 very low density lipoprotein receptor kiaa0246 protein carbamoyl–phosphate synthetase 2 cytochrome c oxidase subunit VIc h. mrna for ornithine decarboxyl. antiz. selenophosphate synthetase

Cluster 2 ctp synthase phosphofructokinase, platelet suppression of tumorigenicity 13 chromobox homolog 3 hydroxyacyl minichromosome maintenance deficient 7 nitrogen fixation cluster–like kiaa0601 protein protein phosphatase 1 h. mrna for ornithine decarboxyl. antiz.

Cluster 3 s–phase response dkfzp564m2423 protein integrin, beta 8 udp-galactose transporter related very low density lipoprotein receptor butyrate response factor 1 low density lipoprotein–related protein 1 retinoblastoma–like 2 zinc finger protein 161 armadillo rep. gene deletes in velocard. synd.

Cluster 5 forkhead box M1 apex nuclease interleukin enhancer binding fac. 2 myotubularin related protein 4 transducer of erbb2, 1 cyclin–dependent kinase 4 myxovirus resistance 2 ests carbamoyl-phosphate synthetase 2 platelet–derived growth fact. beta polyp.

Cluster 5 ests suppression of tumorigenicity 13 myxovirus resistance 2 myotubularin related protein 4 interleukin enhancer binding factor 2, 45kD udp–galactose transporter related zinc finger protein 161 armadillo rep. gene deletes in velocard. synd. nitrogen fixation cluster–like platelet–derived growth fact. beta polyp.

Cluster 4 selenophosphate synthetase cold shock domain protein A tumor protein p53–binding prot., 2 kiaa0246 protein arp1 homolog A proliferating cell nuclear antigen phytanoyl–CoA hydroxylase ests gdp dissociation inhibitor 2 cytochrome c oxidase subunit VIc

Table 12. Clustering of 51 Genes in Sporadic Cancer Samples

Cluster 2 retinoblastoma–like 2 d123 gene product ests ests apex nuclease phosphofructokinase, platelet s–phase response ests chromobox homolog 3 phytanoyl–CoA hydroxylase

Table 11. Clustering of 51 Genes in BRCA2 Mutation Cancer Samples

A Copula-Based Algorithm

71

72

F. M. L. Di Lascio and S. Giannerini

and cell–extracellular interactions. We conclude by observing that CoClust has grouped genes allocated to the nucleus in a same cluster. As for the BRCA2 mutation cancer samples (Table 11), we note that transducer of ERBB2, 1 and zinc finger protein 161 are dependent and involved in the negative regulation of cell proliferation and cellular defense response, respectively, revealing that the defense response mediated by a cell could interact with any process that stops, prevents or reduces the rate or extent of cell proliferation. Moreover, notice that the movement of substances, either into, out of or within a cell, performed by ARP 1 homolog A, is dependent on the formation or destruction of chromatin structures, performed by Chromobox homolog 3. Also, notice that the candidate gene to tumor suppression is related to the integrin, beta 8. Moreover, the second cluster is homogeneous w.r.t. the cellular components (most of its genes are in the nucleus of a cell). As regards the Sporadic cancer samples (Table 12), CoClust reveals a dependence between the glycolysis biological process, the transcription from RNA polymerase II promoter and the negative regulation of transcription from RNA polymerase II promoter (phosphofructokinase, platelet, APEX nuclease and Cold shock domain protein A). An interesting result is that the Human mRNA for ornithine decarboxylase antizyme, ORF 1 and ORF 2 is dependent on Armadillo repeat gene deletes in velocardiofacial syndrome; indeed, since the polyamine level increases in cancer cells, it could be important that a gene involved in the formation of adherens junction complexes (which are thought to facilitate communication between inside and outside environments of a cell) is related to the polyamine metabolism. In addition, CoClust reveals dependence between Butyrate response factor 1, Proliferating cell nuclear antigen and Cyclin–dependent kinase 4 showing that the regulation of mRNA stability that modulates the propensity of mRNA molecules to degradation is related to i) the regulation of gene expressions, the cell cycle and its division, ii) the DNA repair and the regulation of DNA replication. Notice that the candidate gene to tumor suppression (suppression of tumorigenicity 13) is related to the integrin, beta 8. Finally, note that the clusters are quite homogeneous w.r.t. the kind of cellular components. By comparing the results obtained in the three different cancer samples, we observe that the number of identified clusters is the same for the BRCA2 and Sporadic cancer samples and that they share 10 couples and 1 triplet of dependent genes. The clustering of the BRCA1 and the BRCA2 mutations cancer samples, by contrast, have 5 couples of dependent genes in common whereas BRCA1 and Sporadic clustering are totally different except for three cases (carbamoyl–phosphate synthetase 2 and protein phosphatase 1; dkfzp564m2423 protein, phosphofructokinase, platelet and phosphofructokinase, platelet; integrin, beta 8 and suppression of tumorigenicity

A Copula-Based Algorithm

73

13). In conclusion, we stress that, on the basis on our findings, the candidate gene for tumor suppression is dependent on the integrin, beta 8 in all the three cases. 6.

Discussion and Conclusions

In this paper we have proposed a new clustering algorithm (CoClust) based on copula functions. The algorithm is free from sources of bias coming either from an initial choice of the number of clusters or from a starting classification. Since it is based on copulas, CoClust can account for complex dependence relationships between observations. The algorithm is able to group observations according to their underlying dependence structure; this is true independently of the kind and the degree of overlap of margins, the strength of the dependence as well as the estimation method and the data generating process used. Moreover, CoClust, tested on simulated data in different settings, is able to identify both the true number of clusters and their size in most situations. We have compared the performance of the proposed algorithm with that of the model–based clustering and the results show that CoClust appears better suited to cluster dependent data. The algorithm we propose is useful whenever the multivariate dependence structure of the data generating process matters. In the microarray data analysis context, CoClust is a method for discovering gene–gene interactions with the advantage (among others) that the dependence between genes is not assumed to be linear. We have applied CoClust successfully to three real microarray data sets and, among other things, we have found known dependence relationships between the genes and the biological processes in which they are involved. In general, a cluster represents homogeneous mean and variance gene expression levels. However, nothing can be said in advance on the expression levels of different clusters. Indeed, we have shown that CoClust performs well even if the margins are nested so that they can have the same mean levels. The price we have to pay for such good performance and flexibility is the high computational burden. In fact, the computational complex  max G ity is governed by the first step of the procedure where K k=2 k likelihood functions show   are computed ! and maximized. " Simple  approximations  max G k−1 − 1 ≈ O Gk . Such complexity that K ≤ (G + 1) (G + 1) k=2 k makes the algorithm in its present form suited to small or medium size problems. We have performed a first step for addressing the issue by implementing a parallelized version of the algorithm which is available in our R package (see Supplementary Material at http://www2.stat.unibo.it/giannerini/   coclust). Also, the computational complexity can be reduced from O Gk to O ((G/h)k ) where h ∈ N is chosen as to keep G/h acceptable. In order

74

F. M. L. Di Lascio and S. Giannerini

to attain this aim the algorithm should allocate kh observations at a time instead of grouping k observations at a time. This solution is feasible once a scheme for eliminating the influence of the initial selection of observations is provided. Another issue we are going to investigate in future work is that of cluster sizes. In fact, the CoClust method forms clusters of equal size and leaves out up to K − 1 observations, where K is the number of clusters. Such observations can be allocated by resorting to imputation methods. In practice, at the end of the procedure, suppose we have K − m left out observations to be allocated. As in step 3.2 of the algorithm we compute the K!/m! permutations where each permutation is composed by K − m available values and m imputed values (through e.g. the EM algorithm, the donor imputation methods and so on). The subsequent steps of the algorithm do not change; once a permutation of the K -plet has been chosen, only the K − m available observations are allocated to the clusters. The research performed could be extended in different directions. First, it would be interesting to study in detail model selection criteria and their influence upon the clustering algorithm. Second, it might be useful to introduce a shrinkage estimation technique for copula models in order to avoid the inconvenience due to the “small n, large p” paradigm even if it influences the precision of the estimates only in the first steps of the algorithm. Also, it could be possible to combine the dependence between observations and the dependence between variables: this may be achieved by using a convex linear combination of copulas. Finally, another application could pertain the study of the CoClust method for grouping summary statistics, e.g. the t–statistics. References AZZALINI, A., and CAPITANIO, A. (1999), “Statistical Applications of the Multivariate Skew-Normal Distribution”, Journal of the Royal Statistical Society, B(61), 579–602. AZZALINI, A., and DALLA VALLE, A. (1996), “The Multivariate Skew-Normal Distribution”, Biometrika, 83, 715-726. CHERUBINI, U., LUCIANO, E., and VECCHIATO, W. (2004), Copula Methods in Finance, Wiley Finance Series, Chichester: John Wiley & Sons Ltd. CHIPMAN, H., and TIBSHIRANI, R. (2006), “Hybrid Hierarchical Clustering with Applications to Microarray Data”, Biostatistics, 7(2), 286–301. DI LASCIO, F.M.L. (2008), “Analyzing the Dependence Structure of Microarray Data: A Copula-Based Approach”, PhD thesis, Dipartimento di Scienze Statistiche, Universit`a di Bologna, Italy, http://amsdottorato.cib.unibo.it/670/. EISEN, M.B., SPELLMAN, P.T., BROWN, P.O., and BOTSTEIN, D. (1998), “Cluster Analysis and Display of Genome–Wide Expression Patterns”, Proceedings of the National Academy of Sciences, 95, 14863–14868. FRALEY, C., and RAFTERY, A.E. (1998), “How Many Clusters? Which Clustering Method? Answers via Model-Based Cluster Analysis”, The Computer Journal, 41(8), 578–588.

A Copula-Based Algorithm

75

FRALEY, C., and RAFTERY, A. E. (2000). “Model–Based Clustering, Discriminat Analysis and Density Estimation”, Technical Report, University of Washington, Department of Statistics. FRIEDMAN, N., LINIAL, M., NACHMAN, I., and PE’ER, D. (2000), “Using Bayesian Networks to Analyze Expression Data”, Journal of Computational Biology, 7(3), 601–620. GODAMBE, V.P. (1960), “An Optimum Property of Regular Maximum Likelihood Estimation”, Annals of Mathematical Statistics, 31, 1208–1211. HEDENFALK, I., DUGGAN, D., CHEN, Y., RADMACHER, M., BITTNER, M., SIMON, R., MELTZER, P., GUSTERSON, B., ESTELLER, M., KALLIONIEMI, O.P., WILFOND, B., BORG, A., DOUGHERTY, E., KONONEN, J., BUBENDORF, L., FEHRLE, W., PITTALUGA, S., GRUVBERGER, S., LOMAN, N., JOHANNSSON, O., OLSSON, H., and SAUTER, G. (2001), “Gene–Expression Profiles in Hereditary Breast Cancer”, The New England Journal of Medicine, 344(8), 539–548. JOE, H. (1997), Multivariate Models and Dependence Concepts, Vol. 73 of Monographs on Statistics and Applied Probability, London: Chapman & Hall. JOE, H., and XU, J. (1996), “The Estimation Method of Inference Functions for Margins for Multivariate Models”, Technical Report, University of British Columbia, Department of Statistics. MADEIRA, S.C., and OLIVEIRA, A.L. (2004), “Biclustering Algorithms for Biological Data Analysis: A Survey”, IEEE. Transactions on Computational Biology and Bioinformatics, 1(1), 24–45. MAR, J., and MCLACHLAN, G.J. (2003), “Model-Based Clustering in Gene Expression Microarrays: An Application to Breast Cancer Data”, in First Asia-Pacific Bioinformatics Conference, Research and Practice in Information Technology, 19, pp. 139– 144. MOREAU, Y., DE SMET, F., and THIJS, G. (2002), “Functional Bioinformatics of Microarray Data: From Expression to Regulation”, in Proceedings of the IEEE, 90(11), pp.1722–1743. NELSEN, R.B. (2006), Introduction to Copulas, New York: Springer. PAN, W., LIN, J., and LE, C.T. (2002), “Model–Based Cluster Analysis of Microarray Gene–Expression Data”, Genome Biology, 3(2), research0009.1–0009.8. SCHWEIZER, B. and SKLAR, A. (1983), Probabilistic Metric Spaces, New York: North– Holland. SKLAR, A. (1959), “Fonctions de r´epartition a` n dimensions et leures marges”, Publications de l’Institut de Statistique de L’Universit´e de Paris, 8, 229–231. SØRLIE, T., PEROU, C., TIBSHIRANI, R., AAS, T., GEISLER, S., JOHNSEN, H., HASTIE, T., EISEN, M., VAN DE RIJN, M., JEFFREY, S.S., THORSEN, T., QUIST, H., MATESE, J.C., BROWN, P.O., BOTSTEIN, D., EYSTEIN LØNNING, P., and BØRRESEN-DALE, A. L. (2001), “Gene Expression Patterns of Breast Carcinomas Distinguish Tumor Subclasses with Clinical Implications”, Proceedings of the National Academy of Sciences of the United States of America, 98, 10869–10874. TAVAZOIE, S., HUGHES, J.D., CAMPBELL, M.J., CHO, R.J., and CHURCH, G.M. (2001), Systematic Determination of Genetic Network Architecture, Nature Genetics, 22(3), 281–285. YEUNG, K.Y., FRALEY, C., MURUA, A., RAFTERY, A.E., and RUZZO, W.L. (2001), “Model-Based Clustering and Data Transformation for Gene Expression Data”, Bioinformatics, 17(10), 977–987.

Suggest Documents