Robust Double Clustering - Springer Link

Journal of Classification 26:77-101 (2009) DOI: 10.1007/s00357-009-9026-z

Robust Double Clustering: A Method Based on Alternating Concentration Steps

Alessio Farcomeni University of Rome ”La Sapienza” Abstract: We propose two algorithms for robust two-mode partitioning of a data matrix in the presence of outliers. First we extend the robust k-means procedure to the case of biclustering, then we slightly relax the definition of outlier and propose a more flexible and parsimonious strategy, which anyway is inherently less robust. We discuss the breakdown properties of the algorithms, and illustrate the methods with simulations and three real examples. Keywords: Biclustering; Double clustering; Microarrays; Robustness; Outliers.

1. Introduction

Let X be an n by p observed data matrix. Classical (nonhierarchical) methods for cluster analysis deal with the identification of similarities among the rows of X , by claiming the existence of I > p or n δ do if G(o1 + s1 , o2 ) ≥ G(o1 , o2 + s2 ) then Set G = G(o1 + s1 , o2 ). Increment o1 by s1 . else if G(o1 , o2 + s2 ) > G(o1 + s1 , o2 ) then Set G = G(o1 , o2 + s2 ). Increment o2 by s2 . end if end while

the centroid matrix. The choice of δ depends on the application and on the sample size. We suggest choosing smaller values of δ for large matrices, since a single outlier may not yield a big change in the centroid matrix in the presence of many observations. Unless stated otherwise, in this paper we will use δ = 0.05, s1 = s2 = 1. Larger step sizes can be used to avoid being trapped in local maxima by the presence of grouped outliers, and to speed up the algorithm. We recommend in general using a relatively low δ, in order to possibly overestimate the number of outlying locations. A single undetected outlier can spoil the procedure, while a discarded uncontaminated entry usually yields little information loss. Such automatic choice of the number of outliers may lead to an algorithm that fully embeds the spirit of robust statistics. If in fact there is no contamination, o1 and o2 will possibly be set to zero or close, so that the robust algorithm will give results comparable to the classical algorithm. If there is contamination then o1 and o2 will be greater than zero, and outliers will not lead to unreliable results. Secondly, the G statistic can also be plot in order to evaluate the level of contamination of the data, alternatively as a function of o1 for fixed o2 , as a function of o2 for fixed o1 ; or a 3D plot can be generated. A thorough investigation of the properties of the G statistics, and of the effect of the choices of s1 , s2 and δ is grounds for further work. We note that the same procedure will be used for Algorithm 1 and for Algorithm 2, even if the way they handle the outliers is different. It is expected that, since each increment in o1 or o2 will have a much different impact for the clustering algorithms, we may end up with lower values of o1 and o2 when dealing with Algorithm 1 than with Algorithm 2. Furthermore, the proposed procedure for choosing the number of outliers does not allow to detect non clearly remote outlying observations, like bridge points between two clusters. These bridge points may be detected afterward by examining their distance from the assigned centroid, with an approach similar to Cho et al. (2004).

87

Robust Double Clustering

5. Robustness Properties

There are many available methods to evaluate the robustness properties of a procedure (Huber 1981; Hampel, Rousseeuw, Ronchetti, and Stahel 1986). We will focus here on global measures of robustness, given by breakdown values. Hodges (1967) and Donoho and Huber (1983) define a finite-sample breakdown value as the smallest fraction of outliers that can break down the estimate in a sample. It is particularly interesting for us to define breakdown values for the setting of biclustering. Let XM be a data set in which M entries out of the np are replaced by arbitrary values. We define the cell breakdown point for the centroid matrix as ¯ εcell np (X, X) =

1 ¯ ¯ M )|| = ∞}, min{M : sup ||X(X) − X(X np XM

¯ where X(X) is the centroid matrix computed on data X , and for ease of notation we suppress the dependence on I , J , o1 and o2 . The above definition is nothing but the breakdown point of an estimate for location following Donoho and Huber (1983). Gallegos and Ritter (2005) consider a different definition they call individual breakdown point, in which entire rows are replaced instead of just single entries. Similarly, we consider both rows and columns: let XM1 ,0 a data set in which M1 rows are replaced by arbitrary vectors in Rp , and X0,M2 a data set in which M2 columns are replaced by arbitrary vectors in Rn . We define the bi-mode breakdown point for the centroid matrix as ¯ X) = εbi−mode (X, n,p

1 ¯ ¯ M1 ,0 )|| = ∞}, min{M1 : sup ||X(X) − X(X n XM1 ,0 1 ¯ ¯ 0,M2 )|| = ∞} . min{M2 : sup ||X(X) − X(X p X0,M2

min

The bi-mode breakdown point is defined as the minimum between the individual breakdown point obtained when contaminating only the rows, and the one obtained when contaminating only the columns. We prove now that this is (numerically different but) equivalent to considering data sets in which M1 rows and M2 columns are replaced simultaneously by arbitrary vectors, which we denote by XM1 ,M2 . This case may seem more natural but it is less elegant since in XM1 ,M2 there need not be a maximum of M1 p + M2 n replaced entries: certain replacements will sooner or later overlap (i.e., certain entries will be replaced twice). It is straightforward to check that for any

88

A. Farcomeni

choice of (M1 , M2 ) there exist M and M such that XM,0 and X0,M can be constructed exactly equal to XM1 ,M2 . Each outlying entry can be thought of being part of either a row or a column. For example, if the M2 columns are outlying in just single entries which are not part of the M1 outlying rows one can equivalently consider M1 +M2 outlying rows with no outlying columns. At worst, if a single entire row is outlying then this is equivalent to consider replacement of p columns. The asymptotic breakdown value (Hampel 1971) is the breakdown value of a procedure for an infinite number of observations. In our context limits can be taken with respect to n, to p, or simultaneously. The worst case scenario is given by an infinitesimal asymptotic breakdown value. We can now state our results. It is straightforward to see that in classical double k-means a single diverging entry suffices to make the centroid diverge, since classical sample means are not robust. Hence, for the classical double k-means, bi−mode (X, ¯ ¯ X) ≤ 1/ max(n, p). Both breakdown εcell np (X, X) ≤ 1/np, εn,p values are then infinitesimal. Consider now our procedures, with o1 > 0 and o2 > 0. Consider Algorithm 1, and suppose there are corrupted entries in o1 + o2 different rows. All these can be discarded, however maliciously they are put, by marking o1 rows and the o2 columns to which the remaining belong. If there is only one additional corrupted entry in a row/column combination that is not marked as outlying, then at least one corrupted entry is included in the estimate of the centroid matrix, and the procedure breaks down just like classical double k-means. So for procedure of Section 2 it must be that o1 +o2 +1 bi−mode ¯ o1 +o2 +1 ¯ εcell , εn,p (X, X) ≤ max(n,p) . np (X, X) ≤ np Consider now Algorithm 2, and suppose there are o2 + 1 corrupted entries on a single row. Then at least one corrupted observation is included in the estimate of the centroid matrix. Similarly, if there are o1 + 1 corrupted entries on a given column, at least one corrupted observation is included in the estimate of the centroid matrix. Hence it suffices to completely corrupt one row or one column for having Algorithm 2 break down. o1 +o2 +1 ¯ Hence, for procedure of Section 3 it must be that εcell , np (X, X) ≤ np 1 bi−mode ¯ εn,p (X, X) ≤ max(n,p) . We underline that the actual breakdown point depends on the sample data structure. The bounds are likely to be almost achieved when the non outlying observations are close to their centroid and far from the other centroids, so that groups are well separated. Even if they have the same upper bound for the bi-mode breakdown point, it is reasonable to expect that the true breakdown point will be higher for the procedure in Section 3 than for classical double k-means. This is


89

reflected on the cell breakdown point, whose upper bound is equal to the one achieved with Algorithm 1 (even if the true breakdown point will likely be lower). In conclusion, double labeling is inherently less robust than complete discarding given by Algorithm 1. As a referee pointed out, an observation may be an outlier without any of its entries being overly far away from the center. These are outliers of the first kind, in which all entries cooperate towards outlyingness of the observation. Algorithm 1 is suited for this case. Algorithm 2 may still be very useful for carefully chosen o1 and o2 . Gallegos and Ritter (2005) define the universal breakdown as the lowest breakdown point with respect to all possible data matrices X . They show that universal breakdown values are infinitesimal even for their robust procedure, and this is very likely true also for the methods in this paper. Nevertheless, if we let o1 = O(n) and o2 = O(p) the asymptotic breakdown value of Algorithms 1 and 2 could finally be strictly positive. To parallel trimming methods, one could set o1 = nα1 and o2 = pα2 , for α1 and α2 small. 6. Illustration of the Methods 6.1 Simulations

In this section we will compare the methods in simulation, reporting for each setting the average adjusted Rand index (Hubert and Arabie 1985) (m − rand), measuring concordance between estimated and true partition, the median sum of the square error (SSE) in recovering the centroid matrix, and average time (in seconds) for each run of the algorithm (i.e., time to convergence from a single initial solution). We report the adjusted Rand index only for the originally uncontaminated rows and columns, thus overestimating the adjusted Rand index for the classical double-k-means. The setting is as follows: we randomly partition rows and columns into I and J groups, without any control on the size of each group. We then generate the centroid matrix according to the formula: x ¯hk = (k −1)∗I +h. This way, the minimum distance between blocks is 1. We then generate the data simulating independent normals centered at the opportune entry of the centroid matrix, with two levels of noise (σ = 0.1 and σ = 0.5). For each setting we perform B = 250 iterations. We simulate without contamination (Table 1), entirely contaminating o1 rows and o2 columns (Table 2), and contaminating only o2 columns for o1 rows (Table 3). When we include outliers, we generate them by randomly sampling from a normal with unit variance and mean, for the i-th outlier, equal to 10*i*S, where S is sampled from a Rademaker distribution (Pr(S = 1) = Pr(S = −1) = 0.5). We note that in this way the outliers are well separated, but in certain simulation settings they share the same center of certain

90

A. Farcomeni

true clusters with the only difference of a larger spread; thus making their identification and the recovering of the true partition more difficult. In the tables we report the results in parentheses for the classical double k-means, Algorithm 1 and Algorithm 2; with o1 and o2 fixed at their true values. Note that since we are passing the same values of o1 and o2 to both robust clustering algorithms, in one of the two cases the number of outliers is misspecified. Algorithms run until the sum of the squared differences between estimated centroids in two subsequent iterations is smaller than 0.01. The tables seem to elicit the following comments: • Little information loss is observed when using robust procedures in the absence of contamination, especially with large matrices. Apart from the last three cases, in which the number of outliers is grossly overestimated, also the times needed for convergence are very close. • When we partially contaminate some objects (Table 3), the classical algorithm breaks down for small and moderate matrix size, and for more than a few outliers also for big matrices. The robust algorithms perform very well, with as expected Algorithm 2 preferable to Algorithm 1. • This behavior is more evident when we completely contaminate some objects (Table 2), with Algorithm 1 preferable to Algorithm 2 as could be expected. The effect of completely contaminated objects on the SSE of the classical algorithm and on Algorithm 2 is dramatic. • Algorithm 1 is always faster to convergence than Algorithm 2 • The classical method is often much slower to convergence than the robust methods in presence of outliers, likely due to instability.

We further perform a small investigation of the performance of Algorithm 3. We fix n = 200 p = 30, I = J = 2 and contaminate a different number of entries with the same approach as before. We use Algorithm 3 to choose the number of outliers and report the proportion of times over B = 100 iterations o1 and o2 are chosen so that the centroid estimator does not break down. In practice, we count a centroid estimator as not having broken down whenever the maximal absolute difference between the estimator and the true centroid is less than 2. When there is no contamination, we report the proportion of times Algorithm 3 leads to choose o1 = o2 = 0. Results for use of Algorithm 1 and Algorithm 2 respectively are reported in Table 4. Although a more extensive simulation study is needed, results suggest that the proposed strategy may be effective in choosing the number of outliers.

p

10 10 30 30 30 30 80 80 80 80 80 80

n

50 50 200 200 200 200 1000 1000 1000 1000 1000 1000

0.1 0.5 0.1 0.5 0.1 0.5 0.1 0.5 0.1 0.5 0.1 0.5

σ

2 2 4 4 4 4 10 10 10 10 10 10

I

2 2 3 3 3 3 5 5 5 5 5 5

J 1 1 1 1 5 5 1 1 5 5 10 10

o1 1 1 1 1 1 1 1 1 1 1 2 2

o2 (1.00, 0.84, 0.84) (0.98, 0.83, 0.83) (1.00, 0.95, 0.96) (1.00, 0.92, 0.94) (1.00, 0.84, 0.90) (1.00, 0.92, 0.93) (1.00,1.00,0.99) (1.00,0.99,0.99) (1.00,0.98,0.96) (1.00,0.97,0.96) (1.00,0.94,0.97) (1.00,0.94,0.97)

m − rand (0.0003, 0.0003, 0.0003) (0.0073, 0.0093, 0.0077) (0.0002, 0.0002, 0.0002) (0.0059,0.0063,0.0060) (0.0002, 0.0003, 0.0002) (0.0064, 0.0070, 0.0064) (0.0003,0.0003,0.0003) (0.0086,0.0086,0.0086) (0.0003,0.0003,0.0003) (0.0086,0.0085,0.0083) (0.0003,0.0003,0.0003) (0.0087,0.0089,0.0087)

SSE

(0.31, 0.34, 0.36) (0.36, 0.41, 0.81) (5.1, 3.6, 4.5) (5.3, 4.3, 4.8) (5.1, 3.6, 4.2) (5.3, 4.7, 5.6) (42,48,74) (42,69,91) (41,81,89) (42,106,123) (41,80,88) (42,97,106)

time

Table 1: Simulation of uncontaminated data. In parentheses, the results respectively for classical algorithm, Algorithm 1 and Algorithm 2.

Robust Double Clustering 91

p

10 10 30 30 30 30 80 80 80 80 80 80

n

50 50 200 200 200 200 1000 1000 1000 1000 1000 1000

0.1 0.5 0.1 0.5 0.1 0.5 0.1 0.5 0.1 0.5 0.1 0.5

σ

2 2 4 4 4 4 10 10 10 10 10 10

I

2 2 3 3 3 3 5 5 5 5 5 5

J

1 1 1 1 5 5 1 1 5 5 10 10

o1 1 1 1 1 1 1 1 1 1 1 2 2

o2 (0.73, 0.99, 0.83) (0.57, 0.99, 0.59) (1.00, 1.00, 0.98) (0.94, 1.00, 0.97) (0.71, 1.00, 0.87) (0.71, 0.90, 0.81) (1.00,1.00,1.00) (0.98,1.00,0.99) (0.94,1.00,0.97) (0.90,1.00,0.91) (0.61,1.00,0.48) (0.52,0.90,0.31)

m − rand (5.69, 0.0003, 5.56) (81.82, 0.0096, 8.90) (3.09, 0.0004, 3.08) (7.20, 1.3850, 7.18) (4059.51, 0.7862, 3902.42) (3929.17, 1.0319, 4413.31) (1912.66, 0.0003, 2046.74) (1979.31 0.0083, 1945.99) (4166.81 0.0003, 4231.23) (5577.60,0.0008,5757.07) (23820.67, 21.2953, 25213.76) (21337.11 , 31.4790, 22577.36)

SSE

(0.35, 0.16, 9.9) (0.57, 0.22, 6.72) (4.9, 1.8, 6.2) (7.9, 1.9, 12.6) (7.9, 3.5, 24) (10.3, 4.7, 38.7) (41,44,91) (42,40,87) (77,84,108) (100,84,169) (151,98,214) (245,117,301)

time

Table 2: Simulation of completely contaminated data. In parentheses, the results respectively for classical algorithm, Algorithm 1 and Algorithm 2.

92

A. Farcomeni

p

10 10 30 30 30 30 80 80 80 80 80 80

n

50 50 200 200 200 200 1000 1000 1000 1000 1000 1000

0.1 0.5 0.1 0.5 0.1 0.5 0.1 0.5 0.1 0.5 0.1 0.5

σ

2 2 4 4 4 4 10 10 10 10 10 10

I

2 2 3 3 3 3 5 5 5 5 5 5

J

1 1 1 1 5 5 1 1 5 5 10 10

o1 1 1 1 1 3 3 1 1 1 1 2 2

o2 (1.00, 1.00, 1.00) (0.99, 0.98, 0.98) (1.00, 0.99, 1.00) (0.91.00, 0.98, 0.99) (0.78, 0.79, 0.99) (0.67, 0.79, 0.94) (1.00,1.00,1.00) (0.99,0.99,0.99) (1.00,0.99,1.00) (1.00,0.99,0.99) (0.95,0.93,1.00) (0.89,0.93,0.99)

m − rand (0.0040, 0.0004, 0.0008) (0.0113, 0.0091, 0.0101) (0.0007, 0.0002, 0.0003) (0.0065,0.0064,0.0062) (0.0236, 0.0002, 0.0002) (0.0289,0.0068,0.0071) (0.0006,0.0003,0.0004) (0.0084,0.0086,0.0079) (0.0033,0.0003,0.0003) (0.0112,0.0089,0.0095) (0.1411,0.0003,0.0003) (0.1400,0.0086,0.0083)

SSE

(0.27, 0.17, 0.20) (0.75, 0.48, 0.85) (2.3, 2.1, 2.2) (2.5, 3.0, 3.7) (3.2, 3.2, 4.2) (3.8, 6.3, 7.7) (54,48,46) (46,69,69) (50,41,44) (62,40,46) (106,78,96) (118,79,92)

time

Table 3: Simulation of partially contaminated data. In parentheses, the results respectively for classical algorithm, Algorithm 1 and Algorithm 2.

Robust Double Clustering 93

94

A. Farcomeni

Table 4: Proportion of times Algorithm 3 chooses o1 and o2 so that the centroid does not break down respectively for Algorithm 1 and Algorithm 2. In case of no contamination, proportion of times o1 = o2 = 0 is chosen. contaminated entries σ = 0.1 σ = 0.5

0 (0.92,1.00) (0.93,0.99)

1 (1.00,0.93) (1.00,0.88)

5 (1.00,0.97) (1.00,0.91)

12 (1.00,0.95) (1.00,1.00)

15 (1.00,1.00) (0.97,0.81)

6.2 Real Data Examples 6.2.1 Macroeconomic Performance of Industrialized Countries

We first revisit the example of Vichi (2000) about the average macroeconomic performances of the G7 most industrialized countries: France (FRA), Germany (GER), Great Britain (GBR), Italy (ITA), United States of America (USA), Japan (JAP), Canada (CAN); plus Spain (SPA). The variables measured were: Gross Domestic Product index (GDP), Inflation (INF), Budget deficit/GDP (DEF), Public debt/GDP (DEB), Long term interest rate (INT), Trade balance/GDP (TRB), unemployment rate (UNE). The measurements refer to the period 1980-1990, and are of particular interest because most of the variables were considered in the parameters of the Maastricht treaty. The 8 countries have slightly different performances. Clustering can be used to explore the information, and it is of great interest to be able to identify the outliers. In particular it is the question whether Italy can be thought of being an outlier after standardization. For these data the common choice for the number of groups is I = 3 and J = 2. With Algorithm 1 and automatic choice of the number of outliers we end up setting o1 = 1 and o2 = 0. There is one single row outlier, which is identified by the algorithm as being Italy. The row-partition is: {GER, JAP}, {SPA}, {FRA, GBR, USA, CAN}. while the col-partition is: {GDP, DEF,DEB,TRB},{INF,INT,UNE}. This is well in agreement with Vichi (2000), with the only exception that Italy is excluded from the partitioning. If we use the Algorithm 2, the automatic choice leads to set o1 = 1 and o2 = 1. Italy is once again identified as being an outlier, but only for the variable DEB. For the other variables, Italy is not considered as an outlier, and the row-partition is: {GER, JAP}, {ITA, SPA}, {FRA, GBR, USA, CAN}. Interestingly enough, this is exactly the same row partition obtained in Vichi (2000). The column partition is the same as before, and once again in agreement with Vichi (2000). We can conclude that in this data set there is some evidence of Italy being an outlier, possibly because of an exceptional high public dept (with


95

respect to GDP); but that this outlier does not affect the classical procedure. A completely different conclusion will be drawn about the data in next section. 6.2.2 Metallic Oxide Analysis Data

A sampling study was designed to explore the effects of process and measurement variation on properties of lots of metallic oxide. The metal content minus 80% by weight was recorded for two Types of metallic oxide raw material, in respectively 18 and 13 lots, by two randomly chosen chemists for each sample and two samples from each lot. Data come from Bennet (1954), and were analyzed with a robust mixed model approach by Fellner (1986); Zewotir and Galpin (2007). This is a problem of confirmatory cluster analysis, with I = 2 and J = 1 (sample and chemist are zero-centered random effects). The classical k-means procedure with k = 2 leads to a very bad solution, in which lots 6 and 7 of Type 2 belong to one group and all the other rows belong to another group. The same result is given by partitioning around medoids (PAM). If we increase the number of groups and do classical k-means or PAM with k = 3, there still is a badly behaved solution: one group is made of two lots (6 and 7 of Type 2), and the adjusted Rand index for the other rows is only about 0.17 in both cases. With Algorithm 3 we end up choosing o1 = 3 and o2 = 0 for use with Algorithm 1. We mark lot 17 of Type 1 and lots 6 and 7 of Type 2 as outliers. The adjusted Rand index for the remaining variables is around 0.22. Since there are at least two entirely corrupt rows, here Algorithm 2 is not appropriate. On the other hand, we can use a combination of Algorithms 1 and 2. By combining the algorithms we finally end up marking lot 17 of Type 1 together with 6 and 7 of Type 2 as entirely outlying (once again o1 = 3 and o2 = 0 for Algorithm 1). On the other hand, we set o1 = 1 and o2 = 1 for Algorithm 2, thereby letting lot 12 for Type 2, Sample 1, Chemist 1, have an outlying measurement. The final adjusted Rand index is 0.24. We now compare our detected outliers with the results of Fellner (1986) and Zewotir and Galpin (2007). We generally agree with the results of Fellner (1986), with the only difference that for lot 17 of Type 2 only the last four columns are marked as outlying, and that there are no outlying measurements for lot 12. This is particularly encouraging for us, since we manage to achieve a similar list of outliers without using the additional information given by a-priori knowledge of the two row-groups (Type), and use of the random effects Sample and Chemist. In this situation the two rowgroups are not well separated, and the outliers could be masked. Zewotir and

96

A. Farcomeni

Galpin (2007) mark entire lot 17 of Type 1, together with 6 and 7 of Type 2 as outlying. They do not end up marking any column of lot 12 of Type 2, and additionally mark lots 2,3,10,11 of Type 2; together with Chemist 1, Sample 2 for lot 4 of Type 2. Here the number of identified outliers is much higher. We shall finally note that, being based on mixture models both Fellner (1986) and Zewotir and Galpin (2007) procedures are more flexible than our approach, and can do multiple marking: for instance both methods mark Lot 6 of Type 2 as entirely outlying, and then the second measurement for Sample 2 Chemist 2 is further marked as outlying. 6.2.3 Simultaneous Clustering of Genes and Tissues

Our last example is about analysis of DNA microarrays. Data come from Bittner et al. (2000), who finally analyzed the expression levels of 3613 genes on 38 tissues biopsies; 31 cutaneous melanomas and 7 controls. Samples come from male and female patients, aged 29 to 75. Bittner et al. (2000) discuss the segmentation of the 31 tumor samples, obtaining two clusters of 12 and 19 samples validated through multidimensional scaling and by noting different metastatic behavior between the two groups. Rocci and Vichi (2008) perform double clustering on this data set, finally obtaining 4 row groups (one containing differentially expressed genes) and 2 groups of samples. The group of samples is slightly different than the one in Bittner et al. (2000), being formed of 21 and 10 samples, the same groups obtained by Goldstein, Ghosh, and Conlon (2002) with hierarchical clustering methods. The natural choice for the number of row groups in DNA microarray analysis is I = 3, with the aim of separating down-regulated, up-regulated and not-differentially expressed genes. By using the approach in Section 4, with s1 = s2 = 5, we set o1 = 60 and o2 = 10 for Algorithm 2. With this choice, we get back the 12-19 clustering of Bittner et al. (2000). Our choice is then I = 3, J = 2, o1 = 60 and o2 = 10 with Algorithm 2, finally getting 6 blocks with associated centroid in Table 5. As hoped, we get two groups of differentially expressed genes, one up-regulated and one down-regulated. By looking at the centroid matrix we can in fact safely conclude that a first group of 816 genes is candidate to be up-regulated, and also contributes to the column clustering; and another group of 627 is candidate for down-regulation, and also contributes to the column clustering. Further, it seems like differential expression of genes in column group 1 is more marked (higher for up-regulated, lower for down-regulated) than the differential expression for genes in column group 2, thus explaining also the sample partitioning. Interestingly enough, the remaining bulk of 2170 genes is apparently not differentially expressed and simultaneously does not contribute to the column clustering.

97


Table 5: Estimated centroid for Bittner et al. (2000) data, with I = 3, J = 2, o1 = 60, o2 = 10, Algorithm 2; with cardinalities in parentheses.

Row Group 1 (816) Row Group 2 (2170) Row Group 3 (627)

Column Group 1 (19) 0.83 −0.04 −0.96

Column Group 2 (12) 0.48 −0.01 −0.62

Our robust approach to double clustering allows then to recover the Bittner et al. (2000) meaningful clustering for the samples, and furthermore to simultaneously identify groups of genes that can explain the tissue partitioning. Note that the number of genes identified by our and other methods is too large for post-screening, as it often happens when applying clustering methods instead of a multiple testing approach to DNA microarray data. 7. Conclusions

We provided two robust algorithms for double clustering, and an automatic strategy for choosing the number of outliers. The algorithms are seen to have good global robustness properties, and to be useful also in identifying outliers from different perspectives. Robust double clustering seems to yield clusters that are closer to the true structure than classical methods, as was seen both in simulation and by real applications. We focused here on hard partitions, but note that generalization to fuzzy clustering or to coverings (overlapping clusters) is straightforward. In further work we will permit the location of corrupt dimensions o2 to vary with the identified row, finally allowing for a different number and location of corrupt dimensions for each row outlier. A further robustification of the methods would be given by using a robust statistic, like the median, for es¯ (Kaufman and Rousseeuw 1990); even timation of the centroid matrix X if Garcia-Escudero and Gordaliza (1999) note that simply using the median instead of the mean leads to the same global robustness properties in clustering. Appendix A: Convergence of the Proposed Algorithms Since (2) and (3) are bounded from below by zero, it suffices to prove that at each iteration of the corresponding algorithm the objective is monotonically non increasing. Consequently, the algorithms will lead to a local optimum for the corresponding objective function. Since there a finite number of possible assignments, convergence must be achieved in a finite number of steps.

98

A. Farcomeni

A.1 Solution of (2) by Algorithm 1 ¯. of X

Algorithm 1 can be divided into three steps: updating of U , of V , and

Since for the outliers it happens that uir = 0 ∀r , for any i ∈ Y C , we can rewrite the objective as: p n i=1 j=1

r

2

uir vjc (xij − x ¯rc )

=

c

=

n i=1 r n i=1

uir

vjc (xij − x ¯rc )2

c

j

uir d2r (i).

r

˜ as u We have that U is updated to U ˜iri = 1, where ri = ˜ir = 0, u 2 ˜ir d2r (i). The arg min dr (i). Hence ∀i = 1, . . . , n r uir d2r (i) ≥ ru objective is given by the sum after discarding the o1 highest elements, and hence is not increasing after this step. The same reasoning applies to the updating of V . ¯ is the barycenter given by least squares, hence is closer to Finally, X the new elements in each block than the previous centroid.

A.2 Solution of (3) by Algorithm 2 Note that the function to minimize can be rewritten using the indicators η and φ as: ⎛ ⎝

p

uir vjc (xij − x ¯rc )2 +

⎞ uir vjc (xij − x ¯rc )2 ⎠ =

i∈RC j∈D r,c

i∈R j=1 r,c

¯rc )2 + (1 − ηi )φj uir vjc (xij − x ¯rc )2 = uir vjc ηi (xij − x i

j

r

c

i

⎞ ⎛ uir ⎝ vjc (xij − x ¯rc )2 (ηi + (1 − ηi )φj )⎠ =

r

j

c

⎞ ⎛ uir ⎝ vjc (xij − x ¯rc )2 max(ηi , φj )⎠ = i

r

j

c

i

uir d2r (i)

.

r

Now the same reasoning as in the previous subsection can be applied to see that update of the row memberships leads to a not higher objective function.

99


For determination of the row outliers by the indicators η note that the function to minimize can be rewritten once again as: ηi uir vjc (xij − x ¯rc )2 (1 − φj ) i

+

j

r

c

i

j

r

φj uir vjc (xij − x ¯rc )2 .

c

The second The first one is equiv summand is constant with2 respect to η . alent to i ηi j c vjc (xij − x ¯ri c ) (1 − φj ) = i ηi νi . Since ηi are updated to that the o1 highest νi are discarded from the sum, the objective function is once again not higher after the updating of row outliers step. The same reasoning can be applied to column memberships and outliers. Finally, the centroid is once again the barycenter of the observations. The generic element x ¯rc is in fact obtained as the minimizer in a of frc (a) = uir vjc max(ηi , φj )(xij − a)2 , i

j

since the objective function can be once again rewritten as

r

c frc (a).

References ATKINSON, A.C., RIANI, M., and CERIOLI, A. (2004), Exploring Multivariate Data with the Forward Search, New York: Springer. BENNET, C.A. (1954), “Effect of Measurement Error on Chemical Process Control”, Industrial Quality Control 11: 17–20. BITTNER, M., MELTZER, P., CHEN, Y., JIANG, Y., SEFTOR, E., HENDRIX, M.,RADMACHER, M., SIMON, R., YAKHINI, Z., BON-DOR, A., SAMPAS, N., DOUGHERTY, E., WANG, E., MAINCOLA, F., GOODEN, C., LUEDERS, J., GLATFELTER, A., POLLOCK, P., CARPTEN, J., GILLANDERS, E., LEJA, D., DIETRICH, K., BEAUDRY, C., BERENS, M., ALBERTS, D., and SONDAK, V. (2000), “Molecular Classification of Cutaneous Malignant Melanoma by Gene Expression Profiling”, Nature 406: 536–540. BOCK, H.-H. (1996), “Probabilistic Models in Cluster Analysis”, Computational Statistics and Data Analysis 23:5–28. CHO, H., DHILLON, I.S., GUAN, Y., and SRA, S. (2004), “Minimum Sum-Squared Residues Co-Clustering of Gene Expression Data”, Proceedings of the Fourth SIAM International Conference of Data Mining, 114–125. CLIMER, S., and ZHANG, W. (2006) “Rearrangement Clustering: Pitfalls, Remedies, and Applications”, Journal of Machine Learning Research 7: 919–943. ` CUESTA-ALBERTOS, J., GORDALIZA, A., and MATRAN, C. (1997), “Trimmed kMeans: An Attempt to Robustify Quantizers”, Annals of Statistics 25: 553–576. DONOHO, D.L., and HUBER, P.J. (1983), “The Notion of Breakdown Point”, in A Festschirift for Erich L. Lehmann, eds. P. Bickel, K. Doksum, and J.L.Jr. Hodges, Belmont CA: Wadsworth, 157–184.

100

A. Farcomeni

FELLNER, W.H. (1986), “Robust Estimation of Variance Components”, Technometrics 28: 51–60. FISHER, W. (1969), Clustering and Aggregation in Economics, Baltimore: Johns Hopkins. FRALEY, C., and RAFTERY, A.E. (2002), “Model Based Clustering, Discriminant Analysis, and Density Estimation”, Journal of the American Statistical Association 97: 611–631. GALLEGOS, M.T., and RITTER, G. (2005) “A Robust Method for Cluster Analysis”, Annals of Statistics 33: 347–380. GARCIA-ESCUDERO, L.A., and GORDALIZA, A. (1999), “Robustness Properties of k Means and Trimmed k Means”, Journal of the American Statistical Association 94: 956–969. ´ GARCIA-ESCUDERO, L.A., GORDALIZA, A., and MATRAN, C. (2003), “Trimming Tools in Exploratory Data Analysis”, Journal of Computational and Graphical Statistics 12: 434–449. GOLDSTEIN, D., GHOSH, D., and CONLON, E. (2002), “Statistical Issues in the Clustering of Gene Expression Data”, Statistica Sinica 12: 219–241. HAMPEL, F.R. (1971), “A General Qualitative Definition of Robustness”, Annals of Mathematical Statistics 42: 1887–1896. HAMPEL, F.R., ROUSSEEUW, P.J., RONCHETTI, E., and STAHEL, W.A. (1986), Robust Statistics: The Approach Based on the Influence Function, New York: Wiley. HARDIN, J., and ROCKE, D. (2004), “Outlier Detection in the Multiple Cluster Setting Using the Minimum Covariance Determinant Estimator”, Computational Statistics and Data Analysis 44: 625–638. HARTIGAN, J.A. (1972), “Direct Clustering of a Data Matrix”, Journal of the American Statistical Association 67: 123–129. HODGES, J.L. Jr. (1967), “Efficiency in Normal Samples and Tolerance of Extreme Values for Some Estimates of Location”, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1), Berkeley CA: Univ. California Press, pp. 163–186. HUBER, P.J. (1981), Robust Statistics, New York: Wiley. HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification 2: 193–218. KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data, New York: Wiley. MADEIRA, S.C., and OLIVEIRA, A.L. (2004), “Biclustering Algorithms for Biological Data Analysis: A Survey”, IEEE/ACM Transactions on Computational Biology and Bioinformatics 1: 24–45. ROCCI, R., and VICHI, M. (2008), “Two-Mode Multi-Partitioning”, Computational Statistics and Data Analysis 52: 1984–2003. ROUSSEEUW, P.J. (1984), “Least Median of Squares Regression”, Journal of the American Statistical Association 79: 851–857. ROUSSEEUW, P.J., and VAN DRIESSEN, K. (1999), “A Fast Algorithm for the Minimum Covariance Determinant Estimator”, Technometrics 41: 212–223. ROUSSEEUW, P.J., and VAN DRIESSEN, K. (2006), “Computing LTS Regression for Large Data Sets”, Data mining and knowledge discovery 12: 29–45. SCHEPERS, J., CEULEMANS, E., and VAN MECHELEN, I. (2008), “Selecting among Multi-Mode Partitioning Models of Different Complexities: A Comparison of Four Model Selection Criteria”, Journal of Classification 25: 67–85.


101

VAN MECHELEN, I., BOCK, H.H., and DE BOECK, P. (2004), “Two-Mode Clustering Methods: A Structured Overview”, Statistical Methods in Medical Research 13: 363– 394. VICHI, M. (2000), “Double k-means Clustering for Simultaneous Classification of Objects and Variables”, in Advances in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, edd. S. Borra, R. Rocci, and M. Schader, Heidelberg: Springer, 43–52. ZEWOTIR, T., and GALPIN, J.S. (2007), “A Unified Approach on Residuals, Leverages and Outliers in the Linear Mixed Model”, Test 16: 58–75.