Journal of Classification 22:185-201 (2005) DOI: 10.1007/s00357-005-0013-8
A Proposal for Robust Curve Clustering
Luis Angel Garc´ıa-Escudero Universidad de Valladolid, Spain
Alfonso Gordaliza Universidad de Valladolid, Spain
Abstract: Functional data sets appear in many areas of science. Although each data point may be seen as a large finite-dimensional vector it is preferable to think of them as functions, and many classical multivariate techniques have been generalized for this kind of data. A widely used technique for dealing with functional data is to choose a finite-dimensional basis and find the best projection of each curve onto this basis. Therefore, given a functional basis, an approach for doing curve clustering relies on applying the k-means methodology to the fitted basis coefficients corresponding to all the curves in the data set. Unfortunately, a serious drawback follows from the lack of robustness of k-means. Trimmed k-means clustering (Cuesta-Albertos, Gordaliza, and Matr´an 1997) provides a robust alternative to the use of k-means and, consequently, it may be successfully used in this functional framework. The proposed approach will be exemplified by considering cubic B-splines bases, but other bases can be applied analogously depending on the application at hand. Keywords: Functional data; Clustering; k-means; Trimmed k-means; Robustness.
This research was partially supported by Ministerio de Ciencia y Tecnolog´ıa and FEDER grant BFM2002-04430-C02-01 and by Consejer´ıa de Educaci´on y Cultura de la Junta de Castilla y Le´on grant PAPIJCL VA074/03. We are grateful to LSP Toulouse and Traffic-First Paris, both researching on road traffic models, for furnishing the data processed here. The authors also thank Carlos Matr´an for helpful comments, and the Editor and the anonymous referees for many suggestions that greatly improved the article. Authors’ Address: Luis Angel Garc´ıa-Escudero and Alfonso Gordaliza, Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Valladolid, 47002 Valladolid, Spain, email:
[email protected],
[email protected]
186
L.A. Garc´ıa-Escudero and A. Gordaliza
1.
Introduction
In many scientific disciplines, the observed response from experimentation may be viewed as a continuous curve rather than a scalar or a vector. Main sources from this kind of data are real time monitoring of processes and certain types of longitudinal studies. In the terminology of Ramsay and Silverman (1997), they are called functional data sets. Although the underlying data are made up intrinsically by curves, the observed data are discretized representations of these curves. Therefore, each data curve is represented by a (large) finite-dimensional vector. However, direct use of standard multivariate techniques to these vectors is not advisable and sometimes impossible (see, e.g., Ramsay and Silverman 1997). Several adaptations of classical statistical techniques for dealing with this kind of data have been introduced recently. Clustering is used to find homogeneous subgroups within a given data set, and that purpose can be achieved when the data at hand possess functional character. Just a few examples of fields where application of curve clustering could be appealing are neuroscience (detecting similarities in activation between voxels from fMRI time series); genetics (clustering genes using profiles of time-expression from cDNA microarrays experiments); pharmaceutics (classifying drug effects attending to patients time-response after the drug intake); economics (clustering assets having similar stock returns over time); computer science-data mining (obtaining similar users-profiles of web browsing behavior); engineering (analysis of electric load curves and prediction of customer characteristics); ... Many of these applications have led to the introduction of several curve clustering approaches (several references on both applications and techniques will be given in Section 3). Once the importance of curve clustering has been motivated by the existence of applications like those mentioned above, the need for robust clustering methods is naturally justified by the possible and frequent presence of outlying curves. Notice that outliers are likely to occur in these large, and sometimes fully automatized, data sets. Part of the variability of data can be removed by the regularization or smoothing procedure considered, but other sources of variability could be due to particularities of the data curves themselves. The detection of outlying curves is also useful due to the possible interest in the curves themselves after explaining why they depart from the behavior followed by the majority of the curves. Moreover, outliers could affect the clustering process leading to very disappointing results (bad clusters allocation and even totally breaking down the clustering process). Finally, notice that their detection is not an easy task (it is well-known that detection of outliers is a hard problem when analyzing high-dimensional data sets).
A Proposal for Robust Curve Clustering
187
In order to perform robust curve clustering, we will resort to impartial trimming procedures through the use of trimmed k -means (Cuesta-Albertos, Gordaliza, and Matr´an 1997). With the word “impartial” we mean that the data itself tell us which are the observations that should be trimmed off. Section 2 will be devoted to presenting the trimmed k -means methodology and in Section 3 we explain how this technique can be adapted to the functional clustering framework. Some simulated and real data examples are shown in Section 4 and, finally, some possible extensions and open problems are briefly outlined. 2.
Trimmed k -Means
The k -means method (e.g., McQueen 1967; Hartigan 1975) is surely the most widely used nonhierarchical clustering technique: Given a p-variate data sample, X1 , ..., Xn , we search for a set of k points, m1 , ..., mk , minimizing n
1X inf kXi − mj k2 . min m1 ,...,mk n 1≤j≤k
(2.1)
i=1
Clusters arise in a natural way assigning each data point Xi to its closest center mj . Fast algorithms implementing the k -means method already exist, giving results in an acceptable time even if the dimension p is rather large. A well-known drawback for the k -means method is its lack of robustness. It may be argued that the method inherits the non-robustness of the sample mean, as k -means generalize the sample mean when k is bigger than 1. Consequently, k -medoids, using only the norm instead of the squared norm for measuring discrepancies in (2.1), are also frequently used (see, e.g., Kaufman and Rousseeuw 1990) in trying to overcome the lack of robustness. That modification provides only a timid robustification, because a single outlier still serves to breakdown the clustering process (Garc´ıa-Escudero and Gordaliza 1999). Cuesta-Albertos, Gordaliza, and Matr´an (1997) proposed the robustification of k -means methodology through an “impartial” trimming procedure. In that approach, given a trimming size α between 0 and 1, we search for a set with k points, m1 , ..., mk , solving the double minimization problem: ck (α) := min W Y
min
m1 ,...,mk
X 1 [n(1 − α)]
inf
1≤j≤k
kXi − mj k2
(2.2)
Xi ∈Y
where Y ranges on the subsets of X1 , ..., Xn containing [n(1 − α)] data points ([·] denotes the integer part of a given value). Afterwards, each non-trimmed observation is allocated into the cluster corresponding to its closest trimmed k mean center mj . Moreover, each trimmed observation could be also allocated
188
L.A. Garc´ıa-Escudero and A. Gordaliza
to its closest trimmed k -mean center but a “flag” announcing the existence of a “possible outlier” should be added, specially when its distance to the cluster center is large. Notice that problem (2.2) is analogous to (2.1), but we now allow a proportion α of observations to be discarded. Trimmed observations are selfdetermined by the data as we do not impose privileged directions or zones for trimming. This kind of trimming prevents against anomalous points being placed in the “outer zones” and even “bridging points” between the clusters. This approach is closely related to other robust procedures such as Least Trimmed Squares (LTS) and Minimum Covariance Determinant (MCD) estimators applied in regression and in multivariate estimation (see, e.g., Rousseeuw and Leroy 1987). Garc´ıa-Escudero and Gordaliza (1999) show that a notable robustification gain takes place due to the trimming procedure. Trimmed k -means also have good statistical properties as they provide consistent and asymptotically normal estimators of the population trimmed k -means (Cuesta-Albertos, Gordaliza, and Matr´an 1997; Garc´ıa-Escudero, Gordaliza, and Matr´an 1999). Also, a fast algorithm for its computation (in the spirit of that of Rousseeuw and van Driessen 1999 for computing the MCD) exists (see Garc´ıa-Escudero, Gordaliza, and Matr´an 2003). 3.
Trimmed k -Means and Robust Curve Clustering
In order to perform the impartial trimming methodology, we need to transform the (infinite dimensional) data curves onto (finite dimensional) vectors. We will do this by projecting each data curve onto a finite dimensional basis. Afterwards, trimmed k -means clustering is applied to the resulting basis coefficients. There are other possibilities to perform this dimension reduction. A “naive” approach evaluates the curves in a fine grid. However, this is not very convenient since it leads to high dimensional vectors with heavy correlation among components that are not appropriate for applying standard multivariate statistical techniques. We can summarize those high dimensional vectors by the first principal components and making use of quantiles (Jones and Rice 1992) or principal points (Flury and Tarpey 1993). We prefer to use projections onto a suitable functional basis as they serve to smooth the curves by removing disturbing noise. This filtering approach has been frequently used in functional data analysis. Let us consider a p-dimensional functional basis {Bs (·)}ps=1 . Given a i single data curve {(tji , xi (tji ))}Jj=1 (result of recording curve xi at the Ji time Ji 1 2 moments ti < ti < ... < ti ), we can model that curve by a simple linear
A Proposal for Robust Curve Clustering
189
least-squares fit as: xi (tji )
=
p X
βis Bs (tji ) + εi , j = 1, ..., Ji .
(3.1)
s=1
Thus, for n data curves {xi }ni=1 , we will obtain n vectors of fitted coefficients βi = (βi1 , ..., βip )′ , i = 1, 2, ..., n. Throughout this paper, we will always consider the cubic B-spline bases for exemplifying the proposed procedure. Notice that, considering S interior knots ξ1 , ξ2 , ..., ξS , a cubic B-spline basis will be made up of p = S + 4 basis elements. Other functional bases more appropriate in specific contexts can be used in a totally similar fashion. For instance, a Fourier basis could be appealing when the data curves exhibit periodical nature. We can also use orthogonal polynomials (see Tarpey, Petkova, and Odgen 2003), wavelets bases that allows multiresolution analysis (in time and frequency scale), or even Zernike orthogonal bases when analyzing images on the unit sphere along time (Locantore, Marron, Simpson, Zhang, and Cohen 1999). Cubic B-spline bases have good mathematical properties and are easy to implement (see, e.g., de Boor 1978; Schumaker 1981; Eubank 1988; Green and Silverman 1994 for other statistical applications). Notice also that cubic B-spline bases have an attractive property in that they are locally sensitive to data (each Bi is nonzero over a span of at most five distinct nodes). Finally, for a given trimming size α, we apply trimmed k -means clustering (2.2) on the resulting coefficients, obtaining the curves’ group-allocations and highlighting those curves to be trimmed off because of their outlyingness. The curve clustering procedure will benefit from the robustness properties of trimmed k -means. In the regression (3.1) we prefer the use of classical (nonrobust) least-squares regression, instead of a robust one, since outlying values in the curves then exert a high influence in the determination of the coefficients resulting in the curve being detected as an outlier by the trimmed k -means procedure. A clear advantage is that the procedure provides a depth-curve notion (the higher the trimming size α for trimming off a given data curve the deeper that curve is placed within the sample). An implementation of the proposed procedure in the statistical package R is available at http://www.eio.uva.es/∼langel/software. The program makes use of the library splines also available in R. Notice that the whole procedure is computationally feasible because, although we were handling hundreds or even thousands of measurements for each curve, each data curve is summarized by a finite p-dimensional vector of coefficients. This step is made through a (quick) least squares regression (3.1). Moreover, trimmed k -means (like classical k -means) are well-suited to work in moderately high dimensions, essentially because a rather parsimonious model is fitted. This is a
190
L.A. Garc´ıa-Escudero and A. Gordaliza
key aspect because we need clustering procedures that can work well in highdimensional spaces (notice that p = S + 4 could be rather high). Interested readers should go to Garc´ıa-Escudero, Gordaliza, and Matr´an (2003) for details concerning the trimmed k -means algorithm. A theoretical justification for the use of k -means after fitting coefficients on a B-splines basis can be found in a recent paper by Abraham, Cornillon, Matzner-Løber, and Molinari (2003) where they applied it to a pH evolution problem. The technique of using k -means has also been applied in Crellin, Hastie, and Johnstone (2001) to analyze fMRI sequences. Tarpey, Petkova, and Odgen (2003) consider principal points (closely related to k -means) in a drugresponse problem. A different approach is based on considering data curves as realizations of a random element taking values on a L2 functional space and applying k means clustering there directly by measuring the discrepancy in (2.1) through the L2 -distance between functions. Theoretical developments for this approach appear in Cuesta and Matr´an (1988) (k -means) and Tarpey and Kinateder (2003) (principal points). Cuesta-Albertos and Fraiman (2005) also propose an impartial trimming technique, but directly working with curves as functions on a L2 -space. They consider the k = 1 case (impartial trimmed mean) and derive a depth concept for functional data. Previously, Fraiman and Muniz (2001) provided another curve-depth concept based on an integrated pointwise depth. In both papers, some interesting notions of robustness in functional data analysis are introduced. Other procedures for curve clustering appear in James and Sugar (2003), Cadez, Gaffney, and Smyth (2000), de Soete (1993), Heckman and Zamar (2000), and Huhtala, K¨arkk¨ainen, and Toivonen (1999), among others. 4.
Examples
This section exhibits the performance of this procedure on several examples in order to demonstrate the robustness gain provided by the proposed methodology. 4.1
Some Simulations
Figure 1 shows 22 curves where 10 of them were obtained from the function f (x) = sin(x) and 10 from g(x) = cos(x) evaluated in a grid of size 30 along the interval [0, 2π] and a N (0, 0.42 )-distributed error is added. Two outlying curves centered at line h(x) = −π + x are also drawn. The figure shows the result of applying the proposed procedure when α = 0.1 and considering S = 5 knots. Notice that the procedure is perfectly able to detect the two cluster
A Proposal for Robust Curve Clustering
191
3 2 1 0 −3
−2
−1
y
0 −3
−2
−1
y
1
2
3
clus.1 clus.2 trimmed
0
1
2
3
4
5
6
0
0.1 trimming size
1
2
3
4
5
6
Cluster centers tarjectories
Figure 1. Robust clustering for k = 2, a trimming size α = 0.1 and considering S = 5 knots.
3 2 1 −3
−2
−1
0
y
0 −3
−2
−1
y
1
2
3
clus.1 clus.2
0
1
2
3
4
5
untrimmed k−means
6
0
1
2
3
4
5
6
Cluster centers tarjectories
Figure 2. Non-robust clustering (using classical k-means) when k = 2 and S = 5 knots.
structures and also pointing out the two outlying curves. So, these two outliers will no longer influence the clustering process of the non-outlying ones. Additionally, coordinates of the trimmed k -mean centers, acting as coefficients in the B-spline basis, give the cluster center trajectories. Applying untrimmed k -means, we can see in Figure 2 that we do not recover the two cluster structures and the two main groups are joined together. It may be argued that these two outlying curves can be detected by visual inspection, or that we should have considered k = 3 as choice for the number of groups. That is true, but visual inspection could be a hard task, specially when n is large, because the graph with all the curves is usually a saturated plot with a lot of noise that makes it difficult to see anything. Moreover, it is interesting to develop “unsupervised” (in the sense of reducing the human presence, for instance, in routine studies on massive data sets) cluster techniques where we only need to say how many “main” groups the procedure must look for. Notice that the data analyst often may be searching for a fixed number of groups (i.e., k
192
L.A. Garc´ıa-Escudero and A. Gordaliza
3 2 1 −3
−2
−1
0
y −3
−2
−1
0
y
1
2
3
clus.1 clus.2 trimmed
0
1
2
3
4
5
6
0
0.1 trimming size
1
2
3
4
5
6
Cluster centers tarjectories
4
4
Figure 3. Robust clustering when k = 2, α = 0.1 and S = 5. Notice that an outlying curve, not being outlier at any coordinate, is still detected.
2 −2
0
y −2
0
y
2
clus.1 clus.2 trimmed
0
1
2
3
4
5
0.05 trimming size
6
0
1
2
3
4
5
6
Cluster centers tarjectories
Figure 4. Robust clustering when k = 2, α = 0.05 and S = 5 detecting an outlying curve caused by one single anomalous recording.
is totally known in advance due to physical, economical, ... considerations) but he/she could not be alert from the presence of spurious groups. Moreover, outlying curves may appear isolated constituting groups by themselves and a very high k would be needed to take care of all these “groups” (read the comments after Figure 5). Let us consider now a set of curves obtained in a similar way as before but now replacing one of the outliers by another one centered at function l(x) = 0. Notice that this curve is not pointwise outlier at any coordinate, however the procedure still detects it as we can see in Figure 3. Figure 4 is based on 20 observations generated as in the previous examples, but now one single bad or anomalous measurement at a single t value is artificially added. The trimming method with α = 0.05 is able to point out this outlying curve. Finally, in Figure 5 we consider a heavily contaminated case with 20%
A Proposal for Robust Curve Clustering
193
3 2 1 −3
−2
−1
0
y
0 −3
−2
−1
y
1
2
3
clus.1 clus.2 trimmed
0
1
2
3
4
5
0.2 trimming size
6
0
1
2
3
4
5
6
Cluster centers tarjectories
Figure 5. Robust clustering when k = 2, α = 0.2 and S = 5 removing a 20% of contamination.
amount of very wiggly curves introducing too much noise in the data set. Taking an appropriate trimming size, i.e. α = 0.2, we see that groups are perfectly detected and the outlying curves are removed. We could also think that the contamination present in the data set in Figure 5 can just be considered as a “third” cluster. However, the k -means method searches for not very differently scattered groups. Therefore, the (untrimmed) 3-mean procedure does not create a new group with all the curves considered as outliers. It prefers to create a group with a single curve very distant from the remaining curves. Thus, a very high k is needed (k must be equal to the number of main groups plus, practically, a group for each outlying curve) in order to cope with the contamination level. Of course, the consideration of such a high k ’s seems not to be a very reasonable strategy. 4.2
Traffic Data
Road traffic problems constitute an important field where functional data sets appear. For instance, monitoring segments of roads in order to obtain continuous information about total volume of traffic, average speed, etc., is a frequent practice nowadays. Data could be used, for instance, to evaluate the capacity of roads, for forecast times to spend in car journeys, to design the location of traffic police, and so on. Here we present an example based on data collected on the expressways of Paris and its suburbs. Along the studied expressways, several stations containing speed detectors were placed in order to catch instantly the average speed of cars passing through these points. Each station catches 180 daily measurements along the time interval ranging from 5:00 a.m. to 23:00 p.m., each measurement being obtained after a period of 5 minutes from the preceding one. Here we consider data measured at a single station and corresponding to 617 consecutive days.
8
11
14
17
20
23
8
14
17
20
23
5
11
14
17
Cluster 3
20
23
11
14
17
20
23
20
23
150
150
speed
0
50
speed
50 0 8
8
Cluster 2
100
150 100 50
speed
11
Cluster 1
0 5
100
150 5
Cluster centers trajectories
100
5
0
50
speed
100 0
50
speed
100 0
50
speed
150
L.A. Garc´ıa-Escudero and A. Gordaliza
150
194
5
8
11
14
17
Cluster 4
20
23
5
8
11
14
17
Cluster 5
Figure 6. Curve clusters for the Traffic functional data resulting from the robust clustering methodology with α = 0.2 and k = 5 (S = 6 knots considered).
Plotting speed versus time for any selected day, we obtain a curve showing the behavior of the speed along that day. An important characteristic of these curves is their roughness due to the presence of sudden changes characteristic of this setting where traffic jams arise and finish in an abrupt form causing vertical skips in the curves. Moreover, when speed seems to be quite constant in a time interval, also important noise appear due to differences among drivers and cars. We have analyzed this data set following the robust clustering methodology presented in this paper. Finding groups of curves of similar shape can be an important task previous to other approaches, for example, forecasting times to spend in car journeys which can be conditioned by factors such us whether it is a work day or week-end; sunny day or rainy day; etc. We selected k = 5 and α = 0.2 and the results are presented in Figure 6. In the previous figure, at the top left corner we can see the cluster mean trajectories for each group. The sets of smoothed curves belonging to each cluster are plotted next. Cluster 1 is mostly formed by days where traffic jams occur in the morning, afterwards the speed remains high (around 100 km/h) and fairly constant. Cluster 2 contains curves with the highest speeds (around 110 km/h) and minor or sporadic traffic jams in the afternoon. Clusters 3 and 4 contain curves corresponding to days with big traffic jams in the afternoon, being earlier and larger in Cluster 3 than in Cluster 4. Finally, Cluster 5 contains curves with a fairly constant speed around 100 km/h (lower than the case of Cluster 2) and no significant traffic jams.
5
8
11
14
17
20
150 100 50
speed
23
0
50
speed
100
150
195
0
0
50
speed
100
150
A Proposal for Robust Curve Clustering
5
8
11
14
17
20
23
5
11
14
17
Cluster 3
20
23
14
17
20
23
20
23
150 0
50
speed
100
150 100 50 0 8
11
Cluster 2
speed
100 0
50
speed
5
8
Cluster 1
150
Cluster centers trajectories
5
8
11
14
17
Cluster 4
20
23
5
8
11
14
17
Cluster 5
Figure 7. Curve clusters for the Traffic functional data resulting from non-robust k-means (α = 0) clustering (S = 6 knots considered).
The choice of k and α in this example has been made in an interactive fashion. I.e., several α’s and k ’s were tried taking advantage of the fact that the (non-trimmed) smoothed curves in each group can be visualized and interpreted. However, we will present in Section 5 a procedure intended to help us to choose k and α which could have been also applied here. The benefit provided by the trimming procedure can be observed by comparing the above results with the ones obtained without trimming, shown in Figure 7. There we can see at the top left corner the new smoothed mean curves of every group, now showing some patterns that are not as neat as before, due to the effect of atypical curves. Moreover, we can see that Cluster 5 is mostly formed by atypical curves and thus curves included in Cluster 5 in the trimmed case are now included in Clusters 1 and 2. Therefore the speed in Cluster 2 is now lower than before. On obtaining typical average speed in Clusters 1 and 3, in the untrimmed case, the most rare curves play an influential role in the determination of the mean trajectories for the two clusters causing the mean speeds for these clusters to diminish. Figure 8 shows some randomly chosen trimmed curves (without smoothing) resulting from the trimming procedure (124 trimmed curves are obtained for α = 0.2). We can see curves corresponding to days with multiple traffic jams (in the morning and in the afternoon simultaneously or in anomalous time-periods) and days with slow traffic throughout the day or during large time-periods. We can even see a data curve (third curve in the last row) where the speed detector seems to fail leading to incorrect constant measurements of speed after the failure.
23
23
150 speed
100 100
150 17
23
5
23
11
17
23
speed
100
150
hours
150
0
speed 17
hours
23
50
speed 11
0 11
17
0 5
150 5
11
hours
hours
100
speed
5
150 speed 17
0 23
23
0 11
50
150 100
17
hours
17
100
150 5
hours
50
11
11
hours
100
speed 23
0 5
0 5
0 17
50
150 17
50
150 100
speed
50
11
hours
speed
11
hours
0 5
100
speed
0 5
50
23
50
17
hours
100
11
50
5
50
100
speed
0
50
100 50 0
speed
150
L.A. Garc´ıa-Escudero and A. Gordaliza
150
196
5
11
17
hours
23
5
11
17
23
hours
Figure 8. A sample of 12 trimmed curves (α = 0.2) for the Traffic functional data.
5.
Some Further Extensions
There are other possibilities and open questions related to the proposed method. For example, it would be interesting to explore the dependence of the procedure on the number and choice of knots to compute the basis (Friedman and Silverman 1989). Other possible modifications arise from the addition of different constraints to the basis fitting process (restrictions on the extremes, periodicity, positiveness,...) or by analyzing differences on the rates of changes along curves through the clustering of curve functional derivatives. The use of k -means after reducing dimension through Principal Components Analysis (PCA) is a habitual practice in multivariate data analysis. Therefore, another approach that could be explored would be based on applying functional PCA (Ramsay and Silverman 1997) and then applying trimmed k -means to the resulting principal component scores of the curves. An advantage provided by the impartial trimming method is that it can help to lessen certain nuisance effects such us shift registrations when estimating the center trajectories. The use of high trimming sizes α’s leads the procedure to search for highly representative curves on each group (the procedure behaves more like an estimator of the “modes”). Figure 9 shows 60 curves
0.2 0.0 −0.2
y
0.0
0.8−trimm. mean t*exp(−t^2)
−0.4
−0.4
−0.2
y
0.2
0.4
197
0.4
A Proposal for Robust Curve Clustering
trimmed
−4
−2
0
2
0.8 trimming size
4
−4
−2
0
2
4
Estimated trajectories
Figure 9. Functional data set based on function x exp(−x2 ) exhibiting a “shift displacement” phenomena and the result of applying a high trimming size α = 0.8 for this data set. Black curves on the left are the trimmed ones.
based on the function f (x) = x exp(−x2 ) but considering shift displacements f˜(x) = f (x − ∆) for ∆ following a N (0, 0.52 ) distribution. Then, measurements at 40 values on interval the [−4.5, 4.5] are taken for every f˜ with a normally distributed error added. We observed that the estimation based on the 0.8-trimmed 1-mean (i.e., α = 0.8 and k = 1) produces a closer estimate of the function f (x) than the overall average of all curves. Thus, high trimming sizes (together with an appropriate choice of k ) could be interesting to determine the “typical” trajectories within groups. Another key aspect is the choice of the number of groups k and the trimming size α needed. The necessity of choosing k in advance is a drawback usually posed on the k -means methodology, and the proper determination of α relates to the estimation of the contamination level. The analysis of the so-called trimmed k -variance functionals (Garc´ıa-Escudero, Gordaliza, and Matr´an 2003) provides guidance on both choosing k and α. These functionals ck (α) (which was defined in 2.1 as the minimum value map α ∈ (0, 1) onto W attained at the double minimization problem considered to define the trimmed k -means) for k = 1, 2, ... . Roughly speaking, these functionals show how ck (α) decreases as α increases trimming off the less “integrated” data points W within the cluster structure when considering k groups. Therefore, if a proper choice of k has been made, the decrease must be soft in most of the range of ck (α) suffers α. However, if an improper choice of k has been made then W abrupt changes in the decreasing pattern when complete groups are allowed to be trimmed off. Moreover, an initial fast decrease in these curves (which it is not corrected in the case of considering higher k ’s) is surely due to the presence of outliers. The value of α where this initial fast decrease ceases is an indicator
198
L.A. Garc´ıa-Escudero and A. Gordaliza
Figure 10. (a) Trimmed k-variances functionals suggesting k = 3 and α = 0.1. (b) The application of robust curve clustering for these values of k and α.
of the final level of trimming to choose. The changes of the rate of decrease are better found by analyzing the (numerical) second derivatives. Severe peaks in this numerical second derivative tell us that this choice of k is not adequate (more details can be found in Garc´ıa-Escudero, Gordaliza, and Matr´an 2003). Figure 10 shows three equally sized functional clusters centered at functions f1 (x) = sin(2πx), f2 (x) = cos(2πx) and f3 (x) = sin(2πx + π) with a 10% amount of contamination, and the curves cluster-allocations obtained
A Proposal for Robust Curve Clustering
199
Figure 11. Trimmed k-variance functionals when k = 1, 2 (top) and their numerical second derivative (bottom) for the data set in Figure 5.
from the proposed procedure when choosing k = 3 and α = 0.1. These correct values, for k and α, could have been inferred from the trimmed k -variance functionals and their numerical second derivatives in Figure 10 (notice how the second derivative curve when k = 3 has no peaks and how the initial fast decay ceases at about α = 0.1). Just as another example, we will apply this methodology for determining k and α in the data set appearing in Figure 5. The trimmed k -variance functionals and their numerical second derivatives for this set of curves are plotted in Figure 11. Notice that we can observe changes in the rate of decreasing for the trimmed 1-variance causing a severe “sign-change” (at about α = 0.5) in its numerical second derivative. This fact suggests us that k = 1 is not a good choice for k . That “sign-change” (or the peak) does not appear when k = 2 making it a good choice for k . Notice also that an initial fast decrease that ceases at about α = 0.2 appears in the trimmed 2-variance. Thus α = 0.2 could be selected as a good candidate for the trimming level. Finally, we would also like to stress that trimmed k -means are, obviously, closely related to k -means and it is well-known that k -means are specially oriented to data samples with spherical and similar sized clusters. Thus, it would also be interesting to extend this method to cover different dispersions or scattering within clusters or depending on the time argument. Indeed, we are currently working on this direction.
200
L.A. Garc´ıa-Escudero and A. Gordaliza
References ABRAHAM. C., CORNILLON, P.A., MATZNER-LØBER, E., and MOLINARI, N. (2003), “Unsupervised Curve Clustering Using B-splines,” Scandinavian Journal of Statistics, 30, 1–15. CADEZ, I., GAFFNEY, S., and SMYTH, P. (2000), “A General Probabilistic Framework for Clustering Individuals”, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 140–149. CRELLIN, N., HASTIE, T., and JOHNSTONE, I.M. (1997), “Statistical Models for Image Sequences,” in L. Billard, N. I. Fisher (Eds.), Computing Science and Statistics. Interface ’96. Proceedings of the 28th Symposium on the Interface. Fairfax Station, VA: The Interface Foundation of North America, Inc. CUESTA-ALBERTOS, J.A., and FRAIMAN, R. (2005), To appear in “Data Depth: Robust Multivariate Statistical Analysis, Computational Geometry & Applications” in R. Liu, R. Serfling and D. Souvaine (Eds.), American Mathematical Society in DIMACS Series. ´ C. (1997), “Trimmed k-means: CUESTA-ALBERTOS, J.A., GORDALIZA, A., and MATRAN, An Attempt to Robustify Quantizers,” The Annals of Statistics, 25, 553–576. ´ C. (1988), “The Strong Law of Large Numbers and Best PosCUESTA, J.A., and MATRAN, sible Nets of Banach Valued Random Variables,” Probability Theory and Related Fields, 78, 523–534. DE BOOR, C. (1978). A Practical Guide to Splines, New York: Springer-Verlag. DE SOETE, G. (1993), “Hierarchical Clustering of Sampled Functions,” in O. Opitz, B. Lausen amd R. Klar (Eds.), Information and Classification. Berlin: Springer-Verlag. EUBANK, R.L. (1988). Spline Smoothing and Nonparametric Regression. New York: Marcel Dekker. FRAIMAN, R., and MUNIZ, G. (2001), “Trimmed Means for Functional Data ,” TEST, 10, 419–440. FLURY, B., and TARPEY, T. (1993), “Representing a Large Collection of Curves: A Case for Principal Points ,” American Statistician, 47, 305–306. FRIEDMAN, J.H., and SILVERMAN, B.W. (1989), “Flexible Parsimonious Smoothing and Additive Modelling (With Discussion and Report),” Technometrics, 31, 1–39. GARC´IA-ESCUDERO, L.A., and GORDALIZA, A. (1999), “Robustness Properties of k-means and Trimmed k-means,” Journal of the American Statistical Association, 94, 956–969. ´ C. (1999), “A Central Limit GARC´IA-ESCUDERO, L.A., GORDALIZA, A.c and MATRAN, Theorem for Multivariate Generalized Trimmed k-means,” The Annals of Statistics, 27, 1061–1079. ´ C. (2003), “Trimming Tools GARC´IA-ESCUDERO, L.A., GORDALIZA, A., and MATRAN, in Exploratory Data Analysis,” Journal of Computational and Graphical Statistics, 12, 434–449. GREEN, P.J., and SILVERMAN, B.W. (1994), Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. London: Chapman and Hall. HARTIGAN, J.A. (1975), Clustering Algorithms. New York: Wiley. HECKMAN, N., and ZAMAR, R.H. (1999), “The Shape of Regression Curves,” Biometrika, 87, 135–144. ¨ ¨ HUHTALA, I., KARKK AINEN, J., and TOIVONEN, H. (1999), “Mining for Similarities in Aligned Time Series Using Wavelets,” Data Mining and Knowledge Discovery: Theory, Tools, and Technology. SPIE Proceedings Series, 3695, 150–160.
A Proposal for Robust Curve Clustering
201
JAMES, G.M., and SUGAR, C.A. (2003), “Clustering for Sparsely Sampled Functional Data,” Journal of the American Statistical Association, 98, 397–408. JONES, M.C., and RICE, J.A. (1992), “Displaying the Important Data Features of a Large Collection of Similar Curves ,” The American Statistician, 46, 140–145. KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data. New York: Wiley. LOCANTORE, N., MARRON, J.S., SIMPSON, D.G., ZHANG, J.T., and COHEN, K.L. (1999), “Robust Principal Component Analysis for Functional Data ,” TEST, 8, 1–73. MCQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations,” 5th Berkeley Symposium on Mathematics, Statistics, and Probability. Vol 1, 281–298. RAMSAY, J., and SILVERMAN, B. (1997). Functional Data Analysis. New York: Springer. ROUSSEEUW, P.J., and LEROY, A.M. (1987), Robust Regression and Outlier Detection. New York: Wiley-Interscience. ROUSSEEUW, P.J., and VAN DRIESSEN, K. (1999), “A Fast Algorithm for the Minimum Covariance Determinant Estimator,” Technometrics, 41, 212–223. SCHUMAKER, L.L. (1981). Spline Functions: Basic Theory, New York: Wiley. TARPEY, T., and KINATEDER, K.K.J. (2003), “Clustering Functional Data,” Journal of Classification, 20, 93–114. TARPEY, T., PETKOVA, E., and OGDEN, R.T. (2003), “Profiling Placebo Responders by Selfconsistent Partitioning of Functional Data,” Journal of the American Statistical Association, 98, 850–858.