Analytica Chimica Acta 468 (2002) 91–103
Representative subset selection M. Daszykowski, B. Walczak1 , D.L. Massart∗ Pharmaceutical Institute ChemoAC, VUB, FABI, Laarbeeklaan 103, B-1090 Brussels, Belgium Received 15 March 2002; received in revised form 20 June 2002; accepted 8 July 2002
Abstract Fast development of analytical techniques enable to acquire huge amount of data. Large data sets are difficult to handle and therefore, there is a big interest in designing a subset of the original data set, which preserves the information of the original data set and facilitates the computations. There are many subset selection methods and their choice depends on the problem at hand. The two most popular groups of subset selection methods are uniform designs and cluster-based designs. Among the methods considered in this paper there are uniform designs, such as those proposed by Kennard and Stone, OptiSim, and cluster-based designs applying K-means technique and density based spatial clustering of applications with noise (DBSCAN). Additionally, a new concept of the subset selection with K-means is introduced. © 2002 Elsevier Science B.V. All rights reserved. Keywords: Data mining; Subset selection; Uniform design
1. Introduction Data mining is the process of extracting useful knowledge from a large database. A number of data mining techniques have been developed, but frequently a single data mining technique is insufficient and, instead, several techniques must be employed to support a single application. However, applying different procedures to large databases causes a computational problem. A straightforward solution would be to reduce the amount of data by taking a subset of representative objects from the database. There are ∗ Corresponding author. Tel.: +32-2-477-4737; fax: +32-2-477-4735. E-mail address:
[email protected] (D.L. Massart). 1 On leave from Silesian University, 9 Szkolna Street, 40-006 Katowice, Poland.
many subset selection methods, overview of which can be found in [1–7]. Two main groups of approaches used for representative objects selection are cluster-based designs and uniform designs. Those two groups of methods seem to play the most important role in the arsenal of subset selection methods. The goal of the cluster-based designs is to cluster the data set and then, according to the clustering results, to select the representative objects. K-means is a well known partitioning technique and it has already found applications for cluster-based designs. With uniform designs the representative objects are selected to cover uniformly the data space. The Kennard and Stone algorithm [8] is well known in the field of chemometrics, and has already found many applications. A more recent technique for uniform design, OptiSim [9], seems to be an attractive alternative for the latter approach, and therefore, worth of further study.
0003-2670/02/$ – see front matter © 2002 Elsevier Science B.V. All rights reserved. PII: S 0 0 0 3 - 2 6 7 0 ( 0 2 ) 0 0 6 5 1 - 7
92
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
2. Theory 2.1. Designs for uniform data space coverage 2.1.1. Kennard and Stone algorithm The best known among this type of uniform designs in chemometrics is the Kennard and Stone design [8]. It selects a subset of n-dimensional objects, ‘uniformly’ distributed in the experimental space. All m objects from the given data set X(m, n) are considered as candidates for the subset. ¯The closest object to the mean of the data set can be regarded as the most representative one and it is included as the first subset object. Let us denote it as s1 . The consecutive objects are selected sequentially, based on the squared distances to the objects already assigned to the subset. The squared distance from the ith to the jth object, dij2 is defined as: dij2 = ||x i − x j || = ¯ ¯
(xik − xjk )2
(1)
k
The second object introduced to the subset, s2 is the most distant from s1 . The third one, s3 , is the most distant from both s1 and s2 , etc. The main steps of the Kennard and Stone algorithm can be presented as follows: let s1 , s2 , . . . , sk , where k < m, denotes objects that have already been assigned to the subset. We are looking for the (k + 1)th object from the remaining (m − k) candidates, which is the most distant from its nearest object belonging to the subset. The effect of applying Kennard and Stone algorithm is, thus, selection of the central object, then some extreme objects situated at the data space border and then to fill in the data space. Description of other possible algorithms, used in ‘uniform designs’ (i.e. maximum dissimilarity, sphere exclusion), can be found in [1,9–13]. All algorithms of ‘uniform designs’ can be presented as: 1. select the object, which is closest to the data mean and add it to the subset; 2. calculate the dissimilarity between the remaining objects in the data set and the objects in the subset; 3. select the object, which is the most dissimilar to the ones already included in the subset; 4. return to 2 until the desired number of objects in the subset is reached.
2.1.2. OptiSim An effective algorithm leading to uniform design of selected objects is OptiSim, proposed by Clark [9]. OptiSim requires input parameters, such as a threshold (ε), which is defined as the minimal distance between ‘dissimilar’ objects, the number of objects to be selected, k and the size, S of the data subsample, which is constructed to speed up the computations. The input parameters are defined by a user. The S value is usually selected between 5 and 25% of the total amount of the data. Four different sets are defined during algorithm’s run: the data set, a subset where the representative objects are stored, a subsample (randomly selected S objects from the original data set, each of which is at least at the distance ε from all objects in the subset) and the recycle bin. First, the closest object to the data mean is introduced into the subset and removed from the data set. The following steps of OptiSim can be summarized as: a data set object is randomly selected and if it is at least ε away from the objects of the subset, it is added to the subsample, otherwise it is removed from the data set. Random selection of the subsample objects is continued till S objects (at least ε away from the objects of the subset) are found. For each subsample object, the distance to the closest object from the subset is calculated. The subsample object, for which this distance is maximal, is added to the subset. The remaining objects from the subsample are transferred to the recycle bin (which temporary stores objects selected to the subsample but not selected to the subset). The selection of random subsample and objects to the subset are repeated till k representative objects are found. If the data set is empty, and less than k objects are included in the subset, objects from the recycle bin are transferred to the data set and the procedure is repeated. We proposed a default value of ε, obtained by considering the fraction of objects to be selected (k/m) and the volume, V of the n-dimensional hypersphere formed by the same number of objects as in experimental data set (m), but uniformly distributed within the range of experimental variables. The volume, V is defined as: √ πn V = εn (2) Γ (n/2 + 1) where, n denotes the dimensionality of the data set, X, i.e. number of variables, ε a radius of the hypersphere,
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
Γ is a gamma function with the argument (n/2+1), and V the computed as a product of the original variables’ ranges: V = Πj range(x j ) for j = 1, 2, . . . , n (3) ¯ The gamma function is used here to approximate the n! expression. The radius, ε, of the n-dimensional hypersphere containing k objects is, thus, equal to: (k/m)V Γ (n/2 + 1) ε= n (4) √ πn The radius, ε can be also considered as a distance to the kth nearest neighbor. 2.2. Cluster-based selection The aim of cluster-based designs is to select objects from the clusters identified by a certain clustering method applied to the database. There are three main classes of clustering methods: hierarchical, non-hierarchical [14–16] and density-based [17–21]. Hierarchical approaches, such as single linkage, Ward’s method, etc. lead to nested clusters, i.e. small data subsets are nested within larger ones. Results of agglomerative or divisive hierarchical approaches are presented in the form of dendrograms. The main drawback of these clustering methods is that clustering has to be redone if new samples are added to the original data set. Hierarchical clustering methods have additional limitations, such as their high storage and computational requirements. In non-hierarchical clustering, e.g. relocation methods, such as K-means [14–16], the data set is partitioned into subsets, the members of which are similar to each other, but differ from the members of other subsets. Density-based methods are, for instance, ordering points to identify the clustering structure (OPTICS) [17–19], density-based clustering (DENCLUE) [19], and density based spatial clustering of applications with noise (DBSCAN) [19–21]. The density-based approaches, which take into account the data density, can be used to cluster the data. The density is defined as a number of objects contained in the certain volume. An overview of clustering methods can be found in [14–16,19]. The density-based techniques do not directly allow selection of a predefined number of representative
93
objects from each cluster found. This is due to the fact that they give a list of objects in a certain group but not a desired number of objects from a certain cluster. With non-hierarchical methods, such as the K-means method, the desired number of clusters has to be specified and it is equal to the number of the representative objects to be selected. From each of those clusters one can then select one object, e.g. the closest to the mean. This seems to work well when the data do not contain natural clusters. If the data contain clusters then the cluster-based selection with K-means method may not be optimal as it is demonstrated in the discussion part. When the data are clustered, one should first find the natural clusters (for which we propose DBSCAN), decide how many objects should come from each cluster, and then, select that number of objects from the clusters applying, e.g. the K-means method. 2.2.1. Density based spatial clustering of applications with noise (DBSCAN) The DBSCAN [20,21] is a clustering technique based on the data density. To perform clustering, the ith object of the data set is assigned into one out of three categories, and namely: • when in the neighborhood of the radius ε of the ith object there are at least p objects, then the ith object is a core object (see Fig. 1a); • when in the neighborhood of the radius ε of the ith object there are less than p objects but the neighborhood contains at least one core object, then the ith object is a border object (see Fig. 1b); • when in the neighborhood of the radius ε of the ith object there are less than p objects and no core object, then the ith object is an outlier (see Fig. 1c). The clusters are formed by at least one core object and border objects, whereas outliers do not belong to any cluster. The main advantages of the method are the ability to detect clusters of arbitrary shapes, e.g. [20,21] and its speed. The DBSCAN is a so-called single scan technique, which means that the clustering is performed during one single scan through the data set. Due to this property, clustering is very fast, and therefore the method can be applied to large data sets. There are two input parameters, which should be optimized, namely, the number of objects in the neighborhood, p, and the radius of the neighborhood, ε. A default value for ε can be calculated according to
94
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
Fig. 1. Types of objects according to neighborhood’s density. For p = 3 the ith object is: (a) core ( ); (b) border (䊏); (c) outlier (䉬).
Eq. (4), where k = p. The DBSCAN algorithm can be presented as follows: Define the input parameters, i.e. the number of objects in the neighborhood, p, and the radius of the neighborhood, ε. As long as there are no yet processed objects, 1. select randomly a not yet processed object; 2. if the selected object is a core object, mark it as processed, create a new cluster A and assign the selected object to it; 3. find neighbors of the selected object at a distance lower than ε, transfer them to ‘seeds’, assign to cluster A and mark as processed; 4. for each object in ‘seeds’, find the neighbors that are closer than ε, these objects are cores of borders of the same cluster. If they were not the members of ‘seeds’ yet, add them to ‘seeds’, assign to cluster A and mark as processed. The remaining objects, not assigned to any cluster, are outliers. 2.2.2. K-means K-means [14,15] is a clustering technique, which divides the experimental data set, X, into k clusters, ¯ where k is given by a user. At the beginning, the objects are randomly assigned into k clusters. During the consecutive iterations the cluster centers, x¯ j , are ¯ that redistributed over the data space in such a way objects, which are in one cluster, are more similar to each other than to objects in the other clusters. The similarity between objects is measured by for instance Euclidean distance. In practice it means that the sum of squared distances, E, between the center of each
cluster, xj , and the experimental objects is minimized. E=
k m j =1 i=1
(x i − x¯ j )2 ¯ ¯
(5)
The main steps of the K-means algorithm [22,23] can be summarized as follows: 1. 2. 3. 4.
select the number of clusters to be found, k; assign randomly data objects into k-clusters; calculate the mean x¯ j of each cluster; reassign each object¯ to its nearest x j and then cal¯ culate E according to Eq. (6); 5. repeat steps 3 and 4 until convergence of E.
3. Data Data set 1 consists of 160 NIR spectra, measured for 80 corn samples at two different spectrometers. The spectra were collected in 2 nm intervals within the spectral range 1100–2498 nm. The data set is available via internet [24]. Data set 2 contains 536 NIR spectra of three creams with three different concentrations of an active drug. These spectra were measured in the range 1100–2500 nm at two different temperatures (20 and 30 ◦ C) and varying a way of cups filling (bad and good) [25]. Data set 3 contains 535 simulated objects in 2-dimensional data space, grouped into four clusters, containing 16, 17, 225 and 277 objects, respectively. Data set 4 contains 175 NIR spectra of different excipients, measured in the spectral range 1100–2468 nm
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
95
Fig. 2. OptiSim run for data set 1 (S = 20% of the total amount of objects in the data, with different thresholds: (I) 0.5ε, (II) ε, where ε is calculated according to Eq. (4). Results are displayed after: (a) 2; (b) 6; (c) 15; (d) 19 iterations. The subsample objects are marked as (䊐), data set as (·), and the representative objects as (䊏).
96
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
Fig. 2. (Continued ).
in 2 nm intervals. The spectra were SNV-transformed [26]. The detailed description about the data set can be found in [27]. Data sets 1, 2 and 4 were efficiently compressed by principal component analysis (PCA) [28]. The main feature of PCA is that it extracts a set of few new variables, so-called principal components (PCs), being the linear combination of the original variables, efficiently describing most of the data variance. In our study, the two PCs of data sets 1, 2 and 4, describing 99.8, 99.4 and 92.4% of the data variance, respectively, were used for further calculations and visualization purposes.
4. Results and discussion 4.1. Uniform subset selection This type of selection is important when developing a calibration or classification model. The model set should contain samples covering the whole data range and representing all sources of data variance. If experimental design is possible, uniform design can be applied, which does not require any assumptions about the model’s form. When experimental design is impossible (e.g. samples are natural products), the model set should be selected such that maximum coverage of data space is obtained. For classification purposes, when dependent variable, y , is discrete, ¯ of objects (in other words y describes belongingness
to data classes) subsets should be selected from the individual classes. Usually the Kennard and Stone algorithm is applied, but OptiSim seems an attractive alternative. To demonstrate the main idea of the OptiSim, it was run on data set 1, compressed to two PCs, describing 99.8% of data variance. In the consecutive subplots of Fig. 2, the objects selected (䊏) during the algorithm’s run with different values of subsample size, S and different values of threshold, ε, after 2, 6, 15 and 19 iterations are displayed (please notice differences in axis ranges). The subsample of S objects, from which objects are selected to the subset, is presented in Fig. 2 as (䊐), whereas the remaining data set objects are marked as (·), and representative objects as (䊏). The ε should to be large enough, otherwise the representative objects may not cover uniformly data space. As presented in Fig. 2, the higher value of ε the faster the data set objects (·) are removed (compare Fig. 2Ic and IIc). If ε is calculated according to Eq. (4), the representative objects describes well the data space (see Fig. 2Id and IId). The distribution of objects selected according to the above described OptiSim for data sets 1 and 2, does not differ much from the distribution achieved with Kennard and Stone algorithm (compare Fig. 3a–d) but depends on the number of objects in the subsample, S. This number, S strongly influences the running time of OptiSim. The lower the value of S the faster the algorithm is, but S should be reasonably high to ensure optimal distribution of selected objects in the data
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
97
Fig. 3. Twenty representative objects, marked as (䊏), selected from: (I) data set 1, (II) data set 2, according to (a) Kennard and Stone algorithm and OptiSim, with: (b) S = 10%; (c) 25%; (d) 40% of the total amount of objects in the data.
98
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
Fig. 3. (Continued ).
Fig. 4. Running time of OptiSim algorithm versus: (a) the number of objects in random subsample, S (threshold = ε), (b) the number of objects to be selected, (c) running time of Kennard and Stone algorithm versus the number of objects to be selected.
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
99
Fig. 5. (a) 5 and (b) 25 representative objects, marked as (䊏), found by K-means technique for: (I) data set 1; (II) data set 2.
space. The computational time of OptiSim run to select 30 representative objects from data set 2 with different number of objects in the subsample, S is presented in Fig. 4a. Fig. 4b and c present the computational time of OptiSim (S = 20% of the total amount of objects in the data set) and the Kennard and Stone algorithm, respectively, run for data set 2 to select different number of objects. OptiSim is computationally very efficient compared to the Kennard and Stone algorithm (see Fig. 4b and c). Therefore, it is recommended for uniform design purposes, when the number of objects to be selected is large and/or the data set itself is large.
4.2. Cluster-based subset selection Selection of representative objects with clusterbased approaches ensures that the selected objects cover the data space such that any not selected object is near to a selected one. The most popular method applied for these purposes is K-means. It allows selection of k representative objects by partitioning the data into k classes, from which the closest object to the mean of the class is selected. The K-means used for cluster-based design can deal properly with not clustered data or data with special types of clusters only, i.e. round and compact. Then, the selected objects describe well the data distribution. Examples are
100
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
Fig. 6. (a) 4, (b) 10 and (c) 25 representative objects, marked as (䊏), found by K-means technique for data set 1; (d) 10 and (e) 15 representative objects, marked as (䊏), found by K-means technique for data set 2.
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
101
Fig. 7. (a) Clusters identified by DBSCAN (ε, p = 2) in data set 3; the selection of: (b) 4 and (c) 19 representative objects, marked as (䊏), proportional to the number of objects in each cluster with K-means; (d) clusters identified by DBSCAN (ε, p = 2) in data set 4; the selection of: (e) 14 and (f) 10 representative objects, marked as (䊏), proportional to the number of objects in each cluster.
102
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103
given in Fig. 5, where the position of representative objects (䊏) is presented for data sets 1 (with no clusters) and for data set 2, revealing clustering tendency, respectively. If the cost function of K-means, described by Eq. (5) is minimized, for types of the data distribution given in Fig. 5, the distribution of the representative objects is similar as those obtained by uniform designs. Thus, for large data sets, OptiSim is recommended, due to its computational efficiency. There are, however, cases when the application of K-means for the cluster-bases design can lead to sub-optimal results, and namely, some clusters can be neglected. To demonstrate this problem, two data sets are used, and namely, simulated two-dimensional data set 3 and real data set 4 (see Fig. 6a and b). The chemical data sets are often clustered. In many cases arbitrary shapes of distributions are observed as those presented in Fig. 6a and d. When the number of the representative objects to be selected is small, the K-means technique used for cluster-based design fails (compare Fig. 6a,b and d). In Fig. 6a and b the elongated clusters are partitioned into smaller ones. Two compact clusters in data set 3 (see Fig. 6a and b) are not represented at all, whereas in data set 4, two clusters were neglected (Fig. 6d). Thus, the selection of the representative objects can be regarded as not optimal. When k is large enough, all clusters are represented well (see Fig. 6c and e). Large value of k minimizes the risk of neglecting clusters because less populated regions have more chances to be represented. To overcome sub-optimal selection of representative objects from clustered data sets, a simple solution is to cluster the data with DBSCAN, and then run K-means for each class separately to select the desired number of representative objects. One advantage of this approach is that the selection is independent of the cluster shapes, because DBSCAN is able to deal with any type of data distribution. An example of this approach is illustrated on data sets 3 and 4 (compressed to 2 PCs). Both data sets were clustered with DBSCAN (p = 2, ε calculated according to Eq. (4)). As shown in Fig. 7a and d, they reveal a strong clustering tendency. The DBSCAN found four clusters in data set 3, formed by 16, 17, 225 and 277 core objects, respectively. In data set 4, seven clusters were detected containing 30, 16, 21, 15, 37, 32 and 14 objects, respectively (see Fig. 7d). Six objects, located
outside the ellipses are outliers (see Fig. 7d), whereas the remaining objects are cores. Once the clusters are detected, the user can define how many representative objects should be selected with K-means method from each of them. There are many possibilities how the selection can be performed. For instance, from each cluster the same number of representative objects can be selected or the selection can be performed proportionally to the number of objects in each cluster, etc. In Fig. 7b, one representative object was selected from each cluster of data set 3, and in Fig. 7e, two representative objects were selected from each cluster of data set 4. Fig. 7c and f present the selection of representative objects proportional to the number of objects in clusters for data sets 3 and 4, respectively. In Fig. 7c, from each of two clusters, cluster 1 and 2, two representative objects were selected, and from clusters 3 and 4, eight and nine representative objects, respectively. In Fig. 7f, 10 representative objects were selected from data set 4, and namely, from clusters 1, 5 and 6 two objects, and from other clusters only one. 5. Conclusions Uniform designs can deal with any kind of data structure. The Kennard and Stone algorithm for large data sets becomes computationally expensive. OptiSim, proposed by Clark [9] due to its computational properties, seems to be an interesting alternative for the Kennard and Stone algorithm. The results of OptiSim are comparable to those obtained by Kennard and Stone algorithm. As demonstrated in our study, the K-means technique used for cluster-based design can fail. When clusters of arbitrary shapes are present in the data, the K-means technique often leads to sub-optimal selection of representative objects, i.e. it can neglect some clusters. In order to minimize the risk of sub-optimal representative object selection with K-means, a new approach has been recommended, i.e. first to cluster the data with DBSCAN, and then to use K-means for each cluster separately. References [1] P.M. Dean, R.A. Lewis (Eds.), Molecular Diversity in Drug Design, Kluwer Academic Publishers, Dordrecht, 1999.
M. Daszykowski et al. / Analytica Chimica Acta 468 (2002) 91–103 [2] J.B. Dunbar Jr., Cluster-based selection, Perspect. Drug Discov. Des. 7/8 (1997) 51. [3] J.S. Mason, S.D. Pickett, Partition-based selection, Perspect. Drug Discov. Des. 7/8 (1997) 85–114. [4] R.D. Brown, Y.C. Martin, Use of structure–activity data to compare structure-based clustering methods and descriptors for use in compound selection, J. Chem. Inf. Comput. Sci. 36 (1996) 572–584. [5] R.E. Higgs, K.G. Bemis, I.A. Watson, J.H. Wikel, Experimental designs for selecting molecules from large chemical databases, J. Chem. Inf. Comput. Sci. 37 (1997) 861–870. [6] D.K. Agrafiotis, Stochastic algorithms for maximizing molecular diversity, J. Chem. Inf. Comput. Sci. 37 (1997) 841–851. [7] A.K. Ghose, V.N. Viswanadhan (Eds.), Combinatorial Library Design and Evaluation, Marcel Dekker, New York, 2001. [8] R.W. Kennard, L.A. Stone, Computer aided design of experiments, Technometrics 11 (1969) 137–148. [9] R.D. Clark, OptiSim: an extended dissimilarity selection method for finding diverse representative subsets, J. Chem. Inf. Comput. Sci. 37 (1997) 1181–1188. [10] R. Taylor, Simulation analysis of experimental design strategies for screening random compounds as potential new drugs and agrochemicals, J. Chem. Inf. Comput. Sci. 35 (1995) 59–67. [11] J.D. Holliday, S.S. Ranade, P. Willett, A fast algorithm for selecting sets of dissimilar structures from large chemical databases, QSAR 14 (1995) 501–506. [12] M. Snarney, N.K. Terrett, P. Willett, D.J. Wilton, Comparison of algorithms for dissimilarity-based compound selection, J. Mol. Graph. Mod. 15 (1997) 372–384. [13] B.D. Hudson, R.M. Hyde, E. Rahr, J. Wood, J. Osman, Parameter based methods for compound selection from chemical databases, Quant. Struct.-Act. Relat. 15 (1996) 285– 289. [14] D.L. Massart, L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, Wiley, New York, 1983. [15] W. Vogt, D. Nagel, H. Sator, Cluster Analysis in Clinical Chemistry a Model, Wiley, New York, 1987. [16] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data. An Introduction to Cluster Analysis, Wiley, New York, 1990. [17] M. Ankrest, M. Breunig, H. Kriegel, J. Sander, OPTICS: Ordering Points to Identify the Clustering Structure, available
[18]
[19] [20]
[21]
[22]
[23]
[24] [25]
[26]
[27]
[28]
103
from www.dbs.informatik.uni-muenchen.de/cgi-bin/papers? query=–CO. M. Daszykowski, B. Walczak, D.L. Massart, Looking for natural patterns in analytical data. 2. Tracing local densities with OPTICS, J. Chem. Inf. Comput. Sci. 42 (2002) 500– 507. J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, New York, 2001. M. Ester, H. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, p. 226, available from www.dbs. informatik.uni-muenchen.de/cgi-bin/papers?query=–CO. M. Daszykowski, B. Walczak, D.L. Massart, Looking for natural patterns in data Part 1. Density based approach, Chemom. Intell. Lab. Sys. 56 (2001) 83–92. J. MacQueen, Some methods for classification and analysis of multivariate observations, in: L. Le Cam, J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University California Press, Berkeley, CA, 1967, p. 281–297. E.W. Forgy, Cluster Analysis of Multivariate Data: Efficiency Versus Interpretability of Classification, in: Biometric Society Meeting, Riverside, CA, 1965; Biometrics 21 (1965) 768 (Abstract). http://www.eigenvector.com/Data/Corn/corn.mat. J. Luypaert, S. Heuerding, S. de Jong, D.L. Massart, An evaluation of direct orthogonal signal correction and other preprocessing methods for the classification of clinical study lots of a dermatological cream, J. Pharm. Biomed. Anal., submitted for publication. R.J. Barnes, M.S. Dhanoa, S.J. Lister, Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra, Appl. Spectrosc. 43 (1989) 772– 777. A. Candolfi, R. De Maesschalck, D.L. Massart, P.A. Hailey, Identification of pharmaceutical excipients using NIR spectroscopy and SIMCA, J. Pharm. Biomed. Anal. 19 (1999) 923–935. B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. de Jong, P.J. Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics. Part B, Elsevier, Amsterdam, 1998.