Verlag, 1985, COMPSTAT Lectures, vol. 4. [ll] C. de Rham, âLa classification hierarchique ascendante selon la methode des voisins reciproques,â Les Cahiers ...
IEEE TRANSACTIONS ON PA”
1056
ANALYSIS AND MACHINE INTELLIGENCE, VOL. 14, NO. 10, OCTOBER 1992
Comments on “Parallel Algorithms for Hierarchical Clustering and Cluster Validity”
the new sample, that is
Fionn Murtagh Thus, in general, our learning algorithm may require more than rlog,(IXI - 1)1 1 samples to solve the problem in the worst case.
+
Abstract-The purpose of this correspondence is to indicate that stateof-the-arthierarchical clustering algorithms have 0 n time complexity , which and should be referred to in preference to the O(n ) 2algorithms, were described in many texts in the 1970’s. We also point out some further references in the parallelizing of hierarchic clustering algorithms.
L
VI. CONCLUSION
We have analyzed the problem of constructing a linear classifier for a finite set X of linearly separable vectors by partially supervised leaming. The proposed learning algorithm consists of two major operations: sample selection and classifier construction. In both operations, only the classification information about the sample set is used. The primary goal of the leaming system is to identify the membership of those vectors outside the sample set. The key issue in the design of a partially supervised learning algorithm is sample selection. The following factors should be considered in the evaluation of a sample selection algorithm: 1) the size of the sample set, 2) the classification error, and 3) the computational complexity. We have shown that the sample set selected by our algorithm is minimal for R’ and R2 in the worst case, and the classifier derived from such a sample set produces no classification errors. The main disadvantage of our approach is the high computational cost as it requires the solution of O(1XI3)linear programming problems in the worst case. In this study, we have only demonstrated how linearly separable problems can be solved by partially supervised learning. It would be useful to extend our framework to deal with more complicated systems.
REFERENCES S. Baase, Computer Algorithms (2nd ed.). Reading, MA: AddisonWesley, 1988. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Occam’s razor,” Inform. Processing Lett., vol. 24, pp. 377-380, 1987. R. 0. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973. L. G. Khachian, A polynomial algorithm for linear programming, Dokl. Akad. Nanka (USSR), vol. 224, no. 5, pp. 1093-1096, 1979. D. E. Knuth, The Art of Computer Programming, Vol. III. Reading, MA: Addison-Wesley, 1973. M. Minsky and S. Papert, Perceptrons. Cambridge, M A MIT Press, 1988. S. Muroga, Threshold Logic and Its Applications. New York: Wiley, 1971. K. Murty, Linear and Combinatorial Programming. New York: Wiley, 1976. N. J. Nilsson, Learning Machines. New York: McGraw-Hill, 1965. L. Pitt, and L. G. Valiant, “Computational limitations on learning from examples,” J. ACM, vol. 33, no. 4, pp. 965-984, 1988. F. P. Preparata and M. I. Shamos, Computational Geometry. New York: Springer-Verlag, 1985. J. J. Rocchio, Jr., “Relevance feedback in information retrieval,” in The SMART Retrieval System (G. Salton, Ed.). Englewood Cliffs, NJ: Prentice-Hall, 1971. D. E. Rumelhart and J. L. McClelland, Eds., Parallel Distributed Processing, VoZ.I. Cambridge, MA: MIT Press, 1986. J. Stoer and C. Witzgall, Convexity and Optimization in Finite Dimension I . Berlin: Springer-Verlag, 1970. L. G. Valiant, “A theory of the learnable,” Comm. ACM, vol. 27, pp. 1134-1142, Nov. 1984. S. K. M. Wong and Y.Y. Yao, “Query formulation in linear retrieval models,”J. Amer. Soc. Inform. Sci., vol. 41, no. 5, pp. 334-341, 1990.
Index Terms-Computational
complexity, hierarchical clustering.
Li [6] describes parallel implementations of hierarchical clustering algorithms that achieve O(n 2 ) computational time complexity and thereby improve on the baseline of sequential implementations. The latter are stated to be O ( n 3 ) ,with the exception of the single link method. It is inappropriate to use as one’s baseline implementations, which could only be described as 1970’s vintage. Surely, it should have been noted that O ( n 2 )time implementations exist for most of the widely known hierarchical clustering methods. Average time implementations that come close to O ( n ) are also known. Rohlf [13] discusses an O ( nlog log n ) expected time algorithm for the minimal spanning tree, which can subsequently be converted to a single link hierarchic clustering in O ( n 2 )time [12]. Bentley e t al. [ l ] discuss an O ( n ) expected time algorithm for the minimal spanning tree. In [8], an O ( n )expected time algorithm is discussed for hierarchic clustering using the median method. One could practically say that Sibson [14] and Defays [4] are part of the prehistory of clustering. At any rate, their O ( n 2 )implementations of the single link method and of a (nonunique) complete link method, respectively, have been very widely cited. Drawing on initial work (e.g., [5], [ll]) in the quarterly journal Les Cahiers de I’AnaZyse des D o n n t e s (J. P. Benztcri, Ed.), Murtagh [7], [9], [lo] described implementations that required O ( n 2 )time and either O ( n 2 )or O ( n ) storage for most of the most widely used hierarchical clustering methods. These storage requirements refer, respectively, to whether dissimilarities or the initial data only need to be stored. These implementations are based on the quite powerful ideas of constructing nearest neighbor chains and carrying out agglomerations whenever reciprocal nearest neighbors are encountered. The theoretical possibility of a hierarchical clustering criterion allowing such an agglomeration of reciprocal nearest neighbors to take place, without untoward nonlocal effects, is provided by the so-called reducibility property. This property of clustering criteria was first enunciated in [2] and is discussed in [lo] and elsewhere. It asserts that a newly agglomerated pair of objects cannot be closer to any third-party object than the constituent objects had been. Whether or not this is always verified depends on the clustering criterion used. In [3], [7], [9], and [lo], one finds discussions of O ( n 2 )time and O(n ) space implementations of Ward’s minimum variance (or error sum of squares) method and of the centroid and median methods. The latter two methods are termed the UPGMC and WPGMC methods by Sneath and Sokal [15]. Now, a problem with the cluster criteria Manuscript received February 8, 1991; revised July 6, 1991. Recommended for acceptance by Editor-in-Chief A. K. Jain. The author is with ESA Space Telescope-European Coordinating Facility, Garching bei Miinchen, Germany. IEEE Log Number 9201895.
0162-8828/92$03.00 0 1992 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 19, 2009 at 14:56 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON PATIERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 14, NO. 10, OCTOBER 1992
used by these latter two methods is that the reducibility property is not satisfied by them. This means that the hierarchy constructed may not be unique as a result of inversions or reversals (nonmonotonic variation) in the clustering criterion value determined in the sequence of agglomerations. Murtagh [9] describes O ( n 2 )time and O ( n 2 )space implementations for the single link method, the complete link method, and for the weighted and unweighted group average methods (WPGMA and UPGMA). This approach is quite general vis a vis the dissimilarity used and can also be used for hierarchical clustering methods other than those mentioned. Day and Edelsbrunner [3] prove the exact O ( n Z time ) complexity of the centroid and median methods using an argument related to the combinatorial problem of optimally packing hyperspheres into an m -dimensional volume. They also address the question of metrics; results are valid in a wide class of distances including those associated with the Minkowski metrics. The construction and maintenance of the nearest neighbor chain, as well as the carrying out of agglomerations whenever reciprocal nearest neighbors meet, both offer possibilities for parallelization. Implementations on a SIMD machine were described by Willett [16]. Further work in the area of parallel implementations of clustering algorithms is referenced in this paper. Good work [6] becomes more convincing if state-of-the-art, rather than superseded, results are addressed at all times.
REFERENCES [ l ] J. L. Bentley, B. W. Weide, and A. C. Yao, “Optimal expected time algorithms for closest point problems,” ACM Trans. Math. Software, vol. 6, pp. 563-580, 1980.
1057
[2] M. Bruynooghe, “Mtthodes nouvelles en classification automatique des donnees taxinomiques nombreuses,” Stat. Anal. DonnCes, no. 3, pp. 24-42, 1977. (31 W. H. E. Day and H. Edelsbrunner, “Efficient algorithms for agglomerative hierarchical clustering methods”, J . Classification, vol. 1, no. 1, pp. 7-24, 1984. [4] D. Defays, “An efficient algorithm for a complete link method,” Comput. J., vol. 20, pp. 364-366, 1977. [5] J. Juan, “Programme de classification hitrarchique par I’algorithme de la recherche en chaine des voisins reciproques,” Les Cahiers de [’Analyse des Donntes, vol. VII, pp. 219-225, 1982. [6] X. Li, “Parallel algorithms for hierarchical clustering and cluster validity,” IEEE Trans. Patt. Anal. Machine Intell., vol. PAMI-12, no. 11, pp. 1088-1092, 1990. [7] F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,” Comput. J., vol. 26, pp. 354-359, 1983. [8] -, “Expected-time complexity results for hierarchic clustering algorithms which use cluster centres,” Inform. Proc. Lett., vol. 16, pp. 237-241, 1983. “Complexities of hierarchic clustering algorithms: State of the [9] -, art,” Comput. Stat. Quar., vol. 1, no. 2, pp. 101-113, 1984. [lo] __ , Multidimensional Clustering Algorithms. Vienna: PhysicaVerlag, 1985, COMPSTAT Lectures, vol. 4. [ll] C. de Rham, “La classification hierarchique ascendante selon la methode des voisins reciproques,” Les Cahiers de [’Analyse des DonnCes, vol. V, pp. 135-144, 1980. [12] F. J. Rohlf, “Algorithm 76: Hierarchical clustering using the minimum spanning tree,” Comput. J., vol. 16, pp. 93-95, 1973. [13] __, “A probabilistic minimum spanning tree algorithm,” Inform. Proc. Lett., vol. 7, pp. 44-48, 1978. [14] R. Sibson, “SLINK An optimally efficient algorithm for the single-link cluster method,” Comput. J., vol. 16, pp. 3&34, 1973. [15] P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy. San Francisco: W. H. Freeman, 1973. [16] P. Willett, “Efficiency of hierarchic agglomerative clustering using the ICL distributed array processor,” J . Documentation, vol. 45, pp. 1-45, 1989.
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 19, 2009 at 14:56 from IEEE Xplore. Restrictions apply.