Clustering on a hypercube multicomputer - Semantic Scholar

9 downloads 0 Views 811KB Size Report
Stepl: [Cluster Reassignment] ... minimum squared distance, recompute cluster centers. The last ..... develop algorithms for the cluster reassignment (Step 1) and.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 2, APRIL 1991

129

Clustering on a Hypercube Multicomputer S a n j a y R a n k a , Member, IEEE, a n d Sartaj S a h n i , Fellow, IEEE

Abstract-In this paper, squared error clustering algorithms for SIMD hypercubes are presented. These algorithms are asymptotically faster than previously known algorithms and require lesser amount of memory per PE. For a clustering problem with .V patterns, M features per pattern and I< clusters, our algorithms complete in O ( k logd%-dI)steps on S-ZI processor hypercubes. This is optimal up to a constant factor. We extend these results to the case when -Vd\fK processors are available. Experimental results from an MIMD medium grain hypercube are also presented.

+

Index Term+ Clustering, feature vector, hypercube multicomputer, pattern recognition, MIMD, SIMD.

1. INTRODUCTION EATURE vector is a basic notion of pattern recognition. A feature vector v is a set of measurements (vir U,, . . . , v h l ) which map the important properties of an image into a Euclidean space of dimension M [I]. Clustering partitions a set of feature vectors into groups. It is a valuable tool in exploratory pattern analysis and helps making hypotheses about the structure of data. It is important in syntactic pattern recognition, image segmentation, and registration. There are many methods for clustering feature vectors [ l ] , [3],[6], [ 5 ] , [12], [13]. One popular technique is squared error clustering. Let N represent the number of patterns which are to be partitioned and let M represent the number of features per pattern. Let FIO . . . N - 1 , O . . . hl - 11 be the feature matrix such that F [ i , j ]denotes the value of the j t h feature in the ith pattern. Let SI.S p ,. . . , SICbe K clusters. Each pattern belongs to exactly one of the clusters. Let C[i]represent the cluster to which pattern i belongs. Thus, we can define Sk as

F

Sk

= {ilC[i]= k,O

5k 5K

-

1).

Further, I Sk 1 is the cardinality or size of the partition .SA.. The center of cluster k is a 1 x M vector defined as

The squared distance d2 between pattern i and cluster k is A-1

d 2 [ i ,k ] =

( F [ i , j ]- c e n t e r [ k , j ] ) * . ,=O

The squared error for the kth cluster is defined as

E 2 [ k ]=

d2[2,k]

05k

II

Fig. 12. Complexity analysis of Figs. 10 and 11

0 5 i < K, 0 5 j < M. N u m b e r ( i , j ) = \Szl, The algorithm to update the cluster centers is given in Fig. 13. Steps 1 and 2 are performed in K x M windows. The ( i ,j ) PE in each such window computes the change in FeatureSum(i,j) and Number(i, j ) contributed by the patterns in this window. These two steps can be restricted to for which NewCluster(i,j) # c(i,,j). In Steps 3 and 4 the topmost window accumulates the sum of these changes. Steps 5-8 update the clustering data. The complexity analysis is provided in Fig. 14. A total of O(1og'K l o-~ g ( N / K ) )unit routes are used. Overall Complexity: The total number of unit routes used by our algorithms for one pass of Fig. 1 is 4K O(log2K) O(1ogNMK) regardless of whether the amount of memory available is O ( K ) or O(1). This improves on the algorithm of Li and Fang [8] which requires O ( K * logNM) unit routes and O ( K ) memory per PE.

+

Fig. 11. O(1) memory cluster assignment Ii

< K, 0 5 j < M

q€S,

~1

5 M.

Fig. 10. 0(1)memory cluster assignment K

0 < k