(Automatic) Cluster Count Extraction from Unlabeled Data Sets
Isaac J. Sledge ECE Dept. U. of Missouri Columbia, MO 65211, USA
[email protected]
Jacalyn M. Huband Computer Science Dept. U. of West Florida Pensacola, FL 32514, USA
[email protected]
Abstract All clustering algorithms ultimately rely on one or more human inputs, and the most important input is number of clusters (c) to seek. There are "adaptive" methods which claim to relieve the user from making this most important choice, but these methods ultimately make the choice by thresholding some value in the code. Thus, the choice of c is transferred to the equivalent choice of the hidden threshold that determines c "automatically". This work investigates a new technique for estimating the number of clusters to look for in unlabeled data utilizing the VAT [Visual Assessment of Cluster Tendency] algorithm, coupled with a judicious combination of several common image processing techniques. Our method is also at the mercy of several parametric choices, but these parameters are pretty gentle, and hold up well in the examples we have run so far. Several numerical examples are presented to illustrate the effectiveness of the proposed method.
1. Introduction Clustering in unlabeled data comprises three operations: (i) assessment of cluster tendency (ii) partitioning the data into c clusters; and (iii) validation of the c clusters found in (ii). This paper concerns a method for (automatically) extracting the number of clusters to look for in unlabeled data – thus, it falls into the first problem category. The method begins with the creation of a visual image that portrays potential cluster structure using a previously developed algorithm (VAT), then processing this image to enhance its visual information for automated inspection, and finally, computationally extracting from the enhanced image
James C. Bezdek CSSE Department U. of Melbourne Parkville, Victoria, 3010, Australia
[email protected]
the number of clusters that are visually apparent in the VAT image. Since this algorithm extracts the (possible) number of clusters, we call it a cluster count extraction (CCE) algorithm. Let O={o1,…,on} denote n objects (fish, people, stocks, beers, etc.). When each object in O is represented by a (column) vector x in p , the set X = {x1,…, x n } p is an object data representation of O. The kth component of the ith feature vector (xki) is the value of the kth feature (e.g., height, weight, length, etc.) of the ith object. Alternatively, when each pair of objects in O is represented by a relationship between them, then we have relational data. The most common case of relational data is when we have (a matrix of) dissimilarity data, say D = [Dij] , where Dij is the pair wise dissimilarity (usually a distance) d(oi,oj) between objects oi and oj, for 1 i, j N. More generally, D can be a matrix of similarities based on a variety of measures [1-3]; or D can be a relation derived by human intervention of a process involving pairs of objects. Let integer c {2,…, n 1} , and X = {x1,…, x n } denote an object data input set. The fuzzy and crisp (that is, non-soft) c-partitions of X are sets of (cn) values {uik} that can be arrayed as c n matrices U = [uik] . The sets of all non-degenerate (no zero rows) fuzzy and hard c-partition matrices for O are: Mfcn = {U cn | u ik [0,1] i, k; c
(1)
n
u ik = 1 k; u ik > 0 i} i=1
k=1
M hcn = {U Mfcn | u ik {0,1} i, k} .
(2)
Every hard partition is fuzzy, but not conversely. The reason these matrices are called partitions follows from the interpretation of uik as the membership of xk in the ith partitioning subset
(cluster) of X. Ruspini [4] introduced Mfcn in 1969. Bezdek [5] contains more details about the algebraic and geometric nature of these partition sets. The important point is that all clustering algorithms generate solutions to the clustering problem for X, i.e., the identification of an "optimal" partition U in Mfcn that groups together object data vectors (and hence the objects they represent) which share some well defined (mathematical) similarity. With this notation in place we see that assessment attempts to find the best places to look - i.e., assessment of the data may suggest one, or at least a few, distinguished values of c to input to (any) clustering algorithm. It is our hope, of course, that each mathematically optimal grouping is in some sense an accurate portrayal of natural groupings in the physical process from whence the object data are derived.. We have developed several algorithms for performing visual assessment of cluster tendency. Specifically, the VAT [7, Visual Assessment of Tendency], bigVAT [8, sampled VAT for very large VL data], sVAT [9, scalable VAT for VL data] and coVAT [10, VAT for rectangular co-clustering] algorithms. We define "very large" or VL data as any data set that cannot be loaded into a single processor. This definition is flexible since it is processor dependent. But no matter what processor is used, there are nowadays VL data sets of interest that it cannot accommodate by itself. These algorithms have been tested with computer-generated and limited real-world data. One of their advantages is that they are domain independent. That is, as long as the data are represented numerically, VAT does not care if the objects represented are human gene sequences, measurements of iris petals, length of fish species, or the voting records of Congressmen. The remainder of this article is organized as follows. In Section 2 we review VAT and work related to the VAT algorithms. Section 3 describes our approach to extracting cluster counts automatically from a VAT image. Section 4 has several numerical examples. Section 5 contains our summary, and some ideas for future research. Our MATLAB implementation of the CCE algorithm is appended for users interested in trying out this method.
2. VAT and Related Work Various formal and informal (statistically based) techniques for tendency assessment are discussed in Jain and Dubes [11] and Everitt [12], but statistical approaches to assessment are not widespread, principally because the distributional assumptions needed to implement these techniques are often unrealistic. The technique proposed here is visual.
Visual approaches for various data analysis problems have been widely studied in the last 30 years. Tukey [13] and Cleveland [14] are standard sources for many visual techniques. The general spirit of visual assessment is captured by the “graphical method of shading” described by Johnson and Wichern [15]. They give this informal description: (i) arrange the pairwise distances between points in the data into several classes of 15 or fewer, based on their magnitudes; (ii) replace all distances in each class by a common symbol with a certain shade of gray; (iii) reorganize the distance matrix so that items with common symbols appear in contiguous locations along the main diagonal (darker symbols correspond to smaller distances); (iv) identify groups of similar items by the corresponding patches of dark shadings. A more formal approach to this problem is the work of Tran-Luu [16], who proposes a re-ordering of the data into an "acceptable" block form based on optimizing several mathematical criteria of image "blockiness". The reordered matrix is imaged, and the number of clusters are deduced visually by a human observer. Tran-Luu's work is an interesting precursor to the VAT method. Tran-Luu attempts to find images of similarity data by seeking "blocky" images through formal optimization procedures. VAT, on the other hand, depends on an intuitive approach to reordering the objects prior to image display. VAT presents pair wise dissimilarity information about the set of objects O = {o1,…, on}, which are represented by either object or relational data, as an n n digital image after the objects are suitably reordered so that the image is better able to highlight potential cluster structure. The algorithm is used before clustering to provide a visual representation of the possible number of clusters that may exist in the data. It is possible to argue that VAT actually performs (quasisingle linkage) clustering during the reordering of the dissimilarity matrix [7]. Like single linkage, VAT uses a modification of Prim’s minimal spanning tree (MST) algorithm to reorder the matrix. However, VAT does not use the MST to define partitions or assign membership values to the objects in the data; its only purpose is to provide a method for visualizing potential cluster substructure. In other words we use VAT images to estimate c, the number of clusters to seek. There is also a lot of work in cluster visualization which deals with displaying clusters after a clustering algorithm has been applied (that is, after securing a cpartition U of the data). The earliest published reference we can find that discusses visual displays of clusters (as images) is the 1973 SHADE approach of Ling [17]. SHADE was used after application of a hierarchical clustering scheme and served as an
alternative to visual displays of hierarchically nested clusters via the standard dendrogram. Due to limited technology at that time, SHADE used only 15 level halftones (created by over-striking standard printed characters) to approximate a digital representation of the lower triangular part of a dissimilarity matrix. Since the time when SHADE was introduced, technology has greatly improved and displays of similarity or dissimilarity matrices that represent clusters are now a fairly common practice. Representative studies include those by Baumgartner et al. [18, 19], Strehl and Ghosh [20, 21], Dhillon et al. [22] and Hathaway and Bezdek [23]. The work by Strehl and Ghosh appears quite similar the methods discussed here, except that fuzzy partitions are discovered, and then imaged after hardening and reordering. Our work, however, is aimed towards those aspects of the clustering problem that are important prior to clustering, not after it. The VAT algorithm displays a grayscale image ˜ I( R ) , each element of which is a scaled dissimilarity value between objects oi and oj. Each element on the diagonal is zero. Off the diagonal, the values range from 0 to 1. If an object is a member of a cluster, then it should be part of a submatrix of “small” values superimposed on the diagonal of the image. These submatrices are dark blocks along the diagonal of the VAT image. We give a concise statement of VAT. The functions arg max/min in Steps 1/2 are set valued. When sets have multiple optimal arguments, any optimal pair can be selected. Throughout, we will denote the result of applying the VAT algorithm ˜ = VAT(R) ; to the square dissimilarity matrix R as R ˜ ). and the displayed output is the image I( R Algorithm VAT: [7] Input: An n n matrix of dissimilarities R = [rij] , for all 1 i, j n: rij = rji, rij 0, and rii = 0.
Select (i,j) arg max {rpq } ; Set P(1) = j; pJ,qJ
Replace I I {j} and J J – {j}. Step 2: For t = 2,…, n: Select (i,j) arg min {rpq } : Set P(t) = j; pI,qJ
Replace I I {j} and J J – {j}. Next t. ˜ = [˜rij ] = [rP(i)P( j) ] for 1 i,j n. Step 3: Form R
˜ ) , scaled so that max corresponds Output: Image I( R to white and min to black. Figure (1a) is a scatterplot of n-1000 data points in 2 . Table 1 contains statistics of the distribution of the points, which are centered at cluster prototype vi with cardinality ni, i = 1, 2, 3, 4, 5. Table 1. Data set X shown in figure (1a) vi v1 = (0,0) v2 = (8,8) v3 = (16,0) v4 = (0,16) v5 = (16,16)
ni 200 213 208 210 169
These object data were converted to a 1000 x1000 dissimilarity matrix R by computing rij = x i x j with the Euclidean norm. The c = 5 visually apparent clusters in Figure (1a) are suggested by the 5 distinct dark diagonal blocks in Figure (1c), which is the ˜ ) of the data after VAT reordering. VAT image I( R Please compare this to view (1b), which is the image I(R) of the dissimilarities in input order. It is clear that reordering is necessary to reveal the structure of the underlying data: nothing can be inferred about structure from the image of the original matrix R.
Step 1: Set I = ; J = {1,2,…,n}; P = (0, 0, …, 0)
(1a) Scatterplot of input data
(1b) Unordered image I(R)
˜) (1c) VAT-ordered image I(R
Figure 1. An example of VAT that indicates c = 5 clusters in the data set X We humans can simply count the number of dark Ii,i+m} that comprise the dark block. These values blocks along the diagonal of a VAT image – when correspond to objects oi-k through oi+m. Figure 3 they possess visual clarity - to get an estimate for c. shows one row from the uppermost submatrix in The VAT image in view (1c) comes from a nice Figure (1c). "clean" set of compact, well-separated clusters, but as the data become more and more mixed, the VAT image will degrade considerably. However, humans can still deduce the suggested number of clusters Figure 2. One row from the image in (1c) from a VAT image in all but the most incorrigible data sets. But different humans may see different By identifying a set of small-intensity values values, especially when the clusters have significant adjacent to a dark set, we could deduce that this row overlap or strange geometries, both of which lead to suggests one cluster. Due to the symmetry of the VAT images with non-distinct boundaries. That's one image, we can eliminate rows i-k through i+m to look of the problems we hope to overcome. for more clusters. We can count the number of We want to automate this inference procedure, so clusters by repeating this process until all rows have that we can extract the value of c computationally been eliminated. Drawbacks: that a VAT image would suggest visually if viewed by a human. Why? There are several good reasons. i) How do we threshold the row values to First, different humans will disagree more and more find a cluster boundary? as the images degrade, and we would like an ii) How many objects are needed to form a objective, repeatable way to estimate c. Second, if the cluster? data are large, we may not be able to view the image iii) Can we detect a cluster of one object? even after making it because of screen resolution limitations, but we may be able to extract c from the ˜ ) that A third possibility is to look at values in I( R information such an image possesses. Moreover, the are along or slightly off the main diagonal. Figure 3 sVAT procedure enables us to construct an illustrates this idea graphically. Line aa lies along the approximation to the VAT image for VL data, and it main diagonal. Line bb is displaced by 1 unit each may be the case that we need or want to extract c way from line aa, and hence, crosses two edges of from this approximate image for VL data. each of the dark blocks. By analyzing the values of the n-1 (or n-2 or n-3, etc., depending on where the 3. The CCE Algorithm diagonal slice is taken) intensities, we might be able to determine the number of clusters. Can we get a computer to "look at" the image in Figure (1c), and tell us that c = 5? First, we briefly discuss three unsuccessful methods that did not work, but which led us to the CCE algorithm described in this section. An immediate idea is to consider an edge-detection algorithm [17] to find the vertical (or horizontal) edges of the dark blocks. We can find good edge images from good VAT images - but, now we have to count the number of (say) vertical edges visually, so we have simply transformed the problem of counting rectangles to one of counting edges. Humans can do either of these with equal proficiency; but we still don't have a computational estimate of c. A more efficient technique might be based on ˜ ) . Each row crosses examining selected rows in I( R through one of the dark blocks of the VAT image. Consider the ith row of the image matrix. The intensities {Iij} along the ith row show the dissimilarities of object oi to the objects o1 through on. If the ith object is a member of a cluster, there will be a set of “small” intensity values {Ii,i-k, Ii,i-k+1, . . . ,
a b
a b
Figure 3. Line aa is the main diagonal slice: line bb is the first off-diagonal slice None of these three approaches worked. Our fourth attempt is based on the idea illustrated in Figure 4, which is the graph resulting from a somewhat more sophisticated wavelet-based procedure applied to the image of Figure 1(c). This graph was made by first thresholding I(R˜ ) , and then
applying, successively, the 2D fast Fourier transform followed by window correlation in the frequency domain, back-transforming to the spatial domain, and finally, constructing the correlated first off-diagonal histogram.
values: all data values in f(x,y) that are less than T will be zero, and all data values in f(x,y) that are greater than or equal to T will be one. By partitioning the image in this manner, all pixels in the image with a value greater than or equal to T become object points, while all points with a pixel value less than T become background points.
y=b
y=a
Figure 4. Illustration of (Possible) Extraction ˜ ) at Figure (1c) of c from VAT image I( R The c = 5 dark blocks in I(R˜ ) become sets of spikes. This is, of course, yet another way to transform the visual information in Figure 1(c) into a different form, but in this form we can imagine counting c automatically, because of the clear separation between the spikes. For example, we might count the number of times that a horizontal line (y=constant) intersects the graph. This method is certainly threshold dependent even within a single graph. Choosing y = a results in "cutting the spikes" 5 times, but choosing y = b results in only 3 cuts. Thus, there seems to be no consistent way to choose the threshold so that the count would be correct across the range of values in a single graph, much less across graphs derived from VAT images of other, less benevolent data. We will develop a method for "counting the spikes" in graphs such as Figure 4, which relies on several standard image processing algorithms. The most relevant application of segmentation to automatic cluster detection is image thresholding, where region extraction and analysis play an important role [24]. We consider multiple aspects of the image in question, including its histogram and any 3D information that may be available, as fine discrepancies or important details may otherwise go unnoticed [25]. Figure 5 is the histogram for an image having a pair (c=2) of univariate Gaussian distributions. We assume the two regions were produced from an equal distribution of light colored objects on top of a dark background. A threshold point (T) can be selected so that the image, f(x,y), is subdivided into two sets of
Figure 5. Histogram with threshold point T Not all images will produce histograms with easily defined regions. Success depends on the cluster structure underlying the histogram. This is illustrated by the more complicated histogram shown in Figure 6.
Figure 6. Histogram with thresholds T1 ; T2 Objects in the image of Figure 6 cannot easily be classified by the same method as used for Figure 5 if the image is to be considered as a single entity. Instead, multi-level thresholding can be used so that all pixel values less than the first threshold point, T1, are referred to as background points, while the pixel values contained in the remaining two distributions become distinct object points. When images, which contain multiple distributional components, such as the one underlying the histogram of Figure 6, region growing techniques, are perhaps the best method for reliably segmenting the image. Region growing becomes increasingly useful as the data underlying histogram of interest becomes increasingly complex. Given an image f(x,y) and a subimage g(x,y), spatial correlation attempts to locate all of the regions in f(x,y) that best match g(x,y). However, given the complexity of filtering a large image in the spatial domain, correlation is best suited for the frequency
domain, where the regions of similarities appear as peaks on the output and can be easily processed [26]. To begin the correlation process, both the VAT image and the corresponding detection filter are first transformed from the spatial domain to the frequency domain via the Fast Fourier Transform (FFT). Once the image is converted to the frequency domain, correlation is done by the multiplying the transformed image, F(u,v), with the complex conjugate of the transformed filter, G(u,v), 7(a) Before segmentation *
C(u, v) = F(u, v)G (u, v) .
(3)
The motivation for operating in the frequency domain is that spatial convolution is multiplication. For large images, especially those with more than a million pixels, operating in the frequency domain can cut down on the processing time required to filter an image and allow for software implementations of complex algorithms without the need for specialized DSP hardware. It is important to note, however, that some of the details in the image will be lost in the forward transformation, especially for edges and regions of harsh transitions, as an infinite number of frequencies are needed in the frequency domain. When clusters are represented as dark blocks of pixels along the diagonal, choosing a filter with similar characteristics should yield positive results for regions displaying similar characteristics, and negative results for all other regions. To find the largest number of similar characteristics of dark blocks along the diagonal, we apply image segmentation prior to correlation. The need for the segmentation of the VAT image is demonstrated in Figure 7(a). In Figure 7(a) areas with the highest amount of energy are concentrated along the diagonal, while areas with lower amounts of energy, corresponding to partially similar or dissimilar objects, appear in areas outside the main diagonal. However, applying correlation directly to the VAT image, without any segmentation preprocessing, will yield suboptimal results, as the variations in pixel intensity are too pronounced for the filter to find a ‘good’ match. Dynamically generating a subimage that matches the high energy regions becomes a relatively simple statistical operation if the variations are equalized beforehand with binary thresholding, as shown in Fig. 7(b).
7(b) After segmentation Figure 7. Variation in high energy regions Once correlation between the segmented VAT image and filter takes place, and the back-transform of the correlated image is computed, the off-diagonal values of the image are used to generate a histogram with an arbitrary number of approximately Gaussian regions that denote the preliminary number of clusters detected. Taking the set of data for some arbitrary horizontal location in the computed histogram, which will be at y=0 for the inclusion of singletons, the cluster assessment of the VAT image can be automated, with the number of clusters for that dataset returned by counting each continuous distribution. We are ready to summarize our automatic cluster count extraction algorithm. Algorithm CCE (Cluster count extraction)
˜ ) , scaled so that max = Input: n n VAT Image I( R white and min = black
˜ ) with Otsu's algorithm. Step 1. Threshold I( R Step 2. Generate the correlation filter ratio of size s. Step 3. Apply the FFT to the segmented VAT image and the filter. Step 4. Multiply transformed VAT image with the
complex conjugate of the transformed filter. Step 5. Compute inverse FFT of the filtered image.
Step 6. Compute histogram of off-diagonal pixel
values of the back-transformed image. Step 7. Cut the histogram at an arbitrary horizontal
line y = b (usually b = 0), and count the number of spikes. Output: Integer (c), an estimate of the number of ˜ ). dark blocks along the diagonal of I( R The individual steps in our CCE algorithm are all well known, highly documented image processing procedures, so we have not given formulae or the theory of their underlying models for each of the steps. Our implementation was done entirely in MATLAB; the specifics are recorded in Appendix 1.
8(b) VAT image
4. Numerical Examples We consider several artificially generated data sets, as well as a small "real world" data set. There are only two user-defined constants to choose: the filter ratio s in Step 2; and the cut threshold b used in step 7. These parameters had the fixed values s=20 and b=0 in all of our examples. We will discuss possible improvements for these choices in Section 5. Example 1. Our first example completes discussion of the five cluster data. We repeat the initial phases so that you don't have to page backwards to refresh your memory. The data are scatterplotted in Figures 1(a) and 8(a). The VAT image shown as Figure 1(c) is repeated here as Figure 8(b) The binary image outputted by Step 1 of CCE is shown as Figure 8(c); and the off-diagonal histogram of the filtered image (previously Figure 4) at the end of Step 6 of CCE is shown in View 8(d).
8(c) Segmented VAT image
8(d) Off-diagonal histogram Figure 8. CCE extraction on figure 1 data
8(a) 2D data scatter plot
The clusters in this data are very compact and nicely separated, and the VAT image makes that fact visually evident. This is also apparent in Fig. 8(c), as the dark polygons along the histogram diagonal do not display ragged edges, unlike the congressional voting data of Example 2. The CCE algorithm extracts the visually correct value for the number of dark blocks in the VAT image, c = 5. We remark that it is tempting to call c the "number of clusters in the data", but CCE simply finds the number of dark
blocks in the VAT image. It is our HOPE, of course, that this IS the number of clusters in the data, but we know that this will not always be the case [7-10]. Example 2. Our next example involves a data set from the UCI Machine Learning Repository at http://www.ics.uci.edu/~mlearn/MLSummary.html. The data are generated from the Congressional Voting Records Database, and consist of the 1984 records of the 435 United States House of Representatives on 16 key votes. We will refer to this as the HOR data. Votes were numerically encoded as 0.5 for "yea", -0.5 for "nay" and 0 for "unknown disposition", so that the voting record of each Congressman is represented as a trinary-valued object data vector in 16 , so it is impossible to form an opinion about cluster structure that might reside in this data set by visual examination of a scatterplot. Consequently, the VAT image in Fig. 9(a) serves as an excellent starting point for visualizing the distribution of these data, as the VAT algorithm displays object data in any dimension as a twodimensional image. The 435 435 relational data R set needed as input to VAT is generated from the object data as pair wise squared Euclidean distances. The VAT image in Figure 9(a) suggests that the data distribution for the sixteen different bills approximately relates to a bi-partisan division into two fairly well-defined clusters over a variety of issues.
9(a) VAT image of HOR Data
9(b) Segmented VAT image
9(c) CCE histogram Figure 9. CCE extraction on HOR voting data The two identified classes are Republican (54.8%) and Democrat (45.2%), and the image shows members of these two groups share similar voting records. However, about 15% of the voters failed to vote consistently. These voters cause some interaction (mixing) between the two main clusters, and so the dark blocks we expect to see are less welldefined than they were in Example 1. On closer inspection, the dark polygon in the upper portion of Figure 9(a) is more well-defined than the diagonal block in the lower portion. Note that this is easy to see in CCE Figure 9(b), and implies that the voters corresponding to the first region share more similar records (vote in a tighter block) than the voters comprising the second region. Extrapolating the reason for this phenomenon from the data, we surmise that there are a number of incumbents who voted outside of their party affiliation, a bias in the number of incumbents who are Democrats versus those who are Republican, or a mixture of both. But whatever the cause for the unequal, partially well-defined regions in the image, the HOR data provides an excellent test of the effectiveness of the CCE algorithm, the output of which can be seen in Figure 9(c). While the regions in the correlated histogram do not follow approximate Gaussian distributions, two distinct, continuous regions are still generated and denote the existence of two densely populated regions of similar objects. And hence, CCE extracts the value c = 2 for the HOR data. Example 3. Our last example examines the ability of CCE to extract the correct number of dark blocks from VAT images using five sets, called X1-X5, which are samples drawn from a mixture of c=3 bivariate normal distributions. Each dataset contained n=5000 vectors. The three clusters in these five data sets vary from well-separated(X1) to overlapped (X5).
X1
X2
X3
X4
X5
10(a) Scatterplot of Xk
10(b) Segmented VAT image
10(c) CCE histogram
Figure 10. CCE extraction on samples (X1-X5) from bivariate mixtures
Figure 10 shows the results of all five experiments. Column (a) contain scatterplots of the five data sets, column (b) of their segmented VAT images, and column (c) their CCE histograms. For each of the five data sets, the CCE algorithm was able to define three continuous, disjoint regions, no matter the relative similarities between the groups of objects. As the clusters vary from well-separated (X1) to strongly overlapped (X5), the relative size of the continuous distributions in the histograms becomes less pronounced. No real explanation could be given for this phenomenon, except that the size of the distributions is relative to the amount of noise between dark, diagonal regions in the VAT image. We can easily construct samples for which the three clusters are even more overlapped. However, this would serve more as a test for the limits of the VAT algorithm's ability to properly sort similar objects than as a test of the ability of CCE to determine the preliminary number of clusters defined by dark blocks in the VAT image.
5. Discussion and Conclusions CCE has the potential to replace the need for human interpretation of the number of clusters in a data set by means of frequency domain correlation and feature recognition. To explore the effectiveness of the algorithm, a group of user-generated data sets, and a real-world data example, served as the input to the preliminary cluster detection process. While CCE returned the desired number of clusters in our examples, it will fail when contiguous regions in the correlated histogram overlap. In this case the threshold (y=0) should be modified, albeit with the loss of singleton detection. Given CCE's ability successfully to determine the number of clusters for circular data distributions, further testing can be conducted to determine if the proposed process can be applied to a wide breadth of clustering organizations, such as ring and line data sets. Beyond user-generated data sets, more robust tests, with real-world data, could be carried out in the areas of bioinformatics, machine learning and pattern recognition, all of which make use of clustering as a decision making qualifier. Finally we point out that it is easy to find data sets for which the VAT image suggests the wrong number of clusters, and this problem will carry over to the
present work. So, the bad news is that if the VAT image is visually wrong, the CCE count will also be incorrect. But the good news is that this is a defect in VAT, not CCE, so it follows that CCE can probably be reliably used with other visual display methods (such as those in [16-23] to automatically count the number of dark diagonal blocks in display matrices.
6. References [1] Theodoridis. S. and Koutroumbas, K. (2003). Pattern Recognition, 2nd Ed., Elsevier Science, New York, NY. [2] Borg, I. and J. Lingoes (1987). Multidimensional Similarity Structure Analysis, Springer-Verlag, NY, NY. [3] Kendall, M. and J.D. Gibbons (1990). Rank Correlation Methods, Oxford Univ. Press, New York, NY. [4] Ruspini, E. A. (1969). A new approach to clustering, Inf. and Control, 15, 22-32. [5] Jain, A. K. and R. C. Dubes (1988). Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall. [6] Bezdek, J. C., Keller, J. M., Krishnapuram, R. and N. R. Pal (1999). Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer, Norwell. [7] Bezdek, J.C. and Hathaway, R.J., 2002. VAT: A tool for visual assessment of (cluster) tendency, in: Proc. IJCNN 2002. IEEE Press, Piscataway, NJ, pp. 2225-2230. [8] Huband, J., Bezdek, J. C. and Hathaway, R. (2005). bigVAT: Visual Assessment of Cluster Tendency for Large Data Sets, Pattern Recognition, 38(11), 1875-1886. [9] Hathaway, R. J., Bezdek, J. C. and Huband, J. M. (2006). Scalable Visual Assessment of Cluster Tendency, Pattern Recognition, 39, 1315-1324. [10] Bezdek, J.C., Hathaway, R.J. and Huband, J. M. (2006). Visual Assessment of clustering tendency for rectangular dissimilarity matrices, IEEE Trans. Fuzzy Systems, 15(5), 890-903. [11] Jain, A.K. and R.C. Dubes (1988). Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ. [12] Everitt, B.S. (1978). Graphical Techniques for Multivariate Data. North Holland, New York. [13] Tukey, J.W., 1977. Exploratory Data Analysis, Addison-Wesley, Reading, MA. [14] Cleveland, W.S., 1993. Visualizing Data. Hobart Press, Summit, NJ.
[15] Johnson, R. A. and Wichern, D.A. (1992) Applied Multivariate Statistical Analysis, 3rd ed., Prentice-Hall, Englewood Cliffs, NJ, 1992.
Step 1 – Threshold VAT image. First, we threshold the VAT image into foreground and background regions using Otsu’s method.
[16] Tran-Luu, T.-D. (1996). Mathematical concepts and novel heuristic methods for data clustering and visualization, PhD Thesis, U. of Maryland, College Park, MD.
˜ ) = Input: n n VAT image I( R binVAT = im2bw(vatImage, graythresh(vatImage)) Output: n n Binary VAT image
[17] R.F. Ling, A computer generated aid for cluster analysis, Communications of the ACM, 16 (1973) 355-361.
Step 2 – Generate correlation filter. The second operation dynamically generates the correlation filter. The user-defined parameter is the filter size ratio s 1. The n n correlation filter has a neighborhood of positive values that ranges in size from 11 to n n . The 1x1 case corresponds to VAT images
[18] R. Baumgartner, R. Somorjai, R. Summers, W. Richter, L Ryner, Correlator Beware: Correlation has Limited Selectivity for fMRI Data Analysis, NeuroImage, 12 (2000) 240 – 243. [19] R. Baumgartner, R. Somorjai, R. Summers, W. Richter, Ranking fMRI Time Courses by Minimum Spanning Trees: Assessing Coactivation in fMRI, NeuroImage, 13 (2001), 734 – 742. [20] Strehl, A. and Ghosh, J. (2000). Value-based customer grouping from large retail data-sets, Proc. SPIE Conf. On Data Mining and Knowledge Discovery, 4057, SPIE Press, Bellingham, WA, 4057, 33-42. [21] Strehl, A. and Ghosh, J. (2000). A scalable approach to balanced, high-dimensional clustering of market-baskets, Proc. HiPC 2000, LNCS 1970, Springer, NY, 525-536. [22] Dhillon, I., Modha, D. and Spangler, W. (1998). Visualizing class structure of multidimensional data, in Proc. 30th Symp. On the interface; Computing Science and Statistics, ed. S. Weisberg. [23] Hathaway, R. J. and Bezdek, J. C. (2003). Visual cluster validity for prototype generator clustering models, Patt. Recog. Letters, 24, 1563-1569. [24] Gonzalez, R. C. and Woods, R. E. (2002). Digital Image Processing, Prentice-Hall, Upper Saddle River, NJ. [25] R.K. Justice and E.M. Stokely (1996). 3-D segmentation of MR brain images using seeded region growing. Proceedings of the 18th Annual Int. Conf. of the IEEE Engineering in Medicine and Biology Society, 10831084. [26] B.V.K Vijaya Kumar, M. Savvides, K. Venkataramani, and Chunyan Xie (2002). Spatial frequency domain image processing for biometric recognition. Int. Conf. on Image Processing, I53-I56.
7. Appendix - Details of CCE This appendix contains platform specific details of our implementation. Calls are to MATLAB routines.
20
20
that are less than or equal to 20x20 in size (s=1, which will probably never happen in real data sets). The n n case corresponds to the value s = 20. 20
20
This arbitrary used-defined choice can be changed to fit the application at hand. The value 20 seemed to work well across a pretty wide variety of tests for the CCE algorithm. [However, please see more discussion of this issue in the concluding section of the text.]
˜ ) = , Inputs: n n VAT image I( R (integer) filter size ratio s 1 filter = zeros(size(vatImage,1), size(vatImage,2)); if((size(vatImage,1) < s)) filter(1,1) = 1; else for x=1:1:round(size(vatImage,1)/s) for y=1:1:round(size(vatImage,2)/s) filter(x,y) = 1; end end Output: n n correlation filter Step 3 – Convert binary VAT image and correlation filter to frequency domain. The segmented VAT image (step 1), and correlation filter (step 2) are converted to the frequency domain representations via the 2D FFT. Inputs: n n images , FreqVAT=fft2(binVAT); FreqFilter=fft2(filter, size(binVAT,1), size(binVAT, 2)); Outputs: n n frequency images , Step 4 – Perform correlation. We correlate the images from Step 3 by multiplying the (complex conjugate of the) frequency-domain filter with the frequency-domain segmented VAT image. This is
similar to convolution in the spatial domain, but is faster in the frequency domain for large images. Inputs: n n images , FreqFilter=conj(FreqFilter); FreqResult=FreqFilter.*FreqVAT Output: n n frequency image Step 5 – Back-transform the frequency-domain filtered image. The back-transform of the correlation result to the spatial domain is handled by the inverse 2D Fast Fourier Transform. The result is scaled to the range [0,255]. Input: n n frequency image result=real(ifft2(FreqResult)); Result=gscale(result); Output: n n scaled correlation Step 6 – Compute CCE histogram. The (n-1) pixel values (above or below) the main diagonal of the spatial result are collected. Any distance greater than 1 from the main diagonal results in failure, so this distance is a constant for the algorithm, not a userspecified input. These values, along with the number of values returned from the off-diagonal, are used to generate a histogram. Input: n n scaled correlation numElements = 1:1:size(Result,1)-1 offDiag = diag(Result, 1); bar(numElements, offDiag) Output: CCE Histogram, (n-1) vector The histogram should contain peaks corresponding to the dark diagonal blocks in the VAT image. Visual examination of the histogram is not needed to extract c, but provides psychological reassurance via visual confirmation. Step 6 can be omitted, but the offdiagonal values, , must be calculated. Step 7 – Extraction of c, the number of dark blocks in the VAT image. CCE calculates the number of continuous distributions found. In essence, the histogram is "cut" by a horizontal line y=b, as shown in Figure 4, and the number of distributions encountered accumulate in the counter . The value b is a user-defined parameter. The choice b=0 worked in all of our examples because the clusters in the data led to separated peaks in the CCE histogram. This may not always be the case. Please see our discussion of this issue in the concluding section of the text.
Inputs: (n-1) vector , n n scaled correlation , cut threshold b (= 0 in our examples) num_cluster = 0; previous_value = 0; for i=1:1:size(result,1)-1 if(i = 1): previous_value = offDiag(i); else: previous_value = offDiag(i-1); end if(offDiag(i) > b && previous_value == 0) num_cluster = num_cluster + 1; Output: integer = c = number of dark ˜ ). blocks in n n VAT image I( R