Finding the number of clusters in ordered dissimilarities

1 downloads 0 Views 1MB Size Report
pair wise dissimilarity, such as a distance measure, .... distinct dark diagonal blocks in Fig. 1c, which is the VAT image, IрDГЮ, of the data after VAT reordering. Please ... plot of input data, b unordered dissimilarity matrix, D, c reordered dissimilarity matrix,. D* ... do either of these with equal proficiency; but we still do not.
Soft Comput DOI 10.1007/s00500-009-0421-5

FOCUS

Finding the number of clusters in ordered dissimilarities Isaac J. Sledge Æ Timothy C. Havens Æ Jacalyn M. Huband Æ James C. Bezdek Æ James M. Keller

Ó Springer-Verlag 2009

Abstract As humans, we have innate faculties that allow us to efficiently segment groups of objects. Computers, to some degree, can be programmed with similar categorical capabilities, which stem from exploratory data analysis. Out of the various subsets of data reasoning, clustering provides insight into the structure and relationships of input samples situated in a number of distributions. To determine these relationships, many clustering methods rely on one or more human inputs; the most important being the number of distributions, c, to seek. This work investigates a technique for estimating the number of clusters from a general type of data called relational data. Several numerical examples are presented to illustrate the effectiveness of the proposed method. Keywords Data analysis  Pattern recognition  Clustering  Cluster tendency  Cluster count extraction

I. J. Sledge (&)  T. C. Havens  J. M. Keller Electrical and Computer Engineering Department, University of Missouri, Columbia, MO, USA e-mail: [email protected] T. C. Havens e-mail: [email protected] J. M. Keller e-mail: [email protected] J. M. Huband  J. C. Bezdek Computer Science Department, University of West Florida, Pensacola, FL, USA e-mail: [email protected] J. C. Bezdek e-mail: [email protected]

1 Introduction Clustering in unlabeled data comprises three operations: (1) assessment of cluster tendency to find the number of clusters, c; (2) partitioning the data into c clusters; and (3) validation of the c clusters found in step 2. This paper concerns a method for automatically extracting the number of clusters to look for in unlabeled data; thus, it addresses the first problem category. The method consists of first creating an image that portrays potential cluster structure using a previously developed algorithm: visual assessment of (cluster) tendency, or VAT (Bezdek and Hathaway 2002). Subsequent data transformation and processing schemes are used to enhance the visual information for automatic estimation of the number of coherent groups. Because this algorithm extracts the count, or number, of clusters, we call it a cluster count extraction (CCE) algorithm. Before delving into the approach, we define some common conventions and terminology that appear throughout the manuscript. Let O ¼ fo1 ; . . .; on g denote n objects, such as genes, bacteria, stocks, etc. When each object in O is represented by a column vector, x 2 Rp ; the set X ¼ fx1 ; . . .; xn g  Rp is an object data representation of O. Alternatively, each pair of objects in O can be represented by a relationship between them, which is termed relational data. The most common case of relational data is a matrix of dissimilarities, say D ¼ ½di;j nn : Here, di;j is the pair wise dissimilarity, such as a distance measure, between objects oi and oj ; for 1  i; j  N: More generally, D can be a matrix of similarities based on a variety of measures (Theodoridis and Koutroumbas 2003; Borg and Lingoes 1987; Kendall and Gibbons 1990); or D can be a relation derived by human intervention of a process involving pairs of objects.

123

I. J. Sledge et al.

Given an arbitrary X, suppose that each vector, xi 2 X; has a corresponding label, uj;i ; that describes its belongingness to c non-degenerate groups. If all of these labels satisfy the following conditions: 0  uj;i  1; 8j; i;

c X

uj;i ¼ 1; 8i;

j¼1 n X

uj;i \n; 8 j;

ð1Þ

i¼1

then we would say that the c fuzzy subsets, fui : X ! ½0; 1g; are a fuzzy c-partition of X. For each set of values that fulfill these requirements, a c 9 n partition matrix, U ¼ ½uj;i cn ; can be created to succinctly describe the membership of every object to each group. Building upon this, the sets of all such matrices are defined as: Mfcn ¼ fU 2 Rcn juj;i 2 ½0; 1; 8j; i; satisfiesð1Þg for fuzzy c-partitions and Mhcn ¼ fU 2 Rcn juj;i 2 f0; 1g; 8j; i; satisfiesð1Þg for crisp c-partitions. Mhcn is a subset of Mfcn ; every hard c-partition of X is fuzzy, but not conversely. The reason these matrices are called partitions follows from the interpretation of uj;i as the membership of xi in the jth partitioning subset (cluster) of X. Bezdek (1981) contains more details about the algebraic and geometric nature of these partition sets. The important point is that all clustering algorithms generate solutions to the clustering problem for X, i.e. the identification of an ‘‘optimal’’ partition, U 2 Mfcn ; that groups together object data vectors, and hence the objects they represent. With this notation in place we see that cluster tendency assessment attempts to find the best places to look, i.e. assessment of the data may suggest one, or at least a few, distinguished values of c to input to any clustering algorithm. It is our hope, of course, that each mathematically optimal grouping is in some sense an accurate portrayal of natural groupings in the physical process from which the object data was derived. The remainder of this article is organized as follows. In Sect. 2, we review various visual methods for cluster tendency. Section 3 formally describes the VAT algorithm. This segues into Sect. 4, which outlines the CCE algorithm along with several initial attempts to extract c automatically from VAT images. The results for a number of numerical examples are presented in Sect. 5. Section 6 overviews ways to improve CCE, while Sect. 7 covers parameter sensitivity and cluster count correctness. Finally, Sect. 8 concludes the paper and showcases our current, and future, research in the area of relational data processing.

2 Cluster visualization methods Various formal and informal (statistically based) techniques for tendency assessment are discussed in Jain and

123

Dubes (1988) and Everitt (1978). However, statistical approaches to assessment are not widespread, principally because the distributional assumptions needed to implement these techniques are often unrealistic. To ameliorate this issue, visual methods for data analysis problems have been proposed since as far back as 1939 (Tryon 1939). From that time, a plethora of image-based techniques have been widely studied, and both Tukey (1977) and Cleveland (1993) are standard sources for many of these. The general spirit of visual assessment is captured by the ‘‘graphical method of shading’’ described by Johnson and Wichern (1992). They give this informal description: (1) arrange the pairwise distances between points in the data into several classes of 15 or fewer, based on their magnitudes; (2) replace all distances in each class by a common symbol with a certain shade of gray; (3) reorganize the distance matrix so that items with common symbols appear in contiguous locations along the main diagonal (darker symbols correspond to smaller distances); (4) identify groups of similar items by the corresponding patches of dark shadings. A more formal approach to this problem is the work of Tran-Luu (1996), who proposes a reordering of the data into an ‘‘acceptable’’ block form based on optimizing several mathematical criteria of image ‘‘blockiness.’’ After reordering the input data matrix, an image is displayed, and the number of clusters are then deduced visually by a human observer. Tran-Luu’s work is an interesting precursor to the VAT method: he attempts to find images of similarity data by seeking ‘‘blocky’’ images through formal optimization procedures. VAT, on the other hand, depends on an intuitive approach to reordering the objects prior to image display. VAT presents pair wise dissimilarity information about the set of objects, O ¼ fo1 ; . . .; on g; which are represented by either object or relational data, as an n 9 n digital image. After the objects are suitably reordered, the image is better able to highlight potential cluster structure. The algorithm is used before clustering to provide a visual representation of the possible number of clusters that may exist in the data. It is possible to argue that VAT actually performs (quasi-single linkage) clustering during the reordering of the dissimilarity matrix (Bezdek and Hathaway 2002; Havens et al. 2008b). Like single linkage, VAT uses a modification of Prim’s minimal spanning tree (MST) algorithm to reorder the matrix. However, VAT does not use the MST to define partitions or assign membership values to the objects in the data; its purpose is to provide a method for visualizing potential cluster substructure (Havens et al. 2008a). In other words, we use VAT images to estimate c, the number of clusters to seek. Beyond the original VAT algorithm, a family of related procedures have been crafted by the collaborative efforts of Bezdek et al. For very large (VL) datasets, i.e. those that cannot be loaded into a single processor, both sVAT

Finding the number of clusters in ordered dissimilarities

(Hathaway et al. 2005) and bigVAT (Bezdek et al. 2005) provide mechanisms for viewing cluster structure. In a different vein, scoVAT and coVAT (Bezdek et al. 2006) are able to accommodate the reordering process for rectangular dissimilarity matrices. While many flavors of VAT have been developed, one main theme is common: they are domain independent. This implies that, as long as the data are represented numerically, VAT does not care if the objects are human gene sequences, measurements of iris petals, or the voting records of Congressmen. There is also a lot of work in cluster visualization which deals with displaying clusters after a clustering algorithm has been applied (that is, after securing a c-partition U of the data). The earliest published reference we can find that discusses visual displays of clusters (as images) is the work of Cattell (1944). In the paper, Cattell used single linkage heuristics to reorder the elements of small, pair wise dissimilarity matrices. The resulting image, IðD Þ; was then hand-shaded for viewing. Similarly, Sneath (1957) reordered a dissimilarity matrix, using an algorithm that had both computer and manual components, then hand-rendered IðD Þ using eight grayscale intensity levels. Subsequent increases in computing later allowed Floodgate and Hayes (1963) to reorder D using single linkage clustering. However, given the state of the art at the time, for computer-based displays, the resulting image was still manually produced. It was not until 1973, when Ling (1973) introduced the SHADE approach, that the reordering and display processes were fully automated. SHADE was used after application of a hierarchical clustering scheme and served as an alternative to visual displays of hierarchically nested clusters via the standard dendrogram. Due to limited technology at that time, the approach used only 15 level halftones (created by overstriking standard printed characters) to approximate a digital representation of the lower triangular part of D : Since the time when SHADE was introduced, technology has greatly improved and displays of similarity or dissimilarity matrices that represent clusters are now a fairly common practice. Representative studies include those by Baumgartner et al. (2000, 2001), Strehl and Ghosh (2000a, b), Dhillion et al. (2000), Hathaway and Bezdek (2006), and Huband and Bezdek (2008). The work by Strehl and Ghosh in this area appears quite similar the methods discussed here, except that fuzzy partitions are discovered, and then imaged after hardening and reordering. Our work, however, is aimed towards those aspects of data analysis that are important prior to, and during, clustering, not after it. 3 Visual assessment of cluster tendency The VAT algorithm displays a grayscale image IðD Þ; each element of which is a scaled dissimilarity value

between objects oi and oj : Each element on the diagonal is zero, since the dissimilarity of an object with respect to itself is zero. Off the diagonal, the values range from 0 to 1. If an object is a member of a cluster, then it also should be part of a submatrix of ‘‘small’’ values, whose diagonal is superimposed on the diagonal of the image matrix. These submatrices are seen are dark blocks along the diagonal of the VAT image. We next give a concise statement of VAT. In general, the functions arg min and arg max, in Algorithm 1, are set valued, and when the sets contain more than one pair of optimal arguments, any optimal pair can be selected. Throughout, we will denote the result of applying the VAT algorithm to the square dissimilarity matrix, D; as D ¼ VATðDÞ; the displayed output is the image IðD Þ:

To illustrate the effectiveness of VAT in unearthing the potential number of clusters, we have contrived a simple example, which is shown in Fig. 1. Figure 1a is a scatter plot of n ¼ 1; 000 data points, in R2 ; drawn from a mixture of normal distributions. The means, mixing proportions, number of samples and covariances, respectively, used to generate the five distributions are: fð0; 0Þ; 0:21; 200; 1:2Ig; fð8; 8Þ; 0:21; 213; 1:2Ig; fð16; 0Þ; 0:21; 208; 1:2Ig; fð0; 16Þ; 0:21; 210; 1:2Ig; and fð16; 16Þ; 0:16; 169; 1:2Ig; here I is a 2  2 identity matrix. These object data were converted to a 1; 000   1; 000 dissimilarity matrix, D; by computing Di;j ¼ xi  xj  with the Euclidean norm. The c ¼ 5 visually apparent clusters in Fig. 1a are suggested by the 5 distinct dark diagonal blocks in Fig. 1c, which is the VAT image, IðD Þ; of the data after VAT reordering. Please compare this with Fig. 1b, which is the image, IðDÞ; of the dissimilarities in input order. It is clear that reordering is necessary to reveal the structure of the underlying data because nothing can be inferred about structure from the image of the original matrix D.

123

I. J. Sledge et al. Fig. 1 Results from the VAT reordering process that shows c ¼ 5 dark blocks along the main diagonal of D : Note that the non-dark block pixel values highlight the similarity between the different distributions. For instance, the second dark block, from the top, left corner, has elements with relatively low dissimilarity to each of the remaining data groups and corresponds to the cluster centered at (8, 8) in a. a Scatter plot of input data, b unordered dissimilarity matrix, D, c reordered dissimilarity matrix, D*

4 Dissimilarity cluster count When viewing VAT images that possess visual clarity, we humans can easily locate the dark blocks along the main diagonal of IðD Þ and also determine c. Given our capabilities for discerning this information, we pose the following question: can we develop a method to estimate c from D ? VAT images, such as the one in Fig. 1c, are candidate cases for arguing that the number of clusters can be automatically determined algorithmically; this stems from the ‘‘clean,’’ well separated, compact nature of the data. However, as the data become more and more mixed, the quality of the VAT image will worsen, which may spur different individuals to see different values. As such, we want to automate this inference procedure so that a value of c, extracted from a VAT image, roughly mirrors what a human would deduce. Why? There are several good reasons. First, different humans will disagree as the images

123

degrade, and we would like an objective, repeatable way to estimate c. Second, if the data are large, we may not be able to view the VAT image because of screen resolution limitations. But we may be able to extract c after processing D : Moreover, the sVAT procedure enables us to construct an approximation to the VAT image for VL data, and it may be the case that we need or want to extract c from this approximate image for VL data. 4.1 Initial attempts to estimate c First, we briefly discuss three unsuccessful methods that did not work, but which led us to the CCE algorithm described in this section. An immediate idea is to consider an edge detection algorithm (Canny 1986; Gonzalez and Woods 2002) to find the vertical or horizontal edges of the dark blocks. Without any precursory processing, however, the extracted boundaries will emphasize any intra-block patterns or extraneous dissimilarities, as we have shown in

Finding the number of clusters in ordered dissimilarities Fig. 2 Example demonstrating the complexities of finding cluster structure using a standard edge detection technique. For this ‘‘clean’’ VAT image, four lines, two horizontal and the others vertical, highlight the border of the two blocks. However, for more complex VAT images, with ill-defined boundaries, such lines would be nonexistent. a Arbitrary, reordered dissimilarity matrix, D*. b Canny edge detection executed on D*

Fig. 2b. Furthermore, selecting the ‘‘correct’’ line, or possibly series of disjoint lines, for each diagonal submatrix would be an arduous computational task. Even with some judicious preprocessing, to guide the edge detection process, we have simply transformed the problem of counting rectangles to one of counting edges. Humans can do either of these with equal proficiency; but we still do not have an estimate of c. A more efficient technique might be based on examining selected rows, or columns, in IðD Þ: Each row crosses through one of the dark blocks of the VAT image. Consider the ith row of the image matrix. The intensities Di;j along the ith row show the dissimilarities of object oi to the objects o1 through on : If the ith object is a member of a cluster, there will be a set of ‘‘small’’ intensity values Di;ik ; Di;ikþ1 ; . . .; Di;im that comprise the dark block. These values correspond to objects oik through oiþm : Figure 3 shows one row from the uppermost submatrix in Fig. 1c. By identifying a set of small-intensity values adjacent to a dark set, we could deduce that this row suggests one cluster. Furthermore, due to the symmetry of the image matrix, we can eliminate rows i  k through i þ m if we want to look for more clusters. We can then count the number of clusters by repeating this process until all rows have been eliminated. However, this methodology is not without algorithmic complexities. For instance, how do we threshold Di;1;...;n to find a cluster boundary and how many objects are needed in each submatrix to form a cluster? If each row is treated as an independent entity, image thresholding algorithms (Gonzalez and Woods 2002) could be utilized to find potential regions in Di;1;...;n that belong to low dissimilarity entries. However, this raises a slew of further queries, such as: should row segmentation information from Di;1...;n be passed along to Diþ1;1;...;n ? Furthermore, can multiple entries, say Dm;...;k;1;...;n ; be used to make a better decision?

Fig. 3 Dissimilarities from the first row of D in Fig. 1c

While extracting c, using a row-based approach, would greatly increase the complexity of the problem, it began to open paths for how to better address it. Instead of sampling D in a row-iterative fashion, the dissimilarities can instead be selected from some arbitrary diagonal or sub-band. To appreciate the benefits gleaned by such a scheme, we have provided a pictorial description in Fig. 4. Viewing both histogram plots, the peaks immediately standout, from the remaining values, as potential processing candidates. Provided that some peak counting algorithm can be devised, or implemented from the literature (Atiquzzaman 1992), we would have a viable method for counting the number of clusters. But as with the previous two ideas, this one is also wrought with difficulties. If we only consider the tendency curve in Fig. 4a, there are many values that could be flagged as peaks, beyond the two prominent spikes around diagonal index 50 and 100. We would thus need a way to select which values are flagged as new clusters. As well, sampling too far from the diagonal, or in too large a subband can obfuscate smaller cluster structures. 4.2 Cluster count extraction algorithm With each approach we proposed, it would appear that we further distanced ourselves from finding c. Indeed, each of the aforementioned techniques did not hold up well for non-trivial datasets. Nonetheless, the lessons learned from the failures influenced the formulation of an algorithm that we have found to work well for a number of datasets: cluster count extraction, or CCE (Sledge et al. 2008b). To some degree, the various problems that CCE needs to overcome have already been outlined with the previous

123

I. J. Sledge et al. Fig. 4 Two examples that unearth similar information about the dissimilarity tendency of an arbitrary, c ¼ 3 VAT image. The first image is a mock-up of how the dissimilarities are sampled, while the second is the tendency histogram, of the pixel intensities, that is produced. a Sampling scheme where dissimilarities are taken from the10th positive, primary diagonal. b Sampling scheme where dissimilarities are aggregated from the 10th negative, primary diagonal to the 10th positive, primary diagonal and then averaged by column

ideas. Foremost, locating regions of low dissimilarities is paramount, as it will ease the amount of additional processing. Spatial function optimization readily comes to mind as a way to facilitate this, since the scheme would be able to directly find the boundaries of the dark blocks. However, there are many unnecessary complications associated with this procedure. Instead, a better alternative would be exploiting the properties of D and relying on image processing, or more specifically, image segmentation. Segmenting an image is analogous to computing a set of crisp partitions for an arbitrary dataset. Both processes attempt to determine the elements that belong to a specific group so that some function, or property, is either maximized or minimized. For many of these imaging techniques, the value to optimize is an entropy variable or various statistical parameters that are derived from the image content. Assuming that the ‘‘proper’’ value(s) can be determined, we would have a way to globally transform D into a twoclass, or binary, image such that the diagonal block dissimilarities belong to one class, and non-diagonal block dissimilarities are classified to another. This contrasts with the iterative line idea, where a potentially different threshold value would be selected for each Di;1;...;n : To this end, we consider a prominent technique, Otsu’s algorithm

123

(Otsu 1979), which has been widely employed for image segmentation over the years. The method works as follows. Suppose that we have an image, Mnn ; that has L total gray level intensities, ½1; 2; . . .; L: For each intensity level, L; assume that there are gl pixels with that value, and that there are n2 pixels total in M: By normalizing each of these difP ferent values, pl ¼ gl =n2 ; pl  0; Ll¼1 pl ¼ 1; a probability distribution can be created from the image histogram. Provided that we want to segment all pixels into one of two classes, C1 or C2 ; we can find a threshold value, t ; that maximizes the between-class variance. To do so, we comP pute the zeroth-order cumulative moment, m0 ðtÞ ¼ tl¼1 pl ; Pt the first-order cumulative moment, m1 ðtÞ ¼ l¼1 l  pl , and P the total mean level, lL ¼ Ll¼1 l  pl from the probability distribution. Sequentially searching over the intensities, an optimal threshold is thus defined as:! ðlL  m0 ðtÞ  m1 ðtÞÞ2 t ¼ argmax : m0 ðtÞ  ð1  m0 ðtÞÞ 1  t\L For histograms that have easily separable classes, say intensities that form a bimodal distribution, Otsu’s approach should easily find the local minimum between the two peaks. VAT image histograms, however, are rarely bimodal, and have many different modes that correspond to intra- and inter-cluster dissimilarities. In these instances,

Finding the number of clusters in ordered dissimilarities

the algorithm often yields a ‘‘reasonable’’ t that segments the first mode, the intra-cluster distances and possibly other low dissimilarities, from the remainder of the probability distribution. We say that the threshold is only ‘‘reasonable,’’ since it is only based on histogram statistics and does not include spatial information present in D : As such, a small number of VAT images may be improperly segmented, but this issue is addressed in a later section. By utilizing this non-parametric, unsupervised method, we can now globally extract the dark blocks from the remainder of the VAT image. The only processing that remains is to determine the number of dark blocks. Revisiting the off-diagonal sampling scheme from Fig. 4a, we could sample values from the off-diagonal of the thresholded image and count the number of contiguous spikes. But there may be instances where segmentation can introduce noise points, either inside the dark blocks or offset from the diagonal. If these pixels are captured in the sampling process, they would drive up the cluster count. To preemptively reduce the number of false alarms, we draw upon a signal processing technique, cross-correlation, to filter out any noise. Given two real-valued, continuous functions, f ðxÞ and gðxÞ; the cross-correlation of the two is defined as: R ðf IgÞðxÞ ¼ f  ðyÞgðx þ yÞdy; for appropriate values of y; where f  indicates the complex conjugate. Provided that f ðxÞ and gðxÞ differ only by a shift on the primary axis, the process determines the amount that gðxÞ must be shifted to match f ðxÞ: When both f ðxÞ and gðxÞ match, the value of ðf IgÞðxÞ is maximized. This notion can easily be extended to a discrete case, where F and G are no longer continuous functions but instead vectors, and also a two-dimensional, discrete case, where F and G are matrices. Note, however, that for the two-dimensional, discrete case, this process is the same as convolution, FIG ¼ ðy 7! F ðyÞÞ ~ G; where ~ is the convolution operator, provided that the matrix G is rotated by 180 : This implies that the computation can easily, and quickly, be carried out with complex multiplication in the frequency domain. Suppose that we define a two-dimensional ‘‘ideal block’’ filter as contiguous, finite, square matrix of ones. Since D is a two-dimensional matrix, the two-dimensional crosscorrelation approach will find the regions where the discrete filter of ones returns the worst possible match: the dark blocks, represented as zeros, along the diagonal. As a result, groups of pixels that are smaller than the size of the filter will be significantly dampened. Any white pixel intensities that reside inside a segmented diagonal block will also be smoothed out; however, cross-correlation will not fully suppress the values. While grayscale morphology could be used to remove these pixels entirely, this computation is not necessary. Instead if we sample

dissimilarities that are ‘‘close enough’’ to the main diagonal of D;f ; the correlated D ; where the blocks are assumed to have the highest coherence, we can avoid these gray values. Here, we define ‘‘close enough’’ to be the first positive, primary diagonal of D;f ; or D;f i;iþ1 ; 1  i  n  1: Although we could sample from an arbitrary diagonal, D;f i;iþk ; where k  1; we would run the additional risk of missing cluster structures that contain less than k elements. With these sampled values, an easily parsed tendency curve, like the one in Fig. 5, can now be created. Unlike the tendency curves shown earlier in Fig. 4, this one is devoid of any potentially confusing false peaks. By specifying an arbitrary horizontal line, we can now count the number of times that line intersects with the distributions to obtain an estimate for c, the total number of cluster suggested by the VAT image. Obviously, there are some lines, such as y [ 0:4 for the curve in Fig. 5, that may not yield a good value for c. To mitigate these situations, we specify the horizontal line to be y ¼ 0: In doing so, some additional information can be extracted, i.e. aligned partitions, which we expand upon elsewhere (Sledge et al. 2008a). To conclude the development of the CCE approach, we have provided a succinct definition of the necessary steps in Algorithm 2. The only parameter that needs to be passed to the algorithm is the filter size, a; which acts as a rough indicator of the smallest number of entries that constitutes a cluster. Although adaptive methods for determining a; such as finding the smallest number of zeros for all rows of D;s ; can be specified, we simply set a ¼ b0:05  nc; where n is the number of entries in D : For the myriad of datasets we have experimented with, the scalar value of 0:05 has consistently produced good results; however, any integer value 1\a  b0:05  nc should suffice.

5 Numerical examples and results To demonstrate the capabilities of CCE, we have conducted a number of experiments on both synthetic and realworld datasets. For the algorithm, we set the single

123

I. J. Sledge et al.

Fig. 5 Results after applying the CCE method to D from Fig. 1c. Going from left to right, the first image shows the gray-level threshold location chosen by Otsu’s algorithm. The second and third images

display, respectively, the effect of thresholding and correlation on D : Finally, the fourth image is the tendency curve that is generated, which shows c ¼ 5 continuous, disjoint distributions

parameter, a; to a ¼ b0:05  nc; where n is the number of elements in the dataset, and utilized Otsu’s algorithm, for segmenting D ; except where noted. D, for many of the examples, was constructed using the Euclidean norm; for one dataset, which we point out in the text, the standardized Euclidean norm was used. In addition, the dissimilarities were normalized to the range ½0; 1 via D ¼ D=argmaxp2K;q2K Dp;q ; K ¼ f1; . . .; ng; before reordering using VAT. Note that the normalization process is not necessary for VAT to execute properly, however, D does need to be normalized to ½0; 1 for image segmentation to work. Since CCE is a tendency assessment technique, for use before clustering, we examined each IðD Þ; in this section, and subjectively defined a ‘‘good’’ value for the suggested number of clusters. In addition, we executed a number of cluster validity metrics (Bezdek and Pal 1998; Sledge et al. 2009; Wang and Zhang 2007), the results of which are outlined in Sect. 7, to verify the values.

wider and less pronounced. The reason behind this change is rooted in the quality of the VAT image and the subsequent image processing. If the data contains a number of outlier points, or vectors with low dissimilarity to entries in multiple distributions, Prim’s heuristic will tend to place these near the end of the blocks along the diagonal. When image segmentation and cross-correlation are carried out, these dissimilarities can easily affect the number of light, [0.6, 0.9], or medium, [0.3, 0.6], gray pixel intensity values between consecutive dark blocks. As the number of outliers increases, so too will the regions of gray-scale values grow in size, which ultimately leads to a wider tendency distribution. Similarly, the height of the tendency curve is influenced by the number of outliers, or possibly low, intercluster dissimilarities, and the segmentation process. If a cluster has entries with high similarity to those in the another distribution, some ‘‘bleeding’’ may occur during image segmentation; the thresholded VAT image in Fig. 6e is a prime example of this. Consequently, the correlated values along the diagonal will have low pixel intensity values. In certain instances, these intensity values can impact the number of clusters found by CCE. While we discuss this more thoroughly in Sect. 7, many of the adverse effects can be mitigated by selecting a filter size such that 1\a  b0:05  nc:

5.1 Synthetic data experiments In the first series of examples, we focus on the ability of CCE to extract correct number of clusters in distributions drawn from a mixture of c ¼ 3 bivariate normal variables. For each of the n ¼ 5; 000 datasets, the means and covariances were iteratively varied to produce clusters that ranged from compact and well separated (XGauss1 ) to dispersed and overlapped (XGauss5 ). Figure 6a–e shows the results for the five experiments. For each of the datasets, CCE was able to define three continuous, disjoint regions, no matter the relative similarity between the groups of objects. Though we could easily construct samples for which the three clusters are even more overlapped, this would only serve to test the limits of the VAT sorting process. An interesting phenomenon to note is the change in the cluster tendency curve as the distributions merge together. For the well separated case, the tendency histogram contains three, impulse-like spikes. However, when the data become strongly overlapped, these spikes start to become

123

5.2 Pubic data experiments To see how well the CCE algorithm can handle non-synthetic data, we have selected four datasets from the UCI Machine Learning Repository (Asuncion and Newman 2007) and a single dataset from the Gene Ontology Consortium (2004). 5.2.1 XIris : Iris data One of the most widely used pattern recognition datasets, the Iris data, contains numerical information on different types of iris flowers. To help distinguish between each of the three plants, four numerical attributes, sepal length,

Finding the number of clusters in ordered dissimilarities

Fig. 6 Results for the varying Gaussian mixtures datasets. For each experiment, the first image is a scatter plot of the dataset. The second and third images are the corresponding VAT and segmented VAT plots, respectively. The fourth image is the tendency curve generated by the CCE algorithm. To see the full image details, please zoom in. a Results

for the XGauss1 dataset where CCE reports that c = 3. b Results for the XGauss2 dataset where CCE reports that c = 3. c Results for the XGauss3 dataset where CCE reports that c = 3. d Results for the XGauss4 dataset where CCE reports that c = 3. e Results for the XGauss5 dataset where CCE reports that c = 3

sepal width, petal length and petal width, were recorded by Anderson in the early twentieth century. Though there are 50 vectors in each class, for a total of n ¼ 150 vectors, only

two of the classes are linearly separable due to the chosen measurement features. The VAT image in Fig. 7a also suggests that there are two geometric clusters, as does CCE.

123

I. J. Sledge et al.

5.2.2 XVote : congressional voting data The next UCI dataset example was generated from 1984 voting records, on 16 key topics, from the United States House of Representatives. Votes were numerically encoded as 0.5 for ‘‘yea,’’ -0.5 for ‘‘nay’’ and 0 for ‘‘unknown disposition,’’ so that the voting record of each Congressman is represented as a trinary-valued object data vector in R16 ; thus it is impossible to form an opinion about cluster structure that might reside in the data by visual examination of a scatter plot. Consequently, the 435  435 (there are 435 members in the House of Representatives) VAT image in Fig. 7b serves as an excellent starting point for visualizing the distribution of these data. Viewing the image suggests that, for the 16 different bills, there is a bipartisan division into two fairly well-defined clusters over a variety of issues. The two identified classes are Republican (54.8%) and Democrat (45.2%), and the image shows that members of these two groups share similar voting records. However, about 15% of the voters failed to vote consistently. These voters caused some interaction (mixing) between the two main clusters, which lead to ill-defined boundaries in the dark blocks. Applying the CCE process to the data, it reported that c ¼ 2: Although this data is rather noisy, and possibly has a more ambiguous VAT image in comparison to the one in the preceding experiment, we feel that this value is correct. Foremost, we can only make out two dense, compact regions in IðD Þ from Fig. 7b. As well, the noisy grayscale values, that follow the second dense region, correspond either to entries that belong to both distributions or to outlier points. Thus, these entries should not drive up the cluster count. 5.2.3 XBC : Wisconsin breast cancer data Although a staple dataset for classifiers, the original Wisconsin breast cancer data appears like an unlikely choice for testing clustering algorithms. But, there were several interesting results that came about from utilizing it. To briefly comment about the nature of the data, a total of 699 clinical studies were either performed or collected by Dr. Wolberg at the University of Wisconsin Hospitals in Madison. A total of 10 numeric attributes, which lie in the integer set f1; . . .; 10g; were collected for 683 of the 699 studies. Class labels were also provided to distinguish between the 444 benign and 239 malignant cases. Running the VAT algorithm on this data, we arrived at the D shown in Fig. 7c. Looking at the reordered dissimilarities, one may initially believe that there was an issue during the sorting process, or that c ¼ 3: Since the two noisy regions, located at the beginning and end of D ; have a strong similarity to each other, an observer may expect them to be

123

adjacently located. However, there are only two prominent geometric clusters in this dataset, thus the entries in the bottom, noisy region of the VAT image are outlier points. For this example, CCE returned c ¼ 2; which we have mixed feelings about, since it is possible to argue that there are really c ¼ 3 clusters suggested by the VAT image. 5.2.4 XWine : Wine data As another prominent dataset, the Wine data contains the results of various chemical tests carried out on wines from three Italian cultivars. To distinguish between the wines 13 different features were measured for 178 different samples. Using the VAT reordering process, and building the relational matrix with non-normalized data and the Euclidean norm, we expected to see some semblance of c ¼ 3 clusters. However, that reordered dissimilarity matrix, which we have not shown, provided very little information about the distributions. Instead, we used the standardized Euclidean distance, which is just the Euclidean norm after each feature in X is divided by its standard deviation. After reordering, the VAT image, in Fig. 7d, showed data structure and allowed for CCE to extract c ¼ 3; the correct number of classes, from it. 5.2.5 D194 : GDP-194 relational data The last public dataset we consider in this section is unique from the other data. When creating D for the earlier experiments, an underlying set of numerical object data were used to generate the dissimilarities. However, this data, D194 ; was constructed using a fuzzy measure dissimilarity relation applied to annotations of 194 human gene products, derived from the Gene Ontology Consortium (2004). To understand how the dataset was built, please consult Popescu et al. (2004) for a thorough description. For a brief description, the data is comprised of 21, 87, and 86 gene products from the myotubularin, collagen alpha chain, and receptor precursor protein families, respectively. Viewing IðD Þ in Fig. 7e, one can make out three shaded areas, which correspond, going from top left to bottom right, to the receptor precursor, collagen alpha chain, and myotubularin gene products. As mentioned by Popescu et al. (2004), the collagen alpha chain family has some definite secondary cluster structure, which should drive up the cluster count beyond c ¼ 3: After executing the CCE, we obtained the results shown in the final picture of Fig. 7e. For this example, CCE reported that c ¼ 7; which we agree with. Although we can make out three predominant clusters, in D ; the low dissimilarity values in the collagen alpha chain and myotubularin gene products caused the thresholding algorithm to highlight the intra-cluster structure.

Finding the number of clusters in ordered dissimilarities

Fig. 7 Results for various publically available datasets. The first and second images are the corresponding VAT and segmented VAT plots, respectively. The third image is the tendency curve generated by the CCE algorithm. a Results for the XIris dataset where CCE reports that

c = 2. b Results for the XVote dataset where CCE reports that c = 2. c Results for the XBC dataset where CCE reports that c = 2. d Results for the XWine dataset where CCE reports that c = 3. e Results for the D194 dataset where CCE reports that c = 7

123

I. J. Sledge et al.

5.3 Real world data experiments For the last series of non-synthetic experiments, we have selected four datasets created by some of the authors or by colleagues. The data were created from research funded by the National Science Foundation, under grant IIS-0428420, and the U.S. Administration on Aging, under grant 90AM313. 5.3.1 XPart1 , XPart2 ; XPart3 : Wellbeing data To further measure the efficacy of CCE, we tested the algorithm with processed sensor data from the University of Missouri Center for Eldercare and Rehabilitation Technology. The raw data, which are a culmination of values collected by motion and bed restlessness, pulse, and breath sensors, is specific to each participant in the study and provides a rough estimate of that person’s well-being. Currently, 32 different features are computed from the unprocessed data, which help to locate short-term changes and long-term activity trends that correlate with different states of health. Although incremental, temporal approaches have shown prom-ise in locating the activity clusters (Sledge and Keller 2008; Sledge et al. 2008c, d), the data are rather noisy and should help highlight areas for improving cluster count extraction. Utilizing three participants’ feature data, the n ¼ 428; 851; 588 VAT images, shown in Fig. 8a, b, and c, respectively, were produced. Viewing these images, c ¼ 2 immediately jumps out as the expected value for each VAT plot, and is also the value returned by CCE. However, for D in Fig. 8a, one can make out some secondary cluster structure that possibly suggests c ¼ 4 or c ¼ 5: In the remaining VAT plots, any resemblance of definite intracluster structure is not easily discerned. Having dealt with the data extensively, c ¼ 2 spatial clusters, which highlight both ‘‘normality’’ (the large dark region) and ‘‘abnormality’’ (the small dark region), is a reasonable assessment. 5.3.2 XFall : fall detection data The final set real world data we used came from the automated, video-based fall detection work of Anderson et al. (2008a, b). To correctly recognize falls, a number of transformations are performed on raw video sequences, captured by two orthogonally placed cameras. Out of these various processes, silhouette extraction is used to separate the individual from the background for each image frame. Utilizing the silhouette information from both video feeds, a three-dimensional, voxel-space representation of the individual is created. From this three-dimensional model, 12 different features, such as ground plane similarity and mean z centroid of the model, are extracted to help

123

determine if the individual is upright, kneeling, on the ground, etc.

6 Improving cluster count extraction One of the algorithmic facets that will ultimately hinder any image processing-based, cluster count extraction method is the underlying block segmentation process. Though Otsu’s algorithm has shown remarkable promise in segmenting a wide range of VAT images, there are instances where it will fail. To address this issue and provide reasoning for the inclusion of a more robust underlying thresholding algorithm, we have devised a simple, six cluster dataset, XSix ; shown in Fig. 9a. At first sight, this seemingly innocent dataset appears far less complex than some of the others presented here. Foremost, the c ¼ 6 dark blocks are perfectly visible, and well defined, along the main diagonal of the 300  300 VAT image. Feeding the matrix into Otsu’s algorithm, however, introduces some complications. Due to the disproportionate number of pixels with a medium-range [0.3, 0.6], grayscale value, the thresholding approach failed to define a ‘‘good’’ grayscale boundary. This failure was propagated to the later stages of the CCE algorithm, where it eventually returned the incorrect value of c ¼ 3; as shown in Fig. 9b. With such flagrant results, we turned to the literature to find a more powerful thresholding solution. Eventually, we settled on the minimization of homogeneity and uncertainty energy, or MHUE, algorithm developed by Saha and Udupa (2001). Unlike many other segmentation techniques, MHUE attempts to acquire object knowledge from the image by calculating class uncertainty and region homogeneity information. For this technique, class uncertainty can be thought of as the degree to which a pixel belongs to an arbitrary group based upon its intensity and any a priori knowledge about the intensity probability distributions of the classes. Region homogeneity, on the other hand, attempts to capture the local, spatial ‘‘affinity’’ of pixels. For instance, if each element in group of pixels has a high affinity to all other elements, then those pixels should be classified into the same class. Conversely, if a pixel has a low affinity to another, then this most likely corresponds to an object edge and both elements should be classified differently. When both sources of information are coupled together, MHUE is able to overcome many of the issues that plague histogram or entropy dependent approaches. As a result, it is able to locate the complex, fuzzy boundaries that may exist in some VAT images. Most importantly, however, it enables CCE to extract the correct number of clusters from XSix ; which is evidenced by the plots in Fig. 9c. For the MHUE algorithm, we decided to display plots of the

Finding the number of clusters in ordered dissimilarities

Fig. 8 Results for various real world datasets. The first and second images are the corresponding VAT and segmented VAT plots, respectively. The third image is the tendency curve generated by the CCE algorithm. a Results for the XPart1 dataset where CCE reports

that c = 2. b Results for the XPart2 dataset where CCE reports that c = 2. c Results for the XPart3 dataset where CCE reports that c = 2. d Results for the XFall dataset where CCE reports that c = 5

123

I. J. Sledge et al.

Fig. 9 Example that shows motivation for more advanced image segmentation algorithms. a Scatter plot of XSix and the corresponding I(D*). b Results for the CCE process when Otsu’s algorithm is used. Although sixdark blocks are clearly visible along the diagonal of I(D*), the low dissimilarities causes the segmentation algorithm to specify a non-ideal threshold value. This forces CCE to incorrectly return c = 3. c Results for the CCE process when intensity-based

class uncertainty and region homogeneity are used to segment D*. Going from left to right, the first image highlights the objectscale of the VAT image in a. The second image shows the results of the homogeneity calculations, which managed to capture the blocky structure present throughout the entirety of D*. It is apparent that c = 6 clusters are now identifiable by CCE

underlying object scale and region homogeneity information to help provide a graphical reasoning for the segmentation results. While homogeneity knowledge helps to find regions of chaos, which indicate the edge of the objects, estimated object scale values highlight areas that contain similar pixel values. However, both attributes do not always lead to ‘‘ideal’’ thresholding, as we can see that there are some ‘‘bleeding’’ issues with the segmented dissimilarities near the second, diagonal dark block. Although we could present additional, synthetic data to further reinforce the strengths of MHUE, we instead focus on its ability to locate secondary cluster structure. For this study, we reuse the XPart1 dataset from Fig. 8a. Out of the

data we have presented here, XPart1 was unique because it contained definite intra-cluster definition that was obscured during the segmentation process. In comparison to the earlier experiment, however, we can see, in Fig. 10, that this boundary information was preserved by the more robust MHUE algorithm. This allowed CCE to specify that c ¼ 4; which mirrors what we see in the VAT image, as opposed to the original value of c ¼ 2: Given these improved results, why did we not utilize MHUE for each of the experiments? For many datasets, a single-valued thresholding approach, like Otsu’s algorithm, can produce results that are similar to MHUE’s. The thresholded images in Fig. 11 are a testament to this claim.

Fig. 10 Example showing that region homogeneity and intensitybased uncertainty allow CCE to capture intra-cluster structure. Going from left to right, the first image is the previously shown VAT image from a. The second and third images are the object scale and

homogeneity information, respectively. The fourth plot highlights the newly uncovered segmented, secondary cluster structure. Finally, the tendency curve reports that c has increased from c = 2 to c = 4

123

Finding the number of clusters in ordered dissimilarities

The only case, that we found, where MHUE produced a different binary image is for the for the D194 dataset in Fig. 11e. Instead of the c ¼ 7 suggested in Fig. 7e, CCE reported that c ¼ 3; which is the correct number of protein families. 7 Parameter effects and count correctness

Fig. 11 Results of Otsu’s algorithm, shown in the first image, versus the MHUE technique, shown in the second image. For many of these datasets, the thresholded images are nearly identical, which can imply that MHUE is not needed in all circumstances. a Thresholding the XIris dataset, b thresholding the XVote dataset, c thresholding the XBC dataset, d thresholding the XWine dataset, e thresholding the D194 dataset

As a tendency assessment algorithm, CCE mitigates some of the expensive computation, required by spatial data cluster validity schemes, to exhaustively find the best cluster count over a range of c’s. However, the count computation is instead reliant upon two factors: the VAT sorting process and the selection of an appropriate crosscorrelation filter size, a: For situations where the VAT image suggests an improper number of clusters, the CCE algorithm will most likely fail. Likewise, when the ordered relational matrix is cross-correlated with a non-ideal sized filter, the CCE algorithm can produce incorrect results. To elucidate on the sensitivity of the filter parameter, we have provided the cluster count results, in Fig. 12, for a small subset of the datasets previously discussed. Consistent with our expectations and recommendations, a small filter size, i.e. a  b0:1  nc; produced the correct values for c in Fig. 12a and b. Beyond those dimensions, the cluster count significantly degraded. This phenomenon occurred either due to the suppression of distributions with a small number of points, as in Fig. 12b for the small, dark block at the bottom right of the segmented VAT image, or from the coalescing of multiple dark blocks, as in Fig. 12a. For filter sizes that lie between b0:01  nc  a  b0:1  nc; and for VAT images with clearly defined structure, we found no variability in the returned values of c. On the other hand, the results in Fig. 12c shows cluster count fluctuations that are directly correlated with small changes in the filter size. Given that there is little agreement for c for the lower filter sizes, is this a shortcoming in the CCE algorithm? To answer this question, given that the D194 is purely relational with no spatial equivalent, we were motivated to create a family of fuzzy relational cluster validity methods (Sledge et al. 2009) based upon the generalizations of Dunn’s index (Bezdek and Pal 1998). Running the algorithm, for the crisp case where the partitions were generated via the relational c-means algorithm (Hathaway and Bezdek 1994), we obtained the values shown in Table 1, which shows an overwhelming consensus for c ¼ 3: From this, we can conjecture that, in certain circumstances, the underlying segmentation scheme may highlight cluster substructure and drive up c. Although the higher value of c may not be ideal, for most applications, overclustering is far more desirable than underestimating the number of distributions. In addition, the more intra-cluster

123

I. J. Sledge et al.

Fig. 12 Effects of the filter size parameter on the number of clusters returned by the CCE algorithm. a Results of the cross-correlation process for the VAT image in Fig. 1c with filter size a = b0.01nc, b0.05nc, b0.1nc, b0.15nc,b0.2nc, b0.25nc, b0.5nc and b0.75nc for the images going from left to right, respectively, where n = 1,000. The returned cluster countswere c = 5, c = 5, c = 5, c = 5, c = 3, c = 2, c = 1 and c = 1 for the images going from left to right, respectively. b Results of the cross-correlation process for the VAT image in Fig. 6e with filter size a = b0.01nc, b0.05nc, b0.1nc, b0.15nc,b0.2nc, b0.25nc, b0.5nc and b0.75nc for the images going

from left to right, respectively, where n = 5,000. The returned cluster countswere c = 3, c = 3, c = 3, c = 2, c = 2, c = 2, c = 1 and c = 1 for the images going from left to right, respectively. c Results of the cross-correlation process for the VAT image in Fig. 7e with filter size a = b0.01nc, b0.05nc, b0.1nc, b0.15nc,b0.2nc, b0.25nc, b0.5nc and b0.75nc for the images going from left to right, respectively, where n = 194. The returned cluster countswere c = 9, c = 7, c = 4, c = 3, c = 2, c = 2, c = 1 and c = 1 for the images going from left to right, respectively

structure that is highlighted, such as in the VAT image in Fig. 7e, the more sensitive the processing will be to the filter size. We further canvass this issue in Sledge et al. (2008a). To also ensure that the correctness of the distribution counts obtained in Sects. 5 and 6 were not entirely subjective, we employed the relational validity metrics from Sledge et al. (2009). Overall, partitions created from the c’s found by CCE, were of higher quality, i.e. the direct measure V : U 2 Rcn 7!Rþ [ V : U 2 Rci n 7!Rþ ; c 6¼ ci ; than those produced from different cluster counts, i.e. the ci ’s. The only two discrepancies were for the D194 and DPart1 datasets, where CCE reported that c ¼ 7 (or c ¼ 3 when using MHUE) and c ¼ 2 (or c ¼ 4 when using MHUE), respectively, while the relational generalized Dunn’s indices voiced support for c ¼ 3 and c ¼ 3; respectively. Nonetheless, for some datasets, certain cluster validity measures may be fickle estimates of the partition quality. In these instances, it may be necessary to execute multiple post-clustering metrics, over a range of c’s, and intelligently combine the results to arrive at a sound conclusion. What, then, is the difficulty of arriving at a similar verdict for CCE? With our empirical results in this section, and elsewhere, we can state, with a certain degree of certainty, that for compact, separated data it is trivial to select a small value of a which will lead to a correct estimate of the number of distributions. For distributions that do not follow this paradigm, which likely includes those found in all real world datasets, finding a proper value of a that leads to a

Table 1 Relational generalized Dunn’s indexes for D194

123

V i;j

c¼2

c¼3

c¼4

c¼5

c¼6

c¼7

V 1;1

0.08

0.09

0.14

0.06

0.08

0.08

V 1;2

0.23

0.27

0.34

0.14

0.16

0.16

V 1;3

0.16

0.19

0.23

0.09

0.11

0.11

V 2;1

1.48

1.74

1.15

1.07

1.31

1.00

V 2;2

4.36

4.88

2.83

2.45

2.58

1.92

V 2;3

3.11

3.41

1.96

1.69

1.72

1.27

V 3;1

0.71

0.85

0.57

0.39

0.51

0.53

V 3;2

1.78

2.79

1.39

0.88

1.01

1.01

V 3;3

1.27

1.94

0.96

0.61

0.67

0.67

V 4;1

0.53

0.81

0.41

0.29

0.39

0.41

V 4;2 V 4;3

1.32 0.94

2.69 1.87

1.01 0.69

0.69 0.47

0.78 0.52

0.78 0.52

V 5;1

1.24

1.66

0.98

0.67

0.90

0.93

V 5;2

3.12

5.48

2.41

1.55

1.77

1.77

V 5;3

2.22

3.82

1.66

1.07

1.18

1.18

V 6;1

0.93

1.01

0.76

0.58

0.77

0.75

V 6;2

2.33

3.34

1.86

1.33

1.53

1.44

V 6;3

1.65

2.32

1.28

0.92

1.02

0.96

Here, V i;j is the validity metric using the ith inter-cluster distance and the jth intra-cluster dispersion measure. The maximum value for V i;j ; for each row, denotes the highest quality partition and thus an estimate of c

‘‘good’’ estimate of c may not be as easy, since there may be underlying factors subject to the problem or domain desires. It may be necessary, then, in these situations to

Finding the number of clusters in ordered dissimilarities

consider a range of a’s, i.e. b0:01  nc  a  b0:15  nc; to assess the volatility of the returned values. However, in a general sense, we empower the user to determine what they would like to see: each choice of a is analogous to a different prescription for eyeglasses. For small values of a; i.e. 1\a  b0:05  nc; and for data where there may be a high degree of intra-clustering, such as D194 ; the cluster count should reflect the highlighted substructure. Larger values of a; i.e. b0:05  nc\a  b0:15  nc; provide a more global view of the data organization, in contrast.

8 Conclusions and further exploration With the ability to automatically estimate the number of clusters in datasets, CCE is a promising addition to the growing number of tendency assessment algorithms. In addition, it has the potential to replace the need for human interpretation of the number of clusters. To back this claim, we conducted a number of experiments, both on synthetic data and on real world numerical data. For every case, the basic CCE implementation consistently produced cluster count values that were suggested by the VAT images. Moreover, CCE operates on a very generic type of relationship-based data, which implies that this approach has the potential to be applied in a variety of fields and to nearly all, if not all, data. Despite these successes, there were some flaws in the underlying image segmentation process. To address these, we explored a more advanced, spatially-based thresholding algorithm. From our observations, we can conclude that the technique can overcome the failures of Otsu’s algorithm. As well, it shows promise for locating secondary, geometric cluster structure, which may be useful for some datasets. However, for many VAT images, MHUE is excessive, since other thresholding approaches can return similar binary images. From this, we speculate that there are only a few instances where it may be warranted. It is our hope that by analyzing some information from D ; such as the average pixel intensity, that we will be able automatically determine when MHUE, or other advanced techniques, should be used. Finally, we point out that it is easy to contrive datasets, such as hyperlinear or uniform random data, for which the VAT image suggests the wrong number of clusters. Unfortunately, this problem will carry over to CCE. The good news, however, is that this is a defect in the VAT reordering process, not CCE, and that other visual, blockbased algorithms can be utilized. As well, we have explored various ways (Sledge et al. 2008a) to preemptively counteract nonlinear data and produce ‘‘correct’’ VAT images in these situations.

Acknowledgments This work was funded by the National Science Foundation under ITR grant number IIS-0428420. The authors would also like to thank the reviewers for their insightful comments that helped to improve the quality of this manuscript.

References Anderson DT, Luke RH, Keller JM, Skubic M (2008a) Modeling human activity from voxel person using fuzzy logic. IEEE Trans Fuzzy Syst (to appear) Anderson DT, Luke RH, Keller JM, Skubic M, Rantz M, Aud M (2008b) Linguistic summarization of activities for fall detection using voxel person and fuzzy logic. Comp Vis Image Underst Asuncion A, Newman DJ (2007) UCI machine learning repository. http://archive.ics.uci.edu/ml/ Atiquzzaman M (1992) Multiresolution Hough transform—an efficient method of detecting patterns in images. IEEE Trans Pattern Anal Mach Intell 14:1090–1095 Baumgartner R, Somorajai R, Summers R, Richter W, Ryner L (2000) Correlator beware: correlation has limited selectivity for fMRI data analysis. NeuroImage 12:240–243 Baumgartner R, Somorajai R, Summers R, Richter W (2001) Ranking fMRI time courses by minimum spanning trees: assessing coactivation in fMRI. NeuroImage 13:734–742 Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York Bezdek JC, Hathaway RJ (2002) VAT: a tool for visual assessment of (cluster) tendency. In: Proceedings of the IEEE joint conference on neural networks Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst Man Cybern 28:301–315 Bezdek JC, Hathaway RJ, Huband JM (2005) bigVAT: visual assessment of cluster tendency for large datasets. Pattern Recogn 38:1875–1886 Bezdek JC, Hathaway RJ, Huband JM (2006) Visual assessment of clustering tendency for rectangular dissimilarity matrices. IEEE Trans Fuzzy Syst 15:890–903 Borg I, Lingoes J (1987) Multidimensional similiarity structure analysis. Springer, New York Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8:679–698 Cattell RB (1944) A note on correlation clusters and cluster search methods. Psychometrika 9:169–184 Cleveland WS (1993) Visualizing data. Hobart Press, Summit Dhillion I, Modha D, Spranger W (2000) Visualizing class structure of multidimensional data. In: Proceedings of the 30th symposium on the interface: computing science and statistics Everitt BS (1978) Graphical techniques for multivariate data. Heinemann, London Floodgate GD, Hayes PR (1963) The Adansonian taxonomy of some yellow pigmented marine bacteria. J Gen Microbiol 30:237–244 Gene Ontology Consortium (2004) The gene ontology (GO) database and informatics resource. Nucleic Acids Res 32:D258–D261 Gonzalez RC, Woods RE (2002) Digital image processing. PrenticeHall, Upper Saddle River Hathaway RJ, Bezdek JC (1994) NERF c-means: non-Euclidean relational fuzzy clustering. Pattern Recogn 27:429–437 Hathaway RJ, Bezdek JC (2006) Visual cluster validity for prototype generator clustering models. Pattern Recogn Lett 24:1563–1569 Hathaway RJ, Bezdek JC, Huband JM (2005) Scalable visual assessment of cluster tendency. Pattern Recogn 39:1315–1324 Havens TC, Bezdek JC, Keller JM, Popescu M (2008a) Dunn’s cluster validity index as a contrast measure of VAT images.

123

I. J. Sledge et al. In: Proceedings of the IEEE international conference on pattern recognition Havens TC, Bezdek JC, Keller JM, Popescu M, Huband JM (2008b) Is VAT really single linkage in disguise? Ann Math Artif Intell (in review) Huband JM, Bezdek JC (2008) VCV—visual cluster validity. Pattern Recogn (in review) Jain AK, Dubes RC (1988) Algorithms for clustering data. PrenticeHall, Englewood Cliffs Johnson RA, Wichern DA (1992) Applied multivariate statistical analysis, 3rd edn. Prentice-Hall, Englewood Cliffs Kendall M, Gibbons JD (1990) Rank correlation methods. Oxford University Press, New York Ling RF (1973) A computer generated aid for cluster analysis. Commun ACM 16:355–361 Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9:757–763 Popescu M, Keller JM, Mitchell JA, Bezdek JC (2004) Functional summarization of gene product clusters using gene ontology similarity measures. In: Proceedings of the IEEE international conference on intelligent sensors. Sens Netw Inform Process Saha PK, Udupa JK (2001) Optimum image thresholding via class uncertainty and region homogenity. IEEE Trans Pattern Anal Mach Intell 23:689–706 Sledge IJ, Keller JM (2008) Growing neural gas for temporal clustering. In: Proceedings of the IEEE international conference on pattern recognition Sledge IJ, Havens TC, Bezdek JC, Keller JM (2008a) Partitioning ordered dissimilarity data. IEEE Trans Knowl Data Eng (in review)

123

Sledge IJ, Huband JM, Bezdek JC (2008b) Automatic cluster count extraction from unlabeled datasets. In: Proceedings of the IEEE conference on fuzzy systems and knowledge discovery Sledge IJ, Keller JM, Alexander GL (2008c) Emergent trend detection in diurnal activity. In: Proceedings of the IEEE engineering in biology and medicine conference Sledge IJ, Keller JM, Havens TC, Alexander GL, Skubic M (2008d) Temporal activity analysis. In: Proceedings of the association for the advancement of artificial intelligence Sledge IJ, Havens TC, Keller JM, Bezdek JC (2009) Relational generalizations of validity indexes. IEEE Trans Syst Man Cybern (in review) Sneath P (1957) A computer approach to numerical taxonomy. J Gen Microbiol 17:201–226 Strehl A, Ghosh J (2000a) A scalable approach to balanced, highdimensional clustering of market-baskets. In: Proceedings of the international conference on high performance computing Strehl A, Ghosh J (2000b) Value-based customer grouping from large retail data-sets. In: Proceedings of the SPIE conference on data mining and knowledge discovery Theodoridis S, Koutroumbas K (2003) Pattern recognition, 2nd edn. Elsevier, New York Tran-Luu TD (1996) Mathematical concepts and novel heuristic methods for data clustering and visualization. Ph.D. thesis, University of Maryland, College Park Tryon RC (1939) Cluster analysis. Edwards Bros., Ann Arbor Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Reading Wang W, Zhang Y (2007) On fuzzy cluster validity indices. Fuzzy Sets Syst 158:2095–2117