An Index Structure for Data Mining and Clustering Xiong Wang1 , Jason T. L. Wang2 , King-Ip Lin3 Dennis Shasha4 , Bruce A. Shapiro5 , Kaizhong Zhang6 1
5
Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark, NJ 07102 (
[email protected]) 2 Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark, NJ 07102 (
[email protected]) 3 Department of Mathematical Sciences, The University of Memphis, Memphis, TN 38152 (
[email protected]). 4 Courant Institute of Mathematical Sciences, New York University, New York, NY 10012 (
[email protected]). Laboratory of Experimental and Computational Biology, National Institutes of Health, Frederick, MD 21702 (
[email protected]). 6 Department of Computer Science, The University of Western Ontario, London, Ontario, N6A 5B7, Canada (
[email protected]).
Abstract. In this paper we present an index structure, called M etricM ap, that takes a set of objects and a distance metric and then maps those objects to a k-dimensional space in such a way that the distances among objects are approximately preserved. The index structure is a useful tool for clustering and visualization in data intensive applications, because it replaces expensive distance calculations by sum-of-square calculations. This can make clustering in large databases with expensive distance metrics practical. We compare the index structure with another data mining index structure, F astM ap, recently proposed by Faloutsos and Lin, according to two criteria: relative error and clustering accuracy. For relative error, we show that (i) F astM ap gives a lower relative error than M etricM ap for Euclidean distances, (ii) M etricM ap gives a lower relative error than F astM ap for non-Euclidean distances (i.e., general distance metrics), and (iii) combining the two reduces the error yet further. A similar result is obtained when comparing the accuracy of clustering. These results hold for different data sizes. The main qualitative conclusion is that these two index structures capture complementary information about distance metrics and therefore
This work was supported in part by the National Science Foundation under grant numbers IRI-9531548 and IRI-9531554, and by the Natural Sciences and Engineering Research Council of Canada under grant number OGP0046373. A preliminary version of this paper was presented in the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining held in San Diego, California in August 1999. Part of the work of this author was done while visiting Courant Institute of Mathematical Sciences, New York University.
can be used together to great benefit. The net effect is that multi-day computations can be done in minutes. Keywords: biomedical applications, data engineering, distance metrics, knowledge discovery, visualization.
1
Introduction
In [6], Faloutsos and Lin proposed an index structure, called F astM ap, for knowledge discovery, visualization and clustering in data intensive applications. The index structure takes a set of objects and a distance metric and maps the objects to points in a k-dimensional target space in such a way that the distances between objects are approximately preserved. One can then perform data mining and clustering operations on the k-dimensional points in the target space. Empirical studies indicated that F astM ap works well for Euclidean distances [5, 6]. In a later paper, the inventors showed that a modification to F astM ap could also help detect patterns using the time-warping distance (which is not even a metric, i.e., it doesn’t satisfy the triangle inequality) [27]. In this paper we present an index structure, called M etricM ap, that works in a similar way as F astM ap. We conduct experiments to compare the performance of F astM ap and M etricM ap based on both Euclidean distance and general distance metrics. A general distance metric is a function δ that takes pairs of objects into real numbers, satisfying the following properties: for any objects x, y, z, δ(x, x) = 0 and δ(x, y) > 0, x = y (nonnegative definiteness); δ(x, y) = δ(y, x) (symmetry); δ(x, y) ≤ δ(x, z) + δ(z, y) (triangle inequality). Euclidean distance satisfies these properties. On the other hand, many general distance metrics of interest are not Euclidean, e.g. string edit distance as used in biology [17], document comparison [23] and the UNIX diff operator. Neither F astM ap nor M etricM ap (nor any other index structure that we know of) give guaranteed performance for general distance metrics. For this reason, an experimental analysis is worthwhile. Section 2 surveys related work. Section 3 discusses the basic properties of F astM ap and M etricM ap. Section 4 compares the two index structures. Section 5 evaluates their performance in data clustering applications. Section 6 presents a further analysis of the index structures, addressing some tradeoff issues. Section 7 concludes the paper.
2
Related Work
Clustering is an important operation in data mining [1, 4, 22, 25]. Clustering algorithms can be broadly classified into two categories: partitional and hierarchical [11, 12]. A partitional algorithm partitions the objects into a collection of a user-specified number of clusters. A hierarchical algorithm is an iterative process, which either merges small clusters into larger ones, starting with atomic clusters containing single objects, or divides the set of objects into subunits,
until some termination condition is met. These algorithms have been studied extensively by researchers in different communities, including statistics [7], pattern recognition [3, 11], machine learning [14], and databases. In particular, data intensive clustering algorithms include CLARANS [15], BIRCH [28], DBSCAN [4], STING [25], WaveCluster [21], CURE [10], CLIQUE [1], etc. For example, the recently published CURE algorithm [10] utilizes multiple representatives for each cluster. The representatives are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. This enables the algorithm to adjust well to a geometry of clusters having non-spherical shapes and wide variances in size. CURE is designed to handle points (vectors) in k-d space only, not for general distance metric spaces, and therefore is considered as a vector-based clustering algorithm. It employs a combination of random sampling and partitioning to handle large datasets. The algorithm is a typical hierarchical one, which starts with each input point as a separate cluster, and at each successive step merges the closest pair of clusters. By contrast, the popular K-means and K-medoid methods are partitional algorithms. The methods determine K cluster representatives and assign each object to the cluster with its representative closest to the object such that the sum of the distances squared between the objects and their representatives is minimized. The methods work for both Euclidean distance and general distance metrics, and therefore are considered as distance-based clustering algorithms. [12, 15, 28] presented extensions of the partitional methods for large and spatial databases, some of which are vector-based and some are distance-based. In contrast to the above work, F astM ap and M etricM ap employ the approach of mapping objects to points in a k-dimensional (k-d) target space Rk and then cluster the points in Rk . The main benefit provided by this approach is that it saves time in distance computation. Calculating the actual distances among the objects is much more expensive than measuring the dissimilarities among the points in Rk .7 This is particularly true for new, emerging applications in multimedia and scientific computing. As an example, comparing two RNA secondary structures may require a dynamic programming algorithm [19] or a genetic algorithm [18] that runs in seconds or minutes on current workstation. The presented mapping approach is useful not only for data mining and cluster analysis, but also for visualization and retrieval in large datasets [5, 6].
3
F astM ap and M etricM ap: a brief comparison
Consider a set of objects D = {O0 , O1 , . . . , ON −1 } and a distance function d where for any two objects Oi , Oj ∈ D, d(Oi , Oj ) (or di,j for short) represents the distance between Oi and Oj . The function d can be Euclidean or a general distance metric. Both F astM ap and M etricM ap take the set of objects, some inter-object distances and embed the objects in a k-d space Rk (k is user-defined), 7
We use “dissimilarity”, rather than “distance”, in the paper since there may be a negative dissimilarity value between two points in the target space.
such that the distances among the objects are approximately preserved. The k-d point Pi corresponding to the object Oi is called the image of Oi . The k-d space containing the images is called target space. The differences between the two index structures lie in the algorithm they use for embedding and the target space they choose. F astM ap embeds the objects in a Euclidean space, whereas M etricM ap embeds them in a pseudo-Euclidean space [9, 13]. Below we describe the two index structures and their properties; proof details can be found in [6, 26]. 3.1
The F astM ap Algorithm
The basic idea of this algorithm is to project objects on a line (Oa , Ob ) in an n-dimensional (n-d) space Rn for some unknown n, n ≥ k. The line is formed by two pivot objects Oa , Ob , chosen as follows. First arbitrarily choose one object and let it be the second pivot object Ob . Let Oa be the object that is farthest apart from Ob . Then update Ob to be the object that is farthest apart from Oa . The two resulting objects Oa , Ob are pivots. Consider an object Oi and the triangle formed by Oi , Oa and Ob (Figure 1). From the cosine law, one can get d2b,i = d2a,i + d2a,b − 2xi da,b
(1)
Thus, the first coordinate xi of object Oi with respect to the line (Oa , Ob ) is xi =
d2a,i + d2a,b − d2b,i 2da,b
(2)
Oi d b,i
d a,i Oa
Ob xi
d a,b
Fig. 1. Illustration of the projection method used in F astM ap.
Now we can extend the above projection method to embed objects in the target space Rk as follows. Pretending that the given objects are indeed points in Rn , we consider an (n − 1)-d hyper-plane H that is perpendicular to the line (Oa , Ob ), where Oa and Ob are two pivot objects. We then project all the objects onto this hyper-plane. Let Oi , Oj be two objects and let Oi , Oj be
their projections on the hyper-plane H. It can be shown that the dissimilarity d between Oi , Oj is
(d (Oi , Oj ))2 = (d(Oi , Oj ))2 − (xi − xj )2 ,
i, j = 0, . . . , N − 1
(3)
Being able to compute d allows one to project on a second line, lying on the hyper-plane H, and therefore orthogonal to the first line (Oa , Ob ). We repeat the steps recursively, k times, thus mapping all objects to points in Rk . The discussion thus far assumes that the objects are indeed points in Rn . If the assumption doesn’t hold, (d(Oi , Oj ))2 − (xi − xj )2 may become negative. For this case, Equation (3) is modified as follows: d (Oi , Oj ) = − (xi − xj )2 − (d(Oi , Oj ))2 (4) Let Oi , Oj be two objects in D and let Pi = (x1i , . . . , xki ), Pj = (x1j , . . . , xkj ) be their images in the target space Rk . The dissimilarity between Pi and Pj , denoted df (Pi , Pj ), is calculated as k (5) df (Pi , Pj ) = (xli − xlj )2 l=1
Note that if the objects are indeed points in Rn , n ≥ k, and the distance function d is Euclidean, then from Equation (3), F astM ap guarantees a lower bound on inter-object distances. That is, Proposition 3.1
df (Pi , Pj ) ≤ d(Oi , Oj ).
Let Costf astmap denote the total number of distance calculations required by F astM ap. From Equations (2) and (3) and the way the pivot objects are chosen, we have (6) Costf astmap = 3N k where N is the size of the dataset and k is the dimensionality of the target space. 3.2
The M etricM ap Algorithm
The algorithm works by first choosing a small sample A of 2k objects from the dataset. In choosing the sample, one can either pick it up randomly, or use the 2k pivot objects found by F astM ap. The algorithm calculates the pairwise distances among the sampling objects and uses these distances to establish the target space Rk . The algorithm then maps all objects in the dataset to points in Rk . Specifically, assume, without loss of generality, that A = {O0 , . . . , O2k−1 }. We define a mapping α as follows: α : A → R2k−1 such that α(O0 ) = a0 = (0, . . . , 0), α(Oi ) = ai = (0, . . . , 1(i) , . . . , 0), 1 ≤ i ≤ 2k − 1 (see Figure 2(a)). Intuitively we map O0 to the origin and map the other sampling objects to vectors (points)
{ai }1≤i≤2k−1 in R2k−1 so that each of the objects corresponds to a base vector in R2k−1 . Let (7) M (ψ
) = (mi,j )1≤i,j≤2k−1 where mi,j =
d2i,0 + d2j,0 − d2i,j , 2
1 ≤ i, j ≤ 2k − 1
(8)
Define the function ψ as follows: ψ : R2k−1 × R2k−1 → R such that ψ(x, y) = xT M (ψ )y
(9)
where xT is the transpose of vector x. Notice that ψ(ai , aj ) = mi,j , 1 ≤ i, j ≤ 2k −1. The function ψ is called a symmetric bilinear form of R2k−1 [9]. M (ψ ) is the matrix of ψ with respect to the basis {ai }1≤i≤2k−1 . The vector space R2k−1 equipped with the symmetric bilinear form ψ is called a pseudo-Euclidean space. For any two points (vectors) x, y ∈ R2k−1 , ψ(x, y) is called the inner product of x and y. The squared distance between x and y, denoted x − y2 , is defined as x − y2 = ψ(x − y, x − y)
(10)
This squared distance is used to measure the dissimilarity of two points in the pseudo-Euclidean space. Since the matrix M (ψ ) is real symmetric, there is an orthogonal matrix Q = (qi,j )1≤i,j≤2k−1 and a diagonal matrix D = diag(λi )1≤i≤2k−1 such that QT M (ψ )Q = D
(11)
where QT is the transpose of Q, λi s are eigenvalues of M (ψ ) arranged in some order, and columns of Q are the corresponding eigenvectors [8]. Note that if the matrix M (ψ ) has negative eigenvalues, the squared distance between two points in the pseudo-Euclidean space may be negative. That’s why we never say the “distance” between points in a pseudo-Euclidean space. Now we find a ψ-orthogonal basis of R2k−1 , {ei }1≤i≤2k−1 , where (e1 , . . . , e2k−1 ) = (a1 , . . . , a2k−1 )Q
(12)
(a1 , . . . , a2k−1 ) = (e1 , . . . , e2k−1 )QT
(13)
or equivalently Each vector ai , 1 ≤ i ≤ 2k − 1, can be represented as a vector in the space spanned by {ei }1≤i≤2k−1 and the coordinate of aj with respect to {ei }1≤i≤2k−1 is the j th row of Q (see Figure 2(b)). Each ei corresponds to an eigenvector. Suppose the eigenvalues are sorted in descending order by their absolute values, followed by the zero eigenvalues. The M etricM ap algorithm reduces the dimensionality of R2k−1 to obtain the subspace Rk by removing the k − 1 dimensions along which the eigenvalues λi s of M (ψ ) are zero or their absolute values are smallest (see Figure 2(c)). Notice that among the remaining
R
A
2k-1 a3
O3 O1
O0 O2
a1
a0 mapping objects
a2
R
(a)
2k-1 a3 e3 a2
a1 e1
transforming basis
e2
(b)
R
k
e1 reducing dimensionality
a3 e2
(c) Fig. 2. Illustration of the M etricM ap algorithm (k = 2).
a2
a1
k-dimensions, some may have negative eigenvalues. The algorithm then chooses k + 1 objects, called the reference objects, that span Rk . Once the target space Rk is established, the algorithm maps each object O∗ in the dataset to a point (vector) P∗ in the target space by comparing the object with the reference objects. The coordinate of P∗ is calculated through matrix multiplication. Here is how. Assume, without loss of generality, that the reference objects are O0 , O1 , . . . , Ok . Let (14) b = (n∗,j )1≤j≤k where n∗,j = Define
d2∗,0 + d2j,0 − d2∗,j , 1≤j≤k 2 1 if λi > 0 sign(λi ) = 0 if λi = 0 −1 if λi < 0
(15)
(16)
That is, sign(λi ) is the sign of the ith eigenvalue λi . Let J = diag(sign(λi ))1≤i≤2k−1 and C = diag(ci ) 1≤i≤2k−1 where |λi | if λi = 0 (17) ci = 1 otherwise Let J[k] be the k th leading principal submatrix of the matrix J, i.e. J[k] = diag(sign(λi ))1≤i≤k . Let C[k] be the k th leading principal submatrix of the matrix C, i.e. C[k] = diag(|λi |)1≤i≤k . Let Q[kk] be the k th leading principal submatrix of the orthogonal matrix Q, i.e. Q[kk] = (qi,j )1≤i,j≤k . The coordinate of P∗ in Rk , denoted Coor(P∗ ), can be approximated as follows: −1/2
Coor(P∗ ) ≈ J[k] C[k]
Q−1 [kk] b
(18)
Let Oi , Oj be two objects in D and let Pi = (x1i , . . . , xki ), Pj = (x1j , . . . , xkj ) be their images in Rk . Let ∆(Pi , Pj ) =
k
sign(λl )(xli − xlj )2
(19)
l=1
The dissimilarity between Pi and Pj , denoted dm (Pi , Pj ), is approximated by
∆(Pi , Pj ) if ∆(Pi , Pj ) ≥ 0
(20) dm (Pi , Pj ) ≈ − −∆(Pi , Pj ) otherwise Note that if the objects are points in Rn , n ≥ k, and the distance function d is Euclidean, then as in F astM ap, M etricM ap guarantees a lower bound on inter-object distances. That is, Proposition 3.2
dm (Pi , Pj ) ≤ d(Oi , Oj ).
To see this, note that in the Euclidean spaces, the bilinear form ψ is positive definite, because for any non-zero vector x, xT M (ψ )x is positive [16]. This implies that all the non-zero eigenvalues are positive. When projecting the points from Rn onto Rk , the images have fewer coordinates. From Equations (19) and (20), we conclude that the dissimilarity between two images is less than or equal to the distance between the corresponding objects. Let Costmetricmap denote the total number of distance calculations required by M etricM ap. From equations (7), (8) and (11), we see that to calculate the eigenvalues of M (ψ ), one needs to calculate the pairwise distances di,j , 0 ≤ i, j ≤ 2k − 1. This requires (2k)2 = 4k 2 distance calculations. From equations (14), (15) and (18), we see that to embed each object O∗ in Rk , one needs to calculate the distances from O∗ to the k + 1 reference objects. Notice that if O∗ is a sampling object, its distances to the reference objects need not be recalculated, since they are part of the distances di,j , 0 ≤ i, j ≤ 2k − 1 that are already computed. Totally there are N objects in the dataset, and therefore Costmetricmap = 4k 2 + (N − 2k)(k + 1)
(21)
Comparing Equations (6) and (21), since N ≥ k, Costmetricmap ≤ Costf astmap .
4
Precision of Embedding
We conducted a series of experiments to evaluate the precision of embedding by calculating the errors induced by the index structures. The index structures were implemented in C and C++ under the UNIX operating system run on a SPARC 20. Four sets of distances were generated: synthetic Euclidean, synthetic non-Euclidean, protein and RNA. The last three were general distance metrics, so satisfied the triangle inequality, but were not Euclidean. 4.1
Data
In creating synthetic Euclidean distances, we generated N n-dimensional vectors. Each vector was generated by choosing n real numbers randomly and uniformly from the interval [LowBound..Hig- hBound]. We then calculated the pairwise distances among the vectors. In creating synthetic non-Euclidean distances, we generated the pairwise distances among N objects randomly and uniformly in the interval [M inDistance..M axDistance], keeping only those objects that satisfied the triangle inequality as in [20]. Table 1 summarizes the parameters and base values used in the experiments. In generating protein distances, we selected a set of 230 kinase sequences obtained from the protein database in the Cold Spring Harbor Laboratory. We used the string edit distance to measure the dissimilarity of two proteins [17]. The inter-protein distances were in the interval (1..2573).
Table 1: Parameters and base values used in the experiments for evaluating the precision of embedding. Parameter k N n
Value Description 15 Dimensionality of the target space 3,000 Number of objects in the dataset 20 Dimensionality of synthetic vectors in Euclidean space LowBound 0 Smallest possible value for each coordinate of the synthetic vectors HighBound 100 Largest possible value for each coordinate of the synthetic vectors M inDistance 1 Minimum distance between objects for the synthetic non-Euclidean data M axDistance 100 Maximum distance between objects for the synthetic non-Euclidean data In generating RNA distances, we used 200 RNA secondary structures obtained from the virus database in the National Cancer Institute. The RNA secondary structures were created by first choosing two phylogenetically related mRNA sequences, rhino 14 and cox5, from GenBank [2] pertaining to the human rhinovirus and coxsackievirus. The 5’ non-coding region of each sequence was folded and 100 secondary structures of that sequence were collected. The structures were then transformed into trees and their pairwise distances were calculated as described in [19, 24]. The trees had between 70 and 180 nodes. The distances for rhino 14’s trees and cox5’s trees were in the interval (1..75) and (1..60), respectively. The distances between rhino 14’s trees and cox5’s trees were in the interval (43..94). The secondary structures (trees) for each sequence roughly formed a cluster. 4.2
Experimental Results
Let Oi , Oj be two objects in D and let Pi , Pj be their images in Rk . The dissimilarity between Pi , Pj embedded by F astM ap, denoted df (Pi , Pj ), was as in Equation (5). The dissimilarity between Pi , Pj embedded by M etricM ap, denoted dm (Pi , Pj ), was as in Equation (20). To understand whether the index structures might complement each other, we considered three combinations of the index structures: AvgM ap, M inM ap and M axM ap, with the dissimilarities da , dn , dx defined as follows: df (Pi , Pj ) + dm (Pi , Pj ) 2 dn (Pi , Pj ) = min{df (Pi , Pj ), dm (Pi , Pj )} da (Pi , Pj ) =
dx (Pi , Pj ) = max{df (Pi , Pj ), dm (Pi , Pj )}
(22) (23) (24)
We collectively refer to all these index structures as mappers. In building the mappers, we used random sampling objects for M etricM ap to establish the target space (cf. Section 3.2). Note here that the mappers have the same cost O(N k) asymptotically, cf. Equations (6) and (21). The measure used for evaluating the precision of embedding was the average relative error (Errr ), defined as Oi ,Oj ∈D |d(Oi , Oj ) − |ds (Pi , Pj )|| × 100% (25) Errr = Oi ,Oj ∈D d(Oi , Oj ) where s = f, m, a, n, x, respectively. One would like this percentage to be as low as possible. The lower Errr is, the better performance the corresponding mapper has. Figure 3 graphs Errr as a function of the dimensionality of the target space, k, for the synthetic Euclidean data. The parameters have the values shown in Table 1. We see that for all the mappers, Errr drops as k increases. Errr approaches 0 when k = 19. F astM ap performs better than M etricM ap, but M axM ap dominates in all situations. From Proposition 3.1 and Proposition 3.2, both F astM ap and M etricM ap underestimate inter-object distances, so M axM ap gives the lowest average relative error among all the mappers. 60
Errr (%)
50
FastMap MetricMap AvgMap MinMap MaxMap
40 30 20 10 0 5
15
10
20
k Fig. 3. Average relative errors of the mappers as a function of the dimensionality of the target space for synthetic Euclidean data.
We next examined the scalability of the results. Figure 4 compares F astM ap, M etricM ap and M axM ap for varying N , Figure 5 compares the three mappers for varying n, and Figure 6 plots Errr as a function of (HighBound/LowBound) for the three mappers. In each figure, only one parameter is tuned and the other
parameters have the values shown in Table 1. The LowBound in Figure 6 is fixed at 1. It can be seen that Errr depends on the dimensionality of vectors n, but is independent of the dataset size N and coordinate ranges of the vectors. M axM ap consistently beats the other two mappers in all these figures.
Errr (%)
30 FastMap MetricMap MaxMap
20
10
0 1,000
10,000
100,000
1,000,000
N Fig. 4. Effect of dataset size for synthetic Euclidean data.
We then compared the relative performance of the mappers using the synthetic non-Euclidean data. Figure 7 graphs Errr as a function of the dimensionality of the target space k. The parameters have the values shown in Table 1. The figure shows that M etricM ap outperforms F astM ap while AvgM ap is superior to both of them. As k increases, the performance of M etricM ap improves while the performance of F astM ap degrades. The larger the k, the more negative dissimilarity values F astM ap produces, cf. Equation (4). As a consequence, the more biased projections it creates. Note that M etricM ap also produces negative dissimilarity values during the projection. It has a better performance probably because the images’ coordinates are calculated by matrix multiplication through a single projection, rather than through a series of projections as done in F astM ap, and hence the effect incurred by these negative dissimilarity values is reduced. The next two figures show the scalability of the results. Figure 8 compares the relative performance of F astM ap, M etricM ap and AvgM ap for varying N and Figure 9 plots Errr as a function of ln(M axDistance/M inDistance) for the three mappers. The k value in both figures is fixed at 1000 and the M inDistance in Figure 9 is fixed at 10. The other parameters have the values shown in Table 1. It can be seen that Errr depends on the dataset size N , but is independent of the distance ranges. Clearly, AvgM ap is the best for all the non-Euclidean data. Both F astM ap and M etricM ap may overestimate or underestimate some inter-
Errr (%)
30
FastMap MetricMap MaxMap
20
10
0 15
30
45
60
n Fig. 5. Average relative errors of the mappers as a function of the dimensionality of vectors for synthetic Euclidean data.
Errr (%)
30 FastMap MetricMap MaxMap
20
10
0 100
1,000
10,000
100,000
HighBound/LowBound Fig. 6. Effect of coordinate ranges for synthetic Euclidean data.
object distances. The fact that AvgM ap outperforms either one individually is a good indication of the complementarity of the two index structures. The trends observed from protein and RNA data are similar to those from the synthetic data. We omit the results for protein and only present those for RNA secondary structures (Figure 10). In sum, M axM ap is best for Euclidean data; its performance depends on the dimensionality of vectors n, but is independent of the size of datasets N . AvgM ap is best for non-Euclidean data; its performance depends on the dataset size. Both mappers’ performance improves as the
60
Errr (%)
50 40 30 20
FastMap MetricMap AvgMap MinMap MaxMap
10 0 10
500
100
1,000
k Fig. 7. Average relative errors of the mappers as a function of the dimensionality of the target space for synthetic non-Euclidean data.
Errr (%)
20 FastMap MetricMap AvgMap
15 10 5 0 2,000
4,000
3,000
5,000
N Fig. 8. Effect of dataset size for synthetic non-Euclidean data.
dimensionality of the target space k increases. For Euclidean data, M axM ap’s Errr drops to 0 as k approaches n. For non-Euclidean data, AvgM ap’s Errr approaches 0 when k = N/2, i.e. when all the 2k = N data objects are used in the sample to establish the target space. The last set of experiments examined the feasibility of retrieval with M axM ap and AvgM ap. Let ds , s = a, x, represent the dissimilarity measures for the two mappers, cf. Equations (22) and (24). We randomly picked an object Oc and
Errr (%)
80 FastMap MetricMap AvgMap
60 40 20 0 2
14
8
20
ln(MaxDistance/MinDistance) Fig. 9. Effect of distance ranges for synthetic non-Euclidean data.
Errr (%)
40
FastMap MetricMap AvgMap MinMap MaxMap
30 20 10 0 40
80
60
100
k Fig. 10. Average relative errors of the mappers as a function of the dimensionality of the target space for RNA secondary structures.
considered the sphere G1 with Oc as the centroid and a properly chosen as the radius, i.e. G1 contained all the objects O where d(O, Oc ) ≤ . Let Pc be the image of Oc . G2 represented the sphere in the target space that contained all the images P where |ds (P, Pc )| ≤ . Let Oi be an object and let Pi be its image. We say Oi is a false positive if Pi ∈ G2 whereas Oi ∈ G1 . Oi is a false negative if Pi ∈ G2 but Oi ∈ G1 . The performance measure used was the accuracy (Accu),
defined as Accu =
|G1 | + |G2 | − (Np + Nn ) × 100% |G1 | + |G2 |
(26)
where |Gi |, i = 1, 2, was the size of Gi , Np was the number of false positives, and Nn was the number of false negatives. One would like this percentage to be as high as possible. The higher Accu is, the fewer false positives and negatives are, and therefore the better performance a mapper has. Figure 11 illustrates M axM ap’s performance for the synthetic Euclidean data and Figure 12 illustrates AvgM ap’s performance for the synthetic nonEuclidean data. The four curves represent four different dataset sizes (N = 1,000, 10,000, 100,000, 1,000,000, respectively, in Figure 11, and N = 2,000, 3,000, 4,000, 5,000, respectively, in Figure 12). The four points on each curve correspond to four different k values. The Accu plotted in the figures is the average value over all the N spheres where each sphere uses a different object as the centroid. The radius of a sphere is fixed at 50, i.e. = 50. The X-axis shows the CPU time spent in embedding the objects. From the figures we see that as the dimensionality of the target space, k, increases, both the time and accuracy increase. For the Euclidean data with 20-dimensional vectors, Accu approaches 100% when k = 18. For the non-Euclidean data, Accu approaches 100% when k = 1, 000. These results indicate that with the two best mappers, one can conduct the range search [5, 6] on the k-dimensional points by embedding the query object in the target space and then considering the sphere with the query object as the centroid in the target space. Embedding the data objects can be performed in the off-line stage, thus reducing the search time significantly. 100
N=1,000,000
N=100,000
N=10,000
50
N=1,000
Accu(%)
75
k=6 k=10 k=14 k=18
25
0 1
5
9
13
log(time) [sec.] Fig. 11. Accuracy of M axM ap for synthetic Euclidean data.
100
25
N=5,000
N=4,000
50
N=3,000
N=2,000
Accu(%)
75
k=700 k=800 k=900 k=1000
0 5
10
15
20
log(time) [sec.] Fig. 12. Accuracy of AvgM ap for synthetic non-Euclidean data.
5
Clustering
In this section we evaluate the accuracy of clustering in the presence of imprecise embedding. The purpose is twofold. First, this study shows the feasibility of clustering without performing expensive distance calculations. Second, through the study, one can understand how imprecision in the embedding may affect the accuracy of clustering. 5.1
Data
The data used in the experiments included the RNA secondary structures described in Section 4.1, because they roughly formed 2 clusters, each corresponding to an mRNA sequence. RNA distance is non-Euclidean. In addition, we generated Euclidean clusters as follows: we built p = q 2 clusters as in [28]. Specifically, we generated q groups of n-dimensional vectors from an n-dimensional hypercube. The vectors were generated as described in Section 4.1. Each group had C vectors. Initially the groups (clusters) might overlap. We considered all the q groups as sitting on the same line and moved them apart along the line by adding a constant (i × c), 1 ≤ i ≤ q, to the first coordinate of all the vectors in the ith group; c was a tunable parameter. We used CURE [10] to adjust the clusters so that they were not too far apart. Specifically, c was chosen to be the minimum value, by which CURE can just separate the q clusters. In our case, c = 1.15. Once the first q clusters were generated, we moved to the second line, which was parallel to the first line, and generated another q clusters along the second line. This step was repeated until all the q lines were generated, each line comprising
q clusters. Again we used CURE to adjust the distance between the lines so that they were not too far apart. Table 2 summarizes the parameters and base values used in the experiments. Table 2: Parameters and base values used in the experiments for evaluating the accuracy of clustering Euclidean vectors. Parameter Value Description k 10 Dimensionality of the target space p 4 Number of clusters n 20 Dimensionality of synthetic vectors C 100 Number of vectors in a cluster 5.2
Experimental Results
The clustering algorithm used in our experiments was the well known averagegroup method [12], which works as follows. Initially, every object is a cluster. The algorithm merges two nearest clusters to form a new cluster, until there are only K clusters left where K is p for the Euclidean clusters and 2 for the RNA data. The distance between two clusters C1 and C2 is given as 1 |C1 ||C2 |
|d(Op , Oq )|
(27)
Op ∈C1 ,Oq ∈C2
where |Ci |, i = 1, 2, is the size of cluster Ci . The algorithm requires O(N 2 ) distance calculations where N is the total number of objects in the dataset. An object O is said to be mis-clustered if O is in a cluster C created by the average-group method, but its image is not in C’s corresponding cluster, which is also created by the average-group method, in the target space. The performance measure we used was the mis-clustering rate (Errc ), defined as Errc =
Nc × 100% N
(28)
where Nc was the number of mis-clustered objects. Figure 13 graphs Errc as a function the dimensionality of the target space, k, for the Euclidean clusters and Figure 14 shows the results for the RNA data. The parameters have the values shown in Table 2. For the Euclidean data, the average-group method successfully found the 4 clusters in the dataset. For the RNA data, the average-group method missed 5 objects in the dataset (i.e., the 5 RNA secondary structures were not detected to belong to their corresponding sequence’s cluster). The images of these 5 objects were also missed in the target space; they were excluded when calculating Errc . As in Section 4.2, the clustering performance improves as the dimensionality of the target space increases, because the embedding becomes more precise. Figure 13 shows that the Errc s of all the mappers approach 0 when k = 9. Figure 14 shows that M etricM ap outperforms F astM ap; its Errc approaches 0
Errc (%)
8 FastMap MetricMap AvgMap MinMap MaxMap
6 4 2 0 2
8
6
4
10
k Fig. 13. Mis-clustering rates of the mappers as a function of the dimensionality of the target space for synthetic Euclidean data.
Errc (%)
20
FastMap MetricMap AvgMap MinMap MaxMap
10
0 20
60
40
80
k Fig. 14. Mis-clustering rates of the mappers as a function of the dimensionality of the target space for RNA data.
when k = 80. Overall, M axM ap is best for the Euclidean data and AvgM ap is best for the non-Euclidean RNA data. The results indicate that with the two best mappers, one can perform clustering on the k-dimensional points. Embedding the data objects can be performed in the off-line stage, thus reducing the clustering time significantly. It is worth pointing out that one may achieve an accurate clustering even with an imprecise embedding. For example, in Figure 13, the clustering accuracy
is over 90% when k = 2, though the relative errors for the k = 2 case are over 50% (cf. Figure 3). This happens because after the embedding is performed, those objects that are close to each other in the original space remain close in the target space, though the distances are underestimated significantly. We next examined the scalability of the results using Euclidean clusters. Figure 15 compares F astM ap, M etricM ap and M axM ap for varying numbers of clusters, Figure 16 compares them for varying sizes of clusters, and Figure 17 compares the mappers for varying dimensionalities of the vectors in each cluster. With higher dimensional vectors (e.g. 60-dimensions) and more clusters, the average-group method missed several objects in the dataset. However, M axM ap consistently gives the lowest mis-clustering rate in all the figures.
Errc (%)
4 FastMap MetricMap MaxMap
3 2 1 0 4
36
16
64
100
p Fig. 15. Impact of the number of clusters.
To see how different clustering techniques might affect the performance, we have also conducted experiments using some other clustering algorithms, e.g. the single-linkage and complete-linkage methods [12]. The two methods work in a similar way as the average-group method. The differences lie in the way they calculate the distance between two clusters. In the single-linkage algorithm, the distance between two clusters C1 and C2 is given as min
Op ∈C1 ,Oq ∈C2
|d(Op , Oq )|
(29)
In the complete-linkage algorithm, the distance is given as max
Op ∈C1 ,Oq ∈C2
|d(Op , Oq )|
(30)
4 FastMap MetricMap MaxMap
Errc (%)
3 2 1 0 100
500
1000
1500
2000
C Fig. 16. Impact of the size of clusters.
Errc (%)
20 FastMap MetricMap MaxMap
15 10 5 0 20
30
40
50
60
n Fig. 17. Effect of the dimensionality of vectors in a cluster.
The results were slightly worse. The reason is that these two methods use the distance between a specific pair of objects, as opposed to the average distance between the objects in the two clusters. The errors incurred from measuring the distance between the specific pair of objects may affect the clustering accuracy seriously. Finally we conducted experiments by replacing the random sampling objects used by M etric- M ap with the 2k pivot objects found by F astM ap. The per-
formance of M etricM ap improves for the Euclidean data, but degrades for the non-Euclidean data. M axM ap and AvgM ap remain the best as in the random sampling case.
6
Discussion
Since M axM ap and AvgM ap are educed from both F astM ap and M etricM ap, their cost is approximately the sum of the costs of F astM ap and M etricM ap. Figure 3 shows that when the dimensionality of the target space, k, increases, the relative errors of the mappers decrease. On the other hand, increasing k also increases the embedding cost (cf. Figure 11). One may wonder whether using M axM ap and AvgM ap with a smaller k is better than using F astM ap and M etricM ap with a bigger k when they all have approximately the same cost. We have conducted experiments to answer this question. Figures 18 and 19 depict the running times of the mappers as a function of k for synthetic Euclidean and non-Euclidean data, respectively. The dataset size N was 5000 for the Euclidean data and 3000 for the non-Euclidean data. It can be seen from the figures that the costs of the mappers are proportional to the dimensionality of the target space k. The cost of M axM ap and AvgM ap with a k-dimensional target space is approximately the same as the cost of F astM ap with a 2k-dimensional target space. Comparing with Figure 3, we see that using F astM ap or M etricM ap with a 2k-dimensional target space yields a smaller relative error than using M axM ap with a k-dimensional target space for the synthetic Euclidean data. On the other hand, comparing with Figure 7, we see that using AvgM ap with a k-dimensional target space achieves a more precise embedding than using F astM ap or M etricM ap with a 2k-dimensional target space for the synthetic non-Euclidean data. In general, there is a tradeoff between the embedding cost and the embedding precision. Recall that the asymptotic cost of all the mappers is O(N k) where N is the size of the dataset. In the case of Euclidean data, k is independent of N . One can achieve a very precise embedding when k approaches the original dimensionality n of the vectors. Thus for a very large dataset of N Euclidean vectors, we can build a precise mapper (e.g. M axM ap) with a relatively low, asymptotically O(N ), cost. On the other hand, for the non-Euclidean data, the precision of the embedding depends on N . In order to build a precise mapper (e.g. AvgM ap), k should be close to N/2, which leads to an O(N 2 ) cost asymptotically. We have experimented with different distance functions in the paper. Our approach can also be applied to nominal values when a proper metric is defined for these values. Nominal values are identified by their names and do not have numeric values. The colors of eyes [12] are an example. Colors are represented by hexadecimal numbers in the SRGB (Standard Red Green Blue) model as used for Web pages. SRGB is a default color space for the Internet proposed by HewlettPackard and Microsoft, and accepted by the W3 organization as a standard. Each color corresponds to six hexadecimal digits, which are decomposed to three pairs.
FastMap MetricMap MaxMap
Time (x 100 sec.)
25 20 15 10 5 0 5
10
15
20
k Fig. 18. Running times of the mappers as a function of the dimensionality of the target space for synthetic Euclidean data.
Time (x 1,000 sec.)
8
FastMap MetricMap AvgMap
6 4 2 0 20
40
60
80
100
k Fig. 19. Running times of the mappers as a function of the dimensionality of the target space for synthetic non-Euclidean data.
Each pair corresponds to a primary color. One can define the distance between two different colors as the sum of the differences between the corresponding components. Specifically, let c1 = x11 x12 x13 and c2 = x21 x22 x23 be two colors, where xij , i = 1, 2, j = 1, 2, 3, denotes two hexadecimal digits. We define
the distance between c1 and c2 , denoted d(c1 , c2 ), as d(c1 , c2 ) =
3
|x1j − x2j |
(31)
j=1
For example, suppose the color “blue” corresponds to 00 00 FF, the color “black” corresponds to 00 00 00, and the color “green” corresponds to 00 80 00. The distance between black eyes and blue eyes is |00−00|+|00−00|+|F F −00| = F F in hexadecimal number or 255 in decimal number. Similarly, the distance between green eyes and blue eyes is |00−00|+|00−80|+|F F −00| = 01 7F in hexadecimal number or 383 in decimal number. Clearly, for any three colors c1 , c2 and c3 , we have d(c1 , c2 ) > 0, c1 = c2 and d(c1 , c1 ) = 0, d(c1 , c2 ) = d(c2 , c1 ) and d(c1 , c2 ) ≤ d(c1 , c3 )+d(c3 , c2 ). Thus d is a metric and our approach is applicable.
7
Conclusion
In this paper we have presented an index structure, M etricM ap, and compared it with the previously published index structure F astM ap [6]. The two index structures take a set of N objects, a distance metric d and embed those objects in a target space Rk , k ≤ N , in such a way that the distances among objects are approximately preserved. F astM ap considers Rk to be Euclidean; M etricM ap considers Rk to be pseudo-Euclidean. Both index structures perform the embedding at an asymptotic cost O(N k). We have conducted experiments to evaluate the accuracy of the embedding and the accuracy of clustering for the two index structures. The experiments were based on synthetic data as well as protein and virus datasets obtained from the Cold Spring Harbor Laboratory and National Cancer Institute. Our results showed that M etricM ap complements F astM ap. In every case, combining the two index structures performs better than using either one alone. Specifically, F astM ap is more accurate than M etricM ap for Euclidean distances, but taking the maximum of the distances (we use the term dissimilarities because some of these values can be negative) gives the best accuracy of all. M etricM ap is more accurate than F astM ap for non-Euclidean distances, but the average of the dissimilarities is best of all. Besides the 4 datasets mentioned here, we have confirmed these results on 3 other datasets taken from dictionary words and other protein sequences. The practical significance of this work is that the proper use of these index structures can reduce the computation time substantially, thus achieving high efficiency for data mining and clustering applications.
Acknowledgments We thank the anonymous reviewers and the executive editor, Dr. Xindong Wu, for their thoughtful comments and suggestions that helped to improve the paper. We also thank Dr. Tom Marr and Wojcieok Kasprzak for providing the protein and RNA data used in the experiments.
References 1. R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 94–105, Seattle, WA, 1998. 2. C. Burks, M. Cassidy, M. J. Cinkosky, K. E. Cumella, P. Gilna, J. E.-D. Hayden, G. M. Keen, T. A. Kelley, M. Kelly, D. Kristofferson, J. Ryals. GenBank. Nucleic Acids Research, 19,2221–2225, 1991. 3. R. O. Duda, P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, 1973. 4. M. Ester, H.-P. Kriegel, J. Sander, X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226–231, Portland, Oregon, 1996. 5. C. Faloutsos. Searching Multimedia Databases by Content. Kluwer Academic Publishers, Norwell, MA, 1996. 6. C. Faloutsos, K.-I. Lin. F astM ap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 163–174, San Jose, CA, 1995. 7. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, Inc., San Diego, CA, 1990. 8. G. H. Golub, C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, 1996. 9. W. Greub. Linear Algebra. Springer-Verlag, Inc., New York, NY, 1975. 10. S. Guha, R. Rastogi, K. Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 73–84, Seattle, WA, 1998. 11. A. K. Jain, R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ, 1988. 12. L. Kaufman, P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990. 13. P. D. Lax. Linear Algebra. John Wiley & Sons, Inc., New York, NY, 1997. 14. R. S. Michalski, R. E. Stepp. Learning from observation: Conceptual clustering. In: R. S. Michalski, J. G. Carbonell, T. M. Mitchell(eds), Machine Learning: An Artificial Intelligence Approach, volume I, pages 331–363. Morgan Kaufmann, 1983. 15. R. T. Ng, J. Han. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th International Conference on Very Large Data Bases, pages 144–155, Santiago, Chile, 1994. 16. J. M. Ortega. Matrix Theory. Plenum Press, New York, 1987. 17. D. Sankoff, J. B. Kruskal(eds). Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983. 18. B. A. Shapiro, J. Navetta. A massively parallel genetic algorithm for RNA secondary structure prediction. Journal of Supercomputing, 8,195–207, 1994. 19. B. A. Shapiro, K. Zhang. Comparing multiple RNA secondary structures using tree comparisons. Computer Applications in the Biosciences, 6(4),309–318, 1990.
20. D. Shasha, T. L. Wang. New techniques for best-match retrieval. ACM Transactions on Information Systems, 8(2),140–158, 1990. 21. G. Sheikholeslami, S. Chatterjee, A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24th International Conference on Very Large Data Bases, pages 428–439, New York City, New York, 1998. 22. J. T. L. Wang, B. A. Shapiro, D. Shasha(eds), Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications. Oxford University Press, New York, NY, 1999. 23. J. T. L. Wang, D. Shasha, G. Chang, L. Relihan, K. Zhang, G. Patel. Structural matching and discovery in document databases. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pages 560–563, Tucson, AZ, 1997. 24. J. T. L. Wang, K. Zhang, K. Jeong, D. Shasha. A system for approximate tree matching. IEEE Transactions on Knowledge and Data Engineering, 6(4),559–571, 1994. 25. W. Wang, J. Yang, R. Muntz. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd International Conference on Very Large Data Bases, pages 186–195, Athens, Greece, 1997. 26. Y. Yang, K. Zhang, X. Wang, J. T. L. Wang, D. Shasha. An approximate oracle for distance in metric spaces. In: M. Farach-Colton(ed), Combinatorial Pattern Matching, pages 104–117. Lecture Notes in Computer Science, Springer-Verlag, 1998. 27. B.-K. Yi, H. V. Jagadish, C. Faloutsos. Efficient retrieval of similar time sequences under time warping. In International Conference on Data Engineering, Orlando, Florida, 1998. 28. T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103–114, Montreal, Canada, 1996.
Xiong Wang received his M.S. degree in computer science from Fudan University, Shanghai, People’s Republic of China, and the Ph.D. degree in computer science from the New Jersey Institute of Technology. He is currently a special lecturer in the Computer and Information Science Department at the New Jersey Institute of Technology. His research interests include scientific databases, knowledge bases, information retrieval and mining in high dimensional databases. He is a member of ACM and IEEE.
Jason Tsong-Li Wang received the B.S. degree in mathematics from National Taiwan University, Taipei, Taiwan, and the Ph.D. degree in computer science from the Courant Institute of Mathematical Sciences, New York University. Currently, he is an associate professor in the Computer and Information Science Department at the New Jersey Institute of Technology and Director of the University’s Data and Knowledge Engineering Laboratory. He is also a member of the Ph.D. in Management faculty at the Graduate School of Rutgers University, Newark. Dr. Wang’s research interests include data mining and databases, pattern analysis, computational biology, and information retrieval on the Internet. He has published 80 papers in conference proceedings, books, and journals including ACM Transactions on Database Systems, ACM Transactions on Information Systems, IEEE Transactions on Systems, Man, and Cybernetics, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Pattern Analysis and Machine Intelligence, Nucleic Acids Research, Protein Engineering, and Journal of Computational Biology. Dr. Wang is an editor and author of the book Pattern Discovery in Biomolecular Data (Oxford University Press), an associate editor of Knowledge and Information Systems (Springer-Verlag), on the Editorial Advisory Board of Information Systems (Elsevier/Pergamon Press), and the guest coeditor of the special issue on biocomputing of International Journal of Artificial Intelligence Tools (World Scientific). He is listed in Who’s Who among Asian Americans, Who’s Who in Science and Engineering and is a member of New York Academy of Sciences, ACM, IEEE, AAAI, and SIAM. Kaizhong Zhang received the M.S. degree in mathematics from Peking University, Beijing, People’s Republic of China, in 1981, and the M.S. and Ph.D. degrees in computer science from the Courant Institute of Mathematical Sciences, New York University, New York, NY, USA, in 1986 and 1989, respectively. Currently, he is an associate professor in the Department of Computer Science, University of Western Ontario, London, ON, Canada. His research interests include pattern recognition, computational biology, sequential and parallel algorithms.
This article was processed using the LATEX macro package with LLNCS style