A Disc-based Approach to Data Summarization and ... - IEEE Xplore

0 downloads 0 Views 931KB Size Report
E-mail: zhu@cs.sfu.ca. Abstract. Data summarization has been recognized as a funda- mental operation in database systems and data mining with important ...
A Disc-based Approach to Data Summarization and Privacy Preservation Rong Ge

Martin Ester

Wen Jin

Zengjian Hu

Simon Fraser University

Simon Fraser University

Simon Fraser University

Simon Fraser University

E-mail: [email protected]

E-mail: [email protected]

E-mail: [email protected]

E-mail: [email protected]

Abstract Data summarization has been recognized as a fundamental operation in database systems and data mining with important applications such as data compression and privacy preservation. While the existing methods such as CFvalues and DataBubbles may perform reasonably well, they cannot provide any guarantees on the quality of their results. In this paper, we introduce a summarization approach for numerical data based on discs formalizing the notion of quality. Our objective is to find a minimal set of discs, i.e. spheres satisfying a radius and a significance constraint, covering the given dataset. Since the proposed problem is NP-complete, we design two different approximation algorithms. These algorithms have a quality guarantee, but they do not scale well to large databases. However, the machinery from approximation algorithms allows a precise characterization of a further, heuristic algorithm. This heuristic, efficient algorithm exploits multi-dimensional index structures and can be well-integrated with database systems. The experiments show that our heuristic algorithm generates summaries that outperform the state-of-the-art Data Bubbles in terms of internal measures as well as in terms of external measures when using the data summaries as input for clustering methods.

1

Introduction

With the ever-increasing amount of information stored in databases, data summarization has emerged as an independent task, and plays a very important role in database and data mining applications. Data summarization techniques, such as MicroClusters, or Data Bubbles, recently attracted much attention in the database/data mining communities. The idea of MicroClusters[34] is to partition a dataset into a set of sphere-like “small” clusters with useful statistics such as the number of objects, sum of values of objects in each dimension, squared sum of values of objects, etc. Using MicroClusters, a large dataset can be compressed to scale-up many expensive data mining algorithms. For in-

stance, the MicroCluster technique has been used for clustering large databases [7, 9, 8, 16, 23, 25, 31, 30, 34, 22], clustering of data streams [1, 2, 27, 14, 15] and classification of large databases and streams [3, 33]; MicroClusters can also be employed to fast density estimation[35], outlier detection [21, 28] in large databases, and so on. The state-of-the-art methods for generating MicroClusters are (1) the CF-tree-based approach of BIRCH[34] and (2) the Data Bubble approach[8]. In the first approach, the MicroClusters are obtained from the CF-values in leaf nodes after sequentially inserting objects into the CF-tree. The second approach refines the MicroClusters generated in (1) by reassigning every object to the MicroCluster with the nearest center. Although both approaches have proven to be effective in many applications, the state-of-the-art methods have no explicit definition of the quality of a data summarization and no guarantee for achieving high quality. CFvalues limit the diameter of a MicroCluster, but the radius is defined based on the average distance to the MicroCluster center, which allows to accommodate outliers in a MicroCluster. Thus, the actual radius of a MicroCluster can become large. Furthermore, in the existing methods there is no lower limit to the number of points represented by a MicroCluster, i.e., a MicroCluster may not be significant. In this paper, we investigate the formal definition of data summarization quality and present algorithms for generating high-quality summarizations. We use the following two typical scenarios as motivating applications: Scenario I: Data Compression In data mining applications, such as cluster analysis, we often have to deal with very large datasets that do not fit into memory. By summarizing the dataset, we obtain a considerably smaller number of MicroClusters that can be stored and mined in main memory. In addition, data summarization can also be used to greatly speed up data mining algorithms even when the original data set can be loaded in main memory, since these algorithms have superlinear runtime complexities. Scenario II: Privacy Preservation A large medical organization is considering to release its personal medical data to the public, so that researchers can make use of this resource to carry out their research. Laws or other regula-

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

IEEE

tions such as HIPAA1 (the US standard for privacy preservation of health related data) require a certain minimum generalization of the original, detailed attribute values. One solution is to first summarize the personal data and then release only the generated MicroClusters. The challenge is to avoid privacy breaches while maximizing the accuracy of the subsequent data analysis based on the MicroClusters. Motivated by the above scenarios, we explore the quality of MicroClusters from both the theoretical and practical perspectives. As the existing methods we restrict this study to numerical data. Note that in many applications the attributes are either numerical or can be mapped to a numerical domain. To distinguish our notion of MicroClusters from the literature, we call them MicroSpheres. The contributions of this paper are as follows: 1. We formally introduce the problem of finding optimal MicroSpheres, and show its NP-completeness. To our best knowledge, this is the first attempt to theoretically investigate the quality of data summarization in the context of MicroClusters. 2. We present two approximation algorithms for generating optimal MicroSpheres, and prove that one is a PTAS (polynomial-time approximation scheme) algorithm and the other is an O(m)-approximation algorithm. 3. We develop an efficient algorithm(FITS) on multidimensional index structures to find high quality MicroSpheres in very large secondary storage databases. 4. We show that the FITS algorithm obtains superior results compared to the state-of-the-art methods in a thorough experimental evaluation. The rest of this paper is organized as follows: Section 2 reviews related work. Section 3 introduces our problem definition, complexity analysis and approximation algorithms. In Section 4, we present an efficient algorithm to generate high quality MicroSpheres from very large databases. The experimental results are reported in Section 5. We conclude the paper in Section 6 with a summary and interesting directions for future research.

2

Related Work

Data summarization in terms of MicroClusters originated from BIRCH[34], where MicroClusters are the entries in the leaf nodes of a CF-tree. Since the generated MicroClusters have no significance guarantee, the authors [8] propose (1) nearest neighborhood search and (2) a randomly sampling method to generate MicroClusters. Jagadish et al. [20] introduce a notion of Fascicles which are subsets of a relation that share similar values with a constraint t of the width of the range of values (for numeric attributes) or the number of distinct values (for categorical attributes) in 1 http://www.hhs.gov/ocr/hipaa

each attribute and another constraint m for its size. Fascicles needs more priori knowledge of the datasets and the hyper-rectangle shape of the Fascicle is different from the sphere-like of the MicroSphere defined in our paper. Tung et al. present the constrained clustering problem [30], which is to find k clusters of n objects, such that each cluster has at least c objects and the total distance of objects to the cluster centers is minimized. It has no diameter/radius requirement per cluster and the number of clusters is fixed. Feder et al. [12] study the problem of finding optimal cluster size for a fixed number of clusters, and finding the optimal number of clusters for a fixed cluster size. Several PTAS algorithms are proposed based on the box decomposition method [32]. But they do not consider the constraint of minimum number of points per cluster. Hochbaum et al. work on the geometric disc covering problem [18], where a d-dimensional space of points is covered by a minimum number of discs with fixed diameter. An approximation algorithm is proposed based on the shifting strategy which proceeds with an integer parameter l, and proved the algorithm achieve a performance ratio ≤ (1 + 1/l)d in  √ can  √ d d O ld (l d) (2n)d(l d) +1 steps. This work has no constraint on the minimum number of objects per disc. The MicroClustering problem is also related to the kanonymity model [29], where if every generalized record is indistinguishable from k − 1 other records, the new modified table is k-anonymized. Different from our problem, kanonymity is mainly targeted towards categorical data and imposes no constraint on the extension of a group. Meyerson et. al [24] proved that the optimal k-anonymization problem is NP-hard if k > 2 and the size of the domain of attribute values is greater than or equal to the number of records in the table. Based on the idea of k-anonymity, a condensation approach for data perturbation has been presented in [4], mapping the original dataset into multiple groups of a pre-defined size k. This method could combine k very close objects into a group with a small radius, and the personal information could be easily estimated.

3

Optimal MicroSpheres and Approximation Algorithms

The MicroSphere problem is motivated in section 3.1 and formally introduced in section 3.2. In section 3.3, algorithms to generate such MicroSpheres are given.

3.1 Motivation As mentioned in the introduction, MicroClusters can be applied in different scenarios, e.g., clustering very large data sets and privacy preserving data mining.

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

IEEE

When MicroClusters are taken as a compression tool for clustering, typically a distance metric of data points is used. Now the question is how to define the distance between two MicroClusters with different sizes and densities, which is crucial to the clustering quality. A size distortion occurs when the distance calculation considers only the centers of the MicroClusters with different sizes/densities. Some heuristics[8] have been proposed to relieve the distortion caused by largely varying MicroCluster sizes. However, the problem is not totally resolved. To address this, we propose to fix the radius of each MicroCluster. Note that the size distortion completely disappears as the radius of each MicroCluster is the same. Besides, to cope with the effect of different densities of MicroClusters, we add another constraint on their significance (minimum number of points). Finally, to achieve a high compression factor, we aim to minimize the number of MicroClusters with fixed radius. Our model can also be applied to privacy preserving data mining. When we are making some database with sensitive data accessible to the public, original records can not be distributed due to privacy concerns. However, if the database is summarized into MicroClusters we could use the MicroClusters for privacy preserving data mining. One of the well-established approaches for data mining perturbs the individual attributes independently by adding random noise[5]. While this approach may destroy inter-attribute relationships, the MicroCluster based approach can preserve these relationships, which are important for many data mining tasks, much better.

To prevent privacy breaches, we c1 c2 must lowerbound the number of points of c3 the MicroClusters Mc3 (a) (b) to m. Then, each record can be viewed Mc1 c1 c2 to be “anonymized” along with m − 1 c3 other records in the (d) (c) same MicroCluster. Furthermore, Figure 1. Comparison with k-anonymity. a minimum radius constraint avoids the potential problem of the k-anonymity paradigm which could group together k very similar, even duplicate records, leaking their private attributes. This problem is illustrated with two different datasets in Figure 1 (a) and (c). Figure 1 (b) and (d) depict the corresponding MicroClusters with a minimum radius. We can either publish the centers and radii of the MicroClusters or randomly generate points within the MicroClusters and publish these points. In both cases, the minimum radius leads to a much wider confidence interval Mc1

Mc2

for estimating the sensitive private attribute values and thus to a higher degree of privacy preservation than the k-anonymity model. The larger the radius of a MicroCluster, the larger the degree of privacy preservation, but the more accuracy is lost when using the MicroClusters for any data mining task. Thus, with the minimum radius rmin been satisfied, to achieve best accuracy of the subsequent data analysis, we ought to fix the radius of MicroClusters to rmin . All of the above motivate our new concept of MicroClusters, called MicroSpheres, which are spheres(or discs) with a fixed radius covering a minimum number of points.

3.2 Problem Definition Let R and Z denote the sets of real numbers and integers, respectively, and let R+ , Z+ be the set of positive real numbers, (positive integers, resp.). For any points p, q ∈ Rd , the distance of p, q, denoted by dist(p, q),  is defined to be the d 2 Euclidean distance, i.e., dist(p, q) = i=1 (pi − qi ) . d We use P ⊂ R to represent the set of all data points. For any point c ∈ Rd , r ∈ R+ , we define the disc 2 C = (c, r) as the set containing all data points in P whose distances to the center c are smaller than or equal to r (Note c does not have to be in P ). All such points are called covered by the disc C. Finally, we define the capacity of a particular disc C to be the number of points in P which are covered by C. Before we present our problem definition, we first introduce the classical covering with discs problem [18]. Definition (Covering With Discs) Given a set of points in a d-dimensional space, find a minimally sized set of discs(or d-spheres) of radius r covering all points. In the following, we generalize the classical disc covering problem with an additional constraint (m) on the significance of each disc to obtain the new problem definition which fits our scenarios. For easy illustration of our problem, we define the optimal MicroSpheres problem in a geometric way, as the Minimum Geometric m-Disc Cover (minGDCm ), given a set of n points in a d-dimensional space, a radius r and a constraint m ≥ 1 representing the minimum number of points in each disc. The aim is to use the least number of discs (or d-spheres) with radius r so that each point is covered by at least one of the discs which also covers at least m − 1 other points. Throughout the paper, d and m are treated as constants, and a disc covering at least m points is called m-disc for simplicity. Definition (Minimum Geometric m-Disc Cover) Given a set of n points P = {p1 , . . . , pn } ⊂ Rd , m ∈ Z+ , r ∈ R+ , 2 The disc is also called d-sphere for dimension d > 2. In this paper, we use d-sphere and disc interchangeably.

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

IEEE

find a minimum set of discs C1 , . . . , Cq with radius r such that each individual disc covers at least m points of P , and such that every point of P is covered by at least one disc.

Algorithm 1 Existence Checking INPUT: A set P of n points in a 2-dimensional space, significance m and radius r.

OUTPUT: True if there is a Geometric m-Disc Cover of P , false other-

Remark. When m = 1, Minimum Geometric m-Disc Cover specializes to the original Covering With Discs problem, also named minimal Geometric Disc Cover problem (minGDC). We can obtain the NP-Completeness of minGDCm by by a simple Turing reduction.

3.3 Approximation Algorithms In this subsection, we present three results. First, we show that deciding whether a set of points in a 2dimensional space can be covered by a set of discs with capacity ≥ m is polynomial time computable. Also, the algorithm can be easily extended to arbitrary dimensions. Second, we illustrate how to design a (1+)-approximation algorithm for the Minimum Geometric m-Disc Cover problem with running time polynomial to n and 1/. Finally, we propose a faster greedy algorithm which can achieve O(m)approximation ratio in O(mnd+1 ). 3.3.1

Polynomial Algorithm for Deciding the Existence of Geometric m-Disc Cover

Not every dataset has a geometric m-disc cover due to outliers, i.e. points which cannot be covered by an m-disc with fixed radius. Hence a natural question is to know whether there exists such a cover. In fact, the problem can be transformed to testing for every point p, the existence of such a disc which can cover p and satisfy certain significance. For simplicity, we only show the 2-d case as the same result holds for higher dimensions analogously. Before introducing our algorithm, we need a simple observation from [18]: Observation 3.1 For any disc covering at least 2 points, there always exists another disc with same radius covering the same set of points and having two points on its border. This is clearly true as one can always slide the disc to make it touch at least one point, and then rotate the disc until it meets another point. Furthermore, there are only two different ways to draw a circle through two given points. Hence, we can design Algorithm 1. The running time is dominated by the two nested loops of line 2 and line 4, which take O(n2 ) and O(n) iterations respectively. So the time complexity of the above algorithm is O(n3 ) for the 2-d case. To show the correctness, if there exists a point p which can not be covered by any m-disc, clearly p will not be marked as “covered” and the algorithm must return false. Otherwise, for any point p, by Observation 3.1, there must exist a m-disc which can cover p and has at least two points on its border. Since we enumerate

wise. 1: Enumerate all pairs of points (p1 , p2 ) from P . 2: for each pair (p1 , p2 ) s.t. dist(p1 , p2 ) ≤ 2r do 3: Calculate two discs C1 , C2 of radius r having both p1 and 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

p2 on their borders for i = 1, 2 do Find Pi , the set of points in P which are covered by Ci end for if |Pi | ≥ m then Mark all points in Pi as “covered” end if end for if all points in P are marked as “covered” then return TRUE else return FALSE end if

all such discs (line 1), our algorithm would certainly output true for this case. Note for the case with dimension d > 2, it enumerates all the discs with 2, 3, . . . , d points    on theirborders. The total number of such discs is 2 n2 + . . . + nd = O(nd ), so the total runtime is O(nd+1 ). 3.3.2

The PTAS

Next we explain how to design a PTAS algorithm for our problem. The core part is based on a unified and powerful approach named “shifting strategy”, introduced by Hochbaum and Maass [18], which has been widely used for devising PTASs for a number of NP-complete problems, such as, Minimum Geometric Disc Cover [18], maximum Weighted Independent Set for Intersection Graphs of Geometric Discs [10] etc. The shifting strategy can elegantly bound the error of this divide-and-conquer approach. The shifting stratl ... =3 egy, as demonstrated lxD lxD in Figure 2, proceeds with a parameter l. It first partitions the whole space into strips with width D, D = 2r, in one dimension, shift direction then groups consecutive l · D strips and Figure 2. Shifting Strategy. shifts them along this dimension, finally results into l different ways of partitioning the whole space into groups of l · D strips. For every dimension we apply the same partitioning procedure to divide the whole space

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

IEEE

into independent regions (hypercubes) with same size l · D. The authors proved in [18] that if we could obtain the optimal solution for all the hypercubes, their union would be a resulting covering using at most (1 + 1/)d times the global optimal solution. We adopt the shifting strategy to approximate our geometric m-disc covering problem. The problem becomes finding the optimal m-disc covering for each hypercube. To begin, we prove the following lemma3 : Lemma 3.2 The number of diameter D m-discs used to cover a d dimensional √ hypercube with edge length l · D is at most (m − 1)(l d)d . As an example, let us take a look at Figure 3(a), each 2·D×2·D square can be covered by 12 discs with diameter D. Hence, 12(m − 1) m-discs are sufficient to cover all points belonging to that square no matter how many points are in the square and how they are distributed. c3

D C D/2

q c7

c6

c4

c5

Figure 3. Left (a): Covering cubes, Right (b): Covering circles.

Theorem 3.3 There is a PTAS for minimum Geometric m-Disc Cover√ with running time O((m − √ d 1)ld (l d)d n(m−1)·(l d) +1 ) and performance ratio ≤ (1 + 1/l)d , where m is a constant.

3.3.3

A Faster Greedy Algorithm

The PTAS algorithm can find a set of covering discs with a cardinality arbitrarily close to the optimal in polynomial time. Yet, it is interesting to ask whether we can obtain such discs in less runtime with a possibly limited sacrifice in quality. This motivates us to design a faster algorithm (called GAAM) with guaranteed approximation ratio of O(m). Similar to the PTAS, GAAM proceeds in a divide and conquer fashion. It includes two phases: Phase I, inspired by the classic k-center algorithm of Hochbaum et al. [19], partitions all points into groups in a way to cover points in each group we need at least one unique disc. After that, the number of groups becomes the lowerbound for the number of discs used in the optimal solution. Phase II 3 Please

refer to [17] for the proofs of lemma and theorem.

Theorem 3.4 Algorithm GAAM is guaranteed to find a solution at most O(m) times optimal. Obviously Phase II (the covering phase) dominates the total running time of our greedy algorithm. The outer loop (line √ 1) and the inner loop (line 2) contain |Q| ≤ n and ˇ (2 d)d iterations respectively. For a given point p, let n be the number of points in the distance D-neighborhood of p. Line 6 takes O((m − 1)ˇ nd+1 ) steps to generate these m-discs covering all m − 1 points in a disc, since for every point, to enumerate all possible discs takes O(ˇ nd ) steps and to count the significance of every adds another ˇ factor. n  disc√ Finally, the total runtime is O m(2 d)d nd+1 . Algorithm 2 Greedy Approximation Algorithm for MicroSphere(GAAM):

c2

c1

finds the discs covering each group by enumeration. Since at most O(m) m-discs are sufficient to cover every group, hence our algorithm has a bounded approximation ratio.

INPUT: A set P of n points, significance m and diameter D. OUTPUT: A set of m-discs Φ covering all points of P . Phase I: 1: Set Φ, Q to empty, t = ∞. 2: while true do 3: Randomly pick a point p, remove p from P and add p into Q. 4: Pick a point p from P such that t = min dist(p , q) is maxi∀q∈Q

mized 5: if t > D then 6: Remove p from P and add p into Q 7: else 8: break 9: end if 10: end while Phase II: (For any node q ∈ Q, let S√ q be the disc with radius D centering at q, Cq be the set of all the (2 d)d discs with diameter D which cover Sq .) 1: for every point q ∈ Q do 2: for every disc C ∈ Cq do 3: if C covers ≥ m points then 4: Add C to Φ 5: else 6: Cover all points in C in a greedy manner by enumerating all m-discs having 2, 3, . . . , d points on their borders 7: end if 8: end for 9: end for

4

High-quality Databases

MicroSpheres

IEEE

Large

In a first time study of Minimum Geometric m-Disc Cover, the PTAS algorithm and the GAAM algorithm with approximation ratio O(m) are important contributions. Especially, the GAAM algorithm has a much smaller time complexity of O(mnd+1 ) compared to the PTAS. Yet, it

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

for

is still too slow for large databases. Furthermore, verifying the existence of a Minimum Geometric m-disc Cover incurs additional costs. Thus, we study a more practical variation of the minGDCm problem. Instead of covering all points, we cover as many points as possible. Such an algorithm should not only return the centers of all MicroSpheres(discs), but also the percentage of points covered. Based on the previous GAAM algorithm, we present an algorithm called FITS (Finding hIgh-qualiTy microSpheres for large database). The FITS algorithm exploits multidimensional index structures, such as R∗ -tree to efficiently support range queries to retrieve points for MicroSpheres. Algorithm 3 FITS INPUT:(1)Database db,(2)significance m,(3)radius r OUTPUT:The MicroSphere list mspheres and tio.

coverage

ra-

1: (singletons, mspheres) = Form MicroSpheres (db,m,r) 2: if singletons.size() > 0 then 3: mspheres = Shift MicroSpheres(db,singletons, 4: 5: 6: 7:

mspheres,r) end if mspheres = Prune MicroSpheres(mspheres) coverage ratio = mspheres.get outliers num()/db.size() return mspheres and coverage ratio

The FITS algorithm first forms preliminary MicroSpheres covering as many data points as possible. Then it covers the remaining points by shifting existing MicroSpheres. Finally a pruning step removes redundant MicroSpheres. Points which can not be covered during the above process are considered as outliers. The pseudo-code for the entire algorithm is as follows. Each step is explained in detail in the next three parts. Forming MicroSpheres As in the GAAM algorithm, the FITS algorithm also chooses seeds and forms regions with radius D = 2r. Instead of having two distinct phases for choosing seeds and forming MicroSpheres, we integrate them in an incremental fashion. First, we randomly pick a seed point from the database db and retrieve all points in its D neighborhood. Then, we insert them into a candidate queue in the order of their distances to the seed(line 9). Here all the points that can be grouped with the seed, to form a MicroSpheres satisfying both the significance and radius constraints, are included in the candidate queue. Instead of forming multiple MicroSpheres to cover all points in the D neighborhood as in the GAAM algorithm, for the sake of efficiency, FITS only tries to generate one MicroSphere covering the seed and as many other points as possible. For this purpose, we utilize a geometry algorithm which finds the Smallest Enclosing Ball (SEB) for a given set of points [13]. We incrementally add one point from the front of the candidate queue into the SEB(line 13), until the radius of the SEB exceeds r. If the resulting MicroSphere satisfies

the significance constraint, the center of the MicroSphere is stored. Otherwise, the seed will be temporarily marked as outlier. We repeat this procedure by picking the next seed until no point is unpicked or covered. Algorithm 4 Form MicroSpheres INPUT:(1)Data Points db,(2)significance m,(3)radius r OUTPUT: singletons and the MicroSphere list mspheres 1: if seed == NULL then 2: p = db.fetch() 3: else 4: p = seed 5: end if 6: if p == NULL then 7: break 8: end if 9: candidates = db.range query(p,2 ∗ r) 10: while real radius < r do 11: q = candidates.getfront() 12: center = SEB.center() 13: SEB.insert(q) 14: seed = q 15: real radius = SEB.radius() 16: end while 17: result = db.range query(center, r) 18: if result.size(UNCLUSTERED) == 1 then 19: singletons.insert(p) 20: else 21: if result.size() ≤ m THEN then 22: p.clusterId = OUTLIER 23: else 24: clusterId + + 25: mspheres.setborder(clusterId,result.get(m,REVERSE)) 26: result.setClusterId(clusterId) 27: mspheres.insert(clusterId, result,center) 28: end if 29: end if 30: Return singletons and mspheres

We introduce two heuristics to minimize the number of MicroSpheres: First, we take the data point in the front of the candidate queue as the next seed after forming the current MicroSphere. Figure 4(a) demonstrates the intuition behind this heuristic. In this example, after generating MicroSphere M S1 , if we pick A instead of B as the next seed, it is possible to use only two MicroClusters to cover all data points. Second, we maintain a list of all data points that have not been covered by the MicroSpheres generated so far. For each of them, a new MicroSphere must be introduced if we do not move the centers of previously formed MicroSpheres. We refer to all such points as singletons. The singletons may appear due to the lack of global information when forming the preliminary MicroSpheres. However, we can shift the preliminary MicroSpheres to also cover the singletons. Figure 4(b) illustrates this idea. Here, Point A is a singleton since to cover it we have to introduce a new MicroSphere M S3 and A would be the sole data point only covered by M S3 . Clearly, it is not efficient to add a new MicroSphere to cover only one extra data point. To solve this problem, we collect all those data points on the

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

IEEE

Algorithm 5 Shift MicroSpheres MS 1

MS 2

MS 1 A MS 2

B A

MS 3 k=3

k=4

Figure 4. Left (a): Picking proper seed, Right (b): A Singleton.

fly in the first step and cover them by shifting existing MicroSpheres during the second step. We have to make sure that after the center shifts the originally covered data points are still covered. To achieve this goal, we keep a border list containing the last m data points inserted into a MicroSphere(line 25). The border list will be used in the second step to decide whether we should move the centers. Remark. The major secondary storage operations in the FITS algorithm are range queries(line 9 and line 17). With the support of a multi-dimensional index structure, each range query takes only O(log n) time. Shifting Step This step deals with all singletons collected in the first step. Each time, we pop one element out of the singletons list and retrieve all MicroSpheres which are in the D neighborhood of the singleton. For each of the retrieved MicroSphere, we try to shift the original center towards the singleton until it touches the disc border(line 13). Then we check if the disc with new center still covers all border points(line 14). If yes, the new center is updated. If none of the MicroSpheres can be shifted to cover the singleton, it is marked as outlier. Pruning redundant MicroSpheres In the pruning phase, we delete redundant MicroSpheres which do not contain any point that is only covered by this MicroSphere. To achieve this, we maintain an array storing for each point by how many MicroSpheres it is covered. Then we can simply eliminate those MicroSpheres whose data points are all covered by at least two MicroSpheres. After deletion of a MicroSphere, we need to update the array accordingly. Finally the updated list of MicroSpheres is returned. Efficiency of FITS We analyze the run time complexity of each stage of FITS. For the first stage, generating preliminary MicroSpheres, in the worst case each point could be chosen as a seed. Hence the time complexity is O(n(log n + t)), where t is the time for generating the smallest enclosing ball on a partial candidate set which is normally far smaller than the whole dataset. In paper [13], the run time complexity of the smallest enclosing ball algorithm is not analyzed, but it runs very efficiently in practice. In the second stage, we take one singleton at a time and try to shift the close MicroSpheres to the singleton. In the worst case, it takes O(k·n) steps, where k is the number of MicroSpheres. Finally, the pruning phase takes O(k · n). Hence,

INPUT:(1) Data Points db, (2) singletons, (3) the MicroSphere list mspheres, (4) radius r. OUTPUT: microspheres 1: while true do 2: s = singletons.next() 3: if s == NULL then 4: break 5: end if 6: result = db.retrieve nearest clusters(s,2 ∗ r) 7: while true do 8: cid = result.next() 9: if cid == NULL then 10: break 11: end if 12: center = mspheres.getCenter(cid) 13: newCenter = CalculateCenter(center,s) 14: if CheckBorder(newCenter, mspheres.getBorder (cid)) == true then 15: mspheres.setCenter( cid, newCenter) 16: mspheres.insert(cid, s) 17: s.clusterId = cid 18: else 19: s.clusterId = OUTLIER 20: end if 21: end while 22: end while 23: return mspheres

the total running time of FITS is O(n(log n + k + t)).

5

Experimental Evaluation

The generated MircoSpheres can be used in data summarization or privacy preserving applications. There are three alternatives of using MicroSpheres: (1) Alternative 1: represent an MicroSphere by the CF-value obtained from aggregating all points located in a MicroSphere. (2) Alternative 2: represent an MicroSphere by its center, radius and number of points contained. (3) Alternative 3: randomly generate points within each MicroSphere. We implemented our FITS algorithm based on an implementation of R∗ -tree as well as the Data Bubble algorithm[8], and evaluated MicroSpheres in both motivating scenarios (data compression and privacy preservation) using internal and external quality measures. The internal measures consider direct properties of the MicroClusters while the external measures consider the quality of clusterings based on MicroClusters instead of the original data points. The internal measures include compression ratio and average density. The first series of experiments was performed on a 3-dimensional randomly generated synthetic dataset with 2 million records which is large enough for the compression scenario. The results of the

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

IEEE

300

"Data Bubbles" "MicroSpheres"

Compression Ratio

250

to the Data Bubble and determine their maximum distance to the Data Bubble Center. 0.009

0.0085

"Data Bubbles" "MicroSpheres"

"Data Bubbles" "MicroSpheres"

0.0085 0.008

Average Density

0.008 Average Density

internal measures are reported in Section 5.1. The external measures evaluate how the qualities of the MicroClusters affect the data mining applications based on the MicroClusters. We chose two well-known clustering algorithms, CLARANS[26] and DBSCAN[11], and compare their outputs when taking different summarizations as inputs. The results of the external measures are presented in Section 5.2. To demonstrate the effect of using MicroSpheres to preserve both privacy and the semantics of the original data, we randomly regenerate data points inside each MicroSphere and apply OPTICS on the regenerated data. Our comparFigure 5. Compression ison partner is the conRatios vs. Radius Condensation approach in straints. [4], based on the kanonymity model, and we compare the reachability plots generated by OPTICS for the regenerated data from both our MicroSphere and the condensation approach in Section 5.2.

0.0075

0.007

0.0075

0.007

0.0065 0.0065

0.006

0.0055 18.5

19

19.5

20 20.5 21 Radius Constraint

21.5

22

22.5

0.006 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000 # of MicroSpheres

200

150

Figure 6. Left (a): Average Densities vs. Radius Constraints. Right (b): Average Densities vs. # of MicroSpheres.

100

50 18.5

19

19.5

20 20.5 21 21.5 Radius (Significance = 20)

22

22.5

5.1 Summarization Quality Evaluation Here we report the test results on the internal measures. The Compression Ratio It is natural to measure the quality of a data summarization technique by the compression ratio, i.e., the total number of data points divided by the number of MicroClusters. This measure has already been used in the Data Bubble approach. We compare the data compression ratios obtained by our FITS algorithm and the Data Bubble algorithm for varying radius settings. For the FITS algorithm, the significance is fixed at 20. The results are reported in Figure 5 which shows that the Data Bubble algorithm tends to use less MicroClusters achieving larger compression ratios. Part of the reason is that the average radius of the Data Bubbles is generally larger. The Average Density The compression ratio does not take into account the accuracy(or information loss) of a data summarization. Thus, we introduce a unified way to measure the quality of the data summarization, called average density, which is defined as the average number of data points in a unit volume. For simplicity, the total volume of a MicroClusteris calculated as rd . ForC.size rd

. In the above mally, Avg Density = C∈M|MicroClusters icroClusters| equation, M icroClusters contains all the MicroClusters. C.size returns the number of data points in MicroCluster C. Figure 6 presents the results of the average density for both the Data Bubble algorithm and our FITS algorithm. For the radius calculation, we consider all points belonging

The MicroSpheres generated by the FITS algorithm have a higher average density since they have guaranteed radius and significance. Figure 6(a) demonstrates the impacts of the radius constraint on both algorithms when the significance is set to 20 for our FITS algorithm. Furthermore, we evaluate the average density for varying compression ratio settings. We adjust the parameters of both algorithms to let them generate similar numbers of MicroClusters. The MicroSpheres tend to have a larger average density than the Data Bubbles, since the various sizes of the Data Bubbles reduces their average density as shown in Figure 6(b).

5.2 Evaluation on Clustering Algorithms • Alternative 1 The first alternative of using MicroSpheres is effective for K-medoid clustering algorithms, such as CLARANS, because the CF-values support the computation of distance between MicroSpheres. To apply CLARANS on MicroClusters, we use the distance measure defined in [9] to calculate the distance between two MicroClusters. The clustering quality is measured by the total distance of the data points to their closest medoid. In this experiment, we examine two cases. One is the application of CLARANS on a similar number of Data Bubbles and MicroSpheres. We adjust the parameters of both algorithms to reach a similar compression ratio. The results for varying numbers of clusters are shown in Figure 7(a). Due to the fixed radius and the significance guarantee, the size distortion of MicroClusters is better relieved. Therefore, even with the similar compression ratio, the MicroSpheres outperform the Data Bubbles on CLARANS in term of the total distance. In addition, we specify the same radius constraint for Data Bubbles and MicroSpheres in the second case. Figure 7(b) depicts the results for varying cluster numbers. When the radius constraint is set for both algorithms, the Data Bubbles tend to have larger size which explains why the total distance is bigger when we apply CLARANS on the Data Bubbles. Clearly, in both cases, the MicroClus-

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

IEEE

numbers of clusters found by the three different versions of DBSCAN with different parameters are listed in Table 1.

ters generated by our FITS algorithm lead to a smaller total distance, i.e., more compact clusters. 5.2e+08

5.2e+08

"Data Bubbles" "MicroSpheres"

"Data Bubbles" "MicroSpheres"

5.1e+08

5e+08

DS1

5e+08

Input Size Eps = 5, M inP ts = 20 Eps = 5.5, M inP ts = 20

Data Bubbles 619 22 7

MicroSpheres 614 6 6

Input Size Eps = 4, M inP ts = 20 Eps = 2, M inP ts = 20 Eps = 1.2, M inP ts = 20

Data Bubbles 396 4 8 11

MicroSpheres 390 3 4 5

4.9e+08 Total Distance

Total Distance

4.8e+08

4.6e+08

4.4e+08

4.8e+08

4.7e+08

4.6e+08

4.5e+08

4.2e+08

4.4e+08

4e+08

DS2

4.3e+08

3.8e+08

4

4.5

5

5.5 6 6.5 K (# of Clusters)

7

7.5

8

4.2e+08

4

4.5

5

5.5 K (# of Clusters)

6

6.5

7

Figure 7.

The result of CLARANS. Left (a): The cluster cost vs. k with the same compression ratio. Right (b): The cluster cost vs. k (Radius = 19.5).

300 250

80

200

60

150

40

100

• Alternative 2 The center, radius and number of points of a MicroSphere are sufficient for a density based clustering algorithm. Therefore, we choose the second alternative for evaluating MicroClusters as input for DBSCAN. To adapt the DBSCAN algorithm to MicroClusters, we extend the neighborhood range from Eps to Eps + r, where r is the radius used for generating MicroClusters. Thereby, we will not miss any MicroCluster even if it only has a small portion in an Eps-neighborhood. When we calculate the number of data points intersecting an Eps-neighborhood, instead of counting the total number of MicroClusters located, we count every portion of MicroClusters. For example, if a MicroCluster only has half of its space overlapping with the Eps neighborhood, only half of the data points in this MicroCluster are counted. After this simple modification, DBSCAN can be applied on the summarized data. 100

20

"db2.ascii"

"ds1.txt"

90

70

50

30

0

50

100

200

150

300

250

50

10 0

0

10

20

30

40

50

60

70

80

90

100

Figure 8. Left (a): DS1 (10671 pts) Right (b): DS2 (5000 pts)

We manually generated two 2-dimensional datasets as shown in Figure 8. For fair comparison, we adjust the parameters of both algorithms to generate similar number of MicroClusters. To determine the “true” clustering, we apply DBSCAN on the real dataset to measure how much performance lost due to summarization. To better select parameters for DBSCAN, we use the reachability-plot of OPTICS[6] as a visualization tool to identify the cluster structure and decide the value of Eps and M inP ts. The

IEEE

Real Data 5000 3 4 5

Table 1. DBSCAN Results With different settings of Eps and M inP ts, DBSCAN on Data Bubbles tends to find more than the actual number of clusters due to the various sizes of Data Bubbles. The reason is as follows. When Data Bubbles are formed, the threshold applies to the average distance, instead of the maximum distance from the center. In general, some dense areas absorb data points around them. So the centers of the generated Data Bubbles are dragged away from dense areas. Hence, some connected areas fall apart if only Data Bubbles are taken as inputs. The results of DBSCAN in Table 1 clearly demonstrate that our MicroSpheres more faithfully represent the underlying data distribution. • Alternative 3 To better preserve the privacy of the underlying data set, we can use MicroSpheres to regenerate data points randomly (Alternative 3). In this experiment we adopt the dataset DS2 used in the previous DBSCAN comparison. We compare the reachability plots after applying OPTICS on the regenerated data based on MicroSpheres and Condensed Groups [4]. Here the radius constraint r for generating our MicroSpheres is set as the average radius of all Condensed Groups. For both approaches, the significance constraints are set to 10. Our FITS algorithm generates 116 MicroSpheres, while the condensation approach produces 500 Condensed Groups. We note that FITS generates less MicroClusters with a larger cardinality, which leads to a better privacy preservation, but loss of accuracy compared to Condensed Groups. Figure 9 shows the OPTICS reachability plots. The reachability plot for the regenerated data based on Condensed Groups (Figure 9(b)) is very similar to the original OPTICS plot in [6]. The reachability plot for MicroSpheres shows a small loss of details compared to Condensed Group, but the general cluster structure is well preserved. This seems to be acceptable considering the additional privacy guarantees.

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

Real Data 10671 6 6

"disc-output"

"output"

14

14

12

12

10

10

8

8

6

6

4

4

2

2

0

0 0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

Figure 9. Left (a): The reachability plot for the regenerated data based on MicroSpheres. Right (b): The reachability plot for the regenerated data based on the Condensed Groups.

6

Conclusions

In this paper, we introduced the first MicroClustering problem formalizing the notion of quality, and presented two different approximation algorithms. We also proposed a heuristic, efficient algorithm (FITS) derived from the approximation algorithms and exploiting multi-dimensional index structures. The extensive experiments demonstrated that our FITS algorithm generates MicroSpheres that outperform the state-of-the-art Data Bubbles in terms of internal qualities such as average density as well as in terms of external quality when using the MicroClusters as input for clustering methods. Acknowledgement We thank Dr. Jiawei Han for his valuable suggestions on early versions of this paper.

References [1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In VLDB, 2003. [2] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for projected clustering of high dimensional data streams. In VLDB, 2004. [3] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. On demand classification of data streams. In SIGKDD, 2004. [4] C. C. Aggarwal and P. S. Yu. A condensation approach to privacy preserving data mining. In EDBT, 2004. [5] R. Agrawal and R. Srikant. Privacy-preserving data mining. In SIGMOD, pages 439–450, 2000. [6] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. In SIGMOD, 1999. [7] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In SIGKDD, 1998. [8] M. Breunig, H.-P. Kriegel, P. Kr¨oger, and J. Sander. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In SIGMOD, 2001. [9] M. M. Breunig, H.-P. Kriegel, and J. Sander. Fast hierarchical clustering based on compressed data and optics. In PKDD, 2000.

4500

5000

[10] T. Erlebach, K. Jansen, and E. Seidel. Polynomial-time approximation schemes for geometric graphs. In SODA, 2001. [11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, 1996. [12] T. Feder and D. H. Greene. Optimal algorithms for approximate clustering. In STOC, 1988. [13] K. Fischer, B. G¨artner, and M. Kutz. Fast smallest-enclosing-ball computation in high dimensions. In ESA, 2003. [14] V. Ganti, J. Gehrke, and R. Ramakrishnan. Demon: Mining and monitoring evolving data. In ICDE, 2000. [15] V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining data streams under block evolution. ACM SIGKDD Explorations Newsletter, 3(2), January 2003. [16] V. Ganti, R. Ramakrishnan, J. Gehrke, A. L. Powell, and J. C. French. Clustering large datasets in arbitrary metric spaces. In ICDE, 1999. [17] R. Ge, M. Ester, W. Jin, and Z. Hu. A disc-based approach to data summarization. Technical Report TR 2006-07, Simon Fraser University, School of Computing Science, 2006. [18] D. S. Hochbaum and W. Maass. Approximation schemes for covering and packing problems in image processing and vlsi. J. of ACM, 32, January 1985. [19] D. S. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180– 184, 1985. [20] H.V.Jagadish, J. Madar, and R. T. Ng. Semantic compression and pattern extraction with fascicles. In VLDB, 1999. [21] W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases. In SIGKDD, 2001. [22] J.Zhou and J. Sander. Data bubbles for non-vector data: Speeding-up hierarchical clustering in arbitrary metric spaces. In VLDB, 2003. [23] Y. Li, J. Han, and J. Yang. Clustering moving objects. In SIGKDD, 2004. [24] A. Meyerson and R. Williams. On the complexity of optimal kanonymity. In PODS, 2004. [25] S. Nassar, J. Sander, and C. Cheng. Incremental data summarizations for clustering large dynamically changing data sets. In SIGMOD, 2004. [26] R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In VLDB, 1994. [27] L. O’Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha. Streaming-data algorithms for high-quality clustering. In ICDE, 2002. [28] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In ICDE, 2003. [29] L. A. Sweeney. Guaranteeing anonymity when sharing medical data, the datafly system. In Proc of Journal of the American Medical Informatics Association. Hanley and Belfus, Inc., 1997. [30] A. K. H. Tung, J. Han, R. T. Ng, and L. V. S. Lakshmanan. Constraint-based clustering in large databases. In ICDT, 2001. [31] A. K. H. Tung, J. Hou, and J. Han. Spatial clustering in the presence of obstacles. In ICDE, 2001. [32] P. M. Vaidya. An optimal algorithm for the all-nearest-neighbors problem. In FOCS, 1986. [33] H. Yu and J. H. Jiong Yang. Classifying large data sets using svms with hierarchical clusters. In SIGKDD, 2003. [34] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In SIGMOD, 1996. [35] T. Zhang, R. Ramakrishnan, and M. Livny. Fast density estimation using cf-kernel for very large databases. In SIGKDD, 1999.

Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06) 0-7695-2590-3/06 $20.00 © 2006

IEEE

Suggest Documents