Database Clustering for Mining Multi-Databases

1 downloads 62 Views 2MB Size Report
Shigeki Hori, Kazuo Tanaka and Hua O. Wang .... Tung-Sheng Chiang, Chun-Chieh Wang and Ching-Tsan Chiang. FL-FN Based Traffic Signal Control. 296.
Database Clustering for Mining Multi-Databases Chengqi Zhang and Shichao Zhang Faculty of Information Technology University of Technology, Sydney PO Box 123, Broadway NSW 2007, Australia Abstract: This paper presents an efficient and effective application-independent database cluster for mining multi-databases. Our experimental results show that the proposed approach is effective and promising.

1

Introduction

To deal with multiple databases, Liu, Lu and Yao [1,2] have proposed to discover multi-database by identifying relevant databases. They argued that the first step for multi-database mining is to identify databases that are most likely relevant to an application for efficiency and accuracy. The cluster of multi-databases in [1] is constructed for an application. It is typically application-dependent, referred to databases selection. However, database selection has to be carried out multiple times to identify relevant databases for two or more real-world applications. In particular, when users require to mine their multi-databases without reference to any specific application, the application-dependent techniques don't work well. This paper presents an efficient cluster for general multi-databases mining such that the cluster is appropriate to different applications. Section 2 illustrates the effectiveness of clustering. Section 3 presents an approach for clustering databases by similarity. Section 4 designs an algorithm for searching for a good cluster. Section 5 evaluates the proposed approach by experiments. We summarize our contributions in last section.

0-7803-7280-8/021$

I 0.00 ©2002 IEEE

2

Effectiveness

To show the effectiveness of clustering, an example is used below. Consider six databases Dv, D2, "', D6 as follows: D1 = {(A,B,C,D);(B,C); (A,B,C); (A,C)} D2 = {(A, B); (A, C); (A,B,C); (B, C)(A,B,D)} D3 = {(B,C,D); (A,B,C); (B,C); (A,D)} D4 = {(F, G, H, I, J); (E, F, H); (F, H)} Ds = {(E,F,H,J); (F,H); (F,H, J); (E, J)} D6 = {(F, H,I, J); (E, H, J); (E, F, H); (E,I)} where each database has several transactions, separated by a semicolon, and each transaction contains several items, separated by commas. Let minimum support minsupp = 0.5. We can search local frequent itemsets in D; as follows: A, B, C, AB, AC, BC, and ABC, where 'XY' means the conjunction of X and Y. Also, local frequent itemsets in D2 as A, B, C, and ABj D3 as A, B, C, and BC; D4 as F, H, and FHj Ds as E, F, H, J, EJ, FH, FJ, and FHJ; D6 as E, F, H, I, J, EH, FH, and HJ. Let us examine the existing techniques to check whether some can serve the purpose of selection, pretending no knowledge as to which database contains interesting information [I]. The first solution (technique for mono-database mining) is to put all data together from the six databases to create a single database TD = D1 U D2 U ... U D6, which has 24 transactions. We now search the above (local) frequent itemsets in TD. This solution cannot search any frequent itemset when minsupp = 0.5. For application, we need another minimum support, for example,

minsupp = 0.125. The second solution is to use the applicationdependent database selection in [Liu-Lu- Yao 1998]. However, this mining task is without reference to any specific application. In this case, the solution cannot work well. The third solution is our database cluster independent of application. The approach classifies the databases into two classes: classl = {DI, D2, D3} and class2 = {D4, Ds, D6}, and mines patterns in classl and class2 respectively, as shown in Tables 1 and 2.

3.1

Similarity Measurement

Let DI, D2, ... ,Dm be m databases from the branches of an interstate company; and ltem(Di) the set of items in D, (i = 1,2,···, m). If we have no other information about the databases, the items of the databases can be used to measure the closeness of a pair of database objects. A function for the similarity between the items of two databases D, and Dj is defined as follows. sim(Di,D.) J

Table 1 Local frequent itemsets in classl Itemsets supp Itemsets supp B 10 A 9 10 AB 6 C BC AC 6 8

ABC

4

Table 2 Local frequent itemsets in class2 Itemsets spp supp Itemsets F 8 E 6 I H 3 9 EH J 6 4 EJ FH 8 3 FHJ HJ 5 4 By minsupp = 0.5, A,B, C, and BC are frequent itemsets in TDI. E, F, H, J and FH are frequent itemsets in T D2. From the above, the technique for monodatabase mining can disguise useful patterns due to the fact that huge amounts of irrelevant data are included. Database selection is typically application-dependent. It cannot work well for the above multi-database problem. Applicationindependent database cluster presents a significant effectiveness.

3

Clustering Databases

= Iltem(Di)nltem(Dj)1 IItem(Di) U Item(Dj)I

where tn' denotes set intersection, 'U' denotes set union, and 'IItem(Di)nltem(Dj)I' is the number of elements in set Item(Di) n Item(Dj). Certainly, we can also construct other similarity using information such as weights of items and amounts of items purchased by customers. Our work in this paper focuses on only how to construct measures for similarity. Using the above similarity on databases, we define the relevance of databases below. Definition 1 A database D; is Ct-relevant to Dj under measure sim, written as Dicx ~ Dj, if sim(Di,Dj) > o, where o (> 0) is the given threshold. For example, let Ct = 0.35. Consider the data in Example 1, because sim(DI, D2} = 0.4 > Ct = 0.35, the database DI is 0.35-relevant to D2• Definition 2 Let D be the set ofm databases DI, D2, .", D,«. A similarity of a database D; under measure sim, denoted as Dsim(D;,sim,Ct), is a subset of D as follows: Dsim(D;,

sim, o) = {d E D~d is Ct~ D;}

Definition 3 Let D be the set ofm databases DI, D2, "., D;«. A class class" of D under measure sim is defined as class" = {d E DI\fd' E classO(d is o ~ d')}

We now propose a general cluster for given multidatabases by similarity between databases in this section.

0-7803-7280-8/021$10.00

©2002 IEEE

The definition shows that any two of database objects in this class are o-relevant each other.

Definition 4 Let D be the set ofm databases D1, D2, "', Dm. class(D, sim, a) = {class?, class~, "', class~} is a cluster of D1, D2, "', Ds« under measure sim if (1) class? U class~

u··· U class~

= D;

(2) for any two databases D; and Dj in classv, sim(D;,Dj) 2 CY.;

3.2

Ideal and Goodness

We now define two evaluations: ideal and goodness. Definition 5 Let class(D,sim,a) = {class?, class~, "', class~} be a cluster of m databases D1, D2, "', Dm under measure sim. class is ideal if (1) for any two classes class? and class'[ class, class? n class'j = 0 for i of j;

in

(2) for any two databases D; in classl and Dj in classh' if I of h, sim(D;, Dj) ~ a. Certainly, for a = 1 and 0, we can obtain two ideal clusters for a set of given databases, respectively. These two ideal clusters are called as trivial clusters. To evaluate an ideal cluster, we define a goodness value below. Definition 6 Let class" is a class of given multiple databases. The sum of the distances of pairs of databases in class" under measure sim is defined as follows: Value(class"')

=

(1 - sim(D;,Dj))

We take I-sim(D;, Dj) as the distance between two databases D; and Dj• Below, a goodness of a cluster is defined by distances of pairs of databases in classes. Definition 7 Let class(D,sim,a) = {class?, class~, "', class~} be a cluster of m databases D1, D2, Dm under measure sim. The goodness of the class is defined as follows.

A1> have seen, the number (Iclass!) of elements in a cluster class is related to a. To elucidate the relationships among Iclassl, Goodness, and CY., we use an example below. Example 1 Consider six databases D1, D2, ''', D6, where Item(D1) = {A,B,C,D,E}, Item(D2) = {A,B,C}, Item(D3) = {A,B,D}, Item(D4) {F,G,H,I,J}, Item(Ds) {F,G,H}, Item(D6) = {F,G,I}. The similarity between pairs of six database objects is illustrated in following table.

Table sim D1 D2 D3 D4 Ds D6

3 The similarity D1 D2 D3 1 0.6 0.6 0.6 1 0.5 0.6 0.5 1 0 0 0 0 0 0 0 0 0

six database objects D4 Ds D6 0 0 0 0 0 0 0 0 0 1 0.6 0.6 0.6 1 0.5 0.6 0.5 1

(1) When a = 0, n = 1 with class {{Dl,D2,· .. ,D6}}, Goodness(class,O) 2(4(1- 0.6) + 2(1 - 0.5) + 9) = 23.2. (2) When 0 < a ~ 0.5, n = 2 with class = {{D1, D2, D3}; {D4, o., D6}}, Goodness(class,O) = 4(2(1-0.6) + (1-0.5)) = 5.2. (3) When 0.5 < a ~ 0.6, there is not non-trivial ideal cluster. This interval is different from 0.6 < a ~ 1. (4) When 0.6 < CY. ~ 1, n = 6 with class = {{DI}; {D2}; {D3}; {D4}; {Ds}; {D6}}, Goodness(class,O) = 6(1-1) = O. This example illustrates a fact that we can obtain a good cluster when the distance between n and Goodness gets the smallest value for a E [0,1]. We now define a new function for judging the goodness of a cluster below.

,

n

Goodness(class,

a) =

L Value(classn· i=l

0-7803-7280-8/021$10.00

ezooa IEEE

Definition 8 Let class(D, sim, a) = {class?, class~, "', class~} be a cluster of m databases Dv, D2, "', D,« under measure sim. The absoluteness of the difference between the Goodness

and Iclassj is called as distance~ is defined as

for a E [0,1]. It

a) = IGoodness(class , a)- f(a)l.

distance~(class,

For Iclassl and Goodness with respect to different a, we can obtain different distance~ from ideal clusters for given multiple databases. We now illustrate the use of distance~ by an example below. Example 2 Consider the six databases in Example 1.

(1) When a

= 0, class = {{D1,D2,"',D6}}, distance~(class, 0) = 123.2 - 11 = 22.2.

(2) When 0 < a 0.5, class = {{Dl,D2,D3}i{D4,Ds,D6}}i distance~(class,a) = 15.2- 21 = 3.2.

$

(3) When 0.5 < Q $ 0.6, there is not non-trivial ideal cluster. This interval is different from 0.6 < Q $ 1. (4) When 0.6


1. The relation n = /(0

Suggest Documents