Dynamic Data Mining - CiteSeerX

2 downloads 347 Views 75KB Size Report
(raghavan, ahafez)@cacs.louisiana.edu ... University of Louisiana at Lafayette ... of Computer Science and Automatic Control, Faculty of Engineering, Alexandria.
Dynamic Data Mining* Vijay Raghavan and Alaaeldin Hafez1 (raghavan, ahafez)@cacs.louisiana.edu The Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, LA 70504, USA Abstract. Business information received from advanced data analysis and data mining is a critical success factor for companies wishing to maximize competitive advantage. The use of traditional tools and techniques to discover knowledge is ruthless and does not give the right information at the right time. Data mining should provide tactical insights to support the strategic directions. In this paper, we introduce a dynamic approach that uses knowledge discovered in previous episodes. The proposed approach is shown to be effective for solving problems related to the efficiency of handling database updates, accuracy of data mining results, gaining more knowledge and interpretation of the results, and performance. Our results do not depend on the approach used to generate itemsets. In our analysis, we have used an Apriori-like approach as a local procedure to generate large itemsets. We prove that the Dynamic Data Mining algorithm is correct and complete.

Keywords: Data Mining, Dynamic Approach, Knowledge Discovery, Association Mining, Frequent Itemsets. 1

Introduction

Data mining is the process of discovering potentially valuable patterns, associations, trends, sequences and dependencies in data [1-6,12,16,19,22,23]. Key business examples include web site access analysis for improvements in e-commerce advertising, fraud detection, screening and investigation, retail site or product analysis, and customer segmentation. Data mining techniques can discover information that many traditional business analysis and statistical techniques fail to deliver. Additionally, the application of data mining techniques further exploits the value of data warehouse by converting expensive volumes of data into valuable assets for future tactical and strategic business development. Management information systems should provide advanced capabilities that give the user the power to ask more sophisticated and pertinent questions. It empowers the right people by providing the specific information they need. Many knowledge discovery applications [8,10,11,14,15,17,18,20,21], such as on-line services and world wide web applications, require accurate mining information from data that changes on a regular basis. In such an environment, frequent or occasional updates may change the status of some rules discovered earlier. More information should be collected during the data mining process to allow users to gain more complete knowledge of the significance or the importance of the generated data mining rules. Discovering knowledge is an expensive operation [4,6,7,8,11,12,13]. It requires extensive access of secondary storage that can become a bottleneck for efficient processing. Running data mining algorithms from scratch, each time there is a change in data, is obviously, not an efficient strategy. Using previously discovered knowledge along with new data updates to maintain discovered knowledge could solve many problems, that have faced data mining techniques; that is, database updates, accuracy of data mining results, gaining more knowledge and interpretation of the results, and performance. In this paper, we propose an approach, that dynamically updates knowledge obtained from the previous data mining process. Transactions over a long duration are divided into a set of consecutive episodes. In our approach, information gained during the current episode depends on the current set of transactions and the *

This research was supported in part by the U.S. Department of Energy, Grant No. DE-FG02-97ER1220. on leave from The Department of Computer Science and Automatic Control, Faculty of Engineering, Alexandria University, Alexandria, Egypt 1

discovered information during the last episode. Our approach discovers current data mining rules by using updates that have occurred during the current episode along with the data mining rules that have been discovered in the previous episode. In section 2, a formal definition of the problem is given. The dynamic data mining approach is introduced in section 3. In section 4, the dynamic data mining approach is evaluated. The paper is summarized and concluded in section 5. 2

Problem Definition

Association mining that discovers dependencies among values of an attribute was introduced by Agrawal et al.[2] and has emerged as an important research area. The problem of association mining, also referred to as the market basket problem, is formally defined as follows. Let I = {i1,i2, . . . , in} be a set of items and S = {s1, s2, . . ., sm} be a set of transactions, where each transaction si∈ S is a set of items that is si ⊆ I. An association rule denoted by X ⇒ Y, X,Y ⊂ I, and X ∩ Y = Φ , describes the existence of a relationship between the two itemsets X and Y. Several measures have been introduced to define the strength of the relationship between itemsets X and Y such as SUPPORT, CONFIDENCE, and INTEREST [2,4,7,9]. The definitions of these measures, from a probabilistic view point, are given below. I.

SUPPORT ( X ⇒ Y ) = P ( X ,Y ) , or the percentage of transactions in the database that contain both

X and Y. II. CONFIDENCE ( X ⇒ Y ) = P ( X , Y ) / P ( X ) , or the percentage of transactions containing Y in those transactions containing X. III. INTEREST(X ⇒ Y ) = P( X ,Y ) / P( X )P( Y ) represents a test of statistical independence. SUPPORT for an itemset S is calculated as

SUPPORT ( S ) =

F( S ) F

where F(S) is the number of transactions having S, and F is the total number of transactions. For a minimum SUPPORT value MINSUP, S is a large (or frequent) itemset if SUPPORT(S) ≥ MINSUP, or F(S) ≥ F*MINSUP. Suppose we have divided the transaction set T into two subsets T1 and T2, corresponding to two consecutive time intervals, where F1 is the number of transactions in T1 and F2 is the number of transactions in T2, (F=F1+F2), and F1(S) is the number of transactions having S in T1 and F2(S) is the number of transactions having S in T2, (F(S)=F1(S)+F2(S)). By calculating the SUPPORT of S, in each of the two subsets, we get

SUPPORT1 ( S ) =

F1 ( S ) F (S) and SUPPORT2 ( S ) = 2 F1 F2

S is a large itemset if

F1 ( S ) + F2 ( S ) ≥ MINSUP F1 + F2 or

F1 ( S ) + F2 ( S ) ≥ ( F1 + F2 ) * MINSUP In order to find out if S is a large itemset or not, we consider four cases, • S is a large itemset in T1 and also a large itemset in T2, i.e., F1 ( S ) ≥ F1 * MINSUP and

F2 ( S ) ≥ F2 * MINSUP . • S is a large itemset in T1 but a small itemset in T2, i.e., F1 ( S ) ≥ F1 * MINSUP and

F2 ( S ) < F2 * MINSUP . • S is a small itemset in T1 but a large itemset in T2, i.e., F1 ( S ) < F1 * MINSUP and

F2 ( S ) ≥ F2 * min sup . • S is a small itemset in T1 and also a small itemset in T2, i.e., F1 ( S ) < F1 * MINSUP and

F2 ( S ) < F2 * MINSUP . In the first and fourth cases, S is a large itemset and a small itemset in transaction set T, respectively, while in the second and third cases, it is not clear to determine if S is a small itemset or a large itemset. Formally speaking, let SUPPORT(S) = MINSUP + δ, where δ≥ 0 if S is a large itemset, and δ< 0 if S is a small itemset. The above four cases have the following characteristics, • • • •

δ1 ≥ 0 and δ2 ≥ 0 δ1 ≥ 0 and δ2 < 0 δ1 < 0 and δ2 ≥ 0 δ1 < 0 and δ2 < 0

S is a large itemset if

F1 * ( MINSUP + δ1 ) + F2 * ( MINSUP + δ2 ) ≥ MINSUP F1 + F2 or

F1 * ( MINSUP + δ1 ) + F2 * ( MINSUP + δ2 ) ≥ MINSUP * ( F1 + F2 ) which can be written as

F1 * δ1 + F2 * δ2 ≥ 0 Generally, let the transaction set T be divided into n transaction subsets Ti 's, 1 ≤ i ≤n. S is a large itemset if n



i =1

Fi * δi ≥ 0

where Fi is the number of transactions in Ti and δ i = SUPPORTi (S) - MINSUP, 1 ≤ i ≤ n. -MINSUP ≤ δ i ≤ 1-MINSUP, 1 ≤i ≤n. n

For those cases where



i =1

Fi * δi < 0 , there are two options, either

• discard S as a large itemset (a small itemset with no history record maintained), or • keep it for future calculations (a small itemset with history record maintained). In this case, we are n

not going to report it as a large itemset, but its



i =1

checked through the future intervals.

Fi * δi formula will be maintained and

3

The Dynamic Data Mining Approach n

For



Fi * δi < 0 , the two options described above could be combined into a single decision rule

i =1

that says discard S if n



i= k

Fi * ( MINSUP + δi ) n



i=k

α=1 α→ ∞

Fi



MINSUP , where 1 ≤ α < ∞ , and k≥1. α

Discard S from the set of a large itemsets (it becomes a small itemset with no history record) Keep it for future calculations (it becomes a small itemset with a history record)

The value of α determines how much history information would be carried. This history information along with the calculated values of locality can be used to • determine the significance or the importance of the generated emerged-large itemsets. • determine the significance or the importance of the generated declined-large itemsets. • generate large itemsets with less SUPPORT values without having to rerun the mining procedure again. The choice of which value of α to choose is the essence of our approach. If the value of α is chosen to be near the value of 1, we will have less declined-large itemsets and more emerged-large itemsets, and those emerged-large itemsets are more to be occurred near the latest interval episodes. For those cases where the value of α is chosen to be far from the value of 1, we will have more declined-large itemsets and less emerged-large itemsets, and those emerged-large itemsets are more to be large itemsets in the apriori-like approach. In this section, we introduce the notions of declined-large itemset, emerged-large itemset, and locality. Definition 3.1: Let S be a large itemset ( or a emerged-large itemset, please see definition 3.2) in a transaction subset Tl , l ≥ 1 . S is called a declined-large itemset in transaction subset Tn , n > l, if m

MINSUP >



i =k

Fi * ( MINSUP + δi ) m



i= k

Fi



MINSUP α

for all l < m ≤n, where 1 ≤ k ≤ m , and 1 ≤ α < ∞ , Definition 3.2: S is called a emerged-large itemset in transaction subset Tn , n > 1, if S was a small itemset in transaction subset Tn-1 and F n * δn ≥ 0 , or S was a declined-large itemset in transaction subset Tn-1, n

n > 1, and



i=k

Fi * δi ≥ 0 , k ≥ 1 .

Definition 3.3: For an itemset S and a transaction subset Tn , locality(S) is defined as the ratio of the total size of those transaction subsets where S is either a large itemset or a emerged-large itemset to the total size of transaction subsets Ti , 1 ≤i ≤n .



∀ i s .t . S is a l arg e itemset or a emerged − l arg e itemset n



i=1

Fi

Fi

Clearly, the locality(S)=1 for all large itemsets S. The dynamic data mining approach generates three sets of itemsets, n

• large itemsets, that satisfy the rule



i =1

Fi * δi ≥ 0 , where n is the number of intervals carried

out by the dynamic data mining approach • declined-large itemsets, that were large at previous intervals and still maintaining the rule m

MINSUP >



i =k

Fi * ( MINSUP + δi ) m



i= k



Fi

MINSUP , for some value α. α

• emerged-large itemsets, that were -

either small itemsets and at a transaction subset Tk they satisfied the rule Fk * δk ≥ 0 , and n

still satisfy the rule



i=k

-

Fi * δi ≥ 0 ,

or they were declined-large itemsets, and at a transaction subset Tm they satisfied the rule m



i=k

Fi * δi ≥ 0 , and still satisfy the rule

n



i=k

Fi * δi ≥ 0 .

Example: Let I={a,b,c,d,e,f,g,h} be a set of items, MINSUP=0.35, and T be a set of transactions. For α=1,

Transaction Subset T1

Transaction Subset T2

transactions

count

{a,b,g,h} {b,c,d} {a,c} {c,g} {d,e,f} {e,g,h} {a,b,d} {b,d,f} {d,f,h} {c,h} {c,h} {b,d,g} {a,c} {b,c} {g,h}

3 10 2 4 1 4 2 1 5 5 12 8 9 1 4

large or emerged-large itemsets {b} {c} {d} {h} {bd}

{b} {c} {h} {ch}

count

SUPPORT

status

locality

16 21 14 17 13

0.43 0.57 0.38 0.46 0.35

large itemset large itemset large itemset large itemset large itemset

1 1 1 1 1

25 43 33 12

0.35 0.60 0.46 0.35

large itemset large itemset large itemset emerged-large itemset

1 1 1

Transaction Subset T3

{a} {a,b,g,h} {b,c,d} {a,c} {c,g} {d,e,f} {e,g,h} {a,b,d} {b,d,f} {d,f,h} {c,h}

10 5 10 2 4 1 4 2 1 5 5

{a} {b} {c} {h}

19 43 64 52

transactions

count

{a,b,g,h} {b,c,d} {a,c} {c,g} {d,e,f} {e,g,h} {a,b,d} {b,d,f} {d,f,h} {c,h} {c,h} {b,d,g} {a,c} {b,c} {g,h}

3 10 2 4 1 4 2 1 5 5 12 8 9 1 4

large or emerged-large itemsets {b} {c} {d} {h} {bd}

{a} {a,b,g,h} {b,c,d} {a,c} {c,g} {d,e,f} {e,g,h} {a,b,d} {b,d,f} {d,f,h} {c,h}

10 5 10 2 4 1 4 2 1 5 5

0.39 0.36 0.53 0.43

emerged-large itemset large itemset large itemset large itemset

0.41 1 1 1

Status

locality

For α=2,

transaction subset T1

transaction subset T2

transaction subset T3

{b} {c} {d} {g} {h} {bd} {ch} {a} {b} {c} {d} {g} {h} {bd} {ch}

count

SUPPORT

16 21 14 17 13

0.43 0.57 0.38 0.46 0.35

Large itemset Large itemset Large itemset Large itemset Large itemset

1 1 1 1 1

25 43 22 12 33 18 12 19 43 64 36 25 52 31 17

0.35 0.60 0.31 0.35 0.46 0.25 0.35 0.39 0.36 0.53 0.3 0.30 0.43 0.26 0.20

large itemset large itemset declined-large itemset emerged-large itemset large itemset declined-large itemset emerged-large itemset emerged-large itemset large itemset large itemset declined-large itemset declined-large itemset large itemset declined-large itemset declined-large itemset

1 1 0.52 0.48 1 0.52 0.48 0.41 1 1 0.31 0.28 1 0.31 0.28

When applying an Apriori-like Algorithm on the whole file, the resulting large itemsets are large itemsets {b} {c} {h}

count 43 64 52

SUPPORT 0.39 0.58 0.47

By comparing the results in the previous example, we can come with some intuitions about the proposed approach, which can by summarized as, • The set of large itemsets and emerged-large itemsets generated by our Dynamic approach is a superset of the set of large itemsets generated by the Apriori-like approach.

• If there is an itemset generated by our Dynamic approach but not generated by the Apriori-like approach as a large itemset, then this itemset should be large at the latest consecutive time intervals, i.e., a emerged-large itemset. In lemmas 3.1 and 3.2, we proves the above intuitions. lemma 3.1: For a transaction set T, the set of large itemsets and emerged-large itemsets generated by our Dynamic approach is a superset of the set of large itemsets generated by the Apriori-like approach. proof: Let

UT

i

= T , 1 ≤ i ≤ n , Fi = Ti and S be a large itemset that is generated by the Apriori-

i n

like approach, i.e.,



i =1

Fi * δi ≥ 0 , and not by our Dynamic approach. There two cases to consider,

Case 1 ( α=1) For a transaction subset Tk , 1 ≤ k ≤ n, S is discarded from the set of a large itemsets, if it becomes k

a small itemset, i.e.,



i=m

Fi * δi < 0 , 1 ≤ m ≤ k, and no history is recorded. Since no history is m− 1



recorded before m, that means

i =1

n



i =1

k

Fi * δi < 0 . That leads to



Fi * δi < 0 . For k=n, we have

i =1

Fi * δi < 0 , which contradicts our assumption.

Case 2: α>1 For a transaction subset Tk , 1 ≤ k ≤ n, S is discarded from the set of a large itemsets, if it becomes a small itemset, i.e.,

k



i =m

Fi * δi < 0 , 1 ≤ m ≤ k, and depending on the value of α, its history is

started to be recorded in transaction subset Tm. Since no history is recorded before m, that means m− 1



i =1

Fi * δi < 0 . That leads to

k



i =1

Fi * δi < 0 . For k=n, we have

n



i =1

Fi * δi < 0 , which contradicts

our assumption. lemma 3.2: If there is an itemset generated by our Dynamic approach but not generated by the Apriori-like approach as a large itemset, then this itemset should be large at the latest consecutive time intervals, i.e., a emerged-large itemset. proof: By following the proof of lemma 3.1, the proof is straight forward.

Algorithm DynamicMining (Tn)

f 1 (T n ) is the set of * 1

f (T n ) is the set of

l arg e and

emerged − l arg e itemsets .

declined − l arg e itemsets .

Γ is the accumulate d value of Fi * δix . ∆ is the accumulate d value of F i . x

Cl x is the . accumulate d value of

Fi

where

itemset

x is l arg e

begin

∆ = ∆ + Fi f 1 (T n ) = { (x, Cl x ) , Cl x = F n | x ∉ f 1 (T n - 1 ) ∧ x ∉ f 1* (T n - 1 ) ∧ F n * δnx ≥ 0} ∪ {(x, Cl x ) , Cl x = Cl x + F n | Γ x + Fn * δnx ≥ 0 }

//large or emerged-large itemset f 1* (T n ) = { (x, Cl x ) | MINSUP >

∆ * MINSUP + Γ

x





MINSUP } α

//was large itemset for (k=2;fk-1(Tn)≠φ;k++) do begin Ck=AprioriGen(fk-1(Tn) ∪ fk-1*(Tn)) forall transactions t ∈ Tn do forall candidates c∈ Ck do if c⊆ t then c.count++ f k (Tn ) = { (x,Clx ), x ∈ c,Clx = Fn | x ∉ f k (Tn-1 ) ∧ x ∉ f k* (Tn-1 ) ∧ Fn * δnx ≥ 0 } ∪ {(x,Clx ),x ∈ c,Clx = Clx + Fn | Γ x + Fn * δn ≥ 0 } f k* (T n ) = { (x, Cl x ) | MINSUP >

∆ * MINSUP + Γ ∆

x



MINSUP } α

end return fk(Tn) and fk*(Tn) end function AprioriGen(fk-1) insert into Ck select l1,l2, . . .,lk-1,ck-1 from fk-1 l, fk-1 c where l1=c1∧ l2=c2∧ . . . ∧ lk-2=ck-2∧ lk-1