ALPINE: Progressive Itemset Mining with Definite ...

ALPINE: Progressive Itemset Mining with Definite Guarantees Qiong Hu1,2

Tomasz Imielinski1,2

1 Department

of Computer Science Rutgers University

2 Center

for Science of Information Purdue University

SIAM Data Mining Conference (SDM), April 2017

Hu, Imielinski (Rutgers)

ALPINE

SDM 2017

1 / 16

Outline

1

Motivation

2

Problem Definition

3

The Proposed ALPINE Algorithm

4

Computational Experiments

5

Conclusion and Vision


ALPINE

SDM 2017

2 / 16

Motivating Example Experiments on synthetic datasets on aurora.cs1 •ntran = 1000, p = 0.2, minsup = 0.16 Table 1: Mining time increases drastically with the increasing of items. items Eclat

10,000 10 mins

50, 000 4 hours

100,000 17 hours

500,000 16 days

1,000,000 2 months+

•ntran = 100, 000, nitems = 1, 000, 000, maxtranlength = 50, 000 Table 2: Mining time increases drastically with the decreasing of minsup. minsup Eclat FPGrowth 1

2567 15 hours 20 hours

2556 32 hours 43 hours

2548 2 days 3 days

2537

2487

4

26 days

days

5 days+

34 days

Rutgers cs research cluster. Eclat [Zaki, 2000], FP-Growth [Han et al., 2000] Hu, Imielinski (Rutgers)

ALPINE

SDM 2017

3 / 16

Motivation Should you wait? How long can you wait? Can it provide some progress indicator during this long process? Is the best-so-far partial answer available anytime? Can you choose to stop or to go anytime?


ALPINE

SDM 2017

4 / 16

Motivation Key: frequent itemset mining + anytime analysis Definition (Anytime Algorithms): Algorithms whose quality of results improves gradually as computation time increases [Zilberstein, 1996]


ALPINE

SDM 2017

5 / 16

Problem Definition Frequent Itemset Mining. Given a set of transactions, find all the itemsets having support ≥ minsup [Agrawal and Imielinski, 1993]. Progressive Itemset Mining: Given a set of transactions, find all the itemsets having support ≥ a certain minimum support threshold minsupt within arbitrary time period t. The goal of a progressive miner is to maximize the expected utility:

max n1

Pn

j=1

U(Pj )

n } is a set of where, U(Pj ) is the utility obtained at Probe j and {Pj=1 randomly selected probes in time.


ALPINE

SDM 2017

6 / 16

Problem Definition Utility 1

0

1.0

Utility Itemset Miner 1

minsup

…

U(minsupk) …

U(minsup2) U(minsup1) 0

t P1

P2

Pn

Progressive itemset mining. A long mining task, is progressively divided into k k sub-search spaces w.r.t. a set of decreasing minimum supports {minsupi=1 }. n {Pj=1 } is a set of randomly selected probes in time. Hu, Imielinski (Rutgers)

ALPINE

SDM 2017

7 / 16

The ALPINE Algorithm Automatic minsup Lowering with Progress Indicator in practically Never-Ending mining support1

support2

supportk

ptr1

ptr2

ptrk

(I, I*)

(I, I*)

(I, I*)

…

…

…

…

…

…

(I, I*)

(I, I*) b2

(I, I*) bk

b1

All itemset interval (I , I ∗ ) of singleton itemset are partitioned into different bins (b1 , ..., bk , ...) based on supports (support1 > ... > supportk > ...). Hu, Imielinski (Rutgers)

ALPINE

SDM 2017

8 / 16

The ALPINE Algorithm supporti

supporti+1

supporth

ptri

ptri+1

ptrh

…

(R, S)

…

…

…

…

(P, Q) … … bi

…

j k

… bi+1

(U, V) bl

… bh

Itemset interval (P, Q) is extended with all item j (j > tail(P) ∧ j ∈ / Q) to interval (R, S), where R = P ∪ {j} and S = R ∗ ∪ Q (Lemma 3.1), which are progressively segregated into disjoint bins of different supports. Hu, Imielinski (Rutgers)

ALPINE

SDM 2017

9 / 16

Comparison with Frequent Itemset Mining The absolute minsups reached at probes (in hour) by the LCM and ALPINE algorithm on the T40I10D100K (left) and Kosarak (right) datasets2 T40I10D100K Probe LCM ALPINE 1 7314 116 2 6390 56 3 5855 34 4 5317 22 5 4873 16 6 4499 13 7 4168 11 8 3882 10 9 3575 9 10 3313 8

2

Probe 1 2 3 4 5 6 7 8 9 10

Kosarak LCM ALPINE 10178 982 9569 926 9264 907 8955 894 8810 885 8684 878 8645 872 8450 867 8379 862 8158 858

http://fimi.ua.ac.be/data/ Hu, Imielinski (Rutgers)

ALPINE

SDM 2017

10 / 16

Comparison with Frequent Itemset Mining The degree of completeness for each support at 1 hour, 3 hours, 6 hours and 10 hours on T40I10D100K dataset. completeness degree

LCM 1

0.5

0

4

completeness degree

10

3

10

2

10 support ALPINE

1

10

0

10

1

0.5

0

4

10 Hu, Imielinski (Rutgers)

3

10

2

10 support ALPINE

1

10

0

10

SDM 2017

11 / 16


LCM 1

0.5

0

4

completeness degree

10

3

10

2

10 support ALPINE

1

10

0

10

1

0.5

0

4


3

10

2

10 support ALPINE

1

10

0

10

SDM 2017

11 / 16


LCM 1

0.5

0

4

completeness degree

10

3

10

2

10 support ALPINE

1

10

0

10

1

0.5

0

4


3

10

2

10 support ALPINE

1

10

0

10

SDM 2017

11 / 16


LCM 1

0.5

0

4

completeness degree

10

3

10

2

10 support ALPINE

1

10

0

10

1

0.5

0

4


3

10

2

10 support ALPINE

1

10

0

10

SDM 2017

11 / 16

Comparison with Frequent Itemset Mining The computational overhead of ALPINE is minimum

600

800 time (sec)

400 200

1000

time (sec)

800 600

35

30

25

20 15 minsup

10

5

3000

alpine_all lcm_all alpine_closed lcm_closed

2500

400 200 30

400

30

2000

20

15 minsup

10


5

25

20

15 minsup

10

400

5

30

400


350 300

1500 1000

600

600

200

250

25

20

15 minsup

10

5


200 150 100

500

25


800

200

time (sec)

40

600

1000


time (sec)

time (sec)

1000


800

time (sec)

1000

50

500

400

300 200 minsup

ALPINE

100

40

35

30

25

20 15 minsup

10

SDM 2017

5

12 / 16

Comparison with sequential top-k mining ALPINE can always obtain the complete set of itemsets above a lower minimum support in the equivalent execution time in comparison with Seq-Miner[Minh et al., 2006].

7 6 5 4 3 2 1 0

20

40 60 time (sec)

alpine seq−miner 80

100

number of patterns

number of patterns

10

x 10

4

9

8 6 4 2 0

x 10

6

8

100

200

300 400 time (sec)

500


number of patterns

x 10

5

8

700

7 6 5 4 3 2 1 0

20

40 60 time (sec)


100

Figure: Comparison with Seq-Miner


ALPINE

SDM 2017

13 / 16

Conclusion

ALPINE is, to our knowledge, the first algorithm to progressively mine itemsets and closed itemsets “support-wise”. It guarantees that all itemsets with support exceeding the current checkpoint’s support have been found before it proceeds further. It can offer intermediate meaningful and complete results. Another very critical advantage is that it does not require the apriori decided minimum support threshold.


ALPINE

SDM 2017

14 / 16

Forever Mining

Mining very wide data sets may be a continuous process running constantly. If something is ”found” - the user may be alerted. It offers ”progress meter” - the user may also check on.

Our work is the first step towards this vision of mining being like time....it just passes no matter what - just like a forever mining engine


ALPINE

SDM 2017

15 / 16

Thank you! Q&A


ALPINE

SDM 2017

16 / 16

References I Rakesh Agrawal and Tomasz Imielinski. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington DC, USA, pages 207–216, 1993. Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29(2):1–12, May 2000. Quang Tran Minh, Shigeru Oyanagi, and Katsuhiro Yamazaki. Mining the k-most interesting frequent patterns sequentially. In Proceedings of the 7th International Conference on IDEAL, pages 620–628, 2006. Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Trans. on Knowl. and Data Eng., 12(3):372–390, May 2000. Shlomo Zilberstein. Using anytime algorithms in intelligent systems. AI Magazine, 17(3):73–83, 1996.


ALPINE

SDM 2017

1/1