Eclat [Zaki, 2000], FP-Growth [Han et al., 2000]. Hu, Imielinski (Rutgers). ALPINE ..... Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without ...
ALPINE: Progressive Itemset Mining with Definite Guarantees Qiong Hu1,2
Tomasz Imielinski1,2
1 Department
of Computer Science Rutgers University
2 Center
for Science of Information Purdue University
SIAM Data Mining Conference (SDM), April 2017
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
1 / 16
Outline
1
Motivation
2
Problem Definition
3
The Proposed ALPINE Algorithm
4
Computational Experiments
5
Conclusion and Vision
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
2 / 16
Motivating Example Experiments on synthetic datasets on aurora.cs1 •ntran = 1000, p = 0.2, minsup = 0.16 Table 1: Mining time increases drastically with the increasing of items. items Eclat
10,000 10 mins
50, 000 4 hours
100,000 17 hours
500,000 16 days
1,000,000 2 months+
•ntran = 100, 000, nitems = 1, 000, 000, maxtranlength = 50, 000 Table 2: Mining time increases drastically with the decreasing of minsup. minsup Eclat FPGrowth 1
2567 15 hours 20 hours
2556 32 hours 43 hours
2548 2 days 3 days
2537
2487
4
26 days
days
5 days+
34 days
Rutgers cs research cluster. Eclat [Zaki, 2000], FP-Growth [Han et al., 2000] Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
3 / 16
Motivation Should you wait? How long can you wait? Can it provide some progress indicator during this long process? Is the best-so-far partial answer available anytime? Can you choose to stop or to go anytime?
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
4 / 16
Motivation Key: frequent itemset mining + anytime analysis Definition (Anytime Algorithms): Algorithms whose quality of results improves gradually as computation time increases [Zilberstein, 1996]
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
5 / 16
Problem Definition Frequent Itemset Mining. Given a set of transactions, find all the itemsets having support ≥ minsup [Agrawal and Imielinski, 1993]. Progressive Itemset Mining: Given a set of transactions, find all the itemsets having support ≥ a certain minimum support threshold minsupt within arbitrary time period t. The goal of a progressive miner is to maximize the expected utility:
max n1
Pn
j=1
U(Pj )
n } is a set of where, U(Pj ) is the utility obtained at Probe j and {Pj=1 randomly selected probes in time.
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
6 / 16
Problem Definition Utility 1
0
1.0
Utility Itemset Miner 1
minsup
…
U(minsupk) …
U(minsup2) U(minsup1) 0
t P1
P2
Pn
Progressive itemset mining. A long mining task, is progressively divided into k k sub-search spaces w.r.t. a set of decreasing minimum supports {minsupi=1 }. n {Pj=1 } is a set of randomly selected probes in time. Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
7 / 16
The ALPINE Algorithm Automatic minsup Lowering with Progress Indicator in practically Never-Ending mining support1
support2
supportk
ptr1
ptr2
ptrk
(I, I*)
(I, I*)
(I, I*)
…
…
…
…
…
…
(I, I*)
(I, I*) b2
(I, I*) bk
b1
All itemset interval (I , I ∗ ) of singleton itemset are partitioned into different bins (b1 , ..., bk , ...) based on supports (support1 > ... > supportk > ...). Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
8 / 16
The ALPINE Algorithm supporti
supporti+1
supporth
ptri
ptri+1
ptrh
…
(R, S)
…
…
…
…
(P, Q) … … bi
…
j k
… bi+1
(U, V) bl
… bh
Itemset interval (P, Q) is extended with all item j (j > tail(P) ∧ j ∈ / Q) to interval (R, S), where R = P ∪ {j} and S = R ∗ ∪ Q (Lemma 3.1), which are progressively segregated into disjoint bins of different supports. Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
9 / 16
Comparison with Frequent Itemset Mining The absolute minsups reached at probes (in hour) by the LCM and ALPINE algorithm on the T40I10D100K (left) and Kosarak (right) datasets2 T40I10D100K Probe LCM ALPINE 1 7314 116 2 6390 56 3 5855 34 4 5317 22 5 4873 16 6 4499 13 7 4168 11 8 3882 10 9 3575 9 10 3313 8
2
Probe 1 2 3 4 5 6 7 8 9 10
Kosarak LCM ALPINE 10178 982 9569 926 9264 907 8955 894 8810 885 8684 878 8645 872 8450 867 8379 862 8158 858
http://fimi.ua.ac.be/data/ Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
10 / 16
Comparison with Frequent Itemset Mining The degree of completeness for each support at 1 hour, 3 hours, 6 hours and 10 hours on T40I10D100K dataset. completeness degree
LCM 1
0.5
0
4
completeness degree
10
3
10
2
10 support ALPINE
1
10
0
10
1
0.5
0
4
10 Hu, Imielinski (Rutgers)
3
10
2
10 support ALPINE
1
10
0
10
SDM 2017
11 / 16
Comparison with Frequent Itemset Mining The degree of completeness for each support at 1 hour, 3 hours, 6 hours and 10 hours on T40I10D100K dataset. completeness degree
LCM 1
0.5
0
4
completeness degree
10
3
10
2
10 support ALPINE
1
10
0
10
1
0.5
0
4
10 Hu, Imielinski (Rutgers)
3
10
2
10 support ALPINE
1
10
0
10
SDM 2017
11 / 16
Comparison with Frequent Itemset Mining The degree of completeness for each support at 1 hour, 3 hours, 6 hours and 10 hours on T40I10D100K dataset. completeness degree
LCM 1
0.5
0
4
completeness degree
10
3
10
2
10 support ALPINE
1
10
0
10
1
0.5
0
4
10 Hu, Imielinski (Rutgers)
3
10
2
10 support ALPINE
1
10
0
10
SDM 2017
11 / 16
Comparison with Frequent Itemset Mining The degree of completeness for each support at 1 hour, 3 hours, 6 hours and 10 hours on T40I10D100K dataset. completeness degree
LCM 1
0.5
0
4
completeness degree
10
3
10
2
10 support ALPINE
1
10
0
10
1
0.5
0
4
10 Hu, Imielinski (Rutgers)
3
10
2
10 support ALPINE
1
10
0
10
SDM 2017
11 / 16
Comparison with Frequent Itemset Mining The computational overhead of ALPINE is minimum
600
800 time (sec)
400 200
1000
time (sec)
800 600
35
30
25
20 15 minsup
10
5
3000
alpine_all lcm_all alpine_closed lcm_closed
2500
400 200 30
400
30
2000
20
15 minsup
10
Hu, Imielinski (Rutgers)
5
25
20
15 minsup
10
400
5
30
400
alpine_all lcm_all alpine_closed lcm_closed
350 300
1500 1000
600
600
200
250
25
20
15 minsup
10
5
alpine_all lcm_all alpine_closed lcm_closed
200 150 100
500
25
alpine_all lcm_all alpine_closed lcm_closed
800
200
time (sec)
40
600
1000
alpine_all lcm_all alpine_closed lcm_closed
time (sec)
time (sec)
1000
alpine_all lcm_all alpine_closed lcm_closed
800
time (sec)
1000
50
500
400
300 200 minsup
ALPINE
100
40
35
30
25
20 15 minsup
10
SDM 2017
5
12 / 16
Comparison with sequential top-k mining ALPINE can always obtain the complete set of itemsets above a lower minimum support in the equivalent execution time in comparison with Seq-Miner[Minh et al., 2006].
7 6 5 4 3 2 1 0
20
40 60 time (sec)
alpine seq−miner 80
100
number of patterns
number of patterns
10
x 10
4
9
8 6 4 2 0
x 10
6
8
100
200
300 400 time (sec)
500
alpine seq−miner 600
number of patterns
x 10
5
8
700
7 6 5 4 3 2 1 0
20
40 60 time (sec)
alpine seq−miner 80
100
Figure: Comparison with Seq-Miner
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
13 / 16
Conclusion
ALPINE is, to our knowledge, the first algorithm to progressively mine itemsets and closed itemsets “support-wise”. It guarantees that all itemsets with support exceeding the current checkpoint’s support have been found before it proceeds further. It can offer intermediate meaningful and complete results. Another very critical advantage is that it does not require the apriori decided minimum support threshold.
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
14 / 16
Forever Mining
Mining very wide data sets may be a continuous process running constantly. If something is ”found” - the user may be alerted. It offers ”progress meter” - the user may also check on.
Our work is the first step towards this vision of mining being like time....it just passes no matter what - just like a forever mining engine
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
15 / 16
Thank you! Q&A
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
16 / 16
References I Rakesh Agrawal and Tomasz Imielinski. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington DC, USA, pages 207–216, 1993. Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29(2):1–12, May 2000. Quang Tran Minh, Shigeru Oyanagi, and Katsuhiro Yamazaki. Mining the k-most interesting frequent patterns sequentially. In Proceedings of the 7th International Conference on IDEAL, pages 620–628, 2006. Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Trans. on Knowl. and Data Eng., 12(3):372–390, May 2000. Shlomo Zilberstein. Using anytime algorithms in intelligent systems. AI Magazine, 17(3):73–83, 1996.
Hu, Imielinski (Rutgers)
ALPINE
SDM 2017
1/1