Mining Sequential Rules Based on Prefix-Tree - Springer Link

4 downloads 309 Views 292KB Size Report
2 Faculty of Information Technology University of Science, Ho Chi Minh, Vietnam ... Keywords: frequent sequence, prefix-tree, sequential rule, sequence ...
Mining Sequential Rules Based on Prefix-Tree Thien-Trang Van1, Bay Vo1, and Bac Le2 1

Faculty of Information Technology, Ho Chi Minh City University of Technology, Vietnam 2 Faculty of Information Technology University of Science, Ho Chi Minh, Vietnam {vtttrang,vdbay}@hcmhutech.edu.vn, [email protected]

Abstract. We consider the problem of discovering sequential rules between frequent sequences in sequence databases. A sequential rule expresses a relationship of two event series happening one after another. As well as sequential pattern mining, sequential rule mining has broad applications such as the analyses of customer purchases, web log, DNA sequences, and so on. In this paper, for mining sequential rules, we propose two algorithms, MSR_ImpFull and MSR_PreTree. MSR_ImpFull is an improved algorithm of Full (David Lo et al., 2009), and MSR_PreTree is a new algorithm which generates rules from frequent sequences stored in a prefix-tree structure. Both of them mine the complete set of rules but greatly reduce the number of passes over the set of frequent sequences which lead to reduce the runtime. Experimental results show that the proposed algorithms outperform the previous method in all kinds of databases. Keywords: frequent sequence, prefix-tree, sequential rule, sequence database.

1 Introduction Mining sequential patterns from a sequence database was first introduced by Agrawal and Srikant in [1] and has been widely addressed [2-6, 10]. Sequential pattern mining is to find all frequent sequences that satisfy a user-specified threshold called the minimum support (minSup). Sequential rule mining is trying to find the relationships between occurrences of sequential events. A sequential rule is an expression that has form Χ→Υ, i.e., if X occurs in any sequence of the database then Υ also occurs in that sequence following Χ with high confidence. In sequence databases, there are researches on many kinds of rules, such as recurrent rules [8], sequential classification rules [14], sequential rules [7, 9] and so on. In this paper, we focus on sequential rule mining. In sequential rule mining, there are researches on non-redundant sequential rules but not any real research on mining a full set of sequential rules. If we focus on interestingness measures, such as lift [12], conviction [11], then the approach of non-redundant rule [9] cannot be used. For this reason, we try to address the problem of generating a full set of sequential rules effectively. Based on description in [7], the authors of paper [9] have generalized an algorithm for mining sequential rules, called Full. The key feature of this algorithm is that it requires multiple passes over the full set of frequent sequences. In this paper we N.T. Nguyen et al. (Eds.): New Challenges for Intelligent Information, SCI 351, pp. 147–156. springerlink.com © Springer-Verlag Berlin Heidelberg 2011

148

T.-T. Van, B. Vo, and B. Le

present two algorithms: MSR_ImpFull and MSR_PreTree. The former is an improved algorithm (called MSR_ImpFull). The latter is a new algorithm (called MSR_PreTree) which effectively mines all sequential rules base on prefix-tree. The rest of paper is organized as follows: Section 2 presents the related work and Section 3 introduces the basic concepts related to sequences and some definitions used throughout the paper. In section 4, we present the prefix-tree. Two proposed algorithms are presented in Section 5, and experimental results are conducted in Section 6. We summarize our study and discuss some future work in Section 7.

2 Related Work The study in [7] has proposed generating a full set of sequential rules from a full set of frequent sequences and removing some redundant rules by adding post-mining phase. Authors in [9] have investigated several rule sets based on composition of different types of sequence sets such as generators, projected-database generators, closed sequences and projected- database closed sequences. Then, they have proposed a compressed set of non-redundant rules which are generated from two types of sequence sets: LS-Closed and CS-Closed. The premise of the rule is a sequence in LS-Closed set and the consequence is a sequence in CS-Closed set. The authors have proved that the compressed set of non-redundant rules is complete and tight, so they have proposed an algorithm for mining this set. For comparison, based on description in [7] they also have generalized Full algorithm for mining a full set of sequential rules. In this paper, we are interested in mining a full set of sequential rules because nonredundant sequential rule mining just minds the confidence, if we use alternative measures then this approach cannot be appropriate as mentioned in Section 1.

3 Preliminary Concepts Let Ι be a set of distinct items. An itemset is a subset of items (without loss of generality, we assume that items of an itemset are sorted in lexicographic order). A sequence ѕ = 〈s1 s2 … sn〉 is an ordered list of itemsets. The size of a sequence is the number of itemsets in the sequence. The length of a sequence is the number of items in the sequence. A sequence with length k is called a k-sequence. A sequence β = 〈b1 b2 …bm〉 is called a subsequence of another sequence α = 〈a1 a2… an〉 if there exist integers 1≤ i1 < i2

Suggest Documents