A Comparative Study of Spam and PrefixSpan Sequential Pattern Mining Algorithm for Protein Sequences Rashmi V. Mane Department of Technology, Shivaji University, Kolhapur, Maharashtra, India
[email protected]
Abstract. Sequential Pattern Mining is an efficient technique for discovering recurring structures or patterns from very large dataset. Many algorithms are proposed for mining. Broadly data mining algorithms are classified into two categories as Pattern-Growth approach or candidate generation and Apriori – Based. By introducing constraints such as user defined threshold, user specified data, minimum gap or time, algorithms outperforms better. In this paper we have used dataset of protein sequences and comparison in between PrefixSpan from pattern growth approach and SPAM from Apriori-Based algorithm. This comparative study is carried out with respect to space and time consumption of an algorithm. The study shows that SPAM with constraints outperforms better than PrefixSpan for very large dataset but for smaller data PrefixSpan works better than SPAM. Keywords: Data Mining, Sequential Pattern Mining, SPAM, PrefixSpan.
1
Introduction
Discovery of sequential patterns from large dataset is first introduced by Agrawal[1]. Sequential pattern mining is one of the important fields in data mining because of its variety of applications in web access pattern analysis, market basket analysis, fault detection in network, DNA sequences etc. It plays vital role different areas. Many algorithms were proposed for sequential pattern mining. These mining methods are broadly classified into two approaches as Apriori-based candidate generation method [2] and pattern growth method [3]. Apriori based approach uses the Apriori principle presented with association mining [4], which states that any of the super sequence of non frequent pattern can not be frequent. These candidate generation tests are carried out by GSP [2], SPADE [5], and SPAM [6].These algorithms perform well but scans database frequently and requires large search space. To increase the performance of these algorithms constraint driven discovery can be carried out. With constraint driven approach system should concentrate only on user specific or user interested patterns or user specified constraints such as
S. Unnikrishnan, S. Surve, and D. Bhoir (Eds.): ICAC3 2013, CCIS 361, pp. 147–155, 2013. © Springer-Verlag Berlin Heidelberg 2013
148
R.V. Mane
minimum support, minimum gap or time interval etc. With regular expression these constraints are proposed in SPIRIT [7]. Pattern growth approach is used by algorithm PrefixSpan[3], FreeSpan[8].These algorithms uses a method which examine only prefix subsequence and project only prefix subsequences. With these respective postfix subsequences are projected into the projected database. These work faster than GSP [2] Apriori algorithm for smaller database without using any constraint. Sequential pattern mining algorithms efficiently discovers the recurring structures present in the protein sequences [11] [12]. In this paper we have used large protein sequences as the input for algorithm. These protein databases are easily available with the protein databank [9].The comparative study on these large protein sequences shows that Apriori–all SPAM algorithms works better than that of pattern growth PrefixSpan.
2
Related Work
For mining sequential patterns an algorithm was proposed for very large database called SPAM using bitmap representation. This uses the depth first search strategy with effective pruning mechanism. The algorithm uses bitmap representation of database for efficient support counting. It first generates lexicographic sequence tree and traverses the tree in depth first search manner. To improve performance it uses pruning techniques, S-Step and I-step pruning, which are also Apriori based. SPAM is composed with SPADE and PrefixSpan for the dataset used by IBM Assoc Gen Program [1].The comparison shows that SPAM performs well on large dataset due to its bitmap representation of data for efficient counting. SPAM outperforms SPADE by a factor of 2.4 on small dataset. With respect to space SPAM is less space efficient than SPADE [6]. When SPAM used with incorporating constraints to it as gap and regular expression then runtime performs get increased .It also achieves reduction in output size by adding different level of constraints to it. Experimental study show that at high minimum support greater than 30 % there is not much difference in runtime performance but with low minimum support, runtime performance increases. By introducing gap constraint into SPAM output space can be reduced. By increasing mingap from 0 to 5 the output size is significantly reduced. This shows that adding constraints like gap and regular expression increases efficiency of SPAM sequential pattern mining algorithm [10].
3
Problem Definition
A sequence is a collection of an ordered list of item set where item set is a collection of unordered, non empty set of items. For a given set of sequence, to find a set of frequently occurring sub sequences. Any subsequence is said to be frequent if its support value greater or equal to minimum support or threshold value. Support is with which occurrences that item is present in the given number of sequences.
A Comparative Study of Spam and PrefixSpan Sequential Pattern Mining Algorithm
149
For example Table 1. Protein Sequences Sequence No. 1 2 3
Sequence MKKV KVM MKKVM
If minimum support value is 50%, then those sequence with support greater than or equal to minsup are called frequent subsequences. So in the given number of sequence MK, MKK, MKKV, KV, KKV, VM and KVM etc. are some of the frequent subsequences.
4
SPAM Algorithm
SPAM is one of Apriori–all type of sequential pattern mining algorithm. It assumes that entire database used for algorithm completely fit into memory. For the given sequences it generates a lexicographic tree [10]. The root of tree always starts with empty string. The children node of it are formed by sequence-extended or an item setextended sequence. Let for sequence containing M, K, V, we have to generate a lexicographic analysis of tree then it starts with root node as {}.
Fig. 1. Lexicographic tree for dataset in table1
150
R.V. Mane
At level 1 considers items M, K, V separately. Let us consider M item first, for candidate generation it will take sequence-extension step and generate different sequence as {M, M} {M, K} {M, V}.Similarly generates item set extended sequence by Item set extension step or I-Step as (M, K),(M,V) at level 2. Similarly with different items at level 3 it will generate a sequence of length 3. Once the whole tree is generated for discovering or searching user specified set of subsequence algorithm traverses depth first search strategy. At each node, support of sequence is tested. Those nodes having support value greater than or equal to minimum support are stored and repeats DFS recursively on those nodes. Otherwise the node is not considered by the principle of Apriori. For the support counting algorithm uses a vertical bitmap structure. For table 1 bitmap representation of dataset can be given as below. As bitmap representation of sequence length maximum is 5 so vertical bitmap consists of 5 bit and as the number of sequences are 3 in table 1 bitmaps with 3 slots.
Fig. 2. Vertical bitmap representation for items in dataset
For the bitmap presentation of S-Step, every bit after the first index of one is set to zero and every bit after that index position set to be one.
A Comparative Study of Spam and PrefixSpan Sequential Pattern Mining Algorithm
151
Fig. 3. Vertical bitmap representation for S-step with no gap constraint
For the bitmap representation of I-Step, newly added item set’s bitmap are logically ANDed with the sequence generated from S-Step.
Fig. 4. Vertical bitmap representation for I-step
To improve the performance of an algorithm pruning techniques are used with S-extension and I-extension of a node. S-step pruning technique prunes S-step children. For pruning it applies Apriori principle i.e. Let for the sequence
152
R.V. Mane
({M}{M}),({M},{K}) and ({K}{V}) are given and if ({M}{K} is not frequent then ({M}{M}{K}),({M}{V}{K}) or ({M},{M,K}) or ({M},{V,K} are ignored. Similarly I-Step pruning technique prunes I-step children. For pruning at I-Step it applies similarly Apriori principle for item set i.e. Let for the item set sequences ({M,K}) and ({M,V}) if ({M,V}) is not frequent then ({M,K,V}) is also not frequent. Constraints can be added as minimum and maximum gap in between the two items. With mingap and maxgap constraint, transformation step is modified to restrict number of position that next item can appear after the first item. If first item is {M} and next is {K} and constraint of mingap=1 and maxgap=1 then all the sequences {MVK},{MKK} etc are some of the sequences for above mentioned dataset. Regular expressions are also can be used to limit the number of interesting patterns. Let M+K is given then it is considered that all those sequences containing M or K can be obtained.
5
PrefixSpan Algorithm
This algorithm uses the pattern growth method for mining sequence patterns. Here idea is that, instead of projecting sequence database by considering all possible occurrences of frequent subsequences, the projection is based only on frequent prefixes. Let {MKKV} be the sequence then prefixes of sequence may be , , .Let be prefix then be the prefix of it. The algorithm first finds out all length of sequences i.e. etc .with respect to its support. Depending on the number of items present in given sequence, complete set of sequence patterns can be partitioned into those numbers of prefixes. For the above example all sets are partitioned into three different prefixes as and . Let be prefix chosen then all {}, {} be the projected postfix from the given postfixes all length-2 sequence patterns having prefix can be found out like with its support. Similarly for length -3 pattern the process repeats. Table 2. Number of Sequences Pattern from Given Prefix and Postfix No.
Prefix
Postfix
Sequence Pattern
1
{-KKV},{-KKVM}
,,,,
2
{-KV}{-V}{-VM}{KVM}
3
{-M}
The projected database keeps on shrinking for every step. But the major cost of PrefixSpan is during construction of projected database. If number of sequence patterns are more the cost will increased a lot as PrefixSpan constructs projected database for every sequence patterns.
A Comparative Study of Spam and PrefixSpan Sequential Pattern Mining Algorithm
6
153
Experimental Results
To test the efficiency of sequential pattern mining algorithms like SPAM and PrefixSpan we have taken a protein sequence as input dataset. This implementation is carried out on JAVA Standalone application. All experiment were performed on 1.46GHz Intel Pentium Dual CPU machine with 512MB of main memory Microsoft XP and J2SE Runtime Environment 1.5. First experiment is performed to find the time taken by two algorithms with changing minimum support. Minimum support value varied from 10 to 50 and time measured for execution of an algorithm. Fig. 5 shows that time required for SPAM is too less then PreixSpan. Here the text files containing 528 numbers of sequences chosen. Minimum Support Vs Time 300
Runtime(ms)
250 200 Time For SPAM
150
Time For PrefixSpan
100 50 0 1
2
3
4
5
6
7
8
9
10
Minimum Support
Fig. 5. Runtime Performance with Varying Value For minsup
Second experiment was performed for checking the runtime by varying number of sequence. Here number of sequence containing in the text file are only 2. Runtime Vs no.Of Sequences 400 350
Runtime(ms)
300 250 Time For SPAM
200
Time For PrefixSpan
150 100 50 0 1
2
3
4
5
No.Of Sequence
Fig. 6. Runtime Performance with Varying Number of Sequences and minsup=50%
154
R.V. Mane
If number of sequences is too less then it shows that PrefixSpan works better than SPAM. Third experiment was performed by changing minimum support and memory utilization by two algorithms. Again minimum support value varied from 10 to 50.Memory utilization taken by both of algorithm is measured in bytes. As minimum support value gets increased memory utilization also increased. Minimum Support Vs Outputsize 600000
Memory(in bytes)
500000 400000 Memory For SPAM
300000
Memory For PrefixSpan
200000 100000 0 1
2
3
4
5
6
7
8
9
10
Minim um support
Fig. 7. Output Size Performance with Varying Value For minsup
Fourth experiment is performed for measuring output size by keeping minimum support constant as 50% and by changing number of sequence. As number of sequence are increased algorithm requires more memory. The major drawback of PrefixSpan is that lot of memory needed during performing postfix database for each frequently occurring prefix sequence. No.Of Sequence Vs Output size 600000
Memory (In bytes)
500000 400000 Memory For SPAM
300000
Memory For Pref ixSpan
200000 100000 0 1
2
3
4
5
No.Of Seque nce
Fig. 8. Output Size Performance with Varying no. Of Sequences
A Comparative Study of Spam and PrefixSpan Sequential Pattern Mining Algorithm
7
155
Conclusion
For discovering frequently occurring sequential patterns from a large dataset can be possible with two approaches of mining algorithm as Apriori-all or Pattern growth. Experimental study shows that PrefixSpan performs better for smaller number of sequences. This is due to PrefixSpan does not have to waste time during creation of vertical bitmaps of items with in the sequences. It takes more time for SPAM though the number of sequence is less. SPAM outperforms better than PrefixSpan for larger dataset. This is because of depth first search traversal of tree, vertical bitmap representation for support counting and pruning technique used during generation of candidate sequence. Though memory utilization by SPAM is higher but it is more time efficient than any of the pattern growth approach like PrefixSpan.
References 1. Agrawal, R., Srikant: Mining Sequential Pattern. In: Yu, P.S., Chen (ed.) Eleventh International Conference on Data Engineering (ICDE 1995), Taipei, Taiwan, pp. 3–14. IEEE Computer Society Press (1995) 2. Srikant, R., Agrawal, R.: Mining Sequential patterns: Generalization and performance improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996) 3. Pei, J., Han, J., Mortazavi-Asl, B., et al.: PrefixSpan: Mining Sequential Patterns efficiently by prefix projected pattern growth. In: ICDE 2001, Heidelberg, Germany, pp. 215–224 (2001) 4. Agrawal, R., Srikant, R.: Fast algorithm for mining association rules. In: Proceedings of International Conference on Very Large Data Bases (VLDB 1994), Santiago, Chile, pp. 487–499 (1994) 5. Zaki, M.: An efficient algorithm for mining frequent sequence. Machine Learning 40, 31– 60 (2000) 6. Ayres, J., Gehrke, J., Yiu, T., Flannick, J.: Sequential Pattern Mining using Bitmap representation. In: Proceedings of ACM SIGKDD 2002, pp. 429–435 (2002) 7. Zaki, M.: Sequential mining in categorical domains-Incorporating constraints. In: Proceeding of CIKM 2000, pp. 422–429 (2000) 8. Han, H., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, M.-C.: FreeSpan: Frequent Pattern projected Sequential Pattern mining. In: Proceedings of 2000 International Conference on Knowledge Discovery and Data Mining, pp. 355–359 (2000) 9. http://www.pdb.org, http://www.ebi.ac.uk/pdbc, http://www.rcsb.org, http://www.pdbj.org 10. Ho, J., Lukov, L., Chawla, S.: Sequential Pattern mining with constraints on large protein databases. In: ICMD (2005) 11. Tao, T., Zhai, C.X., Lu, X., Fang, H.: A study of stastical methods for function prediction of protein motifs 12. Wang, M., Shang, X.-Q., Li, Z.-H.: Sequential Pattern Mining for Protein Function Prediction. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds.) ADMA 2008. LNCS (LNAI), vol. 5139, pp. 652–658. Springer, Heidelberg (2008)