effective approximate sequential pattern mining method. Finally ... the problem of mining the multidimensional sequential patterns ..... Atlanta, Georgia, pp. 81-88 ...
ApproxMGMSP: A Scalable Method of Mining Approximate Multidimensional Sequential Patterns on Distributed System Changhai Zhang, Kongfa Hu, Zhuxi Chen, Ling Chen Department of Computer Science and Engineering, Yangzhou University, Yangzhou 225009,China
Yisheng Dong Department of Computer Science and Engineering, Southeast University, Nanjing 210096,China
Abstract
patterns. At present, databases and data warehouses with huge amount of data make data mining on PC not very effective, especially can not make the need of the ability of data process on function and performance. In actual applications, most large information systems are distributed, such as the data access of large interregional shopping markets. So, distributed multidimensional patterns mining is proposed in order to deal with this problem firstly. At present, many multidimensional sequential pattern mining-related researches have been advanced. such as the well-known algorithms UniSeq, PSFP and HYBRID[1]. However, the overall performance of these algorithms is not high in mining global multidimensional patterns for the large amount of data scattered in distributed environment. So the issue only can be solved by the distributed or parallel data mining technology. In 2003, S.C. Zhang proposed the technique of distributed mining of multi-database[2] to resolve the problem, and then the methods of global association rule mining[3] and exceptional sequential patterns mining[4] in different data sources were also proposed. Recently H.C. Kum has also proposed the method of mining global sequential patterns[5] in multi-database. Traditional methods of mining sequential patterns are to find all the patterns that satisfy the user-specified minimum support threshold, such as the well-known algorithms GSP[6], Prefixspan[7], SPADE[8] and so on. However, these sequential patterns mining algorithms based on support have some inherent limitations. So, we propose a novel method of mining approximate multidimensional sequential patterns on distributed system. Our experiments indicate that the method simplify the process of mining multidimensional sequential patterns and solve the problem of high dimension effectively. The global multidimensional sequential patterns could be obtained effectively by reducing the redundant information.
We present a scalable and effective algorithm called ApproxMGMSP (Approximate Mining of Global Multidimensional Sequential Patterns) to solve the problem of mining the multidimensional sequential patterns for large databases in the distributed environment. Our method differs from previous related works of mining multidimensional patterns on distributed system. The main difference is that an approximate mining method is used in large multidimensional sequence database firstly. In this paper, to convert the mining on the multidimensional sequential patterns to sequential patterns, the multidimensional information is embedded into the corresponding sequences. Then the sequences are clustered, summarized, and analyzed on the distributed sites, and the local patterns could be obtained by the effective approximate sequential pattern mining method. Finally, the global multidimensional sequential patterns could be quickly mined by high vote sequential pattern model after collecting all the local patterns on one site. Both the theories and the experiments indicate that this method could simplify the problem of mining the multidimensional sequential patterns and avoid mining the redundant information. The global sequential patterns could be obtained effectively by the scalable method after reducing the cost of communication.
1. Introduction Sequential pattern mining has become an essential data mining task, with broad applications, including web log analysis, market and customer analysis, pattern discovery in protein sequences, and mining XML query access patterns for caching. However, mining multidimensional sequential patterns could extract more useful information than mining sequential
2. Problem formulation Assume that there are n sites S1,S2,…,Sn in the distributed environment and the multidimensional sequence database MSDB is partitioned over the n sites into {MSDB1,MSDB2,…,MSDBn}, respectively. Let the independent computer on each site can communicate each other. Given schema MSDB= (TID, A1, …,Am, S) is a multidimensional sequence database, where TID is a primary key, A1, …,Am is multidimensional information and S are sequences. Let * be any value belong to any domain of A1, …,Am. A multidimensional sequence takes the form of (a1,…,am,s), where ai ∈ ( Ai ∪ {*} ) for(1≤ i≤m) and s is a sequence. Definition 1. Given a local sequence database DBx, let dist (seqi,seqj) be the distance measure for seqi and seqj (0