An Intelligent Approach for Mining Frequent Patterns in Spatial

3 downloads 0 Views 195KB Size Report
Index Terms—Spatial Database, Frequent Patterns,. Association Rule Mining. I. INTRODUCTION. Nowadays, spatial data mining [1] is a well defined domain.
An Intelligent Approach for Mining Frequent Patterns in Spatial Database System Using SQL Animesh Tripathy, Associate Professor, School of Computer Engineering, KIIT University, INDIA Subhalaxmi Das, Lecturer, Department of CSE, CET, BPUT, INDIA Prashanta Kumar Patra, Professor, Department of CSE, CET, BPUT, INDIA

Abstract—Mining frequent pattern from spatial databases systems has always remained a challenge for researchers. However, the performance of SQL based spatial data mining is known to fall behind specialized implementation since the prohibitive nature of the cost associated with extracting knowledge, and the lack of suitable declarative query language support. In this paper, we proposed an enhancement of existing mining algorithm based on SQL for the problem of finding frequent patterns for efficiently mining frequent patterns of spatial objects occurring in space. The proposed algorithm is termed as Frequent Positive Association Rule/Frequent Negative Association Rule (FPAR/FNAR). This algorithm is an improvement of the FP growth algorithm. Further an enhancement of the improved algorithm by a numerical method based on SQL for generating frequent patterns known as Transaction Frequent Pattern (TFP) Tree is proposed to reduces the storage space of the spatial dataset and overcomes some limitations of the previous method. Index Terms—Spatial Association Rule Mining.

I.

Database,

Frequent

Patterns,

INTRODUCTION

Nowadays, spatial data mining [1] is a well defined domain of data mining. It can be described as the discovery of interesting, implicit and previously unknown knowledge from spatial databases. Several data mining techniques have been applied to discover knowledge from spatial databases. In particular, Association Rule Mining (ARM) discovers spatial relationships and infer valid, novel, useful and understandable patterns for generation of rules [2][3]. Extensive efforts have been devoted to developing efficient algorithms for mining frequent patterns. So a couple of algorithms adopting the candidate generate-and-test approach are proposed. Apriori algorithm [2] is the first algorithm which uses the Apriori property to prune the search space. A hash based algorithm [8] reduces the number of candidate patterns. The pattern growth approach [4][5] tries to avoid the candidate patterns by constructing conditional databases for frequent patterns. The proposed algorithms differ mainly in how they represent the conditional databases that use a compact data structure [6] FP tree to represent the conditional databases, which is a combination of prefix-tree structure and node-links. FPgrowth algorithm [6] is not efficient because each node still

needs to maintain a couple of pointers, which incurs a huge memory space requirement and the main memory consumption is usually hard to precisely predict. A variant of FP-growth is FPAR/FNAR algorithm. This algorithm shows significant performance improvements over FPgrowth by constructing a tree that compresses and generates a set of generalized strong association rules for valid patterns and generated rules are higher as compared to invalid rules [7]. The problem of finding all frequent object sets can be solved in two ways likely to be chosen: using algorithms that employed sophisticated in memory data structures, where the data is stored into and retrieved from flat files; and using algorithms that are based on SQL statements and extensions to query and update a database. Thinking of spatial database, where a lot of spatial objects are stored, it is quite hard to copy and execute memory programs for finding frequent object sets. One may want to avoid one at a time record retrieval from the database, saving both the copying and process context switching cost [7]. Therefore, we present a procedural schema for mining all frequent patterns taking advantages of database capabilities with appropriate SQL-Extensions for better performance. One of the advantages of SQL-based mining algorithms is fast and easy development since they are declaratively specified as a set of SQL queries. In this paper, we use structured query language (SQL) for frequent pattern mining. A sample dataset of spatial objects have been used to mine frequent pattern tree using the proposed numerical method known as TFP. II.

PROPOSED FRAMEWORK

In this section we propose a framework to mine frequent patterns of spatial objects. These spatial objects situated close to each other for a given sample space of Indian Cities. The presence of each spatial object is recorded against each city as a transaction. The spatial objects data set for each transaction is arranged based on the occurrence of higher ranked objects. Equal ranked objects are sorted in lexicographic ordering. Then threshold value of minimum support and confidence of frequent pattern determines valid association rules. The analysis is based on frequent spatial objects from Map Database and the correlation between

978-1-4673-0449-8/12/$31.00 ©2012 IEEE

TABLE I SAMPLE SPATIAL DATASET

those objects. The layout of the proposed framework is given in Fig. 1. The proposed approach can be described as a sequence of processes. It extracts the spatial objects and its frequency of occurrences from the Map Database and builds a Sample Spatial Objects Datasets. The next process is to first we represent the frequent order list in form of numerical representation where each object in the transaction is represented as a prime value and then each transaction can be viewed by the product value of the prime dataset which compresses the transaction. Then the next step constructs a TFP tree using the numerical representation of each transaction dataset. The final step finds frequent spatial patterns with their respective support count by intersecting its numerical ordered list. Frequent Spatial Patterns

TFP Tree

Numerical Representation of Ordered List Frequent Ordered List

Sample Spatial Object Datasets

TID 1 2 3 4 5 6 7 8 9 10 11 12

Fig. 1. Proposed Framework

A. Analysis of Proposed Algorithm Based on the proposed algorithm we performed the test on a sample real time data base of 250 Indian cities taken as reference to validate the proposed framework in our study of spatial database system. TABLE I shows a sample of 07 spatial objects for 12 Indian cities. The spatial objects are as Museum (A), Zoo (B), Lake (C), Monument (D), River (E), Forest (F), and Hill (G). We have assumed a minimum support count greater than 4. Analysis process has the following steps as per the framework discussed before. Step 1: Obtain Sample Spatial Dataset. Step 2: Build Ordered list of objects in descending order of their frequencies. Step3: Mapping Ordered List in form on numerical representation. Step 4: Build a TFP Tree using numerical representation. Step 5: Find frequent patterns and validate it against their respective support count.

Positive Object C,D,A,B,E,G B,F ,A,C,D A,C,D,E A,C,D,B,E,G B,F, A,C C,A,B D,B,F,E,A A,C,D,B,E A,C,D,B,E,G C,D,B,F,G A,C,D,B E,A,D,B

Each transaction is scanned once to find frequent object. For example, in TID (1) {C, D, A, B, E, G} is a transaction list. So the object which is less than the minimum support is pruned. Now TID (1) can be viewed as a new transaction {A, B, C, D, E} which are arranged in descending order of their frequencies. The TFP-Tree is based on prime number characteristics which can makes use of both possibilities data compressing and pruning techniques to enhance efficiency. The ordered list of each transaction is mapped using prime-based data transformation technique as product value for each order list to reduce the size of transaction database which is shown in TABLE II.

Extract Frequency Count

Sample Spatial Database

Reference City Bhubaneswar Bangalore Ajmer Mumbai Chandigarh Trivandrum Delhi Ahmadabad Pune Mysore Nagpur Patna

TABLE II NUMERICAL REPRESENTATION OF SPATIAL DATASET TID #

References City

1 2 3 4 5 6 7 8 9 10 11 12

Bhubaneswar Mumbai Ahmadabad Pune Bangalore Nagpur Delhi Patna Ajmer Chandigarh Trivandrum Mysore

Ordered Dense Object A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D A,B,C,D A,B,D,E A,B,D,E A,C,D,E A,B,C A,B,C B,C,D

Prime No. Representa tion 11,7,5,3,2 11,7,5,3,2 11,7,5,3,2 11,7,5,3,2 11,7,5,3 11,7,5,3 11,7,3,2 11,7,3,2 11,5,3,2 11,7,5 11,7,5 7,5,3

Product Value 2310 2310 2310 2310 1155 1155 462 462 330 385 385 105

We also store the product value for each order list. Let’s examine this through an example. Suppose we take sample dense database of #1 where Bhubaneswar is a reference city and its corresponding spatial objects are {A, B, C, D, E}. Using the prime numbers in decreasing order for representation such as [(A:11) ,(B:7), (C:5), (D:3), (E:2)]. Therefore the ordered list can be mapped as {11, 7, 5, 3, 2} for {A, B, C, D, E}.This transaction can be mapped by product value of prime numbers i.e. (2310 = 11 ‫ כ‬7 ‫ כ‬5 ‫ כ‬3 ‫ כ‬2). Then, the product value of each transaction will be used to construct the TFP-tree as shown in Fig. 2.

A TFP-Tree includes a root node and a child node that forms a sub tree as children of the root or creates new descendants and assigns count as 1. If product value is equal to the current node only the count of the current node is increased by 1. Suppose #1, #2, #3 are taken and #4 of spatial dataset (TABLE II.) where the product values of transaction are equal, so count value of node is increased by 4. Further when #9 of dataset is taken, it is not divisible by 2310 so 330 create a new descendant. Similarly for every successive insertion of a new node its product value is examined against existing nodes product values. The final TFP tree for spatial datasets is shown below. Root 2310:4 330:1

462:2

1155:2

105:1

385:2

Fig. 2. TFP Tree

To mine the frequent pattern from TFP-tree we traverse from root to each branch of tree. Taking each branch as a new entry to an array list we find the set of prime numbers for each product value for a given node. Then by intersecting of prime number set of each node’s product value, we find the frequent patterns. Suppose we consider Fig.2 and take (2310, 330) branch as a new entry in an array list (say TFP-1) which is shown in Fig 3. 2310 TFP-1

330

11, 7,5,3,2

11,5,3,2

2310 TFP-2

TFP-

11,7,5,3,2

11,7,3,2

2310

1155

11,7,5,3,2

11,7,5,3

2310 TFP-4

462

11,7,5,3,2

1155 11,7,5,3

105 7,5,3 385 11,7,5

Fig. 3. TFP Array List

Then by intersecting its prime-based representation {11, 5, 3, 2} is found as its prime factor. So all possible frequent patterns are (11, 5), (11, 5, 3) and (11, 5, 3, 2) where each prime number represent spatial object pattern like {Museum, Lake},{Museum, Lake, Monument}, {Museum, Lake, Monument, River} which are correlated with each other. To generate frequent pattern we assume the lexicographic ordering of spatial objects. Similarly for

every successive insertion of a new array its pattern is examined against existing branch values to generate the consolidated list of frequent patterns. III.

SQL-BASED

There are a few SQL-based approaches that can be used to mine frequent patterns [8, 9]. Even so, all of them are based on nature of Apriori-like approach. There is another approach that uses FP-Growth [10]. Nevertheless, the process of reconstructing conditional FP tables for large datasets may pose performance issues. Therefore, we must to avoid two bottlenecks of previous mentioned implementations: candidate set generation and test; and FP table reconstruction. It is found that the mining performance can be improved if one can avoid computing each frequent object individually. The difference between the FPAR/FNAR approach and the FP approach is the process of the constructing the conditional tree. In FPAR/FNAR approach assuming the full set of transactions in the spatial dense data set to find patterns for each transaction and append it to the list that stores the unique set of all patterns existing in the transaction. After the list is complete with all unique spatial object patterns, the frequency of the occurrence of each pattern count is also appended to the list of spatial patterns. Further in TFP approach the prime numbers mapping takes place to the existing spatial objects data set. After the prime number mapping is stored in the TFP table, the product value for each transaction is computed. This computed product value is considered further while constructing the TFP tree. A. Construction of FPAR/FNAR Tree The process of construction of the table FPAR/FNAR is as follows: 1. Create table ORDER-LIST, in which infrequent objects are excluded and frequent ones are sorted in descending order by frequency, i.e. frequent 1-object sets using SQL query. create table ORDER-LIST as select a.tid, a.objects, s.count from DATASET a, ((select objects, count(*) count from DATASET a group by objects having count(*)>minimum support order by count(*) desc, objects)s) where a.objects=s.objects order by a.tid,s.count desc,a.objects 2. Construct the table FPAR (shown in TABLE III). The PL/SQL procedure for constructing the table FPAR is shown in below. Input: Transaction table ORDER-LIST. Output: FPAR table Procedure: for objects with the identical tid in table ORDER-LIST insert into the table FPAR

1. Transfer transaction ORDER-DN into RESULT-PN, in which infrequent objects are excluded and frequent ones are sorted in descending order by frequency, i.e. frequent 1-object sets using PL/SQL query.

count=0; path=null; for each transaction[i] if (list[i]>1) object=list; insert object into table FPAR count++; path+= object; else for each transaction if FPAR has not an object=list[i] insert list[i] into table FPAR set count=1; update table FPAR path+= object;

select object, count(*) from ORDER-DN group by object having count(*)>minsupp order by count(*) desc, object; Further after the spatial objects are sorted in descending order prime numbers are assigned to these objects for calculating the product values for each transaction which is shown in TABLE V. This product values are used later in creating the TFP tree.

For each transaction the frequent object pattern of current transaction is inserted into the tree. Then for successive transaction if any object is not found as a node in the tree, it creates a new object node and assigns one as the frequency. Otherwise the frequency of child node adds one. Insert into FPAR values (tid, object, count, path); TABLE III FPAR TABLE TID 1 1 1 1 3 7 7 10

Object AB C D E ACDE D E BCD

Count 10 8 6 4 1 2 2 1

Path AB ABC ABCD ABCDE ACDE ABD ABDE BCD

B. Mining Frequent Patterns from FPAR/FNAR After the construction of a table FPAR, it can be used to efficiently mine the complete set of frequent spatial patterns. Suppose for each frequent object ’i’, its conditional pattern base table FPAR-PATTERN(shown in TABLE IV) is constructed which has two attributes (path, count) where path denotes the pattern and count denotes the number of occurrences of the pattern. create table FPAR-PATTERN as select a1.path,...,ak.path, count from FPAR where object= a1.object...ak.object. TABLE IV FPAR-PATTERN TABLE Path ABCDE ACDE ABDE

Count 4 1 2

C. Construction of TFP Tree The process of creation of the table TFP is as follows:

insert into RESULT-PN values (tid, object, prime, product); TABLE V RESULT-PN TABLE TID 1 2 3 4 5 6 7 8 9 10 11 12

Object A,B,C,D,E A,B,C,D A,C,D,E A,B,C,D,E A,B,C A,B,C A,B,D,E A,B,C,D,E A,B,C,D,E B,C,D A,B,C,D A,B,D,E

Prime 11,7,5,3,2 11,7,5,3 11,5,3,2 11,7,5,3,2 11,7,5 11,7,5 11,7,3,2 11,7,5,3,2 11,7,5,3,2 7,5,3 11,7,5,3 11,7,3,2

Product 2310 1155 330 2310 385 385 462 2310 2310 105 1155 462

2. After the construction of RESULT-PN table, each the frequent ordered lists are arranged in descending order by their product values respectively. Next step is to construct TFP table (TABLE VI). The PL/SQL procedure for constructing the table TFP is shown below.

Input: Transaction table RESULT-PN. Output: TFP Table Procedure: for objects with the identical tid, object, prime, product in table RESULT-PN insert into the table TFP count=1; path=null; for product in each transaction if (product[i+1])MOD(product[i+1])=0 insert product into table TFP count++; path+=product; else insert product into table TFP count++; path+= product;

For each transaction the frequent object pattern of current transaction is inserted into the tree. It includes a root node and a child node that forms a sub tree as children of the root or creates new descendants and assigns count as 1. If product value is equal to the current node only the count of the current node is increased by 1 otherwise create a new descendant. Similarly for every successive insertion of a new node its product value is examined against existing nodes product values.

Learning Repository (http://kdd.ics.uci.edu/)-Mushroom Dataset have also been considered. All the experiments were performed on a 2.0 GHz Pentium PC machine with 512 MB main memory and 40 GB hard disk, running Microsoft Windows/NT.

insert into TFP values (tid, object, prime, product, count, path); TABLE VI TFP TABLE TI D

Object

Prime

Produ ct

Coun t

Path

1 2 12 5 3 10

A,B,C,D,E A,B,C,D A,B,D,E A,B,C A,C,D,E B,C,D

11,7,5,3,2 11,7,5,3 11,7,3,2 11,7,5 11,5,3,2 7,5,3

2310 1155 462 385 330 105

4 2 2 2 1 1

2310 2310,1155 2310,462 2310,1155,385 2310,330 2310,1155,105

Fig. 4. Time performance of FPAR/FNAR, FP and TFP

D. Mining Frequent Patterns from TFP After the construction of a table TFP, it can be used to efficiently mine the complete set of frequent spatial patterns. Like FP-growth and FPAR/FNAR approach, for each frequent object ’i’, its conditional pattern base table TFP-PATTERN is constructed which has two attributes (path, count) where path denotes the pattern and count denotes the number of occurrences of the pattern. create table TFP-PATTERN a1.path,...,ak.path,count from TFP a1.object...ak.object.

as where

select object=

TABLE VII TFP PATTERN TABLE

IV.

TID 1

Object A,B,C,D,E

Path 2310

Count 4

12 3

A,B,D,E A,C,D,E

2310,462 2310,330

2 1

It shows that FPAR/FNAR and TFP is requires less time for mining frequent pattern as compared to FP as the number of nodes generated and the level of the tree is much less and improves on the generation of frequent patterns as compared to traditional FP algorithm. The execution time for searching frequent patterns is shown in Fig. 4. Therefore these implementations of TFP, FPAR/FNAR successfully computes and validates the efficiency of the proposed approach using the designed code as compared to existing traditional approaches. We have tested various spatial data sets, with consistent results. Limited by space, only the results on some typical data sets are reported here. V.

Association rule mining and using SQL to implement it is a logical choice for mining frequent pattern in spatial databases. It generates frequent patterns with lesser input parameters. The evaluation of this algorithm is performed on spatial datasets. The proposed approach presents several areas for future research. First is to improve the time complexity of the algorithm by applying some heuristics. Further research can be done to see if SQL usage finds its implementation free from any user involvement.

PERFORMANCE EVALUATION

In order to evaluate performance we compare our results with an Apriori [9] and FP-growth [10]. The proposed algorithms such as FPAR/FNAR and TFP have also been implemented and compared. The implementation has been done using Oracle 10g and using the sample spatial dataset. To evaluate the performance of our proposed framework experiments on IBM synthetic dataset T10I4D100K from frequent item set mining dataset repository(http://fimi.cs.helsinki.fi/data/) and UCI Machine

CONCLUSION

VI.

REFERENCES

[1]

R. Agrawal, T. Imielinski, and A. Swami. 1993. “Mining association rules between sets of items in large databases”, Proceeding of the ACM SIGMOD Intl. Conf. on Management of Data, (1993,) pp. 207216.

[2]

Dong, X., Niu, Z., Shi, X., Zhang, X., Zhu, D.2007.”Mining both Positive and Negative Association Rules from Frequent and Infrequent Item sets”. ADMA 2007, LNAI 4632, Springer-Verlag Berlin Heidelberg, (2007), pp.122–133.

T D.R. Thiruvady and G.I. Webb. 2004. “Mining negative rules in large databases using GRD”, Proceeding of PAKDD 2004, pp.161-165. [4] A. Pietracaprina and D. Zandolin.2003. “Mining frequent item sets using Patricia tries”, Proceeding of IEEE FIMI ( 2003). [5] M M. Gan, M.-Y. Zhang and S.-W. Wang. 2005. “One Extended Form for Negative Association Rules and the Corresponding Mining Algorithm” Proceedings of the 4th International Conference on Machine Learning and Cybernetics, Vol. 3, ( 2005), pp.1716-1721. [6] Borgelt. 2005."An Implementation of the FP-growth Algorithm," Proceedings of the 1st international workshop on open source data mining, (2005), pp. 1-5. [7] Animesh Tripathy, Subhalaxmi Das, Prashata Kumar Patra. 2010. An Improved Design Approach in Spatial Databases Using Frequent Association Rule Mining Algorithm, IEEE 2nd International Advance Computing Conference, (2010),pp. 410-415. [8] Ranjana Vyas, Lokesh Kumar Sharma, U.S.Tiwary. 2007."Exploring Spatial ARM (Association Rule Mining) for Geo Decision support System”, Journal of Computer Science, (2007), pp.882-886. [9] G. Booch, J. Rumbsugh, and I. Jacobson.2005. The Unified Modeling Language: User Guide. Addison-Wesley, (2005),2nd edition [10] Borgelt, Efficient Implementations of apriori and éclat, IEEE , pp-90, 2003. [3]

VII.

BIOGRAPHES Prashanta Kumar Patra received Bachelor of Engineering in Electronics from SVRCET (NIT), Surat, India, M.Tech in Computer Engg. from I.I.T., Kharagpur and Ph. D. in Computer Science from Utkal University, India. He is presently working as Professor & Head of the Department of Computer Science & Engg. , College of Engg & Tech, a constituent college of Biju Patnaik University of Technology, Orissa, India. He has published many papers at National/International journals/ Conferences in the areas of Soft Computing, Image processing, Pattern recognition and Bioinformatics which are the subjects of his research interest. Animesh Tripathy received Bachelor of Engineering in Computer Engineering and Master of Technology in Computer Science & Engineering from University of Calcutta and Ph.D. in Computer Science from Utkal University, India. Presently he is working as Associate Professor in School of Computer Engineering, KIIT University, Bhubaneswar, Orissa, India. He has published some innovative research papers in International Journals & Conferences. His major strength lies in GIS, Image Analysis & Intelligent Database Systems. Subhalaxmi Das received B.Tech in Computer Science Engineering from Biju patnaik University of Technology, Orissa,India and M.Tech in Computer Science & Engineering from KIIT University, India. She is currently working as a Lecturer in the Department of Computer Science & Engg. , College of Engg & Tech, a constituent college of Biju Patnaik University of Technology, Orissa, India. Her special fields of interest included Spatial Data Mining.

Suggest Documents