An Improved Itemset Generation Approach for Mining ... - IEEE Xplore

4 downloads 12801 Views 1MB Size Report
Selcuk University Technical Education Faculty. Electronic & Computer Education Department, Turkey. Electronic & Computer Education Department, Turkey.
An Improved Itemset Generation Approach for Mining Medical Databases K. Zuhtuogullari1

N. Allahverdi2

Selcuk University Technical Education Faculty

Selcuk University Technical Education Faculty

Electronic & Computer Education Department, Turkey

Electronic & Computer Education Department, Turkey

1

2

[email protected]

[email protected]

Abstract— Finding frequent patterns in data mining plays a significant role for finding the relational patterns. In this study an extendable and improved itemset generation approach has been constructed and developed for mining the relationships of the symptoms and disorders in the medical databases. The algorithm of the developed software finds the frequent illnesses and generates association rules using Apriori algorithm. The developed software can be usable for large medical and health databases for constructing association rules for disorders frequently seen in the patient and determining the correlation of the health disorders and symptoms observed simultaneosly.

Apriori is a data mining algorithm for mining frequent itemsets for association rules [5; 6]. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori algorithm uses an iterative algorithm known as a level-wise database search for finding frequent patterns.

Keywords: Improved itemset generation approach, Data Mining, Artificial Intelligence

In this study an improved itemset generation algorithm approach is developed for mining the relationships of the symptoms observed together. The developed algorithm shows the relationships of the symptoms observed together by generating the itemsets and constructing association rules using the candidate generation algorithm. In the developed approach, the algorithm can be stopped by the user according to the itemset number (generation) determined by the user in addition to the classical type Apriori itemset generation approaches. The developed approach can finalize itemset generation process before reaching the last itemset, this approach gives the opportunity to construct different association rules according to the user defined itemset number parameter. The constructed system also includes the classical properties of Apriori algorithm. The pruning property is used in the software for optimizing the constructed software and decreasing computation time. The constructed approach also explores dependencies of the illnesses and symptoms frequently observed in the medical data bases.The developed software was constructed by using C#.net programming language and the joining step and the pruning step properties of Apriori algorithm were used. The pruning step property of Apriori algorithm is added to the software for scanning large medical databases more effectively and quickly. In the developed approach, the itemset generation process can be stopped by the user according to the itemset number (generation) determined by the user in addition to the classical type apriori itemset generation approaches. This proposed

I.

II.

INTRODUCTION

The usage of data mining and artificial intelligence techniques in the medical areas and knowledge discovery is very significant for searching the hidden and valuable data. Data mining explores the hidden relationships and secret knowledge that can not be observed and evaluated by the human beings easily and improves the quality of our lifes by helping the experts showing the secret relationships and correlations in the large databases [1; 2]. Data mining has been played an important role in the intelligent medical systems [3; 4]. The relationships of disorders and the real causes of the disorders and the effects of symptoms that are spontaneously seen in patients can be evaluated by the users via the constructed software easily. Large databases can be applied as the input data to the software by using the extendibility of the software. The effects of relationships that have not been evaluated adequately have been explored and the relationships of hidden knowledge laid among the large medical databases have been searched in this study by means of finding frequent items using candidate generation. The sets of sicknesses simultaneously seen in the medical databases can be reduced by using candidate generation of Apriori algorithm.

978-1-61284-922-5/11/$26.00 ©2011 IEEE 39

THE GENERAL STRUCTURE OF THE DEVELOPED SYSTEM

approach gives opportunity to the system for generating different association rules regarding to the generated final itemset. In the classical approaches the algorithm continues itemset generation process until finding the last itemset. The developed software also supports the classical itemset and association rules generation properties in addition to the approach expressed above. The software is developed with the specification of adaptation to large and different databases. The number of the transactions can be extendable by the specialists or the user and the support count values can be changed by the user of the constructed software. The self pruning property is used for improving the performance of the developed software. Each transaction represents different kinds of disorders observed synchronously in the patients in the medical database. Support counts can be changed for determining the frequency of the illnesses in the medical databases. The sets of frequent itemsets (illnesses, symptoms and disorders itemsets) are generated by scanning all the transactions in the medical database regarding the minimum support threshold level via developed software. The minimum threshold support level can be determined by the user, or the specialists using the software. The minimum confidence value is determined by the user for constructing “strong association rules” for the symptoms observed in the patients and researching the dependency of them. The confidence value is a user defined parameter which is proportional to the set of illnesses spontaneously observed in medical transaction databases and this value is determined by the user of the software for obtaining the dominant correlations among the disorders seen in the related patients. The general structure of the developed software depends on the finding frequent itemsets using candidate generation (Apriori) algorithm. The input data can be read from the text file and can be extendable for large data bases [5; 6]. The support count can be also determined as different values for having the feasibility to be adaptable to different data bases. In the Apriori algorithm, initially the set of itemsets including one item is constructed and the medical data base is scanned for determining the count for each item. The support count expresses the number of minimum count and expresses the count of repetition number of the selected item or subset used for constructing larger sets. Joining the sets with k elements will give us the sets with k+1 elements by obeying the joining and minimum support count threshold property of Apriori algorithm. Collecting items that satisfy the minimum support count are expressed by the set named Ln. The itemset generation algorithm recursively iterates until finding the last itemset. The self pruning property is used for optimizing the software for scanning large medical databases more quickly. All nonempty subsets of a frequent itemset must also be frequent according to the pruning property of Apriori itemset generation algorithm. The pruning property searches for the subsets of any of the candidates that have a subset which is not frequent. The pruning step of Apriori algorithm enables the algorithm to run faster when large databases are scanned. All

subsets of a frequent set must also be frequent according to the pruning step property. The pruning step property can be summarized as : Any itemset that includes k-1 elements which is not frequent can not be a subset of a frequent k-itemset. This property decreases the computation time when very large databases are used. By using this property, some itemsets can be pruned by not scanning all the transactions of the input database. The pruning property is also used in the constructed software for increasing the performance of the algorithm and also decreasing the computation time [6-9]. The input database is a medical database that shows the symptoms and disorders or sicknesses observed. The medical data base consists of 25 transactions. Minimum support count can be changed by the expert and different association rules can be generated depending on the support count threshold level. The relationships of the diseases observed can be easily interpreted by the experts via the constructed software. The database can be read from an extendable text file and the number of transactions is unlimited and can be specified by user. Transactions of the input database represent the gastroenterological disorders or the disorders or the symptoms simultaneously observed in patients of the medical database. A1, A2, A3, A4 and A5 represent the disorders named as Gastric Ulcer, Duedenal Ulcer, Reflux Disease, Dyspepsia and Stomach Cancer respectively in the input database and each transaction (T1..Tn) (row) represents the specified patient and each column represents a disorder or symptoms mentioned above. The 1’s in the table depict that An disease or symptom is observed in the patient Tn and the zeros (0) depict that the specified disease is negative or not observed in patient Tn. Some of the transactions were shown in the Table 2.1 T T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12

A1 1 0 0 1 1 0 1 1 1 0 1 1

A2 1 1 1 1 0 1 0 1 1 0 1 0

A3 0 0 1 0 1 1 1 1 1 0 1 1

A4 0 1 0 1 0 0 0 0 0 1 1 0

A5 1 0 0 0 0 0 0 1 0 1 0 0

Table 2.1 Some of the transactions of the input medical database showing the diseases and symptoms. Finding frequent itemsets using an iterative level-wise approach based on candidate generation is known as Apriori algorithm and the steps of classical Apriori algorithm is given as below [10].

40

D, represents a database of transactions and “min_sup” represents the minimum support count threshold.

when the minimum support count threshold level was determined as 5.

Output: L, frequent itemsets in D.

When the support count was determined as 4 the last itemset combinations are shown as below

Method: A1,A2,A5..............with Support Count 4 A1,A2,A4............. with Support Count 4 A1,A2,A3.............. with Support Count 7

L1 = find _ frequent _1 − itemsets( D); for (k = 2; Lk −1 ≠ φ ; k + +) {Ck = apriori _ gen( Lk −1 );

A1, A2, A3, A4 and A5 represents the disorders named as Gastric Ulcer, Duedenal Ulcer, Reflux Disease, Dyspepsia and Stomach Cancer respectively. In the last itemset combination the support count value was calculated as 4 for the combinations of the symptoms or disorders “A1, A2 and A5” and “A1,A2 and A4” respectively. For “A2, A2 and A3” combination, the support count was calculated as 7. “A1, A2, A3” combination with Support Count 7 represents strong relationships or triggering effects among the diseases and symptoms named as A1, A2 and A3 respectively. And the correlation and triggering effect of this combination was found stronger than the other combinations. The association rules are constructed by using the resultant sets generated finally. When minimum support count was determined as 5, “A1, A2, A3” combination with support count 7 was derived as the final combination set (Fig. 2.2). After finding the last table of frequent itemsets, rules are generated according to the frequent itemsets by considering the minimum confidence threshold level. The minimum confidence threshold level is expressed in the equation 2.1. Confidence ( A → B) is the ratio of Support Count of (A ∪ B) to Support Count of A. In test procedure in the Figure 2.1 the developed software calculates the last table of frequent itemsets as, (A1, A2, A5), (A1, A2, A5) and (A1, A2, A3). “sc” represents the support count of the algorithm.

for each transaction t ∈ D { scan D for counts Ct = subset (Ck , t ); / / get the subsets of t that are candidates for each candidate c ∈Ct c.count + +; } Lk = {c ∈ Ck | c.count ≥ min_ sup} } return L = ∪k Lk ; procedure apriori _ gen( Lk −1 : frequent ( k − 1) − itemsets) for each itemset l1 ∈ Lk −1 for each itemset l2 ∈ Lk −1 if (l1[1] = l2 [1]) ∧ (l1[2] = l2 [2]) ∧ ... ∧ (l1[k − 2] = l2 [k − 2]) ∧ (l1[k − 2] < l1[k − 1]) then { c = l1 ⊗ l2 ; / / join step; generate candidates if has _ infrequent _ subset (c, Lk −1 ) then delete c; / / prune step else add c to Ck ; } return Ck ;

Confidence( A → B) = P ( B / A) =

procedure has _ infrequent_subset (c : candidate k − itemset ; Lk −1 : frequent (k − 1) − itemsets);

sc( A ∪ B) …………...2.1 sc( A)

Strong association rules are accepted as dominant rules [10; 11]. For generating association rules the sets in the last procedure is used. The rules that satisfy the minimum threshold confidence level (the rules that have the confidence value equal or greater than the specified confidence value) are used. The rules that don’t satisfy the minimum threshold level are rejected and the rules that satisfy the minimum threshold level are accepted as strong association rules and these rules will help the user of the constructed software to observe the symptoms or diseases that occur synchronously observed and also help the specialists to find the correlation among these symptoms and illnesses. Some diseases or symptoms may be the cause of some other types of diseases or the symptoms. The constructed software has a very significant role for determining the associations of the diseases by constructing association rules.

for each (k − 1) subset s of c if s ∉ Lk −1 then return TRUE; return FALSE;

II.I GENERATING ASSOCIATION RULES BY CANDIDATE GENERATION For the test procedure the minimum support count was determined as 4. In the last set combination, itemsets are the itemsets that include 3 items as shown in the Fig. 2.1 In the Fig. 2.1, the calculated results are shown when the support count was specified as 4. In the Fig. 2.2 the results are shown

41

III. RESULTS AND FUTURE WORK Satisfactory results were obtained for determining the correlation among the symptoms and diseases observed in the test procedure of the system. The developed system has the capability for generating rules from the desired itemsets and this property gives the system flexibility for generating more classification rules. In addition, the developed system has also the capability of generating rules from the last itemset. The developed software will help to search correlations among the symptoms in the large databases and triggering effects of each symptom. This study constructs association rules via the last itemset generated and the desired itemset. The developed software generates association rules for determining the relationships among the diseases observed synchronously. The generated association rules are too significant for making early diagnosis for the correlated diseases. This study shows the correlation of some types of gastroentrological diseases as gastric ulcer, duodenal ulcer and stomach cancer etc. Some types of diseases can have triggering effects on different kinds of diseases. The gastroenterelogical symptoms and diseases which have stronger effect on each other can be determined and interpreted by the constructed system and the large and extended databases can be scanned effectively with the pruning property of the developed system.

Fig.2.1 The results calculated when the support count is determined as 4

Fig. 2.2 The results are shown when support count threshold level is determined as 5.

REFERENCES

Some of the association rules constructed from the last itemset are given as below:

[1]

A. Sadanandam, M. L. Varney, R. K. Singh, “Identification of Semaphorin A Interacting Protein by Applying Apriori Knowledge and Peptide Complementarity Related to Protein Evolution and Structure Genomics”, Proteomics & Bioinformatics, Volume 6, Issues 3-4, 2008, pp. 163-174

[2]

E. Lazcorreta, F. Botella, A. Fernández-Caballero, “Towards personalized recommendation by two-step modified Apriori data mining algorithm” Expert Systems with Applications, Volume 35, Issue 3, October 2008, pp. 1422-1429

[3]

C. Aflori, M. Craus, “Grid implementation of the Apriori algorithm Advances in Engineering Software, Volume 38, Issue 5, May 2007, pp. 295-300

[4]

A. J.T. Lee, Y.H. Liu, H.Mu Tsai, H.-Hui Lin, H-W. Wu, “Mining frequent patterns in image databases with 9D-SPA representation”, Journal of Systems and Software, Volume 82, Issue 4, April 2009, pp. 603-618

[5]

H. Yaoa, H. J. Hamilton, “Mining itemset utilities from transaction databases”,Data & Knowledge Engineering Volume 59, Issue 3, December 2006, pp. 603-626

[6]

R. Hu, “Medical Data Mining Based on Association Rules”, Computer and Information Science Vol. 3, No. 4,2010 pp.104-108.

[7]

M. S. Tsechansky, N. Pliskin, G. Rabinowitz, A. Porath, “Mining relational patterns from multiple relational tables”, Decision Support Systems, Volume 27, Issues 1-2, November 1999, pp.177-195

[8]

F. Thabtah, P. Cowling, S. Hammoud, ““Improving rule sorting, predictive accuracy and training time in associative lassification”,Expert Systems with Applications, Volume 31, Issue 2, August 2006, pp. 414426

A1^A2 → A3 has the the confidence value of 7/13= 0.538=%53.8 A2^A3 → A1 has the the confidence value of 7/9= %77,7 A1^A3 → A2 has the the confidence value of 7/12= %58,3 If the confidence level is specified as %50 these rules are accepted as a strong rules and depicts the ratio of occurrence of illnesses of A1, A2, A3 to A1 and A2 observed spontaneously on the patients. In addition to the classical approaches, the constructed approach can calculate the association rules from the desired itemset number and this specification gives the system the opportunity to generate different association rules. When the set generation part of the algorithm is finalized according to the itemset number 2, some of the constructed association rules are : A1 → A2 has the confidence value of 13/18 =0,722=%72,2 A1 → A4 has the confidence value of 5/18 = 0,277=%27,7 A5 → A2 has the confidence value of 5/8 = 0,625 A2 → A4 has the confidence value of 6/17= %35,29

42

[9]

R Agrawal, T Imielinski, AN Swami. "Mining Association Rules between Sets of Items in Large Databases." SIGMOD. June 1993, 22(2): pp207- 2016

[10] J. Han, M. Kamber, “Data Mining Concepts and Techniques”, 2nd Edition,Morgan Kaufmann Publishers, Elsevier, 2006 [11] S. Kotsiantis, D. Kanellopoulos, “Association Rules Mining: A Recent Overview”, GESTS International Transactions on Computer Science and Engineering, Vol.32 (1), 2006, pp. 71–82

43