Mining medical data to identify frequent diseases using Apriori algorithm

Proceedings of the 2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering, February 21-22

Mining Medical Data to Identify Frequent Diseases using Apriori Algorithm M. Ilayaraja

T. Meyyappan

Department of Computer Science & Engineering Alagappa University Karaikudi, India [email protected]

Department of Computer Science & Engineering, Alagappa University Karaikudi, India [email protected]

Abstract— The data mining is a process of analyzing a huge data from different perspectives and summarizing it into useful information. The information can be converted into knowledge about historical patterns and future trends. Data mining plays a significant role in the field of information technology. Health care industry today generates large amounts of complex data about patients, hospitals resources, diseases, diagnosis methods, electronic patients records, etc,. The data mining techniques are very useful to make medicinal decisions in curing diseases. The healthcare industry collects huge amount of healthcare data which, unfortunately, are not “mined” to discover hidden information for effective decision making. The discovered knowledge can be used by the healthcare administrators to improve the quality of service. In this paper, authors developed a method to identify frequency of diseases in particular geographical area at given time period with the aid of association rule based Apriori data mining technique. Keywords— Frequent Diseases; Data Mining; Medical Data; Association Rule; Apriori Algorithm

I. INTRODUCTION Data Mining is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable pattern in data with the wide use of databases and the explosive growth in their sizes. Data mining refers to extracting or “mining” knowledge from large amounts of data. Data mining is the search for the relationships and global patterns that exist in large databases but are hidden among large amounts of data. The essential process of Knowledge Discovery is the conversion of data into knowledge in order to aid in decision making, referred to as data mining. Knowledge Discovery process consists of an iterative sequence of data cleaning, data integration, data selection, data mining pattern recognition and knowledge presentation [1]. Data Mining has great potential for exploring the meaningful and hidden patterns in the data sets at the medical domain. Discovery of hidden patterns and relationships often goes unexploited. Advanced data mining techniques is a remedy to this situation. Data mining functions include clustering, classification, prediction, and associations. One of the most important data mining applications is that of mining association rules. Association rules, first introduced in 1993, are used to identify relationships among a set of items in

978-1-4673-5845-3/13/$31.00©2013 IEEE

databases. These relationships are not based on inherit properties of the data themselves, but rather based on cooccurrence of the data items [2]. Emphasis in this research work is analysis of medical data. Medical profiles such as patient name, age, sex, disease name, address, time, date, etc., can be used to mining the frequent disease of patients in different geographical area at given time period. II. REVIEW OF LITERATURE Jyoti Soni, et al. [3] research work, intend to provide a survey of current techniques of knowledge discovery in databases using data mining techniques that are in use in today’s medical research particularly in Heart Disease Prediction. Number of experiments have been conducted to compare the performance of predictive data mining technique on the same dataset and the outcome reveals that Decision Tree outperforms and some time Bayesian classification is having similar accuracy as of decision tree but other predictive methods like KNN, Neural Networks, Classification based on clustering are not performing well. The second conclusion is that the accuracy of the Decision Tree and Bayesian Classification further improves after applying genetic algorithm to reduce the actual data size to get the optimal subset of attribute sufficient for heart disease prediction. The problem of identifying constrained association rules for heart disease prediction was studied by Carlos Ordonez [4]. The assessed data set encompassed medical records of people having heart disease with attributes for risk factors, heart perfusion measurements and artery narrowing. Three constraints were introduced to decrease the number of patterns. First one necessitates the attributes to appear on only one side of the rule. The second one segregates attributes into uninteresting groups. The ultimate constraint restricts the number of attributes in a rule. Experiments illustrated that the constraints reduced the number of discovered rules remarkably besides decreasing the running time. Two groups of rules envisaged the presence or absence of heart disease in four specific heart arteries. Maria-Luiza Antonie, et al. [5], present some experiments for tumour detection in digital mammography. Authors investigated the use of different data mining techniques, neural networks and association rule mining, for anomaly

2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) detection and classification. The results show that the two approaches performed well, obtaining a classification accuracy reaching over 70% percent for both techniques. Moreover, the experiments conducted demonstrate the use and effectiveness of association rule mining in image categorization. E. Barati, et al. [6], emphasize mainly on the application of data mining on skin diseases. A categorization has been provided based on the different data mining techniques. The utility of the various data mining methodologies is highlighted. Generally association mining is suitable for extracting rules. It has been used especially in cancer diagnosis. Classification is a robust method in medical mining. Authors have summarized the different uses of classification in dermatology. It is one of the most important methods for diagnosis of erythematosquamous diseases. There are different methods like Neural Networks, Genetic Algorithms and fuzzy classification in this topic. Clustering is a useful method in medical images mining. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. Clustering has some applications in dermatology. Besides introducing different mining methods, authors have investigated some challenges which exist in mining skin data.

195

The above given review of literature shows the improvement in the accuracy of classification, increase in the prediction of heart disease, skin disease and breast cancer. Based on electronic medical record, a method for identification of the frequency of various diseases in a particular geographical area at a given time period is proposed in this research work. III. APRIORI ALGORITHM

Apriori algorithm (Agrawal et al. 1993), is the most classical and important algorithm for mining frequent itemsets. Apriori is used to find all frequent itemsets in a given database DB. Based on the Apriori principle any subset of a frequent itemset must also be frequent. Example: if {XY} is a frequent itemset, both {A} and {B} must be frequent itemsets. The key idea of Apriori algorithm is to make multiple passes over the database. It employs an iterative approach known as a breadth-first search (level-wise search) through the search space, where k-itemsets are used to explore (k+1)-itemsets. In the beginning, the set of frequent 1-itemsets is found. The set of that contains one item, which satisfy the support threshold, is denoted by L1. In each subsequent pass, we begin with a seed set of itemsets found to be large in the previous pass. This seed set is used for generating new potentially large itemsets, called candidate itemsets, and count the actual Abdelghani Bellaachia, et al. [7], presents an analysis of the support for these candidate itemsets during the pass over the prediction of survivability rate of breast cancer patients using data. At the end of the pass, we determine which of the data mining techniques. The data used is the SEER Public-Use candidate itemsets are actually large (frequent), and they Data. The pre-processed data set consists of 151,886 records, become the seed for the next pass. Therefore, L1 is used to which have all the available 16 fields from the SEER database. find L2, the set of frequent 2-itemsets, which is used to find We have investigated three data mining techniques: the Naïve L3, and so on, until no more frequent k-itemsets can be found Bayes, the back-propagated neural network, and the C4.5 [11]. decision tree algorithms. Several experiments were conducted IV. PROPOSED WORK using these algorithms. The achieved prediction performances are comparable to existing techniques. However, we found out Data mining tools have been developed for effective that C4.5 algorithm has a much better performance than the analysis of medical information to help the clinician in other two techniques. making better diagnosis. In this research work, the researcher Mrs.G.Subbalakshmi (M.Tech), et al. [8], developed a can collect data from Hospital Information System (HIS) Decision Support in Heart Disease Prediction System which has the sufficient details of patient including patient’s (DSHDPS) using data mining modelling technique, namely, name, age, disease, location, district, date from laboratories Naïve Bayes. Using medical profiles such as age, sex, blood which keeps on growing year after year. Having collected the pressure and blood sugar it can predict the likelihood of data from hospital information system, this research can find patients getting a heart disease. It is implemented as web the frequent disease with the help of association techniques. based questionnaire application. It can serve a training tool to This research work helps to mine the data about the frequent train nurses and medical students to diagnose patients with diseases with the help of weka tool applied over training data set. heart disease. N.Deepika, et al. [9], presents about the various effective heart attack prediction system using PCAR: an Efficient Approach for mining Association Rules. A proficient methodology for the generation of association rules from the heart disease warehouses for heart attack prediction has been presented. For predicting heart attack, significantly 30 attributes are listed. Continuous data can also be used instead of just categorical data. Authors can also use Text Mining to mine the vast amount of unstructured data available in healthcare databases. The experimental Results have illustrated the efficacy of the designed prediction system in predicting the heart attack.

In the medical field, the hospital information system is used to receive different kinds of medical records of diseases and its patients who hail from different areas. It is a herculean task to identify the frequent diseases and its causes from the large data set. Patients from different locations approach different hospitals. They do not converge in a same place. Their records are maintained by the hospitals where they get treated. Collecting information about the frequently occurring diseases is not an easy job. The data collection regarding these sorts of diseases can be done through association rule. Apriori of the Association rule is adopted for the mining of data. Details

2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) regarding the occurrence of these diseases in a particular time period can also be mined using apriori algorithm.

196

B. Outcome of the Research

A. Algorithm Ck: Medical data itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = Medical data generated from Lk; each transaction t in database do increment the count of all medical data in Ck+1 that are contained in t Lk+1 = medical data in Ck+1 with min_support end return k Lk;

The proposed method is useful to identify the frequent diseases in a large medical dataset. The outcome of this research will help the practitioners in making medicinal decisions for frequently occurring diseases. Analysis is made on data from various geographical locations during different time periods. V. EXPERIMENTAL RESULT Training data set used in this research work contains 1246 patients’ electronic medical records. The data set include patient’s name, sex, age, disease name, address, date, etc, in the year 2012. The proposed method is implemented using the WEKA data mining tool. The results obtained are promising.

March

t

June

t

t

t

t

September

t

October

t

t

t

t

T

t

T

t

T t

t

t

t t

t

t

t

t

Whooping Cough

t

t

t

t

t

t t

t

t t

t

t

t

t

t

t

t

t t

t

t t

t t

t

t

t t

sleep disorders

sexually transmitted diseases

Raynauds Phenomenon

Pertussis Pregnancy

Eye Disease

Overweight

Pain

t

t

t t

t

t t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t t

T

t

Lupus

Lung Cancer

Liver Disease

Liver cancer

Leukemia

t t

t

t

t

t

August

t

t

t

t

t

t t

t t

t

t t

t

t

Kidney Disease

Insomnia

t

t

t

July

t

t

t

t

May

Jaundice

Impotence

human papilloma virus

t

April

December

t

t

Thrush

t

t

Thyroid disorders

t

t

stroke

t

T

smoking

February

November

HIV

Asthma

t

Hypertension

January

Heart disease

Allergies

AIDS

TABLE I VARIOUS DISEASES APPEAR IN MONTH-WISE

t t

t

t

t

t t

t

2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) The table I shows the occurrence of various diseases over 12 months in a year. Here the letter ‘t’ denotes the occurrence of a particular disease in a particular month.

197

the month-wise number of patients affected by various diseases. It reveals the fact that some patients are suffering from the same disease in a particular month.

The bar graph in Fig. 1. depicts the month-wise number of diseases affecting the patients. The bar graph in Fig. 2. shows 160

147

16 140

14

12 10 8

127

121 12 11

120

12

11 10

10 9 8

111

8

8

6

100

89

85

80

109

79 67

60

4

40

2

20

0

91

113

107

11 Number of Patients

Number of Diseases

14

0

Months Fig. 1 Number of diseases occurs in month-wise

Months Fig. 2 Number of patients affected from diseases in month-wise

Relation: Disease Instances: 12 January February March April May June July August September October November December Attributes: 29 AIDS Allergies Heart disease Asthma

2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) HIV human papilloma virus hypertension Impotence Insomnia Jaundice Kidney Disease Leukemia Liver cancer Liver Disease Lung Cancer Lupus Overweight Eye Disease Pain Pertussis Pregnancy Raynauds Phenomenon sexually transmitted diseases sleep disorders smoking stroke Thrush Thyroid disorders Whooping Cough Apriori ======= Minimum support: 0.35 (4 instances) Minimum metric : 0.9 Number of cycles performed: 13 Generated sets of large itemsets: Size of set of large itemsets L(1): 20 Large Itemsets L(1): Allergies=t 4 Heart disease=t 7 Asthma=t 4 HIV=t 4 hypertension=t 5 Impotence=t 6 Insomnia=t 6 Jaundice=t 5 Kidney Disease=t 5 Liver Disease=t 4 Lung Cancer=t 4 Overweight=t 5 Pain=t 7 Pregnancy=t 4 Raynauds Phenomenon=t 4 sexually transmitted diseases=t 5

198

sleep disorders=t 5 smoking=t 9 Thrush=t 4 Whooping Cough=t 7 Size of set of large itemsets L(2): 21 Large Itemsets L(2): Heart disease=t hypertension=t 4 Heart disease=t Insomnia=t 5 Heart disease=t Kidney Disease=t 4 Heart disease=t Liver Disease=t 4 Heart disease=t Overweight=t 5 Heart disease=t Pain=t 4 Heart disease=t smoking=t 6 Asthma=t Thrush=t 4 HIV=t smoking=t 4 hypertension=t smoking=t 4 Impotence=t Raynauds Phenomenon=t 4 Impotence=t smoking=t 4 Impotence=t Whooping Cough=t 5 Insomnia=t smoking=t 5 Liver Disease=t Overweight=t 4 Liver Disease=t smoking=t 4 Overweight=t smoking=t 4 Pain=t sexually transmitted diseases=t 5 Pain=t smoking=t 5 sexually transmitted diseases=t smoking=t 4 smoking=t Whooping Cough=t 5 Size of set of large itemsets L(3): 7 Large Itemsets L(3): Heart disease=t Insomnia=t smoking=t 4 Heart disease=t Liver Disease=t Overweight=t 4 Heart disease=t Liver Disease=t smoking=t 4 Heart disease=t Overweight=t smoking=t 4 Impotence=t smoking=t Whooping Cough=t 4 Liver Disease=t Overweight=t smoking=t 4 Pain=t sexually transmitted diseases=t smoking=t 4 Size of set of large itemsets L(4): 1 Large Itemsets L(4): Heart disease=t Liver Disease=t Overweight=t smoking=t 4 VI. CONCLUSION This research work proposes a association rule based apriori data mining technique that finds the frequency of diseases affecting patients. The study is made on patients from various geographical locations and at various time periods. Existing electronic medical records obtained from hospitals are used as training data set for the study. Totally

2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) 1216 patient records affected by 29 different diseases during the year 2012 are analysed. WEKA data mining tool is employed to identify the frequency of the diseases that are recurring in people living in various geographical locations during different time periods. The analysis revealed the fact that 4 different diseases affected the patients frequently at various geographical locations during the year 2012. REFERENCES [1]

Arun K Pujari “Data Mining Techniques”, Edition 2001.

[2]

Jayanthi Ranjan, “Applications of Data Mining Techniques in Pharmaceutical Industry”, Journal of Theoretical and Applied Information Technology.

[3]

Jyoti Soni, et al., “Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction”, International Journal of Computer Applications (0975 – 8887), Volume 17– No.8, March 2011.

[4]

Carlos Ordonez, "Improving Heart Disease Prediction Using Constrained Association Rules”, Seminar Presentation at University of Tokyo, 2004.

[5]

Maria-Luiza Antonie et al., “Application of Data Mining Techniques for Medical Image Classification”, Proceedings of the second international workshop on multimedia Data Mining (MDM/KDD’2001), in conjunction with ACM SIGKDD conference. San Francisco, USA, August 26, 2001.

[6]

E. Barati et al., “A Survey on Utilization of Data Mining Approaches for Dermatological (Skin) Diseases Prediction”, Cyber Journals:

199

Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Health Informatics (JSHI): March Edition, 2011. [7]

Abdelghani Bellaachia and Erhan Guven, “Predicting Breast Cancer Survivability Using Data Mining Techniques”

[8]

G.Subbalakshmi et al., “Decision Support in Heart Disease Prediction System using Naive Bayes”, Indian Journal of Computer Science and Engineering (IJCSE)

[9]

N.DEEPIKA et al., “Association rule for classification of Heart-attack patients”, International Journal of Advanced Engineering Sciences and Technologies, Vol No. 11, Issue No. 2, 253 – 257.

[10] V.V.Jaya Rama krishniah et al., “Predicting the Heart Attack Symptoms using Biomedical Data Mining Techniques”, The International Journal of Computer Science & Applications (TIJCSA), Volume 1, No. 3, May 2012 ISSN – 2278-1080. [11] Sanjeev Rao and Priyanka Gupta, “Implementing Improved Algorithm Over APRIORI Data Mining Association Rule Algorithm”, International Journal of Computer Science and Technology(IJCST), Volume 3, Issue1, Jan. - March 2012, ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print). [12] S Stilou, P D Bamidis, N Maglaveras, C Pappas, “Mining association rules from clinical databases: an intelligent diagnostic process in healthcare”, Stud Health Technol Inform 84: Pt 2. 1399-1403, 2001. [13] Kaur, H., Wasan, S. K.: “Empirical Study on Applications of Data Mining Techniques in Healthcare”, Journal of Computer Science 2(2), 194-200, 2006.

Mining medical data to identify frequent diseases using Apriori algorithm

Mining medical data to identify frequent diseases using Apriori algorithm

Suggest Documents

Fast Apriori Algorithm for Frequent Itemset Mining

IRJET- Market Basket Analysis using Apriori Algorithm in Data Mining

Using Apriori with WEKA for Frequent Pattern Mining - IJETT

Algorithm to Identify Frequent Coupled Modules

Mining Frequent Itemsets Using Genetic Algorithm

Association Rule Mining using Apriori Algorithm for Distributed System ...

APRIORI Algorithm

Data Mining Apriori Algorithm for Heart Disease Prediction

Web Log Mining using K-Apriori Algorithm - International Journal of ...

Frequent Data Itemset Mining Using VS_Apriori Algorithms

Detecting Diseases in Medical Prescriptions Using Data Mining Tools ...

Identify Crime Detection Using Data Mining Techniques

Applying Frequent Sequence Mining to Identify Design ... - CiteSeerX

Enhancing the Apriori Algorithm for Frequent Set ... - Semantic Scholar

Pattern Decomposition Algorithm for Data Mining Frequent ... - CoBase

Using Data Mining Technology to Identify and Prioritize Emerging ...

Using Data Mining to Identify Customer Needs in Quality ... - CiteSeerX

Using Data Mining Technology to Identify and Prioritize Emerging ...

efficient algorithm for mining frequent itemsets using ... - CiteSeerX

Mining Frequent Itemsets Using Genetic Algorithm - Aircc Digital Library

An Algorithm Using Administrative Data to Identify Patient Attachment ...

Applying the Apriori-based Graph Mining Method to Mutagenesis Data ...

An Efficient Frequent Patterns Mining Algorithm ...

A FREQUENT DOCUMENT MINING ALGORITHM WITH CLUSTERING