Proceedings of the 2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering, February 21-22
Mining Medical Data to Identify Frequent Diseases using Apriori Algorithm M. Ilayaraja
T. Meyyappan
Department of Computer Science & Engineering Alagappa University Karaikudi, India
[email protected]
Department of Computer Science & Engineering, Alagappa University Karaikudi, India
[email protected]
Abstract— The data mining is a process of analyzing a huge data from different perspectives and summarizing it into useful information. The information can be converted into knowledge about historical patterns and future trends. Data mining plays a significant role in the field of information technology. Health care industry today generates large amounts of complex data about patients, hospitals resources, diseases, diagnosis methods, electronic patients records, etc,. The data mining techniques are very useful to make medicinal decisions in curing diseases. The healthcare industry collects huge amount of healthcare data which, unfortunately, are not “mined” to discover hidden information for effective decision making. The discovered knowledge can be used by the healthcare administrators to improve the quality of service. In this paper, authors developed a method to identify frequency of diseases in particular geographical area at given time period with the aid of association rule based Apriori data mining technique. Keywords— Frequent Diseases; Data Mining; Medical Data; Association Rule; Apriori Algorithm
I. INTRODUCTION Data Mining is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable pattern in data with the wide use of databases and the explosive growth in their sizes. Data mining refers to extracting or “mining” knowledge from large amounts of data. Data mining is the search for the relationships and global patterns that exist in large databases but are hidden among large amounts of data. The essential process of Knowledge Discovery is the conversion of data into knowledge in order to aid in decision making, referred to as data mining. Knowledge Discovery process consists of an iterative sequence of data cleaning, data integration, data selection, data mining pattern recognition and knowledge presentation [1]. Data Mining has great potential for exploring the meaningful and hidden patterns in the data sets at the medical domain. Discovery of hidden patterns and relationships often goes unexploited. Advanced data mining techniques is a remedy to this situation. Data mining functions include clustering, classification, prediction, and associations. One of the most important data mining applications is that of mining association rules. Association rules, first introduced in 1993, are used to identify relationships among a set of items in
978-1-4673-5845-3/13/$31.00©2013 IEEE
databases. These relationships are not based on inherit properties of the data themselves, but rather based on cooccurrence of the data items [2]. Emphasis in this research work is analysis of medical data. Medical profiles such as patient name, age, sex, disease name, address, time, date, etc., can be used to mining the frequent disease of patients in different geographical area at given time period. II. REVIEW OF LITERATURE Jyoti Soni, et al. [3] research work, intend to provide a survey of current techniques of knowledge discovery in databases using data mining techniques that are in use in today’s medical research particularly in Heart Disease Prediction. Number of experiments have been conducted to compare the performance of predictive data mining technique on the same dataset and the outcome reveals that Decision Tree outperforms and some time Bayesian classification is having similar accuracy as of decision tree but other predictive methods like KNN, Neural Networks, Classification based on clustering are not performing well. The second conclusion is that the accuracy of the Decision Tree and Bayesian Classification further improves after applying genetic algorithm to reduce the actual data size to get the optimal subset of attribute sufficient for heart disease prediction. The problem of identifying constrained association rules for heart disease prediction was studied by Carlos Ordonez [4]. The assessed data set encompassed medical records of people having heart disease with attributes for risk factors, heart perfusion measurements and artery narrowing. Three constraints were introduced to decrease the number of patterns. First one necessitates the attributes to appear on only one side of the rule. The second one segregates attributes into uninteresting groups. The ultimate constraint restricts the number of attributes in a rule. Experiments illustrated that the constraints reduced the number of discovered rules remarkably besides decreasing the running time. Two groups of rules envisaged the presence or absence of heart disease in four specific heart arteries. Maria-Luiza Antonie, et al. [5], present some experiments for tumour detection in digital mammography. Authors investigated the use of different data mining techniques, neural networks and association rule mining, for anomaly
2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) detection and classification. The results show that the two approaches performed well, obtaining a classification accuracy reaching over 70% percent for both techniques. Moreover, the experiments conducted demonstrate the use and effectiveness of association rule mining in image categorization. E. Barati, et al. [6], emphasize mainly on the application of data mining on skin diseases. A categorization has been provided based on the different data mining techniques. The utility of the various data mining methodologies is highlighted. Generally association mining is suitable for extracting rules. It has been used especially in cancer diagnosis. Classification is a robust method in medical mining. Authors have summarized the different uses of classification in dermatology. It is one of the most important methods for diagnosis of erythematosquamous diseases. There are different methods like Neural Networks, Genetic Algorithms and fuzzy classification in this topic. Clustering is a useful method in medical images mining. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. Clustering has some applications in dermatology. Besides introducing different mining methods, authors have investigated some challenges which exist in mining skin data.
195
The above given review of literature shows the improvement in the accuracy of classification, increase in the prediction of heart disease, skin disease and breast cancer. Based on electronic medical record, a method for identification of the frequency of various diseases in a particular geographical area at a given time period is proposed in this research work. III. APRIORI ALGORITHM
Apriori algorithm (Agrawal et al. 1993), is the most classical and important algorithm for mining frequent itemsets. Apriori is used to find all frequent itemsets in a given database DB. Based on the Apriori principle any subset of a frequent itemset must also be frequent. Example: if {XY} is a frequent itemset, both {A} and {B} must be frequent itemsets. The key idea of Apriori algorithm is to make multiple passes over the database. It employs an iterative approach known as a breadth-first search (level-wise search) through the search space, where k-itemsets are used to explore (k+1)-itemsets. In the beginning, the set of frequent 1-itemsets is found. The set of that contains one item, which satisfy the support threshold, is denoted by L1. In each subsequent pass, we begin with a seed set of itemsets found to be large in the previous pass. This seed set is used for generating new potentially large itemsets, called candidate itemsets, and count the actual Abdelghani Bellaachia, et al. [7], presents an analysis of the support for these candidate itemsets during the pass over the prediction of survivability rate of breast cancer patients using data. At the end of the pass, we determine which of the data mining techniques. The data used is the SEER Public-Use candidate itemsets are actually large (frequent), and they Data. The pre-processed data set consists of 151,886 records, become the seed for the next pass. Therefore, L1 is used to which have all the available 16 fields from the SEER database. find L2, the set of frequent 2-itemsets, which is used to find We have investigated three data mining techniques: the Naïve L3, and so on, until no more frequent k-itemsets can be found Bayes, the back-propagated neural network, and the C4.5 [11]. decision tree algorithms. Several experiments were conducted IV. PROPOSED WORK using these algorithms. The achieved prediction performances are comparable to existing techniques. However, we found out Data mining tools have been developed for effective that C4.5 algorithm has a much better performance than the analysis of medical information to help the clinician in other two techniques. making better diagnosis. In this research work, the researcher Mrs.G.Subbalakshmi (M.Tech), et al. [8], developed a can collect data from Hospital Information System (HIS) Decision Support in Heart Disease Prediction System which has the sufficient details of patient including patient’s (DSHDPS) using data mining modelling technique, namely, name, age, disease, location, district, date from laboratories Naïve Bayes. Using medical profiles such as age, sex, blood which keeps on growing year after year. Having collected the pressure and blood sugar it can predict the likelihood of data from hospital information system, this research can find patients getting a heart disease. It is implemented as web the frequent disease with the help of association techniques. based questionnaire application. It can serve a training tool to This research work helps to mine the data about the frequent train nurses and medical students to diagnose patients with diseases with the help of weka tool applied over training data set. heart disease. N.Deepika, et al. [9], presents about the various effective heart attack prediction system using PCAR: an Efficient Approach for mining Association Rules. A proficient methodology for the generation of association rules from the heart disease warehouses for heart attack prediction has been presented. For predicting heart attack, significantly 30 attributes are listed. Continuous data can also be used instead of just categorical data. Authors can also use Text Mining to mine the vast amount of unstructured data available in healthcare databases. The experimental Results have illustrated the efficacy of the designed prediction system in predicting the heart attack.
In the medical field, the hospital information system is used to receive different kinds of medical records of diseases and its patients who hail from different areas. It is a herculean task to identify the frequent diseases and its causes from the large data set. Patients from different locations approach different hospitals. They do not converge in a same place. Their records are maintained by the hospitals where they get treated. Collecting information about the frequently occurring diseases is not an easy job. The data collection regarding these sorts of diseases can be done through association rule. Apriori of the Association rule is adopted for the mining of data. Details
2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) regarding the occurrence of these diseases in a particular time period can also be mined using apriori algorithm.
196
B. Outcome of the Research
A. Algorithm Ck: Medical data itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = Medical data generated from Lk; each transaction t in database do increment the count of all medical data in Ck+1 that are contained in t Lk+1 = medical data in Ck+1 with min_support end return k Lk;
The proposed method is useful to identify the frequent diseases in a large medical dataset. The outcome of this research will help the practitioners in making medicinal decisions for frequently occurring diseases. Analysis is made on data from various geographical locations during different time periods. V. EXPERIMENTAL RESULT Training data set used in this research work contains 1246 patients’ electronic medical records. The data set include patient’s name, sex, age, disease name, address, date, etc, in the year 2012. The proposed method is implemented using the WEKA data mining tool. The results obtained are promising.
March
t
June
t
t
t
t
September
t
October
t
t
t
t
T
t
T
t
T t
t
t
t t
t
t
t
t
Whooping Cough
t
t
t
t
t
t t
t
t t
t
t
t
t
t
t
t
t t
t
t t
t t
t
t
t t
sleep disorders
sexually transmitted diseases
Raynauds Phenomenon
Pertussis Pregnancy
Eye Disease
Overweight
Pain
t
t
t t
t
t t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t t
T
t
Lupus
Lung Cancer
Liver Disease
Liver cancer
Leukemia
t t
t
t
t
t
August
t
t
t
t
t
t t
t t
t
t t
t
t
Kidney Disease
Insomnia
t
t
t
July
t
t
t
t
May
Jaundice
Impotence
human papilloma virus
t
April
December
t
t
Thrush
t
t
Thyroid disorders
t
t
stroke
t
T
smoking
February
November
HIV
Asthma
t
Hypertension
January
Heart disease
Allergies
AIDS
TABLE I VARIOUS DISEASES APPEAR IN MONTH-WISE
t t
t
t
t
t t
t
2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) The table I shows the occurrence of various diseases over 12 months in a year. Here the letter ‘t’ denotes the occurrence of a particular disease in a particular month.
197
the month-wise number of patients affected by various diseases. It reveals the fact that some patients are suffering from the same disease in a particular month.
The bar graph in Fig. 1. depicts the month-wise number of diseases affecting the patients. The bar graph in Fig. 2. shows 160
147
16 140
14
12 10 8
127
121 12 11
120
12
11 10
10 9 8
111
8
8
6
100
89
85
80
109
79 67
60
4
40
2
20
0
91
113
107
11 Number of Patients
Number of Diseases
14
0
Months Fig. 1 Number of diseases occurs in month-wise
Months Fig. 2 Number of patients affected from diseases in month-wise
Relation: Disease Instances: 12 January February March April May June July August September October November December Attributes: 29 AIDS Allergies Heart disease Asthma
2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) HIV human papilloma virus hypertension Impotence Insomnia Jaundice Kidney Disease Leukemia Liver cancer Liver Disease Lung Cancer Lupus Overweight Eye Disease Pain Pertussis Pregnancy Raynauds Phenomenon sexually transmitted diseases sleep disorders smoking stroke Thrush Thyroid disorders Whooping Cough Apriori ======= Minimum support: 0.35 (4 instances) Minimum metric : 0.9 Number of cycles performed: 13 Generated sets of large itemsets: Size of set of large itemsets L(1): 20 Large Itemsets L(1): Allergies=t 4 Heart disease=t 7 Asthma=t 4 HIV=t 4 hypertension=t 5 Impotence=t 6 Insomnia=t 6 Jaundice=t 5 Kidney Disease=t 5 Liver Disease=t 4 Lung Cancer=t 4 Overweight=t 5 Pain=t 7 Pregnancy=t 4 Raynauds Phenomenon=t 4 sexually transmitted diseases=t 5
198
sleep disorders=t 5 smoking=t 9 Thrush=t 4 Whooping Cough=t 7 Size of set of large itemsets L(2): 21 Large Itemsets L(2): Heart disease=t hypertension=t 4 Heart disease=t Insomnia=t 5 Heart disease=t Kidney Disease=t 4 Heart disease=t Liver Disease=t 4 Heart disease=t Overweight=t 5 Heart disease=t Pain=t 4 Heart disease=t smoking=t 6 Asthma=t Thrush=t 4 HIV=t smoking=t 4 hypertension=t smoking=t 4 Impotence=t Raynauds Phenomenon=t 4 Impotence=t smoking=t 4 Impotence=t Whooping Cough=t 5 Insomnia=t smoking=t 5 Liver Disease=t Overweight=t 4 Liver Disease=t smoking=t 4 Overweight=t smoking=t 4 Pain=t sexually transmitted diseases=t 5 Pain=t smoking=t 5 sexually transmitted diseases=t smoking=t 4 smoking=t Whooping Cough=t 5 Size of set of large itemsets L(3): 7 Large Itemsets L(3): Heart disease=t Insomnia=t smoking=t 4 Heart disease=t Liver Disease=t Overweight=t 4 Heart disease=t Liver Disease=t smoking=t 4 Heart disease=t Overweight=t smoking=t 4 Impotence=t smoking=t Whooping Cough=t 4 Liver Disease=t Overweight=t smoking=t 4 Pain=t sexually transmitted diseases=t smoking=t 4 Size of set of large itemsets L(4): 1 Large Itemsets L(4): Heart disease=t Liver Disease=t Overweight=t smoking=t 4 VI. CONCLUSION This research work proposes a association rule based apriori data mining technique that finds the frequency of diseases affecting patients. The study is made on patients from various geographical locations and at various time periods. Existing electronic medical records obtained from hospitals are used as training data set for the study. Totally
2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) 1216 patient records affected by 29 different diseases during the year 2012 are analysed. WEKA data mining tool is employed to identify the frequency of the diseases that are recurring in people living in various geographical locations during different time periods. The analysis revealed the fact that 4 different diseases affected the patients frequently at various geographical locations during the year 2012. REFERENCES [1]
Arun K Pujari “Data Mining Techniques”, Edition 2001.
[2]
Jayanthi Ranjan, “Applications of Data Mining Techniques in Pharmaceutical Industry”, Journal of Theoretical and Applied Information Technology.
[3]
Jyoti Soni, et al., “Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction”, International Journal of Computer Applications (0975 – 8887), Volume 17– No.8, March 2011.
[4]
Carlos Ordonez, "Improving Heart Disease Prediction Using Constrained Association Rules”, Seminar Presentation at University of Tokyo, 2004.
[5]
Maria-Luiza Antonie et al., “Application of Data Mining Techniques for Medical Image Classification”, Proceedings of the second international workshop on multimedia Data Mining (MDM/KDD’2001), in conjunction with ACM SIGKDD conference. San Francisco, USA, August 26, 2001.
[6]
E. Barati et al., “A Survey on Utilization of Data Mining Approaches for Dermatological (Skin) Diseases Prediction”, Cyber Journals:
199
Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Health Informatics (JSHI): March Edition, 2011. [7]
Abdelghani Bellaachia and Erhan Guven, “Predicting Breast Cancer Survivability Using Data Mining Techniques”
[8]
G.Subbalakshmi et al., “Decision Support in Heart Disease Prediction System using Naive Bayes”, Indian Journal of Computer Science and Engineering (IJCSE)
[9]
N.DEEPIKA et al., “Association rule for classification of Heart-attack patients”, International Journal of Advanced Engineering Sciences and Technologies, Vol No. 11, Issue No. 2, 253 – 257.
[10] V.V.Jaya Rama krishniah et al., “Predicting the Heart Attack Symptoms using Biomedical Data Mining Techniques”, The International Journal of Computer Science & Applications (TIJCSA), Volume 1, No. 3, May 2012 ISSN – 2278-1080. [11] Sanjeev Rao and Priyanka Gupta, “Implementing Improved Algorithm Over APRIORI Data Mining Association Rule Algorithm”, International Journal of Computer Science and Technology(IJCST), Volume 3, Issue1, Jan. - March 2012, ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print). [12] S Stilou, P D Bamidis, N Maglaveras, C Pappas, “Mining association rules from clinical databases: an intelligent diagnostic process in healthcare”, Stud Health Technol Inform 84: Pt 2. 1399-1403, 2001. [13] Kaur, H., Wasan, S. K.: “Empirical Study on Applications of Data Mining Techniques in Healthcare”, Journal of Computer Science 2(2), 194-200, 2006.