International Journal of Basic & Applied Sciences IJBAS-IJENS Vol:13 No:03
11
Association Rules of Data Mining Application for Respiratory Illness by Air Pollution Database Carolyn Payus, Norela Sulaiman, Mazrura Shahani and Azuraliza Abu Bakar
Abstract — Exposure to air pollution has been related with vary adverse health effects. This study aims to assess the impact of air pollution to the number of hospitalization for respiratory illness in Kuala Lumpur as the case study. Kuala Lumpur, the capital city of Malaysia, is an urban and industrialized city in the tropical climate of Malaysia that often experiencing has highest record of severe respiratory illness due to air pollution. The effects of air pollution on health triggers oxidative stress and inflammation, and it is plausible that high levels of air pollutants causing the high number of hospitalizations. In this study, an intelligent approach in data mining called association rules has been used based on its capability to search for an interesting relationship among attributes in a larger database and to its ability to handle uncertain database that often occurs in the real world problem. Association rules mining is a discovery of association relationships, frequent patterns or correlations among sets of items or elements in databases. In air pollution and healthcare database, association rules are useful as they offer the possibility to conduct intelligent diagnosis and extract invaluable information and build important knowledge bases quickly and automatically, in order to develop effective strategies to minimize the health exposure to the air pollution. A total of 2102 data were obtained from the Department of Environment Malaysia and Malaysian Ministry of Health. There were six attributes used as input and one attribute as an output for the association rule mining. Data has been through a pre-processing stage to facilitate the requirement of the modeling process. As for conclusion, association rules mining has given a promising result with more than 90% accuracy and the rules obtained have contributing to knowledge for the respiratory illness.
Index Term— Association rule, air pollution, respiratory illness, data mining I. INT RODUCT ION Clean air is considered to be a basic requirement of human health and well-being, however air pollution continues to pose Carolyn Payus is with the Environmental Science Program, School of Science & T echnology, Universiti Malaysia Sabah (UMS), 88999 Kota Kinabalu, MALAYSIA (e-mail: cpayus@ gmail.com). Norela Sulaiman is from School of Environmental and Natural Resources Science, Universiti Kebangsaan Malaysia (UKM), 43600 Bangi, Selangor, MALAYSIA (e-mail:
[email protected]). Azuraliza Abu Bakar is with School of Computer Science, Universiti Kebangsaan Malaysia (UKM), 43600 Bangi, Selangor, MALAYSIA (email:
[email protected]) Mazrura Shahani is from Faculty of Health Science, Universiti Kebangsaan Malaysia (UKM), 43600 Bangi, Selangor, MALAYSIA (email:
[email protected])
a significant threat to health worldwide (WHO 2011). The effects of air pollution on health have shown effects ranging from minor eye irritations to upper respiratory symptoms, chronic respiratory diseases, cardiovascular diseases and lung cancer, that may result in hospital admission and even death (Zheng 2011; Sousa et al. 2009; Ragas et al. 2011). The impacts of air pollution on human health can be assess in terms of a reduction in average life expectancy, additional premature deaths, absent in work place or school, hospital admissions and the increase use of medication and days of restricted activity (EEA 2007; Rosa et al. 2008; Peng and Dominici 2008). In developed countries in Europe and United States (US), legislation and guidelines regarding the concentrations of air pollutants in ambient air has been established based on the epidemiological, toxicological and clinical evidence (WHO 2006). However, in developing countries, and recently newly industrialized countries, such as Malaysia, studies on this matter have been ignored and started later than in Europe and US. Nevertheless, legislation regarding air pollution standards in Malaysia remains the same since 1978 (Olmo et al. 2011), thus allowing levels that have been proved to have serious effects on human health, especially on children and elderly exposed to them. Assessment on air pollution behavior and their impacts on health will help decision makers to understand better its effects, as well as the benefits that could be achieved through the application of control measures. The causes of respiratory illness and air pollution are depend on various factors including the pollutant emissions, atmospheric chemical processes, topography, meteorological conditions and solar radiation (Seinfeld and Pandis 1998). The complex mechanism of air pollution formation and respiratory effects makes it even more complex and difficult to control. In order to understand it is necessary to apply an intelligent approach that can describe the complex relationship between air pollution concentrations and the many variables that cause or hinder the respiratory effects. The complexity makes applying the conventional statistical analysis to air quality and respiratory illness as inefficient task as it mostly based on basic linear principles (Braak 1986). Though the statistical methods may provide reasonable results, but these are essentially incapable of capturing the important knowledge of the complexity and non-linearity of the pollution-adverse impacts relationships (Chakraborty et al. 1992). Therefore, it is expected that it will underperform when use to model the relationship between air pollution and the health effects that extremely non-linear. In the past few years, the collection of air quality and clinical data has generated an urgency need for new techniques and tools that can intelligently and automatically
136503-7474- IJBAS-IJENS @ June 2013 IJENS
I JE N S
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol:13 No:03 transform the processed data into useful information and knowledge (Fayyad et al. 1996). Data mining which is also known as knowledge discovery in databases is a process o f nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. In general, data mining is an essential process in knowledge discovery where intelligent methods are applied in order to extract the important data patterns. Data mining can be used as an intelligent diagnostic tool in healthcare. In clinical data, it is possible to extract the knowledge about the elevated concentrations of air pollution that caused the respiratory illness from the patient meas urements. In addition, in research data the extraction knowledge could be the information about the level of concentration that has been exposed to the patients that have caused the respiratory sickness. Consequently, data mining has become important research domain in environmental and also in healthcare. In this paper, we have applied one of the association rules mining algorithms, namely the apriori algorithm, and apply it in extracting knowledge from a clinical database from respiratory patients, for air pollution impacts analysis. II. M ETHODOLOGY Data mining is the process of discovering interesting knowledge, such as patterns, associations, changes, anomalies and significant structures, from large amounts of data stored in databases, data warehouses, or other information repositories. Mining association rules is one of the techniques involved in the process mentioned above and used in this paper. Association rules are the discovery of association relationships or correlations among a set of items. Association rule mining search for the interesting relationships among attributes in the database. Association rules are similar to classification rules except that they can predict any attribute and not just the class, and this allows them to predict combination of the attributes. Different association rules express different regularities that underlie the dataset, and they generally will predict different things. Because of so many interesting association rules can be derived from even a tiny dataset, interest is restricted to those that apply to a reasonably large number of instances and have a reasonably high accuracy on the instances to which they apply to. The coverage of an association rule is the number of instances for which it predicts correctly (Zhou 2008). This if often called its support. Its accuracy often called confidence is the number of instances it predicts correctly expressed as a proportion of all instances to which is applies. For example, in this research, the rule Air Pollutant (T, “LESS”) ==> Respiratory Illness (T, “LESS”), means if air pollutant,T, is less then, T, respiratory illness is less. According to Malhotra and Venugopal 2011, the accuracy is the proportion of the days when air pollutant is less than the mean air pollution also has respiratory illness less than the mean respiratory illness, expressed in percentage or fraction. It is usual to sepcify minimum support (coverage) and the confidence (accuracy) values and to seek only those rules whose support and confidence are at least equal to these specified minima. Rules that satisfy both minimum support threshold and minimum confidence threshold are called strong. Generally support and
12
confidence values are expressed between 0% to 100% rather than 0 to 1.0. There are two methods for mining the form of association rules which is the Boolean association rules (Harms and Deogun 2004). One is a basic algorithm for finding frequent item sets and another one is the frequent pattern growth methods which adopts a divide and conquer strategy. Apriori algorithm (Witten and Frank 2008) for mining frequent item sets for Boolean association rules is used in the present study. The algorithm employs an iterative approach known as level-wise approach where k item sets are used to explore (k+1) item sets. In environmental, particularly on air pollution association rules are useful to summarize pollutants levels into groups (categorized) and to build model for patients prediction (Wang 2005). In this study, it involves five major phases, namely (i) data selection, (ii) pre-processing, (iii) data mining; (iv) testinf and evaluation; and (v) knowledge discovery, as shown in the framework of this study in Figure 1. Based on the framework, the stage of pre-processing and data preparation is done in two steps, which were during cleaning and integration of data collection and data selection and transformation. Pre-processing is done so that the generated rules at the end of the study will be the certainty and reliable rules as a knowledge based. In this stage, several phases have been carried out, which were the data integration, data cleaning, attribute selection and data reduction. Data cleaning is required when there are incomplete attributes or missing values in data. It involved filling the missing values, smoothing noisy data, identifying outliers and correcting the data inconsistency. Data integration combines data from multiple sources to form a coherent data store. Metadata, correlation analysis, data conflict detection and resolution of semantic heterogeneity contribute towards smooth data integration. Data transformation converts the data into appropriate forms for data mining that depends on the mining technique. In the case of developing a knowledge based model, data are required to be discretized. This is because the rough classification algorithm only accepts categorical attributes. Discretization involves reducing the number of distinct vales for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.
Fig. 1. Methodology framework
136503-7474- IJBAS-IJENS @ June 2013 IJENS
I JE N S
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol:13 No:03 A time series datasets were obtained from Malaysian Ministry of Health and the Department of Environment Malaysia consists of 1000 lines with 7 attributes of PM 10, CO, SO2, NO2, O3 temperature and number of hospital admissions (respiratory illness patients). The first seven attribute were used as an input or predictor attribute, while the last attribute which was patients, as the target knowledge (output). Table I shows for each attribute and the classifiers. Based on these figures, the air quality and clinical data obtained is continuous, derived from the study area, Kuala Lumpur for 1 January 2008 till 31 December 2008. T ABLE I ATTRIBUTES ON AIR QUALITY P ARAMETER AND P ATIENTS Measureme N Data Data Attribute nt o Notattion Scale Unit Kuala 1 SIT E not relevant Station Lumpur
2
DAT E
month/day/ye ar
Date
1 Jan 2008 hingga 31 Decembe r 2008
3
O3
Ozon
0.009 – 0.135
4
PM10
Particulate Matter
21 – 84
5
CO
Carbon Monoxide
0.198 – 1.341
6
SO2
Sulphur Dioxide
0.00 – 0.011
7
NO2
Nitrogen Dioxide
0.012 – 0.035
8
T EMP
T emperatur e
24.6 – 29.9
Number of Respiratory Illness Patients
7 - 32
9
PAT IENT S
Celsius
-
Table II shows the first ten rows of the data sets that were collected. All attributes in the dataset contain highly distinct values that required to be handled. The nature of association rules is that the data to be modeled are in discrete form. Therefore, discretization is required to transfer the data in ranges of categories.
Date 1/1/08 1/2/08 1/3/08 1/4/08 1/5/08 1/6/08 1/7/08 1/8/08 1/9/08
13
T ABLE II FIRST TEN ROWS OF THE RAW DATASETS Tem O3 PM10 CO SO 2 NO 2 p 0.03 0.42 0.00 0.01 27 27.0 6 3 1 5 0.04 0.39 0.00 0.01 22 26.9 6 6 2 5 0.04 0.49 0.00 0.01 21 26.3 0 3 2 7 0.06 0.81 0.00 0.01 34 26.3 6 5 2 9 0.05 0.76 0.00 0.01 30 25.9 2 0 2 8 0.07 0.46 0.00 0.02 27 25.8 8 1 2 0 0.08 1.14 0.00 0.02 43 25.8 1 4 3 6 0.04 1.28 0.00 0.02 47 25.0 8 4 4 9 0.07 1.09 0.00 0.02 43 25.3 9 3 3 8
Patie n t 20 23 24 25 26 20 23 19 20
Discretization data is sufficient especially for large number of datasets that involved and having a lot of incomplete attributes or missing value in the data. In this study discretization has been done by first performing several statistical analyses to investigate the distribution of values in each attributes. The equal frequency binning method was used to discretize the data (Han and Kamber 2001). Table III depicts the results of data discretization on each attribute. T ABLE III FIRST TEN ROWS OF THE DISCRETIZE DATASETS AFTER BINNING O3
PM10
CO
SO 2
NO 2
Temp
Patie nt
Norma l Norma l Norma l
Norma l Norma l Norma l Norma l Norma l Norma l Norma l Norma l Norma l
Norma l Norma l Norma l
Norma l Norma l Norma l Norma l Norma l Norma l Norma l Norma l Norma l
Norma l Norma l Norma l Norma l Norma l Norma l Norma l Norma l Norma l
Norma l Norma l Norma l Norma l
Moderat e Moderat e
Low
High
High Norma l High High Norma l High
High Norma l Norma l High High High
Low Low Low Low
High High
Moderat e Moderat e Moderat e Moderat e
III. A PPLICAT ION & RESULT The association rule model was conducted with minimum support of 0.1 and minimum confidence of 0.1. The number of association rules generated was 42 rules with the highest
136503-7474- IJBAS-IJENS @ June 2013 IJENS
I JE N S
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol:13 No:03 confidence value of 0.93. In addition, there are 42 major items that were identified with a length of 1 to 3 of patients forecasting model as shown in Table IV. T ABLE IV T OTAL OF L- ITEMSET
L-itemset Size
No. Rules L-itemset
1
Patients Dataset 16
2 3
21 5
4
-
5
-
Then sum rule has four range values for the collected confidence level. The four range of the confidence level are 0.00 to 0.40 represent as a weak rule; 0.41 to 0.60 as for moderate rule; 0.61 to 0.80 general or common rules and 0.81 to 1.00 for strong rule. In this research the association rule model has generated 17 rules for NORMAL hospitalized patients, 1 rule for HIGH hospitalized and 24 rules for MODERATE hospitalized patients. In Figure 2 shows the output of the association rules that were obtained from Weka 3.7 platform, with minimum support = 0.1 and minimum confidence = 0.1. After the final screening process, summarized in Appendix 1, the generated association rules indicate that PM 10, CO and temperature are strongly associated with the number of hospitalization of patients. The generated association rules also show that there is some association between HIGH patients with CO but with a weak confidence value of 0.15. Temperature and air pollutants such CO and PM 10 are generally highly correlated in many places (Holgate et al. 1999) and they may interact significantly to affect health outcomes (Choi et al. 1997; Roberts 2004). For example, Katsouyanni et al. 1993 reported that the air pollution and ambient temperature had synergistic effect on excess mortality during the 1987 heat wave in Athens. They have found a statistically significant modification of temperature on the association between exposures to SO2, CO and total excess mortality, although the main effect of this pollutant was not statistically significant. Roberts (2004) found that temperature modified the association between PM 10 and mortality. Our findings also found that temperature PM 10, CO and temperature are significantly associated with the patient hospitalizations. These support the hypothesis that air pollutants along with temperature might contribute to health outcomes. Exposure to air pollutants such PM 10, SO2, CO may directly affect airways through inhalation, including upper airways, bronchiole and alveolus. The exposure could modulate the automatic nervous system and might further influence the cardiovascular system (Gordon 2003; Jeffrey 1999). Some studies have shown that PM 10 is associated with decreased heart rate variation (Creason et al. 2001; Gold et al. 2000). Ambient temperature changes also affect physiological and psychological stresses to our body system (Gordon 2003), which could aggravate the pre-existing diseases. Therefore, both air pollutants and
14
temperature may interact to synergistically effect human morbidity, thus mortality. === Run information === Scheme: weka.associations.Apriori -N 1000 -T 0 -C 0.1 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -A -c -1 Relation: PatientsData_Clean-weka.filters.unsupervised.attribute.N umericT oNominal-Rfirst-last-weka.filters.unsupervised.a ttribute.Remove-R6-7 Instances: 366 Attributes: 7 SO2 NO2 O3 PM10 CO T EMP PAT IENT S === Associator model (full training set ) === Apriori ======= Minimum support: 0.1 (37 instances) Minimum metric : 0.1 Number of cycles performed: 18 Generated sets of large itemsets: Size of set of large itemsets L(1): 16 Size of set of large itemsets L(2): 21 Size of set of large itemsets L(3): 5
Fig. 2. Run information of the association rules
The major strength of this study is, to our knowledge, the first study ever using an intelligent diagnosis called the association rules, to extract invaluable information and association patterns from the database. However, this study also has one important key limitation. This study was carried out in a single city with a tropical climate and for a year data, 2008, though using hourly database, are actually not extensive. Caution is needed when interpreting any such time-series study within a single location. Therefore, we suggest for future work, it is better to involve at least 3 different locations for comparisons, so that the findings will be more generalize, valid and consistent to other places. However, this study is still the pilot-study and pioneer introducing association rules for air pollution and respiratory database, and the variation would not be that significant and severe to the study (Kim et al. 2005). IV. CONCLUSION This paper has given a promising and valuable contribution especially to the air pollution management. It is the first attempt using the association rules in trying to understand the air pollution formation to its effect to the
136503-7474- IJBAS-IJENS @ June 2013 IJENS
I JE N S
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol:13 No:03 respiratory illness, thus to solve environmental issue. The knowledge model obtained can be used as a decision support system to gain sets of knowledge that is useful in terms of preventing the elevated exposure of the hazardous air pollutants that gives more impact on the respiratory illness. From the association rules base knowledge as well, we can know what are the best combinations or associations of the air pollutants that contributes to higher health risk, so that more action plans can be done in resolving the problem. Association rule data mining produces knowledge that is understandable and can be interpreted easily. This is an advantage compare to other learning algorithm in conventional analysis. From this study, it indicated several important attributes combinations that have strong influence to the respiratory illness of patients, such as PM 10, CO and temperature.
[15]
[16]
[17]
[18]
[19]
[20]
REFERENCES [1] Choi, K., Inou, S. and Shinozaki, R. 1997. Air pollution, temperature, and regional differences in lung cancer mortality in Japan. Archaeology Environmental Health, Vol. 52, pp. 160-168. [2] Creason, J., Neas, L., Walsh, D. and Sheldon, L. 2001. Particulate matter and heart rate variability among elderly retirees. Journal of Exploratory Analytical Environmental Epidemiology, Vol. 11, pp. 116-122. [3] EEA. 2007. Europe Environment: T he 4 th Assessment. European Environment Agency, Copenhagen. [4] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R.1996. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press. [5] Gold, D.R., Schwartz, J., Lovett, E., Larson, A. and Nearing, B. 2000. Ambient pollution and heart rate variability. Circulation, Vol. 101, pp. 1267-1273. [6] Gordon, C.J. 2003. Role of environmental stress in the physiological response to chemical toxicants. Environmental Resources, Vol. 92, pp. 1-7. [7] Harms, S.K. and Deogun, J.S. 2004. Sequential association rule mining with time lags. Journal of Intelligent Information Systems, Vol. 22, pp.7-22. [8] Holgate, S.T ., Samet, J.M., Koren, H.S. and Maynard, R.L. 1999. Air Pollution and Health. Academic Press, London. [9] Jeffrey, P. 1999. Effects of cigarette and air pollutants on the lower respiratory tract. Air Pollution and Health. Academic Press, Sydney. [10] Katsouyanni, K., Pantazopoulou, A. T ouloumi, G. and Asimakopoulos, D. 1993. Evidence for interaction between air pollution and high temperature in the causation of excess mortality. Archaeology Environmental Health, Vol. 48, pp. 235-242. [11] Kim, D., Sass-Kortsak, A., Purdham, J.T . and Brook, J.R. 2005. Associations between personal exposures and fixed-site ambient measurements of fine particulate matter, nitrogen dioxide and carbon monoxide in T oronto, Canada. Journal of Exploratory Analytical Environmental Epidemiology, Vol. 12, pp. 1-12. [12] Olmo, N.R.S., Saldiva, P.H.N., Braga, A.L.F., Lin, C.A., Santos, U.P. and Pereira, L.A.A. 2011. A review of low level air pollution and adverse effects on human health: implications for epidemiological studies and public policy. Clinical, Vol. 66, pp. 681–90. [13] Peng, R.D. and Dominici, F. 2008. Statistical Methods for Environmental Epidemiology with R: A Case Study in Air Pollution and Health. Springer, New York, pp. 69–93. [14] Ragas, A.M., Oldenkamp, R., Preeker, N. L., Wernicke, J. and Schlink, U. 2011. Cumulative risk assessment of chemical
[21] [22] [23]
15
exposures in urban environments. Environment International, Vol. 37, pp. 872–81. Roberts, S. 2004. Interaction between particulate air pollution and temperature in air pollution mortality time series studies. Environmental Resources, Vol. 96, pp. 328-337. Rosa, A.M., Ignotti, E., Hacon, S.S. and Castro, H.A. 2008. Analysis of hospitalizations for respiratory diseases in T angará da Serra, Brazil. International Journal of Environmental Health, Vol. 34, pp. 575–582. Sousa, S.I., Alvim-Ferraz, M.C.M., Martins, F.G. and Pereira, M.C. 2011. Spirometric tests t o assess the prevalence of childhood asthma at Portuguese rural areas: Influence of exposure to high ozone levels. Environment International, Vol. 37, pp. 474–478. Wang, K.S. 2005. Mining customer value from association rules to direct marketing. Data Mining and Knowledge Discovery, Vol. 11, pp. 57-79. Witten, I.H. and Frank, E. 2008. Data Mining: Practical Machine Learning T ools and T echniques, ISBN 978-81-312-0050-6, Morgan Kaufmann Publishers. WHO. 2006. Air Quality Guidelines - Global update 2005. World Health Organization, Copenhagen, Denmark. Regional Office for Europe. WHO. 2011. World Health Statistics. World Health Organization, France Regional Office for Europe. Zheng, M. 2011. Hong Kong: Particulate air pollution and health impacts. Encyclopedia Environmental Health, pp. 56-61. Zhou, H. 2008. T ime related association rules m ining with attributes accumulation mechanism and its application to traffic prediction. Journal of Advanced Computational Intelligence and Intelligent Informatics, Vol.12, pp. 467-478
136503-7474- IJBAS-IJENS @ June 2013 IJENS
I JE N S
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol:13 No:03
16
Appendix 1: Apriori association rule of patient prediction model AT RIBUT E 1 PM10=NORMA L PM10=NORMA L
AT RIBUT E 2
AT RIBUT E 3
CO=NORMAL
T EMP=HIGH
T EMP=HIGH
O3=HIGH
T EMP=HIGH
CO=NORMAL
T EMP=HIGH
O3=NORMAL
T EMP=HIGH
O3=HIGH
PM10=NORMA L
PM10=NORMA L
T EMP=LOW
O3=HIGH
CO=NORMAL
PM10=NORMA L
CO=NORMAL
O3=NORMAL
PM10=NORMA L
CO=NORMAL
PM10=NORMA L
CO=NORMAL
T EMP=NORMA L
O3=NORMAL
PM10=NORMA L
O3=NORMAL
CO=NORMAL
O3=NORMAL
PAT IENT MODERAT E MODERAT E MODERAT E MODERAT E MODERAT E MODERAT E MODERAT E MODERAT E MODERAT E MODERAT E
CON F 0.93 0.93 0.88 0.83 0.8 0.7 0.7 0.58 0.57 0.55
NORMAL
0.55
MODERAT E
0.55
NORMAL
0.54
CO=NORMAL
MODERAT E
0.54
O3=HIGH
T EMP=NORMA L
NORMAL
0.54
PM10=HIGH
CO=NORMAL
O3=HIGH
PM10=HIGH
O3=NORMAL
PM10=NORMA L
T EMP=NORMA L
MODERAT E MODERAT E CO=NORMAL
NORMAL
136503-7474- IJBAS-IJENS @ June 2013 IJENS
0.5 0.49 0.34
I JE N S