Multiple Attribute Frequent Mining-Based for Dengue Outbreak Zalizah Awang Long1, Azuraliza Abu Bakar1, Abdul Razak Hamdan1, and Mazrura Sahani2 1 Center for Artificial Intelligence Technology Faculty of Information Science and Technology 2 Faculty of Allied Health Science Universiti Kebangsaan Malaysia Bangi, Selangor, Malaysia
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. Dengue fever (DF) and dengue hemorrhagic fever (DHF) are vector borne disease which is notifiable diseases in Malaysia since 1974. Early notification is essential for control measures as delayed notification will lead to further occurrences of outbreak cases. In this study we identify the number of attributes to be used in determining outbreaks rather than using only case counts. The experiment is conducted using multiple attribute value based on Apriori concept. The outcomes are promising when we can identify more than one attributes showing similar graph in vector-borne diseases outbreaks. Our methods also outperform in term of detection rate, false positive rate and overall performance. We prove through our experiment that more than one attributes can be used to better detect outbreaks. Keywords: Frequent mining, outbreak, dengue.
1 Introduction Dengue outbreak is one of the critical communicable diseases and becoming more serious in Malaysia. The first dengue outbreak reported in Malaysia was recorded in 1901 in Penang, more than century ago. Quoted from [1], the dengue hemorrhagic fever (DHF) was first detected in 1962 and gradually increases with the development of the country. Dengue fever (DF) and dengue hemorrhagic fever (DHF) have been continuously becoming a public health related issues in Malaysia and growing pandemic as reported by World Health Organization (WHO).It is estimated that there are 50 million dengue fever cases with 500,000 people with DHF requiring hospitalization each year. The dengue outbreaks are increasing not only in Malaysia but also in Thailand, Vietnam and Singapore. As in Malaysia there are 24 dengue hotspots and most are of them in Selangor state [2]. Surveillance for DF and DHF outbreaks in Malaysia is based on Laboratory-based surveillance system. In most cases of dengue surveillance system are considered as L. Cao, J. Zhong, and Y. Feng (Eds.): ADMA 2010, Part I, LNCS 6440, pp. 489–496, 2010. © Springer-Verlag Berlin Heidelberg 2010
490
Z.A. Long et al.
passive surveillance system where it depends on cases reported from physicians through the routine reporting system. In Malaysia, mandated by law under Seksyen 10(2) Akta 342, all communicable diseases must be reported to ministry of health [3]. In order to permit action on DF and DHF or control dengue epidemic a surveillance system must capture related information either clinical or non-clinical data. The nature of passive surveillance system may not incorporate the capability to determine the potential combination related to non-clinical data in generating outbreak particularly in dengue cases. Thus, to improve early detection of the dengue outbreak is to look at the historical data from the passive surveillance system to identify the potential of collected attribute or data to be use to early detect dengue outbreak. It may improve public health surveillance system particularly in Malaysia to ensure the effectiveness of the actions taken by public health officers to control such epidemics and mitigate the impact to the nations. In order to determine early detection of dengue outbreak or prediction of potential dengue outbreak are insufficiently being discussed. Analyzing dengue outbreak is based on vector-borne diseases epidemic curve. The analysis considers the increasing, peak and declining to determine the outbreak. Most of the studies focus on determining outbreak based on case counts over a period of time to predict the dengue outbreak. Therefore, in this study we focus on other potential attributes to be used to determine the potential dengue outbreak predominantly in Malaysia. In this paper, we apply the method called Multiple Attribute Value (MAV), which employs an Apriori concept for frequent mining. We use the MAV to detect outbreaks and we calculate the algorithm performance based on detection rate, false positive rate and overall performance. Our techniques are able to identify combinations attribute (k-length) to be used to detect outbreaks. Our evaluations are based on real data set.
2 Related Study The frequent items problem is one of the interesting and popular studies in data mining. The problem is interesting due to its simplicity to state, and its interest and values of associating between the items. Typically this will involve formalizing as to find whose frequency exceeded the specified fractions of the total numbers of items and also generating combine items or candidate items. The variation of problem becomes larger and larger such as the frequent value can be used to in some real life purposes such as outbreak detection. This is because of the number of candidate itemset is exponentially increases as the minimum support decreases. The abstraction of the above problem is viewed as the passive surveillance system particularly in the dengue context. The vast amount of collected data can be represented as a set of transaction, which contains multiple attribute, can be viewed as the items which may store more than one value. Originally, Apriori [4],[5] implemented Apriori algorithm to mine one-dimensional Boolean association rule from transactional database [6]. [4] was the first to introduce the frequent pattern mining for the market basket analysis in a form of association mining. The concept of present and absent of the data in the transaction, with data representation in form of binary. However in the dengue outbreak detection we need to analyse data in the categorical representation which consist of multiple values
Multiple Attribute Frequent Mining-Based for Dengue Outbreak
491
within the attributes. We view dataset in a form of non-binary and not in transaction format. Based on classical Apriori we develop a new algorithm named Multiple Attribute Value Function (MAV) to calculate the frequency of each attribute value within a set of database [7]. The verification of dengue outbreaks are based on the data collected for the dengue cases.Reported [8] quoted by [9] dengue outbreaks mean incidence of notified dengue cases more than 1SD above average. While [10] define outbreaks as number of cases with 2 SD above mean baseline during non-epidemic week. Outbreaks being defined in many ways depending on the diseases and also vehicle of the diseases. Widely accepted definition based on CDC “The occurrence of more cases of disease than is expected in a given area over a particular period of time. While epidemic often implies a large number of cases a wide geographic area. Cluster refers to an aggregate of cases in a given area over a particular period of time without regard whether number of cases is greater than expected”[11]. We will use the definition of dengue outbreak based on definition by [12] the dengue outbreaks is the occurrence of more than one case in the same locality, where the date of onset between the cases is less than 14 days. There are number of researches conducted in identifying the dengue outbreak cases. [12] was used geospatial modeling to detect dengue outbreak. The researcher focus on identifying the relation of population density and rainfall that contributed to dengue outbreak. [9] Dengue analysis aims at estimating the expected monthly values in each province, to define a significant statistical threshold, and then to identify periods differing from that basic value. While [10] tried to find correlation of different indicators to access usefulness in dengue outbreaks and the result indicated that the negative malaria diagnosis is an indicator for dengue. Proposed by [8] the uses of different indicator should be investigated in different settings particularly in dengue outbreaks. Based on the recommendations quoted by [8] our research focuses on finding a new way of defining outbreaks with dengue as our case study. Our focus is on the usages of data mining techniques precisely association rules mining (ARM) in identifying the potential component within the data surveillance to be used as indicator of dengue outbreaks. We mined frequent itemset based on attribute value using Apriori based concepts. Our experiment will identify the possible number of items to be considered in determining outbreaks based on detection rate, false positive rate and overall performance. Our experiment also indicate that the number of cases are not critical in determining the dengue outbreak.
3 Experiment Setting We run our algorithm based on real dataset on dengue cases. We compare our technique based on the detection rate, false positive rate and overall performance in detecting dengue outbreak with CUSUM technique. The data is obtained from the Unit Kawalan Vektor, Pusat Kesihatan Hulu Langat. Detail pre-processing data are discussed below.
492
Z.A. Long et al.
3.1 Data Pre-processing Due to the nature of data, we have to conduct extensive data cleaning, data transformation and data reduction. In this experiment, we focus on non-clinical dataset. Dataset consists information on year and epic week (week1 to week 52), age, sex, races, address, nature of work, type of dengue, incubation period, epidemic type, recurrent cases and dead code. We focus on demographic effect to the recurrent cases and incubation period from the onset toward to confirm diagnose. Approximately 0.14% data is reduced through the pre-processing stage. We try to maintain closely to real sets of the original dataset in analyzing the real dengue dataset. 3.2 Multiple Attribute Frequent Mining We introduce the frequent mining analysis to retrieve normal behavior as the baseline. We implement and introduce Multiple Attribute Value (MAV) to calculate the frequent attribution within the surveillance data based on Apriori-based algorithm. Assume the attribute is an item. Let P = set of items {P1, P2, P3 …. Pn} denotes as items/attribute. For each P there exist multiple values, P = {pn1, pn2, pn3....... pnm,}. Let Ti has pnj items, so we can write a transaction as ti = { P1, P2, .... Pn} and Pn = { t1, t2, t3 ..... ti}. Following the above reason an indication to calculate frequency for each attribute value can be defined as: Pij = Pkj where i z k , MAVij =
σ ೕ ೕ
.
(1)
We calculate the outbreak based on the definitions of the diseases. As in this case, we use dengue outbreak definitions.
4 Result and Discussion The experiment is divided into two main objectives; the first objective is to analyze the important of numbers attribute @ k-length to be considered in determining the outbreak. While the second objective is to identify the number of records to be considered in analyzing the outbreak using multiple attribute frequent mining. The result discussions are based on fig1 and also table 1 and table 2.
Fig. 1. Number of records
Multiple Attribute Frequent Mining-Based for Dengue Outbreak
493
In identifying dengue outbreak, the reported cases are plotted based on the definition quoted from [12]. In Fig 1, it indicate the cases for Dengue fever (DF) and dengue hemorrhagic fever (DHF) from epic 1 toward epic week 52. There are 39.6% of the cases reported shown the above average number of cases. Mostly cases reported are towards week 44. Based on the graph, it indicate in fig 1, there are few peak situations throughout the 52 weeks. Dramatically a change is at week 49. There are number of outbreak cases detected using the number of reported cases. We identify outbreak at week 3, week 5-7, week 14, week 16, week 18, week 20-22, week 26 week 28-29, week 31-32, week 34-35, week 38, week 41, week43-44, week 46-47, week 49 and week 51-52. The question is which is the case throughout the year is the true outbreak? Our research is able to identify true outbreaks using our techniques. The discussions on the performance of our technique are based on table2. While in table 1 we analyze our technique in correlation of number of records and maximum length can be produced with speed and frequent items detected. Table 1. MAV results for dengue dataset WEEK 34
RECORDS 101
MAX_LENGTH 5
SPEEDS 0.125
FREQUENT_ITEMS 50
13
62
5
0.203
93
12
63
5
0.125
63
52
232
4
0.234
56
49
226
4
0.187
52
21
132
4
0.11
37
19
74
4
0.11
44
The results in table1 are collected using the dengue data from week 1 towards week 52 using MAV algorithm. Due to limited space we illustrate a few results for discussion purpose as in table 1. Based on table 1, we find that the number of records or cases do not indicate the execution time. As recorded execution time in week 12 and week 34 is 0.125sec with 63 records and 103 records while in week 21 and 19 the records are 132 and 74 with the same execution time 0.11sec. Again the same pattern of execution time in week 49 and week 13 the execution time for 62 records is 0.203sec and 226 records is 0.187 seconds. The execution time is recorded for several attempts of experiment with 0.001 second changes. Our experiment indicates that the fluctuation between numbers of records and execution times was not representing number of records. We believe that the execution times recorded are based on the complexity of the attribute values. Our experiment also identify maximum length produced by the algorithm in dengue dataset. We identify most of the longest length are appearing in the beginning of the epic week, even though the number of records is small compared to the end of epic week. We believe this is also related to the complexity of the attribute values. We try to analyze whether the important number of records have any significance towards generating the combination of potential attributes in determining the
494
Z.A. Long et al.
outbreaks. We find that number of records will not have significant effect on the frequent items and the execution time recorded. Our experiment also indicates the higher number of frequent items and maximum length in the week 13 with 62 records produce higher number of frequent items. In contrast our experiment has shown that with the highest number on record 232 at week 52, the frequent items less 25%. Table 2. Performance in detecting outbreak MEASURE
MULTIPLE ATTRIBUTE VALUE(MAV)
CUSUM
MAV-1l
MAV-2l
MAV-3l
MAV-4l
MAVFI
DR
70.8%
58.8%
57.9%
65.0%
72.2%
74.1%
FPR
28.0%
28.0%
32.0%
28.0%
20.0%
28.0%
OP
67.3%
53.8%
53.8%
59.6%
63.5%
73.1%
The measurement used to compare the proposed technique as in table 2. The calculations are based on table3. We compare our algorithm with CUSUM [12], [13] in detecting outbreak. We also show in table 2 our detection rate (DR), false positive rate (FPR) and overall performance (OP) based on various length produced by our algorithm. Our result on detection manage to outperform the CUSUM with full length and with 4-length. In detecting the false positive rate our algorithm outperform CUSUM with 4-length while in full length we produce the same result. In the overall performance detecting outbreak, our algorithm manages to outperform CUSUM. Table 3. Calculation matrix for detection rate (DR) False positive rate (FPR) and overall performance (OP) adopted from [14] Actual cases
System detected
Outbreak
No outbreak
Outbreak
True positive
False positive
(TP)
(FP)
No
False negative
True negative
outbreak
(FN)
(TN)
TP+FN
FP+TN
TP+FP FN+TN
TOTAL
5 Conclusion and Future Work Dengue is a mosquito borne viral disease with high capacity for epidemic outbreaks. Infection can be asymptomatic or can present with symptoms ranging from mild, selflimiting, febrile illness to severe, life-threatening disease. Two clinical pictures are recognized: (a) dengue fever (DF) and (b) dengue hemorrhagic fever (DHF) or dengue shock syndrome (DSS). In detecting outbreak, there are various techniques being
Multiple Attribute Frequent Mining-Based for Dengue Outbreak
495
applied ranging from statistics such as Cumulative Sum (CuSUM) related [13], [15], Space-time scan statistic [16] just to name a few. Extensive literature on detection techniques are from [17], [18]. In most literature, the development of analytical algorithms to detect anomalies is to reduce the outbreak curve are based on number of reported cases. Our study is to identify the combination of attributes to be used in determining the outbreak focusing on vector borne diseases. Our experiment shows that more than one attributes can be used as projected in frequent items. The experiment is conducted using Apriori concept for frequent mining. We find that using maximum item length shows better performance in detecting outbreak based on detection graph in vector-borne diseases. We also find out that high volumes of records are not critical since the complexity of the attribute value will determine the potential dengue outbreaks. Our next experiment will focus on determine the outbreak using frequent mining with outlier concept.
Acknowledgement We would like to thank the Health District Officer, Ministry of Health (MOH) who has provided the dengue database(Vekpro) and Dr. Zainuddin Mohd Ali for his information and support.
References 1. Choy, E.A., Asmahani, A., Mazrura, S.: Perubahan Iklim dan Kesihatan Manusia: Metodologi dan Senario Penyakit Bawaan Vektor (unpublished) 2. New Strait Time (NST) online, Dengue Alert, http://www.nst.com.my/Current_News/NST/articles/ 6dent/Article/ 3. Seksyen Penyakit Berjangkit, Bahagian Kawalan Penyakit, Jabatan Kesihatan Awam, Kementerian Kesihatan Malaysia, http://www.moh.gov.my 4. Agrawal, R., et al.: Mining association rules between sets of items in large databases. J. ACM SIGMOD Record. 22, 207–216 (1993) 5. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994) 6. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001) 7. Zalizah, A.L., Azuraliza, A.B., Abdul-Razak, H.: Mining Multiple Attribute Values for Frequent Itemset Generation in Non-Binary Search Space (2009) 8. Runge-Ranzinger, S., Horstick, O., Marx, M., Kroeger, A.: What does dengue disease surveillance contribute to predicting and detecting outbreaks and describing trends? J. Tropical Medicine & International Health 13, 1022–1041 (2008) 9. Barbazan, P., Yoksan, S., Gonzalez, J.P.: Dengue hemorrhagic fever epidemiology in Thailand: description and forecasting of epidemics. J. Microbes and infection 4, 699–705 (2002) 10. Talarmin, A., Peneau, C., Dussart, P., Pfaff, F., Courcier, M., de Rocca-Serra, B., Sarthou, J.L.: Surveillance of dengue fever in French Guiana by monitoring the results of negative malaria diagnoses. J. Epidemiology and Infection 125, 189–193 (2000)
496
Z.A. Long et al.
11. Excite, http://www.cdc.gov/excite/classroom/outbreak/objectives.htm 12. Seng, S.B., Chong, A.K., Moore, A.: Geostatistical modelling, analysis and mapping of epidemiology of Dengue fever in Johor State, Malaysia (2005) 13. Shmueli, G.: Current and Potential Statistical Methods for Anomaly Detection in Modern Time Series Data: The Case of Biosurveillance. Data Mining Methods for Anomaly Detection (2005) 14. German, R.R., Armstrong, G., Birkhead, G.S., Horan, J.M., Herrera, G.: Updated guidelines for evaluating public health surveillance systems. MMWR Recomm. Rep. 50, 1–35 (2001) 15. Watkins, R.E., Eagleson, S., Veenendaal, B., Wright, G., Plant, A.J.: Applying cusumbased methods for the detection of outbreaks of Ross River virus disease in Western Australia. J. BMC Medical Informatics and Decision Making 8, 37 (2008) 16. Kulldorff, M., Heffernan, R., Hartman, J., Assuncao, R., Mostashari, F.: A Space-Time Permutation Scan Statistic for Disease Outbreak Detection. J. Plos Medicine 2, 216 (2005) 17. Buckeridge, D.L., Burkom, H., Campbell, M., Hogan, W.R., Moore, A.W.: Algorithms for rapid outbreak detection: a research synthesis. Journal of Biomedical Informatics 38, 99– 113 (2005) 18. Hutwagner, L., Browne, T., Seeman, G.M., Fleischauer, A.T.: Comparing aberration detection methods with simulated data. J. Emerging Infectious Diseases 11, 314–316 (2005)