Adaptive mining prediction model for content ... - Semantic Scholar

2 downloads 201 Views 745KB Size Report
Aug 7, 2013 - aids in the provision of medical care to patients, and also support diagnosis or ... mine the safest and most cost-effective evidence-based sys- tems in line with medical ...... Data Mining, Cloud Computing,. Modeling and ...
Cluster Comput DOI 10.1007/s10586-013-0308-1

Adaptive mining prediction model for content recommendation to coronary heart disease patients Jae-Kwon Kim · Jong-Sik Lee · Dong-Kyun Park · Yong-Soo Lim · Young-Ho Lee · Eun-Young Jung

Received: 30 April 2013 / Revised: 7 August 2013 / Accepted: 28 August 2013 © Springer Science+Business Media New York 2013

Abstract This paper proposes the Fuzzy Rule-based Adaptive Coronary Heart Disease Prediction Support Model (FbACHD_PSM), which gives content recommendation to coronary heart disease patients. The proposed model uses a mining technique validated by medical experts to provide recommendations. FbACHD_PSM consists of three parts for heart disease risk prediction. First, a fuzzy membership function is constructed using medical guidelines and statistical methods. Then, a decision-tree rule induction technique creates mining-based rules that are subjected to validation by medical experts. As the rules may not be medically suitable, the experts add rules that have been verified and delete inappropriate rules. Thirdly, using fuzzy inference based on Mamdani’s method, the model predicts the risk of heart disease. Based on this, final recommendations are provided to J.-K. Kim · J.-S. Lee Department of Computer Science and Engineering, Inha University, Incheon, Korea J.-K. Kim e-mail: [email protected] J.-S. Lee e-mail: [email protected] D.-K. Park · Y.-S. Lim · E.-Y. Jung (B) u-Healthcare Center, Gachon University Gil Hospital, Incheon, Korea e-mail: [email protected] D.-K. Park e-mail: [email protected] Y.-S. Lim e-mail: [email protected] Y.-H. Lee School of Information Technology, Gachon University, Seongnam, Korea e-mail: [email protected]

patients regarding normal living, nutrition control, exercise, and drugs. To implement our proposed model and evaluate its performance, we use a dataset from a single tertiary hospital. Keywords Coronary heart disease · Data mining · Fuzzy logic · Decision tree · FbACHD_PSM

1 Introduction Artificial intelligence and data mining techniques have recently been attracting attention as a means of enhancing data processing capacity and solving complex problems using computers [1]. Expert systems, which use data mining techniques, help to deal with complex and specialized decision-making issues [2]. Clinical Decision Support Systems (CDSS), which use expert systems, are rule-based and enable a computer to understand medical knowledge that aids in the provision of medical care to patients, and also support diagnosis or treatment policies to assist medical experts who personally deal with complex requirements [3]. To realize CDSS, it is essential that computers understand the rules governing massive amounts of clinical knowledge. Many challenges still exist, and so studies geared towards overcoming these challenges are continuously being conducted [4]. Recently, progress has been made by combining artificial intelligence and data mining, and this has led to an improved ability to prevent disease by relying on clinical knowledge and patients’ data [5]. Coronary heart disease is presently recording the highest fatality rate among non-infectious diseases, and the rate is still increasing. Its management is also costly. Amid extensive efforts to prevent heart disease, research on heart disease prediction using CDSS is ongoing [6]. Prediction of

Cluster Comput

heart disease can reduce health care costs and the need for future national health promotion. CDSS provides the knowledge required to decide how to use the diagnosis or treatment policy, facilitates correct decision-making and, in combination with existing Hospital Information System (HIS), can predict heart disease risks [7]. However, to support the decision-making, a heart disease prediction model is required. This is currently the subject of on-going research. FRS [8] and PROCAM [9] are typical coronary heart disease guidelines. Research to determine the safest and most cost-effective evidence-based systems in line with medical guidelines is also ongoing [10]. However, current research efforts only rely on knowledge rules; consequently, there are risks of uncertainties. To enable CDSS to accurately predict coronary heart disease, a knowledge-based mining model is required. Further, in order to reduce the uncertainty of knowledge rules, a fuzzy logic method that processes uncertain and ambiguous clinical data is needed. Because fuzzy logic-based engines require validation by medical experts, credibility is high and uncertainty is low [11]. Fuzzy logic is an accurate mathematical language that expresses explicit clinical ambiguity. Thus, it is an extremely useful method for explaining ambiguity and uncertainties [12]. This paper proposes the Fuzzy Rule-based Adaptive Coronary Heart Disease Prediction Support Model (FbACHD_PSM) for coronary heart disease patients’ content recommendation. The proposed model uses a mining technique that is validated by medical experts to come up with its recommendations. Thus, it removes the uncertainties experienced when using knowledge-based rules to make coronary heart disease predictions, and can predict heart disease risks at a lower cost. The fuzzy rule-based adaptive mining model consists of three parts for heart risk prediction. The first part is in the form of a fuzzy set that constructs a fuzzy membership function using FRS medical guidelines [8] and utilizes statistical methods. The second part uses a decision-tree rule induction technique to create mining-based rules, which are further used to generate domain knowledge and other mining-based rules that are subjected to validation by medical experts. As the rules generated based on the clinical data may not be suitable from a medical perspective, it is necessary for medical experts to add rules that they have verified, while deleting inappropriate rules. Lastly, from fuzzy inference based on Mamdani’s method, it predicts the risks of heart disease. On the basis of the results of the heart disease risk prediction, final recommendations to a patient regarding normal living, nutrition control, exercise, and drugs are made. To implement our proposed model and evaluate its performance in terms of accuracy, we use the Personal Health Record (PHR) dataset from a single tertiary hospital (G Medical Center in Korea).

2 Related research 2.1 Knowledge-based CDSS CDSS is classified into two main categories: knowledgebased CDSS and non-knowledge-based CDSS [13]. A knowledge-based CDSS follows the clinical rules of the IF-Then format, and generally, data are related to this rule. The knowledge-based CDSS, including the rules and the inference engine, shows the results of the clinical data input of patients in a simplified manner. In certain cases, by using a knowledge-based server, it has proved to be far more effective for chest pain management compared to other programs [14]. Practically, symptoms are unpredictable, and can be both certain and uncertain. Symptoms and phenomena are both related to the characteristics of uncertainty [15, 16]. Artificial neural network [10] and genetic algorithm techniques are typical examples of non-knowledge-based CDSS. Both techniques, which rely on data rather than knowledge, predict symptoms very accurately, but are hard for the user to interpret and to support decision-making [17]. Moreover, when using a complex clinical information database, their costs are too high due to their complexity and long learning time [18]. Therefore, knowledge-based CDSS is considered more effective and so, in this paper, we propose a fuzzy logic coronary heart disease prediction support model. 2.2 Prediction model Artificial intelligence and data mining techniques have been used to construct a number of CDSS systems, and have been proposed in research to support decision-making in heart disease prediction. We elaborate on significant research results associated with heart disease prediction below. On the basis of the Dempster-Shafer theory of evidence and fuzzy sets theory, Khatibi and Montazer [19] constructed a fuzzy-evidential hybrid inference engine that has two phases of operation. In the first phase, a fuzzy model that converts the inputted vague value into a fuzzy value is constructed. A fuzzy inference rule is then created based on the fuzzy set, which leads to the prediction results: this is where the fuzzy role problem is extracted. In the second phase, to resolve the problems faced in the results of the prior phase, the values obtained are regarded as basic beliefs for each rule, and a second inference is made by using the belief and plausibility functions to deploy each rule. The model uses two types of medical rule bases, and resolves uncertainty in each role through information fusion, which results in higher accuracy. Tsipouras et al. [20] proposed a fuzzy rule-based DSS for the diagnosis of Coronary Artery Disease (CAD). Their proposed system comprises four methods. The first method extracts rules using the rule induction in the decision tree

Cluster Comput Table 1 Dataset features No.

Feature

Units

Range

Type

Mean (±sd)

1

Sex

[1: Male, 2: Female]

29–73

Categorical

163(1), 136(2)

2

Age

Year

1, 2

Numeric

61.652 (±8.384) 199.231 (±40.263)

3

Total Cholesterol

mg/dL

104–357

Numeric

4

HDL Cholesterol

mg/dL

25–91

Numeric

52.411 (±12.949)

5

Systolic Blood Pressure

mmHg

56–154

Numeric

115.197 (±11.828)

6

Diabetes

[0: No, 1: Yes]

0, 1

Categorical

258(0), 41(1)

7

Smoking

[0: No, 1: Yes]

0, 1

Numeric

231(0), 68(1)

8

CHD Risk

Risk Score (%)

1–31

Numeric

9.184 (±5.793)

9

CHD Event

[Very Low risk, Low risk, Moderate risk, High risk]

VL, L, M, H

Categorical

VL(176), L(71), M(44), H(8)

from the dataset. The second method converts the extracted rule to fit the crisp model. The third method enters the crisp set of rules into the fuzzy model. Finally, the fourth method optimizes the parameters of the fuzzy model by automatically generating decisions using the dataset. As the system is automatically generated, this model can easily provide interpretations for decision-making on the standard of CAD diagnosis. Fidele et al. [21] proposed the use of an Artificial Neural Network (ANN) to support clinical decisions in the assessment of coronary heart disease risk in patients. The proposed ANN uses two neural networks, as well as the LevenbergMarquardt algorithm and back propagation. Their proposed ANN predicts coronary heart disease risks and appears to be effective for individual patients. Applications that use the individual level ANN appeared to be more effective in heart disease prediction. The ability of a fuzzy neural network model to predict the likelihood of coronary heart disease has been evaluated by Abidin et al. [22] and has been implemented using a knowledge base of individual harmful bodily habits and demographic profiles. The prediction of the fuzzy neural network was found to be more accurate than that of the logistic regression model. It made coronary heart disease predictions on the basis of BMI index, systolic blood pressure, total cholesterol, and age.

3 The Personal Health Record dataset 3.1 Personal Health Record (PHR) The term Personal Health Record (PHR) has been in use since 1978 [23], and is still currently being used along with various other expressions such as Personally Controlled Health Record, Personal Medical Record, and Electronic Health Record. PHR has been defined by many institutions, and according to the material published officially in 2007 by the Healthcare Information and Management System Society (HIMSS), an electronic Personal Health Record

(“ePHR”) is defined as “a universally accessible, layperson comprehensible, lifelong tool for managing relevant health information, promoting health maintenance and assisting with chronic disease management via an interactive, common dataset of electronic health information and e-health tools” [24]. PHR provides medical services such as various health related information to consumers, a means for them to personally control and manage their health, and a means for them to be actively involved in the medical delivery system and to participate in the decision-making process. Such a PHR leads the roles and relationships of each stakeholder in the traditional medical delivery system, and changes them into consumer-oriented ones. For this reason, each stakeholder can acquire new modes of benefit [25]. PHR is an essential element of CDSS, and by utilizing it, this paper generates a coronary heart disease prediction model. 3.2 Dataset We collected data on 299 persons in the PHR dataset from a single tertiary hospital (G Medical Center in Korea). The 299 persons are patients suffering from heart disease who use the PHR service at the Gil Medical Center of Gachon University. We predicted the probability of heart disease using the clinical data on these 299 patients. To generate and assess the coronary heart prediction model, we used seven input attributes, namely sex, age, total cholesterol, HDL (High Density Lipoprotein) cholesterol, systolic blood pressure, diabetes, and smoking. In addition, we used two output attributes, CHD risk and CHD event, for the inputted attributes, and generated results in accordance with the FRS heart study [8]. The features of the nine attributes are displayed in Table 1. We generated and experimented with the prediction model by setting the inputs and outputs as stated above. We divided the 299 clinical datasets into training sets (70 %) (Total subjects: 210) and testing sets (30 %) (Total subject: 89).

Cluster Comput Fig. 1 FbACHD_PSM architecture

4 Design of the Adaptive Coronary Heart Disease Prediction Support Model In this section, we discuss the design of the coronary heart disease prediction model. By way of explaining the FbACHD_PSM proposed in this paper, the system architecture, the construction of the fuzzy membership function, the rule set, and the fuzzy inference are described sequentially below. 4.1 Architecture The structure of the FbACHD_PSM is shown in Fig. 1. The 299 PHR datasets obtained from the tertiary hospital are used as training sets and testing sets (refer to Sect. 3). Based on the training sets, the coronary heart disease prediction model is generated, and the performance of the model generated through the testing sets evaluated. To construct the knowledge base, the fuzzy rule base and the fuzzy membership function are required. The fuzzy rule base generates IF-Then rules using the C4.5 algorithm of the decision tree based on the training set, and converts them so that they can be applied to the crisp function. In addition, for rule verification, medical experts personally assess the rules and delete, revise, and add to them. For the fuzzy membership function, the binary logic form is converted into multi-valued logic by referring to the existing medical guidelines (FRS study [8]). Further, for membership function optimization, the degrees of membership are revised by referring to the data of the training set. The fuzzy rule base and the fuzzy membership function are used in the fuzzy inference engine for heart disease risk prediction support. On the basis of the final results, recommendations regarding normal living, nutrition control, exercise, and drugs categories are then made to patients. 4.2 Fuzzy membership function A fuzzy membership function can be defined as a set that simply has boundary ambiguities. The design of the fuzzy membership function is illustrated in Fig. 2.

Fig. 2 Design of fuzzy membership function

Input values are defined to form the fuzzy membership function, and the continuous age, total cholesterol, HDL cholesterol, and systolic blood pressure are expressed as a fuzzy set. Next, the fundamental triangular fuzzy function is constructed by referring to medical guidelines. To adjust the set of the membership function, a position function diagram is formed by receiving the input training dataset information. Finally, the fuzzy hedge is determined through verification by medical experts to construct the fuzzy membership function that is to be used. As shown in Table 2, the parameters needed to construct the fuzzy function use four input variables and one output variable. The fuzzy parameters are constructed as shown in the table, and the output variable, μH, produces the final heart disease risk value. The final result is inferred through rulebased utilization by combining four fuzzy parameters and category data including sex, diabetes, and smoking-related information. First, to construct the fuzzy function, the medical guidelines from the FRS study [8] are consulted. The triangular fuzzy function that is generated by referring to the medical guidelines is the most fundamental fuzzy function and contains much uncertainty. The required formula that converts the data in the guidelines before constructing the triangular fuzzy function by referring to the guidelines is shown in

Cluster Comput Table 2 Fuzzy parameter variables Parameter

State variable

Linguistic variable

μA (Input)

Age

Young, Less mid-aged, Mid-aged, Very mid-aged, Very less old, Less old, Old

μB (Input)

Total Cholesterol

Very Low, Low, Moderate, High, Very High

μC (Input)

HDL Cholesterol

Very Low, Low, Moderate, High, Very High

μD (Input)

Systolic Blood Pressure

Very Low, Low, Moderate, High, Very High

μH (Output)

CHD Risk

Very Low, Low, Moderate, High

Fig. 3 Design of the fuzzy membership function

Eq. (1).   Mid = min(Guildelineij ) + max(Guildelineij ) /2   Left = min(Guildelineij ) − Mid − min(Guildelineij ) (1)   Right = max(Guildelineij ) + max(Guildelineij ) − Mid In Eq. (1), i is the name of the variable and x is the corresponding parameter. min(Guidelineij ) is the smallest number among the items that correspond to parameters in the guideline, and max(Guidelineij ) is the largest. For example, if the lowest value of total cholesterol is between 160 and 200, min(Guidelineij ) = 160 and max(Guidelineij ) = 200. Accordingly, Mid = 180, Left = 140, and Right = 220. The equation used to find the triangular fuzzy function is

μi(x : Left, Mid, Right) =

⎧ 0 ⎪ ⎪ ⎪ ⎨ x−Left

Mid−Left

Right−x ⎪ ⎪ ⎪ ⎩ Right−Mid 0

(x ≤ Left) (Left ≤ x ≤ Mid) (Mid ≤ x ≤ Right) (x ≥ Right) (2)

To modify the position function diagram of the fuzzy function, data from the training set is calculated and adjusted. The method used to modify the position function diagram of a fuzzy function is shown in Eq. (3).   μi xMid = μi(x : Mid) +

 μi(xRight)



− μi(x : Mid) /2

n=μi(xLeft) trainingsetij n

n

  μi xLeft = μi(xLeft) −

 μi(xMid )

n=μi(xLeft) trainingsetij n



− μi(xLeft) /2   μi xRight  = μi(xRight) +

n (3)

 μi(xRight) trainingset ij n n=μi(xMid )



− μi(xRight) /2

n

As shown in Eq. (3) and Fig. 3, by referring to the training set for the fuzzy input data, the fuzzy membership function is designed. That is, the first triangular function is generated from the initial set through the guidelines and forms F , and then, by modifying the membership function, F  is formed. The finalized fuzzy membership function is shown in Table 3. 4.3 Rule set To infer the fuzzy value, it is necessary to have a set of rules. The fuzzy-based prediction method for predicting coronary heart disease requires a definition of a rule set. The method used to construct the rules in this paper is illustrated in Fig. 4. To generate the rules using the training set, a decisiontree data mining technique is used. The decision tree technique generates IF-Then rules. The C4.5 decision tree algorithm is used for rule induction [26]. The C4.5 algorithm, which is based on entropy, calculates the average amount of information and generates a tree. To generate a fuzzy-based

Cluster Comput Table 3 The fuzzy membership function Parameter

Range

Linguistic variable

Left

Mid

Right

μA (Age)

0

32

36

Young

30

37

42

Less mid-aged

36

42

46

Mid-aged

40

48

51

Very mid-aged

46

51

56

Very Less Old

51

57

61

Less Old

55

64

74

0

147

173

Very Low

128

182

210

Low

171

217

248

Moderate

213

251

283

High

255

290

339

Very High

μB (Total Cholesterol)

μC (HDL Cholesterol)

Old

0

33

39

Very Low

29

41

46

Low

38

47

53

Moderate

45

54

60

High

53

64

70

Very High

0

113

122

Very Low

113

124

132

Low

123

133

143

Moderate

133

145

162

High

148

162

172

Very High

μD (Systolic Blood Pressure)

rule, it requires that the input values be defined as categories, so the continuous values in the training dataset are converted into categorical values. Next, a tree is generated and the resulting values of the tree are converted so that they are appropriate for the crisp model from which we conduct fuzzy inference. The generated rule base is verified by medical experts who add, modify, and delete rules. First, continuous values are converted to categorical values in accordance with the FRS study [8]. In addition, the output is assigned to CHD risk (Very Low, Low, Moderate, or High). The formula used to convert to a categorical value is shown in Eq. (4). Fig. 4 Design of the rule base

 I F min(Guidelineij )

 ≤ xi ≤ max(Guidelineij ) T hen xi → aj

(4)

In the equation, Guidelineij signifies a category attached to the medical guidelines. x represents the continuous value of the training data, i represents the parameter of the input data, and j represents the linguistic variable of the input value. If the training data meets the conditions, xi is converted to aj , with the continuous value converted to a categorical value.

Next, the C4.5 decision-tree algorithm is used to generate rules. The decision tree calculates the average amount of data, based on the information gain, to generate the tree structure. The tree structure is divided into right and left sides, and generates branches until it a result value is obtained. Finally, the rules generated using the decision tree are modified to create a crisp model. This crisp model modifies the rule to permit fuzzy inference. The modification method

Cluster Comput Fig. 5 Modification to crisp model from decision tree

Table 4 Domain knowledge and mining-based rules No.

Rule

1

Diabetes = N and sex = Men and Age = Young Then Very_Low_risk

2

Diabetes = N and sex = Men and Age = Less_mid_aged Then Very_Low_risk

3

Diabetes = N and sex = Men and Age = Mid_aged Then Very_Low_risk

4

Diabetes = N and sex = Men and Age = Very_mid_aged Then Very_Low_risk

5

Diabetes = N and sex = Men and Age = Very_less_old Then Very_Low_risk

6

Diabetes = N and sex = Men and Age = Less_old and Smoking = N Then Very_Low_risk

7

Diabetes = N and sex = Men and Age = Less_old and Smoking = Y Then Moderate_risk

8

Diabetes = N and sex = Men and Age = Old and Smoking = N and Total-co = VL Then Moderate_risk

9

Diabetes = N and sex = Men and Age = Old and Smoking = N and Total-co = L and HDL = L and SBP = M Then Low_risk

....

....

28

Diabetes = Y and Total-co = L and Age = Old and SBP = L and HDL = VH Then Low_risk

29

Diabetes = Y and Total-co = L and Age = Old and SBP = M Then Moderate_risk

30

Diabetes = Y and Total-co = L and Age = Old and SBP = H Then Moderate_risk

31

Diabetes = Y and Total-co = M Then Moderate_risk

32

Diabetes = Y and Total-co = H and Age = Less_mid_aged Then High_risk

33

Diabetes = Y and Total-co = H and Age = Mid_aged Then High_risk

34

Diabetes = Y and Total-co = H and Age = Very_mid_aged Then High_risk

35

Diabetes = Y and Total-co = H and Age = Old Then High_risk

is illustrated in Fig. 5 and Eq. (5) [6]. Condition(a, j )i = xroot (aroot , jroot ) ∧ xA (aA , jA )∧ · · · ∧ xZ (aZ , jZ )

(5)

Rule(a, j )k = Condition(a, j )i1 ∨ Condition(a, j )i2 ∨ · · · ∨ Condition(a, j )in . Condition(a, j )i is the total condition for a single rule; that is, Condition(a, j ) includes all the conditions from the root to the lowest node. For Rulek , k includes all the conditions of CHD risk, which is the final output. As shown above, the decision tree was transformed to a crisp model, and the rule base was generated using the C4.5 algorithm, which produced 54 rules. As the generated rules are not based on actual medical knowledge, because they are a result of data mining, verification is required.

To verify the rule base, medical experts were consulted for advice. In the process of verification, the two medical experts deleted those rules that did not meet the expert knowledge, and made personal corrections, when necessary, to generate a new rule base. A total of 35 domain knowledge and mining-based rules were thus verified; a sample of these is displayed in Table 4. 4.4 Fuzzy inference It is required that the fuzzy membership function and a rule base for fuzzy inference be implemented based on the previously designed rules. The fuzzy inference method is illustrated in Fig. 6. Among PHR datasets, the categorical data—specifically, the diabetes and smoking data—were then entered into the inference engine, and the continuous data—specifically, the age, total cholesterol, HDL, and SBP data—underwent

Cluster Comput Fig. 6 The fuzzy inference method

Table 5 Recommendation according to CHD risk CHD Risk (%)

1–9

10–14

16–20

25