A Multi-Stage Ensemble Data Mining Model to Predict ...

4 downloads 11832 Views 458KB Size Report
A Multi-Stage Ensemble Data Mining Model to Predict Ferritin Serum Levels. Mohammad A. Abedini, Iran ... Data Mining Software. Weka is a collection of ...
A Multi-Stage Ensemble Data Mining Model to Predict Ferritin Serum Levels Mohammad A. Abedini, Iran University of Science & Technology, Department of Industrial Engineering, Tehran, Iran Kamran Heidari, Department of Emergency Medicine, Loghman Hospital, Shahid Beheshti University of Medical Science, Tehran, Iran Mohammadsadegh Mobin, Afshan Roshani, Aliea Afnan, Western New England University, Department of Industrial Engineering and Engineering Management, MA, USA

Abstract Motivation: The Ferritin serum level is one of the key factors in diagnosing Iron Deficiency Anemia (IDA)-related diseases, which is one of the most common types of anemia. It is not common to measure the ferritin serum level in many cases, especially in the primitive stages of disease diagnostics; and in clinical laboratories it is not always feasible to assess ferritin serum levels.

Dataset Description Dataset was obtained in TALEGHNI Hospital, Tehran, Iran. About 300 people were selected from the hospital patient list and after initial assessments by Department of Clinical Diagnostics; a dataset size of 164 was selected.

The Proposed Model Framework

● Correlation-based feature selection ● Search method: Best first  Findings: Input Features for the Regression Model

Dataset Partitioning Dataset: 164

Objectives:

Train Set: 114

In this research, we developed a multi-stage ensemble data mining model which predicts the ferritin serum level in a more efficient way.

Model Input & Output Inputs (Features)

● CBC Test Result:

Summary of Proposed Model: The proposed model works as a Decision Support System (DSS) which considers the Complete Blood Count (CBC) test results as inputs in order to make a prediction for ferritin serum levels. The developed model uses demographical information of the patients in addition to CBC test results consisting of three stages: 1. Select important features using correlation-based feature selections; 2. Train the decision tree as a base classifier by applying four different ensemble regressions approaches including: Bagging, Additive regression, Rotation forest and Random subspace; 3.Evaluate and compare mentioned approaches based on correlation coefficient and root mean squared error criteria.

1. Red Blood Cells(RBC) 2. Hemoglobin (HG) 3. Hematocrit (HCT) 4. Mean Corpuscular Volume (MCV) 5. Mean Corpuscular Hemoglobin (MCH) 6. Mean Corpuscular Hemoglobin Concentration (MCHC)

● Demographical Characteristics: 7. Age 8. Sex

PROPOSED DATA MINING MODEL

Output: Ferritin Serum Level Note: All features from CBC Test and Ferritin are numeric.

Data Mining Software

Model Verification

2. Ensemble Learning

● Comparing ensemble models on Test Set

● Base Regression: REPTree ● Ensemble methods: 1. Bagging 2. Additive regression 3. Rotation forest 4. Random subspace  Findings: Four Different Trained Ensemble Regression models

● Comparing the ensemble model with other powerful data mining tools, such as:

3. Model Selection

∘ Multilayer Perceptron Neural Network, ∘ RBF Neural Networks, ∘ Support vector Machines (SVM), ∘ Linear Regression (LR).

● Evaluation criteria: 1. Correlation Coefficient 2. Root Mean Squared (RMS) Error  Findings: Final Model

Feature Selection Result Using Correlation Based Feature Selection (CBFS) as feature selection method and best first algorithm for searching option, four features were selected in first stage including: 1. HG 2. MCH 3. Age 4. Sex

Conclusions

Ensemble Learning Result

Summary of the Results: The results show that the bagging approach outperforms the others in terms of both criteria. By conducting this case study, the proposed model has proven to be an efficient DSS in IDA diagnosis.

Considering RMS and Correlation Coefficient simultaneously as evaluation criteria, It turned out that the bagging ensemble approach had better performance improvement for base regression method (REPTree). Also, Additive Regression failed to make any improvement on the base regression method (REPTree) performance.

1. Feature Selection

Test Set: 50

Regression

Weka is a collection of Machine Learning Algorithms for data mining tasks. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

Model Selection Results

REPTree Bagging Additive reg. Rotation forest Random subspace

Correlation Coefficient 0.669 0.845 0.669 0.781 0.721

RMS Error 1.040 0.936 1.040 0.986 0.990

● The bagging approach had more improvement on the performance of REPTree as a weak learner. ● The suggested model outperformed other powerful data mining tools. ● Beside Diagnostic Procedures, the model could be used as DSS and help in Ferritin serum level estimation and IDA diagnosis.

Suggest Documents