Application of Weka environment to determine factors that stand .... Explorer Tool as it seems to be the best for this purpose from the whole WEKA workbench.
Application of Weka environment to determine factors that stand behind non-alcoholic fatty liver disease (NAFLD) Michał M. Plutecki1, Aldona Wierzbicka2, Piotr Socha3, Jan J. Mulawka1 1
Warsaw University of Technology, Warsaw, Poland Department of Biochemistry and Experimental Medicine, The Children Memorial Health Institute, Warsaw, Poland 3 Clinic of Gastroenterology, Hepatology and Immunology, The Children Memorial Health Institute, Warsaw, Poland 2
ABSTRACT The paper describes an innovative approach to discover new knowledge in non-alcoholic fatty liver disease (NAFLD). In order to determine the factors that may cause the disease a number of classification and attribute selection algorithms have been applied. Only those with the best classification results were chosen. Several interesting facts associated with this unclear disease have been discovered. All data mining computations were made in Weka environment. Keywords: knowledge discovery from medical databases, feature selection
1. INTRODUCTION Recently we observe fast progress in different applications of computer technologies. This is due to the development of new generations of fast computers as well as growing storage possibilities. New software tools make it possible to develop procedures for database mining in medicine. So far, the process of knowledge discovery in medicine has been provided in quite a traditional and old-fashioned way. First, a data set for a given disease is created. Next, certain statistical tools are applied and the final conclusions are drawn based on pure statistical methods only. It is hard for a man to consider the whole space of data at once or even in stepwise statistical process. Thus, these methods seem not to be sufficient to discover deeper knowledge. Nevertheless, this problem can be overcome by applying data mining procedures. In this contribution we have used Weka environment to explore a hepatological disease. Non-alcoholic fatty liver disease (NAFLD) is considered. This disease is related to insulin resistance and the metabolic syndromes like obesity, combined hyperlipidemia, diabetes mellitus and high blood pressure. A large number of treatments for NAFLD have been studied. Even though, not much is known about this complex disease. In this study several parameters have been taken into consideration: patients with obesity, insulin resistance, lipid disturbances and oxidative stress. The aim of the study is to use classification and attribute selection algorithms that are commonly used in data mining approaches to learn more about NAFLD. Weka environment that implements all required algorithms is used to determine specific risk factors that could play an important role in pathogenesis of NAFLD. The control group considered here consisted of both obese and healthy children.
2. DISEASE DESCRIPTION NAFLD is a poorly understood disease which may cause an enlarged liver and scarring of the liver. People with obesity, high level of cholesterol, high blood sugar are more likely to have it. The exact causes responsible for development of NAFLD have not been established yet and currently, there is no effective drug treatment for this disease. Some researches consider that the factor behind development of NAFLD is associated with other disorders such as: diabetes, stroke and some heart diseases. The disease is quite silent at the early stage. However, one may experience symptoms like malaise, mild abdominal pain and fatigue. Advanced stage of the disease may be more noticeable. To detect NAFLD the results of liver tests need to be examined. Nevertheless, there is no guarantee of success. Nowadays NAFLD is increasingly recognized among children, but the history of this disease is even less well understood in this population.
Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2009, edited by Ryszard S. Romaniuk, Krzysztof S. Kulpa, Proc. of SPIE Vol. 7502, 75022L · © 2009 SPIE · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.837690 Proc. of SPIE Vol. 7502 75022L-1
3. SPREADSHEETS WITH DATA The original spreadsheet with data contained 201 children and 142 parameters. The patients were divided into three groups. Each represented one class. There were 60 children with NAFLD. Mean age of this group was 13.55. Those instances were assigned to the "s" class. The next group covered 57 young patients who suffered from simple obesity. Those children had BMI parameter greater than 97 centile and were assigned to the "o" class. The last group was represented by 84 healthy children. The mean age of this group was 14.37. All patients from this group had a normal weight and were assigned to the "k" class. Preliminary data preprocessing decreased number of rows to 172 and number of columns to 127. Various parameters (attributes) were measured for each patient. Beside patient id, the class that a given patient was representing, the date of visit, age, sex, weight, height and other physical parameters there are also some calculated indexes like Z-score, BMI or BMI for an age. Moreover, parameters representing different skin folds were included. The next groups of parameters contained various blood tests, parameters like total-body bone mineral density, some chosen enzymes, apolipoprotein A1, B and E, fatty acids and a lot more. In total: 127 parameters. Different medical examinations are often done in different laboratories. Hence, this is common that medical data sets have quite a big number of missing values. In data mining literature a common practice is to replace missing values with a mode (the value that occurs the most frequently) for nominal attributes or median for numeric ones. Nevertheless, since data corruption would not be welcomed this practice could not be followed.
4. COMPUTER AIDED KNOWLEDGE DISCOVERY In order to conduct an effective and successful knowledge discovery process it is required to use an environment with well implemented data mining algorithms as well as good and simple access to them. This is why Weka environment was chosen. Thanks to its collection of visualization tools, algorithms for data analysis and predictive modeling Weka seems ideal for the purpose of the presented research. Moreover, this written in Java, cross-platform application with GNU Public Licence offers an easy access to its wide functionality. There are few ways to use Weka as a powerful data mining tool. In this article only two of them have been applied. First, it will be shown how to preprocess the original data (given in the spreadsheet form) in order to enable further processing. This part of the experiment has been covered by WEKA Explorer Tool as it seems to be the best for this purpose from the whole WEKA workbench. Next, because of the number of prepared data and the number of the algorithms to apply WEKA Experimenter Tool was used. This tool enables to set up an experiment by specifying all input files and all data mining algorithms to be applied. Both standalone and remote experiments can be created. Remote ones required an access to a database server with data to process. In this consideration standard standalone experiment was created rather than remote one. Once the experiment is finished it is possible to analyze the results. To do so, it is required to select variables we want to display in Analyze tab. All that have to be done is to specify quality measures we want to output for each data set and data mining algorithm combination. It is also possible to output e.g. classification models in the form of row output text files. It can be done by changing rawOutput to true and specifying the output ZIP system catalog. During the research a disadvantage in WEKA Experimenter Tool was found. There is no possibility to output a confusion matrix what would be quite a useful feather. However, a number of other useful features is supplied. It is possible to divide input data set into percentage split of training and test set right before applying a classification algorithm. Dividing input data set into n fold cross-validation where for each fold 1/n of data set is used as a test set and (n-1)/n as a training set. In general, the most of the features available in WEKA Explorer Tool are also available in WEKA Experimenter Tool. Both tools were used in the research. The first to prepare data, get a closer look at the best models created by classification algorithms and view selected attributes. The second to set up the whole experiment and via result analyzer separate the most valuable classification models. Section 0 covers the details of this process.
5. DATA PREPARATION Before any knowledge discovery could be performed it is required to prepare data set for further analysis. Dealing with redundancy has been our first issue. All redundant columns as well as instances were excluded from the original data set. There are no filters to do so in Weka environment, thus it has to be done using other applications. Another issue is removing anomalies in NAFLD data set. It has been done by displaying and analyzing attributes distributions. The visualization was displayed using Visualize All button available in Preprocess tab in WEKA Explorer Tool, but checking the scope of a given attribute’s distribution had to be done by an expert. Next, the diagnosis attribute was set
Proc. of SPIE Vol. 7502 75022L-2
as the class attribute. All attributes with missing values to all values ratio greater than fifty percent were removed as they could not participate in further processing. This resulted in decreasing the number of attributes from 146 (in the original data set) to 99. Additionally, because the aim of the study is to compare two classes: healthy patients and those with NAFLD, two non-NAFLD classes were merged into one non-NAFLD class. In such a way we created the first data set called "nafld_0". By using this data set five more datasets were created. Each of them was a subset of "nafld_0" data set and consisted of the unique attribute selection. Table 1 presents the selection of attributes for a given data set, number of them and a field of medicine associated with the selection.
Table 1: Prepared data sets with chosen attributes and the number of attributes File names nafld_0 nafld_A nafld_B nafld_C * nafld_D * nafld_E
Covered field whole data without biochemistry fatty acids oxidative stress insulin resistance lipid disturbances
Selected attributes all attributes Z-score, [TCH … C22:6n-3] Z-score, [DHA/ARA … C22:6n-3] Z-score, GSH, GPx, witA, witE, β-carotene, TBARS proinsulin, insulin0, glucose0, HOMA_IR [TCH … HDL]
# (99) (79) (59) (8) (5) (11)
Using Edit feature in WEKA Explorer Tool it is possible to view raw data. After viewing all data sets listed in Table 1 it occurred that because of lack of data two data sets had to be excluded from the exploration process. Those data sets were marked with a star next to its name. The preprocessed data consisted of four data sets: "nafld_0", "nafld_A", "nafld_B" and "nafld_E". The preparation process was quite smooth and simple, but time-consuming. It revealed an important problem of inconsistency, redundancy and incompleteness that is common for medical data sets. This problem not only hinders this kind of researches, but often makes them impossible, wrong or distorted.
6. WEKA ENVIRONMENT CONTRIBUTION IN DATA EXPLORATION Once, the data sets were prepared next equally important step of knowledge discovery could be performed. Using WEKA Experimenter Tool setup for the research was prepared. In order to provide good access and simple manipulation of the output data, destination file was changed to coma separated values (CSV). As model validation method 10 fold cross-validation was chosen. This method provides random data split and validates a model for 10 subsets so there was no need for changing the number of repetition from its default value of one. By setting higher algorithms priority it is possible to specify that algorithms should run first for every given data set when the experiment starts. Described in the section 0 data sets were added as the one of the experiment input. The second input list consisted of all classification algorithms that could handle NAFLD data format. As there was no need to use any more advanced options simple experiment configuration mode was used. This finished the configuration process. Experiment was started in Run tab. In this tab, it is also possible to see any error logs that may occur. This happens in case of bad configuration or inappropriate algorithm use. The computations took about 30 minutes on medium-class PC. Because of the validation method it was not possible to view the summaries in the Analyze tab. It was required to use other table data processing aimed applications. Using pivot tables it was easy to extract and display interesting data. Area under ROC curve and the number of correctly classified instances were chosen to compare classification models created for each data set. Based on this data Table2 was created. The table shows results generated for "nafld_0" data set. Despite the fact that three tables for remaining three data sets were not included here those tables will also be referred in the further considerations.
Proc. of SPIE Vol. 7502 75022L-3
Table 2: Correctly classified instances and ROC Area calculated for all applied algorithms meta.LogitBoost trees.FT meta.AdaBoostM1 meta.RotationForest rules.NNge functions.SMO functions.SimpleLogistic functions.RBFNetwork trees.RandomForest trees.LMT rules.JRip rules.DecisionTable bayes.NaiveBayesUpdateable bayes.NaiveBayes bayes.BayesNet meta.MultiBoostAB trees.DecisionStump rules.OneR meta.Decorate meta.Dagging lazy.LWL trees.NBTree meta.ClassificationViaRegression meta.RandomSubSpace meta.Bagging trees.REPTree misc.VFI meta.RandomCommittee trees.LADTree misc.HyperPipes trees.J48graft rules.Ridor meta.FilteredClassifier rules.PART trees.J48 meta.OrdinalClassClassifier meta.nestedDichotomies.ND meta.nestedDichotomies.DNBND meta.nestedDichotomies.CBND meta.END meta.AttributeSelectedClassifier rules.ConjunctiveRule meta.MultiClassClassifier functions.Logistic trees.RandomTree rules.ZeroR meta.Vote meta.Stacking meta.RacedIncrementalLogitBoost meta.MultiScheme meta.Grading meta.CVParameterSelection lazy.IBk lazy.IB1 lazy.KStar
ROC Area Correctness
0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
Proc. of SPIE Vol. 7502 75022L-4
1
In Table it is easy to point out the classification algorithms with the best results. That would be the ones with a high number of correctly classified instances. However, it is also important to have high value of the area under ROC curve. In the presented table there is no doubt that LogitBoost, FunctionalTrees (FT) and AdaBoostM1 classification algorithms have the highest scores. All excided or nearly excided 95% of ROC Area and misclassified no more than 10 (out of 158) instances. LogitBoost occurred to be the best misclassifying only 5 (out of 158) instances and covered 97% of ROC Area. Finding this as a quite well result and the breakthrough moment in these considerations, more detailed examinations of those classifiers was made. First, a closer look at LogitBoost and AdaBoostM1 classification algorithms was taken. To analyze them WEKA Explorer Tool with the same validation method was used. Both algorithms turn out to use the same based classifier: DecisionStump. That would explain similarities in built classification models. But the third (FT) algorithm uses different techniques to build its model. All three algorithms indicated nearly the same set of crucial attributes. Moreover, cut off points for those attributes were found. The set of indicated attributes by the algorithms is showed in Table with the best one marked with a star next to it.
Table3: Attributes indicated by three classification algorithms for "nafld_0" data set LogitBoost* N6 16.5 : s C14:1t > 0.305 : s MUFA > 18.95: s C8:0 2.685: s FaldBrzuch > 39.7 : s – –
AdaBoostM1 N6