Mining Learning Experiment Database for Effective Data Characterization in Automatic Classifier Selection William-Chandra Tjhi1, Tianyou Zhang1, Kee Khoon Lee1, Bu Sung Lee2 and Terence Hung1 1
Institute of High Performance Computing 2 Nanyang Technological University Singapore {tjhiwc, zhangty, leekk, terence}@ihpc.a-star.edu.sg
[email protected]
Abstract. How can we select a good classifier for a given dataset? One way is by inferring a function mapping dataset characteristics and selected models from past learning experiments. Aiming towards a collaborative platform for machine learning experiments, Learning Experiment Database stores and shares past classification experiments. Initial studies reported useful patterns mined from the database to support classifier selection. We extend these studies by considering more data characteristic criteria and new approaches to mine the database. We leverage the database to validate the usefulness of statistics, landmarking and partial learning curve data characteristics for classifier selection; propose more accurate landmarking based on classifier clustering; and derive a decision tree to generalize classifier selection. Our findings show 44% improvement in accuracy compared to random classifier selection. This reaffirms the potential of collaborative platform for automatic model recommendation towards making analytics more accessible by public. Keywords: classifier selection, Learning Experiment Database, landmarking
1 Introduction Classification, defined as categorization from training data, is widely adopted for predictive modeling. However, deciding a suitable classifier for a given dataset is not trivial, with a plethora of classifiers to choose from. A method that recommends classifiers suitable for a particular dataset is therefore needed. Such a method is the focus of meta-learning for model selection [1]. The key idea is to infer from past learning experiments model selection for current analysis. Features of past experiments from datasets, models and parameters are stored as training data. The performance evaluations, e.g. classification accuracy, are treated as labels. Model selection is achieved by classifying these “meta data”, producing models that map learning characteristics into a suitable learner. A key challenge in meta-learning for model selection is on how to extract descriptive features for assessing experiments’
relevance to the current analysis. Existing research focuses on extracting features from datasets more than other aspects of experiments [2]. Intuitively, past analyses on datasets similar to the query dataset can provide clues on what models should be used. Dataset properties based on statistical measures are used frequently [3]. Additionally, landmarking [4], which are the accuracies of some simple classifiers applied on the query dataset, have been found to be descriptive for classifier selection. Recent works that look into characterization of models and parameters have also revealed promising results [2]. Despite the progress, studies on learning experiments characterization for model selection have been localized and limited in scope. The process of gathering past learning experiments to build training data has been done in isolation by individual research groups mainly using auto-experimentation approach [5]. The latter requires significant time and computing resources, limiting the numbers of datasets and models studied. To consolidate the collection of past learning experiments, a platform for sharing learning experiments has been set up recently: a public online database called Learning Experiment Database (LED) [6]. LED comprises mostly classification of benchmark datasets from the UCI Machine Learning repository using open tools like Weka [7]. Information stored includes dataset and model properties, and performance evaluations of experiments and parameter settings. Preliminary studies discuss basic analyses on LED to understand how classifiers behave on various experimental settings [6][8][9]. It was not the focus of these works however to provide extensive LED analysis, and thus further in-depth studies are needed. Motivated to gain more insights from the database and improve accuracies of classifier selection, we extend the existing works on LED mining/analysis. We study wider arrays of data characteristics combining their statistical, landmarking and Partial Learning Curve (PLC) [10] properties. We validate their usefulness for classifier selection leveraging information from the database. Based on the data characteristics, we formulate three new approaches to mine the database for classifier selection. The first begins by finding the query dataset’s closest neighboring dataset from the database. The most accurate classifier for the closest neighbor is selected for the query dataset. This approach is simple and efficient. In the second approach we improve the selection of landmarkers. Classifiers from the database are first clustered based their accuracies on different datasets. The clusters provide a guideline to avoid using redundant landmarkers. The third approach generalizes classifier selection by inferring a decision tree that associates suitable classifiers with relevant data characteristics. The challenge in building the decision tree is that suitable labels for training are not readily available from LED. We incorporate cluster analysis on the classification accuracy table of LED to extract class labels. Decision tree improves the efficiency and accuracy of classifier selection as only relevant data characteristics are evaluated. We validated the proposed approaches on the 87 UCI Machine Learning datasets and 28 classifiers stored in LED. Automatic classifier selection is part of the more general study of meta learning for model selection. One pioneering framework in the field is the Rice framework [11], which infers the mapping of data characteristics into model selection based on past learning experiments. A recent article provides an excellent survey on some instantiations of the Rice framework [1]. Works focusing on the data characterization aspects of classifier selection include [3] on the statistical characteristics, [12][4] on
landmarking and [10] on PLC. To our knowledge, our work is one of the firsts that attempt to gain insights by combining these three types of characterization. Beyond data characterization, there are works that include characterization of classifiers and their parameters in classifier selection [2]. The argument is that the classifier characterization results in more meaningful relationships between data and classifiers. In a way our clustering of classifiers (to improve landmarking) and datasets (to extract labels for training the decision tree) based on the accuracy information from LED borders on extracting meaningful relationship between data and classifiers. However these relationships are modeled statistically, rather than semantically as in [2]. We have yet to move into the semantic characterization of learning experiments. Our works share some similarities to classifier selection tools such as METAL [13] and the ParEN extension of RapidMiner [12]. In general our reliance on insights mined from LED for classifier selection distinguishes our work from these works. As an example, the new landmarkers and the decision tree (because of the class labels) do not exist independently from the clustering results derived from the database. More specifically, METAL considers only the statistical data characteristics. In addition, METAL and ParEN aim to predict the accuracies of classifiers; while our approach offers recommendations of classifiers that facilitate users in their work, which predicted accuracies alone could not offer. On mining LED, there have been several works presented in [6][8][9]. These works provide initial insights on how the database can be useful for comparing classifiers, investigating parameter effects and profiling the bias-variance of classifiers. The analyses performed are limited mostly to direct database querying and a handful of classifiers and datasets. Our work can be seen as a more extensive follow up of these works to improve the task of classifier selection. Our results of mining LED for classifier selection using the proposed approaches are promising. Importantly, these approaches are data-driven, which enables them to adapt as new experiment data are added into the database. We expect that as the database grows, more reliable insights can be extracted by the proposed approaches.
2 Methods Our study focuses on three types of data characteristics: 1. Statistics: the important statistical properties of a dataset as described in [3] (e.g. standard deviation and class entropy). 2. Landmarking: classification accuracies of a few efficient classifiers (called landmarkers) on a dataset. Landmarkers must be efficient to minimize overhead for data characterization. They should be comprised of classifiers of different categories (e.g. trees, Bayesian, etc.) to avoid redundancy. Our choice of landmarkers follows [12]: NaiveBayes, Decision Stump (average, best and worse nodes), and 1-Nearest Neighbor (1-NN). The Linear Discriminant Analysis was not chosen because it is unable to handle multi-class datasets. 3. Partial Learning Curve (PLC): a curve of landmarking scores at various sampling points of dataset size [14]. The aim is to capture the trend of landmarking changes. In our implementation, we adopt uniform sampling at 20%, 40% and 60% of dataset size.
Figure 1 summarizes the proposed three approaches to mine LED for classifier selection. The following provides more details on these approaches: 1. The Closest Neighbor’s Top Classifier approach (CNTC). Given a query dataset, this approach recommends a classifier that gives the highest accuracy on the closest dataset in the database (i.e. the query dataset’s Closest Neighbor’s Top Classifier, or CNTC). The key is in finding the closest neighbor accurately. The closeness of two datasets is calculated using the normalized Euclidean distance, with the data characteristics as dimensions. We propose two ways to set parameters in the CNTC approach: i) CNTC-direct, i.e. the parameter values that correspond to the highest accuracy of the CNTC when applied on the closest neighbor was reused on the query dataset; ii) CNTC-optimum, i.e. the parameters of CNTC are optimized for the query dataset to reflect the true potential of the recommended classifier. 2. The Landmarking by Clustering Classifiers (LCC) approach. It is important for classifiers that serve as landmarkers to behave uniquely from one another. So far the choices of landmarkers have been made mostly based on the conventional understanding adopted by the machine-learning communities [12]. We propose a data-driven approach to select landmarkers to maximize their differences from one another. Classifiers in LED are first clustered based on their accuracies on the different datasets. To make landmarkers as different as possible from one another, a landmarker is selected from each resulting clusters. Hierarchical Agglomerative Clustering (HAC) with Dynamic Tree Cut [15] was adopted for clustering. The most efficient classifier in a cluster is chosen as the landmarker (i.e. efficiency = average build time recorded in LED). 3. Building Decision Tree for Generalization of Classifier Selection (DT). This third approach mines classifier-selection rules from LED by focusing on the relevant parts of data characteristics. The target is to improve efficiency (fewer features) and accuracy (less noise). The rules are in the form of a decision tree mined from LED based on the data characterization matrix, i.e. a matrix of datasets as rows and data characteristics as columns. To build a decision tree, class-labels are needed. The labels are not directly available from the database. We propose to use the accuracy profiles of a dataset across all classifiers to generate class labels. We first cluster the datasets based on their accuracies across all the classifiers in the database. The different clusters reflect the different dataset groups that share similar classification accuracy profiles. HAC and DTC were adopted for dataset clustering. To minimize the observed effect of high correlations among classifiers, Principal Component Analysis (PCA) is performed prior to clustering. Clustering is performed on top 5 principal components (90% variance). The resulting clusters are used as the class labels of the data characterization matrix for the training of the decision tree. To associate a specific classifier with each class label, lookup from LED is performed to find which classifiers perform the best on average for all the datasets in a cluster.
Fig. 1. The three approaches to mine Learning Experiment Database for classifier selection.
3 Results and Discussions The tests were conducted on the 87 UCI datasets in LED. All the classifiers in the database were studied except the meta-classifiers, as we want to focus on individual classifier selection (not ensemble). There are 28 classifiers studied. All landmarking and PLC characteristics were calculated using Weka, with default parameter setting and automatic missing value replacement. The classifier accuracy was set to the weighted average per-class F-measure, which gives a less-biased validity than prediction accuracy in the case of imbalanced class distributions [16]. The F-measure was calculated using 10-fold cross validation. In the CNTC-optimum approach, LED was first queried to get the accuracy matrix, i.e. a classification accuracy matrix of 87 rows (for datasets) and 28 columns (for classifiers). To reflect the optimum parameter setting, each element of the matrix was set to be the highest CNTC accuracy on the query dataset. The accuracy matrix was also used for the classifier clustering in the LCC approach, and for the dataset clustering in the DT approach. Columns of the accuracy matrix that contain more than 50% missing values were ignored. The remaining missing values were replaced by attribute mean. In the LCC approach, the CNTC-optimum and CNTC-direct approaches were repeated using the new landmarkers. The new landmarkers from the LCC approach was not used for building the decision tree in the DT approach to isolate the effects of LCC and DT. In the DT approach, the classification accuracy of the decision tree was validated by 10-fold cross-validation on the 87 datasets. For every selected classifier for each test data, the classification accuracy was looked up from the accuracy matrix to measure classifier selection effectiveness. To evaluate the classifier selection performance, two baseline performances were defined. The first baseline, i.e. the random classifier, reflects the case where, for any given query dataset, users randomly select a classifier from the
pool of the 28 studied. For a given query dataset, this baseline is represented by the recorded accuracies of all the classifiers on the query dataset. In the CNTC-direct case, the accuracies are aggregated across all scenarios of parameter settings recorded in LED. In the CNTC-optimum, LCC and DT cases, the classifier accuracies considered are only from the accuracy matrix. The second baseline, i.e. the random neighbor, captures the situation where users have access to LED, but not to our classifier selection. In this case, for a given query dataset, users randomly choose a neighbor from the other 86 datasets, and select its top classifier in a manner similar to the CNTC approach. This baseline is stricter than the random classifier. For a given query dataset, the random neighbor baseline is represented by all the classifier accuracies recorded in the database (for the CNTC-direct approach) or all the elements of the accuracy matrix (for the CNTC-optimum, LCC and DT approaches), but excluding the classifier accuracies on the query dataset. Figure 2a summarizes the percentages of classifier selection gaining accuracies higher than baselines for the three proposed approaches. From the figure, we can see that in 90.8% of the datasets CNTC-optimum scores higher than the random classifier baseline. For the stricter random neighbor baseline, CNTC-optimum scores higher than the baseline in 72.4% of the datasets. For CNTC-direct, the numbers of datasets where the classifier selection scores above the random classifier and neighbor baselines means are 81.7% and 70.7% respectively. The findings that CNTC-direct gives lower performances than CNTC-optimum justify the merit of fine-tuning parameters to fit the selected classifier to a query dataset. To study how data characterization affects classifier selection, the CNTC approach was repeated using the combined and the individual types of data characteristics. Figure 2b shows the results, aggregated by the accuracy mean of the 87 datasets. We found no notable difference in the performances of the combined and individual statistical, landmarking and PLC data characteristics. This suggests that classifier selection can be performed by just using one type of data characteristics. The new landmarkers selected by the LCC approach are weka.VFI, weka.J48, weka.IB1, weka.zeroR and weka.Ridor. Figure 2a shows the results of using these new landmarkers in the CNTC approach. Compared to the random classifier baselines, the LCC approach does not show clear improvement. However, compared to the random neighbor baseline, the new landmarkers improve accuracies to 78.2% and 77.2% for CNTC-optimum and CNTC-direct respectively. Figure 2b compares the performances of the new landmarking and the other types of data characteristics. The figure shows increased mean scores of 76.8% and 69.8% for CNTC-optimum and CNTC-direct respectively. These results suggest the superiority of data-driven selection of landmarkers for classifier selection purpose. The initial dataset clustering in the DT approach generates three clusters of datasets that score high, medium and low classification accuracies. From LED, we found that weka.SMO, weka.MultilayerPerceptron and weka.NaiveBayesSimple are the best classifiers in the three clusters respectively. Using these classifiers as class labels, a generalizable (i.e. scoring 77% weighted average F-measure with 10-fold cross validation) decision tree was built as shown in Figure 3. The figure shows that the statistical (e.g. CoefficientOfVariation) and landmarking (e.g. 1NN, bestNode) data characteristics are used in the classifier selection rules. The absence of PLC in the tree suggests their low relevance to classifier selection. From Figure 2a, we can
see that the DT approach improves the rates of selecting suitable classifiers to 94.2% and 85% on the random classifier and neighbor baselines respectively. This makes the DT approach the most effective classifier selection in this study. This supports our argument that by focusing on the relevant parts of data characteristics, the DT approach gives very robust classifier selection.
Fig. 2: a) the summary of how the proposed classifier selection approaches perform compared to the random classifier and random neighbor baselines; b) the classifier selection performances broken down into the different data characterization techniques used.
Fig. 3. The decision tree for classifier selection, built using the dataset clusters as class labels and the data characteristics matrix as training examples.
4 Conclusions We presented three approaches to mine patterns from Learning Experiment Database to improve classifier selection based on dataset characteristics. Compared to existing works, our study explores wider array of data characteristics and more variety of approaches to mine LED. By looking up from the database to find the top classifier of the closest neighboring dataset to the query dataset, we validated the advantage of classifier selection compared to random classifier selections. Through this approach,
we also found that the statistical, landmarking and partial learning curve properties of datasets, individually and in combination, are effective guide for classifier selection. We showed that further improvement can be achieved by selecting landmarkers in a data-driven manner. We also showed how generalizable classifier selection can be achieved using a decision tree that maps data characteristics to suitable classifiers. The decision tree provides the most accurate classifier selection in our study. We suggested a possible reason that by emphasizing on relevant data characteristics, the decision tree provides a robust classifier selection. Our promising findings should pave the way for more advanced ways for classifier selection leveraging LED or other learning databases. A notable direction is to extend our study beyond classification.
References 1. Giraud-Carrier, C.: Metalearning-a tutorial. In: 7th Int. Conf. on Machine Learning and Applications (2008) 2. Hilario, M., Kalousis, A., Nguyen, P., Woznica, A.: A data mining ontology for algorithm selection and meta-mining. In: European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 76--87 (2009) 3. Castiello, C., Castellan, G., Fanelli, A.: Meta-data: Characterization of input features for meta-learning. In: Modeling Decisions for Artificial Intelligence, pp. 457–-468 (2005) 4. Pfahringer, B., Bensusan, H., Giraud-Carrier, C.: Meta-learning by landmarking various learning algorithms. In: Int. Conf. on Machine Learning, pp. 743--750 (2000) 5. Kietz, J.U., Serban, F., Bernstein, A., Fischer, S.: Data Mining Workflow Templates for Intelligent Discovery Assistance and Auto-Experimentation. In: European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases Workshop, pp. 1--12 (2010) 6. Vanschoren, J., Blockeel, H., Pfahringer, B., Holmes, G.: Experiment databases. A new way to share, organize and learn from experiments. Machine Learning, in-press (2011) 7. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann Pub. (2005) 8. Vanschoren, J., Blockeel, H.: Investigating classifier learning behavior with experiment databases. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds.) Data Analysis, Machine Learning and Applications, pp. 421--428. Springer, Germany (2008) 9. Vanschoren, J., Van Assche, A., Vens, C., Blockeel, H.: Meta-learning from experiment databases: an illustration. In: Van Someren, M., Katrenko, S., Adriaans, P. (eds.) Annual Machine Learning Conference of Belgium and The Netherlands, pp. 120--127 (2007) 10. Giraud-Carrier, C., Ventura, D.: Effecting Transfer via Learning Curve Analysis. In: NIPS Workshop (2005) 11. Rice, J.R.: The Algorithm Selection Problem. Advances in computers 15, 65—118 (1976) 12. Shafait, F., Reif, M., Kofler, C., Breuel, T.M.: Pattern Recognition Engineering. In: RapidMiner Community Meeting and Conference (2010) 13. Koepf, C., Taylor, C., Keller, J.: Meta-analysis: Data characterisation for classification and regression on a meta-level. In: European Conf. on Principles of Data Mining and Knowledge Discovery Workshop (2000) 14. Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Int. Conf. on Knowledge Discovery and Data Mining, pp. 23--32 (1999) 15. Langfelder, P., Zhang, B., Horvath, S.: Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24, pp. 719—720 (2008) 16. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kauffman (2006)