Towards Understanding Learning Behavior - CiteSeerX

4 downloads 0 Views 196KB Size Report
to the behavior of a learning method, this provides ... assumptions about the representation of the data (e.g. all attributes should be relevant, or nominal), a dual ... solution, aiming for a deeper understanding of learn- ..... be calculated from the confusion matrix, while oth- .... 4.4.1. a link with bias-variance decomposition.
Towards Understanding Learning Behavior

Joaquin Vanschoren [email protected] Hendrik Blockeel [email protected] Department of Computer Science, Celestijnenlaan 200A, 3001 Heverlee, Belgium

Abstract This paper presents ideas for learning to understand performance differences among learning algorithms. We propose a descriptive meta-learning approach, i.e. being able to thoroughly investigate and explain the reasons behind the success or failure of a learning algorithm. We start from an analysis of current meta-learning issues and propose an integrated solution, based on synthetic datasets, “insightful” meta-features (about data as well as algorithms) and eventually incorporating preprocessing techniques.

1. Introduction Data mining is the process of discovering useful knowledge in data. Key to this endeavor are data preprocessing (transforming the data to a more usable form) and algorithm selection (selecting the learning algorithms best suited for the problem at hand). For (time-)efficient learning, each learning algorithm must make some assumptions about the underlying structure of the data, and the utility of a learning algorithm largely depends on how well these assumptions match the concept hidden in the data at hand. These assumptions make up each algorithm’s machine learning bias, which can be subdivided in two parts. First, there is the representational bias: decision tree algorithms, neural networks, genetic algorithms, Bayesian networks,... all start from a predefined data model, a structure assumed to be able to adequately represent regularities in the data. They then learn by increasingly refining instances of this model (hypotheses) to fit the data as closely as possible, i.e. by searching for the “best” hypothesis according to a computational (or procedural) bias: a set of heuristics assumed to impose an appropriate order. If a learning algorithm cannot adequately represent the hidden concept, or favors the wrong hypotheses, learning performance will suffer.

Prior knowledge about (or experience with) the structure of the data, the problem domain or the way the data was conceived, can be used to select (a class of) fitting algorithms. For example, if we know the attributes (or features) are conditionally independent, then naive Bayes algorithms (which assume this condition) may have a suitable bias. Van Someren (2001) provides an excellent overview of possible methods to use such knowledge in practice. If no prior knowledge is available about the dataset, there remains a possibility to compute some properties of the given data (e.g. the distribution of classes, the correlation of features,. . . ). If we can (learn to) relate these properties to the behavior of a learning method, this provides knowledge about how the learning method will behave on new learning problems. By measuring the same data properties on each future dataset, we could predict which induction methods are (more) appropriate. This approach is pursued in meta-learning. Selecting a method for a particular problem is however only part of the successful application of inductive algorithms. At a more detailed level, many algorithms have parameters that need to be tuned to the dataset. Also, since each algorithm implementation has certain assumptions about the representation of the data (e.g. all attributes should be relevant, or nominal), a dual to this problem is to transform the problem data such that a particular method can be applied efficiently. Current meta-learning approaches do provide a framework for addressing these problems but, as we shall discuss in Section 2, their effectiveness remains limited. In Section 3, we describe the basis of an integrated solution, aiming for a deeper understanding of learning behavior. In Section 4, we then gradually extend this idea to address the current limitations one by one. Section 5 concludes.

2. Issues in meta-learning In the current state of the art, not much is known about the conditions under which one inductive

ent learning algorithms. A concise historical overview of such projects can be found in (Giraud-Carrier & Keller, 2002). These also include some large-scale studies, most notably the STATLOG (Michie et al., 1994) and METAL (Metal, 2001) projects, the first of which provided a set of rules describing when one algorithm was significantly better than another, while the latter features a data mining advisor generating a ranking of the best algorithms given a certain dataset.

Figure 1. Base-learning versus meta-learning.

method is better than another, making it hard to make a rational choice. Practical applications therefore often rely on human expertise and costly trial-and-error, and would benefit greatly from automatic guidance (Giraud-Carrier & Keller, 2002). Automatic guidance in model selection, model combination and data transformation requires metaknowledge about past experience concerning the involved learning algorithms and data transformation methods. If we want to predict the suitability of a certain method, this constitutes a learning problem on the meta-level (see Figure 1). As such, we must propose relevant meta-features (e.g. class entropy), measure a large number of instances (each single instance being the result of a base-level machine learning experiment), and choose a learning model to learn from these experiments1 . Whereas in base-level learning the problem features are chosen from insight into the problem domain, in meta-learning they are chosen based on experience with and insights in the working of different learning methods. Indeed, meta-features should be relevant and able to differentiate between the behavior of different learning algorithms. 2.1. Characterizing datasets A great amount of work in meta-learning has focused on describing meta-features to characterize datasets, and to use this characterization to compare differ1

Note that the term “meta-learning” is also used for other approaches using some kind of meta-data. A perspective overview can be found in Vilalta and Drissi (2002b)

Here, we only provide a few illustrative examples of these dataset characterizations, more complete overviews can be found in (Michie et al., 1994; Peng et al., 2002b). These include general features, like the dataset size and the number of features. Statistical ones are (averaged values of) the linear correlation between attributes and the skewness and kurtosis in attribute and class values. Very interesting are information-theoretic features like the normalized class or attribute entropy (the information value of a class or attribute), the mutual information between class and attributes (the useful information each attribute has about the class label), and the noise-signal ratio (the amount of irrelevant information). Other measures include landmarkers(Pfahringer et al., 2000) and model-based characterizations(Peng et al., 2002a). 2.2. Meta-learning limitations Although these measures provide a reasonable characterization of the dataset, the effectiveness of current meta-learning approaches remains very limited. This can be attributed to a number of reasons, which we shall now discuss. In the next sections, we propose ways to address these issues. 2.2.1. The curse of dimensionality As illustrated by the many measures for dataset characterization, the meta-learning space is highdimensional. On the other hand, each instance in this space holds the results of a single experiment (on a number of algorithms), and requires a completely new dataset. Since the collection of publicly available datasets is limited, this amounts to very few samples in a very large space, providing very sparse evidence for any recommendation we deduce from it (Blockeel, 2006). Indeed, any unseen case we wish to predict will lie very far from the examples in the instance space, and may not be classified correctly. It may help to reduce dimensionality by selecting the most relevant meta-features, but still the evidence would be very weak. In order to provide truly generalisable results, we simply need more datasets to sufficiently cover the meta-learning space.

2.2.2. Generalising over learning methods

3. A basis for investigation

Although datasets are characterized, usually, the learning methods are not. If one algorithm is found to work badly for certain properties of the data, this means nothing for a somewhat modified version of the algorithm (Van Someren, 2001). Empirical studies (e.g. (Hoste & Daelemans, 2005)) prove that optimizing algorithm parameters has a significant impact on the relative performance of learning algorithms. When previously one method proved to work better than another on a certain dataset using default parameter values, the situation can change radically if the parameters were optimized to the dataset. This can intuitively be explained by the fact that parameters alter an algorithm’s (computational) bias. For example, a decision tree algorithm may contain a parameter stating the minimum leaf size. If we increase this parameter, we restrain the algorithm from growing large decision trees, which may fit the population better (if the algorithm was previously overfitting), or worse. The same goes for other modifications of an algorithm: without characterizing the properties of the algorithm, we cannot generalise over them, and thus we cannot predict how the modified version will behave.

We now describe a meta-learning system aimed at efficiently performing meta-learning experiments. We wish to construct a meta-knowledge base in such a way that it can (additionally) be used to thoroughly investigate the behavior of learning algorithms. As such, we could use the (expensive) experiment results to investigate and gain insights in the limitations of each learning algorithm in terms of its properties and dataset characteristics. The goal of this approach is to yield a kind of descriptive (rather than comparative) form of meta-learning2 . Being able to investigate specific questions and diagnose bad performance can provide useful information to design new learning algorithms, or to make existing algorithms more robust.

2.2.3. Explaining learning behavior Although relating characteristics of available datasets to experimental results tells us when a learning algorithm fails or works, it offers no explanation as to why it failed, or what could be done to make it work better. It also does not allow a thorough investigation of the impact of a certain dataset characteristic, or of the interaction between different meta-features. Without understanding why learning algorithms work well with some datasets but not on others, little information can be provided about how to design new learning algorithms, or to improve existing ones. 2.2.4. Data transformation Preprocessing techniques can be used to adapt a dataset so a learning algorithm can handle it better. Hoste and Daelemans (2005) show that the relative performance of algorithms changes drastically when preprocessing is optimized. Moreover, investigations of real world learning problems (Van der Putten & Van Someren, 2004) show that the selection of the appropriate preprocessing techniques may well be more important than algorithm selection. To offer practical advise on which learning method to use on a given dataset, it is therefore very important to advise which preprocessing steps are useful, and how these steps may affect the performance of learning algorithms.

3.1. Experiment databases When building a database of machine learning experiments, we want to ensure that previous experiments stay usable in the future, as we add new meta-features or search for other kinds of patterns contained in experiment results. A very useful framework to do this are experiment databases (Blockeel, 2006). It proposes to improve the interpretability of machine learning experiments by building a database of experiments over a wide range of datasets (as is commonplace in metalearning), such that the results are not only generalisable, but also reproducible and reusable for further research. For example, we could investigate (by query) or discover (by data mining) a wide range of questions such as “What is the effect of parameter X on runtime?” without doing specific (additional) experiments. As it fits the meta-learning paradigm nicely, we wish to use this approach to increase the utility of metalearning experiments. Indeed, it would allow a thorough and systematic investigation of the interactions between algorithm parameters, dataset characteristics, and performance metrics. We could study which parameter settings are most important to tune, or which dataset characteristics most affect which algorithm’s performance. This opens up possibilities to thoroughly understand learning behavior through metalearning, and to provide insights that may prove useful for designing future learning methods. However, we will need to further adapt this approach to the meta-learning setting (for example to compare multiple algorithms), as Figure 2 (tentatively) illustrates. We will further explain all (italicized) aspects later on. 2

This notion is similar to the one used by Kalousis (2004), who uses similarities between datasets and learners to sketch maps of the dataset space.

4.1. Synthetic datasets

Figure 2. Sketch of experiment database

Building experiment databases does impose a shift in the way we perform meta-learning experiments. Firstly, to ensure reproducibility (being able to repeat the experiment exactly), we must log all parameters that influenced the experiment (algorithm version, parameter settings, datasets). This also includes all random seeds used for randomized aspects of the implementation of learning algorithms. The benefit is that we can acquire new measurements, like new performance measures, that were ignored before. Secondly, to ensure reusability, we must store as much metafeatures as possible, even if they are not needed immediately. In meta-learning, we often focus on what seem to be the most relevant meta-features for comparison alone. Here, we log all meta-features that are possibly interesting at “experiment time”, and select the relevant ones at “meta-learning time”, to find patterns or to advise learning methods.

4. An integrated solution We will now extend the idea of descriptive metalearning to address the above mentioned limitations as follows: 1. Design meta-learning experiments (using synthetic datasets) to thoroughly explore the metafeature space (to minimize generalization error) 2. Characterize algorithms by their properties so we can link their performance to specific parameter settings, specific techniques used in the implementation or more general properties of the used model 3. Use performance measures that help to explain why an algorithm failed, i.e. which part of their bias did not match the data 4. Measure the effect of preprocessing techniques and use this knowledge (together with the above) to advise useful preprocessing steps

Synthetic datasets are often used to study or affirm the behavior of an algorithm when insufficient natural data is available. As such, they seem a natural choice to augment our meta-learning experiments3 , and to fill the experiment database with examples. However, we should proceed cautiously to maintain the validity of our experiments. Firstly, when using such datasets, there is a real danger of introducing bias into the datasets, which would be an unfair basis for comparing learning algorithms. Secondly, we must be able to adequately simulate the conditions of natural datasets to be able to investigate and learn the behavior of algorithms in such conditions. And thirdly, we must wisely select our experiments to cover the metalearning space. We will now discuss how to do this. 4.1.1. Concept characterization To build a new dataset, it is necessary to introduce some structure, some concept into the data, after which examples are generated by (randomly) choosing new points in the instance space, labeling them according to this concept. If a fixed type of concept is used (as often is the case), the resulting datasets unfairly favor inductive methods with a closely related learning bias. To avoid biased meta-learning results, it is therefore important to characterize the concept in the instance space. To link the performance of different algorithms on different kinds of hidden concepts (an interesting investigation by itself) we should add the concept characterization to the experiment database (as part of the dataset characterization). Synthetic dataset characteristics should therefore be stored in a separate table, augmented with concept characterizations and the parameters used for generation, so they can be regenerated. From the viewpoint of the dataset generator, straightforward measures include the concept model used (a decision tree, a Bayesian network,...) and all possible characteristics of this model (eg. the tree depth). Also, if they are introduced, measures of the kinds of relations imposed between attributes, imposed attribute value distributions and artificial noise, missing values and irrelevant attributes could provide interesting characteristics. Very interesting are general concept characterization measures, like concept variation, the (non-) uniformity of the class-label distribu3 Aha (1992) also proposed generating new (synthetic) datasets, to generalise existing benchmark results and to measure the impact of different characteristics. The difference with our approach is that we don’t restrict our investigation to a small “area” surrounding existing experiments.

tion throughout the feature space (measured through the distance between two examples of a different class), and example cohesiveness, the density of the example distribution in the training set. Vilalta (1999) and Vilalta and Drissi (2002) provide exact definitions and prove these measures have great impact on algorithm performance. 4.1.2. Natural data Another danger of using synthetic datasets it that they may not be able to approximate natural ones. Some fundamental characteristics of natural datasets, like noise, irrelevant attributes, missing values,. . . can (and should) also be introduced into synthetic datasets. Still, equally important is the ability to introduce possibly complex relations between attributes, correlations, and different distributions in attribute or class values. It may also be interesting to focus experiments on the range of dataset characteristics actually appearing in real world datasets. Only if we can impose such deeper characteristics of datasets, at a fine grained level, we can approximate natural datasets. Building such a fine-grained dataset generator is very challenging, and the focus of ongoing work. In our design, each attribute can be assigned a separate linear combination of distributions over its values, and dependencies between attributes can be expressed by transition (value depends on the value of other attributes) or correlation. The exact distributions and dependencies can be expressed in a XML-template, which will be interpreted by the generator. On a higher level, these templates may be generated based on a model, eg. a Bayesian network expressing the dependencies. A thorough discussion is however beyond the scope of this paper. However, we should always validate the results of learning from synthetic datasets to the ones obtained on natural ones and revise our experiments or dataset generator if the results contradict. Some authors have also proposed methods to create new datasets by changing existing natural datasets (e.g. by removing examples and attributes). However, the amount of datasets we can produce this way is still very limited, and the more the dataset is altered, the less “natural” it becomes. A more fundamental reason for using synthetic datasets, is that it becomes possible to control dataset properties. 4.1.3. Experiment design Controlling dataset properties allows us to efficiently design our experiments to sufficiently cover the instance space. We should use techniques in Optimal

Experiment Design (Cohn, 1994; Emery & Nenarokomov, 1998) to optimize the search over instance space, focusing on those areas that have significant impact on algorithm performance, where much can be learned. Factorial design can be used to investigate the influence of all experimental values and interaction effects. This provides new opportunities to understand the behavior of learning algorithms. We could investigate certain hypotheses about learning algorithms, like the ones stated in section 3.1, or focus our experiments on a subset of methods or meta-features. If we can (automatically) design experiments, we can let the computer run these experiments in the background, automatically filling up the experiment database. Also, having a database that can be accessed by different computers, we can divide the work over different computers (provided we can somehow compare the runtimes). Note that, in order to satisfy the reproducibility constraint, we should also store any parameters (and random seeds) used to generate the synthetic datasets, in our experiment database. 4.2. Algorithm characterization Algorithm characterization is twofold. First, we must store the parameter settings of the algorithm used. These are stored in separate tables for each algorithm in our experiment database because each algorithm has different parameters. However, to generalise over other (more general) properties of algorithms, we also need to define (and store) further properties of the algorithm’s internal mechanism. This information is also very useful to explain why an algorithm did or did not work, as it provides causal explanations for the behavior of algorithms. Such characterizations however require a deep understanding of the involved algorithms. Some can be very general, such as the representation model used (e.g. a decision tree). Others are specific to larger groups of algorithms, from which some are defined by Van Someren (2001). These include linear separability of examples, conditional independence of attributes, the ability to model fine-grained concepts, the ability to handle local relevance of attributes (when an attribute is only relevant for some values of other attributes) and the ability to construct (weighed) attribute summations, averaging out noise from correlated attributes. For example, while decision trees and rule based systems are good at fine grained concepts and local relevance, naive Bayes is definitely not. Vilalta (1999) also defines some more specific properties, like the fragmentation problem (recursively partitioning the data to model specific instance regions) in C4.5trees versus

the global evaluation (using all data) in C4.5rules , and provides experimental proof that these properties help explain inductive behavior.

Table 1. Relationship between machine learning bias and statistical bias and variance

4.3. Understanding inductive performance

Rep.Bias

Comp.Bias

Stat.Bias

Stat.Var.

We have now discussed how to characterize datasets, concepts and algorithms to better understand learning behavior. What about performance evaluations? Since different users are interested in different measures, we want to include all these measures in our experiment dataset. Most performance evaluations can be calculated from the confusion matrix, while others (e.g. runtime) will be stored individually. Still, to thoroughly explain the behavior of an algorithm, we need more than global evaluation metrics. We need to investigate the different reasons why an algorithm fails, and choose performance measures that reflect those reasons. Therefore, we propose to add a bias-variance decomposition to our set of performance measures. Although this measure is computationally quite expensive, it does provide much insight into algorithm behavior.

appr. appr. appr.

too strong ok too weak

high low low

low low high

inappr. inappr. inappr.

too strong ok too weak

high high high

low avg high

4.3.1. Bias-variance decomposition Given a new dataset, we can judge an algorithm’s representational bias to be appropriate or inappropriate for modeling the underlying concept (e.g. a decision tree is inappropriate to model simple non-axis parallel decision boundaries), and its computational bias to be good, too strong, or too weak. If the bias it too strong for a given dataset, the algorithm will not sufficiently approximate the underlying (more complex) concept (i.e. underfitting). If it is too weak, the bias doesn’t offer much guidance, increasing the dependence on the data, which often leads to modeling noise (i.e. overfitting). Although some kinds of algorithms are generally stronger biased than others (e.g. naive Bayes, k-nearest neighbors and linear discriminant algorithms have a much stronger bias than decision trees or rule based algorithms), parameters can still offer much control over the bias, as discussed before. Statistical analysis of algorithm performance allows us to diagnose how the algorithm’s bias failed to match the structure in the data, by splitting up the expected misclassification error in a sum of three components (Wolpert & Kohavi, 1996). The (squared) bias error is the systematic error the learning algorithm is expected to make (because its bias doesn’t match the data). The variance error is a measure for how strongly generated hypotheses vary on different samples of the same dataset. And, the (squared) intrinsic error is associated with the inherent uncertainty in the data. Exact

definitions and measuring techniques can be found in the references above. Dietterich (1995) investigates the relationship between machine learning bias and bias- and variance error: see table 1. An inappropriate representational learning bias results in high statistical bias, because the algorithm simply can’t model the concept. On the other hand, when the representation bias is appropriate, there exists a (statistical) bias-variance trade-off: allowing more complex models (using a weaker bias) results in low bias error, but high variance error (overfitting), and using (overly) simple models leads to high bias error, but low variance error (underfitting). Bias-variance analysis is a very powerful means to diagnose problems with machine learning bias (Dietterich, 1995; Van der Putten & Van Someren, 2004). In our meta-learning system, it allows to search for more specific patterns and improve advice. If an algorithm’s failure on a certain type of dataset is mainly due to bias error, then its representation is probably inappropriate. This is useful information to advise another learning method or, as we will soon see, to advise a feature construction step to alter the representation of the dataset. If variance error is the main problem, we may want to try other parameter settings, a different version of the algorithm (more robust against variance), or to advise an ensemble(Dietterich, 1995) or specific preprocessing to lower the variance (e.g. combining correlated attributes to cancel out noise). 4.4. Task reformulation Ideally, our advice to the user should be a ranked list of machine learning “plans”, stating interesting learning algorithms combined with the preprocessing steps that may increase their performance4 . To do this, we 4 A somewhat similar ranking has recently been suggested by Bernstein, Provost and Hill (2005), although it uses a planning-based approach based on an ontology of constraints (e.g. “naive Bayes needs continuous data”)

need to measure the effect each preprocessing technique has on the dataset characterizations. Experiments on these effects can be executed independently from the experiments on algorithm performance, and stored in a separate experiment database. We now have two experiment databases, giving complementary information: the database on learning algorithms tells us how algorithms perform on a wide range of different datasets, the one on preprocessing techniques tells us how to alter any given dataset to make it more suitable for different learning methods. Just like we do with learning algorithms, we can link dataset characteristics to the effect of preprocessing techniques. Starting from a new unknown dataset, we could characterize it, predict how the characteristics would change after applying different preprocessing techniques, and predict how learning algorithms would perform on these (hypothetical) datasets. Alternatively, we could start from a number of learning methods that seem promising, and predict which preprocessing techniques may be useful to optimize their performance, returning a ranking of machine learning plans according to their promise. There are a number of probable links between dataset characteristics and the utility of certain preprocessing techniques. For example, if there are many correlations between algorithms, much noise or little mutual information between certain attributes and the class, feature selection is likely to help. However, the utility of preprocessing also strongly depends on the learning method used. Some learners need a preprocessing step to work at all (e.g. a discretization of numerical attribute values), and other learners do their own (limited) preprocessing, so a separate preprocessing step is less needed. Most likely, dataset characteristics will play different roles in both databases. Indeed, some dataset characteristics provide information about the underlying concept (like attribute correlation, class entropy,. . . ), while others are related to the specific configuration of the dataset (like the number of samples, standard deviation of attribute values, noise-signal ratio,. . . ). The latter can be altered (by preprocessing) without changing the underlying concept. Castiello, Castellano and Fanelli (2005) shows which of the commonly used dataset characteristics belong to which group, and proposed slightly altered meta-features to make the difference more explicit. Although they propose to remove the second group altogether, we think they are very and simplified heuristics (e.g. the relative speeds of different learning algorithms), without any relation to the target dataset.

useful to measure the effect of preprocessing (since preprocessing is meant to alter the dataset configuration), although they should be ignored when meta-learning over algorithm performance. 4.4.1. a link with bias-variance decomposition The bias-variance decomposition also proves very useful to predict when a preprocessing step may help algorithm performance. As analysis on real-world data mining problems has revealed (Van der Putten & Van Someren, 2004), the selection of appropriate preprocessing techniques may be more important than algorithm selection, since variance is a very important source of error, causing simple, but robust algorithms (like naive Bayes) to perform much better than sophisticated learners. Bias and variance error can however be decreased if the right preprocessing techniques are applied. Feature construction and transformation can be used to avoid bias error caused by inappropriate representation models, for example by removing correlations between attributes. This may help naive Bayes learners which cannot model interactions between different features, and decision tree algorithms which cannot model nonaxis parallel relationships. Replacing the old features by a new feature that removes these interactions decreases bias error. Feature selection is generally aimed at variance reduction: fewer parameters need to be estimated, whereas little relevant information is lost. Some algorithms have very strong biases making them more robust against variance (e.g. naive Bayes), others have very flexible representations, so bias error is not likely to be a problem (but variance is). It would be very interesting to use this link to investigate which algorithms will commonly benefit from which preprocessing techniques.

5. Conclusion We have proposed a form of descriptive meta-learning, aimed at designing meta-learning experiments not only to compare learning algorithms, but also to thoroughly investigate and explain their behavior in terms of their properties. As a platform for these experiments, we propose to use an adaptation of experiment databases. By querying or mining this database, we can (re)use these experiments for various investigations into the impact of different meta-features. To fill this database with experiment results, we are constructing a finegrained dataset generator, allowing better coverage of the high dimensional meta-learning space. From an analysis of current meta-learning issues, we

underline the importance of “insightful” meta-features (about data as well as algorithms), that allow to understand, i.e. to provide causal explanations for, the behavior of different learning algorithms. These include concept characterizations, general algorithm properties and bias-variance decomposition for algorithm evaluation. Finally, we suggest that, by applying the same approach to investigate the effect of different preprocessing techniques, we can predict when they can remedy shortcomings of learning methods. We hope to advance toward a meta-learning approach that can explain not only when, but also why an algorithm works or fails, so we can advise on corrective measures and help design new learning algorithms.

Acknowledgements We like to thank Saso Dzeroski, Peter van der Putten and Carlos Soares for their helpful comments and suggestions.

References Aha, D. (1992). Generalizing from case studies: A case study. Proc. of the Ninth Int’l Conf. on Mach. Learning (pp. 1–10). Bernstein, A., Provost, F., & Hill, S. (2005). Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. IEEE transactions on knowledge and data engineering, 17(4), 503–518. Blockeel, H. (2006). Experiment databases: A novel methodology for experimental research. Lecture Notes in Computer Science, 3933, 72–85. C.Castiello, G.Castellano, & A.M.Fanelli (2005). Meta-data: Characterization of input features for meta-learning. Lecture Notes in Artificial Intelligence, 3558, 457–468. Cohn, D. (1994). Neural network exploration using optimal experiment design. Advances in Neural Information Processing Systems (pp. 679–686). Dietterich, T. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms (Technical Report). Department of Computer Science, Oregon State University. Emery, A. F., & Nenarokomov, A. (1998). Optimal experiment design. Meas. Sci. Technol., 9, 864–876. Giraud-Carrier, C., & Keller, J. (2002). Dealing with the data flood, j. meij (ed), chapter Meta-Learning. STT/Beweton, The Hague.

Hoste, V., & Daelemans, W. (2005). Comparing learning approaches to coreference resolution. there is more to it than ’bias’. Proc. of the Workshop on Meta-Learning (ICML-2005). Kalousis, A., Gama, J., & Hilario, M. (2004). On data and algorithms: Understanding inductive performance. Machine Learning, 54, 275–312. Metal (2001). The metal meta-learning project (http://www.metal-kdd.org). Michie, D., Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural and statistical classification, chapter Methods for Comparison. Ellis Horwood. Peng, Y., Flach, P., Soares, C., & Brazdil, P. (2002a). Decision tree-based data characterization for metalearning. ECML/PKDD 2002 Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning (pp. 111–122). Peng, Y., Flach, P., Soares, C., & Brazdil, P. (2002b). Improved dataset characterisation for meta-learning. Lecture Notes in Computer Science, 2534, 141–152. Pfahringer, B., Bensusan, H., & Giraud-Carrier, C. (2000). Meta-learning by landmarking various learning algorithms. Proc. of the Int’l Conf. on Mach. Learning (ICML-2000). Van der Putten, P., & Van Someren, M. (2004). A biasvariance analysis of a real world learning problem: The coil challenge 2000. Machine Learning, 57, 177– 195. Van Someren, M. (2001). Model class selection and construction: Beyond the procrustean approach to machine learning applications. Lecture Notes in Computer Science, 2049, 196. Vilalta, R. (1999). Understanding accuracy performance through concept characterization and algorithm analysis. Workshop on Recent Advances in Meta-Learning and Future Work (ICML-1999). Vilalta, R., & Drissi, Y. (2002). A characterization of difficult problems in characterization. Proc. of the Int’l Conf. on Mach. Learning and Appl.. Vilalta, R., & Drissi, Y. (2002b). A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2), 77–95. Wolpert, D., & Kohavi, R. (1996). Bias plus variance decomposition for zero-one loss functions. Proc. of the Int’l Conf. on Mach. Learning (ICML-1996).