Data transformation and model selection by ... - CiteSeerX

1 downloads 0 Views 108KB Size Report
Data transformation and model selection by experimentation and meta-learning. Pavel B. Brazdil. LIACC, FEP - University of Porto. Rua Campo Alegre, 823.
Data transformation and model selection by experimentation and meta-learning Pavel B. Brazdil LIACC, FEP - University of Porto Rua Campo Alegre, 823 4150 Porto, Portugal Email: [email protected] Research in the area of ML/data mining has lead to a proliferation of many di erent algorithms. In the area of classi cation Michie et al. (1994) for instance, describe about two dozen of such algorithms. Previous work has shown that there does not exist a single best algorithm, which would be suited for all tasks. It is thus necessary to have a way of selecting the most promising model type. This process is often referred to as model selection. An interesting question arises as to what kind of method, or methodology, we should adopt to do that. Previous approaches can be divided basically into two groups. The rst one includes methods based on experimentation, and the second one, methods which employ meta-knowledge. Our aim here is to review both of these approaches in some detail and examine how these could be extended to encompass also the data transformation phase which often precedes learning.

1 Model Selection by Experimentation or Using Meta-Knowledge? 1.1 Model Selection by Experimentation

Model selection by experimentation works, as the name suggests, by evaluating the possible alternatives experimentally on the given problem. In the context of classi cation one would normally consider a set of possible classi ers and try to obtain reliable error estimates, which is usually done using cross-validation (CV) (Scha er, 1993). This approach has a number of advantages. First, it is quite a general and applicable in many di erent situations. The method is, as Scha er (1993) has demonstrated, quite reliable. Given certain con dence level, the approach does indeed identify the best possible candidate 11

and errs as expected. The disadvantage of this approach is that it is time consuming, due to the fact that it is necessary to evaluate all algorithms, some of which can be quite slow. Various proposals have been presented how to speed up this process. One possibility is to pre-select some algorithms using certain criteria and then limit the experimentation to this subset. Some people have suggested that we should preferably use algorithms which behave rather di erently form one another. One criteria for deciding this is by examining whether the algorithms lead to uncorrelated errors (Ali and Pazzani, 1996). Another possibility is to try to reduce the number of cycles of cross-validation without e ecting the reliability of the result. Moore and Lee (1994) have proposed a technique referred to as racing, which permits to terminate the evaluation of those algorithms which appear to be far behind others. Yet another option is by exploiting meta-knowledge which will be brie y reviewed in the next section.

1.2 Model Selection Using Meta-knowledge

Meta-knowledge permits to capture our knowledge about which ML algorithms should perform well in which situation. This knowledge can be either theoretical or of experimental origin, or a mixture of both. The rules described by Brodley (1993) for instance, captured the knowledge of experts concerning the applicability of certain classi cation algorithms. The meta-knowledge of Brazdil et al.(1994) and Gama and Brazdil (1995) was of experimental origin. The objective of the meta-rules generated with the help learning systems was to capture certain relationships between the measured dataset characteristics (such as the number of attributes, number of cases, skew, kurtosis etc.) and the error rate. As was demonstrated by the authors this meta-knowledge can be used to predict the errors of individual algorithms with a certain degree of success. One advantage of this approach when compared to model selection based on experimentation, is that it does not really need extensive experiments. This is due to the fact that meta-knowledge captures certain regularities of the situations encountered in the past. The disadvantage is that the meta-knowledge acquired need not be totally applicable to a new situation and, in consequence, this method tends to be somewhat less reliable than model selection based on experimentation. As neither solution is ideal, this suggests that we may gain by combining the two approaches. Model selection by meta-knowledge can be used to pre-select a subset of promising algorithms and then experimentation can be used to identify the best candidate. This method requires that we de ne the criteria for pre-selecting the set candidate algorithms. A good criteria will somehow strike a good balance between the reliability of the outcome and the amount of experimentation we are prepared to undertake. Preselecting fewer algorithms has the advantage that there is less work to be done, but on the other hand, we may get a sub-optimal result.

2

Di erent Approaches to Model Selection by Meta-knowledge

There are many di erent ways of how we can approach the problem. Our aim in this section is to describe certain options we can take when addressing the problem. Basically we need to decide:  Whether the meta-knowledge should express knowledge concerning pairs of algorithms or a larger group; 12

What the reference point is for the comparisons of error rates;  Whether the meta-knowledge should be easily updateable;  Whether the predictions should be qualitative (e.g. Ai is applicable) or quantitative (the error rate of Ai is E%);  Whether or not we want to condition the predictions on dataset characteristics. Let us now analyze each of the points above in some detail. 

2.1 Which is the Best Reference Point?

The rst important decision is whether we should consider pairs of algorithms generalize the study to N algorithms. The meta-rules of Aha (1992) were oriented towards pairs of algorithms (e.g. IB1, C4). The objectives of the meta-rules was to de ne conditions under which one algorithm (e.g. IB1) achieves better results and hence is preferable to another (e.g. C4). This rst major comparative study of a set of 22 classi cation algorithms was carried out under the StatLog project (Michie et al., 1994). The fact that a number of algorithms were analyzed together provided a reason to establish a kind of common reference point for all comparisons involving error rates. Gama and Brazdil (1995), for instance, considered three kinds of reference points in their study and evaluated them experimentally:  the best error rate achieved by one of the algorithms,  the mean error rate of all algorithms (or weighted mean),  the error rate associated with the majority class prediction. We note that the rst two reference points depend on the set of algorithms under consideration. That is, if we introduce new algorithms into the set, or if we eliminate some existing ones from consideration, we have to, at least in principle, repeat all steps that depend on this reference point. This of course complicates the task of updating the existing meta- knowledge, as soon as new algorithms become available. The third reference point mentioned does not su er from this disadvantage. The error rate associated with the majority class depends entirely on the dataset under consideration.

2.2 Should the Predictions of Meta-Knowledge be Qualitative or Quantitative?

Another important issue is whether we want the prediction concerning error to be qualitative or quantitative. Qualitative prediction would simply divide the algorithms into two groups: Those with low error rates, which we could identify as applicable, and the remaining ones which include both the algorithms with unacceptably high error rates, and also, the algorithms which failed to run. Quantitative predictions are concerned with predicting the actual error rate (or error which has been normalized in some way). The question concerning the form of the meta-knowledge is closely related to this issue. If we are interested to obtain only qualitative predictions, then meta-knowledge can be represented in the form of rules or cases. If we are interested in qualitative predictions, then we need to use some kind of a regression model, although qualitative predictions can also be converted to quantitative predictions (i.e. by associating a numeric value with each class). 13

2.3 Conclusions of a Previous Comparative Analysis

Let us review the results of the experimental analysis carried out by Gama and Brazdil (1995) who have collected test results of about 20 algorithms on more than 20 datasets. Each dataset was characterized using 18 di erent measures (such as the number of attributes, number of cases, skew, kurtosis etc.). The authors have considered and evaluated the three reference points discussed earlier. Besides, the following forms for metaknowledge were considered:  rules (generated by C4.5 (Quinlan, 1993));  instances (a version of IB1 (Aha et al., 1991));  linear regression equations (generated by a linear discriminant procedure);  piecewise linear regression equations (linear regression equations with restricted applicability generated by Quinlan's (1993b) M5.1); A separate experiment was conducted for each of the 3 reference points and each of the 4 forms of meta-knowledge. There were thus 12 separate experiments in total. In each experiment the predictive power of the meta-knowledge was evaluated using a leave-oneout method. Let us analyze one such experiment for the sake of clarity. Suppose the aim is to evaluate, for instance, the scheme involving normalization method based majority class prediction and meta-knowledge in the form of piecewise linear regression equations. In each step of the leave-one-out method, one dataset was set aside for evaluation. The remaining data was normalized with respect to the reference point chosen and supplied to the learning system to construct the model (i.e. piecewise linear regression equations in this case). The prediction was then denormalized and stored with the actual value. These pairs of values were used to calculate measures characterizing the quality of predictions, such as NMSE, after all cycles of the leave-one-out method have terminated. So, essentially the authors evaluated the possibility of obtaining reliable predictions with the help of meta-level models. This analysis showed that meta-level models were indeed quite useful, although some set-ups were more successful than others. Instance based models (more exactly 3-NN) provided more reliable predictions than some of the other model types (particularly rules and linear regression equations). Piecewise linear regression equations achieved also quite good predictions overall. The best reference point was the one related to majority class prediction. These results have quite interesting implication. The method that provides the most reliable predictions (IBL + majority class as the reference point), enables us to construct a system which is easily extensible. The system can easily accommodate new algorithms which can arise at any time. The new results can just be added to the existing instances and used immediately afterwards in decision making. There is no need to carry out extensive meta- level learning, which is an advantage. This strategy was incorporated in the system Calg (Gama, 1996). The only disadvantage is that the meta-knowledge in this form does not really provide a comprehensible model.

3

Using Meta-Knowledge to Guide Experimentation

Let us consider the issue whether meta-knowledge can be used to guide also the process of experimentation. But would this guidance be really useful? 14

The answer is armative, if we want to avoid unnecessary work. If pre-selection is done on the basis of performance of the individual algorithms only, we cannot guarantee that the nal subset does not include algorithms which are minor variants of one another. For practical reasons it is not really worth trying them all. What kind of meta-knowledge could be useful here? One interesting and practical possibility is to use statements of the form: pf(Ai(Di) >> Aj(Di) j Di 2 Dataset pool) which enables us to describe the frequency with which algorithm Ai performs signi cantly better than algorithm Aj for given datasets. Here, \Ai(Di) >> Aj(Di)" is used as a shorthand for \algorithm Ai performs signi cantly better (considering a given con dence level, say 95%) than algorithm Aj". We can use this representation to express the fact that algorithm Ltree, for instance, leads to signi cantly better results than C4.5 in 10 out of 22 cases by: pf(Ltree(Di) >> C4.5(Di) j Di 2 UCIdatasets ofJG) = 10/22 The algorithm Ltree is a decision tree type algorithm which can introduce new terms with the help of constructive induction (Gama, 1997). The frequency can be used to estimate the probability that one algorithm performs better than another. It can help to resolve the problem we have discussed earlier: If Aj is a variant of Aj which does not really bring out any bene ts, then presumably the frequency of observing a signi cant improvement is zero. To express this we can use: pf(Aj'(Di) >> Aj(Di) j Di 2 Dataset pool) = 0

4

Using Meta-Knowledge to Guide Pre-Processing and Model Selection

Previous studies have shown that pre-processing, such as elimination of irrelevant features or discretization of numeric values etc., can often bring about substantial improvements. Langley and Iba (1993), for instance, have demonstrated that the performance of IBL classi er can be substantially improved by eliminating irrelevant features. Kohavi and John (1997) have veri ed that similar improvements can be obtained also with Naive Bayes and ID3 classi er. Some classi cation algorithms (e.g. Naive Bayes) achieve better performance if the numeric features are discretized rst (Dougherty et al., 1995). A question arises whether the system proposed in the previous section can be extended to cover also the pre-processing stage. Our view is that this can indeed be done. Let us consider, for instance, one result presented in (Dougherty et al., 1995): At 95% con dence level, the Naive-Bayes with entropy-discretization is better than C4.5 on ve datasets and worse on two (there were 16 datasets in total). This statement can be expressed in the form of the following two meta-level facts: pf(Naive-Bayes(Entropy-discr(Di)) >> C4.5(Di) j Di 2 UCIdata DKS)=5/16 pf(Naive-Bayes(Entropy-discr(Di)) > C4.5(Di) j Di 2 UCIdata KJ)=4/14 pf(C4.5(Back-feature-select(Di))

Suggest Documents