Although ensemble methods have proven to be very useful in propositional learn- ing, in relational learning, and in ILP in particular, less attention has been paid.
Learning an Interpretable Model from an Ensemble in ILP Anneleen Van Assche, Jan Ramon and Hendrik Blockeel Computer Science Department, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium
Abstract. In this paper a method to learn a single interpretable model from a relational ensemble is presented. The new model is obtained by artificially generating partial data examples using the distributions implicit in the ensemble and by building a new relational model from this artificial data.
1
Introduction
Although ensemble methods have proven to be very useful in propositional learning, in relational learning, and in ILP in particular, less attention has been paid to them. One reason why the ILP community showed less interest in ensemble methods is the significant drop of understandability of the learned hypothesis. The result of an ensemble method indeed is a large set of hypotheses with (weighted) votes connected to each of them, which becomes very hard to interpret. Hoche and Wrobel [1] address this problem by using a method called constrained confidence-rated boosting (based on the work by Schapire and Singer [2]) on a fast but weak ILP learner. This method improves the understandability of the boosted learning results by restricting the kinds of rule sets allowed. Another way to obtain interpretability is first building an ensemble and afterwards reducing it to an interpretable model without sacrificing too much predictive performance. In propositional learning quite some successful work has been done already in this area. Domingos [3] proposed Combined Multiple Models (CMM). CMM is based on reapplying the base learner to recover the partitioning implicit in the multiple model ensemble. This is achieved by giving the base learner a new training set, composed of a large number of examples generated and classified according to the ensemble. Important is that the true distribution of the examples is approximated as close as possible by the artificially generated data. Zhou, Jiang and Chen [4] utilize neural network ensembles to generate new instances and then extract symbolic rules from those instances. In this paper we follow Domingos’ approach of generating artificial instances, classified by the ensemble, and learning a new (comprehensible) model from these instances (added to the original data), in an ILP setting. However, generating artificial data in a relational environment is not straightforward as distributions over the data are far more complex than in the propositional case. We describe in the next section how we will deal with this.
2 2.1
Approximating an Ensemble by a Single Model Using Artificially Generated Data Motivation
In [5] we describe an efficient relational ensemble learner called Forf (First Order Random Forests). Although providing a higher predictive accuracy than a single first order decision tree, it is clearly less interpretable. In this paper we aim to learn a single interpretable model, approximating the ensemble learned by Forf. As the partitioning of the ensemble is more complex than that of a single model, the new model, learned on the data generated by the ensemble, will probably also be more complex than the base learners of the ensemble, but not necessarily such that it becomes incomprehensible. As a single model may not be able to capture the representative power of an ensemble, it will also likely be less accurate than the ensemble, but since it is provided with more training examples than the base learners, it should be possible to obtain a model more powerful than these. In propositional learning this already showed to be the case [3, 4]. 2.2
Artificially Generated Relational Data
The artificially generated data should approximate the true distribution of examples as close as possible. To ensure this, the distribution implicit in the ensemble of decision trees is used. After all, each decision tree in the forest defines a probability distribution over the regions in its partitioning. The number of examples that end up in a leaf, divided by the total number of training examples is an estimate of the probability that an example falls in the region of the instance space covered by that leaf. Artificially generated examples should approximately have the same distribution over these regions of the instance space. In propositional learning new examples get values for each of the attributes according to the distribution of the existing data. In relational learning this is more complex. We do not have a fixed set of attributes that can be assigned random values. Therefore we will generate partial examples, meaning that we only assign characteristics to the examples that occur as tests in a tree of the ensemble. So although no full knowledge will be available about the artificial examples, the learning algorithm building the new model will be able to inquire the same knowledge that was used by the ensemble. We will proceed as follows to generate a partial example pe: first, one decision tree T1 is randomly selected from the ensemble. Then a leaf is taken from this tree according to the probability distribution over the leaves, meaning a leaf l T1 is chosen with a probability P (lT1 ) = nlT1 /N with nlT1 the number of examples that end up in that leaf and N the total number of examples in the training set. The conjunction of literals along the path from the root until the selected leaf lT1 is added to the partial example pe. Then each leaf L that is conflicting with pe (i.e. no interpretation exists satisfying both the conjunctions already added to pe and the conjunction corresponding to leaf L) is removed from the
remaining trees1 . Next another random tree T2 is selected. To follow the exact distribution, the next leaf lT2 should be selected with a probability P (lT2 |lT1 ), but as this probability is unknown we approximate it with nlT2 /nC where nC is the number of examples ending up in leaves not conflicting with pe. Choosing the order in which the trees are processed to make up an example randomly is compensating somewhat for this approximation. We proceed like this until pe has been assigned one leaf from each tree. So for each generated example we know to which partitioning of the ensemble it belongs and a class can be assigned to it by the ensemble. However, still some tests t may occur in the ensemble for which both t and not(t) are still satisfiable given pe (these will be tests that occur along paths to leaves conflicting with pe). In that case we randomly either add t or not(t) to pe, as the outcome of these tests might be required by the algorithm run on the artificially generated examples. Once having composed an artificial data set according to the method we described above, a first order decision tree learner can be applied on this new data set added to the original one. Because of the way we proceeded with the generation of examples, the new model is restricted to consist of tests that appeared in the ensemble too.
3
Conclusion
In this paper we propose a method to reduce an ensemble of first order trees to one single interpretable tree, by generating artificial relational data according to the distributions implicit in the ensemble and by building a new model from this new data. As it is hard to capture the complete relational distribution, only partial examples are constructed, containing only information about tests occurring in trees of the ensemble. As such the new model will be restricted to use only queries that were also used by the ensemble. Preliminary experiments on the Mutagenesis data set are promising, but more extensive experiments need to reveal how rewarding this method is in an ILP setting.
References 1. Hoche, S., Wrobel, S.: Relational learning using constrained confidence-rated boosting. In Rouveirol, C., Sebag, M., eds.: Proceedings of the Eleventh International Conference on Inductive Logic Programming. Volume 2157 of Lecture Notes in Artificial Intelligence., Springer-Verlag (2001) 51–64 2. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3) (1999) 297–336 3. Domingos, P.: Knowledge discovery via multiple models. Intelligent Data Analysis 2 (1998) 187–202 4. Zhou, Z., Jiang, Y., Chen, S.: Extracting symbolic rules from trained neural network ensembles. AI Communications 16(1) (2003) 3–15 5. Van Assche, A., Vens, C., Blockeel, H., Dˇzeroski, S.: First order random forests: Learning relational classifiers with complex aggregates. Machine Learning (2006) To appear. 1
In order to be able to check conflicts, the background needs to be defined adequately.