OPTIMIZING MIXTURES OF LOCAL EXPERTS IN TREE-LIKE ...www.researchgate.net › profile › publication › links

0 downloads 0 Views 244KB Size Report
OPTIMIZING MIXTURES OF LOCAL EXPERTS IN TREE-LIKE ... › profile › publication › links › profile › publication › linksPDFby MBLA SIEK · ‎Cited by 3 · ‎Related articlesexample M5 model tree, are sub-optimal (greedy). An algorithm ... local experts (MLEs) an
OPTIMIZING MIXTURES OF LOCAL EXPERTS IN TREE-LIKE REGRESSION MODELS Michael Baskara L. A. SIEK

Dimitri P. SOLOMATINE

Prima Intelligence, Inc. P.O. Box 315 Surabaya, Indonesia [email protected]

UNESCO-IHE Institute for Water Education P.O. Box 3015 Delft The Netherlands [email protected] (corresponding author)

ABSTRACT A mixture of local experts consists of a set of specialized models each of which is responsible for a particular local region of input space. Many algorithms in this class, for example M5 model tree, are sub-optimal (greedy). An algorithm for building optimal local mixtures of regression experts is proposed, and compared to MLP ANN on a number of cases.

KEY WORDS Machine learning, mixtures, local models, regression.

1. Mixtures of Local Experts A complex machine learning problem can be solved by dividing it into a number of simple tasks and combining the solutions of these tasks. The input space can be divided into a number of regions (subsets of data) for each of which a separate specialized model (expert, module) built. Outputs of the experts are combined. Such models are named committee machines, mixtures of experts, modular models, stacked models, etc. ([12], [14]). One can use two criteria to classify such models: how experts are combined and on which data they are trained. The way experts are combined falls into one of the two major categories: (1) static, where response of experts is combined by a mechanism that does not involve the input signal, e.g., using fixed weights. Examples are ensemble averaging (where separate experts are built for the whole input space and then averaged) and boosting . (2) dynamic, where experts are combined using weighting schemes depending on the input vector. Example is a statistically-driven approach of Jacobs, Jordan, Nowland and Hinton [14] (called mixture of experts). The regions (or subsets of data) for which experts are “responsible” can be constructed in two ways: (1) in a probabilistic fashion so that they may have repetitive examples and be intersecting. This is done, e.g., in boosting [6], [28].

(2) as a result of “hard” splitting of input space. Each individual expert is trained individually on subsets of instances contained in these local regions, and finally the output of only one specialized expert is taken into consideration. Indeed, if the regions are non-intersecting then there is no reason to combine the outputs of different experts or modules and only one of them is explicitly used (a particular case when the weights of other experts are zero). In tree-like models such regions are constructed progressively narrowing the regions of the input space. The result is a hierarchy, a tree (often a binary one) with splitting rules in non-terminal nodes and the expert models in leaves (Fig. 1). Such models will be called in this paper mixtures of local experts (MLEs) and the experts will be referred to as modules, or specialized models. Models in MLEs could be of any type. If the model output is a nominal variable so that the classification problem is to be solved, then one of the popular methods is a decision tree. For solving numerical prediction (regression) problem, there is a number of methods that are based on the idea of a decision tree: (a) regression tree by Breiman et al. [4], where a leave is associated with an average output value of the instances sorted down to it (zero-order model), and (b) model tree, where leaves have regression functions of the input variables. In model trees, two approaches can be distinguished: one by Friedman [10] in the MARS (multiple adaptive regression splines) algorithm implemented as MARS software, and M5 model trees by Quinlan [20] implemented in Cubist software and, with some changes, in Weka software ([8]). The mentioned algorithms are suboptimal, “greedy”, since the choice of attribute for a split node is made once and is not reconsidered. The subject at this paper is optimisation of building LMEs consisting of simple (linear) regression models. M5 algorithm is chosen as the basic greedy algorithm allowing for building LMEs of linear models. Aim is to propose an approach allowing for building optimal LMEs. M5 algorithm is analogous to ID3 decision tree algorithm of Quinlan (also greedy) in a sense that it minimizes the intrasubset variation in the output values down each branch. In each node, the standard deviation of the output values for the examples reaching a node is taken as a measure of the error of this node and calculating the expected reduction in error as a result of testing each attribute and possible split values. Such

split attribute together with the split value that maximize the expected error reduction are chosen for each node. The splitting process will terminate if the output values of all the instances that reach the node vary only slightly or only a few instances remain. After the initial tree has been grown, the linear regression models are generated, and, possibly, simplified, pruned and smoothed. Wang & Witten [7] reported M5’ algorithm based on the original M5 algorithm but able to deal