Measuring Diversity in Regression Ensembles - Semantic Scholar

15 downloads 0 Views 721KB Size Report
ensembles formed by Bagging and Random Forest techniques. ... Whitaker [KW03] provide eight measures of dependency between a team of classifiers.
Measuring Diversity in Regression Ensembles Haimonti Dutta The Center for Computational Learning Systems, Columbia University, New York, USA. [email protected]

Abstract. The problem of combining predictors to increase accuracy (often called ensemble learning) has been studied broadly in the machine learning community for both classification and regression tasks. The design of an ensemble is based on the individual accuracy of the predictors and also how different they are from one another. There is a significant body of literature on how to design and measure diversity in classification ensembles. Most of these metrics are not directly applicable to regression ensembles since the regression task inherently deals with continuous valued labels for learning. To measure diversity in regression ensembles, Krogh and Vedelsby show that the quadratic error of an ensemble estimator is guaranteed to be less than or equal to average quadratic error of components. However, this does not give a way to measure or create diverse regression ensembles. This paper presents metrics (correlation coefficient, covariance, dissimilarity measure, chi-square and mutual information) that can be used for measuring diversity in regression ensembles. Careful selection of diverse models can be used to reduce the overall ensemble size without substantial loss in performance. We present extensive empirical results to show the performance of diverse regression ensembles formed by Bagging and Random Forest techniques.

1

Introduction

Ensemble based learning algorithms have been used for many machine learning applications ([Die02], [OT08], [BWT05]). They were first used for classification as early as the 1960’s [Nil65] and more recently techniques such as Stacking ([Wol92], [Bre93]), Bagging [Bre96], Boosting ([Sch99], [DC96]) and Random Forest [Bre01] have been developed. Ensembles are groups of learners (such as decision trees, neural networks, Support Vector Machines (SVMs)) where each learner provides an estimate of the target variable which can be categorial or continuous; these estimates are combined by some technique (such as majority voting, averaging etc) thereby reducing the generalization error produced by a single learner. Thus, the central idea in ensemble learning is to exploit the information provided by the weak learners for improved performance. Success of an ensemble of learners relies upon the diversity among the individual learners [RP06]. Diversity is the degree of disagreement [BWT05] among the individual learners. The concept had its origin in software engineering where the aim was to increase reliability of solutions by combining programs whose failures were uncorrelated [SS97]. In the context of supervised machine learning, diversity measures have been studied widely for classification problems from different perspectives ([OS99],

2

[KW03]). However, very little research has been done to address diversity of regression ensembles. The main contribution of this paper involves: (1) study of metrics for measuring diversity in regression ensembles and (2) extensive empirical analysis of the performance of “diverse” ensembles compared to the larger sized ensembles built without consideration of diversity. The rest of the paper is organized as follows: Section 2 presents Related Work, Section 3 provides definitions and notations, Section 4 discusses diversity measures for regression ensembles, Section 5 presents how diverse ensembles can be created and Section 6 provides experimental results. Finally, Section 7 concludes the paper.

2

Related Work

The problem of combining predictors to increase accuracy (often called ensemble learning) has been studied extensively in the machine learning and statistics communities ([BHBK03], [REBK05], [CNMCK04], [CMNM06], [QSxSy05], [GRF00])1 . The design of an ensemble is often based on the assumption that each predictor is reasonably accurate and diverse i.e. hypotheses disagree with each other in many of the predictions [Die02]. Diversity is a cruical aspect since it determines whether the ensemble will perform better than its individual components. Breiman [Bre96] introduced a technique called “Bootstrap Aggregating” (Bagging) which works by sampling data points uniformly with replacement from the original training set. If the learning algorithm is such that small perturbations in the data leads to large changes in the resulting hypothesis, Bagging will produce diverse ensemble of hypotheses. A second technique to construct diverse ensembles is to use a subset of input features [Che96] identified by domain knowledge. A third mechanism is injecting noise (drawn from a normal distribution with zero mean) into the output labels of predictors ([DB95], [Chr03], [Bre00]). Injecting randomness in decision tree ensembles has been studied in the “random subspace method” [Ho98] which choses a random subset of features at each node of the tree. Finally, Random Forests [Bre01] which combined bagging and the random subspace method are known to give very good accuracy of ensembles. Metrics for measuring diversity of classification and regression ensembles have received particular attention in recent years. For classification ensembles Kuncheva and Whitaker [KW03] provide eight measures of dependency between a team of classifiers divided into two groups: (1) pairwise measures (Q statistic, correlation, disagreement and double fault) and (2) non-pairwise measures (entropy, the difficulty index, KohaviWolpert variance and interrater agreement). Tang et. al [TSY06] present a theoretical analysis on the above six diversity measures and show underlying relationships between them, and relate them to the concept of margin, which is more explicitly related to the success of ensemble learning algorithms. For regression ensembles, Krogh and Vedelsby [AV95] prove that the quadratic error of the ensemble estimator is guaranteed to be less than or equal to the average 1

A comprehensive listing of Ensemble Pruning literature is available mlkd.csd.auth.gr/ensemblepruning/ensemblepruning bib.pdf

from

3

quadratic error of the components. This means: Eensemble =< Eindividual models > − < A >

(1)

where E is the prediction error with < · · · > denoting averaging over all models and < A > represents the ensemble “ambiguity” which measures the difference in prediction of individual models from the overall ensemble [Chr03]. Intuitively this means that larger the ambiguity term, the larger is the ensemble error reduction. However, if all the models have low prediction error it is likely that they are very similar to one another. If they are very different from one another, all of them may not have low prediction error. This implies that the right balance is required between the diversity (ambiguity term) and the individual accuracy (the average error term) in order to achieve low ensemble error. Extensions of the model proposed by Krogh et. al have been studied by Brown et. al [BWT05] who show that Negative Correlation (NC) [LY99] plays an important role in the diversity of ensembles. All of the above theoretical analyses quantify diversity, but do not show how this can be achieved in the ensemble. They also do not provide metrics which will help evaluate whether inclusion of a predictor in the ensemble will make its overall prediction any better. In this paper, we provide metrics to compare predictors in regression ensembles with particular emphasis on bagging ensembles. In the following section, we will introduce some definitions and metrics for measuring diversity.

3

Definitions and Notation

Let D be a data set with N instances. Each instance contains a vector of features (denoted by x) and a continuous2 valued target variable (denoted by y). It is assumed that there exists an underlying function f such that y = f(x) for each instance (x,y) in the training set. The goal of a supervised learning algorithm is to find an approximation h of f that can be used to predict values of unseen instances in the test set. Typically machine learning algorithms search in the possible function space and come up with one approximation h of the function f. However, in ensemble learning, multiple hypotheses are built and combined by some technique to reduce the risk that search space of hypotheses is too large. Let h1 , h2 · · · hk represent k such hypotheses built on the data set D. Assume that averaging is used to combine the output of models. Then the final output for a test data point k (x) . is given by H(x) = h1 (x)+h2 (x)+···+h k In the following section we introduce some of the diversity measures that can be used with regression ensembles.

4

Diversity Measures

In this section we define the statistics used to assess the similarity between two regressors e.g. Rm and Rn . Let Y m and Y n represent the continuous valued outputs 2

In this paper we are concerned with only regression problems.

4

of the regressors Rm and Rn . Thus Y m and Y n are N-dimensional vectors; Y m = m n [y1m , y2m · · · yN ] and Y n = [y1n , y2n · · · yN ] assuming there are N instances in the test set. 4.1

Correlation Coefficient

The correlation coefficient between Y m and Y n is defined as: Σi=1...N (yim − µY m )(y n − µY n ) ρ= p Σi=1...N (yim − µY m )2 Σi=1...N (yin − µY n )2

(2)

The diversity of two predictors is inversely proportional to the correlation between them. As such, two regressors with low correlation coefficient between them are preferred over those with high correlation coefficient. 4.2

Covariance

Covariance between Y m and Y n is defined as: 0

Cov(Y m , Y n ) = E[(Y m − µY m )(Y n − µY n ) ]

(3)

Covariance and correlation are highly related values. However, because of the differences in computation, we consider it worthwhile to examine both. Similar to correlation, covariance is also inversely proportional to diversity. 4.3

Chisquare

Chi-square of Y m with respect to Y n is defined as: χ2 = Σi=1...N

(yim − yin )2 yin

(4)

This is an asymmetric measure and diversity is directly proportional to covariance. 4.4

Disagreement Measure

Kuncheva et.al., [KW01] define disagreement measure for classification ensembles as the ratio of the number of observations on which one classifier is correct and the other incorrect to the total number of observations. Let N be the number of instances, 0 denotes correct classification and 1 denotes incorrect classification. Then, N 01 + N 10 (5) N 11 + N 10 + N 01 + N 00 Where N mn denotes the number of classifications labeled i by the first classifier and j by the second classifier. For regression problems, we extend this measure as follows: For each instance x, we calculate the standard deviation, σ, of the estimated target variable by all predictors. If true value of the target is α then a prediction β is considered to be correct if that β < α + σ and β > α − σ. i.e., the prediction has to fall within a margin of one standard deviation of the value of the target variable. Otherwise, the prediction is taken to be incorrect. We count the number of correct and incorrect predictions for each predictor over the data set and use the formula above to measure disagreement. Disi,j =

5

4.5

Entropy

The mutual information between Y m and Y n is given by I(Y m ; Y n ) = H(Y m ) + H(Y n ) − H(Y m ; Y n )

(6)

where H(Y m ) and H(Y n ) are the differential entropies of Y m and Y n and H(Y m ; Y n ) is the joint differential entropy of Y m and Y n [YL04]. If Y m and Y n are Gaus2 sian random variables with variances σm and σn2 , the differential entropy H(Y m ) = 1 1 2 n 2 2 [1 + log(2πσm )] and H(Y ) = 2 [1 + log(2πσn )]. The joint differential entropy is given by: 1 2 2 H(Y m ; Y n ) = 1 + log(2π) + log(σm σn (1 − ρ2 ) (7) 2 where ρ is the correlation coefficient as defined above. It can be shown that the mutual information I(Y m ; Y n ) is given by 1 I(Y m ; Y n ) = − log(1 − ρ2 ) 2

(8)

In the following section we provide a mechanism for creating diverse ensembles.

5

Creating Diverse Regression Ensembles

To construct a diverse ensemble, for each data set, we first construct a large number (ε) of predictors on data obtained by sampling with replacement. We assume that the ensemble to be constructed is of size η. To measure the diversity of the predictors a ε × ε matrix is constructed such that the ij th element of the matrix gives the diversity between predictor i and predictor j. To create an ensemble of size η, the top η/2 pairs of predictors with highest diversity between them is selected. Each predictor is picked only once. We illustrate the mechanism of creating diverse ensembles with the following example: Example 1 Assume there are five predictors (as shown in Figure 1) built on the Computer Activity data set (described in Section 6) using 0.5% of the instances and 5 attributes only chosen at random. Each of the individual predictors have RMSE of 20.376, 17.7025, 19.1591, 17.8133 and 18.2342. We are required to build an ensemble of size 4. For each of the 5 predictors, there can be at most 52 = 25 pairwise combinations i.e. (1,2), (2,1), (1,3), (3,1), (1,4), (4,1), (1,5), (5,1), (2,3), (3,2), (2,4), (4,2), (2,5), (5,2), (3,4), (4,3), (4,5), (5,4), (3,5), (5,3), (1,1), (2,2), (3,3), (4,4), (5,5). If a predictor is chosen with itself, diversity is 0. Of the remaining 20 possibilities, notice that symmetry is respected i.e. choosing (1,2) and (2,1) ensures the same level of diversity. Figure 2 shows the 5 × 5 matrices built for each of the diversity metrics. Now we are left with 20 2 = 10 groups from which if we select the top two pairs exhibiting the most diversity, we can form an ensemble of size 4.

6

(a) Tree 1

(b) Tree 2

(c) Tree 3

(d) Tree 4

(e) Tree 5

Fig. 1. Five Regression Trees forming the Random Forest - built on the Computer Activity dataset, using 0.5% of the instances and 5 randomly chosen attributes.

7

(a) Correlation

(b) Covariance

(c) Disagreement Measure

(d) Entropy

(e) Chi Square

Fig. 2. 5 × 5 matrices obtained using different diversity measures.

8

6

Experimental Results

In this section, we study the effect of choosing the most diverse predictors on the accuracy of the overall ensemble. For all the experiments we have used regression trees as predictors. However, it must be noted that this can be easily extended to other regression ensembles such as SVMs, Neural Networks and so on. For each data set, we performed two sets of experiments: (1) Effect of diversity of predictors on a random forest and (2) Effect of diversity of predictors on a bagged ensemble. For Random Forests, the number of trees in the original ensemble was varied between 20 and 100. For the first set of experiments, the number of attributes and the percentage of training data used to construct the forest were kept constant. However, the number of “diverse” trees chosen from the ensemble was varied between 5 and 20. The performance of the “diverse” ensemble was compared to the original ensemble with all the trees in it. In either case, the combination of predictions from trees was done using a simple averaging technique. We note that this is not the only method of combination of predictions and other schemes such as variants of majority voting3 , minimum and maximum values predicted can also be used for the combination of process. In the second set of experiments, the number of attributes used to construct the regression tree was varied, keeping the size of the ensemble constant. This allowed us to study the effect of “diversity” obtained from different types of regression trees in the forest. The above two experiments were also repeated for Bagged Predictors. The metric used for comparing the performance of the trees is Root Mean Square Error (RMSE). The Mean Squared Error (MSE) measures the square of the predictor error (i.e. the amount by which an estimate by the predictor differs from the true value being estimated). The square root of the MSE yields RMSE. In the following section we present a brief description of the datasets used for our experiments. 6.1

Descripion of Datasets

We conducted experiments on two datasets, all of which were downloaded from online repositories. A brief description of each dataset along with the regression task is given below. Computer Activity: This dataset was downloaded from the Delve [delte] data repository and is a collection of computer systems activity measures. It was collected on a Sun Sparcstation 20 / 712 with 128 Mbytes of memory running in a university department and contains 8192 instances each having 21 continuous-valued attributes which measure statistics of the operating system such as number of system fork calls per second, number of pages paged out per second and transfers per second between system and user memory. The goal is to predict the portion of time that CPUs ran in user mode. US Houses Dataset: This dataset was downloaded from CMU Statlib Repository [delte]. It contains information regarding houses in California collected from the US 1990 census. There are 20,640 instances with 8 attributes which measure different characteristics 3

Majority voting as such cannot be applied since we are concerned primarily with regression problems.

9

of houses such as median house value, total number of rooms, housing median age, total rooms, total number of bedrooms, population, median income and house age. The objective is to predict the median sale price of the house for that geographic region. For each data set, approximately 70% of the data is used for training the models, the remaining 30% is used as a blind test set. 6.2

Results and Discussions

The Tables 1, 2, 3, 4, 5 and 6 contain two different parts – the description of the ensemble (number of regression trees and number of attributes used to construct the ensemble of trees) and the RMSE values. Six different RMSE values are presented: (1) The original ensemble without application of any diversity metric – This is used as a baseline to test the performance of the diverse ensemble (2) Ensemble formed by choosing trees whose predictions have least correlation coefficient (3) Ensemble formed by choosing trees whose predictions have least covariance (4) Ensemble formed by choosing trees whose predictions have the most dissimilarity measure between them (5) Ensemble formed by choosing trees whose predictions have least entropy and (6) Ensemble formed by choosing trees whose predictions have maximum chi-square measure.

Ensemble Size Original Diverse 20 5 40 5 60 5 80 5 100 5 20 10 40 10 60 10 80 10 100 10 20 15 40 15 60 15 80 15 100 15 20 20 40 20 60 20 80 20 100 20

RMSE Original Correlation Covariance Dissimilarity 2.6765 2.7751 2.8228 3.0228 2.6251 2.8266 2.7418 2.863 2.6212 2.7632 2.8123 2.938 2.6101 2.7603 2.8263 3.0414 2.5927 2.7674 2.7598 2.9456 2.6765 2.7139 2.7372 2.8593 2.6251 2.7198 2.6888 2.7674 2.6212 2.6809 2.7325 2.8203 2.6101 2.7226 2.7503 2.9109 2.5927 2.6785 2.6881 2.8388 2.6765 2.678 2.6775 2.7682 2.6251 2.6924 2.6699 2.6898 2.6212 2.6455 2.6732 2.7305 2.6101 2.6705 2.7008 2.7743 2.5927 2.6513 2.6595 2.7332 2.6765 2.6765 2.6765 2.7415 2.6251 2.669 2.6416 2.6804 2.6212 2.6415 2.6583 2.7114 2.6101 2.6585 2.6794 2.7402 2.5927 2.6238 2.6515 2.7016

Entropy Chi-Square 2.8118 2.7751 2.7534 2.8266 2.8094 2.7632 2.7771 2.7603 2.7611 2.7647 2.7318 2.7319 2.6926 2.7209 2.7261 2.692 2.699 2.7171 2.718 2.6968 2.6775 2.678 2.6526 2.6924 2.6936 2.6427 2.663 2.6705 2.6604 2.6374 2.6765 2.6765 2.6487 2.669 2.6835 2.6392 2.6447 2.6616 2.6369 2.6342

Table 1. Performance of Diverse Regression Trees on the Computer Activity data set using Bagging. Only 50% of the instances are used in the tree construction process.

10

(a) Correlation

(b) Covariance

(c) Chisquare

(d) Entropy

(e) Disagreement

Fig. 3. Visualization of the correlation, covariance, chisquare, entropy and disagreement measures for the CPU dataset for an ensemble of size 20.

11

Results for the Computer Activity Data: Figures 3(a), 3(b), 3(c), 3(d), 3(e) illustrate the degree of “diversity” among the regression predictors (ensemble size 20) using different metrics mentioned above. Note that for Correlation coefficient, Covariance, Entropy and Dissimilarity measures the matrices are symmetric about the diagonal i.e. the upper triangular and lower triangular matrices are identical. This is because, with these measures, “diversity” between predictors i and j is identical to that between j and i. This is not true for the chi-square measure and hence the lower triangular region in Figure 3(c) is shown as a black box. Ensemble Size Original No. of Attributes 20 5 40 5 60 5 80 5 100 5 20 10 40 10 60 10 80 10 100 10 20 15 40 15 60 15 80 15 100 15 20 20 40 20 60 20 80 20 100 20

RMSE Original Correlation Covariance Dissimilarity 7.8058 7.7147 7.7147 7.1282 8.2633 10.6376 10.6283 10.3703 7.3042 9.0991 9.0851 7.2256 7.1681 9.2824 11.6946 8.5565 6.6297 10.8124 10.8105 5.7123 2.7261 2.7891 2.8072 2.7979 3.3521 5.776 5.7639 5.2076 3.2285 7.4817 7.4832 3.7794 3.777 8.5187 8.5187 3.8247 3.4495 9.4278 10.3194 3.8363 2.6765 2.678 2.6775 2.7682 2.6251 2.6924 2.6699 2.6898 2.6212 2.6455 2.6732 2.7305 2.6101 2.6705 2.7008 2.7743 2.5927 2.6513 2.6595 2.7332 2.6765 2.6765 2.6765 2.7415 2.6251 2.669 2.6416 2.6804 2.6212 2.6415 2.6583 2.7114 2.6101 2.6585 2.6794 2.7402 2.5927 2.6238 2.6515 2.7016

Entropy Chi-Square 7.6957 7.7044 8.5319 8.8026 7.8218 8.7232 8.5936 9.7962 8.8505 8.7711 2.7888 2.7891 5.7254 5.7825 4.893 7.4791 6.7021 8.4156 5.1055 9.3333 2.6775 2.678 2.6526 2.6924 2.6936 2.6427 2.663 2.6705 2.6604 2.6374 2.6765 2.6765 2.6487 2.669 2.6835 2.6392 2.6447 2.6616 2.6369 2.6342

Table 2. Performance of Diverse Regression Trees on the Computer Activity data set using Random Forest. Only 50% of the instances are used in the tree construction process, but the number of attributes are varied between 5 and 20. The size of the diverse ensemble was kept constant at 15.

Table 3 illustrates the performance of “diverse” regression trees in a Random Forest using different heuristics for diversity. The results show that when the original ensemble size is large (say 80 or 100 trees in the forest), selecting a small and diverse ensemble using Entropy may be very useful as the performance of the “diverse” ensemble is very close to the original large ensemble. However, choice of the diversity metric is critical for this purpose and a similar effect is not seen when correlation, covariance or dissimilarity measures are chosen. This is indicated by the much higher RMSE values for these metrics. Furthermore, if the ensemble size of the “diverse” ensemble is increased, this

12

is seen to produce lower RMSE values. In general, regression ensembles designed with entropy or chi-square diversity measures results in lower RMSE values as compared to Correlation, Covariance or the Dissimilarity measure. Another set of experiments were designed to study the effect of increasing the number of attributes used to construct the trees in the forest. The number of “diverse” trees was kept constant at 15 and both the size of the original ensemble and the number of attributes used to construct the forest were varied. Table 2 presents the results. Table 1 illustrates the performance of “diverse” regression trees in a Bagged ensemble using different heuristics for diversity. First note, that for this dataset, Bagging has a better performance than a random forest of trees constructed using 5 randomly chosen attributes and using 50% of the data for training. Furthermore, it is clear that constructing a much smaller ensemble with high diversity is possible and such a “diverse” ensemble is likely to have a performance comparable with the original large ensemble. For example, an ensemble of Bagged Trees constructed using 50% of instances and averaged to produce the final prediction has an RMSE=2.5927. In comparison, if 5 Bagged trees were constructed using the Entropy diversity measure and averaged, the aggregated tree would give an RMSE=2.7611. This illustrates that careful choice of diversity metrics can lead to formation of very small yet effective ensembles.

Ensemble Size Original Diverse 20 5 40 5 60 5 80 5 100 5 20 10 40 10 60 10 80 10 100 10 20 15 40 15 60 15 80 15 100 15 20 20 40 20 60 20 80 20 100 20

Original 7.8058 8.2633 7.3042 7.1681 6.6297 7.8058 8.2633 7.3042 7.1681 6.6297 7.8058 8.2633 7.3042 7.1681 6.6297 8.2633 6.2769 7.3042 7.1681 6.6297

Correlation 9.334 12.1441 9.9148 10.2086 12.2378 10.4705 9.5627 9.3533 10.6144 9.4687 7.7147 10.6376 9.0991 9.2824 10.8124 6.4243 10.1171 8.9136 9.7454 10.3971

RMSE Covariance Dissimilarity 9.2865 7.4017 12.1441 16.8118 9.9587 4.5755 12.0487 7.5286 12.3071 9.9604 10.4705 10.8207 9.5107 10.9644 9.2792 7.692 10.5883 6.1038 10.7988 8.9003 7.7147 7.1282 10.6283 10.3703 9.0851 7.2256 11.6946 8.5565 10.8105 5.7123 6.4243 8.7217 10.8036 8.4968 9.5645 6.144 11.0584 8.5051 10.3506 4.9325

Entropy Chi-Square 9.3559 9.4126 9.4507 9.8094 9.1586 9.3385 9.0706 9.8139 7.7023 9.5028 7.3643 8.9764 9.1056 9.4502 7.8043 9.4388 3.5539 9.1155 10.6472 10.4657 7.6957 7.7044 8.5319 8.8026 7.8218 8.7232 8.5936 9.7962 8.8505 8.7711 6.4243 6.4243 8.3658 9.2803 7.8774 8.6936 8.4773 9.5162 8.6396 8.6989

Table 3. Performance of Diverse Regression Trees on the Computer Activity data set using Random Forests. Only 5 attributes are chosen at random for construction of each decision tree in the forest and the 50% of the instances are used in the tree construction process.

13 Ensemble Size Original Diverse 20 5 40 5 60 5 80 5 20 10 40 10 60 10 80 10 20 15 40 15 60 15 80 15 20 20 40 20 60 20 80 20

RMSE Original Correlation Covariance Dissimilarity 1.6555 1.8539 2.1667 2.6256 1.3084 1.5474 1.5474 2.0026 1.5131 1.5498 1.5305 2.6396 1.3704 1.5368 1.5368 1.0236 1.6555 1.706 1.8963 1.8963 1.3084 1.5191 1.5201 1.8358 1.5131 1.5259 1.5199 1.8583 1.3704 1.5152 1.5152 1.1053 1.6555 1.6638 1.7531 1.8193 1.3084 1.5085 1.5005 1.6338 1.5131 1.6179 1.61 1.8419 1.3704 1.4902 1.551 1.1291 1.6555 1.6555 1.6555 1.7719 1.3084 1.4989 1.5002 1.625 1.5131 1.6758 1.5858 1.789 1.3704 1.4732 1.4732 1.1871

Entropy Chi-Square 1.5375 1.5673 1.5919 1.565 1.5609 1.562 1.6549 1.5521 2.177 1.549 1.6068 1.5331 1.6059 1.5464 1.6248 1.5374 1.5791 1.5747 1.603 1.5227 1.5433 1.5312 1.551 1.496 1.6555 1.6555 1.5764 1.5227 1.5445 1.5233 1.5262 1.4899

Table 4. Performance of Diverse Regression Trees on the US Houses data set using Random Forest. Only 50% of the instances and 5 attributes are used in the tree construction process.

Results for the US Houses Data: Figures 4(a), 4(b), 4(c), 4(d), 4(e) illustrate the degree of “diversity” among the regression predictors (ensemble size 20) using different metrics described in Section 4. Table 4 shows the effect of choosing diverse ensembles of different sizes on the data. All of these trees were constructed using the Random Forest algorithm proposed by Leo Breiman [Bre01]. Using Entropy and Chi-Square diversity metrics, relatively small, diverse ensembles can be built such that the RMSE values of the “diverse” ensemble is comparable to the original large ensemble built. Table 5 provides results for Bagged ensembles. In this data set also, Bagging seems to perform slightly better than Random Forests and the diverse ensembles built on Bagged trees have slightly lower RMSE than their Random Forest counterparts. In addition, dissimilarity measure and entropy are found to outperform the other metrics namely correlation, covariance and chi-square. Lastly, Table 6 reports the effect of variation of the size of the feature space when building ensembles of Random Forests. Since the number of attributes in this data set is very small, i.e. 8 attributes only, we tested the effect of building trees on 3 and 5 attributes only. However, no significant change is noticed with attribute size variation. This implies that one could safely replace a very large ensemble of say 80 trees with a more diverse ensemble of size 20.

7

Conclusions and Future Work

Ensemble learning techniques such as bagging, stacking, boosting and random forests have been used extensively for improving the accuracy of predictive models. Two important factors play an important role in the design of ensembles – accuracy of individ-

14

(a) Correlation

(b) Covariance

(c) Chisquare

(d) Entropy

(e) Disagreement

Fig. 4. Visualization of the correlation, covariance, chisquare, entropy and disagreement measures for the US Houses Data for an ensemble of size 20.

15

Ensemble Size Original Diverse 20 5 40 5 60 5 80 5 20 10 40 10 60 10 80 10 20 15 40 15 60 15 80 15 20 20 40 20 60 20 80 20

RMSE Original Correlation Covariance Dissimilarity 0.9975 1.1004 1.0702 1.0275 0.9798 1.0922 1.0702 1.1414 1.0199 1.117 1.1004 1.0269 0.9902 1.137 1.0609 1.0522 0.9803 1.0115 1.0038 1.0702 0.984 1.053 1.0295 1.0311 0.9888 1.0754 1.0641 1.1265 1.0066 1.0701 1.0367 0.9831 1.0032 1.0201 1.0216 1.0139 0.9854 1.0375 1.0246 1.0341 0.9998 1.0833 1.0601 0.9927 1.0300 0.9938 0.9678 1.0055 1.0053 1.0053 1.0966 0.976 1.0139 0.9862 0.9623 0.9589 1.0291 1.0072 0.9911 0.9849 1.0602 1.0386 1.0349

Entropy Chi-Square 1.0847 1.1167 1.0388 1.0856 1.1309 1.1028 0.9708 1.1415 1.0015 1.0115 0.9497 1.0658 1.0023 1.0767 0.9967 1.0649 1.0198 1.0251 0.9882 1.0233 1.0213 1.0859 0.9751 1.0367 1.0053 1.0053 0.9862 1.0165 0.9536 1.0339 0.9976 1.0562

Table 5. Performance of Diverse Regression Trees on the US Houses data set using Bagging. Only 50% of the instances and 5 attributes are used in the tree construction process.

Ensemble Size Original Number of Attributes 20 3 40 3 60 3 80 3 20 5 40 5 60 5 80 5

RMSE Original Correlation Covariance Dissimilarity 1.6511 1.6512 1.6627 1.4219 1.7889 2.1544 2.1544 1.5794 1.6789 2.2437 2.15 1.7107 1.6516 2.3614 2.2461 1.7212 1.6555 1.6555 1.6555 1.7719 1.3084 1.4989 1.5002 1.625 1.5131 1.6758 1.5858 1.789 1.3704 1.4732 1.4732 1.1871

Entropy Chi-Square 1.6512 1.6512 1.5543 1.547 1.5163 1.5487 1.4985 1.5397 1.6555 1.6555 1.5764 1.5227 1.5445 1.5233 1.5262 1.4899

Table 6. Performance of Diverse Regression Trees on the US Houses data set using Random Forest. Only 50% of the instances and 5 attributes are used in the tree construction process, but the number of attributes are varied. The ensemble size is kept constant at 20.

16

ual predictors and how different they are from one another. If most individual predictors have low error then it is likely that they are very similar to one another; in contrast if they are very different the prediction error may be high. In this paper, we first study metrics that can be used for measuring diversity of regression ensembles. We suggest correlation coefficient, covariance, chi-square, disagreement measure and entropy as possible measures of diversity between regressors. We present experimental results for testing the metrics on two real datasets - Computer Acitivty data and US Houses data. Our results indicate that relatively small ensembles created using diversity metrics entropy and chi-square have comparable performance to the larger complex original ensembles. Our future work will involve analysis of the effect of pruning trees on the diversity of ensembles. In addition, we will perform a theoretical analysis of diversity measures and their effect on accuracy of regression ensembles.

8

Acknowledgments

This section will be provided if the paper is accepted for publication.

References [AV95]

K. Anders and J. Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems, pages 231–238. MIT Press, 1995. [BHBK03] R.E. Banfield, L.O. Hall, K.W. Bowyer, and P.W. Kegelmeyer. A new ensemble diversity measure applied to thinning ensembles. In Proceedings of the International Workshop on Multiple Classifier Systems, pages 306–316, 2003. [Bre93] L. Breiman. Stacked regression. Technical Report TR-367, University of California, Berkeley, 1993. [Bre96] Leo Breiman. Bagging predictors. In Machine Learning, pages 123–140, 1996. [Bre00] L. Breiman. Randomizing outputs to increase prediction accuracy. Machine Learning, 40:229–242, 2000. [Bre01] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, 2001. [BWT05] Gavin Brown, Jeremy L. Wyatt, and Peter Tiˇno. Managing diversity in regression ensembles. J. Mach. Learn. Res., 6:1621–1650, 2005. [Che96] K. J. Cherkauer. Human expert-level performance on a scientific image analysis task by a system using combined artifical neural networks. In P. Chan, editor, Working notes of the AAAI Workshop on Integrating Multiple Learned Models, pages 15 –21, Menlo Park, 1996. AAAI Press. [Chr03] Stefan W. Christensen. Ensemble construction via designed output distortion. Lecture Notes in Computer Science, 2709:286–295, 2003. Proc. 4th Int. Workshop Multiple Classifier Systems, Guildford, Surrey, UK, June 11-13 2003. [CMNM06] Rich Caruana, Art Munson, and Alexandru Niculescu-Mizil. Getting the most out of ensemble selection. In ICDM ’06: Proceedings of the Sixth International Conference on Data Mining, pages 828–833, Washington, DC, USA, 2006. IEEE Computer Society. [CNMCK04] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In In Proc. 21st International Conference on Machine Learning, pages 137–144, 2004.

17 [DB95]

T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrecting output codes. In Journal of Artificial Intelligence Research, volume 2, pages 263 – 286, 1995. [DC96] H. Drucker and C. Cortes. Boosting decision trees. In Advances in Neural Information Processing Systems, volume 8, pages 479–485, 1996. [delte] http://www.cs.toronto.edu/ delve/data/datasets.html, Website. [Die02] T. G. Dietterich. Ensemble learning. The Handbook of Brain Theory and Neural Networks, Second Edition, 2002. [GRF00] Giorgio Giacinto, Fabio Roli, and Giorgio Fumera. Design of effective multiple classifier systems by clustering of classifiers. In Proc. of ICPR2000, 15th Int. Conference on Pattern Recognition, pages 3–8, 2000. [Ho98] T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:832–844, 1998. [KW01] L.I. Kuncheva and C.J. Whitaker. Ten measures of diversity in classifier ensembles: limits for two classifiers. Intelligent Sensor Processing (Ref. No. 2001/050), A DERA/IEE Workshop on, pages 10/1–1010, Feb. 2001. [KW03] Ludmila I. Kuncheva and Christopher J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn., 51(2):181–207, 2003. [LY99] Yong Liu and Xin Yao. Ensemble learning via negative correlation. Neural Networks, 12:1399–1404, 1999. [Nil65] N. J. Nilsson. Learning Machines: Foundations of Trainable Pattern-Classifying Systems. McGraw Hill, NY, 1965. [OS99] D. Opitz and J. Shavlik. A genetic algorithm approach for creating neural network ensembles. In Springer Verlag, editor, Combining Artificial Neural Networks, pages 79–99, 1999. [OT08] Nikunj C. Oza and Kagan Tumer. Key real-world applications of classifier ensembles. Information Fusion, Special Issue on Applications of Ensemble Methods, 9(1):4–20, 2008. [QSxSy05] Fu Qiang, Hu Shang-xu, and Zhao Sheng-ying. Clustering-based selective neural network ensemble. In Journal of Zhejiang University - Science A, volume 6, pages 387–392, 2005. [REBK05] Kevin W. Bowyer Robert E. Banfield, Lawrence O. Hall and W. Philip Kegelmeyer. Ensemble diversity measures and their application to thinning. Diversity in Multiple Classifier Systems, 1(6):49–62, 2005. [RP06] Romesh Ranawana and Vasile Palade. Multi-classifier systems: Review and a roadmap for developers. Int. J. Hybrid Intell. Syst., 3(1):35–61, 2006. [Sch99] R. E. Schapire. Theoretical views of boosting. In Computational Learning Theory: Fourth European Conference, pages 1 – 10, 1999. [SS97] A. J. C Sharkey and N. E Sharkey. Combining diverse neural nets. The Knowledge Engineering Review, 12:231–247, 1997. [TSY06] E. K. Tang, P. N. Suganthan, and X. Yao. An analysis of diversity measures. Mach. Learn., 65(1):247–271, 2006. [UCIte] http://www.csi.uci.edu/mlearn, Website. [Wol92] D. Wolpert. Stacked generalization. Neural Networks, 5:241 – 259, 1992. [YL04] Xin Yao and Yong Liu. Evolving neural network ensembles by minimization of mutual information. Int. J. Hybrid Intell. Syst., 1(1-2):12–21, 2004.

Suggest Documents