Empirical comparison of resampling methods ... - Semantic Scholar

Empirical comparison of resampling methods using genetic fuzzy systems for a regression problem Tadeusz Lasota1, Zbigniew Telec2, Grzegorz Trawiński3, Bogdan Trawiński2 1

Wrocław University of Environmental and Life Sciences, Dept. of Spatial Management ul. Norwida 25/27, 50-375 Wrocław, Poland 2 Wrocław University of Technology, Institute of Informatics, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland 3 Wrocław University of Technology, Faculty of Electronics, Wybrzeże S. Wyspiańskiego 27, 50-370 Wrocław, Poland [email protected], [email protected], {zbigniew.telec, bogdan.trawinski}@pwr.wroc.pl

Abstract. Much attention has been given in machine learning field to the study of numerous resampling techniques during the last fifteen years. In the paper the investigation of m-out-of-n bagging with and without replacement and repeated cross-validation using genetic fuzzy systems is presented. All experiments were conducted with real-world data derived from a cadastral system and registry of real estate transactions. The bagging ensembles created using genetic fuzzy systems revealed prediction accuracy not worse than the experts’ method employed in reality. It confirms that automated valuation models can be successfully utilized to support appraisers’ work. Keywords: genetic fuzzy systems, bagging, subagging, cross-validation

1 Introduction Resampling techniques and ensemble models have been focusing the attention of many researchers for last fifteen years. Bagging, which stands for bootstrap aggregating, devised by Breiman [4] belongs to the most intuitive and simplest ensemble algorithms providing a good performance. Diversity of learners is obtained by using bootstrapped replicas of the training data. That is, different training data subsets are randomly drawn with replacement from the original base dataset. So obtained training data subsets, called also bags, are used then to train different classification or regression models. Finally, individual learners are combined through algebraic expressions. The classic form of bagging is the n-out-of-n with replacement bootstrap where the number of samples in each bag equals to the cardinality of a base dataset and as a test set the whole original dataset is used. Much effort has been made to achieve better computational effectiveness by introducing subsampling techniques which consisted in drawing from an original dataset smaller numbers of samples, with or without replacement. The m-out-of-n without replacement bagging, where at each step m observations less than n are distinctly chosen at random within the base dataset, belongs to such variants. This alternative aggregation scheme was called by

Bühlmann and Yu [5] subagging for subsample aggregating. In the literature the resampling methods of the same nature as subagging are also named Monte Carlo cross-validation [24] or repeated holdout [3]. In turn, subagging with replacement was called moon-bagging, standing for m-out-of-n bootstrap aggregating [2]. The above mentioned resampling techniques are still under active theoretical and experimental investigation [2], [3], [5], [6], [11], [12], [24]. Theoretical analyses and experimental results to date proved benefits of bagging especially in terms of stability improvement and variance reduction of learners for both classification and regression problems. Bagging techniques both with and without replacement may provide improvements in prediction accuracy in a range of settings. Moreover, n-out-of-n with replacement bootstrap and n/2-out-of-n without replacement sampling, i.e. halfsampling, may give fairly similar results. Majority of the experiments were conducted employing statistical models, decision trees, and neural networks which are less computationally intensive than genetic fuzzy systems reported in the paper. The size of bootstrapped replicas in bagging usually is equal to the number of instances in an original dataset and the base dataset is commonly used as a test set for each generated component model. However, it is claimed it leads to an optimistic overestimation of the prediction error. So, as test error out-of-bag samples are applied, i.e. those included in the base dataset but not drawn to respective bags. These, in turn may cause a pessimistic underestimation of the prediction error. In consequence, the .632 and .632+ corrections of the out-of-bag prediction error were proposed [3], [10]. The main focus of soft computing techniques to assist with real estate appraisals has been directed towards neural networks [17], [22], less researchers have been involved in the application of fuzzy systems [1], [13]. So far, we have investigated several methods to construct regression models to assist with real estate appraisal: evolutionary fuzzy systems, neural networks, decision trees, and statistical algorithms using MATLAB, KEEL, RapidMiner, and WEKA data mining systems [15], [18], [20]. We have studied also ensemble models created with these computational intelligence techniques [16], [19], [21] employing classic bagging approach. In this paper we make one step forward, we compare m-out-of-n bagging with and without replacement with different sizes of samples with a property valuating method employed by professional appraisers in reality and the standard 10-fold crossvalidation. We apply genetic fuzzy systems to real-world regression problem of predicting the prices of residential premises based on historical data of sales/purchase transactions obtained from a cadastral system. As it is often necessary to understand the behaviour of the property valuation models, we chose genetic fuzzy system as base learners because of its high interpretability compared to neural networks or support vector machines. The investigation was conducted with our newly developed system in Matlab to test multiple models using different resampling methods.

2 Methods Used and Experimental Setup The investigation was conducted with our new experimental system implemented in Matlab environment using Fuzzy Logic, Global Optimization, Neural Network, and Statistics toolboxes [9], [14]. The system was designed to carry out research into

machine learning algorithms using various resampling methods and constructing and evaluating ensemble models for regression problems. Real-world dataset used in experiments was drawn from an unrefined dataset containing above 50 000 records referring to residential premises transactions accomplished in one Polish big city with the population of 640 000 within 11 years from 1998 to 2008. The final dataset counted the 5213 samples for which the experts could estimate the value using their pairwise comparison method. Due to the fact that the prices of premises change substantially in the course of time, the whole 11-year dataset cannot be used to create data-driven models, therefore it was split into 20 halfyear subsets. The sizes of half-year data subsets are given in Table 1. Table 1. Number of instances in half-year datasets 1998-2 202 2003-2 386

1999-1 213 2004-1 278

1999-2 264 2004-2 268

2000-1 162 2005-1 244

2000-2 167 2005-2 336

2001-1 228 2006-1 300

2001-2 235 2006-2 377

2002-1 267 2007-1 289

2002-2 263 2007-2 286

2003-1 267 2008-1 181

In order to compare evolutionary machine learning algorithms with techniques applied to property valuation we asked experts to evaluate premises using their pairwise comparison method to historical data of sales/purchase transactions recorded in a cadastral system. The experts worked out a computer program which simulated their routine work and was able to estimate the experts’ prices of a great number of premises automatically. First of all the whole area of the city was divided into 6 quality zones. Next, the premises located in each zone were classified into 243 groups determined by 5 following quantitative features selected as the main price drivers: Area, Year, Storeys, Rooms, and Centre. Domains of each feature were split into three brackets as follows: Area denotes the usable area of premises and comprises small flats up to 40 m2, medium flats in the bracket 40 to 60 m2, and big flats above 60 m2. Year (Age) means the year of a building construction and consists of old buildings constructed before 1945, medium age ones built in the period 1945 to 1960, and new buildings constructed between 1960 and 1996, the buildings falling into individual ranges are treated as in bad, medium, and good physical condition respectively. Storeys are intended for the height of a building and are composed of low houses up to three storeys, multi-family houses from 4 to 5 storeys, and tower blocks above 5 storeys. Rooms are designated for the number of rooms in a flat including a kitchen. The data contain small flats up to 2 rooms, medium flats in the bracket 3 to 4, and big flats above 4 rooms. Centre stands for the distance from the city centre and includes buildings located near the centre i.e. up to 1.5 km, in a medium distance from the centre - in the brackets 1.5 to 5 km, and far from the centre - above 5 km. Then the prices of premises were updated according to the trends of the value changes over time. Starting from the second half-year of 1998 the prices were updated for the last day of consecutive half-years. The trends were modelled by polynomials of degree three. Premises estimation procedure employed a two-year time window to take into consideration transaction data of similar premises.

1. 2.

Take next premises to estimate. Check the completeness of values of all five features and note a transaction date. 3. Select all premises sold earlier than the one being appraised, within current and one preceding year and assigned to the same group. 4. If there are at least three such premises calculate the average price taking the prices updated for the last day of a given half-year. 5. Return this average as the estimated value of the premises. 6. Repeat steps 1 to 5 for all premises to be appraised. 7. For all premises not satisfying the condition determined in step 4 extend the quality zones by merging 1 & 2, 3 & 4, and 5 & 6 zones. Moreover, extend the time window to include current and two preceding years. 8. Repeat steps 1 to 5 for all remaining premises. Our study consisted in the application of an evolutionary approach to real-world regression problem of predicting the prices of residential premises based on historical data of sales/purchase transactions obtained from a cadastral system, namely genetic fuzzy systems (GFS). In GFS approach for each input variable three triangular and trapezoidal membership functions, and for output - five functions, were automatically determined by the symmetric division of the individual attribute domains. The evolutionary optimization process combined both learning the rule base and tuning the membership functions using real-coded chromosomes. Similar designs are described in [7], [8], [18]. Following resampling methods were applied in the experiments and compared with the standard 10cv and the experts’ method. Bag: B100, B70, B50, B30 – m-out-of-n bagging with replacement with different sizes of samples using the whole base dataset as a test set. The numbers in the codes indicate what percentage of the base set was drawn to create training sets. OoB: O100, O70, O50, O30 – m-out-of-n bagging with replacement with different sizes of samples tested with the out-of-bag datasets. The numbers in the codes mean what percentage of the base dataset was drawn to create a training set. RHO: H90, H70, H50, H30 – repeated holdout (50 times in our research), m-outof-n bagging without replacement with different sizes of samples. The numbers in the codes point out what percentage of the base dataset was drawn to create a training set. RCV: 1x50cv, 5x10cv, 10x5cv, 25x2cv – repeated cross-validation, k-fold crossvalidation splits, for k=50, 10, 5, and 2 respectively, were repeated 1, 5, 10, and 25 times respectively, to obtain 50 pairs of training and test sets. In the case of bagging methods 50 bootstrap replicates (bags) were created on the basis of each base dataset, as performance functions the mean square error (MSE) was used, and as aggregation functions simple averages were employed. The normalization of data was accomplished using the min-max approach.

3 Results of Experiments The performance of Bag, OoB, RHO, and RCV models created by genetic fuzzy systems (GFS) in terms of MSE is illustrated graphically in Figures 1 and 2 respectively. In each figure, for comparison, the same results for 10cv and Expert

methods are shown. The Friedman test performed in respect of MSE values of all models built over 20 half-year datasets showed that there are significant differences between some models. Average ranks of individual models are shown in Table 2, where the lower rank value the better model. In Table 3 and 4 the results of nonparametric Wilcoxon signed-rank test to pairwise comparison of the model performance are presented. The zero hypothesis stated there were not significant differences in accuracy, in terms of MSE, between given pairs of models. In both tables + denotes that the model in the row performed significantly better than, – significantly worse than, and ≈ statistically equivalent to the one in the corresponding column, respectively. In turn, / (slashes) separate the results for individual methods. The significance level considered for the null hypothesis rejection was 5%.

Fig. 1 Performance of Bag (left) and OoB (right) models generated using GFS Table 2. Average rank positions of models determined during Friedman test Bag OoB RHO RCV

1st B100 (1.40) 10cv (2.20) H90 (2.50) 1x50cv (2.50)

2nd B70 (2.40) Expert (2.30) Expert 2.55) Expert (3.00)

3rd Expert (3.50) O100 (2.70) 10cv (3.05) 5x10cv (3.15)

4th B50 (3.70) O70 (3.55) H70 (3.10) 10x5cv (3.25)

5th 10cv (4.45) O50 (4.50) H50 (4.10) 10cv (3.80)

6th B30 (5.55) O30 (5.75) H30 (5.70) 25x2cv (5.30)

Fig. 2 Performance of RHO (left) and RCV (right) models created by GFS

Table 3. Results of Wilcoxon tests for the performance of Bag and OoB models B100/O100 B100/O100 B70/O70 B50/O50 B30/O30 10cv Expert

–/– –/– –/– –/+ –/≈

B70/O70 +/ + –/– –/– –/+ ≈/≈

B50/O50 +/+ +/+ –/– –/+ ≈/≈

B30/O30 +/+ +/+ +/+ +/+ ≈/≈

10cv +/– +/– +/– –/–

Expert ≈/≈ ≈/≈ ≈/≈ ≈/≈ ≈/≈

≈/≈

Table 4. Results of Wilcoxon tests for the performance of RHO and RCV models H90/1x50cv H90/1x50cv H70/5x10cv H50/10x5cv H30/25x2cv 10cv Expert

–/≈ –/– –/– ≈/– ≈/≈

H70/5x10cv +/ ≈ –/≈ –/– ≈/≈ ≈/≈

H50/10x5cv +/+ +/≈ –/– +/≈ ≈/≈

H30/25x2cv +/+ +/+ +/+ +/+ ≈/≈

10cv ≈/+ ≈/≈ –/≈ –/– ≈/≈

Expert ≈/≈ ≈/≈ ≈/≈ ≈/≈ ≈/≈

The general outcome is as follows. Firstly, the performance of the experts’ method fluctuates strongly achieving for some datasets excessively high MSE values and for others the lowest values; MSE ranges from 0.007 to 0.023. Therefore, no significant difference in accuracy between the experts’ method and any other technique can be observed. Secondly, The bagging models created over 30% subsamples perform significantly worse than ones trained using bigger portions of base datasets for all methods. The same applies to 25x2cv. Thirdly, for bagging and subagging, the greater portion of a base set used as a training set the better accuracy of the models created. More specifically, the B100, B70, and B50 bagging ensembles outperform single base models assessed using 10cv. In turn, 10cv provides better results than out-of-bag O100, O70, and O50 and repeated holdout H70 and H50 ensembles. H90 and 10cv models perform equivalently. Finally, 1x50cv turns out to be better than any other cross-validation model but one 5x10cv. No significant differences are observed among 5x10cv, 10x5cv, and 10cv.

4 Conclusions and Future Work The computationally intensive experiments aimed to compare the performance of bagging and subagging ensembles as well as repeated cross-validation models built using genetic fuzzy systems over real-world data taken from a cadastral system with different numbers of training samples. Moreover, the predictive accuracy of a pairwise comparison method applied by professional appraisers in reality was compared with our genetic fuzzy systems aiding in a residential premises valuation. The overall results of our investigation were as follows. The bagging ensembles created using genetic fuzzy systems revealed prediction accuracy not worse than the experts’ method employed in reality. It confirms that automated valuation models can be successfully utilized to support appraisers’ work. Moreover, we conducted our experiments with the use of genetic fuzzy rule-based systems which have the advantage of knowledge extraction and representation when modeling complex systems in a way that they could be understood by humans. Processing time needed to generate models is higher when compared to other computational intelligence or statistical techniques, such as neural networks and support vector regression, but this drawback has lower impact on the effectiveness of Computer Assisted Mass Appraisal systems which may operate in off-line mode. Acknowledgments. This paper was partially supported by the Polish National Science Centre under grant no. N N516 483840.

References 1. 2.

Bagnoli, C., Smith, H. C.: The Theory of Fuzzy Logic and its Application to Real Estate Valuation. Journal of Real Estate Research 16(2), 169--199 (1998) Biau, G., Cérou, F., Guyader, A.: On the Rate of Convergence of the Bagged Nearest Neighbor Estimate. Journal of Machine Learning Research 11, 687--712 (2010)

3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

Borra, S., Di Ciaccio, A.: Measuring the prediction error. A comparison of crossvalidation, bootstrap and covariance penalty methods. Computational Statistics & Data Analysis 54(12), 2976--2989 (2010) Breiman, L.: Bagging Predictors. Machine Learning 24(2), 123--140 (1996) Bühlmann, P., Yu, B.: Analyzing bagging. Annals of Statistics 30, 927--961 (2002) Buja, A., Stuetzle, W.: Observations on bagging, Statistica Sinica 16, 323--352 (2006) Cordón, O., Gomide, F., Herrera, F., Hoffmann, F., Magdalena, L.: Ten years of genetic fuzzy systems: current framework and new trends. Fuzzy Sets and Systems 141, 5--31 (2004) Cordón, O., Herrera, F.: A Two-Stage Evolutionary Process for Designing TSK Fuzzy Rule-Based Systems. IEEE Tr on Sys., Man, and Cyb.-Part B 29(6), 703--715 (1999) Czuczwara, K.: Comparative analysis of selected evolutionary algorithms for optimization of neural network architectures. Master’s Thesis (in Polish), Wrocław University of Technology, Wrocław, Poland (2010) Efron, B., Tibshirani,R.J.: Improvements on cross-validation: the .632+ bootstrap method. Journal of the American Statistical Association 92(438), 548--560 (1997) Friedman, J.H., Hall, P.: On bagging and nonlinear estimation Journal of Statistical Planning and Inference 137(3), 669--683 (2007) Fumera, G., Roli, F., Serrau, A.: A theoretical analysis of bagging as a linear combination of classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(7), 1293--1299 (2008) González, M.A.S., Formoso, C.T.: Mass appraisal with genetic fuzzy rule-based systems. Property Management 24(1), 20--30 (2006) Góral, M.: Comparative analysis of selected evolutionary algorithms for optimization of fuzzy models for real estate appraisals. Master’s Thesis (in Polish), Wrocław University of Technology, Wrocław, Poland (2010) Graczyk, M., Lasota, T., and Trawiński, B.: Comparative Analysis of Premises Valuation Models Using KEEL, RapidMiner, and WEKA. In Nguyen N.T. et al. (Eds.): ICCCI 2009, LNAI 5796, pp. 800--812, Springer, Heidelberg (2009) Graczyk, M., Lasota, T., Trawiński, B., Trawiński, K.: Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal. In N.T. Nguyen, et.al. (Eds.): ACIIDS 2010, LNAI 5991, pp. 340--350, Springer, Heidelberg (2010) Kontrimas, V., Verikas, A.: The mass appraisal of the real estate by computational intelligence. Applied Soft Computing 11(1), 443--448 (2011) Król, D., Lasota, T., Trawiński, B., Trawiński, K.: Investigation of Evolutionary Optimization Methods of TSK Fuzzy Model for Real Estate Appraisal. International Journal of Hybrid Intelligent Systems 5(3), 111--128 (2008) Krzystanek, M., Lasota, T., Telec, Z., Trawiński, B.: Analysis of Bagging Ensembles of Fuzzy Models for Premises Valuation. In N.T. Nguyen, M.T. Le, and J. Świątek (Eds.): ACIIDS 2010, LNAI 5991, pp. 330--339, Springer, Heidelberg (2010) Lasota, T., Mazurkiewicz, J., Trawiński, B., Trawiński, K.: Comparison of Data Driven Models for the Validation of Residential Premises using KEEL. International Journal of Hybrid Intelligent Systems 7(1), 3--16 (2010) Lasota, T., Telec, Z., Trawiński, B., and Trawiński K.: Exploration of Bagging Ensembles Comprising Genetic Fuzzy Models to Assist with Real Estate Appraisals. In H. Yin and E. Corchado (Eds.): IDEAL 2009, LNCS 5788, pp. 554--561, Springer, Heidelberg (2009) Lewis, O.M., Ware J.A., Jenkins, D.: A novel neural network technique for the valuation of residential property. Neural Computing & Applications 5(4), 224--229 (1997) Martínez-Muñoz, G., Suárez, A.: Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognition 43, 143--152 (2010) Molinaro, A.N., Simon, R., Pfeiffer, R.M.: Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15), 3301--3307 (2005)

Empirical comparison of resampling methods ... - Semantic Scholar

Empirical comparison of resampling methods ... - Semantic Scholar

Suggest Documents

A Comparison of Resampling Methods for ... - Semantic Scholar

Comparison of Interpolation Methods for Image Resampling

COMPARISON OF STANDARD RESAMPLING METHODS FOR ...

Empirical Bayes and Resampling Based Multiple ... - Semantic Scholar

empirical comparison of various approximate ... - Semantic Scholar

impact of spatial resampling methods on the ... - Semantic Scholar

Emerging Applications of the Resampling Methods ... - Semantic Scholar

A Comparison of Resampling and Recursive Partitioning Methods in ...

Comparison of Bootstrap Resampling Methods for 3-D PET Imaging

Resampling methods - Creative Wisdom

Prediction Error Estimation: A Comparison of Resampling Methods

Empirical comparison of methods for analyzing ... - BioMedSearch

Empirical Comparison of Prediction Methods for Electricity ...

Comparison of Regularization Methods for ... - Semantic Scholar

Comparison of Hartmann analysis methods - Semantic Scholar

Comparison of Hartmann analysis methods - Semantic Scholar

COMPARISON OF FIVE METHODS FOR ... - Semantic Scholar

Reversible Resampling of Integer Signals - Semantic Scholar

Detection of Resampling Supplemented with ... - Semantic Scholar

Resampling methods for document clustering

Map comparison methods for comprehensive ... - Semantic Scholar

Comparison among three methods for ... - Semantic Scholar

evaluation of empirical methods to estimate ... - Semantic Scholar

resampling methods of analysis in simulation studies