strategy to improve the accuracy of learning techniques for regression in a big ... I. INTRODUCTION. The advent of big data analytics demands the adoption of.
2016 IEEE International Congress on Big Data
Predictive modeling in a big data distributed setting: a scalable bias correction approach Gianluca Bontempi, Yann-A¨el Le Borgne Machine Learning Group Computer Science Department, Faculty of Sciences ULB, Universit´e Libre de Bruxelles, Belgium mlg.ulb.ac.be confidence intervals to the right scale. In order to have an adequate scaling and a scalable effective implementation, Kleiner et al [9] proposed the Bag of Little Bootstraps (BLB) approach which is a smart way to derive the statistical ˆ N using only partitions of the original dataset. properties of θ An extension of BLB was proposed in [1]. This paper proposes a different approach. In BLB and ˆ N by related approaches the aim is to assess an estimator θ ˆ . Here the focus is to use directly the assessment using θ N ˆ to improve its accuracy and have a better estimate of θ N of θ in a scalable manner. The rationale is that, if we are ˆ N ) trained with ˆ and θ confronted with two estimators (θ N different number of samples but targeting the same quantity θ, the one to be preferred should be the most accurate (e.g. the unbiased one) and not necessarily the one estimated with the highest number of samples. So instead of using ˆ to derive indirectly the properties (e.g. the properties of θ N ˆ [6] to ˆ bias) of θ N , we estimate directly the bias of θ N improve the prediction of θ. The choice of the best estimator has been typically tackled in predictive modelling tasks (e.g. classification or regression) by model selection procedures [8], [4]. Model selection consists in (i) considering a set of alternative families of models (e.g. linear and non linear), (ii) assessing the generalisation error of these models (e.g. by means of crossvalidation or resampling techniques) and (iii) proceeding to selection and/or averaging. This procedure is often long and time consuming since it requires to iterate over different families of models and to search over large hyper parameter spaces. At the same time, its outcome is extremely dependent on the accuracy of the estimation of the generalisation error. However, it is well known that the generalisation error (e.g. mean squared error in regression) is only an aggregated measure of estimation accuracy since it integrates the bias, the variance of the model and the noise [8]. The choice of measuring directly the generalisation error, instead of its components, is dictated in small data settings by the difficulty of reproducing accurately the sampling distribution of the estimator given the low number of samples. This is indeed a major problem since the problem of estimation would be solved if we could have accurate measures of the
Abstract—Massive datasets are becoming pervasive in computational sciences. Though this opens new perspectives for discovery and an increasing number of processing and storage solutions is available, it is still an open issue how to transpose machine learning and statistical procedures to distributed settings. Big datasets are no guarantee for optimal modeling since they do not automatically solve the issues of model design, validation and selection. At the same time conventional techniques of cross-validation and model assessment are computationally prohibitive when the size of the dataset explodes. This paper claims that the main benefit of a massive dataset is not related to the size of the training set but to the possibility of assessing in an accurate and scalable manner the properties of the learner itself (e.g. bias and variance). Accordingly, the paper proposes a scalable implementation of a bias correction strategy to improve the accuracy of learning techniques for regression in a big data setting. An analytical derivation and an experimental study show the potential of the approach.
I. INTRODUCTION The advent of big data analytics demands the adoption of effective and accurate machine learning methods. However most of these techniques have been designed and assessed in an epoch when all data could reside in memory and be rapidly accessible. Also, most of machine learning research has been conceived for a‘not enough data” setting where the amount of data was limited and each single sample had to be taken into account both for fitting and assessing purposes. Think for instance to pervasive techniques like cross-validation and bootstrap where the reuse of data is crucial for assessing models and making design choices. These techniques, while recommended and extremely effective on small amount of samples, are prohibitive in large data and distributed settings. This led to the introduction of ‘divide and conquer” and parallel versions of these procedures [9]. ˆ N is an Suppose we want to estimate a quantity θ and that θ estimator built with a very large number N of observations. The idea of divide and conquer approaches is that a large dataset of size N can be processed into multiple datasets ˆ computed on of size N B1 2P
y=
y=
|z
2 z1 2 +1
|
y = z2 +z12 +1 1 2 z sin(z1 z2 ) y = z12 +z2 +1) 1 2 y = z2 exp (2z12 ) z2 sin(z1 ) + z1 sin(z2 ) z 3 −2z z +z 2 y = 1 z2 +z1 22+1 2 1 2
y = z1 + z2 y = sin(z1 ) + log(z2 )
Table I S ET OF NONLINEAR BIVARIATE DEPENDENCIES . I N OUR EXPERIMENTS
is satisfied, where ΔB denoted the reduction due to N bias because of the non the correction and B1 (N ) ≤ B1 2P increasing behavior of B1 (·). It follows that bias reduction is effective if N B1 − B1 (N ) < ΔB 2P
A NONLINEAR MULTIVARIATE IS INPUT OUTPUT DEPENDENCY OBTAINED BY SETTING z1 = x ∈X |xk |, z2 = x ∈X |xk | WHERE
k
1
k
X1 ∩ X2 = ∅ AND X1 ∪ X2 = M ⊂ X
2
dependencies (functions f ()˙ detailed in Table I) with additive Gaussian noise whose standard deviation is chosen in the interval 0.25 ≤ σw ≤ 0.5. The value of n is randomly chosen in the range 20 ≤ n ≤ 40. Note that each time only a subset of the n measured inputs X is randomly chosen to be part of the Markov Blanket M ⊂ X. The caption of the table details also how these bivariate dependencies are used to implement multivariate functions. We use the Random Forest (sklearn RandomForestRegressor implementation with 100 trees and 10 as max depth) as learner and we compared three strategies:
Supposing that B1 (N ) ≈ o(1/N ) (i.e. that it converges to 0 as fast as 1/N ), since N N < ΔB ⇒ B1 − B1 (N ) < ΔB B1 2P 2P the condition (8) is satisfied for a sufficiently large N if 1 1 P = < ΔB ⇒ N > 2 N N ΔB where N is the number of samples per partition. This means that the larger is the squared bias correction ΔB, the smaller is the number of the samples per partition required in order for the distributed approach to outperform the full data approach. In other terms, if the estimation of the bias is not a too difficult task, a distributed bias-corrected approach with a reasonably small number of samples per partition can improve accuracy.
• •
•
ALL: conventional training done on the entire dataset PASTING: a pasting [3] approach consisting in partitioning the data, fitting a model in each partition and then averaging the results BC: the bias correction approach described in Section III.
For the sake of comparison we set the same number of partitions in the PASTING and the BC distribution of data. The aim of the comparison is to show that the BC approach is able to perform in a manner comparable to the ALL strategy and better than the PASTING strategy. The accuracy of each method was measured in terms of Normalized Mean Squared Error for the test set. Figure 3 compares the BC strategy with the ALL strategy by showing the boxplot of the paired differences NMSEBC - NMSEALL . What appears is that by increasing the size of the partitions the probability of outperfoming the ALL strategy increases. Note however that the size N of the partition is much smaller than the size of the original training set. Figure 4 compares the BC strategy with the ALL strategy by showing the boxplot of the paired differences NMSEBC NMSEP AST IN G . What appears is that consistently the bias correction strategy improves the PASTING strategy. A final analysis concerns the relation between the size N
V. E XPERIMENTS The experimental session aims to show that in a big data setting, a predictive model based on i) distributing the data into partitions, ii) estimating the bias and iii) correcting the bias, can be competitive and sometimes outperform a predictive model using the entire dataset for training. All the experiments were implemented in an embarrassingly parallel manner with Python Spark and carried out in a dedicated cluster of 10 nodes, each with 24 cores and 128GB RAM. We consider 11 synthetic regression tasks and for each of them we set an increasing number of training samples N = 200K, 300K, 400K, 500K, 600K, 800K while the test dataset has size Nts = 200K. The input matrix [N, n] is generated according to a multivariate gaussian distribution with non diagonal covariance matrix in order to emulate tasks with collinearity. Then we consider a set of nonlinear
72
Figure 3. BC vs. ALL strategy: boxplot of the difference NMSEBC NMSEALL for different partition sizes. All the points falling outside the range between upper and lower quartiles are denotied as open circles.
Figure 5. Extrapolation of the relation between NMSEBC and NMSEALL . The curve shows for each size N of the partition, the corresponding size N (in logarithmic scale) of the dataset for which NMSEBC is as accurate as NMSEALL .
distributed computing platforms. Statistics is plenty of large sample results concerning the behavior of estimators (e.g. learners) for large sample settings. For instance given an appropriate model, and if the sample size is large, then maximum likelihood provides estimators of parameters that are consistent (i.e., asymptotically unbiased with variance tending to zero), fully efficient (i.e., minimum variance among consistent estimators), and normally distributed. Unfortunately all this is of little relevance when the model is not appropriate. In the majority of real applications, even big data, neither the model parameters nor the model is known. Large and extensive data sets are likely to support more complexity, but the definition of the relation between complexity and number of samples is hard to be defined. So, in principle, even in a large setting data model selection should be adopted and consistency results about the optimality of the selection are encouraging in this sense. The problem is that the training and assessment of many alternatives in a large data setting is impracticable and excessively time consuming. This paper presented then a bias correction technique for training a number of distributed models using only a partition of the dataset and at the same time assess their estimation properties. The resulting predictive model is able
Figure 4. BC vs. PASTING strategy: boxplot of the difference NMSEBC NMSEP AST IN G for different partition sizes. All the points falling outside the range between upper and lower quartiles are denotied as open circles.
of the dataset, the size of the partition N and the expected NMSE improvement. On the basis of the experimental data we fitted a linear model linking these three variables and we used it to extrapolate the behavior to larger values of N . Figure 5 shows for each N the expected size (and related 95% confidence interval) of the training set for which the NMSEBC is equal to NMSEALL . It is interesting to remark that this extrapolation suggests that we could attain the same accuracy of a model trained on a huge dataset (e.g N = 109 ) by using a BC distributed strategy with much smaller partitions (e.g. N = 2 · 105 ). VI. C ONCLUSION The paper proposes an original interpretation of how conventional statistical and learning methodology should evolve in a world involving massive data on parallel and
73
to be competitive and sometimes outperform models trained on the entire dataset also for partitions of much smaller size. Future work will focus on the design of a smart partitioning of the data for improved accuracy and on extending the approach to a streaming setting.
[13] S. Zhou, X. Shen, and D.A. Wolfe. Local asymptotics for regression splines and confidence regions. Ann. Statist., 26(5):1760–1782, 10 1998.
Acknowledgements—The authors acknowledge the support of the ”BruFence: Scalable machine learning for automating defense system” project (RBC/14 PFS-ICT 5), funded by the Institute for the encouragement of Scientific Research and Innovation of Brussels (INNOVIRIS, Brussels Region, Belgium).
R EFERENCES [1] S. Basiri, E. Ollila, and V. Koivunen. Fast and robust bootstrap in analysing large multivariate datasets. In 2014 48th Asilomar Conference on Signals, Systems and Computers, pages 8–13, Nov 2014. [2] Souhaib Ben Taieb and Amir F Atiya. A bias and variance analysis for multistep-ahead time series forecasting. IEEE Transactions on Neural Networks and Learning Systems, 27(1):62–76, 2015. [3] Leo Breiman. Pasting small votes for classification in large databases and on-line. Machine Learning, 36(1):85–103, 1999. [4] Gerda Claeskens and Nils Lid Hjort. Model selection and model averaging. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge, New York, 2008. [5] Bertrand Clarke, Ernest Fokoue, and Hao Helen Zhang. Principles and Theory for Data Mining and Machine Learning. Springer, 2009. [6] B. Efron and R. J. Tibshirani. Bootstrap. Chapman, 1993.
An Introduction to the
[7] Jerome H. Friedman. Stochastic gradient boosting. Comput. Stat. Data Anal., 38(4):367–378, February 2002. [8] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001. [9] Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I. Jordan. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):795–816, 2014. [10] Nick Pentreath. Machine Learning with Spark - Tackle Big Data with Powerful Spark Machine Learning Algorithms. Packt Publishing, 2014. [11] D. Politis, J. Romano, and M. Wolf. Subsampling. Springer Series in Statistics. Springer New York Inc., 1999. [12] Panos Toulis and Edoardo M. Airoldi. Scalable estimation strategies based on stochastic approximations: classical results and new insights. Statistics and Computing, 25(4):781–795, 2015.
74