when doing parameter studies and investigating the prediction performance on ... putation time as well as prediction performance. ..... learning in Python. Journal ...
Precise Wind Power Prediction with SVM Ensemble Regression Justin Heinermann and Oliver Kramer Department of Computing Science University of Oldenburg, D-26111 Oldenburg, Germany {justin.philipp.heinermann, oliver.kramer}@uni-oldenburg.de
Abstract. In this work, we propose the use of support vector regression ensembles for wind power prediction. Ensemble methods often yield better classification and regression accuracy than classical machine learning algorithms and reduce the computational cost. In the field of wind power generation, the integration into the smart grid is only possible with a precise forecast computed in a reasonable time. Our support vector regression ensemble approach uses bootstrap aggregating (bagging), which can easily be parallelized. A set of weak predictors is trained and then combined to an ensemble by aggregating the predictions. We investigate how to choose and train the individual predictors and how to weight them best in the prediction ensemble. In a comprehensive experimental analysis, we show that our SVR ensemble approach renders significantly better forecast results than state-of-the-art predictors. Keywords: Support Vector Regression, SVR Ensemble, Wind Power Prediction, Bagging, Weighting, Parallelization.
1
Introduction
The successful integration of wind energy into the grid depends on precise predictions of the amount of energy produced. It has been shown that good forecast results can be achieved using support vector regression (SVR) [11]. The main problem of the SVR algorithm is the huge computational cost. In particular, when doing parameter studies and investigating the prediction performance on large data sets, the optimization process becomes impractically slow. The training time required for an acceptable forecast can easily become thousands of seconds. Therefore, often a worse prediction performance has to be accepted. In this work, we propose an SVR ensemble method for wind power prediction in order to ameliorate the forecast quality and spend less computation time. Instead of using one single support vector regressor, we train a number of regressors, called weak predictors, which together form an ensemble. Each of them is trained on a smaller subset sample of the training set. The prediction is computed by a weighted average of the regression results of the weak predictors. This paper is structured as follows. Our ensemble method is described in Section 2. The experimental results, presented in Section 3, show that a random parameter
choice and an mean squared error (MSE) based weighting renders the best computation time as well as prediction performance. Compared to state-of-the-art algorithms, our approach yields a better forecast performance in a reasonable computation time. Our conclusions and future work can be found in Section 4. 1.1
Related Work
A comprehensive overview and empirical analysis for ensemble classification is given by Bauer and Kohavi [1]. Another, more up-to-date review paper was written by Rokach [8]. The most important ensemble techniques are bagging and boosting. Bagging, which stands for bootstrap aggregating, was introduced by Breiman [2] and is a relatively simple algorithm. The main idea is to build independent predictors using samples of the training set and average the output of these predictors. In contrast, Boosting approaches like AdaBoost [4], make use of predictors trained a consecutive manner with continuous adaptation of the ensemble weights. Kim et al. [6] build up classifiers using support vector machine (SVM) ensembles using both bagging and boosting. The single SVMs are aggregated by using majority voting, an LSE-based weighting, or combined in an hierarchical manner. Important in SVM ensemble construction is “that each individual SVM becomes different with another SVM as much as possible” [6]. This aspect was also investigated by Tsang, Kocsor and Kwok [12] by introducing orthogonality constraints for the weak SVM predictors in order to diversify them. Due to the lack of space, we refer to Sch¨olkopf and Smola [9] for an introduction to support vector machines. In the field of wind power prediction, ensembles were used for postprocessing numerical weather forecasts [5, 10]. For solar power output prediction, Chakraborty et al. [3] built up an ensemble of a weather forecast-driven model and machine learning predictors.
2
Support Vector Regression Ensemble With Weighted Bagging
Our research goal is to answer the question, if SVR ensembles can be used to improve the prediction accuracy in the field of wind power forecasts while reducing the computation time. The basic idea of the resulting training algorithm is depicted in Algorithm 1. Let P = {pi |i = 1, . . . , k} be the set of predictors and W = {wi ∈ R|i = 1, . . . , k} the set of the corresponding weights. Each weight belongs to the predictor with the same index. For computing the final prediction value for an unknown instance x, the results of the weak predictors pi ∈ P are combined using the weights wi ∈ W : k yˆ = Σi=1 wi · pi (x)
As the design goal is to find a balance between a good regression performance and a feasible computational cost, we decided to implement a relatively simple
bagging approach, which can easily be parallelized: Every iteration of the forloop in line 4 is independent from its preceding run, so the loop can be replaced by a map-Operation and then executed on distributed computing systems or on multicore processors. The same holds for the prediction step. In contrast to our bagging approach, iterative algorithms like AdaBoost were too expensive because of interdependent steps of the training algorithm. Algorithm 1 Training of the SVR Ensemble Predictor 1: Inputs: T = {(x1 , y1 ), . . . , (xn , yn )} ⊂ Rd × R (training set), s (sample size), k (number of weak predictors) 2: Returns: P = {pi |i = 1, . . . , k} (weak predictors), W = {wi ∈ R|i = 1, . . . , k} (ensemble weights) 3: Initialize: wi ← 1, i = 1, . . . , k 4: for i = 1 to k do 5: Ti ← sample(T, s) 6: Tval,i ← T − Ti 7: C, γ = ChooseP arameters(SV R, Ti ) 8: pi = F itSV R(Ti , C, γ) 9: wi = 1/RegressionP erf ormance(pi , Tval,i ) 10: end for
2.1
Training the weak predictors
When training the weak predictors, one must consider which samples are used for the design of the training set and which settings are the best for the particular machine learning algorithm. Like Kim et al. [6], we build training sets Ti of the weak predictors by randomly sampling s instances from the global training set T . Hence, the single sets Ti are non-disjoint and one single training instance (xi , yi ) can occur multiple times or not at all. In future work, one could also test disjoint sets or even introduce orthogonality constraints like Tsang, Kocsor, and Kwok [12], but randomly chosen sampling turned out to be a good choice. An important research aspect is the parameter tuning implemented in the ChooseParameters method. In our case, we considered the regularization parameter C and the RBF (radial basis function) kernel bandwidth γ the most important ones1 and only varied these two parameters. We tested three different variants of parameter choice: Global Optimization Each weak predictor’s regression performance is given by the mean square error on the whole Tval,i validation set, which is the 1
We are using the Scikit-learn implementation [7] of -support vector regression, which uses γ instead of σ.
global training data set T without the training data sample Ti . Thus, all weak predictors are optimized for the training data set via grid search. Local Optimization The weak predictors regression performance is optimized by a grid search using a cross-validation on the belonging training data sample Ti . Random Choice The parameters are randomly chosen. No optimization will be performed. Obviously, both the global and local optimization are inherent expensive, because of the need to call the SVR training algorithm multiple times2 . Thus, one would prefer the random method if the prediction quality is not worse. Our experiments show that a random choice of parameters can result in a better prediction performance. 2.2
Ensemble Prediction and Learning Weights
Many different methods of combining the weak predictors to a reasonable prediction output are possible. We decided to test uniform ranking with wi = 1 and different weighting methods using a prediction error E on a validation data set Tval,i interpreted as importance of each weak predictor pi : wi =
1 E(pi , Tval,i )
For E we tested the mean square error (MSE), the square of the MSE, the inverse of the least square error (LSE), or the biggest square error (BSE). Besides the SVR and kernel parameters, and the ensemble weights, the two most important factors of the algorithm’s success are the sample size s and the number of weak predictors k. As shown in Section 3, it turns out to be the best choice to balance both sizes. 2.3
Runtime of the Approach
Because of the runtime bound O(N 3 ) of the usual SVR training algorithm with a training set of size N , it is often much cheaper to train many SVR predictors on small training data sets than one single SVR predictor on a large training data set. For the case of partitioning of the training data set into partitions of size n < N , the runtime bound of our approach is given by: N · n3 = N · n2 < N 3 for n < N (1) n In our case, we do not necessarily divide the whole training data set in partitions but rather sample k subsets of the size n. Therefore, our runtime and space complexity does not longer depend on the training set size and the runtime boundary is given by O(k · n3 ) = k · O(n3 ) (2) 2
Also when using other optimization techniques like evolutionary algorithms, the SVR training has to be called multiple times, which induces a long computation time.
3
Experimental Results
In our experiments, we analyze the prediction performance of our ensemble regression approach. The first thing we have to do is finding the settings that achieve the best prediction performance and the lowest computation time needed. Given the best results that can be reached with our approach, we can compare the algorithm to the commonly used techniques k Nearest Neighbors (kNN) and SVR. The experiments were run on an Intel Core i5 (4 × 3.10GHz) with 8GiB of RAM using the kNN and SVR implementations of Scikit-learn [7]. 3.1
Wind Power Prediction with Machine Learning
In contrast to numerical weather forecast models, we are using statistical learning methods for wind power prediction. The prediction task is formulated as regression problem on observed wind speed or power output time series. The forecast problem for a given target turbine is to predict the measurement with a forecast horizon λ. As input features for the regression algorithm, we use a feature window µ, determining how much past measurements to consider for the forecast. For both parameters µ and λ, we use 30 minutes. It has been shown that involving the measurements of neighboring turbines within a radius of a few kilometers can be helpful to ameliorate the forecast accuracy [11]. Let pi (t) be the measurement of a turbine i at a time t, and 1 ≤ i ≤ n the indices of the n neighboring turbines. Then, for a target turbine with the index j we define a pattern-label-pair (x, y) for a given time t0 as p1 (t0 − µ) p1 (t0 − µ + 1) p2 (t0 − µ + 1) ... pn (t0 − µ) pn (t0 − µ + 1)
p2 (t0 − µ) ...
. . . p1 (t0 ) . . . p2 (t0 ) ... ... . . . pn (t0 )
→ pj (t0 + λ)
(3)
In our experiments, we use the NREL western wind resources dataset 3 , which contains the simulated power output of 32,043 wind power stations in the US. Each grid point has a maximum power output of 30M W and 10-minute data is given for the years 2004 - 2006. Therefore, for every station there are 157, 680 wind speed and power output measurements available. In our experiments, we use the power output data of five wind parks4 that consist of the wind turbine the power output shall be predicted for, and the turbines in a radius of 3 kilometers. As training data set, the whole time series for the year 2004 is used and the data of the year 2005 serves as test data set. 3.2
Experiment 1: Optimization and Weighting of the Weak Predictors
In our first experiment, we analyze the use of the different parameter optimization variants for the weak predictors. For C, the possible values used are 3 4
http://wind.nrel.gov/ The IDs of the turbines in the NREL dataset are: Cheyenne=17423, Lancaster=2473, Palmsprings=1175, Vantage=28981, Yucca Valley=1539.
Table 1. Comparison of parameter choice methods and ensemble weighting methods (k = 32, n = 1, 000). For the locally optimized, globally optimized, and random chosen parameters, the mean squared prediction error is evaluated for five turbines (repeated ten times). For the weighting of the 1 ensemble members, 1, M 1SE and M SE 2 are tested. Furthermore, the runtime for training is given. The least error reached for each turbine is printed in bold, the least runtime is printed in italic. Parameters
Locally Optimized
Value Weights
Error 1
M −1 M −2
Globally Optimized
Time
Error 1
M −1 M −2
Random
Time
Error 1
Time
M −1 M −2
Cheyenne
7.84 7.84 7.84 175.93s 7.87 7.87 7.86 607.56s 12.38 8.14
7.69 45.77s
Lancaster
8.89 8.89 8.89 161.37s 9.04 9.04 9.03 513.66s 13.85 9.52
8.81 36.80s
Palm Springs 6.12 6.12 6.11 221.60s 6.13 6.12 6.12 476.66s 7.75 6.04
5.96 41.68s
Vantage
5.63 5.63 5.63 151.42s 5.67 5.67 5.67 525.88s 8.41 6.19
5.75 37.68s
Yucca Valley 10.29 10.29 10.29 226.89s 10.44 10.43 10.43 614.51s 10.59 10.10 10.05 51.07s
{1; 10; 100; 1, 000; 10, 000} and γ is taken from {1 · 10e |e ∈ {1, . . . , 5}}. Furthermore, we compare the different weighting methods for each of the three algorithm variants. The results are presented in Table 1. For five wind parks, the prediction error (MSE) is compared for the three parameter choice methods. For the weights, LSE and BSE based weights are not listed because of poor prediction performance. The results show that a random choice of the weak predictors parameters is better in the most cases while providing a much shorter runtime compared to the optimized variants. This behaviour may be surprising at first, but complies with the intuition behind diversification [12] and random forests [2]. Another consideration is the possible overfitting when using the optimized variants. Therefore, we are using the computational cheap random variant 1 with M SE 2 weighting in the following. 3.3
Experiment 2: Number of Weak Predictors and Samples
Because of the stochastic sampling of the training subsets, the expected prediction performance depends on the number of weak predictors. Furthermore, it is non-deterministic and we have to analyze the properties of our algorithm for varying parameters. Figure 1 (a) shows the behaviour of our algorithm depending on the sample size n for a wind park near Palm Springs using k ∈ {8, 32, 64} weak predictors: When increasing the number n of samples used for each weak predictor, the prediction error decreases. The standard deviation also decreases: E.g., for the case k = 32, the standard deviation when using predictors with sample size n = 100 is 0.30, and is reduced to 0.07 when using n = 1, 000. Every prediction error is given by a mean of 25 repeated measures. For the other parks, our approach yields the same behaviour. Figure 1 (b) shows the dependency of the prediction error on the number k of weak predictors used. The results show that a larger number k greatly decreases the prediction error and the standard deviation is reduced, too. E.g., for k = 5, the standard deviation is 0.58, it decreases to 0.10 for k = 100. Thus, one can only expect reliable results with a sufficient k and sufficient n.
7
10.2 k=8 k=32 k=64
6.9
mean (25 runs) standard deviation
10
6.8
9.8
MSE
MSE
6.7 6.6
9.6 9.4
6.5 9.2
6.4
9
6.3 6.2
8.8 100
200
300
400
500
600
700
800
900 1000
0
sample size n
20
40
60
80
100
number of predictors k
(a)
(b)
Fig. 1. (a) Prediction error for a wind park near Palm Springs depending on n, (b) Prediction error and standard deviation for a wind park near Lancaster depending on k.
3.4
Experiment 3: Comparison with State-Of-The-Art Predictors
Table 2 shows the comparison of our approach to kNN and SV R algorithms, which are state-of-the art prediction algorithms. In order to give a fair comparison to the kNN model, the parameter k was first optimized using a 3-fold cross-validation on the training set. The training times for the cross-validation as well as the training times using the best number k of neighbors are given. The SVR approach is far too expensive in order to justify any cross-validation. Therefore, we only searched for a good parameter guess on 10% of the training data and used the parameters C = 10, 000 and γ = 1e−5 in the comparison. Our SVR ensemble approach is using k = 64 and n = 2, 000. In four of five cases, our proposed prediction algorithm outperforms kNN and SVR algorithms. A conclusion cannot be given without considering the runtime needed. The training and testing with our algorithm only needs a few minutes, giving better results than the SVR algorithm in thousands of seconds. However, the kNN algorithm is faster for the given data and would often give a good first guess.
Table 2. Comparison of SVR ensemble regressor (SVRENS, repeated 10 times) using n = 2, 000 and k = 64 with state-of-the-art regressors. For every turbine, the best result is printed in bold. Algorithm
kNN
SVR
CV
Train
Test
Error
Cheyenne
319.33
0.21
42.76
7.85
1671.55 154.94
Lancaster
209.24
0.60
26.69
8.97
Palm Springs 122.59
1.54
14.50
6.06
307.44
0.73
40.90
Yucca Valley 147.36
0.08
19.67
Vantage
Train
Test
SVRENS Error
Train
Test
Error
7.70
303.12 146.07 7.54
2067.55 126.43
9.98
266.35 115.30 8.87
1907.32 87.56
7.59
252.69 83.88
5.77
2194.81 164.78
8.33
224.04 122.77 5.55
10.35
536.80 115.43 10.40 276.51 121.70 10.20
6.12
4
Conclusions
The integration of wind power into the smart grid is only possible with a precise forecast computed in reasonable time. For an improvement of the prediction performance and reduction of computational cost, we presented an SVR ensemble method using bagging and weighted averaging. We showed that we obtain the best results when using a random parameter choice for the weak predictors. The number of weak predictors and the samples have to be sufficient in order to provide a reliable forecast accuracy. Compared to state-of-the-art prediction methods kNN and SVR, the prediction error can be decreased while offering a reasonable runtime. As future work, we plan to analyze other methods for optimization and diversification of the weak predictors in order to improve the prediction performance.
References 1. E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning, 36(1-2):105–139, 1999. 2. L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. 3. P. Chakraborty, M. Marwah, M. F. Arlitt, and N. Ramakrishnan. Fine-Grained Photovoltaic Output Prediction Using a Bayesian Ensemble. In AAAI Conference on Artificial Intelligence, 2012. 4. Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In International Conference on Machine Learning, volume 96, pages 148–156, 1996. 5. T. Gneiting, A. E. Raftery, A. H. Westveld, and T. Goldman. Calibrated Probabilistic Forecasting Using Ensemble Model Output Statistics and Minimum CRPS Estimation. Monthly Weather Review, 133(5):1098–1118, 2005. 6. H.-C. Kim, S. Pang, H.-M. Je, D. Kim, and S. Y. Bang. Constructing support vector machine ensemble. Pattern Recognition, 36(12):2757–2767, 2003. 7. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 8. L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39, 2010. 9. B. Sch¨ olkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive computation and machine learning. MIT Press, 2002. 10. T. L. Thorarinsdottir and T. Gneiting. Probabilistic forecasts of wind speed: ensemble model output statistics by using heteroscedastic censored regression. Journal of the Royal Statistical Society: Series A (Statistics in Society), 173(2):371–388, 2010. 11. N. A. Treiber, J. Heinermann, and O. Kramer. Aggregation of features for wind energy prediction with support vector regression and nearest neighbors. In European Conference on Machine Learning, Workshop Data Analytics for Renewable Energy Integration, 2013. 12. I. W. Tsang, A. Kocsor, and J. T. Kwok. Diversified SVM Ensembles for Large Data Sets. In European Conference on Machine Learning, pages 792–800, 2006.