Mar 14, 2016 - comparison to parametric methods, a solution to the bottlenecks ... tion to applying Gaussian Process regression to large data sets is based on.
arXiv:1509.05142v4 [cs.LG] 14 Mar 2016
Fast Gaussian Process Regression for Big Data Sourish Das1 , Sasanka Roy2 , and Rajiv Sambasivan1 1
Chennai Mathematical Institute 2 Indian Statistical Institute March 15, 2016
Abstract Gaussian Processes are widely used for regression tasks. A known limitation in the application of Gaussian Processes to regression tasks is that the computation of the solution requires performing a matrix inversion. The solution also requires the storage of a large matrix in memory. These factors restrict the application of Gaussian Process regression to small and moderate size data sets. We present an algorithm based on empirically determined subset selection that works well on both real world and synthetic datasets. We also compare the performance of this algorithm with two other methods that are used to apply Gaussian Processes Regression on large datasets. On the synthetic and real world datasets used in this study, the algorithm demonstrated sub-linear time and space complexity. The accuracy obtained with this algorithm on the datasets used for this study is comparable to what is achieved with the two other methods commonly used to apply Gaussian Processes to large datasets.
1
Introduction
Gaussian Processes are attractive tools to perform supervised learning tasks on complex datasets on which traditional parametric methods may not be effective. They are also easier to use in comparison to alternatives like neural networks Rasmussen [2006]. When Gaussian Processes are applied to model regression problems, exact inference is possible. When Gaussian Processes are applied to classification, we need to resort to approximate inference techniques such as Markov Chain Monte Carlo, Laplace Approximation or Variational Inference. Even though exact inference is possible for Gaussian 1
Process regression, the computation of the solution requires matrix inversion. For a dataset of size n, the time complexity of matrix inversion is O(n3 ). The space complexity associated with storing a matrix of size n is O(n2 ). This restricts the applicability of the technique to small or moderate sized datasets. In this paper we present an algorithm that uses subset selection and ideas borrowed from bootstrap aggregation to mitigate the problem discussed above. Parallel implementation of this algorithm is also possible and can further improve performance. The rest of this paper is organized as follows: In section 2, we discuss the problem context. In section 3, we present our solution to the problem. Our solution is based on combining estimators developed on subsets of the data. The selection of the subsets is based on simple random sampling with replacement (similar to what is done in the bootstrap). The size of the subset selected is a key aspect of this algorithm. This is determined empirically. We present two methods to determine the subset size. We present the motivating ideas leading to the final form of the algorithm in terms of elementary propositions and lemmas. We then provide a proof of correctness for the algorithm. Results from experiments to study the effect of algorithm parameters are presented. In section 4, we illustrate the application of our algorithm to synthetic and real world data sets. Typically GP (Gaussian Processes) are applied to data sets with a few hundred to a few thousand data instances. We applied the algorithm developed in this work to data sets with over a million instances. The number of features in these datasets ranged from one (time series) to forty. Stochastic Variational Inference (Hensman et al. [2013]) and Sparse Gaussian Processes (Snelson and Ghahramani [2005]) are two recent techniques that have been used to apply Gaussian Processes on large datasets. We compare the performance (accuracy and running times) of the proposed method to these methods. In section 5, we present a brief summary of prior work. Applying Gaussian Processes to large datasets has attracted a lot interest from the Machine Learning research community since this method is both practical and effective on complex datasets. The approach presented in this work falls in the category described as ”subset of data” in the research literature (see Rasmussen and Williams [2005][Chapter 8]). It is characterized by selecting a subset of data from the dataset to develop the GP model. Other researchers have taken theoretical approaches to subset selection, while this work presents empirical methods to do this. In section 6, we present the conclusions from this work.
2
2
Problem Formulation
A Gaussian Process y with additive noise can be represented as: y = f (x) + η. Here : • y represents the observed response • x represents an input vector of covariates • η represents the noise. The noise terms are assumed to identical independently distributed (IID) random variables drawn from a Normal distribution with variance σn2 . • f represents the function being modelled. It is a Multivariate Normal with mean function µ(x) and covariance function σ(x) If we want to make a prediction for the value of the response at a new prediction point X ∗ (X ∗ ∈ Xtest , Xtest being the test set), then the predictive equations are given by (see [Rasmussen and Williams, 2005]) f∗ |X, y, X∗ ∼ N f ∗ , cov(f∗ ) , (1) f ∗ = E[f∗ |X, y, X∗ ] = K(X∗ , X)[(K(X, X) + σn2 I]−1 y, cov(f∗ ) = K(X∗ , X∗ ) − K(X∗ , X)[K(X, X) +
σn2 I]−1 K(X, X∗ ).
(2) (3)
Here • f∗ is the value of the function at the test point X∗ • K(X∗ , X) represents the covariance between a test point and the training set • K(X, X) represents the covariance matrix for the training set • I is the identity matrix Equation (2) is the key equation to make predictions. An inspection of equation (2) shows that this equation requires the computation of an inverse. Computation of the inverse has O(n3 ) time complexity. This is one of the bottle necks in the application of Gaussian Processes. Calculation of the solution also requires the storage of the matrix (K(X, X) + σn2 I). This is associated with a space complexity of O(n2 ). This is the other bottle neck 3
associated with Gaussian Processes. If we are to use the entire training set, we are restricted in applying Gaussian Process regression to moderately sized data sets. For example , on an Intel i7 laptop with 8 GB of RAM, we ran out of memory even with a dataset with only 10000 rows.
3
Proposed Solution
Since GP regression is both effective and versatile on complex datasets in comparison to parametric methods, a solution to the bottlenecks mentioned above will enable us to apply GP regression to large datasets that are complex. We run into such datasets routinely in this age of big data. Our solution to applying Gaussian Process regression to large data sets is based on developing estimators on smaller subsets of the data. The size of the subset and the desired accuracy are key parameters for the proposed solution. The accuracy is specified in terms of an acceptable error threshold, . We pick K smaller subsets of size Ns from the original data set (of size N ) in a manner similar to bootstrap aggregation. We then fit a Guassian Process to each subset. The GP fit procedure includes hyperparameter selection through a maximum likelihood estimation procedure (using a gradient descent type method). We want a subset size such that when we combine the estimators developed on the K subsets, we have an estimator that yields a prediction error that is acceptable. Note that the proposed solution requires a space complexity of O(K.Ns2 ) which was easily handled on standard laptop (8 GB RAM). The intuition for the proposed approach is that in many real world large datasets, a large proportion of data points are redundant in terms of developing a GP estimator. We posited that by a judicious choice of the subset size we could obtain estimators that we could combine to produce an estimator that performed at an acceptable level. To combine estimators, we simply average the prediction from each of the K estimators. The time complexity for fitting a Gaussian Process regression on the entire dataset of size N is O(N 3 ). The algorithm presented above requires the fitting of K Gaussian Process regression models to smaller size datasets(Ns ). The time complexity then becomes O(K.Ns3 ). We present two methods to determine the subset size: 1. Estimating the subset size using statistical inference. For a dataset of size N , the subset size is expressed as: Ns = N δ where 0< δ < 1
4
δ is a random variable that is determined based on inference of a proportion using a small sample. The details of this method are presented in the next subsection. 2. Estimating the subset size using an empirical estimator. We use the following lemma and proposition to derive this empirical estimator. Lemma 1. The subset size Ns , should be proportional to the size of the dataset, N . Ns ∝ N Proof. The proof is by contradiction. If the sample size is not an increasing function of N then it implies that a single sample size would suffice for a dataset of any size. Since increasing N may introduce more variation in the data, we would need to increase the number of samples (Ns ) to maintain the same level of variance in our estimates for a test point x∗ . Please see note below for how this connects to the posterior predictive variance for a test point. Note 1. For a given kernel, the variance associated with the estimate for a test point x∗ is given by: V[f ∗ ] = K(x∗ , x∗ ) − KT (x∗ , x)(K(x, x) + σn2 I)−1 K(x∗ , x) The first term in this equation is a non-negative scalar. The second term represents a quadratic form. Since (K + σn2 I)−1 is positive definite, the second term increases as the number of samples (sample size) increases. So increasing the sample size provides us a method of decreasing the variance. Proposition 1. The sample size Ns , should be a decreasing function of the desired error rate (). This means that decreasing the desired error rate should increase the sample size. Ns ∝
1 where g() is an increasing f unction g()
5
Application of the above lemma and proposition yields the following estimator for sample size: Ns =
N δ(N ) . g()
Here: δ(N ) is a function that characterizes the fact that an increase in N should increase the sample size Ns . g() is a function that characterizes the fact that sample size should increase as (desired error level) decreases. The algorithm is summarized in Algorithm 1 input : A dataset D of size N, δ, K output: An estimator f that combines the estimators fitted from resampling for i ← 1 to K do /* select a sample from D. Two ways of selecting the sample size are presented */ Ns ←SampleWithReplacement(D, δ); /* A kernel is fit for each sample. Hyperparameter selection is done for each sample */ ˆ fi ←FitGP(Ns ); end /* the estimate for a point x ∈ Dtest (the test dataset) is the average of the estimates from the k estimators fitted above */ 1 Pi=K ˆ fresampled (x) ← K i=1 fi (x) Algorithm 1: Gaussian Process Regression Using Resampling We now need to show that the estimator obtained from the above algorithm is a valid estimator. To do so, we need the following preliminaries. Assumption 1. The function being modeled is in the Reproducing Kernel Hilbert Space of the kernel σ. The reproducing property of the kernel can be expressed as: f (t) =< f (.), σ(., t) > ∀t ∈ T T represents the index set to select the predictor instances.
6
Algorithm 1 makes use of K estimators, f1 , f2 , . . ., fK . Each estimator is associated with a reproducing kernel, σ1 , σ2 , . . ., σK . We show that the estimator resulting from combining these k estimators is a valid estimator. Lemma 2. The estimator frm obtained from using the individual estimators f1 , f2 , . . ., fK using: i=K 1 Xˆ frm (t) = fi (t) K i=1
is associated with the following kernel: σrm =
i=K 1 X σi K i=1
Proof. From Assumption 1, the kernel associated with each estimator has the reproducing property. So the following equation holds: < f (.), σ1 (., t) >=< f (.), σ2 (., t) >= . . . =< f (.), σK (., t) >= f (t) Pi=K σi and the inner product < f (.), σrm >. Consider the kernel σrm = K1 i=1 The inner product can be written as: K
< f (.), σrm
times
}| { 1 z >= [< f (.), σ1 (., t) > + . . . + < f (.), σK (., t) >] K K
times
}| { 1 z = [f (t) + . . . + f (t)] K 1 = K f (t) K = f (t)
We now show that the estimator, frm (t) = the true solution given enough samples.
1 K
Pi=K ˆ i=1 fi (t) will converge to
Theorem 1. Suppose F denotes the sampling distribution function of the estimator using all the data with a GP prior, Fn denotes the bootstrap empirical distribution function of the bootstrap estimate constructed using Algorithm 1, then for every > 0, n o lim P sup |Fn (t) − F (t)| < = 1 n→∞
t∈R
i.e., |Fn − F |∞ = supt∈R |Fn (t) − F (t)| −→ 0 7
almost surely
Proof. The result is a direct consequence of the Glivenko Cantelli Theorem (see [Pollard, 1984]) So now we have established that if we have a subset size of the form, Ns = N δ , then the Gaussian Process using the above algorithm will approximate the Gaussian Process with all the data. Therefore the choice of δ is critical because it should minimize Ns and give appropriate approximation of the Gaussian Process with all the data. Estimating Sample Size on the Basis of Inference of a Proportion Since δ takes values between 0 and 1, it can be interpreted as a proportion. We treat it as a random variable that can be inferred from a small sample. To estimate δ, we do the following: We pick a small sample of the original dataset (say a few thousand by simple random sampling). We start with a small value of δ and check if the prediction error with this value of δ is acceptable. If not we increment δ until we arrive at a δ that yields an acceptable error. This procedure yields the smallest δ value that produces an acceptable error on this sample. Since the size of this dataset is small, the above procedure can be performed quickly. This technique yielded reliable estimates of δ on both synthetic and real world datasets.
Empirical Estimation of the Subset Size Application of Lemma 1 and Proposition 1 yields the following: Ns =
N δ(N ) g()
Here δ(N ) is a function that characterizes the fact that an increase in N should increase the sample size Ns . g() is a function that characterizes the fact that sample size should increase as decreases. The next section describes the effect of the parameters N , δ and on the proposed algorithm. To motivate the derivation an expression for Ns , we need to use one of the observations that are discussed in the next section. This observation was that the increase in sample size needed to maintain a predefined error rate was a slowly growing function. Figure 2 shows the δ required to produce a 10% error for different dataset sizes, N , on one of the 8
datasets studied in this work. It is evident that δ(N ) should decrease very slowly as N increases. We expect that we would need smaller values of δ as N increases to produce an acceptable level of error. 1 After a rigorous empirical study we realized that if we chose δ(N ) = (log(N )) then it works well on all real world datasets and synthetic data. The experiments for the empirical study included the household power consumption dataset and the synthetic dataset described in section 4.1. The household power consumption dataset represents voltage readings at a one minute sampling rate from an individual household, the synthetic data was generated using a sinusoidal function (with added noise). The effectiveness of the algorithm was also studied on various higher dimensional datasets. The number of features in these datasets ranged from four to forty while the number of instances ranged from the thousands to over a million. An exposition of the effect of the parameters is easier to illustrate on one dimensional datasets. To study the effect of δ on the dataset size, N , the number of bootstrap iterations were held at a constant level (K = 30) and the δ required to limit the test error () to less than 10 % was observed for various values of N . g() is a monotonically increasing function that characterizes the fact that sample size should increase as the acceptable error threshold decreases √ 1 (e.g. g() = , , 10 , . . .). An optimal choice of g() is an area of future 1 work. For the experiments reported in this work, g() = 10 worked well. Effect of the Parameters The subset size estimator has the following parameters: • N , the size of the dataset • δ, this determines the sample size for the algorithm, N δ • K, the number of bootstrap iterations • , the acceptable error (error threshold). We used the Mean Absoulute Percentage Error (M AP E) as the error metric for this study. This is defined as: 1 X |yactual − yestimate | M AP E = N yactual y∈Y
Note that the MAPE provides a measure of the average residual in relation to the actual value of the response. An alternate metric to develop this estimator would be the RMSE. The error in this section refers to the test error (error observed on the test set). 9
Figure 1: Vs K for N = 50000, δ = 0.3
We observed the following based on experiments performed on the household electricity consumption and the synthetic dataset. These datasets are described in the next section. 1. The error (on the test set) is very sensitive to the subset size, N δ , but not very sensitive to K. When we arrive at a value of δ that produces acceptable error, increasing K decreases the error very slowly (see Figure 1). K does not have a very strong influence on the error. For this reason, there is no benefit in using K values bigger than 30. 2. Since K is not an interesting parameter, we examine the relationship between, N , δ, and for a a fixed K of 30.
10
Figure 2: δ versus log(log(N )) for = 0.1
3. The δ required to produce a fixed level of reduces with increasing N . This relationship (for a real world dataset) is shown in Figure 2. This is the slowly growing relationship discussed in the previous section. This is used to derive the functional form of the subset size, Ns empirically.
11
Figure 3: versus δ for N = 1500
4. The effect of increasing δ on the actual error is shown in Figure 3. As is evident, increasing δ decreases the error. However, there is a point beyond which increasing δ is not going to yield an appreciable reduction in error.
Figure 4: versus N for a fixed δ = 0.3
12
5. The effect of increasing N on is shown in Figure 4. The behaviour seen here is consistent with the behaviour seen in the previous graph. Here we are increasing N , keeping δ constant. In the previous case, we increase δ keeping N constant. Increasing subset sizes results in smaller error, but there is a point beyond which increasing subset size does not give us any appreciable reduction in error.
3.1
Standard Error of the Estimator
It is natural to ask what the standard error of the estimate provided by the proposed algorithm would be. In this section we address this question. The standard error of our estimate is the standard deviation of the bootstrap estimates. Construction 1. The sample for each estimator was drawn independently (see Algorithm 1). Hence the estimators are independent of each other Assumption 2. The variance of each of the bootstrap estimators is finite. Theorem 2. Under Construction 1 and Assumption 2, the variance of the estimator derived from combining K resampled estimators (fˆ1 , fˆ2 , ..., fˆK ) is proportional to K12 . . Proof. The individual estimators fˆ1 , fˆ2 , ..., fˆK are associated with error components ˆ1 , ˆ2 , ..., ˆK . 2 . Let the variance associated with each of these error components be σ12 , σ22 , ...σK The variance of the combined estimator fˆrm is then obtained as follows: 1 V ar(fˆrm ) = V ar[ (fˆ1 + fˆ2 + ... + fˆK )] K 1 = 2 V ar[(fˆ1 + fˆ2 + ... + fˆK )] K 1 = 2 (V ar[fˆ1 ] + V ar[fˆ2 ] + ... + V ar[fˆK ]) K 1 2 = 2 (σ12 + σ22 + ... + σK ) K
13
Also 1 ˆ (f1 + fˆ2 + ... + fˆK )] K 2 ) < max(σ12 , σ22 , ..., σK
V ar(fˆrm ) = V ar[
< σmax where σmax = max
4
2 (σ12 , σ22 , ..., σK )
Application of the Algorithm
The above algorithms were applied to large datasets, both synthetic and real. The number of features in these datasets ranged from one to forty. Illustration of the algorithm is easier on one dimensional datasets. For this reason the results of applying the algorithm are presented in two parts. The first illustrates various aspects of the algorithm on very large one dimensional datasets. The second provides the results of applying the algorithm on datasets where the number of features ranged from four to forty. The performance of the algorithm is compared with the Stochastic Variational method and the Sparse GP method. Both Sparse GP and Stochastic Variational GP require a subset size as input. The algorithms are compared on the basis of the subset size used from the first method (sample size estimated from statistical inference of a proportion). The subset size from the empirical method is not very different from the subset size used from the first method, therefore, for brevity the comparison is made on the basis of the subset size from the first method. For these data sets, 70% of the dataset was used for training and 30 % was used for test. The datasets are described in the sections that follow. All results are based on the test set. The GPy package GPy [2014] was used for this work. This package provides the software abstractions for the kernel and optimization procedures. The experiments were performed on a Intel i7 laptop with 8 GB of memory running an Ubuntu operating system. The examples illustrate subset size estimation using both the methods discussed above: 1. Using Ns = N δ where the δ required to produce an acceptable error is estimated from a small sample (N=2000). 1
2. Using the refinement, Ns =
N log(log(N ) 1
10
14
4.1
One Dimensional Datasets
A Synthetic Dataset A synthetic dataset was constructed using the following function: y = 500 Sin(
2π x) + 500
where
∼ N (0, 75)
The predictor variable x was generated using the following Normal distribution x ∼ N (100, 50)
Figure 5: Fit Using Empirical Sample Figure 6: Fit Using Small Sample EstiEstimate mate
Table 1: Synthetic Dataset Results GP Method Bootstrap Method δ = 0.2, K = 30 SVIGP Sparse GP Bootstrap Method, ErrorT hreshold = 0.2
subset size 17 17 17 21
Running Time (s) 45.81 0.2581 . 573.71 52.06
RMSE
MAPE
77.1673 420.063 74.8921 77.3396
0.1875 0.9445 0.20072 0.1904
Max Memory(GB) ∼ 3 ∼0.4 ∼ 8.2 ∼ 3
The RBF (Radial Basis Function) kernel was used for this dataset. The fit obtained on the test set is shown in Figure 5 and Figure 6. The training set was 2 million points and the test set was 600 K instances. The results of applying the algorithms presented above to this synthetic dataset is shown in Table 1. The running time reflects time required to train and predict. Inspection of Table 1 shows that the proposed method is comparable in accuracy to the SVIGP and Sparse GP solutions. The acceptable error threshold was set at 20%. For the small sample estimation method a sample 15
of size 2000 was used to estimate the δ required. This dataset required δ to be 0.2 to achieve the acceptable error threshold. This implies that our algorithm was sublinear in time and space complexity. Individual Household Electric Power Consumption The individual household power consumption dataset [Lichman, 2013] from the UCI repository was used for this work. This dataset represents electricity usage measurement at a one minute sampling rate for an individual house hold over a period that spans almost four years. The above algorithm was used to predict the voltage attribute. The dataset has over 2 million instances. A training set of 1.43 million instances was used (the test set is the remainder). This dataset required a combination of a RBF kernel and a Bias kernel - the kernel was the sum of a Bias kernel and an RBF kernel.
Figure 8: Fit Using Small Sample EstiFigure 7: Fit Using Empirical Sample mate Estimate
Table 2: House Hold Power Consumption Results GP Method
Bootstrap Method δ = 0.2, K = 30 SVIGP Sparse GP Bootstrap Method, ErrorT hreshold = 0.05
subset size
Running Time(s)
RMSE
MAPE
18 18 18 23
40.81 240.83 . 782.95 49.96
3.239 240.8392 3.2348 3.2419
0.01036 0.99988 0.01032 0.01038
Max Memory Used (GB) ∼ 2.5 ∼0.5 ∼ 9 ∼ 2.5
The fit obtained is shown in Figure 7 and Figure 8. The acceptable error threshold was set at 5%. The running time and the error rates are shown in Table 2. Inspection of Table 2 shows that the proposed method is com16
parable in accuracy to the SVIGP and Sparse GP solutions. As with the synthetic dataset, the time shown in Table 2 is the time taken for training and prediction. The δ required for the small sample was estimated with a sample size of 2000. This dataset required δ to be 0.2 to achieve the acceptable error threshold. This implies our algorithm was sublinear in time and space complexity.
4.2
Higher Dimensional Datasets
The results of applying the proposed algorithm to higher dimensional datasets are presented in this section. For each dataset, the running time (for training and prediction), the RMSE (Root Mean Square Error), the MAPE (Mean Absolute Percentage Error) are presented. Five datasets obtained from various public domain data repositories were used for this part of the study. The combined cycle power plant dataset was obtained from the UCI machine learning repository [Lichman, 2013]. The airline dataset is available from the Bureau of Transportation (http: //www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236). The airline dataset reports flight delays for various domestic flights in the US. The data used for this study included flight delays between November 2013 through October 2014. The data from the Bureau of Transportation is available as multiple individual files that need to be collated. A collated version of the data is available at http://s3. amazonaws.com/jsw.dsprojects/AirlinePredictions/CombinedFlights.csv. All the other datasets were obtained from the LIAD (Laboratory of Artificial Intelligence and Decision Support, University of Porto) repository. The kernels for these datasets are provided in Table 8. The airline dataset has been studied in the context of applying Gaussian Processes to big data by Hensman et al. [2013]. Hensman et al. [2013] has used Stochastic Variational Inference on the airline dataset and uses flight delay data from 2008. Our work uses data from a later period, but of comparable size. RMSE was the metric of accuracy in [Hensman et al., 2013]. This dataset has outliers. Most delays are small, however there are a few records with very large delays. Including the outliers makes the MAPE an unsuitable metric for this dataset (MAPE is over 1 for all methods). The functional form of the empirical estimator for subset size requires the MAPE to be less than one. Therefore the result from this estimator is not included in the results. There are alternate ways to model this problem so that the MAPE is less than one, however we chose to stay consistent with the methodology used in [Hensman et al., 2013] (where a single kernel was used for the entire dataset) so that a comparison of the RMSE from this study to that observed by Hensman et al. [2013] is possible. It should be noted that the data for this study is from a later period and the predictor set is slightly different (this study does not include
17
the aircraft age attribute).The predictors of arrival delay for this study were: day of the week, day of the month, month, arrival time, departure time, distance and travel time. The procedure used for this dataset is identical to what was applied to all other datasets reported in this study. Notably, there was no tuning of the optimizer parameters. An inspection of Table 3 through Table 7 shows that for all the higher dimensional datasets, the accuracy of the proposed method is comparable to two of the most recent methods to scale Gaussian Processes to large datasets. In terms of running times, the performance of the proposed method was better in many cases. Table 3:
Results:Delta Elevators, Training Size: 6661, Test Size: 2855, Number of Features:6 GP Method
Bootstrap Method, δ = 0.3, K = 30 SVIGP Sparse GP Bootstap Method Err.Threshold = .1,
Table 4:
Subset Size 15 15 15 18
Total Time (s) 5.84 134.370 1.36 5.58
RMSE
MAPE
0.002127 0.002164 0.002373 0.002115
0.0375 0.1587 0.1223 0.0552
Results:California Housing, Training Size: 14447, Test Size: 6192, Number of Features:8 GP Method
Bootstrap Method, δ = 0.3, K = 30 SVIGP Sparse GP Bootstap Method Err.Threshold = .15,
Table 5:
Subset Size 18 18 18 22
Total Time (s) 5.77 243.27 5.85 5.45
RMSE
MAPE
0.51 3.46 0.33 0.45
0.0348 0.2323 0.0213 0.03048
Results:Power Plant, Training Size: 6697, Test Size: 2871, Number of Features:4 GP Method
Bootstrap Method, δ = 0.3, K = 30 SVIGP Sparse GP Bootstap Method Err.Threshold = .05,
Table 6:
Subset Total Size Time (secs) 15 7.13 15 125.58 15 13.21 19 6.35
RMSE
MAPE
6.82 449.73 4.27 7.192
0.0122 0.9889 0.0072 0.01288
Results:Ailerons, Training Size: 5007, Test Size: 2146, Number of Features:40 GP Method
Bootstrap Method, δ = 0.3, K = 30 SVIGP Sparse GP Bootstap Method Err.Threshold = .45,
Subset Total Size Time (secs) 13 2.67 13 0.57 13 1.42 14 2.93
18
RMSE
MAPE
0.0004052 0.0004224 0.0004213 0.0004052
0.4242 0.4002 0.4194 0.4237
Table 7:
Results: Airline Dataset, Training Size: 991888, Test Size: 425096, Number of Features:7 GP Method
Bootstrap Method, δ = 0.21, K = 30 SVIGP Sparse GP
Subset Total Size Time (secs) 19 43.55 19 214.48 19 203.55
RMSE
MAPE
42.28 41.69 41.62
1.2334 1.8113 1.9247
Table 8: Kernels for the Datasets Dataset Airline Ailerons California Housing Power Plant Delta Elevators
5
Kernel sum of sum of sum of sum of sum of
RBF, Linear and Bias RBF and Bias RBF, Linear and Bias RBF and Bias RBF and Bias
Review of Prior Work
[Rasmussen and Williams, 2005, Chapter 8] provides a detailed discussion of the approaches used to apply Gaussian Process regression to large datasets. Qui˜ noneroCandela et al. [2007] is another detailed review of the various approaches to applying Gaussian Processes to large datasets. More recently, variational approaches have been applied to this problem, see Hensman et al. [2013] The approach presented in this work belongs to approximation technique described as the ”subset of data” approach. As the name suggests, the subset of data approach is characterized by picking a small subset m of data points from the dataset of size n. Since m n, picking the subset of m points from the dataset is clearly a problem to be addressed. The number of ways of picking m points out of the n point dataset is large, when n is large. The Informative Vector MachineLawrence et al. [2003] provides one method of selecting this subset using a differential entropy score. An approach based on the information gain (KL Divergence) is presented in Seeger et al. [2003]. Sparse GP and Stochastic Variational GP also select a small subset of points from the large dataset to fit a GP. This work provides two empirical methods to obtain this subset of points. The estimates for the subset size obtained from these methods should provide some qualitative insights of the subset size we would need for datasets in the same domain. Refinement of this empirical estimate to characterize various datasets is an area for future work. Combining the approach used in this study with Sparse GP or the Stochastic Variational GP is also approach that needs to be examined in future work. A key difference between the approach presented in this paper from apporoaches that find low rank approximations to the kernel Gram matrix is that our method estimates multiple kernels - one for each sample from the dataset. This is different
19
from approaches such as the Nystorm Approximation which approximate a single Gram matrix.
6
Discussion of Results
This work presents algorithm that selects a subset of data points from the original dataset to fit a Gaussian Process regression model. We presented two methods to select this subset. One of them requires experimentation on a small subset to pick an algorithm parameter that we use to apply the algorithm to the larger dataset. This method requires some work, but the results we observed on empirical and real world data sets indicate that this could be both reliable and fast. The second method presents an approach that the modeller can use to pick a sample size without experimentation. This method should be viewed as a methodology that can be adapted to other similar datasets. Proving that a desired error bound can be achieved with high probability using the sample size estimated from this work is an open problem and an area for future work. A key insight from this work is that for many real world datasets (time series and higher dimensional datasets), combining estimators developed on samples obtained by simple random sampling can match the accuracy of methods like Sparse GP or Stochastic Variational GP. Further, the tuning effort associated with hyperparameter selection for this method is minimal (worked with the default settings of the optimizer of the GPy package) compared to Stochastic Variational GP. This work used the GPy library for implementation. The running time of the proposed method was better than the running time for Sparse GP or Stochastic Variational GP for many of the larger datasets used for this study. The implementation simplicity of the proposed method is an attractive feature from the standpoint of applying the Gaussian Process to large scale regression problems. The experimental results on large real world and synthetic datasets used for this work suggest that from the perspective of developing a Gaussian Process regression model, a large proportion of data points may be redundant. This is consistent with work such as Lawrence et al. [2003].This can be exploited to apply Gaussian Process regression to large datasets.
References Adler, R. (1990). An Introduction to Continuity, Extrema, and Related Topics for GeneralGaussian Processes. IMS. Ghosal, S. and Roy, A. (2006). Posterior consistency of gaussian process prior for nonparametric binary regression. The Annals of Statistics, pages 2413–2429. GPy (2012–2014). GPy: A gaussian process framework in python. http://github. com/SheffieldML/GPy.
20
Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. In Conference on Uncertainty in Artificial Intellegence, pages 282–290. auai.org. Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. (2012). The Big Data Bootstrap. ArXiv e-prints. Korosok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer. Lawrence, N., Seeger, M., and Herbrich, R. (2003). Fast sparse gaussian process methods: The informative vector machine. In Proceedings of the 16th Annual Conference on Neural Information Processing Systems, number EPFL-CONF161319, pages 609–616. Lichman, M. (2013). UCI machine learning repository. Pollard, D. (1984). Convergence of Stochastic Processes. Clinical Perspectives in Obstetrics and Gynecology. Springer-Verlag. Quinonero-Candela, J. and Carl, R. E. (2005). A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6:1939–1959. Qui˜ nonero-Candela, J., Rasmussen, C. E., and Williams, C. K. (2007). Approximation methods for gaussian process regression. Large-scale kernel machines, pages 203–224. Rasmussen, C. E. (2006). Gaussian processes for machine learning. MIT Press. Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press. Seeger, M., Williams, C. K. I., and Lawrence, N. D. (2003). Fast forward selection to speed up sparse gaussian process regression. In IN WORKSHOP ON AI AND STATISTICS 9. Snelson, E. and Ghahramani, Z. (2005). Sparse gaussian processes using pseudoinputs. In Advances in neural information processing systems, pages 1257–1264. Snelson, E. L. (2007). Flexible and Efficient Gaussian Process Models for Machine Learning. PhD thesis, University of London. Tresp, V. (2000). A bayesian committee machine. 12(11):2719–2741.
Neural Computation,
VanderVaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes with Applications to Statistics. Springer.
21