variables x = (x1;:::;xn). Assume we are given M data points xm = (x1m;:::;xnm), ... unity function (2.2) to serve as a parent for new basis functions, then the resulting model will be .... The function f(x) = 10 sin( x1x2) + 20(x3 ?0:5)2 + 10x4 + 5x5.
1
Computational Techniques and Applications: CTAC97 World Scienti c
Computational Techniques and Applications Conference: Can MARS be improved with B-splines ? S. Bakin
School of Mathematical Sciences, Australian National University, Canberra, ACT 0200, Australia.
M. Hegland
Computer Sciences Laboratory, Australian National University, Canberra, ACT 0200, Australia.
M. Osborne
School of Mathematical Sciences, Australian National University, Canberra, ACT 0200, Australia.
1. Problem formulation
Let y represent a single response variable that depends on a vector of n predictor variables x = (x ; : : : ; xn). Assume we are given M data points xm = (x m ; : : : ; xnm ), m = 1; : : : ; M in the predictor domain and response values 1
1
ym = f (xm) + m ; (1.1) where f () is unknown but assumed smooth in E n (Euclidian n space) and fm g are mean zero random variables. The goal is to determine a f^() which is a reasonable approximation of f () over the predictor domain. ( )
2. Friedman's MARS
There exist many techniques aimed at the solution of the problem outlined in the previous section (see, for instance, [1, 2, 6, 8]). However, when it comes to modeling functions of many variables ( 10) on the basis of large datasets, one may experience some problems related to computational feasibility and interpretativeness of resulting models when using those methods. The Multivariate Adaptive Regression Splines (MARS) algorithm developed by Friedman [4] presentes an alternative way for dealing with the multivariate function approximation problem. For a particular dataset it chooses a relatively small subbasis from a (typically very large) n-variate complete tensor product spline basis constructed using truncated power basis functions of the form ((x ? t))q with knots t's chosen from the observed values of correspondng components. Typically, the j -th basis function can be expressed by the following formula : +
Tj (x) =
Kj Y
[skj (xv k;j ? tkj )]q
k=1
(
)
+
(2.1)
2
S. Bakin,M. Hegland & M. Osborne
Here Kj is the number of factors (or interaction level) in the basis function Tj ; skj takes on two values +1; ?1; v(k; j ) labels the predictor variable associated with the corresponding factor of Tj and tkj is a knot location for the variable xv k;j . The exponent q is the order of the spline approximation. The MARS algorithm uses a forward/backward stepwise strategy to produce a set of basis functions (2.1). The forward part is recursive. Each iteration constructs two new basis functions and adds them to the current model. This forward stepwise procedure is continued until a relatively large number of basis functions are included. 2.1. The Forward Stepwise Procedure of MARS. The forward stepwise procedure starts with one basis function : (
T (x) = 1: After the J -th iteration there are 2J + 1 basis functions
)
(2.2)
0
fTj (x)g J
(2.3) in the model, each of the form (2.1). The J +1 iteration adds two new basis functions 2 0
T J (x; l; v; t) = Tl (x)[+(xv ? t)]qq (2.4) T J (x; l; v; t) = Tl (x)[?xv ? t)] : Here Tl (x) is one of the 2J + 1 already chosen basis functions (2.3), 0 l 2J , v is one of the predictor variables (not present in Tl (x)), and t is the knot location of that variable. The three parameters l; v; t de ning the two new basis functions are chosen to be those that provide the most improvement in the least squares t. The nature of the MARS algorithm allows for straightforward incorporation of restrictions on the level of interaction between predictor variables. For example, if one allows only the unity function (2.2) to serve as a parent for new basis functions, then the resulting model will be additive. 2.2. The Backward Elimination Procedure of MARS. The forward stepwise procedure of MARS builds up the model comprised of Jmax tensor product basis functions. The number Jmax is chosen to be substantially larger than would be optimal. The reason for doing this becomes clear from the observation that the MARS algorithm only enters basis functions that can be derived from those produced earlier. Thus, basis functions entered earlier may prove to be suboptimal. The role of such basis functions is to serve as ingredients for later functions. In order to provide an opportunity for more complex functions to enter the model, the forward stepwise procedure is allowed to produce an excess number of basis functions. In order to get rid of suboptimal functions one has to apply the second module of the MARS algorithm. This is essentially just a standard statistical backward subset selection algorithm [7]. This part of MARS selects a submodel which minimizes an estimate of prediction error. The Generalized Cross Validation criterion (GCV) [5] is suggested. 2 +1
+
2 +2
+
"
M X (J ) GCV (J ) = M1 [ym ? f^J (xm )] = 1 ? CM m 2
=1
#2
(2.5)
Can MARS be improved with B-splines ?
3
The numerator in GCV is the average residual squared error and the denominator is a penalty term that re ects model complexity. The forward and backward procedures produce a model of the form J
Kj
j =1
k=1
X Y f^(x) = a + aj [skj (xv k;j ? tkj )]q ; 0
(
)
(2.6)
+
where the coecients faj gJj are estimated using a linear least squares algorithm. The model can be directly used to estimate future response values given a set of predictor points. However, in this form the model is hard to interprete. This drawback can be removed by rearrangement of terms in (2.6) =0
f^(x) = a + 0
X
Kj =1
fi (xi) +
X
Kj =2
fij (xi; xj ) +
X
Kj =3
fijk (xi ; xj ; xk ) + :
(2.7)
The rst term in (2.7) collects together all basis functions that involve only one variable (Kj = 1), the second sum collects together all basis functions involving only two variables (Kj = 2) and so on. The representation of the MARS model given by (2.7) is called an ANOVA decomposition since it breaks up the model into main (additive) eects and interaction eects of various orders. Each individual function in (2.7) is called an \ANOVA function" and is an expansion in tensor product functions involving identical predictor variable sets. The ANOVA decomposition identi es the variables that enter the model, whether they contribute additively or are involved in interactions, the order of interaction eects, and the particular variables that participate in them. It is worth noting that rst and second order ANOVA functions are easily visualized through standard graphical techniques.
3. BMARS - MARS based on B-splines
The MARS algorithm is based on truncated power basis functions. It is known that such a basis may lead to ill-conditioned linear systems of equations [3] whereas some other bases for representing spline approximations, such as, for instance, Bsplines, have superior numerical properties. In this section we describe the MARS-like algorithm that utilizes B-splines . In principle, one could implement BMARS using B-splines of any order. However, we utilized B-splines of the second order [3]
B ;t (xi ) = (t ? t? )[t? ; t? ; t]( ? xi ) ; where t? ; t? ; t are knots and [t? ; t? ; t]( ? xi ) denotes the divided difference at t? ; t? ; t of the bivariate function (s ? xi ) with respect to its rst variable [3]. The reason for using this type of functions is two-fold : rstly, utilization of the simplest continuous B-splines signi cantly simpli es implementation of the BMARS algorithm and, secondly, approximation with piecewise linear models is more advantageous from the statistical point of view. 2
2
1
2
1
2
2
2
1
1
+
+
+
4
S. Bakin,M. Hegland & M. Osborne
Assume that for each predictor variable univariate B-splines are introduced having various sizes of support intervals (or scales as we will call them). Such B-splines can be constructed based on a special sequence of sets of knots fSligLl ; i = 1; : : : ; n. Each set Sli contains 2l knots placed at the 2?l 100 percentiles of the marginal distribution corresponding to a particular i-th predictor variable. The upper limit L is some prespeci ed number. For each set of knots Sli one can construct a set of univariate B-spline basis functions. It is easy to see that the scale of B-splines associated with Sli decreases as the index l increases. The novelty of the BMARS algorithm is concerned with the order in which univariate B-spline functions of dierent scales are allowed to take part in forming tensor product basis functions. 3.1. Modi ed MARS algorithm - BMARS Like the original MARS, our procedure consists of two phases : a forward stepwise procedure intended to construct a model made up of a large number of basis functions (and probably over tting the data), and a backward stepwise procedure which removes suboptimal basis functions from the model produced at the previous stage. The forward procedure starts by following the MARS strategy with the only dierence being that only B-splines of largest scale are allowed to form tensor product basis functions =1
Tj (x) =
Kj Y k=1
B ;t k;j (xv k;j ): 2 (
)
(
)
(3.1)
To clarifythis point consider what the J + 1-th iteration does in this case : after the J -th iteration there are J + 1 functions
fTj (x)gJ
(3.2) in the model, each of the form (3.1). The J + 1-st iteration adds one new basis function 0
TJ (x; l; v; t) = Tl (x)B ;s(xv ): (3.3) Here Tl (x) is one of the J +1 already chosen basis functions (3.2), 0 l J , v is one of the predictor variables not present in Tl (x), and s is the label marking the univariate B-spline basis functions of variable xv and current active scale. The three parameters l; v; t de ning TJ are chosen such that they provide the most improvement in the least squares t. Proceeding in this way we are likely to reach the point where the the approximating ability of the B-splines of large scale is exhausted. Indeed, basis functions of large scale are able to approximate target functions whose values do not change dramatically over regions of much smaller scale compared to the large one. In order to determine the most appropriate moment for changing over to the B-splines of smaller scale we estimate the prediction accuracy of the current model by computing the Generalized Cross-Validation criterion (2.5). When the GCV score ceases to decrease we change over to the next smaller scale which implies that now B-spline functions of that scale only are allowed to participate in construction of new +1
+1
2
Can MARS be improved with B-splines ?
5
basis functions (3.1). The algorithm proceeds in this manner until the size of the model exceeds a predetermined level. It is worth noting that, because each new basis function has been derived from an earlier basis function, B-splines of all scales may, in principle, appear as factors in any basis function (3.1). The backward elimination procedure is similar to that utilized in the original MARS. Like the original MARS procedure, BMARS produces a model of the form J
Kj
X Y f^(x) = a + aj B ;t k;j (xv k;j ); 0
j =1 k=1 faj gJj=0 are to
2 (
)
(
(3.4)
)
where the regression coecients be determined by a least squares t. It should be noted that the rearrangment of the model (3.4) in terms of ANOVA functions (2.7) remains relevant. While working with models of functions having small scale features (like spikes or ridges) we found that a signi cant gain in terms of the accuracy sometimes can be achieved by adapting the locations of knots corresponding to tensor product basis functions. The adaption can be accomplished through a special optimization procedure that minimizes the sum of squared residuals with respect to the knot locations. The use of that procedure is optional and strictly speaking it is not the part of the BMARS algorithm. The rst advantage of our strategy over the original MARS procedure is the significant reduction in computational cost. Indeed, at each step we allow functions from a relatively small subset of all possible univariate B-splines (in particular, B-splines of one particular scale) only to form tensor product basis functions whereas the MARS algorithm always considers all possible candidates. The second advantage is not as evident as the rst one. It turns out that the scale-by-scale approach facilitates the separation of small scale features of the target function from the large scale ones. As our experiments have showed, it results in smoother approximating surfaces compared to those produced by the algorithm that strictly follows the MARS's strategy.
4. Numerical examples
In this section we use the Prediction Squared Error (PSE) quantity as a measure of the accuracy of a model. The PSE is de ned by the formula
PSE =
Z
Z
(f^ ? ftarget ) dx = 2
(f ? f ) dx : 2
Here f^ is the estimate of the target function f , and f is the mean value of the target function. 4.1. Example 1 : A simple function of ten variables This test example was taken from Friedman's paper on the MARS algorithm [4]. The function
f (x) = 10 sin(x x ) + 20(x ? 0:5) + 10x + 5x 1
2
3
2
4
5
(4.1)
6
S. Bakin,M. Hegland & M. Osborne Table 1.
Performance of the MARS and BMARS in Example 1. # bsfns accuracy CPU sec BMARS 12 5:0 10? 46 ? MARS 21 1:16 10 167 3
2
was considered in the ten-dimensional unit hypercube, that is there were ve noise variables having no in uence on the response values. We formed a dataset of 1000 points by randomly generating covariates from a uniform distribution. The corresponding response values were generated according to the formula (1.1) with f (x) given by (4.1) and normally distributed with the value of variance that set the signal-to-noise ratio to the fairly low value of 2:4=1. Thus, the true underlying function accounted for 86% of the variance of the response. Table 1 compares the performance of MARS and BMARS algorithms when run on this particular data set. In both cases the maximum allowed order of interaction between variables was taken to be 2 (since the model with all possible interactions allowed gave less than 1% increase in the value of GCV score). The accuracy of the model found by BMARS is even better than that of the model by MARS, and the former algorithm is more than three times as fast as the latter one. This is due to the \scale-by-scale" strategy employed by the BMARS procedure. Figure 1 depicts all univariate and bivariate ANOVA functions (2.7) corresponding to the BMARS model of this example. BMARS: X 4
BMARS: X 5
12
5 4.5
10 4 3.5
8
3
6
2.5 2
4 1.5 1
2 0.5 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
BMARS: X 3 5
BMARS: X 1 X 2
4.5 4 3.5 3 2.5 2 1.5 X1
1
X2
0.5 0 0
0.1
0.2
Figure 1.
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
ANOVA functions of the model by BMARS of Example 1.
Can MARS be improved with B-splines ? Table 2.
7
Performance of the MARS and BMARS in Example 2. # bsfns accuracy CPU sec BMARS 43 1:9 10? 74 ? MARS 46 1:6 10 76 3
2
4.2. Example 2 : Modeling small scale features In this example we study
the ability of the BMARS algorithm to approximate smooth target functions with superimposed small scale features. For this purpose, the target function of two variables was taken to be
f (x) = sin(x x ) + exp ? (jjx ? w jj=0:05) + exp ? (jjx ? w jj=0:05) where w = (0:3; 0:3) and w = (0:6; 0:7) respectively. The dataset was comprised of 1500 covariates generated from a uniform distribution and corresponding response values with no noise added. Both MARS and BMARS algorithms were applied to this dataset. The knot location optimization procedure was run after the BMARS model had been built up. The performance details for both algorithms are summarized in Table 2. The plots of the resulting approximating surfaces are shown in Figure 2. 1
2
1
1
2
2
2
2
BMARS: X 1 X 2
MARS: X1 X2
X2
X2
0
X1
Figure 2.
0
0
0
X1
BMARS and MARS approximating surfaces of Example 2.
Although the knot location optimization procedure slowed down the BMARS algorithm, the resulting model almost got rid of wiggles and proved to be nearly ten times as accurate as the model by MARS.
5. Conclusion
In this paper we have considered algorithmical and performance issues related to a new Multivariate Adaptive Regression Splines Algorithm based on B-splines (BMARS).
8
S. Bakin,M. Hegland & M. Osborne
The discussion and numerical examples in the previous sections show that the utilization of B-spline basis functions together with the \scale-by-scale" strategy of the BMARS procedure allows one to accomplish signi cant reduction in computational cost, especially when modeling smooth functions. There is also pretty strong evidence that the BMARS algorithm is able to outperform MARS in terms of the accuracy achieved in a variety of situations.
6. Acknowledgements
We are most grateful to Dr. B. Turlach for very fruitful discussions. The research of S. Bakin was supported by the Australian Government (Overseas Postgraduate Research Scholarship), by the Australian National University (ANU PhD Scholarship) and, also, by ACSys, CRC, Australia.
References
[1] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J., Classi cation and Regression Trees, Wadsworth, Belmont, California,1984. [2] Chen, Z., Beyond additive models: interactions by smmothing spline methods. Technical Report, SMS-009-90, The Australian National University,1990. [3] Cox, M.G., Practical spline approximation. Topics in Numerical Analysis, Lancaster, 1981,79112. [4] Friedman, J., Multivariate Adaptive Regression Splines, The Annals of Statistics, 19,1,1{141. [5] Friedman, J, Estimating functions of mixed ordinal and categorical variables, Stanford University, Technical Report NO.108, June 1991. [6] Friedman, J.H., Stuetzle, W., Projection Pursuit Regression, Journal of American Statistical Association,76, 817{823. [7] Miller, A.J., Subset Selection in Regression. Chapman and Hall, 1990. [8] Wahba, G., Spline Models for Observational Data, SIAM, Philadelphia, 1990.