Timothy Masters and Walker Land. Abstract. The General Regression Neural Network (GRNN) is known to be widely effective for modeling and prediction, ...
A New Training Algorithm for the General Regression Neural Network Timothy Masters and Walker Land
Abstract
GRNN Error and its Derivatives
The General Regression Neural Network (GRNN) is known to be widely effective for modeling and prediction, especially if separate sigma weights are used for each predictor. However, the significant time requirements for executing the model, combined with the frequent presence of multiple local optima, makes it difficult to train this model in many applications. This paper shows how differential evolution may be enhanced by direct gradient descent to produce a hybrid training algorithm that is both fast and effective.
This section defines the General Regression Neural Network and its mean squared error, then demonstrates how the gradient and the diagonal of the Hessian of the error can be computed. Whenever we need to minimize a continuous function, the ability to efficiently compute partial derivatives of the function with respect to its parameters is invaluable. It should be noted that a simple extension of the methods shown here allow computation of the full Hessian matrix. However, the expense of this computation usually precludes its use. It is shown in the next section that the diagonal alone can be used in a rapidly converging algorithm. Let us begin by defining the GRNN. A vector x of p independent variables is being used to predict a dependent scalar variable y. If we happen to know the joint density of these quantities, the prediction having minimum expected squared error is given by the conditional expectation shown in Equation 1.
Introduction The General Regression Neural Network has been demonstrated to be extremely effective at solving a great variety of difficult function mapping and prediction problems. However, its widespread use has been impeded by the fact that the original single-sigma version suffers from the limitation of demanding one universal sigma, while the much more powerful multiple-sigma version can be difficult to train with reasonable speed. With moderately large datasets, as much as several minutes of CPU time may be required to compute the mean squared error for a single trial parameter set (sigma vector). Under these conditions, the optimization algorithm had better be good. This problem is further compounded by the fact that the parameter space is often filled with poor local minima. Moreover, the error function is often nearly flat, meandering slowly downward, twisting and turning along the way. This is not a friendly optimization problem. In [Masters, 1995] a new training algorithm for the GRNN was presented. Since that time, extensive use of this method in a variety of practical problems has revealed two characteristics of the algorithm:
C
C
Its local behavior is even better than originally thought. In most situations it requires nearly the minimum number of function evaluations that could be expected, and its convergence to the nearest local minimum is usually rapid. Problems with inferior local minima are far more severe than originally expected. This algorithm seems to have virtually no global effectiveness, happily landing in whichever minimum happens to be most convenient.
The purpose of this paper is to demonstrate a new GRNN training algorithm that significantly improves global performance while usually costing relatively little in terms of total training time.
(1) Naturally, we virtually never know the joint density. In practice we estimate this density by treating the quantity (x, y) as a single vector and using Parzen's method to estimate the joint density based on the set of training data. The mathematics is easiest (and performance usually best as well) if a Gaussian kernel is employed. Most applications require separate sigma weights to be used for each independent variable. This implies the distance functions shown in Equations 2 and 3, where xi is the ith training vector. Parzen's estimator of the joint density is then given by Equation 4. (2)
(3)
(4)
The two normalizing constants that appear in Equation 4 are needed to ensure that the joint density integrates to unity. The normalizer for y is shown in Equation 5, and that for x is defined in the corresponding multivariate way.
(5) When the predictor shown in Equation 1 is modified by replacing the exact densities with the Parzen approximators just shown, we arrive at Equation 6. The numerator and denominator of this expression are shown in Equations 7 and 8, respectively.
(6)
first and second partial derivatives of the error, as shown in Equations 14 and 15, respectively. (14)
(15)
Twice differentiating the numerator and denominator of Equation 10 gives Equations 16 through 19. Note the use of intermediate quantities to simplify later work. (16)
(7) (17)
(18) (8)
When the expressions for the numerator and denominator are inserted in Equation 6, the constants cancel and we are left with the fundamental equation of the GRNN shown in Equation 9.
(19)
(9)
In order to derive the partial derivatives of the mean squared error of the GRNN, it is helpful to express the GRNN's prediction using the simplified quantities just shown. Equations 10, 11, and 12 do this. The squared error associated with a single observation is given by Equation 13.
All that remains is to apply to Equation 10 the rule for differentiating quotients. This is shown in Equations 20 and 21.
(20)
(10) (11)
(21)
(12) (13)
Straightforward differentiation of Equation 13 gives the
The previous presentation showed how we can compute the error of a GRNN when the model is applied to a single observation. We also saw how to compute the first and second partial derivatives of the error with respect to the sigma weights. In order to compute these quantities for an entire training set, we
sum the individual quantities. However, it must be remembered that the GRNN can be strongly biased toward optimism when a test case is included in the training set. Therefore, it is necessary that each test case be temporarily removed from the training set by excluding it from the sums in Equations 11, 12, and all equations derived from these roots. This cross validation procedure, along with a discussion of some of the efficiency issues and numerical pitfalls, can be found in [Masters, 1995].
Quasi-Newton Optimization of the Sigma Vector In the previous section, we saw how to compute the error of a GRNN when it is applied to a training set. We also saw how to compute the first and second partial derivatives of the error with respect to the sigma weights. We now explore how this information may be used to find a set of sigma weights that minimizes the training-set error of the GRNN. The most straightforward use of derivative information is to use it to move in a straight line from a starting point to another point that has lower error than the starting point. This basic procedure is the foundation for most sophisticated optimization algorithms. In particular, suppose we are currently at some point that we will call p. We decide to search from p in a direction s. We move along the line determined by that point and direction using a parameter t. We seek a value of t that minimizes e(p + ts). Let g be the gradient vector at p, and let H be the Hessian matrix. If the error surface is quadratic, Newton's method tells us that the value of t at which the minimum is found is given by Equation 22. (22) This equation nearly always provides an excellent approximation to the line minimum. However, it sometimes fails, usually because the directional second derivative in the denominator is wildly inaccurate or even negative. This inaccuracy is made even more likely when we compute only the diagonal of H, setting the rest of the matrix to zero. Nevertheless, computation of the full matrix is extremely expensive and virtually never justified for GRNN training. Experience indicates that as long as basic precautions are taken, use of just the diagonal is widely effective. A good precaution is to compute the heuristic shown in Equation 23. If Newton's estimate in Equation 22 is within an order of magnitude of the heuristic, use Newton's estimate. Otherwise use the heuristic. (23) The value of having a good initial estimate of the line minimum is that it facilitates rapid location of the true minimum. In the absence of a good starting point, the most expensive part of line minimization usually is initial bounding of a minimum. But when we have the capability of immediately jumping to the vicinity of the minimum, we can almost always bound the true minimum in just three function evaluations, then use Brent's algorithm or a relative to rapidly move to the bottom. How can we make use of this efficient line minimization to find optimal sigma weights? [Masters, 1995] shows how it can be integrated into the conjugate gradient algorithm. Extensive experience with practical problems shows
that this method is extremely effective at converging to the nearest local minimum. The technique of using Equation 22 bounded by the heuristic in Equation 23 gives the location of the minimum to within about 20 percent of t most of the time, an exemplary figure. Unfortunately, this same practical experience shows that the nearest local minimum is very frequently inferior to the global minimum which may be far away. The next section shows how the recently developed technique of differential evolution may be hybridized with line minimization to produce an algorithm that combines the excellent global search strategies of stochastic methods with the efficiency of deterministic descent.
Differential Evolution With Line Minimization [Price and Storn, 1997] reported on a variation of genetic optimization called differential evolution. This variation appears to be much more appropriate than traditional genetic methods when optimizing a multivariate function. It is especially valuable when the scalings in the different dimensions are not commensurate, a situation commonly found in GRNN training using poorly prescaled or highly correlated variables. Unfortunately, differential evolution shares the principal weakness of all stochastic methods in that it can arrive frustratingly close to the global minimum, then fail to converge to the minimum in a reasonable period of time. The problem is that these algorithms generally operate in total ignorance of the local properties of the function being minimized. Sometimes this is out of necessity because the derivatives cannot be computed. More importantly, it is because we must be careful to avoid excessive use of local information if we are to preserve the global quality of the search for the minimum. But when we do have easy access to local information, it often makes sense to make modest use of it. Such is the case when using differential evolution to train a GRNN. Differential evolution is similar to ordinary genetic optimization in that it starts with a collection of parameter sets that we will call the source population. The individuals comprising this population are combined with each other via crossover and subjected to mutation to produce the members of the destination population. The members of the destination population, taken as a group, are generally expected to be superior to the members of the source population. By repeating this process enough times, the best member of the final population is hopefully close to the global optimum. There are several important differences between traditional genetic optimization and differential evolution. Probably the most important difference is in the nature of the mutation. In traditional genetic optimization, mutation takes the form of a random perturbation of a fixed type, such as flipping bits in a binary representation of a parameter set, or adding random numbers to individual parameters. The problem with this approach is that it fails to account for the fact that what might be a small perturbation for one parameter might be gigantic for another. Also, random bit flipping can be extremely destructive. Differential evolution avoids these problems by using the source population itself to determine the nature and degree of mutation. It does this by randomly selecting a pair of individuals and computing the difference between their parameter vectors. This difference vector is multiplied by a fixed constant (typically around 0.5 or so) and added to the individual being mutated. When the optimization begins, the average difference will be about the same for all variables being optimized. But as
generations pass, the difference will tend to adapt to the natural scaling of the problem. Variables having a large natural scale will be distributed over a larger range in the population, so mutations for these variables will also be relatively large. As convergence approaches, those variables having a narrow and well defined range around the minimum will have small variation among the population members, resulting in their mutations being relatively small. This automatic adaptation significantly improves behavior of the algorithm as convergence nears. Another important difference is that differential evolution does not involve selection of parents based on fitness. Instead, fitness determines which children are kept. In particular, one parent, called the primary parent is selected deterministically: each individual in the source population is chosen as a primary parent exactly once. The other parent, called the secondary parent, is randomly chosen. Two other individuals which make up the differential pair are also selected randomly. These two are subtracted and their difference is multiplied by a small fixed constant. This scaled difference vector is added to the secondary parent to induce mutation. Ordinary crossover is applied to the primary parent and the mutated secondary parent. The resulting child's fitness is compared to that of the primary parent. The winner becomes a new member of the destination population. This entire process is illustrated in Figure 1.
each starting point. Individuals that have different basins of attraction will still remain in their respective basins, allowing subsequent processing to sort out the best basin. When training GRNNs on most practical problems, multiple minima tend to be restricted to some variables, while other variables possess a nice bowl shape in their cross sections. The line minimization has a strong tendency to cause such well behaved variables to rapidly converge to their best positions, while leaving the stochastic component of the algorithm to deal with the problem variables. The process of creating a child using enhanced differential evolution is illustrated in Figure 2.
Figure 2: Creation of a child.
Numerical Results A test was devised to compare the performance of several GRNN training methods. Monthly temperature and precipitation data spanning 83 years for a region of New York State was used to predict monthly precipitation based on six predictors. These predictors are the temperature at lags of 1, 2, 11, and 12 months, as well as precipitation at lags of 1 and 12 months. Five training methods were tried. These are the following:
C
GRAD — Direct descent to the nearest local minimum is performed by computing the gradient and Hessian diagonal and using this in the conjugate gradient algorithm as described in [Masters, 1995].
C
GRAD-STEP — The previous algorithm is modified by adding the predictor variables in a stepwise fashion. The best single predictor is chosen. Then the second best, given the best, is appended. This is repeated until all predictors are included. This extremely slow procedure tends to have good global behavior if its cost can be tolerated.
C
GENETIC — Ordinary genetic optimization (population size=25) with straightforward crossover (probability=0.8) and mutation (probability=0.1) serves as a baseline stochastic method.
C
DIFFEV — Ordinary differential evolution is
Figure 1: Differential evolution.
The GRNN training algorithm that is the subject of this paper adds one last twist to the differential evolution just described. After the child for the destination population is created (perhaps simply by copying the primary parent), a random yes/no decision is made. If the decision is yes (which has small probability, perhaps 0.1 or so), the derivatives of the child are computed and a single line minimization is performed along the gradient. This is a relatively expensive operation, often requiring as much as ten times the computational effort of a single function evaluation. However, because it does not happen often, the expense is easily tolerated. And those times that it does happen, it nearly always results in very significant improvement in the error of the GRNN, so the quality of the gene pool in the population is greatly increased. This results in much more rapid convergence to the minimum. At the same time it must be noted that the global nature of the search does not suffer because the line minimizations are working toward the basin of attraction of
performed as previously described.
C
DIFFEV-DES — The preceding algorithm is modified by including direct descent with probability 0.1 as already described.
These five training algorithms were run on identical computers, and the mean squared error of each was sampled at the same elapsed times. The results are shown in the following table and graphed in Figure 3: GRAD GRAD-STEP GENETIC 1.429 1.429 1.429 -.-.-.-.-.-.-
1.499 1.499 1.469 1.442 1.442 1.440 1.433 1.429 1.427
2.149 1.988 1.647 1.570 1.567 1.518 1.460 1.444 -.-
DIFFEV DIFFEV-DES 2.149 1.988 1.950 1.515 1.514 1.462 1.436 1.430 -.-
2.149 1.988 1.430 1.430 1.419 1.419 1.419 -.-.-
equal to the time taken for ordinary differential evolution to cover 8 generations. Direct descent with 0.1 probability just about doubles the time required per generation.
Conclusions This paper has presented a modification to the differential evolution optimization algorithm that appears to tremendously speed convergence while almost certainly having little impact on its ability to find a good global optimum. It has been found that this improved algorithm is particularly suitable for training general regression neural networks because the gradient and Hessian with respect to the GRNN parameters are easily computed, but their direct use is often impaired by the presence of multiple local minima in the error space. It is surely the case that this algorithm could enjoy widespread applicability in training other models. However, because the GRNN is the workhorse model of the author, this model was chosen for experimentation. If an application has very few predictors, it is likely that direct application of the enhanced conjugate gradient algorithm (GRAD) described in [Masters, 1995] is the best training method. As is shown here, this method converges extremely rapidly to the nearest local minimum, which is usually the global minimum when there are few predictors. But if there are more than a few predictors, the likely multiple minima preclude use of direct descent. In this case either the painfully slow stepwise method must be used, or a stochastic algorithm is needed. Of the three stochastic contenders studied here, differential evolution enhanced by direct descent was seen to be the best both in terms of speed and finding the best optimum.
References Lawrence Davis, Handbook of Genetic Algorithms, New York, NY: Van Nostrand Reinhold, 1991. David Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Reading, MA: Addison-Wesley, 1989. Figure 3: Comparative results. This table reveals many interesting features. The direct descent method (GRAD) converges extremely rapidly to a good local minimum, though not the best among the minima found. The stepwise modification converged to the second-best minimum, although it required the most time to get there. The traditional genetic algorithm was the grand loser. It briefly pulled a bit ahead of ordinary differential evolution early on, but rapidly lost its lead and was unable to produce any more significant improvement after 15 generations. (Certainly it would be able to find improvement after a sufficient period of time, but the stopping criterion of failure to improve two generations in a row is employed for reasons of economy.) Differential evolution was nearly always ahead of genetic optimization, and it ultimately converged to a fairly good minimum after 13 generations. But the grand winner was clearly differential evolution with direct descent. This method quickly pulled to the lead (except for GRAD) and reached the best optimum long before the others were even close. In fact, it finished in just four generations, although the time taken to accomplish these four generations was about
Timothy Masters, Advanced Algorithms for Neural Networks, New York, NY: John Wiley and Sons, 1995. Price, K., and Storn, R. “Differential Evolution,” Journal, April 1997, pp. 18-24.
Dr. Dobb's