model fitting for multiple variables by minimising the ... - Papers.ssrn.com

2 downloads 0 Views 27KB Size Report
GEOMETRIC MEAN DEVIATION. • From:Total least squares and errors-in-variables modeling: algorithms, analysis and applications (2002). Eds. S.Van.
Model fitting using the geometric mean deviation.

MODEL FITTING FOR MULTIPLE VARIABLES BY MINIMISING THE GEOMETRIC MEAN DEVIATION • From:Total least squares and errors-in-variables modeling: algorithms, analysis and applications (2002). Eds. S.Van Huffel and P.Lemmerling. Kluwer Academic, Dordrecht.

Dr. Chris Tofallis University of Hertfordshire Business School Dept. of Statistics, Economics, Accounting and Management Systems Mangrove Rd Hertford SG13 8QF United Kingdom [email protected]

Abstract

We consider the problem of fitting a linear model for a number of variables but without treating any one of these variables as special, in contrast to regression where one variable is singled out as being a dependent variable. Each of the variables is allowed to have error or natural variability but we do not assume any prior knowledge about the distribution or variance of this variability. The fitting criterion we use is based on the geometric mean of the absolute deviations in each direction. This combines variables using a product rather than a sum and so allows the method to naturally produce units-invariant models; this property is vital for law-like relationships in the natural or social sciences.

Keywords: Geometric mean functional relationship, least area criterion, least volume criterion, measurement error, reduced major axis.

C.Tofallis

1.

Properties of the geometric mean functional relation

In two dimensions the method we are interested in involves fitting a line by minimising the total area of the triangles defined by the data points and the line (see Figure 1). This is equivalent to minimising the sum of the geometric means of the squared deviations in each dimension.

Figure 1

This simple geometric approach provides a line with a number of very useful or interesting properties. These are gathered together here from papers scattered in the literatures of different fields. 1. The resulting line is symmetric with respect to the two variables i.e. whichever variable is plotted on the vertical axis the resulting models will be mutually equivalent i.e. consistent with each other. This is because switching the axes does not affect the areas. 2. The fitted line is units invariant (or scale invariant). This means that if we change the units of measurement from say cm to mm we obtain an equivalent model. Surprisingly this is not true of some line-fitting procedures, notably orthogonal regression (also known as total least squares). 3. The slope of the line is the geometric mean of the two least squares regressions. In other words if the regression of y on x gives slope b1 and the regression of x against y gives slope b2 (meaning the rate of change of y with x) then our least areas procedure will give a slope of (b1 b2)1/2. A proof of this appears in Woolley [11] and in Teissier [10]. For this reason this type of line is sometimes called the geometric mean functional relationship. 4. Minimising the areas implies the objective function involves the product of the deviations in each dimension for each point. Hence as already stated, we are also minimising the sum of the geometric means of the squared deviations in the x and y direction. 5. The expression for the magnitude of the slope of the line is perhaps the simplest of any available fitting method, it is merely the standard deviation of the y-values divided by the standard deviation of the x-values. (Proved in Woolley [11]). The sign of the slope is given by the correlation. 6. Suppose the relative frequency function, or probability density function f(x,y) is symmetrical in the two variables after allowing for re-scaling i.e. f(x,y) = f(y/c, cx). Then Greenall [3] has proved that our line is unique in pairing off values of X and Y such that the proportion of x-scores less than X equals the proportion of y-scores less than Y. In other words it pairs off scores with equal percentiles (or quartiles or more generally any quantile or cumulative measure of relative ranking). The bivariate normal distribution is one case that satisfies the symmetry condition stated. 7. By definition the least squares regression line of y on x has the lowest mean squared error of estimation of y. Likewise the regression of x on y has the lowest mean squared error of estimation of x. No single line can simultaneously achieve both these minima so we must expect an increase

Model fitting using the geometric mean deviation.

in these errors. If we require that the proportional increase in each of these errors be the same (i.e. for x and y) then Greenall [3] has proved that our line is the only line which achieves this. 8. For the set of all possible line-fitting procedures that depend on standard deviations and correlations (i.e. second moments), ours is unique in behaving correctly i.e. invariant to switching of axes, and change of scale . This was proved by Nobel laureate Paul Samuelson in 1942 [9]. 9. When the data are divided by their standard deviations then we obtain a 45 degree line which bisects the two least squares line (regression of y on x and vice versa). This line minimises the sum of squares of perpendicular distances from the data points to the line. (Ricker [7]). This arises from the fact that the triangles (Figure 1) become isosceles in this case. For the case of two variables this method has appeared in the research literature of many disciplines throughout the twentieth century, but under different names. In astronomy it is known as Stromberg's impartial line. In biology it is the line of organic correlation. In economics it is the method of minimised areas or diagonal regression. In statistics it is sometimes referred to as the 'standard or reduced major axis'. This derives from the fact that if the data are standardised by dividing by their standard deviation, then the fitted line corresponds to the major (i.e. principal) axis of the ellipse of constant probability for the bivariate normal distribution. Kruskal [5] proved that of all line-fitting procedures which depend only on first and second moments this is the only one which behaves correctly under translation and change of scale. Moreover, in a comparison of 34 fitting methods Riggs et al [8] found that in overall performance, judged both by root mean square error and percent bias, this method was almost always the best when the ratio of error variances is unity. In cases where this ratio is unknown, Ricker [6] concluded that this is the best available estimate of the functional relationship where the variability in both variates is due to measurement error or due to natural or inherent variability (e.g. as arises in biology). For these reasons it seems worthwhile to investigate this method when it has been extended to the case of multiple variables.

2.

Extension to multiple dimensions

We now intend to fit a linear function of the form ∑ ajxj = c to data in p dimensions (j =1 to p). Of course this is not uniquely specified because we can divide through by any non-zero constant. Thus we are free to impose a constraint on the coefficients. For later convenience the constraint we choose will involve the product of the coefficients: Π aj. One obvious way of generalising the least area procedure to higher dimensions is to minimise the volumes (or hypervolumes). Each data point will have associated with it a 'volume deviation' which is simply the product of its deviations from the fitted plane in each dimension. We must take care to make all these non-negative by taking the absolute values. For the i'th data point this volume deviation Vi is proportional to: p

(∑ a j=1

j

x

ij

− c)

p

p



a

j

j=1

Hence one possible objective might be to minimise ∑ Vi . This is an unconstrained fractional programming problem. An alternative objective that has already been investigated is the minimisation of ∑ Vi2/p (Draper and Yang [2]). This is the route to generalisation based on property (4) above i.e. the geometric mean of the squared deviations. Their method of solution involved treating the problem

3

C.Tofallis

as a nonlinear weighted least squares problem, and they list a short piece of S-plus code for achieving this. The numerator of Vi looks like an Lp norm, but with the degree p varying as the number of variables. The L1 norm would then correspond to taking the p'th root of the above volume deviation before summing and minimising i.e. min ∑ Vi1/p , so we are minimising the geometric mean of the absolute deviations in each dimension. Let us see how this particular case can be formulated. The denominator remains the same after summing over the data points. One way of dealing with fractional programmes is to set the denominator in the objective function to a particular value, say unity, and then optimise the numerator. Thus we can make good use of our condition on the coefficients: Set Π aj = 1 and minimise the numerator subject to this condition. In fact we have to solve the problem a second time with Π aj = −1 to ensure that all combinations of signs are considered. To deal with the absolute values of the volume deviations we introduce pairs of non-negative variables u,v. The problem then becomes: Find values of the coefficients aj and c so as to minimise Σ ui + vi subject to ∑ ajxij + ui − vi = c (for i = 1 to n, where n is the number of data points) and Π aj = ±1 Notice that the objective function is linear and that we only have one non-linear constraint. For fitted functions that are linear in the coefficients the optimisation method for their estimation can be shown to be a convex fractional programme so that any optimum is both unique and global. Whilst there are solution algorithms for fractional programmes, the unique global property implies that general convex solution methods can be used.

3.

Numerical test

We shall now apply the least geometric mean deviation approach to try and uncover the coefficients from data, which have been generated from a known underlying model with some randomness thrown in. In order to make this a difficult test we shall choose data that suffers from multicollinearity. It is taken from Belsley's comprehensive monograph [1] on this problem. The generating model is y = 1.2 − 0.4x1 + 0.6x2 + 0.9x3 + ε with ε normally distributed with zero mean and variance 0.01. Two very similar data sets (A,B) are tabulated in Belsley based on this model. For set A ordinary least squares (OLS) gives: y = 1.26 + 0.97x1 + 9.0x2 − 38.4x3 The fit as measured by R2 is very good at 0.992 but the underlying model is far from being uncovered. In particular, the coefficient of x2 is 15 times too high and two of the coefficients have the wrong sign. Getting the signs wrong is very serious if one is trying to understand how variables are related to each other. Turning to the least geometric mean deviation we find: y = 1.28 − 0.44x1 + 0.33x2 + 1.74x3 We now have all the correct signs and the magnitudes are also reasonable.

Model fitting using the geometric mean deviation.

Repeating this for data set B: OLS:

y = 1.275 + 0.25x1 + 4.5x2 − 17.6x3

Geometric mean:

y = 1.29 − 0.44x1 + 0.42x2 + 1.98x3

Once again the geometric mean produces a superior model. Moreover it is also worth noting that the two OLS models are very different from each other whereas the geometric mean models seem to be more stable to small variations in the data. This is noteworthy because of how similar the two data sets were: the y-values were identical for sets A and B, and the x-values never varied by more than one in the third digit. Thus our method seems to be much more stable than OLS. Of course a comprehensive set of Monte Carlo simulations is required to fully explore this aspect. Furthermore, although it is not our intention to use this method for forecasting a particular dependent variable we have found that the mean absolute deviation for y is 10% lower than that from using OLS for both data sets.

4.

Conclusion

Regression is not only one of the most used quantitative methods, it is also one of the most abused. In the simplest two variable case it is well suited to providing an expected value for the y-variable for a given x-value. Unfortunately it is also widely, and incorrectly, applied to finding law-like relationships, or estimating parameters within them. This paper presents the idea of measuring deviations from the fitted function by means of the product of the deviations in each dimension. More generally we have a whole family of fitting norms based upon this volume deviation in much the same way we have Lp norms. A key advantage of this criterion is that the fitted models are scale-invariant i.e. they are unaffected by the units of measurement. This is an essential property if one is seeking law-like relationships in the natural sciences or social sciences. For the case of two variables this method has appeared in the research literature of many disciplines throughout the twentieth century, but under different names. This paper has extended the technique to multiple variables. The method can be used when there is random variation (due to natural individuality or measurement errors) in the variables, but we do not assume any information is known about this variation e.g. we make no assumptions regarding the distributions of these errors or the variance of the errors (in which case maximum likelihood estimation could be applied). A numerical test showed it to be superior to least squares in two regards: firstly, in that it was able to more accurately estimate the coefficients of the data-generating model, and secondly it showed itself to be more stable to small variations in the data. This preliminary evidence commends the method for further investigation. References [1] Belsley, DA (1991). Conditioning Diagnostics. Wiley, New York. [2] Draper, NR and Yang, Y (1997). Generalization of the geometric mean functional relationship. Computational Statistics and Data Analysis, 23, 355-372. [3] Greenall, PD (1949). The concept of equivalent scores in similar tests. British J. of Psychology: Statistical Section, 2, 30-40.

5

C.Tofallis

[4] Grosse, E (1989). A catalogue of algorithms for approximation. In Algorithms for Approximation II, eds. JC Mason and M Cox. Also available at www.hensa.ac.uk/netlib/a/catalog.html [5] Kruskal, WH (1953). On the uniqueness of the line of organic correlation. Biometrics 9, 47-58. [6] Ricker, WE (1973). Linear regressions in fishery research. J. Fisheries Research Board of Canada 30, 409-434. [7] Ricker, WE (1984). Computation and uses of central trend lines. Canadian J. of Zoology 62, 1897-1905. [8] Riggs, DS, Guarnieri, JA and Adelman, S (1978). Fitting straight lines when both variables are subject to error. Life Sciences 22, 1305-1360. [9] Samuelson, PA (1942). A note on alternative regressions. Econometrica 10(1), 80-83. [10] Teissier, G (1948). La relation d'allometrie: sa signification statistique et biologique. Biometrics 4(1), 14- 48. [11] Woolley, EB (1941). The method of minimized areas as a basis for correlation analysis. Econometrica 9(1), 38-62.

Suggest Documents