Abstract--Application of ridge regression in geoscience usually is a more appropriate technique than ordinary least-squares regression, especially in the ...
Computers & Geosciences Vol. 16. No. 7. pp. 933-952, 1990
0098-300490 $3.00 + 0.00 Copyright ~ 1990PergamonPress pie
Pnnted in Great Britain. All rights reserved
MULTIPLE LINEAR REGRESSION WITH CORRELATIONS AMONG THE PREDICTOR VARIABLES. THEORY AND COMPUTER ALGORITHM RIDGE (FORTRAN 77) P. F. M. VAY GAANS and S. P. VmEYD Institute of Earth Sciences, Department of Geochemistry, University of Utrecht, Budapestlaan 4, P.O. Box 80.021, 3508 TA Utrecht, The Netherlands (Receired 5 Ju/y 1989; recised 7 February 1990)
Abstract--Application of ridge regression in geoscience usually is a more appropriate technique than ordinary least-squares regression, especially in the situation of highly intercorrelated predictor variables. A FORTRAN 77 program RIDGE for ridged multiple linear regression is presented. The theory of linear regression and ridge regression is treated, to allow for a careful interpretation of the results and to understand the structure of the program. The program gives various parameters to evaluate the extent of multicollinearity within a given regression problem, such as the correlation matrix, multiple correlations among the predictors, variance inflation factors, eigenvalues, condition number, and the determinant of the predictors correlation matrix. The best method for the optimum choice of the ridge parameter with ridge regression has not been established yet. Estimates of the ridge bias, ridged variance inflation factors. estimates, and norms for the ridge parameter therefore are given as output by RIDGE and should complement inspection of the ridge traces. Application within the earth sciences is discussed. Key Words: Ridge regression, Multiple regression, Multicollinearity diagnostics, Robust estimators. FORTRAN.
INTRODUCTION Multiple linear regression is used as a tool in most branches of geoscience to quantify relations between experimental data or tield observations. The objective of the regression may be primarily to obtain an equation that accurately estimates the value of one (dependent) variable from the other (predictor) variables observed, that is to smooth and interpolate between the observed data (Snedecor and Cochran, 1967). Usually, however, the regression parameters are meant to be indicative of the strength of influence of each predictor or to represent true physical parameters. The regression estimates obtained for these parameters then are considered to be of more general validity, and inferences are made from comparison with other (physical) parameters. This may imply prediction or extrapolation (far) outside the range of actual experiments or observations. Although many different regression techniques have been developed and are well documented (e.g. Draper and Smith, 1981; Montgomery and Peck, 1982), ordinary least-squares regression (OLS) is applied almost uniquely in the problems mentioned previously. This probably is the result of the straightforwardness and relative simplicity of OLS and general availability in computer software. Yet, it also reveals that many researchers do not discriminate between multiple regression as a fitting or interpolation tool and multiple regression as a technique
to estimate physical parameters or strengths of influence, that is a predictive or extrapolation tool. For these latter purposes OLS is not always the best technique and may be misleading. One type of situation in which OLS may fail is when the predictor or so-called independent variables are strongly intercorrelated causing problems of multicollinearity. In the geosciences, controlled experiments, whereby selected parameters are kept constant, are not possible when natural processes are studied in their earthly environment. Various processes invariably act simultaneously and may be related or may interfere with one another. Also in experimental work the effects of predictor variables that are correlated inherently cannot be studied separately. The correlations may be the result of physical or economic constraints (Marquardt, 1970) or may be imposed by the theoretical model used to describe the experiments. Ridge regression (RR) is a procedure only slightly different from OLS, specifically designed to overcome multicollinearity in the predictor variables (Hoerl, 1962). For these situations RR gives more robust estimates of the regression parameters than OLS, that are more likely to be meaningful physically (Hoed, Kennard, and Baldwin, 1975; Montgomery and Peck, 1982). Comprehensive texts on regression techniques are given by Draper and Smith (1981) and Montgomery and Peck (1982). They discuss a wide range of regression techniques other than OLS, including RR. 933
934
P.F.M. v/~.'qGAA.~Sand S. P. VRIEND
An extensive discussion of RR, in a general perspective of nonorthogonal multidimensionality, is given in Marquardt (1970). Application of RR in the earth sciences is advocated by Jones (1972) and Howarth (1984). A summary of RR and its advantages and disadvantages also is presented by Wood and Crerar (1985). A procedure to utilize the nonlinear regression program of the BMDP Statistical Software (Dixon, 1981) for RR is explained by Jennrich (1978). In principle this procedure, which generally will not give all necessary statistics, can be followed with other regression programs. Preparation of input and subsequent interpretation of the output may be a tedious affair. Fully implemented RR with limited output (only tabular and graphical display of the ridge traces with linear incrementation of the ridge parameter) is provided by Statgraphics oR) (1989). This paper presents a F O R T R A N 77 routine for RR: RIDGE, that provides a wide variety of necessary parameters to gain insight in the data structure and to optimize the regression model. A concise treatise of OLS and RR is presented for proper interpretation of results, and the program structure is explained. The performance of the program is illustrated with the Hald data (Hald, 1952; Draper and Smith, 198 I). Some examples of intercorrelated predictor variables in geosciences are discussed. TllEORY
Least-squares regression A general linear regression model has the form:
Y,= Bo + ~ BjX,j + E,
(I)
contribution is given by the product b;. R(Xj. Y) (van der Grinten and Lenoir, 1973). In the situation of highly intercorrelated predictors the product bj. R(Xj,Y) may be negative. With OLS the summation E[bj.R(Xj,Y)] equals the overall goodnessof-fit or coefficient of determination, that is the squared multiple correlation R ~ between the dependent Y and all predictors. The normalized problem generally is solved through matrix computations: b = (zr-z)-~ • zr. y or in correlation form: b = R - l . Ryx
y, = o + Z b , : , , + ~ ,
(2)
)
with
z,j = (X V- mean(Xj))/stdev(Xj) and )'~ = ( Y~- mean( Y))/stdev(Y)
(3) (4)
and e, the normalized error. The bjs are the normalized regression coefficients from which the Bjs and Bo are computed easily (Li, 1964): Bj = b~' stdev(Y)/stdev(Xj) Bo = mean(Y) - Z [Bj" mean(Xj)].
VIF = R -t ViF,, = R " = I/(t - R ~ , )
(6)
If the predictors are totally independent, that is uncorrelated, the bjs are equal to the correlations R(X/,Y). Then the squared bj represents the contribution of Xj to Y or the percentage of variance of Y explained by the variation in Xj. More generally this
(9) (10)
with R j "j an element of R- t. For uncorrelated predictors VIF//of course equals 1.0. The VIF is the inverse of the tolerance as used in least-squares multiple regression in the SPSS (Norusis, 1986) and BMPD (Dixon, 1981) software. The larger VIFj/or R~.r~ the less X~ independently contributes to the variation in Y and the more indeterminate bj becomes with OLS. From practical experience it is inferred that no single VIF should exceed 5 or 10, or severe problems of multicollinearity exist (Montgomery and Peck, 1982). The label VIF characterizes the relation of the inverse of the correlation matrix (or VIF matrix) with the standard error (sterr) of the regression coefficients Bj (van der Grinten and Lenoir, 1973):
(5)
J
(8)
where b is the column vector of the bjs, y is the column vector of the y,s and z r the transpose of the matrix z of the :vs. R -~ is the inverse of the correlation matrix of the predictor variables, R = zr • z/(N - I) with N the number of observations; Rxr is the vector of correlations R(,~, Y) between the dependent and each predictor, Rxv = zr ' y/(N - I). If R, or rather its determinant det(R) approaches zero owing to high correlations between the independent variables Xj, R -t is ill-determined, resulting in an unstable solution for the bjs. A measure of the extent of multicollinearity among the predictors is the multiple correlation coefficient R~xjp between the jth predictor and all other predictors, or the related variance inflation factor (VIF) (Marquardt, 1970; Montgomery and Peck, 1982). With OLS the VIFs can be calculated through inversion of the correlation matrix of the predictors (Anderson, 1958). The following relations hold:
I
where Y~is the measured dependent quantity for the ith object or observation, Xj is thejth independent or predictor variable and Bj the associated regression coefficient estimate. B0 is the intercept value and E~ is the error associated with the ith observation. Normalizing all variables to standard deviations (stdev) from the mean results in:
(7)
(N - I)- stdev(Xj): ,
(l I)
I
VI Fj.j - mean(X,,,) • mean(Xj) 1 ]
(12)
Multiple linear regression with correlations among the predictor variables with s: representing the squared standard error of prediction of the Y estimate or residual mean square, the property to be minimized with OLS.
Equations (11)--(13) show that in OLS a good fit, expressed as a small value for s:, with large values of VIFj.j does not necessarily imply good estimates of the Bj. A measure of the error involved in the regression model for the determination of the Bjs that takes into account the VIF effect is the mean square error of estimation of the Bjs (MSE), the summation over the squared standard errors of the B~s of Equation (I 1). For reasons of comparability between different regression problems the MSE is best expressed in its normalized form as the summation over the squared standard errors of the bjs [cf. Eq. (5)]: MSE=
[
s
( N - I)-stdev(Y):
1
.jVlF,.~
(14)
Another approach to identify possible multicollinearity in the predictors, is through decomposition of the predictor correlation matrix in eigenvalues 2 and eigenvectors. The eigenvalues give insight in the true dimensionality of the problem, or rather in the relative importance of the different dimensions (see also Marquardt, 1970). The condition number CN: CN = 2(max)/2(min)
(15)
is a measure of the spread in the eigenvalues. Values above 100 indicate moderate to strong multicollinearity, values above 1000 indicate severe multicollinearity (Montgomery and Peck, 1982).
Alternatives Alternative procedures to OLS in the situation of multicollinearity among the predictors can be divided roughly into biased estimation and subset selection (Hoed, Schuenemeyer, and Hoed, 1986). In the latter situation one or more predictors are deleted from the complete set of variables because they are considered insignificant. Thus one or more of the bjs are constrained to zero whereas the other bj change under this constraint. The most popular biased estimation procedures (Hoed, Schuenemeyer, and Hoed, 1986) are principal component (PC) regression, based on orthogonalization of the regression problem using eigenvalues and eigenvectors (Montgomery and Peck, 1982), and RR. Hoed, Schuenemeyer, and Hoerl (1986) showed that RR generally is superior to PC and subset selection in estimating the bj. Gibbons and McDonald (1984) demonstrated that RR in fact is a weighed average of all possible subset regressions. Subset selection is considered appropriate only if it is expected that some of the predictor variables indeed are superfluous (Hoerl, Schuenemeyer, and Hoed, 1986). In these situations RR can be used CAGEOIbT--E
935
more effectively to select the appropriate subset than OLS.
Ridge regression Ridge regression may be viewed as a form of Bayesian statistics in which prior knowledge or belief about the type or size of parameters is formalized (Hoed, 1962; Marquardt, 1970; Draper and Smith, 1981; Montgomery and Peck, 1982; Barnett and Lewis, 1984). With RR the correlation matrix R is stabilized by adding a ridge parameter k > 0 to the diagonal values of I, and with I the identity matrix we obtain: b(k) = (R + kl) -l • Rxv.
(16)
The effect of k is that the absolute values of the normalized coefficients bj are forced to be smaller, towards a more equal distribution of the total variance in Y over the .~, thereby avoiding large negative contributions of some Xj to Y. The solution is biased slightly which indicates that for an infinite number of perfectly accurate data the calculated values of the bjs slightly deviate from the true values. The normal least-squares estimates in principle are unbiased but, with correlations among the )(~s, sensitive to the experimental or mcasurement error and experimental or sampling design. The ridged b/s are more robust which signifies that with a limited number of data having some error the deviation from the true value is likely to be small. In other words, because of multicollinearity, the ridged bjs have lower accuracy but higher precision than the least-squares estimates of the bjs (see Montgomery and Peck, 1982). The elements of the VIF matrix that calculate the standard errors of the B/s or bjs [Eqs. (II), (12)] arc reduced. Thc VIF matrix is given now by (Marquardt, 1970):
VIF(k)=(R + k l ) - t ' R . ( R +kl) -~.
(17)
The ridge bias--in normalized form--is given mathematically by: bias(k) =/~r. {[(R + k l ) -~. R ] - i} r • {[(R + k l ) - ' . R ] - I}.fl
(18)
with ,8 the vector of the true normalized coefficients, which in fact are the unknowns to be estimated. The ridged MSE is the sum of the squared standard deviations of the ridged bs and the bias: MSE(k) = iN - l ) - ~ d e v ( Y ) ' x y'VIF(k)~j+bias(k).
(19)
/
For given ,Ss it follows from Equation (18) that the bias is zero for k = 0 (OLS) and increases with k, approaching ,Or..o at infinite k. The bias also causes the standard error to increase and all bjs to approach zero at infinite high values of k. For this reason k must not be taken too large, usually not > I.
936
P.F.M. w,y GAANSand S. P. VmE.~'D
Estimation of k Whereas the bias increases with k the squared standard error terms in the summation of Equation (19) decrease with k because of the decrease in the VIF(k),. Theoretically there is an optimum value of k for which MSE(k) is minimized. Unfortunately this optimum cannot be calculated because the true bias is unknown. Some mathematical procedures have been proposed to estimate the optimum k. Marquardt (1970) suggests that k should be increased until the maximum VIF(k)jj is above 1 but below 10. If k is increased stepwise, at each value of k an estimate for the most appropriate value ofk (Hoerl, Kennard, and Baldwin, 1975; Hoerl and Kennard, 1976), which we will term kHKB, may be given by: (M - I) s:(0) kHKB(k) = b(k)r. IRk) (N - 1)" stdev(Y) r
(20)
Here M - 1 is the number of predictor variables Xj, s2(0) is the residual mean square from the OLS analysis (k --- 0) and IRk) is the vector of the bjs at the current value of k. Either the primary estimate kHKB(0) from the OLS solution may be taken or k HKB(k) is updated until k HKB(k) - k approaches zero. Lawless and Wang (1976) estimate the appropriate k value (k LW) from the OLS F-ratio between the sum of squares resulting from regression and the residual sum of squares:
I kLW = Fi0i
s"(0) ~ [ Y, - mean( Y)I-'"
(21)
then should become 1 for k equals kMG. There is no estimate for k M G if MSE(0) exceeds the squared length of the OLS b vector. Because the squared length of IRk)/> 0 for all k, MGL(k) then will be < 1 for all values of k. A second norm M G D can be constructed, which has no such geometrical constraint with large MSE(0), that calculates the difference between the squared distance of the endpoints of b(0) and IRk), and E(D:): MGD(k ) = [b(0) - IRk )]r. [b(0) - IRk)] x /
/[ ( N - l i - Ts'-(O) dev(Y):
.Eli , ~ " (25)
The MGL norm seems to be good particularly with large signal-to-noise ratios (Hoerl, Schuenemeyer, and Hoerl, 1986). Inasmuch as the MSE(0) describes the confidence interval of b(0), the length reduction norm MGL or the distance norm MGD seem to put some upper limit to k. None of the described or other mathematical criteria for the selection of k have proven their superiority (Montgomery and Peck, 1982). Normally the change in the bjs (or Bjs) and the standard error s with k are plotted as the ridge traces (Fig. I). We consider it informative to make a ridge trace of the regression constant B0 and other parameters as well. A selection for k can be made visually at the smallest wdue for which a "stable" solution is determined. The mathematical criteria can be used as additional tools for the final selection of k.
I
Simulation studies by Hoerl, Schucnemeyer, and Hoerl (1986) show that kLW generally is slightly better than k HKB(0) except for large signal-to-noise ratios; they did not compare kHKB(k). Another approach to a mathematical selection o f k is based on the stretching effect of small eigenvalues on the b vector (Agterberg, 1974). The expected squared distance E(D:) between the endpoints of the true normalized ,6 vector and the least-squares estimate IR0) equals MSE(0) and may be rewritten as: s2(0) E 1 (22) E(D 2) = MSE(0) = (N - 1)' stdev(Y): j ,~.j with D: = [IR0) - pit. [b(0) - ,8].
(23)
McDonald and Galarneau (1975) then suggest that k should be increased to a certain value k MG such that the length of the b vector is reduced by MSE(0). A norm MGL that calculates the ratio of the squared length reduction of IRk) compared to b(0) and MSE(0): M G L(k) = [IR0)r. IR0) - IRk)r. IRk )] x
/[
(N - I ) ' stdev(g):"
~
(24)
PROGRAM STRUCTURE The program was developed, keeping in mind that RR is not yet an established and widely used procedure among geoscientists. Therefore, the algorithm and program are kept as simple as possible. Some routines as published by Davis (1973) for a.o. the extraction of eigenvalues and matrix inversion are incorporated. The source code is documented extensively through comment lines. Users of F O R T R A N should be able to adapt and extend the program to their needs. Secondly, the program's output supplies a wide variety of parameters that may give insight in the structure of the regression model at hand. Correlations and multiple correlations, VIF(0) parameters, eigenvalues and condition number enable an evaluation of the extent and location of the intercorrelations. Variance inflation factor and eigenvalue parameters are considered the best currently available multicollinearity diagnostics (Montgomery and Peck, 1982). A comparison between the various criteria and parameters for an optimum k --MSE(k), VIF(k), kLW, kHKB(0), kHKB(k), MGL(k), MGD(k), visual inspection of the ridge traces--will lead to the best possible selection. Future use of RIDGE to geological problems may provide insight in the relative value of the different criteria.
Multiple linear regression with correlations among the predictor variables
suited especially for exploratory inspection of the ridge trace. All input values should be separated by at least one space or a comma. To increase readability of the output, the next M lines list the names of all variables in the order they occur in the data matrix. Names should not exceed ten characters. The input data matrix follows next. Several lines may be used for the input of one observation, but each observation should start on a new line. Again input values should be separated by at least one space or a comma. An example of the input file is given in Appendix 2.
normalized regression coefficients 0.8. 0,6 ¸ 0.4 0.2
0t _
b~ b,,
O0
0.2
o14
o16
o18
ridge parameter k
Calculations and the contents of the print file ~.o A
interceptvalue IO0 9O 80 70 60 5O 40 O0
o'.2
6.4 • o16 ridge parameter k
o'.8
937
to B
Figure I. Ridge traces of regression coefficients for tlald data (I laid. 1952). A, Normalized regression coellicients; B, intercept value. Note rapid change at small values of k. The source code is listed in Appendix 1. The program requires three user files, designated the imput file, the print file, and the output file respectively. The names of these files may be entered interactively, if the program is run in batch mode the user file names must be stated on the default input on three consecutive lines in the order mentioned. The print file can be sent directly to a printer.
The input file The input required by RIDGE basically is the matrix of the dependent variable Y and all the predictor variables Xj. The first column is reserved for the dependent Y. To ensure that the data are read correctly the total number of variables M (including Y) and the number of observations N must be stated in the first line of the input file. Current maxima for M and N are 12 and 400 respectively. N should always exceed M, otherwise the problem is overdefined and a different regression approach such as PC has to be selected (Montgomery and Peck, 1982). In the first line also the starting and increment value for the ridge parameter k must be given and a flag (LOGOPT) must be set that indicates either linear (0) or exponential (I) increase of k. The latter option is
The first ten observations of the input matrix are written to the print file for verification. The number of observations read and means and standard deviations of all variables are printed. All variables then are normalized to units of standard deviation from the mean. The matrix of correlations bet~een the dependent variable and the predictors, and among the predictors, is calculated and written to the print file. In addition multiple correlations between each predictor and all other predictors and the related OLS VIF, that is VIF(0), are given. The determinant of R, thc correlation matrix among the predictors, is printed with all signiticant digits as an auxiliary check on the possible singularity. The cigcnvalues of R and the condition number are given. Regression proceeds as described under "Theory". The normalization of the predictor variables, which is required in RR, generally prevents invalidation of the results because of round-off errors. As an addition:tl precaution to this type of error double precision is used throughout the program. Irrespective of the starting value given for k on input, a basic OLS (RR at k = 0) always is performed. With exponential incremcntation of k or if the starting value given is other than zero, the second RR is performed at this starting value. In total 21 regression analyses are performed, k each time being raised or multiplied by the increment value. Default settings prevent repetitive output for a zero starting value of k with exponential incrementation. For each analysis a table of the regression coefficients and an analysis of variance (ANOVA) table is recorded on the print file. The regression coefficients table lists the ridged VIFs, the ridged regression coefficients with their standard deviations, the normalized coefficients, the contribution of the normalized coefficients to the MSE(k) and the contribution of each predictor to R ~ . A lower and upper bound for the bias contributing to the mean square error are given. The first estimate is based on the ridged bs, which with increasing k tend to be too small. The higher estimate is based on the OLS solution. Summation of all contributions gives the range of the MSE(k). The summation of all contributions to R~n only for k = 0 exactly equals R{n. The ANOVA table shows the regression and residual contributions to the sum of squares of the dependent
938
P . F . M . v ^ s G~.~s and S. P. VRien'D
Y, and calculates the F statistic for the regression [see Eq. (21)]. An additional F statistic for each RR v,ith respect to the OLS (s"(k)/s:(O)) also is given. Obenchain (1977) has shown that at fixed k the distribution of the F statistic for hypothesis testing is not influenced by ridging. However, distributional properties are unknown for stochastic choice of k (Montgomery and Peck, 1982). Together with the standard deviation of prediction of the Y estimate, and the bias, the second F statistic is an indication of the decrease in fit because of the ridge parameter. The goodness-of-fit is documented further by the (squared) multiple correlation coefficient of Y with all predictors calculated as the correlation between observed and estimated Y. From the OLS solution k L W and kHKB(0) are calculated. At each step of k the current estimate kHKB(k) is given. Ideally the plot of the estimate k H K B ( k ) vs k intercepts the I : I line. The intercept value then given the last updated value as described with Equation (20) (Fig. 2.). However, k HKB(k) may diverge from k. The MGL and M G D norms are given at each step ofk. They are zero for k = 0 and increase with increasing k. According to McDonald and Galarneau (1975) they should reach a value of I at the appropriate k. Variance inflation factor parameters, k LW, first and updated estimates for kHKB, and M G L or MGD, should be used as guidelines for the final selection of k from the ridge traces. The names of the input and output file used are given on the print file. A message is triggered if the default settings with exponential incrementation of k were applied. Tile output file A table of all regression coefficients as a function of the ridge parameter k is followed by similar tables for the standard deviations of the regression coefficients, the normalized regression coefficients and the VIF parameters. Next the standard deviation of prediction of the Y estimate, the minimum and estimations for k 0 025-
/... /.,...' /i, .1"1""
0.020" k~Ka
/ 0015,~ 0.010 •"
• RLw
k
../
0005
./"" // ./--"
0.000 ""
0.000
0.005
o.61o
o.ols
0.620
o,o2s
ridge parameter k
Figure 2. Graphical display of estimates for appropriate value of k :k LW (Lawless and Wang, 1976). k HKB(0) and updated kHKB at intercept with I : I line (Hoed, Kennard, and Baldwin (1975).
maximum estimate for MSE, kHKB, and the M G L and M G D norms are given for each value of k. The last two tables on the output file list the estimated values of the dependent Y and the deviation from the observed value respectively, for each observation and as a function of the ridge parameter k. The contents of the output file may be used as input for a suitable spreadsheet or graphics program, to produce the ridge traces or to perform curve fitting.
Performance The program was tested using the Hald data. Starting and increment value of k were selected to compare the results for k = 0.0131. Ridge regression results conform to those given in the literature (Draper and Smith, 1981). Input file, print file, and output file for the Hald data are given in Appendices 2, 3, and 4 respectively, the ridge traces are shown in Figures I, 2, and 3. The ridge traces of the normalized coefficients conform to the output of the Statgraphics ~R~ RR routine. Ordinary least-squares statistics also adequately compared with output from the SPSS software. APPLICATION OF RR IN GEOSCIENCE As mentioned before, OLS is the widely used regression technique. However, RR already has been applied successfully in geoscience. Jones (1972) recommends the use of RR explicitly for closed or constant-sum systems--a constant problem in geology--and in trend-surface analysis. Howarth (1984) shows that with OLS the effect of the (intercorrelated) light rare earth on uranium content in stream-sediment samples are grossly overestimated, whereas RR shows the instability of the OLS regression coefficients. Especially the pattern of ridge traces is illustrative. Wood and Crerar (1985) and van Gaans, Oonk, and Somsen (1990) applied RR in geochemical thermodynamics. Both in the determination of solubility constants and in the extraction of Pitzer parameters, all predictor variables are some function of concentration, thereby being correlated inherently. Other regression problems are given in recent literature where RR probably could have been clarifying. For example, Schulze and Schwertmann (1984), in their attempt to characterize the influence of aluminum substitution on goethite structure, give a complex OLS model, whereas multicollinearity is indicated by individual correlations between the predictors of over 0.80. Least-squares calculations by Barker and others (1989), used to demonstrate the fractionation relationship between different magmas, also are based on magma and mineral-type predictors that are correlated highly. Other variables whose separate influences one would like to deduce, that may occur intrinsically correlated in geological studies, include P and T, age and sedimentary environment, and different particle-size parameters.
Multiple linear regression with correlations among the predictor variables variance inflation factors
normalized regression coefficients 0.80.~1.
939
~°/ vw, bl
0.4"
0.2" 15o.lOO. VlFs
0
-0,2' -0.4
0.0001
0.001 0.01 0.1 ridge I~'=meter k
1
10
0
................................... 0.0001
A
0.001 0.01 0.1 ridge parameter k
, , ............... I
10
D
estimations for k
intercept value 100.
0,025
gO
0.020
80 0.015 70 0.010 O0
k.Ka (0) "k~w
/k /t
0 005
50
/
/
o/I 0.00(3
4( 0.0001
0.(301
0.01
0.1
I
10
ridge parameter k
. . . . ,..~---.':';':77~"'. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.00(31
n D
0.001 0.01 0.1 ridge pe/'emeter k
1
10
E
McDonald & Galarneau norms
prediction and estimation error 0.57
3.5 3,0
F/
0.4
2.50.3-
2,0, 1.5,
0.2"
/o"
10. 0.1.
0.5. . . . . . . . . . . . . . . . . .
0.0
. . . . . . .
0.0 0.0001
0001
0.01 0.1 ridge p~tuneter k
I
10 C
0.0001
0.001
0.01
0.1
ridge pa.rametet k
Figure 3. Ridge traces of various parameters for Hald data (Hald, 1952). Horizontal axis is logarithmically scaled for better resolution. A, Normalized regression coefficients; B, intercept value; C, standard error of prediction of )" estimale, m a x i m u m , and m i n i m u m mean square error of estimation; D, variance inflalion factors with approximate upper limit of 10 (Marquardt, 1970); E, estimates for appropriate value o f k (compare Fig. 2); F. norms M G L and M G D for selection o f k after McDonald and Galarneau (1975). Note m i n i m u m (although small) value of/c needed to have impact on parameters. Normalized regression coeHicients indeed rapidly approach zero for higher values of k; intercept value then approaches 95.4, mean of Y. Note also that M G L and M G D norms will not reach value of I.
1
10
F
P. F. M. vA~ G,,ANS and S. P. VRIEND
9.10
With Hoerl, K e n n a r d , a n d Baldwin (1975) and M o n t g o m e r y a n d Peck (1982), we agree that ~ h o e v e r decides not to use R R but OLS to estimate regression coefficients at least should study the effects of not d o i n g so. Inspection of the ridge traces seems m o s t informative. The current p r o g r a m will be o f help in this respect and m a y further the insight in the different aspects of
RR.
Acknowledgments--Mark Dekkers kindly provided us with literature examples of correlated predictors in geoscience. Leon Linnartz critically read the manuscript. REFERENCES Agterberg, F. P., 1974. Geomathematics: Elsevier Scientific Publ. Co., Amsterdam. 596 p. Anderson. T. W., 1985. An introduction to multivariate statistical analysis: John Wiley & Sons Inc., London, 374 p. Barker. F., Sutherland Brown. A., Budahn, J. R., and Plafker, G., 198% Back-arc with frontal-arc component origin of Triassic Karmutsen basalt, British Columbia, Canada: Chem. Geology. v. 75. no. I. p. 81-102. Barnett, V., and Lewis, T., 1984. Outliers in statistical data: John Wiley & Sons Inc., London, 463 p. I)avis, J. C., 1973, Statistics and data analysis in geology: John Wiley & Sons Inc., London, 550 p. Dixon, W. J., editor, 1981, BMDP statistical software 1981: Univ. Calilbrnia Press, California,, 725 p. Draper, N. R., and Smith, tt., 1981, Applied regression analysis: John Wiley & Sons Inc., London, 709 p. van Gaans, P. F. M., Oonk. |1. A. J., and Somsen, G., 1990, Thermodynamics of aqueous gallium chloride. Appztrent molar volumes and heat capacities at 25 C: Jour. Sol. Chcm+, v. 19, no. 10. Gibbons, D. I., and McDonald, G. C., 1984, A rational interpretation of the ridge trace: Technometrics, v. 25, no. 4, p. 339 346. v:,n der Grinten, P. M. E. M., and Lenoir, J. M. H., 1973, Statistische procesbchcersing: Uitg. Het Spectrum B. V., Utrecht, 599 p. Itald. A., 1952, Statistical theory with engineering applications: John Wiley & Sons Inc., London, 783 p. ltoerl, A. E., 1962, Application of ridge analysis to regression problems: Chem. Eng. Prog., v. 58, no. 3, p. 54 59.
Hoerl, A. E., and Kennard. R. W., 1976, Ridge regression: interative estimation of the biasing parameter: Comm. St,at., v. A5, no. 1, p. 77-78. Hoerl, A. E., Kennard, R. W., and Baldwin, K. F., 1975, Ridge regression: some simulations: Comm. Stat., v. A4, no. 2, p. 105-123. Hoerl, R. W., Schuenemeyer, J. H., and Hoerl, A. E., 1986, A simulation of biased estimation and subset regression techniques: Technometrics, v. 28, no. 4, p. 369-380. Howarth, R. J., 1984, Statistical applications in geochemical prospecting: a survey of recent developments: Jour. Geochem. Expl. v. 21, no. t, p. 41-61. Jennrich, R. I., 1978, Ridge regression, Bayes estimation variance components and the general mixed model by means of nonlinear regression: BMDP Tech. Rept. No. 32, BMDP Statistical Software Inc., Los Angeles, California, 18 p. Jones, T. A., 1972, Multiple regression with correlated independent variables: Jour. Math. Geology, v. 4, no. 3, p. 203-216. Lawless, J. F., and Wang, P., 1976, A simulation study of ridge and other regression estimators: Comm. Stat.. v. A5, no. 3, p. 307-323. Li, C. C., 1964, Introduction to experimental statistics: McGraw-Hill Book Co. Inc., New York, 460 p. Marquardt, D. W., 1970, Generalized inverses, ridge regression, biased linear estimation and nonlinear estimation: Technometrics, v. 12. no. 3, p. 591-612. McDonald, G. C., and Galarneau, D. I., 1975, A Monte Carlo evaluation of some ridge-type estimators: Jour. Am. Stat. Assoc., Sect. Theory Methods, v. 70, no. 350, p. 407-416. Montgomery, D. C., and Peck, E. A., 1982, Introduction to linear regression analysis: John Wiley & Sons Inc., London, 504 p. Norusis, M. J., 1986, SPSS PC + for the IBM PC/XT/AT: SPSS Inc., Chicago, Illinois, variously paginated. Obenchain. R. L., 1977, Classical F-tests and confidence intervals for ridge regression: Tcchnometrics, v. 19, no. 5, p. 429-439. Schulze, D. G., and Schwertmann, U., 1984, The influence of aluminium on iron oxides: X. Properties of A1substituted goethites: Clay Minerals, v. 19, no. 3, p. 521-539. Snedecor, G. W., and Cochran, W. G., 1967, Statistical methods: The Iowa State Univ. Press, 302 p. Statgraphics, 1989, STSC Inc., USA, variously paginated. Wood, S. A., and Crerar, D. A., 1985, A numerical method for obtaining multiple linear regression parameters with physically realistic signs and magnitudes: Application to the determination of equilibrium constants from solubility data: Geochim. Cosmochim. Acta, v. 49, no. 2, p. 165-172.
APPENDIX
1
Program Listing RIDGE,
VERSION
INSTITUTE
1989.
F O R T R A N 77
P.F.M. V A N G A A N S A N D S.P. V R I E N D FOR E A R T H SCIENCES, D E P A R T M E N T OF G E O C H E M I S T R Y , UNIVERSITY OF UTRECHT
THIS P R O G R A M P E R F O R M S M U L T I P L E L I N E A R R E G R E S S I O N AND R I D G E R E G R E S S I O N A N A L Y S I S . T H E O R Y IS G I V E N IN M O N T G O M E R Y & PECK (1982). R O U T I N E S AS P U B L I S H E D BY D A V I S (1973) A R E INCORPORATED. V A R I A B L E S A R E E X P L A I N E D IN C O M M E N T L I N E S W E R E N E C E S S A R Y
PROGRAM R I D G E I M P L I C I T D O U B L E P R E C I S I O N (A-H, O-Z) P A R A M E T E R (NROW-400, MCOL'I2, KRD'21)
Multiple linear regression with correlations among the predictor variables C C C C
N R O W IS T H E M A X I M U M N U M B E R OF CASES NCOL IS T H E M A X I M U M N U M B E R OF VARIABLES, KRD IS T H E N U M B E R OF STEPS FOR THE RIDGE
DIMENSION DIMENSION DIMENSION DIMENSION DIMENSION DIMENSION DIMENSION CHARACTER
INCLUDING THE D E P E N D E N T Y P A R A M E T E R RIDGEX
X ( N R O W , M C O L ) , XMEAN(MCOL), XSTDEV(MCOL), XZ(NROW,MCOL) CORRXX(MCOL,MCOL), A(MCOL,MCOL), CORINV(MCOL,MCOL) EIGENV(MCOL), B(MCOL), R2X(MCOL), VIF(MCOL,MCOL) BIASQQ(MCOL,MCOL), BIASQ0(MCOL), BIASQI(MCOL) . . RIDGEK(KRD), STDEVY(KRD), HKB(KRD), QMGL(KRD),QMGD(KRD) A V S E R 0 ( K R D ) , AVSERI(KRD) D(KRD,NROW,2), BCOEF(KRD,MCOL,3). DIAVIF(KRD,MCOL) NAME(MCOL)*I0, NINFL~80, NPRNT*80, NOUTFL*80
C ZERO=I.OE-14 C C THE VALUE OF ZERO IS USED TO P R E V E N T DIVISION BY NEAR ZERO C NUMBERS IN M A T R I X I N V E R S I O N C C... READ USER NAMES OF INPUT FILE, PRINT FILE AND O U T P U T FILE C I0 WRITE (*,9999) 9999 FORMAT (' ENTER NAME OF INPUT FILE:') READ (*,9998) N I N F L 9998 FORMAT (A80) OPEN ( 6 , F I L E - N I N F L , S T A T U S - ' O L D ' , E R R = I O ) WRITE (e,9997) N I N F L 9997 FORMAT (IX,A80) C WRITE (- 9996) 9996 FORMAT (~ ENTER NAME OF PRINT FILE:') READ (~,9998) N P R N T OPEN (8,FILE-NPRNT) WRITE (*,9997) N P R N T C WRITE (~ 9995) 9995 FORMAT (| ENTER NAME OF O U T P U T FILE:') READ (~,9998) N O U T F L OPEN (9,FILE=NOUTFL) WRITE (~,9997) N O U T F L C C... READ INPUT DATA M A T R I X C M IS TfIE N U M B E R OF V A R I A B L E S IN THE INPUT DATA C N IS THE N U M B E R OF CASES ON INPUT, N SHOULD BE LARGER THAN M C BEGINKR IS THE S T A R T I N G VALUE FOR RIDGEK. C STEPKR IS TIlE I N C R E M E N T V A L U E FOR RIDGEK IF LOGOPT-0 C STEPKR IS THE M U L T I P L I E R IF LOGOPT-I C LOGOPT IS THE O P T I O N FOR LINEAR (-0) OR EXPONENTIAL (-I) C I N C R E M E N T A T I O N OF RIDGEK. IF LOGOPT-I AND BEGINKR-0, THE DEFAULTS C B E G I N K R - I E - 5 AND STEPKR-2 ARE IMPLEMENTED. C READ (6, ~) M , N , B E G I N K R , S T E P K R , L O G O P T IF (N.LE.M) THEN WRITE (8,99941 9994 FORMAT (' N U M B E R OF O B S E R V A T I O N S DOES NOT EXCEED NUMBER OF', ~' VARIABLES, P R O G R A M TERMINATED'/) STOP ENDIF IF (LOGOPT.EQ.I.AND. BEGINKR. EQ.0) THEN BEGINKR'IE-5 STEPKR-2 LOGOPT--I ENDIF C C NAME(j) IS THE NAME OF THE JTH VARIABLE IN THE INPUT M A T R I X C (NAME(1) IS THE NAME OF THE DEPENDENT Y) C DO 20 J=I,M READ (6,9993) NAME(J) 9993 FORMAT (AI0) 20 CONTINUE C C X(I,J) IS THE VALUE OF THE JTH VARIABLE FOR THE ITH CASE C (X(I,I) IS THE VALUE OF Y FOR THE ITH CASE) C DO 30 I-I,N READ (6, *) (X(I,J),J-I,M) 30 CONTINUE C WRITE ( 8 , 9 9 9 2 ) 9992 FORMAT ( ' INPUT-DATA MATRIX, FIRST TEN OBSERVATIONS'/) WRITE (8,9991) (NAME(J), J'I,M) 9991 FORMAT (' N R ' , 2 X , 1 2 ( I X , A I 0 )) NN-10 IF (N.LT.10) NN'N DO 40 I-I,NN WRITE (8,9990) I, (X(l,J), J-I,M) 9990 FORMAT (IX,I4,2X,|P,12EII.3) 40 CONTINUE
C C...
STANDARDIZE
INPUT DATA M A T R I X
941
942
P.F.M. C C C C C C
VAN GAx.~S and S. P. VRm.WD
C A L C U L A T E MEAN AND S T A N D A R D D E V I A T I O N SX IS THE S U M M A T I O N O V E R ALL CASES OF X SXX IS THE S U M M A T I O N O V E R ALL CASES OF SQUARED XMEAN(J) IS THE M E A N OF X(I,J) XSTDEV(J) IS THE S T A N D A R D D E V I A T I O N OF X(I,J)
X
DO 70 J-I,M SX=0.0
SXX=0.0
50
DO 50 I=I,N SX=SX~X(I,J) SXX=SXX+X(I,J)"*2 CONTINUE XMEAN(J)=SX/FLOAT(N)
XSTDEV(J)=SQRT((SXX-SX~SX/FLOAT(N))/FLOAT(N-I)) C C C C C
S U B T R A C T MEAN FROM EACH E L E M E N T IN COLUMN, THEN DIVIDE RESULT BY S T A N D A R D D E V I A T I O N XZ(I,J) IS THE S T A N D A R D I Z E D V A L U E OF X(l,J)
60 70
DO 60 I=I,N XZ(I,J)=(X(I,J)-XMEAN(J))/XSTDEV(J) CONTINUE CONTINUE
C 9989 9988
W R I T E (8,9989) N FORMAT (/IX,14,' O B S E R V A T I O N S ' ) W R I T E (8 9988) (XMEAN(J), J=I~M) FORMAT (: MEAN ',IX, IP,12EII.3) W R I T E (8,9987) (XSTDEV(J), J=I,M) FORMAT (' STDEV',IX, IP,12EII.3)
9987 C C... C A L C U L A T E C O R R E L A T I O N M A T R I X CORRXX(J,JJ) BETWEEN ALL V A R I A B L E S C SXIX2 IS THE SUM OF C R O S S P R O D U C T S OF XZ(J) AND XZ(JJ) C DO 100 JJ=I,M DO 90 J=l,JJ IF (J.LE.JJ) THEN
SXIX2=O.0 80
90 I00
DO 80 I=I,N SXIX2-SXIX2+XZ(I,J)~XZ(I,JJ) CONTINUE CORRXX(J,JJ)-SXIX2/(N-[) CORRXX(JJ,J)-CORRXX(J,JJ) ENDIF CONTINUE CONTINUE
C WRITE
(8,9986)
9986
FORMAT ( / / ' CORREI.ATION M A T R I X ' / ) WRITE ( 8 , 9 9 8 5 ) (NAME(J), J-I,M)
9985
FORMAT (IIX, 12AI0) DO If0 J-I,M WRITE (8,9984) N A M E ( J ) , ( C O R R X X ( J , J J ) , FORMAT (IX,AI0,12FI0.4) CONTINUE
JJ-I,M) 9984 If0 C C... C A L C U L A T E E I G E N V A L U E S OF THE C O R R E L A T I O N M A T R I X OF THE PREDICTORS C AND C A L C U L A T E T H E D E T E R M I N A N T DETCOR. INITIALLY A EQUALS THE C C O R R E L A T I O N MATRIX. A C O N V E R T S TO A D I A G O N A L M A T R I X OF THE C EIGENVALUES. E I G E N V IS THE V E C T O R OF SORTED E I G E N V A L U E S C C I N I T I A L I Z E A - M A T R I X A N D C A L C U L A T E INITIAL AND FINAL T H R E S H O L D S C ANORM AND FNORM C ANORM~O.O
A(I,I)=I DO 130 J-2,M-I A(J,J)-I DO 120 J J - I , J - I A(J,JJ)-CORRXX(J+I,JJ+I)
120 130
A(JJ,J)-A(J, J J) ANORM-ANORM+2~A(J,JJ)**2 CONTINUE CONTINUE ANORM-SQRT(ANORM) FNORM-ANORM*I.0E-9/FLOAT(M-I)
C C C
COMPUTE 140 150
C C C C
CURRENT THRESHOLD
ANORM
AND I N I T I A L I Z E
INDICATOR
IND
ANORM-ANORM/FLOAT(M-|) IND-0 SCAN DOWN C O L U M N S TO THE T H R E S H O L D DO
FOR O F F - D I A G O N A L
190 J=2,M-I DO 180 JJ-I,J-i IF ( A B S ( A ( j J , J ) ) . G E . A N O R M ) IND=I
ELEMENTS
THEN
G R E A T E R THAN OR EQUAL
Multiple linear regression with correlations among the predictor variables
COMPUTE
SIN AND COS AL=-A(JJ,J) AM=(A(JJ,JJ)-A(J,J))/2.0 AO=AL/SQRT(AL"AL+AM*AM) IF (AM.LT.0) A O - - A O SINX=AO/SQRT(2.0*(I.O+SQRT(I.0-AO*AO))) SINX2-SINX*SINX COSX-SQRT(I.0-SINX2) COSX2=COSX*COSX
ROTATE
160
170
180 190
COLUMNS
J AND JJ
DO 160 JJJml,M-I IF (JJJ.NE.JJ) T H E N IF (JJJ.NE.J) T H E N AT=A(JJJ~JJ) A(JJj,JJ)-AT~COSX-A(JJJ,J)*SINX A(JJJ,J)-AT*SINX+A(JJJ,J)*COSX ENDIF ENDIF CONTINUE XT=2.0*A(JJ,J)*SINX*COSX AT=A(JJ,JJ) BT=A(J,J) A(JJ,JJ)=AT*COSX2+BT*SINX2-XT A(J,J)-AT*SINX2+BT*COSX2+XT A(JJ,J)=(AT-BT)*SINX*COSX+A(JJ,J)*(COSX2-SINX2) A(J,JJ)=A(JJ,J) DO 170 JJJ-I,M-I A(JJ,JJJ)=A(JJJ,JJ) A(J,JJJ)-A(JJJ,J) CONTINUE ENDIF CONTINUE CONTINUE
IF ( I N D . G T . 0 )
GOTO 150
IF (ANORM.GT. FNORM) SORT
200 210
GOTO
140
EIGENVALUES
DO 2[0 J-[.M-I EIGENV(J~=A(J,J) K-J DO 200 JJ-I,M-[ IF (A(jJ,jJ).GT. EIGENV(J)) EIGENV(J)-A(JJ,JJ) K=JJ ENDIF CONTINUE A(K,K)-0 CONTINUE
THEN
C DETCOR-I.0 DO 220 J-I,M-[ DETCOR-DETCOR*EIGENV(J) CONTINUE
220 C C... SETUP AND SOLVE S I M U L T A N E O U S EQUATIONS C BEGIN WITH RIDGEK-0, THEN INCREASE RIDGEK IN 20 STEPS FROM BEGINKR C WITH STEPKR C DO 420 K R - I , K R D IF (BEGINKR. EQ.0) T H E N RIDGEK(KR)-BEGINKR+STEPKR*(KR-I) ELSE IF (KR. EQ.I) THEN RIDGEK(KR)'0 ELSE IF (LOGOPT. EQ.0) THEN RIDGEK(KR)-BEGINKR+STEPKR*(KR-2) ELSE RIDGEK(KR)-BEGINKR*STEPKR*~(KR-2) ENDIF C CALCULATE THE INVERSE OF THE RIDGED CORRELATION MATRIX OF THE C. PREDICTORS AND SOLVE S I M U L T A N E O U S EQUATIONS. C INITIALLY A EQUALS THE R I D G E D CORRELATION MATRIX, B THE COLUMN C VECTOR OF C O R R E L A T I O N S OF Y WITH THE PREDICTORS. CORINV IS SET TO C THE IDENTITY MATRIX. A CONVERTS TO THE IDENTITY MATRIX. B AND C CORINV ARE CHANGED A N A L O G U O U S L Y AND FINALLY CONTAIN THE SOLUTION. C C DO 240 J-I,M-I DO 230 JJ"l,J A(J,JJ)-CORRXX(J+I,JJ+I) A(JJ,J)'A(J,JJ) CORINV(J,JJ)'0.O CORINV(JJ,J)-O.O CONTINUE 230 A(J,J)mA(J,J)+RIDGEK(KR) B(J)-CORRXX(J+I,I) CORINV(J,J)-I.O
943
944
P.F.M.
240 C C C
VA~ GAA,~S and S. P. VI~E~'D
CONTINUE DIVIDE THE JTH ROW OF A BY A(J,J)
250 C C C
DO 280 J-I,M-I DIV=A(J,J) IF (ABS(DIV)-ZERO.GT.0) THEN DO 250 JJ=l,S-I A(J,JJ)-A(J.JJ)/DIV CORINV(J,JJ)-CORINV(J,JJ)/DIV CONTINUE B(j)=B(J)/DIV REDUCE THE JTH COLUMN OF A TO ZERO
260 270 9983 280 C C C C C C C C
DO 270 JJ=I,M-I IF (J.NE.JJ) THEN RATIO-A(JJ,J) DO 260 JJJ-I,M-i A(JJ,JJJ)-A(JJ,JJJ)-RATIO~A(J,JJJ) CORINV(JJ,JJJ)=CORINV(JJ,JJJ)-RATIO~CORINV(J,JJJ) CONTINUE B(JJ)-B(JJ)-RATIO"B(J) ENDIF CONTINUE ELSE WRITE (8,9983) FORMAT (/' DIVISION BY ZERO IN MATRIX INVERSION, SOLUTION STOPS'/) STOP ENDIF CONTINUE CALCULATE VARIANCE INFLATION FACTORS VIF AND FOR THE LEAST SQAURES SOLUTION THE MULTIPLE CORRELATIONS R2X AMONG THE PREDICTORS. DIAVIF ARE THE DIAGONAL VALUES OF THE VIF MATRIX THET ARE KEPT FOR OUTPUT. FOR THE LEAST SQUARES SOLUTION THE VlF MATRIX EQUALS THE INVERSE OF THE CORRELATION MATRIX.
290 300
IF (KR.EQ.I) THEN DO 300 J-I,M-I DO 290, JJ-I,J VIF(J,JJ)-CORINV(J,JJ) VIF(JJ,J)-VIF(J,JJ) CONTINUE R2X(J+I)-I-I/VI , DIAVIF(KR,J)-VIF J, CONTINUE
FIJ
C 9982 9981 9980
9979 9978 9977 C C C C C C C C C C C C
WRITE (8,9982) (R2X(JD), JD'2,M) FORMAT (IX,10('-')/' MULT. R~R',IIX, IIFI0.4) WRITE (8,9981) (VIF(JD,JD), JD-I,M-I) FORMAT (' VIF'.I7X, IIFI0.4) WRITE (8,9986) FORMAT (/' THE MULTIPLE R~R ARE THE SQUARED MULTIPLE CORRELATIONS' ' AMONG THE PREDICTORS'/' VIF IS THE VARIANCE INFLATION FACTOR.' "i' WITH LEAST SQUARES VIF-I/(I-R~R) ' ) WRITE (8,9979) DETCOR FORMAT (/' DETERMINANT OF Tile PREDICTOR CORRELATION MATRIX: ', F17.15) WRITE (8,9978) (EIGENV(J), J-I,M-I) FORMAT (/' EIGENVALUES',9X, IIFI0.6) WRITE (8,9977) EIGENV(1)/EIGENV(M-I) FORMAT (/' CONDITION NUMBER: ',FI0.2) FOR THE RIDGED SOLUTIONS THE VIF MATRIX EQUALS THE PRODUCT OF THE INVERSE OF THE RIDGED CORRELATION MATRIX TIMES THE NORMAL CORRELATION MATRIX TIMES THE INVERSE OF THE RIDGED CORRELATION MATRIX (MARQUARDT, 1970) ESTIMATE BIAS. THE MEAN SQUARE ERROR DUE TO THE RIDGE BIAS. (BIAS-B'(Zk-II'(Zk-I)B ACCORDING TO MARQUARDT, 1970). BIAS0 IS THE TOO OPTIMISTIC ESTIMATED BASED ON THE RIDGED B COEFFICIENTS, BIASI IS THE TOO PESSIMISTIC ESTIMATE BASED ON THE LEAST SQUARES SOLUTION. BIASQQ AND BIASQ ARE AN INTERMEDIATE MATRIX AND VECTORS IN THE CALCULATION (BIASQQ-Zk, BIASQ-B'(Zk-I)').
C ELSE BIAS0=0.0 BIAS I-0.0 DO 340 J-I.M-I BIASQ0(J)-0.0 BIASQI(J)-O.0 DO 330 JJ-i,J VIF(J,JJ)-0 DO 320 JJJ-I,M-1 BIASQQ(J,JJJ)-0.0 DO 310 JJJJ-I,M-I BIASQQ(J,JJJ)-BIASQQ(J,JJJ)+CORINV(JJJJ,JJJ) ~CORRXX(J+I,JJJJ+I) VlF(J,JJ)'VIF(J,JJ)+CORINV(JJJJ,JJ)eCORRXX(JJJ+I,JJJJ+I) * ~CORINV(J,JJJ)
Multiple linear regression with correlations among the predictor variables 310
CONTINUE IF (JJJ.EQ.J)
THEN
BIASQ0(J)-BIASQO(J)+B(JJJ)~(BIASQQ(O.JJJ)-I)
320
BIASQI(J)-BIASQI(J)+BCOEF(I,JJJ,3)n(BIASQQ(J,JJJ)-I) ELSE BIASQ0(J)'BIASQ0(J)+B(JJJ)~BIASQQ(J,JJJ) BIASQI(J)=BIASQI(J)+BCOEF(I.JJJ.3)~BIASQQ(J,JJJ) ENDIF CONTINUE IF (J.NE.JJ) THEN VIF(JJ,J)-VIF(J,JJ)
ELSE DIAVIF(KR,J)=VIF(J,J) 330 340 C C... C
C C C C C C
C C
350 C c C C C
ENDIF CONTINUE BIAS0-BIAS0+BIASQ0(J)e~2 BIASImBIASI+BIASQI(J)~"2 CONTINUE ENDIF CALCULATE NON-STANDARDIZED PARTIAL REGRESSION COEFFICIENTS, THE SQUARED LENGTH SMBZSQ OF THE NORMALIZED REGRESSION COEFFICIENT VECTOR AND THE SQUARED DISTANCE DISTBZ OF ITS ENDPOINT TO THAT OF THE LEAST SQUARES VECTOR. SMBZS0 IS THE SQUARED LENGTH OF THE NORMALIZED LEAST SQUARES VECTOR BCOEF(KR,J,I) IS THE JTH PARTIAL REGRESSION COEFFICIENT FOR STEP KR BCOEF(KR,J,2) IS RESERVED FOR THE STANDARD ERROR OF BCOEF(KR,J,I) BCOEF(KR,J,3) IS THE STANDARDIZED PARTIAL REGRESSION COEFFICIENT BCOEF(KR, I,I)-XMEAN(1) BCOEF(KR, I,3)-0 SMBZSQ-O DISTBZ-0 DO 350 J-I,M-I BCOEF(KR,J+I,3)-B(J) SMBZSQ-SMBZSQ+B(J).*2 DISTBZ-DISTBZ+(BCOEF(I,J+I,3)-B(J))~2 BCOEF(KR,J+I.I)-B(J)~XSTDEV(1)/XSTDEV(J+I) BCOEFIKR, I,I}-BCOEF(KR, I,I)-BCOEF(KR,J+I,I)~XMEAN(J+I) CONTINUE IF (KR. EQ.I) SMBZS0-SMBZSQ CALCULATE OIKR. I.I> DCKR..I2>
360 370 C C... C C C C C
E S T I M A T E D VALUE AND DEVIATION FOR EAC, O , S E R V A T I O N IS THE SET OF E S T I M A T E D VALUES AT STEP KR IS THE D E V I A T I O N
DO 370 I-I.N D(KR, I,II-BCOEF(KR, I,I) DO 360 J-I.M-I D(KR,I,I)-D(KR,I,I)+BCOEF(KR,J+I,I)~X(I,J+I) CONTINUE D(KR, I,2)-X(I,I)-D(KR,I,I) CONTINUE CALCULATION OF ERROR MEASURES. ANALYSIS OF VARIANCES SY IS THE SUM OVER ALL CASES FOR THE DEPENDENT Y SYY IS SUM OVER ALL CASES OF SQUARED Y SBXY IS THE SUM OVER ALL CASES OF ESTIMATED Y TIMES Y SBXBX IS THE SUM OVER ALL CASES OF SQUARED ESTIMATED Y
SY-O. 0 SYY-O. 0 SBXY-O.O SBXBX-O.O
380 C C C C C
DO 380 I-I,N SY-SY+X(I,I) SYY-SYY+X(I,I)"~2 SBXY-SBXY+D(KR,I,I)~X(I~I) SBXBX-SBXBX+D(KR, I,I)""2 CONTINUE SST IS THE TOTAL SUM OF SQUARES OF Y SSR IS THE SUM OF SQUARES DUE TO REGRESSION SSD IS THE RESIDUAL SUM OF SQUARES SST-SYY-SY"SY/FLOAT(N) SSR=2~SBXY-SBXBX-SY.SY/FLOAT(N) SSD-SST-SSR
C C C C C
AVSR IS THE MEAN SQUARE DUE TO REGRESSION AVSD IS THE RESIDUAL MEAN SQUARE AVSD0 IS THE RESIDUAL MEAN SQUARE FOR STANDARD SQUARES REGRESSION (K-0)
LEAST
C AVSR-SSR/FLOAT(M-I) AVSD-SSD/FLOAT(N-M) IF (KR.EQ.|) AVSDO-AVSD C C C
R2Y IS THE SQUARED MULTIPLE CORRELATION F AND FR ARE APPARENT F-RATIOS
COEFFICIENT
RY
945
946
P.F.M.
VAn GAANS a n d S. P. VRmND
C R2Y-SSR/SST RY=SQRT(R2Y) F-AVSR/AVSD FR-AVSD/AVSD0 C STDEVY(KR)=SQRT(AVSD) C C... C
390 400 C C C C C C C C 8876 9976 9975
9974
9973 410 • 9972
9971 C C C 9970
9969 9968 9967 9966 9965 C C... C C C C C
C C
CALCULATE
VARIANCE
OF P A R T I A L
REGRESSION
COEFFICIENTS
BCOEF(KR, I,2)=I/FLOAT(N) DO 400 J=I,M-I BCOEF(KR,J+I,2)=SQRT(VIF(J,J)~AVSD/((N-I)eXSTDEV(J+I)~"2)) DO 390 JJfI,M-I BCOEF(KR, I,2)-BCOEF(KR, I,2)+XMEAN(J+I)eXMEAN(JJ+I) ~ VIF(J,JJ)/((N-I)~XSTDEV(J+I)~XSTDEV(JJ+I)) CONTINUE CONTINUE BCOEF(KR, I,2)'SQRT(AVSD~BCOEF(KR, I,2)) PRINT P A R T I A L R E G R E S S I O N COEFFICIENTS C A L U L A T E THE C O N T I R B U T I O N CONTR OF EACH PREDICTOR TO R SQUARED SUM OVER ALL C O N T R I B U T I O N S IN TOTCON. CALCULATE THE C O N T R I B U T I O N SQRERR TO THE N O R M A L I Z E D MEAN SQUARE ERROR OF THE R E G R E S S I O N COEFFICIENTS. SUM O V E R A L L CONTRIBUTIONS AND THE BIAS TO C A L C U L A T E AVSER. A V S E R 0 AND AVSERI ARE MIN AND MAX VALUES. IF (KR. EQ.I) T H E N W R I T E (8,8876) FORMAT (//IX,77('-')/' LEAST SQUARES SOLUTION (K = 0)'/IX,77('-')) ELSE WRITE (8,9976) RIDGEK(KR) FORMAT (//IX,77('-')/' RIDGE PARAMETER K - ',FI0.7/IX,77('=')) ENDIF W R I T E (8,9975) FORMAT (/' R E G R E S S I O N C O E F F I C I E N T S ' / 31X,' STANDARD ',' NORM. ',' C O N T R I B U T I O N TO ',' CONTR.'/ llX,' VIF ' ' COEFFICIENT',' ERROR ' ' COEFF.', ' MEAN S Q U A R E ERROR ' ,' TO R~R ') TOTCON-0 AVSERO(KR)-O W R I T E (8,9974) (BCOEF(KR, I,L) L-Io2) FORMAT (/' I N T E R C E P T ',SX, IP,2EI2.4/) DO 410 J-I,M-I CONTR-BCOEF(KR,J+I,3)~CORRXX(I,J+I) TOTCON-TOTCON+CONTR SQRERR-VIF(J,J)*AVSD/((N-I)~XSTDEV(1)~*2) AVSERO(KR)-AVSERO(KR)+SQRERR W R I T E (8,9973) NAME(J+I), VIF(J,J), (BCOEF(KR,J+I,L), L-I,3), SQRERR, CONTR FORMAT ( I X , A I O , F S . 2 , 1 P , 2 E I 2 . 4 , 0 P , F 8 . 4 , 6 X , FS.4,5X, F8.4) CONTINUE W R I T E (8,9972) B I A S 0 , B I A S I FORMAT (' BIAS (MIN - MAX)',35X, FS.4j' -'F8.4) AVSER0(KR)'AVSER0(KR)+BIAS0 AVSERI(KR)-AVSER0(KR)-BIAS0+BIASI WRITE (8,9971! AVSER0(KR)I AVSERI(KR) TOTCON FORMAT (52X,18( - ) , 2 X , 6 ( ' - )/52X,FS.4, ~ -',2F8.4) PRINT
ERROR
MEASURES
WRITE (8,9970) FORMAT (//' A N A L Y S I S OF VARIANCE'//' SOURCE OF',I3X, 'SUM OF DEGREES OF MEAN'/' VARIATION',I3X, 'SQUARES FREEDOM SQUARES APP. F-TEST'/IX,62('-')) WRITE (8,9969) SST,N-I FORMAT (' T O T A L VARIATION',3X, FI2.4,18) W R I T E (8,9968) S S R , M - I , A V S R , F FORMAT (' R E G R E S S I O N ' , 8 X , F I 2 . 4 , 1 8 , 2 F I 2 . 4 ) WRITE (8,9967) S S D , N - M , A V S D , F R FORMAT (' RESIDUAL',|0X, FI2.4,18,2FI2.4) W R I T E (8,9966) (N-M)*AVSD0, N-M,AVSDO FORMAT (' LEAST SQ. RES.',4X, FI2.4,18,FI2.4) W R I T E (8,9965) S T D E V Y ( K R ) , R 2 Y , R Y FORMAT (/' S T A N D A R D ERROR OF THE Y ESTIMATE = ',IP,EI2.4,0P/ "' G O O D N E S S OF FIT (R SQUARED) - ',F10.4/ ~' C O R R E L A T I O N C O E F F I C I E N T R - ',F10.4) FOR KR-I C A L C U L A T E THE E S T I M A T E RKLW FOR THE APPROPRIATE VALUE OF K A C C O R D I N G TO LAWLWESS & WANG (1976). CALCULATE THE CURRENT ESTIMATE HKB FOR THE A P P R O P R I A T E VALUE OF K ACCORDING TO HOERL, K E N N A R D & B A L D W I N (1975) AND FOR KR>I THE RATIO QMGL OF THE LENGTH R E D U C T I O N OF T H E B V E C T O R COMPARED TO ITS LENGTH AT K-O AND THE M E A N SQUARE ERROR AT K=0 (MCDONALD & GALARNEAU, 1975). QMGD IS THE R A T I O OF THE D I S T A N C E DISTBZ AND THIS MEAN SQUARE ERROR. HKB(KR)-(M-I)~AVSDO/((N-I)*XSTDEV(1)~2)/SMBZSQ IF (KR. EQ. I) T H E N RKLW=I/F W R I T E (8,8864) RKLW, HKB(KR)
Multiple linear regression with correlations among the predictor variables
947
8864
FORMAT (//' LEAST SQUARES ESTIMATE FOR K (LAWLESS & WANG): ' ~,9X,FI0.7//' LEAST SQUARES ESTIMATE FOR K (HOERL,KENNARD &', *' BALDWIN): ',FIO.7) ELSE QMGL(KR)-(SMBZS0-SMBESQ)/AVSER0(1) QMGD(KR)-DISTBZ/AVSER0(1) WRITE (8,9964) HKB(KR), QMGL(KR), QMGD(KR) 9964 FORMAT (//' PRESENT ESTIMATE FOR K (HOERL,KENNARD & BALDWIN): ' *,FLU.7//' LENGTH REDUCTION NORM (AFTER MCDONALD & GALARNEAU): ' ~,F8.4/' DISTANCE NORM:',38X, FS.&) ENDIF C 420 CONTINUE C C... MESSAGES C WRITE (8,7763) NINFL 7763 FORMAT (//' DATA WERE READ FROM: ',A80) IF (LOGOPT.EQ.-I) WRITE (8,8863) 8863 FORMAT (/' DEFAULT SETTINGS WERE IMPLEMENTED WITH LOGOPTml ') WRITE (8,9963) NOUTFL 9963 FORMAT (/' OUTPUT MATRICES WERE WRITTEN TO: ',A80//) C C... WRITING SEPARATE OUTPUT FILES C WRITE (9,9962) 9962 FORMAT (' REGRESSION COEFFICIENT MATRIX') DO 430 KR=I.KRD WRITE (9,~961) RIDGEK(KR), (BCOEF(KR,J,I), J=I,M) 9961 FORMAT (IX,F|0.7,1P,12EI2.4) 430 CONTINUE WRITE (9,9960) 9960 FORMAT (' STANDARD DEVIATION OF REGRESSION COEFFICIENT MATRIX') DO 440 KR~I.KRD WRITE (9,~961) RIDGEK(KR), (BCOEF(KR,J,2), J~I,M) 440 CONTINUE WRITE (9,9959) 9959 FORMAT (' STANDARDIZED REGRESSION COEFFICIENT MATRIX') DO 450 KR-I,KRD WRITE (9,9958) RIDGEK(KR), (BCOEF(KR,J,3), J-2,M) 9958 FORMAT (IX,FI0.7,12X, IIFI2.6) 450 CONTINUE C WRITE (9,9957) 9957 FORMAT (' VARIANCE INFLATION FACTORS') DO 460 KR-I.KRD WRITE (9,~956) RIDGEK(KR), (DIAVIF(KR,J), J-I,M-I) 9956 FORMAT (IX, FI0.7,12X, IIFI2.6) 460 CONTINUE C WRITE (9,9955) 9955 FORMAT (' STANDARD ERROR OF THE ESTIMATE, NORMALIZED MEAN SQUARE', *' ERROR (MIN AND MAX),'/' ESTIMATE KHKB AND NORMS MGL AND MGD') DO 470 KR-I.KRD WRITE (9,~954) RIDGEK(KR), STDEVY(KR), AVSERO(KR), AVSERI(KR), * HKB(KR), QMGL(KR), QMGD(KR) 9954 FORMAT (lX, FI0.7,IP,E12.4,0P,2FI2.4,FI0.7,2F8.4) 470 CONTINUE C WRITE (9,9953) (RIDGEK(KR), KR-I,KRD) 9953 FORMAT (~ MATRIX OF ESTIMATED Y'/5X,'K-',SX,21FII.7/' Y') DO 480 I-I,N WRITE (9,9952) X(I,l), (D(KR, I,I), KR-I,KRD) 9952 FORMAT (IX, IP,22EII.3) 480 CONTINUE WRITE (9 9951) (RIDGEK(KR), KR-I,KRD) 9951 FORMAT ,| MATRIX OF DEVIATIONS'/5X,'K-',5X,21FII.7/' Y') DO 490 I-I,N WRITE (9,9952) X(I,I), (D(KR,I,2), KR-I,KRD) 490 CONTINUE C END
Appendix 2 overleaf
P . F . M . VAN GAANS and S. P. VRm,'~D
948
APPENDIX 2
Input Filefor HaM Data (HAM.1952) 5, 13, x5 xl x2 x3 x4
1.2793E-5,
78.50 74.30 104.30 87.60 95.90 109.20 102.70 72.50 93.10 115.90 83.80 113.30 109.40 line lines lines
1:
2,
I
7 I ii ii 7 Ii 3 I 2 21 I II 10
00 O0 00 O0 O0 O0 O0 O0 00 O0 00 O0 00
60.00 52.00 20.00 47.00 33.00 22.00 6.00 44.00 22.00 26.00 34.00 12.00 12.00
6 00 15 O0 8 00 8 O0 6 O0 9 O0 17 O0 22 O0 18 O0 4 00 23.00 9.00 8.00
26.00 29.00 56.00 31.O0 52.00 55.00 71.00 31.00 54.00 47.00 40.00 66.00 68.00
number of variables; number of observations; s t a r t i n g value for k; increment or m u l t i p l i e r for k; LOGOPT (O-linear, I-exponential incrementation). v a r i a b l e names in order of appearance in data matrix. data matrix, first column is dependent variable.
2-6: 7-19:
3
APPENDIX
Print Filefor Input of Appendix 2 is given for Selected Values of k INPUT-DATA NR I 2 3 4 5 6 7 8 9 lO
MATRIX,
x5 7 7 I 8 9 I
850E+01 430E+01 043E+02 760E+01 590E+01 092E+02 1 027E+02 7 250E+01 9.310E+01 1.159E+02
13 O B S E R V A T I O N S MEAN 9.542E+01 STDEV 1.504E+01 CORRELATION
FIRST TEN O B S E R V A T I O N S xl 7 I I I 7 I 3 I 2 2
000E+O0 O00E+O0 100E+01 100E+OI 000E+O0 IOOE+OI 000E+00 000E+00 000E+00 100E+OI
x2 2.600E+01 2.900E+01 5.600E+01 3.100E+Ol 5.200E+01 5.500E+01 7.I00E+01 3.100E+0! 5.400E+01 4.700E+01
7.462E+00 5.882E+00
4.815E+01 1.556E+01
MULT. VIF
x4 6.000E+OI 5.200E+01 2.000E+OI 4.700E+01 3.300E+01 2.200E+01 6.000E+O0 4.400E+01 2.200E+01 2.600E+01
1.177E+01 6.405E+00
3.000E+OI 1.674E+01
MATRIX
x5 x5 xl x2 x3 x4
000E+O0 500E+OI O00E+O0 O00E+00 O00E+O0 O00E+O0 700E+OI 200E+01 1 800E+01 4 000E+00
x3 6 I 8 8 6 9 1 2
xl 1.0000 .7307 .8163 -.5347 -.8213
R~R
x2
x3
x4
.7307 1.0000 .2286 .8241 -.2454
.8163 .2286 1.0000 -.1392 -.9730
-.5347 -.8241 -.1392 1.0000 .0295
-.8213 -.2454 .9730 .0295 1.0000
.9740 38.4962
.9961 254.4232
.9787 46.8684
.9965 282.5129
THE MULTIPLE R~R ARE T H E S Q U A R E D M U L T I P L E C O R R E L A T I O N S AMONG THE PREDICTORS VIF IS THE V A R X A N C E I N F L A T I O N FACTOR. W I T H LEAST SQUARES VIF-I/(I-R~R) DETERMINANT
OF T H E
EIGENVALUES CONDITION
NUMBER:
PREDICTOR
CORRELATION
2.235704 1376.88
1.576066
MATRIX: .186606
.001067659340641 ,001624
Multiple Linear regression v.ith correlations among the predictor variables
ZEAST'SQUARES'SOZ~ION'~K'"'0T
949
...............................................
m m m n m m m a m m m a n m m n m m | u m m n m n m m m m | | ~ a a n m m m m | m u m u u m n | | | n m m m n m m m n u m m m n | ~ m m u m | m m m | i |
REGRESSION
COEFFICIENTS VIF
INTERCEPT xl x2 x3 x4 BIAS
38.50 254.42 46.87 282.51 (MIN -
ANALYSIS
STANDARD ERROR
COEFFICIENT 6.2405E+01
7.0071E+01
1.5511E+00 5.1017E-01 1.0191E-01 -1.4406E-01
7.4477E-01 7.2379E-01 7.5471E-01 7.0905E-01
NORM. COEFF.
C O N T R I B U T I O N TO MEAN SQUARE ERROR
.6065 .5277 .0434 -.1603
CONTR. TO R~R
.0848 .5605 .I033 .6224
MAX)
.0000
.4432 .4307 -.0232 .1316 .0000
-
OF V A R I A N C E
SOURCE OF VARIATION
SUM OF SQUARES
TOTAL V A R I A T I O N REGRESSION RESIDUAL LEAST SQ. RES.
DEGREES OF FREEDOM
2715.7631 2667.8994 47.8636 47.8636
12 4 8 8
STANDARD ERROR OF THE Y E S T I M A T E GOODNESS OF FIT (R SQUARED) CORRELATION COEFFICIENT R -
MEAN SQUARES 666.9749 5.9830 5.9830
111.4792 1.0000
2.4460E+00 .9824 .9911
ESTIMATE
FOR K (LAWLESS
LEAST SQUARES
ESTIMATE
FOR K ( H O E R L , K E N N A R D
K
F-TEST
-
LEAST SQUARES
RIDGE P A R A M E T E R
APP.
& WANG):
.0089703
& BALDWIN):
.0130763
.0004094
-
m m m ~ m m m m m m m m m m N N m m m m ~ m m m m m m m m m m a m N N N m ~ m m m m m m m m m m m N m m m m w m m N N N N m m m m m m m m N ~ m m m N
REGRESSION
COEFFICIENTS VIF
INTERCEPT xl x2 x3 x4 BIAS
STANDARD ERROR
COEFFICIENT
25.53 162.56 30.78 180.37 (MIN - MAX)
6,7113E+01
5.5985E+01
1.5016E+00 4.6183E-01 5.1759E-02 -|.9141E-01
6.0663E-01 5.7871E-01 6.1181E-01 5.6671E-01
NORM. COEFF.
C O N T R I B U T I O N TO MEAN SQUARE ERROR
.5872 .4777 .0220 .2130
.0563 .3583 .0679 .3976 .0375 .1230 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.9176 ANALYSIS
SUM OF SQUARES
TOTAL V A R I A T I O N REGRESSION RESIDUAL LEAST SQ. RES.
DEGREES OF FREEDOM
2715.7631 2667.8722 47.8909 47.8636
12 4 8 8
STANDARD ERROR OF THE Y E S T I M A T E GOODNESS OF FIT (R SQUARED) CORRELATION COEFFICIENT R ESTIMATE
LENGTH R E D U C T I O N DISTANCE NORM:
.
.
.
.
.
.
.
.9821
(AFTER M C D O N A L D
| m m n |
oo
F-TEST
111.4146 1.0006
|
"
.0142404 .0402 .0045
| ~ u u u m m m m m n m ~ m | m m m w m m u m m u u m m m m m | m m m l m w m |
"
COEFFICIENTS
INTERCEPT
BIAS
.
2.4467E+00 .9824 .9911
& GALARNEAU):
VIF
x4
.
1.0031
-
NORM
UT
APP.
666.9681 5.9864 5.9830
& BALDWIN):
.
REGRESSION
MEAN SQUARES
FOR K ( H O E R L , K E N N A R D
~ u
xl x2 x3
.
.4291 .3899 -.0118 .1749
OF V A R I A N C E
SOURCE OF VARIATION
PRESENT
CONTR. TO R~R
2.83 3.79 2.74 3.87 (MIN - MAX)
COEFFICIENT
STANDARD ERROR
8.3418B+01
7.9198E+00
1.2994E+00 2.9989E-01 -1.4208E-01 -3.4865E-01
2.0372E-01 8.9144E-02 1.8401E-01 8.3676E-02
NORM. COEFF.
C O N T R I B U T I O N TO MEAN SQUARE ERROR
.5081 .3102 -.0605 -.3879
CONTR. TO R"R
.0063 .0085 .0061 .0087 .0125 2.3810 .
.
.
.
.
.
.
.
.
.
.
.
.0422 -
.
.
.
.
.
.
.
.
.
2.4107
.3713 .2532 .0323 .3186 .
.
.
.9754
950
P.F.M. ANALYSIS
VAN GAA.~S and S. P. VRIE.~D
OF VARIANCE
SOURCE OF SUM OF DEGREES OF MEAN VARIATION SQUARES FREEDOM SQUARES APP. F - T E S T .............................................................. TOTAL VARIATION 2715.7631 12 REGRESSION 2667.0850 4 666.7712 109.5805 RESIDUAL 48.6781 8 6.0848 1.0170 L E A S T SQ. RES. 47.8636 8 5.9830 STANDARD ERROR OF THE Y ESTIMATE G O O D N E S S O F F I T (R S Q U A R E D ) = CORRELATION COEFFICIENT R = PRESENT
ESTIMATE
LENGTH REDUCTION D I S T A N C E NORM:
FOR K NORM
=
2.4667E+00 .9821 .9910
(HOERL,KENNARD (AFTER MCDONALD
& BALDWIN): & GALARNEAU):
°0173292 .1206 .0872
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
RIDGE
PARAMETER
K =
.4192010
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
REGRESSION
COEFFICIENTS VIF
INTERCEPT xl x2 x3 x4 BIAS
STANDARD ERROR
COEFFICIENT 8.8904E+01
1.5984E+00
.41 8.2689E-01 .22 2.7679E-01 .40 - 3 . 6 2 0 8 E - 0 1 .20 - 2 . 9 0 5 7 E - 0 1 (MIN - M A X )
1.2913E-01 3.6113E-02 I. 1 7 8 1 E - 0 1 3.1771E-02
ANALYSIS
NORM. COEFF.
CONTRIBUTION TO MEAN SQUARE ERROR
CONTR. TO R~R
.3233 .2863 -.1542 -.3233
.0025 .2363 .0014 .2337 .0025 .0824 .0012 .2655 .0693 2.7063 ........................ .0770 2.7140 .8179
OF VARIANCE
SOURCE OF SUM OF DEGREES OF MEAN VARIATION SQUARES FREEDOM SQUARES APP. F - T E S T .............................................................. TOTAL VARIATION 2715.7631 12 REGRESSION 2579.6508 4 644.9127 37.9048 RESIDUAL 136.1122 8 17.0140 2.8438 L E A S T SQ. RES. 47,8636 8 5.9830 STANDARD ERROR OF THE Y ESTIMATE G O O D N E S S O F F I T (R S Q U A R E D ) CORRELATION COEFFICIENT R PRESENT
ESTIMATE
LENGTH REDUCTION D I S T A N C E NORM: RIDGE
PARAMETER
REGRESSION
FOR K NORM
K -
(HOERL,KENNARD (AFTER
MCDONALD
& BALDWIN): & GALARNEAU):
.2619 .1489
6.7072164
COEFFICIENT
INTERCEPT
STANDARD ERROR
9.3721E+01
3.9528E+00
.01 2.1207E-01 .01 8.7731E-02 .Of - 1 . 3 7 4 3 E - 0 1 .01 - 8 . 2 9 0 4 E - 0 2 (MIN - MAX)
8.1483E-02 2.9858E-02 7.5547E-0Z 2.7768E-02
ANALYSIS
.0279927
COEFFICIENTS VIF
xl x2 x3 x4 BIAS
4.1248E+00 .9499 .9746
-
NORM. COEFF.
CONTRIBUTION TO MEAN SQUARE ERROR
.0829 .0907 -.0585 -.0922
.0010 .0606 .0010 .0741 .0010 .0313 .0010 .0758 .1161 - 3 . 3 9 8 4 ........................ .1200 3.4023 .2417
OF VARIANCE
SOURCE OF SUM OF DEGREES OF MEAN VARIATION SQUARES FREEDOM SQUARES APP. F - T E S T .............................................................. TOTAL VARIATION 2715.7631 12 REGRESSION 1149.0332 4 287.2583 1.4668 RESIDUAL 1566.7299 8 195.8412 32.7332 L E A S T SQ. RES. 47.8636 8 5.9830 STANDARD ERROR OF THE Y ESTIMATE G O O D N E S S O F F I T (R S Q U A R E D ) CORRELATION COEFFICIENT R -
1.3994E+01 .4231 .6505
-
CONTR. TO R~R
Multiple linear regression with correlations among the predictor variables
PRESENT
ESTIMATE
LENGTH R E D U C T I O N DISTANCE NORM: DATA W E R E READ OUTPUT M A T R I C E S
FOR K ( H O E R L , K E N N A R D
& BALDWIN):
NORM
& GALARNEAU):
FROM:
(AFTER M C D O N A L D
.3258504 .4718 .3502
hald.txt
WERE WRITTEN
TO:
outlog
APPENDIX 4
Output File for Input of Appendix 2 Last two tables here given for selected values of k only. REGRESSION COEFFICIENT MATRIX 1.5511E+O0 5 IOI7E-OI l OIglE-OI .0000000 6.2405E+01 1.5492E+00 5 0829E-01 9 9969E-02 .0000128 6.2588E+01 1.5473E+00 5 0644E-01 9 8058E-02 .0000256 6.2768E+01 1.5436E+00 5 0283E-01 9 4323E-02 .0000512 6.3119E+01 4 9593E-01 8 7181E-02 .0001023 6.3791E+01 I. 7366E+00 I. 7237E+00 4 8328E-01 7 4076E-02 .0002047 6.5022E+01 4 6183E-01 5 |759E-02 .0004094 6.7113E+01 1.5016E+00 4 2974E-01 I 8127E-02 .0008188 7.0245E+01 1.4683E+00 3 8982E-01 -2 4456E-02 .0016375 7.4156E+01 1.4259E+00 I 3 8 1 0 E + 0 0 3 5027E-01 -6.8622E-02 .0032750 7.8073E+01 1 3394E+00 3 1931E-01 -I.0781E-01 .0065500 8.1235E+01 2 9989E-01 -1.4208E-01 .0131000 8.3418E+01 1 2994E+00 2 9060E-01 -1.7747E-01 .0262001 8.4852E+01 I 2524E+00 .0524001 8.5895E+01 I 1864E+00 2 8852E-01 -2.2129E-01 i 0923E+00 2 8995E-01 -2.7563E-01 .1048003 8.6834E+01 .2096005 8.7821E+01 9 6975E-01 2.8887E-01 -3.3005E-01 .4192010 8.8904E+01 8 2689E-01 2.7679E-01 -3.6208E-01 6 7019E-01 2.4634E-01 -3.5148E-01 .8384020 9.0098E+01 1.9727E-01 -2.9608E-01 1.6768041 9.1393E+01 5 0432E-01 3.4409E-01 1.3958E-01 -2.1542E-01 3.3536082 9.2665E+01 2.1207E-01 8.7731E-02 -1.3743E-01 6.7072164 9.3721E+01 S T A N D A R D D E V I A T I O N OF R E G R E S S I O N C O E F F I C I E N T M A T R I X 7.0071E+01 7.4477E-01 7.2379E-01 7.5471E-01 .0000000 7 3936E-01 7.1815E-01 7.4912E-01 .0000128 6.9523E+01 7.4362E-01 7 3403E-01 7.1260E-01 .0000256 6.8985E+01 7.0175E-01 7.3288E-01 .0000512 6.7932E+01 7 2363E-01 6.8102E-01 7.1240E-01 .0001023 6.5920E+01 7 0380E-01 6.4308E-01 6.7497E-01 .0002047 6.2236E+01 6 6761E-01 .0004094 6 0663E-01 5.7871E-01 6.1181E-01 5.5985E+01 4.8247E-01 5.1840E-01 .0008188 4.6632E+01 5 1677E-01 4.0496E-01 0016375 3.4973E+01 4 0849E-01 3.6266E-01 0032750 2.3343E+01 3 0841E-01 2.4359E-01 2.9846E-01 1.4064E+01 2 4006E-01 1.4958E-01 2.2397E-01 0065500 7.9198E+00 2 0372E-01 8.9144E-02 1.8401E-01 0131000 I 8353E-01 5.6809E-02 1.6322E-01 0262001 4.3810E+00 4.2069E-02 1.4784E-01 0524001 2.5667E+00 1 6678E-01 3.5914E-02 1.3334E-01 1048003 1.7586E+00 1 4985E-01 3.4174E-02 1.2194E-01 2096005 1.5032E+00 1 3570E-01 4192010 1 2913E-01 3.6113E-02 1.1781E-01 1.5984E+00 1 3011E-OI 4.0671E-02 1.2044E-01 8384020 1.9894E+00 4.3501E-02 1.1950E-01 6768041 2.6079E+00 1 2822E-01 3.9675E-02 1.0363E-01 3536082 3.3132E+00 1 I131E-01 2.9858E-02 7.5547E-02 7072164 3.9528E+00 8.1483E-02 S T A N D A R D I Z E D R E G R E S S I O N C O E F F I C I E N T MATRIX 0000000 .606512 .527706 .043390 .525763 .042564 0000128 .605765 .041750 0000256 .605029 .523851 .040160 0000512 .603591 .520115 .037119 0001023 .600840 .512975 .031539 0002047 .595789 .499898 0004094 .587174 .477703 .022037 0008188 .574152 .444508 .007718 .403222 . 010413 0016375 .557543 -.029217 0032750 .540003 .362316 -.045902 0065500 .523725 .330283 .310202 -.060495 0131000 .508088 -.075560 0262001 .489701 .300587 -.094217 0524001 .463922 298443 299915 -.I17353 1048003 .427112 -.140523 2096005 .379193 298800 -.154163 4192010 .323330 286302 .149647 8384020 .262059 254804 204052 .126062 1 6768041 .197197 144376 -.091719 3. 3536082 .134546 .082924 .090746 -.058513 6. 7072164 CAGEO 16 7 - - F
-1.4406E-01 -1.4590E-01 -1.4771E-01 -I 5125E-01 - I 5801E-01 -I 7039E-01 -I 9141E-01 -2 2282E-01 -2 6186E-01 -3 0045E-01 -3 3043E-01 -3 4865E-01 -3 5595E-01 -3 5380E-01 -3 4265E-01 -3.2200E-01 -2.9057E-01 -2.4672E-01 -1.9160E-01 -1.3318E-01 -8.2904E-02 7.0905E-01 7.0352E-01 6.9807E-01 6.8743E-01 6.6710E-01 6.2987E-01 5.6671E-01 4.7223E-01 3.5454E-01 2.3735E-01 1.4432E-01 8.3676E-02 5.0293E-02 3.4898E-02 2.9277E-02 2.8736E-02 3.1771E-02 3.6973E-02 4.0176E-02 3.6847E-02 2.7768E-02 -.160287 -.162334 -.164349 -.168286 -.175808 -.189586 -.212966 -.247921 -.291358 -.334289 -.367645 -.387916 -.396044 -.393656 -.381247 -.358270 -.323304 -.274505 -.213182 .148176 -.092242
951
952
P.F.M.
VAN GAANS and S. P. VRIEND
VARIANCE INFLATION FACTORS 282.512865 •0 0 0 0 0 0 0 38.496211 254.423166 46.868386 278.120520 .0000128 37.938585 250.473088 46.176831 273.829984 •0 0 0 0 2 5 6 37.393875 246.614568 45.501299 265.541972 •0 0 0 0 5 1 2 36.341640 239.161075 44.196355 250.057947 •0 0 0 1 0 2 3 34.375720 225.236121 41.758324 222.898766 .0002047 30.927145 200.811526 37.481696 •0 0 0 4 0 9 4 25.525278 162.560667 30.783164 180.365509 125.114226 •0 0 0 8 1 8 8 18.504490 112.871755 22.078394 70.381627 .0016375 11.539099 63.647818 13.445981 31.447440 .0032750 6.557953 28.628504 7.281871 11.583019 .0065500 3.958180 10.753198 4.084923 .0131000 2.832014 3.794794 2.739557 3.868651 1.374109 .0262001 2.260122 1.515310 2.119341 .628761 .0524001 1.773618 .789683 1.652321 .385692 .1048003 1.248016 .501633 1.171483 .274960 .2096005 .757280 .336076 .724981 .199454 .4192010 .406956 .222724 .401572 •8 3 8 4 0 2 0 .204809 .140049 .208076 .133912 1.6768041 .096228 .077508 .099089 .076493 3.3536082 .039815 .035397 .040914 .035326 6.7072164 .014077 .013227 .014347 .013236 M E A N S Q U A R E E R R O R (MIN A N D M A X ) . STANDARD ERROR OF THE ESTIMATE, NORMALIZED ESTIMATE KHKB AND NORMS MGL AND MGD .0000000 2.4460E+00 1.3710 1.3710 0130763 .0000 0000 .0000128 2.4460E+00 0131223 .0017 0000 1.3499 1.3500 .0000256 2.4460E+00 0 1 3 1 6 7 6 .0034 1.3295 1.3299 0000 .0000512 2.4460E+00 1 2922 0132559 0067 0001 1.2906 .0001023 1.2194 0134242 0127 0004 2.4461E+00 1 2256 .0002047 I 1224 0014 2.4462E+00 1.0987 0137304 0234 .0004094 0142404 0402 0045 2.4467E+O0 .9176 i 0031 .0008188 2.4479E÷00 .6866 0149673 0621 0123 9555 0277 .0016375 2.4504E+00 .4416 I 1159 0157800 0842 0491 .0032750 2.4541E+00 .2337 i 5173 0164567 I010 .0065500 2.4588E+00 .1023 2 0077 0169255 II18 0706 .0131000 2.4667E+00 .0422 2 4107 0173292 1206 0872 .026200L 2.4877E+00 0178848 1322 .0230 2 6621 0984 .052400I 2.5519E+00 .0238 2 7798 0188241 1501 1067 .1048003 2 7334E+00 .0343 2 8021 0204225 1768 I158 .2096005 3 1776E+00 ,0487 2 7659 0231186 2135 1292 .4192010 4 1248E+00 .0770 2 7140 0279927 2619 1489 .8384020 5 8583E+00 .1396 2 7133 0380907 3228 1776 1.676804! 8 4228E+00 .2055 2 8402 0621179 3881 2216 3.3536082 I 1367E+01 ,1967 3 I010 1271316 4410 2831 6.7072164 I 3994E+01 ,1200 3 4023 3258504 4718 3502 MATRIX OF ESTIMATED Y K.0000000 . 0 0 0 0 1 2 8 --.0065500 .0131000 . 0 2 6 2 0 0 1 --Y 7.850E+01 7.850E+01 7.849E+01 --7.844E+01 7 854E+01 7.875E+01 --7.430E+01 7.279E+01 7.279E+01 --7.303E+01 7 315E+01 7.336E+01 --1.043E+02 1.060E+02 1.060E+02 --1.064E+02 1 064E+02 1.064E+02 --8.760E+01 8.933E+01 8.933E+01 --8.947E+01 8 949E+01 8.949E+01 --9.590E+01 9.565E+01 9.565E+01 --9.566E+01 9 575E+01 9.592E+01 --1.092E+02 1.O53E+02 1.053E+02 --1.053E+02 I 053E+02 1.052E+02 --1.027E+02 1.041E+02 1.041E+02 --1.041E+02 1 041E+02 1 . 0 4 1 E + 0 2 --7.250E+01 7.567E+0! 7 . 5 6 7 E + 0 1 --7.556E+01 7 555E+01 7.555E+01 --9.310E~01 9.172E+O1 9.172E+01 --9.195E+01 9 198E+01 9 . 2 0 2 E + 0 1 --1.159E+02 1.156E+02 1.156E+02 --1.153E+02 1 152E+02 1.148E+02 --8.380E+01 8.181E÷01 8.181E+01 --8.163E+01 8.159E+01 8.154E+01 --1.133E+02 I. 1 2 3 E + 0 2 1.123E+02 --1.121E+02 1.120E+02 I.I19E+02 --1.094E+02 I.I17E+02 1.I17E+02 --I.I15E+02 I.I15E+02 I.I14E+02 --MATRIX OF DEVIATIONS K.0000000 .0000128 --.0065500 .0131000 .0262001 Y 7.850E+01 4.760E-03 6.268E-03 --5.979E-02 -3.991E-02 -2.525E-01 7.430E+01 1.511E+O0 1.510E+OO --1.265E+00 1.146E+00 9.395E-01 1.043E+02 -1.671E÷00 -1.675E+00 .... 2.078E+00 -2.096E+00 -2.063E+00 8.760E+01 -1.727E+00 -1.729E+00 .... 1.874E+00 -1.885E+00 -1.887E+00 9.590E+01 2.508E-01 2.514E-01 --2.363E-01 1.494E-01 -1.873E-02 1.092E+O2 3.925E+00 3.925E+00 --3.910E+00 3.943E+00 4.017E+00 1.027E+02 -1.449E+00 -1.448E+00 . . . . 1 . 4 0 9 E ÷ 0 0 -I 4 0 1 E + 0 0 - 1 . 3 8 9 E + 0 0 7.250E+01 -3.175E+00 -3.174E+00 . . . . 3 . 0 6 2 E ÷ 0 0 -3 0 4 8 E + 0 0 - 3 . 0 4 7 E + 0 0 9.310E+01 1.378E+00 1.376E+00 --1.154E+00 1 II7E+O0 1.076E+00 1.159E+02 2.815E-01 2 . 8 2 8 E - 0 1 --5.529E-01 7 327E-01 1.055E+00 8.380E+01 1.991E+00 1.992E+00 --2.167E÷00 2 209E+00 2.256E+00 I. 1 3 3 E + 0 2 9.730E-01 9.748E-01 --1.193E+00 i 258E+00 1.361E+00 [.094E+02 -2.294E+00 -2.293E+00 . . . . 2. I I 4 E + 0 0 -2 0 8 4 E + 0 0 - 2 . 0 4 5 E + 0 0
.4192010 8.228E+01 7.722E+01 1.048E+02 9.003E+01 9.732E+01 1.036E+02 1.031E+02 7.756E+01 9.259E+01 I.I03E+02 8.259E+01 1.095E+02 1.096E+02 .4192010 -3.781E+00 -2.916E+00 -4.912E-01 -2.426E+00 -1.423E+00 5.629E+00 -4.372E-01 -5.06OE+00 5.063E-01 5.626E+00 1.206E+00 3.778E+00 -2.104E-01