Sankhy¯ a : The Indian Journal of Statistics 1999, Volume 61, Series B, Pt. 3, pp. 443–459
A NEW BIASED ESTIMATOR IN LINEAR REGRESSION AND A DETAILED ANALYSIS OF THE WIDELY-ANALYSED DATASET ON PORTLAND CEMENT∗ ˙ KAC By SELAHATTIN ¸ IRANLAR, ˘ ˙ ˙ SADULLAH SAKALLIOGLU, FIKR I˙ AKDENIZ University of C ¸ ukurova, Adana, Turkey GEORGE P. H. STYAN McGill University, Montr´eal, Qu´ebec, Canada and HANS JOACHIM WERNER University of Bonn, Bonn, Germany SUMMARY. This paper deals with the standard multiple linear regression model (y, Xβ, σ 2 I), where the model matrix X is assumed to be of full column rank. We introduce a new biased estimator for β and discuss its properties in some detail. In particular, we show that our new estimator is superior, in the scalar mean-squared error sense, to both the usual restricted least-squares estimator and to the new biased estimator introduced by Liu (1993). We illustrate our findings with a numerical example based on the widely-analysed dataset on Portland cement, cf. Woods, Steinour and Starke (1932), Hald (1952, pp. 635–652), Piepel and Redgate (1998).
1.
Introduction
In so much econometric work, especially with time series data, there is often high but not exact multicollinearity. It is well known that one effect to expect from near multicollinearity is that of very large sampling variances of the ordinary least squares (OLS) estimates; the distance of βˆ to β will then tend to be large. Although the OLS estimator βˆ is still the best linear unbiased estimator (BLUE) for β, with near multicollinearity this property is of little comfort. To Paper received March 1996; revised August 1999. AMS (1991) subject classification. 62J05, 62J07. Key words and phrases. Anti-quirk; biased estimator; ill-conditioning; least squares estimator; Liu estimator; mean-squared error; multicollinearity; Portland cement data; quirk; restricted least squares; ridge regression estimator. ∗ The research of the fourth author was supported in part by the Natural Sciences and Engineering Research Council of Canada. Financial support of the fifth author by the Deutsche Forschungsgemeinschaft, Sonderforschungsbereich 303 at the University of Bonn, is gratefully acknowledged.
selahatt˙in kac ¸ iranlar et. al
444
overcome this, different remedial actions have been proposed. A popular numerical technique to deal with near multicollinearity is that of ridge regression. The (ordinary) ridge regression estimator (ORR), however, is biased, but the variances of its elements are less than the variances of the corresponding elements of the OLS estimator. By accepting some bias to reduce variances, the matrix mean-squared error (MSE) or the weaker scalar mean-squared error (mse) might thus be improved. A different approach to improving the MSE (or the mse) involves the suggestion that one or more of the explanatory variables be dropped to improve the MSE (or the mse) of the remaining coefficients. Setting a coefficient or a group of coefficients equal to zero is, obviously, a special case of imposing a set of linear restrictions on the unknown regression coefficients. By adding linear restrictions to a model, the resulting restricted least squares (RLS) estimator might again be better in some mean-squared error sense than the unrestricted OLS estimator, even though the restrictions may not, in fact, be valid. The hope that the combination of two different estimators might inherit the advantages of both estimators surely motivated both Sarkar (1992) and Liu (1993) to study new biased estimators. By grafting the ridge regression philosophy into the RLS estimator, Sarkar (1992) was led to a biased estimator, which we will call the restricted ridge regression (RRR) estimator, for β. To combat near multicollinearity, Liu (1993) combined the Stein (1956) estimator with the ORR estimator to obtain what we will call the Liu estimator. Our primary aim in this paper is to introduce another new estimator which will be defined by combining in a special way the two approaches followed in obtaining the ORR estimator and the Liu estimator. We call our new biased estimator the restricted Liu (RL) estimator and discuss its properties in some detail. In particular, we show that this RL estimator is superior, in the (scalar) mse sense, to both the RLS estimator and to the Liu estimator when the restrictions are indeed correct. We also derive conditions for the superiority of the RL estimator over both the RLS and Liu estimators when the restrictions are not correct. The paper is organised as follows. The statistical model and the various estimators are introduced in Section 2; this Section also contains some basic properties which are important later on. Our main results on the restricted Liu (RL) estimator are then given in Section 3, while in Section 4 we illustrate some of these results with a numerical example based on the widely-analysed dataset on Portland cement originally due to Woods, Steinour and Starke (1932), cf. Hald (1952, pp. 635–652). 2.
Preliminaries—the Model and Estimators
We consider the standard multiple linear regression model y = Xβ + u,
. . . (2.1)
new biased regression estimator
445
where y is an n×1 vector of observations on the response (or dependent) variable, X is an n × p model matrix of observations on p nonstochastic explanatory regressor (or independent) variables, β is a p × 1 vector of unknown parameters associated with the p regressors, and u is an n × 1 vector of disturbances with expectation E(u) = 0 and dispersion (variance-covariance) matrix D(u) = σ 2 In . Throughout this paper, we assume that the model matrix X is of full column rank and that the unknown scalar variance σ 2 is strictly positive. We note that for any given square-integrable estimator b? of β, its matrix mean-squared error MSE(b? ) = E(b? − β)(b? − β)0 = D(b? ) + BIAS(b? )BIAS(b? )0 , with BIAS(b? ) := E(b? ) − β. We recall that the scalar mean-squared error mse(b? ) = trMSE(b? ), where tr denotes trace. In this paper we often consider a set of q linear restrictions on β, embodied in Rβ = r, . . . (2.2) where the matrix R is q × p and of full row rank q < p, the vector r is q × 1, and both R and r are known. The restricted least squares RLS estimator of β, that is, the estimator β ∗ embodying these constraints, can be written as ˆ β ∗ = βˆ + S −1 R0 (RS −1 R0 )−1 (r − Rβ)
. . . (2.3)
where S := X 0 X, while βˆ := S −1 X 0 y is the usual OLS estimator for β. Letting A := S −1 − S −1 R0 (RS −1 R0 )−1 RS −1 and δ := r − Rβ, we obtain D(β ∗ ) = σ 2 A
. . . (2.4)
and mse(β ∗ ) = σ 2 tr(A) + δ 0 (RS −1 R0 )−1 RS −2 R0 (RS −1 R0 )−1 δ,
. . . (2.5)
respectively for the dispersion matrix and the (scalar) mse of the RLS estimator β ∗ . The ordinary ridge regression (ORR) estimator is defined as ˆ β(k) := (S + kIp )−1 X 0 y where k is an arbitrary positive constant. Putting Wk := (Ip + kS −1 )−1 we may write the ORR estimator as ˆ ˆ β(k) = Wk β.
. . . (2.6)
446
selahatt˙in kac ¸ iranlar et. al
The restricted ridge regression (RRR) estimator β ∗ (k), introduced by Sarkar (1992), is defined as follows: β ∗ (k) := Wk β ∗ , where Wk is as above in (2.6). For k = 0, obviously β ∗ (k) = β ∗ (0) = β ∗ and ˆ ˆ ˆ In view of Eβ ∗ (k) = Wk β + Wk S −1 R0 (RS −1 R0 )−1 δ, it is clear β(k) = β(0) = β. that, unless k = 0 and δ = 0, the RRR estimator is always a biased estimator for β. Furthermore, Dβ ∗ (k) = σ 2 Wk AWk0 . The estimator βˆd , due to Liu (1993), which we will call the Liu estimator, is defined for each parameter d ∈ (−∞, +∞) as follows: ˆ βˆd := (S + I)−1 (X 0 y + dβ). Letting Fd := (S + I)−1 (S + dI),
. . . (2.7)
this Liu estimator βˆd can also be expressed as ˆ βˆd = Fd β. Consequently, E(βˆd ) = Fd β and D(βˆd ) = σ 2 Fd S −1 Fd0 . For d = 1, evidently ˆ Since mse(βˆd ) = trD(βˆd ) + k E(βˆd ) − β k2 , we further obtain that βˆd = βˆ1 = β. mse(βˆd ) = σ 2 tr(Fd S −1 Fd0 ) + (d − 1)2 β 0 (S + I)−2 β;
. . . (2.8)
here k · k indicates the usual Euclidean norm. We are now ready to introduce a new alternative estimator for β, for each parameter d ∈ (−∞, +∞) according to βˆrd := Fd β ∗ ,
. . . (2.9)
where Fd is defined as above in (2.7). Since our new estimator βˆrd is obtained simply by replacing, in the Liu estimator, the OLS estimator by the RLS estimator, we will call βˆrd the restricted Liu (RL) estimator of β. From our definition it directly follows that the RL estimator reduces to the RLS estimator β ∗ when d = 1. In passing, we mention here that βˆrd can also be obtained as the OLS estimator of β in the framework of the following artificially augmented linear model (2.1): µ ¶ µ ¶ µ ¶ y X u = β+ , dβ ∗ + Sg I u ˜
new biased regression estimator
447
ˆ E(˜ where g := S −1 R0 (RS −1 R0 )−1 (r − Rβ), u) = 0, D(˜ u) = σ 2 I, and E(u˜ u0 ) = 0. So we find βˆrd = (S + I)−1 (X 0 y + dβ ∗ + Sg) = (S + I)−1 (S + dI)β ∗ . We conclude this section by presenting, for later use, expressions for the expectation, the dispersion, and the mse of βˆrd ; two special situations are also considered. From the definition (2.9) and the corresponding expressions for the RLS estimator β ∗ we easily see that E(βˆrd ) = Fd β + Fd S −1 R0 (RS −1 R0 )−1 δ and D(βˆrd ) = σ 2 Fd AFd0 . The formula for the expectation especially shows that βˆrd is an unbiased estimator of β if and only if the vector δ = 0 and the scalar d = 1. Letting δ∗ := R0 (RS −1 R0 )−1 δ we now further note that the (scalar) mean-squared error mse(βˆrd ) = σ 2 tr(Fd AFd0 )+ k Fd β + Fd S −1 δ∗ − β k2 .
. . . (2.10)
When the restrictions are valid, that is, whenever δ = 0, then (2.10) becomes mse(βˆrd ) = σ 2 tr(Fd AFd0 ) + (d − 1)2 β 0 (S + I)−2 β.
3.
. . . (2.11)
Superiority of the Restricted Liu Estimator
In the following three subsections we compare our restricted Liu (RL) estimator βˆrd with the (unrestricted) Liu estimator βˆd and the RLS estimator β ∗ . In Section 3.1 we show that, when the restrictions are correct, our restricted Liu estimator is, according to the strong matrix MSE criterion, better than the Liu estimator. Section 3.2 deals with the much weaker scalar mse criterion. There we prove that if the restrictions are correct then there exist parameters d for which βˆrd behaves even better, in the mse sense, than the RLS estimator. Section 3.3 is devoted to situations where the restrictions are incorrect. There we derive some sufficient conditions for βˆrd to have smaller mse than β ∗ and βˆd . 3.1. Comparisons between the restricted Liu estimator and the Liu estimator when the restrictions hold. Clearly, when the same parameter d is used in βˆd = Fd βˆ and βˆrd = Fd β ∗ , then ˆ − D(β ∗ )]F 0 . D(βˆd ) − D(βˆrd ) = Fd [D(β) d ∗ ˆ Since D(β)−D(β ) = σ 2 S −1 R0 (RS −1 R0 )−1 RS −1 is nonnegative definite (NND); this implies that, for each choice of d, the matrix D(βˆd ) − D(βˆrd ) is also NND. It further follows that D(βˆd ) equals D(βˆrd ) if and only if R(S +dI) = 0; notice that
selahatt˙in kac ¸ iranlar et. al
448
this happens (if at all) only for a rather particular negative value of d. Hence we have the following theorem. Theorem 3.1.1. Irrespective of the choice of d, the dispersion of the restricted Liu estimator βˆrd is always less than or equal to the corresponding dispersion of the Liu estimator βˆd . The dispersions are equal if and only if R(S + dI) = 0, which can happen (if at all) only for a rather particular negative value of d. Next, let us assume that the restrictions (2.2) are correct. Notice that the restricted Liu estimator βˆrd then has the same bias as the Liu estimator βˆd . Since mse(b? ) = trMSE(b? ), and since the trace of any NND matrix is nonnegative, the following Corollary follows at once from Theorem 3.1.1. Corollary 3.1.1. Let the restrictions (2.2) be correct. Then irrespective of the choice of d ≥ 0: 1. The restricted Liu estimator βˆrd is better in the (matrix) MSE sense than the Liu estimator βˆd . 2. The restricted Liu estimator βˆrd is better in the (scalar) mse sense than the Liu estimator βˆd . We note that the two statements in Corollary 3.1.1 hold even for almost all (if not all) negative values of d. For notice that as a consequence of Theorem 3.1.1 we have MSE(βˆrd ) = MSE(βˆd ) and so mse(βˆrd ) = mse(βˆd ) if and only if d is such that R(S + dI) = 0. 3.2. Comparisons between the restricted Liu estimator and the restricted least squares estimator when the restrictions hold. Since the matrix S is positive definite, there exist an orthogonal matrix P and a positive definite diagonal matrix Λ = diag(λ1 , λ2 , . . . , λp ) such that P 0 SP = Λ, with P 0 P = P P 0 = I. Put B := P 0 AP , with A as defined in Section 2. The matrix A is NND and hence so is B. The diagonal elements bii of B are, therefore, all nonnegative. Exploiting some elementary properties of the trace function along with the definition of the matrix Fd , it is now not difficult to obtain tr(Fd AFd0 ) =
p X (λi + d)2 i=1
bii .
. . . (3.1)
αi2 . (λi + 1)2
. . . (3.2)
(λi + 1)2
Letting α := P 0 β, we further find that β 0 (S + I)−2 β =
p X i=1
Substituting (3.1) and (3.2) in (2.11), yields mse(βˆrd ) = σ 2
p X (λi + d)2 i=1
(λi + 1)
b + (d − 1)2 2 ii
p X i=1
αi2 =: f (d), (λi + 1)2
. . . (3.3)
new biased regression estimator
449
say; it is interesting to observe that, for each given β and σ 2 , this function f (d) = mse(βˆrd ) is convex in the argument d. For later use we mention here that similarly we obtain mse(βˆd ) = σ 2
p p X X (λi + d)2 αi2 2 + (d − 1) λ (λ + 1)2 (λi + 1)2 i=1 i i i=1
. . . (3.4)
from (2.8). Differentiating (3.3) with respect to d, gives f 0 (d) = 2σ 2
p X (λi + d)bii i=1
(λi + 1)2
+2
p X (d − 1)α2 i=1
(λi +
i 1)2
=2
p X (λi + d)bii σ 2 + (d − 1)α2 i
(λi + 1)2
i=1
.
. . . (3.5) We now recall that in this section the restrictions (2.2) are required to hold. Since then δ = 0, we see that (2.5) now reduces to mse(β ∗ ) = σ 2 tr(A). We next observe that A = P BP 0 since B = P 0 AP and P P 0 = I, and that for any pair of matrices M and N with M N being square, we have tr(M N ) = tr(N M ). Hence mse(β ∗ ) = σ 2 tr(A) = σ 2 tr(P BP 0 ) = σ 2 tr(B) = σ 2
p X
bii = f (1).
. . . (3.6)
i=1
At the point d = 1, the first derivative of f becomes, cf. (3.5), 0
f (1) = 2
p X i=1
σ2 bii . λi + 1
2
Since λi > 0 (i = 1, 2, . . . , p) and σ > 0, clearly f 0 (1) > 0 whenever B 6= 0. But B 6= 0 if and only if q < p. Since q < p is one of our model assumptions, we thus indeed have f 0 (1) > 0 which in turn implies that for each sufficiently large d with d < 1 we further have f (d) < f (1) or, equivalently, mse(βˆrd ) < mse(β ∗ ). Introducing Lτ (f ) := {d : f (d) ≤ τ } for each τ , we can be even more precise. Since f is a strictly convex function, Lτ (f ) is convex for each τ . Here the equation f (d) = f (1), therefore, has at most two different solutions, for notice that f is a second-degree polynomial in d. Not surprisingly, we see that two different solutions always exist, viz. , p p X X σ 2 bii + α2 σ 2 bii i dl = 1 − 2 and du = 1. . . . (3.7) 2 λ + 1 (λ + 1) i i i=1 i=1 Evidently, dl < du = 1. With all these observations in mind, we have the following theorem:
450
selahatt˙in kac ¸ iranlar et. al
Theorem 3.2.1. Let the restrictions (2.2) hold. Moreover, let σ 2 > 0, q < p, and let dl and du be as in (3.7). The restricted Liu estimator βˆrd is then better, in the (scalar) mse sense, than the RLS estimator β ∗ if and only if d is such that dl < d < 1. We now wish to solve the following programming problem: min f (d), d
. . . (3.8)
where as before f (d) = mse(βˆrd ). Since (3.8) is a convex optimization problem, each local minimum is a global minimum. The strict convexity of the function f along with the existence of dl and du thus guarantees that (3.8) has a unique solution (say dopt ). From classical optimization theory it now follows that dopt is the unique solution to the so-called first order condition f 0 (d) = 0. Solving this equation we find that , p p X X bii σ 2 + α2 b ii i dopt = 1 − σ 2 . . . . (3.9) 2 λ + 1 (λ + 1) i i i=1 i=1 In view of Theorem 3.2.1 it is now clear that, irrespective of the true but unknown values of the parameters β and σ 2 , the restricted Liu estimator βˆrd at the point d = dopt is better, in the (scalar) mse sense, than the RLS estimator β ∗ . In other words, for given β and σ 2 , choosing d = dopt leads to the least possible mse-value of βˆrd . Unfortunately, dopt depends on the unknown model parameters and so, for practical purposes, we have to replace these unknown parameters by some suitable estimates. If, in (3.9), we now substitute the OLS estimators for αi and σ 2 then we obtain , p p X X bii σ bii ˆ2 + α ˆ i2 2 ˆ dOLS = 1 − σ ˆ . . . . (3.10) λ +1 (λi + 1)2 i=1 i i=1 as an operational estimator for dopt . We call this estimator the OLS-based minimum mse estimator for dopt . Since in this section, however, the restrictions (2.2) are assumed to hold, it should be expected that a better estimator for dopt might even be obtained by replacing αi and σ 2 in dopt not by their OLS estimates α ˆ i and σ ˆ 2 but by their RLS estimates α ˜ i and σ ˜ 2 , respectively. When doing this, we obtain , p p X X bii σ b ˜2 + α ˜ i2 ii 2 dˆRLS = 1 − σ ˜ . . . (3.11) λ +1 (λi + 1)2 i=1 i i=1 as an alternative estimator for dopt ; we will call this estimator the RLS-based minimum mse estimator for dopt . 3.3. Comparisons between the restricted Liu estimator and the restricted least squares estimator and the Liu estimator when the restrictions do not hold. In this
new biased regression estimator
451
subsection we assume that the restrictions (2.2) do not hold. We first recall that now the estimators βˆrd and β ∗ are both biased. Using Fd − I = (d − 1)(S + I)−1 , we may rewrite (2.10) in the form mse(βˆrd ) = σ 2 tr(Fd AFd0 )+ k (d − 1)(S + I)−1 β + Fd S −1 δ∗ k2 ! "à # p X 2(d − 1)αi δ˜i (λi + d) 1 δ˜i2 2 2 2 2 = σ bii + 2 (λi + d) + (d − 1) αi + , (λi + 1)2 λi λi i=1 . . . (3.12) where δ˜ := P 0 δ∗ . We see from (3.12) that the mse(βˆrd ) can again be considered as a function, h(d), say, of d, and note that h(d) is strictly convex in d, being a second-degree polynomial in d. From (2.5) we likewise obtain à ! p X δ˜i2 2 ∗ σ bii + 2 . mse(β ) = λi i=1 By differentiating the function h with respect to d, we find that h0 (d) = p X i=1
" # ˜i (λi + d) + (d − 1)αi δ˜i ˜2 (λi + d) 2 α δ δ i σ 2 bii (λi + d) + (d − 1)αi2 + + i , (λi + 1)2 λi λ2i
and so h0 (1) =
p X i=1
Moreover, h(1) =
2 (λi + 1)
p X i=1
Ã
Ã
αi δ˜i δ˜2 σ 2 bii + + i2 λi λi
δ˜2 σ bii + i2 λi 2
! .
! = mse(β ∗ ).
We now notice that a sufficient condition for h0 (1) > 0 to hold is that for each i = 1, 2, . . . , p we have 1 + αi λi δ˜i† ≥ 0, where (·)† denotes the Moore-Penrose inverse. But if h0 (1) > 0 then again for any sufficiently large d with d < 1 we have h(d) < h(1) or, equivalently, mse(βˆrd ) < mse(β ∗ ). An analogous result holds when h0 (1) < 0. This leads us to the following theorem: Theorem 3.3.1. Let the restrictions (2.2) be incorrect. Then we have: 1. If h0 (1) > 0 or, equivalently, if tr{(S + I)−1 S −1 δ∗ (β + S −1 δ∗ )0 } > −σ 2 tr{(S + I)−1 A},
selahatt˙in kac ¸ iranlar et. al
452
then there exists a (sufficiently large) parameter d with d < 1 for which the corresponding restricted Liu estimator βˆrd is better, in the (scalar) mse sense, than the restricted least squares estimator β ∗ . A sufficient condition for h0 (1) to be positive is that 1 + αi λi δ˜i† ≥ 0 holds for each i = 1, 2, . . . , p. 2. If h0 (1) < 0 or, equivalently, if tr{(S + I)−1 S −1 δ∗ (β + S −1 δ∗ )0 } < −σ 2 tr{(S + I)−1 A}, then there exists a (sufficiently small) parameter d with d > 1 for which the corresponding restricted Liu estimator βˆrd is better, in the (scalar) mse sense, than the restricted least squares estimator β ∗ . 3. If h0 (1) = 0 or, equivalently, if tr{(S + I)−1 S −1 δ∗ (β + S −1 δ∗ )0 } = −σ 2 tr{(S + I)−1 A}, then the restricted least squares estimator β ∗ cannot be improved, in the (scalar) mse sense, by a suitably chosen restricted Liu estimator βˆrd . It is straightforward to prove that mse(βˆrd ) < mse(β ∗ ) holds if and only if d lies in the open interval whose endpoints are d1 := 1
and
d2 := 1 −
2tr{(S + I)−1 (σ 2 A + S −1 δ∗ β 0 + S −2 δ∗ δ∗0 )} ; tr{(S + I)−2 (σ 2 A + ββ 0 + 2S −1 δ∗ β 0 + S −2 δ∗ δ∗0 )}
compare the corresponding part in Section 3.2. It is interesting that, depending on the values of β and σ 2 , the endpoint d2 can here be greater than, equal to, or less than d1 = 1. We conclude this section by comparing, with respect to the (scalar) mse criterion, the restricted Liu estimator βˆrd with the Liu estimator βˆd . Applying (3.4) and (3.12), we easily obtain mse(βˆd ) − mse(βˆrd ) = " p X 1 σ 2 (λi + d)2 − 2(d − 1)αi δ˜i (λi + d) = 2 (λi + 1) λi i=1
= =
# δ˜i2 (λi + d)2 2 2 − σ bii (λi + d) − λ2i " # µ ¶ p X 1 αi δ˜i (λi + d) δ˜i2 (λi + d)2 1 2 2 σ (λi + d) − bii − 2(d − 1) − (λi + 1)2 λi λi λ2i i=1 " # p X αi δ˜i δ˜i2 (λi + d) λi + d 2 σ (λi + d)cii − 2(d − 1) − (λi + 1)2 λi λ2i i=1
new biased regression estimator
=
p X λi + d (λi + 1)2 i=1
"Ã
αi δ˜i δ˜2 σ cii − 2 − i2 λi λi 2
!
à d−
453
δ˜i2 αi δ˜i −2 − σ 2 cii λi λi λi
!# .
. . . (3.13) Let à d¯ := max i
αi δ˜i δ˜i2 −2 − σ 2 cii λi λi λi
!,
Ã
δ˜2 αi δ˜i − i2 min σ 2 cii − 2 i λi λi
! . . . (3.14)
¯ Using (3.13) then and let d¯den denote the denominator in this definition of d. leads to the following theorem: Theorem 3.3.2. Suppose that the restrictions (2.2) do not hold. Moreover, let d¯ be defined according to (3.14) and let d¯den denote the denominator in this ¯ Suppose further that d¯ > 0. Then: definition of d. 1. If d¯den > 0, it follows that for each positive d with d > d¯ the restricted Liu estimator βˆrd is better, in the (scalar) mse sense, than the Liu estimator βˆd . 2. If d¯den < 0, it follows that for each positive d with d < d¯ the restricted Liu estimator βˆrd is better, in the (scalar) mse sense, than the Liu estimator βˆd . 4.
Numerical Example
To illustrate our theoretical results we now consider in this last section a dataset on Portland cement originally due to Woods, Steinour and Starke (1932), and which has since then been widely analysed, cf. e.g., Hald (1952, pp. 635–652), Hamaker (1962), Gorman and Toman (1966, pp. 35–36), Daniel and Wood (1980, pp. 89–91, 106–107), and Nomura (1988, p. 735). These data come from an experimental investigation of the heat evolved during the setting and hardening of Portland cements of varied composition and the dependence of this heat on the percentages of four compounds in the clinkers from which the cement was produced. As observed by Woods, Steinour and Starke (1932, p. 1207) “This property is of interest in the construction of massive works such as dams, in which the great thicknesses severely hinder the outflow of the heat. The consequent rise in temperature while the cement is hardening may result in contractions and cracking when the eventual cooling to the surrounding temperature takes place.” The four compounds considered by Woods, Steinour and Starke (1932) are tricalcium aluminate: 3CaO·Al2 O3 , tricalcium silicate: 3CaO·SiO2 , tetracalcium aluminoferrite: 4CaO·Al2 O3 ·Fe2 O3 , and β-dicalcium silicate: 2CaO·SiO2 , which we will denote by x1 , x2 , x3 and x4 , respectively. The heat evolved after 180 days of curing, which we will denote by y, is measured in calories per gram of cement. Woods, Steinour and Starke (1932) also present data on the heat evolved after 3, 7, 28 and 90 days of curing.
selahatt˙in kac ¸ iranlar et. al
454
We assemble our data as follows: 7 26 6 1 29 15 11 56 8 11 31 8 7 52 6 11 55 9 X= 3 71 17 1 31 22 2 54 18 21 47 4 1 40 23 11 66 9 10 68 8
60 52 20 47 33 22 6 44 22 26 34 12 12
,
y=
78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4
.
The four columns of the 13 × 4 matrix X comprise the data on x1 , x2 , x3 and x4 , respectively. We observe that this matrix X does not include a column of ones; indeed we wish (first) to fit a homogeneous linear regression model (2.1) without intercept to the data, cf. Woods, Steinour and Starke (1932, p. 1211). We have n = 13 cases (experimental units) in this dataset and p = 4 unknown regression coefficients in our linear model. The scalar variance σ 2 is also unknown. The 13 × 4 matrix X has singular values θ1 = 211.336941, θ2 = 77.235610, θ3 = 28.459657, and θ4 = 10.266734. The condition number of X is θ1 /θ4 = 20.5846 and so X may be considered as “well-conditioned” and possessing essentially no multicollinearity. [It is interesting to note that if X is centered and standardised (normalised) so that X 0 X is a correlation matrix then, cf. Nomura (1988, p. 735), the condition number of X almost doubles to 37.1063.] The OLS estimates and their estimated standard errors (in parentheses), together with the associated t-statistics with n − p = 13 − 4 = 9 degrees of freedom and two-sided critical levels (in parentheses) are: βˆ1 βˆ2 βˆ3 βˆ4
= = = =
2.1930 (0.1853) 1.1533 (0.0479) 0.7585 (0.1595) 0.4863 (0.0414)
11.84 24.06 4.76 11.74
(< 0.0001), (< 0.0001), (0.0010), (< 0.0001).
The corresponding values of the OLS estimates and their estimated standard errors (in parentheses) presented by Woods, Steinour and Starke (1932, p. 1212) are slightly different: βˆ1 βˆ2 βˆ3 βˆ4
= = = =
2.18 1.206 0.73 0.526
(0.11), (0.029), (0.10), (0.024).
new biased regression estimator
455
While the computational algorithms available today are surely more accurate than 65 years ago, we note that Woods, Steinour and Starke (1932, p. 1208, Table I) present the values of the four compounds x1 , . . . x4 as integers—percentages rounded to the nearest integer—and it is possible that the values of these percentages which these authors used to compute their OLS estimates (and standard errors) were not so rounded. Our computations here (and below) were performed both by using the JMP statistical package on the Macintosh, cf. Sall and Lehman (1996), and the Rcode given by Cook and Weisberg (1994) and written in the Xlisp-Stat computer language due to Tierney (1990). We note that all of the t-statistics here are highly significant (assuming that the underlying response vector y follows a multivariate normal distribution). Indeed (the uncentered) R2 is 99.96% and the overall F -statistic with (4, 9) degrees of freedom is 5176.47 (with critical level less than 0.01%). We now, however, following Hald (1952, pp. 648–649), Gorman and Toman (1966, pp. 35–36) and Daniel and Wood (1980, p. 89), add a column of ones to the matrix X and fit an inhomogeneous linear regression model with intercept to the data. We still have n = 13 observations but there are now p = 5 unknown regression coefficients. The 13×5 matrix X now has singular values θ1 = 211.367467, θ2 = 77.236145, θ3 = 28.459657, θ4 = 10.267360, and θ5 = 0.034900. The condition number of X is now θ1 /θ5 = 6056.3744 and so X may now be considered as being quite “ill-conditioned.” It is interesting to observe how close the largest four singular values are here to those of the 13 × 4 matrix X considered earlier without the column of ones (indeed θ3 = 28.459657 is the third largest singular value of both X matrices, at least to six decimal places!). We feel that this “correspondence” of the singular values is connected with the fact that the row totals of the 13 × 4 matrix X (without the column of ones) are all approximately equal (to 100), respectively: 99, 97, 95, 97, 98, 97, 97, 98, 96, 98, 98, 98, 98 (the compounds are all measured in percentages!). Moreover, we note that the (normalised) “singular vector” a corresponding to the smallest singular value θ5 = 0.034900, i.e., X 0 Xa = (θ5 )2 a, is a = (−0.999788, 0.010285, 0.010304, 0.010519, 0.010099)0 suggesting (again) that x1 + x2 + x3 + x4 = 100. We find it curious that all the analyses of these data that we have found from Hald (1952, pp. 635–652) todate consider the inhomogeneous model with an intercept even though in their original analysis Woods, Steinour and Starke (1932, p. 1211) clearly used the homogeneous model without an intercept. With an intercept the associated inhomogeneous linear model certainly possesses “high multicollinearity” with possible effects on the OLS estimates βˆ0 , βˆ1 , βˆ2 , βˆ3 , βˆ4 as described in Section 1. The OLS estimates and their estimated standard errors (in parentheses), together with the associated t-statistics—now
selahatt˙in kac ¸ iranlar et. al
456
with n − p = 13 − 5 = 8 degrees of freedom—and two-sided critical levels (in parentheses) are: βˆ0 βˆ1 βˆ2 βˆ3 βˆ4
= = = = =
62.4054 1.5511 0.5102 0.1019 −0.1441
(70.0710) (0.7448) (0.7238) (0.7547) (0.7091)
0.89 2.08 0.70 0.14 −0.20
(0.3991), (0.0708), (0.5009), (0.8959), (0.8441).
We note that not one of these t-statistics is significant (at the 5% level). In contrast (the centered) R2 is 98.24% and the overall F -statistic for the four regression coefficients with (4, 8) degrees of freedom is 111.48 (again with critical level less than 0.01%). Even though it now seems that y does not depend on any one of the xi ’s in a univariate way, we find that the sample correlation coefficients: rx1 ,y = 0.7307, rx2 ,y = 0.8163, rx3 ,y = −0.5347, rx4 ,y = −0.8213 are all rather high (in absolute value). Indeed we have here an “anti-quirk”, cf. Bertrand and Holder (1988), Bertrand (1998). An anti-quirk is said to occur whenever X rx2i ,y ; R2 < i
2 a quirk occurs P 2 when the converse inequality holds. Here we see that R = 0.9824 < i rxi ,y = 2.1607 and so we have an anti-quirk. Indeed, as observed by Bertrand (1998), “An anti-quirk may be such that it is possible for many of the xi to appear to be strongly correlated with y, but the xi apparently having the strongest correlation with y may not be significant in the multiple regression of y on all the xi .” In order to improve these estimates we now add one linear restriction to the model. In view of our discussion above it is clearly of interest to consider the restriction β0 = 0, so that our inhomogeneous model with intercept becomes the original homogeneous model without intercept. The corresponding values of the RLS estimates and their estimated standard errors (in parentheses) are:
β1∗ β2∗ β3∗ β4∗
= = = =
2.1930 (0.1853, 0.1874), 1.1533 (0.0479, 0.0485), 0.7585 (0.1595, 0.1613), 0.4863 (0.0414, 0.0419).
The RLS estimates here are, of course, the same as the OLS estimates in the original homogeneous model (clearly β0∗ = 0 as per our restriction). The first of the two standard errors given here (in pairs in parentheses) uses the RLS estimate for σ 2 in the restricted (current) inhomogeneous model with the restriction: β0 = 0 (or equivalently the OLS estimate for σ 2 in the unrestricted original
new biased regression estimator
457
homogeneous model), while the second uses the OLS estimate in the unrestricted (current) inhomogeneous model; we note that for each estimate the first standard error (using the RLS estimate for σ 2 in the restricted inhomogeneous model with the restriction: β0 = 0) is lower than the second (which uses the OLS estimate for σ 2 in the unrestricted inhomogeneous model). With this particular restriction β0 = 0, we compute dˆRLS = 0.7892 and dˆOLS = 1.0 − (1.4 × 10−7 ). Choosing d = dˆRLS we obtain βˆrd,0 = 0.0099 (0.0005, 0.0006), βˆrd,1 = 2.1901 (0.1849, 0.1871), βˆrd,2 = 1.1539 (0.0479, 0.0484), βˆrd,3 = 0.7563 (0.1592, 0.1611), βˆrd,4 = 0.4867 (0.0414, 0.0419). As before, the first of the two standard errors given here (in parentheses) uses the RLS estimate for σ 2 in the restricted inhomogeneous model with the restriction: β0 = 0, while the second uses the OLS estimate in the unrestricted inhomogeneous model; and again we note that for each estimate the first standard error (using the RLS estimate for σ 2 ) is lower than the second (which uses the OLS estimate). We also observe that the standard errors of the βˆrd,i here are very close to those of the βi∗ above—indeed the pairs of standard errors of β4∗ and of βˆrd,4 coincide completely (at least to 4 decimal places!). For the estimated mse’s we now find that: mse(β d ∗ ) = 0.0637846 whereas for ˆ ˆ d = dRLS we have mse( d βrd ) = 0.0636713. These mse estimates are obtained by replacing in the corresponding theoretical mse expressions all unknown model parameters by their RLS estimates. We may also improve our estimates by adding to the inhomogeneous model with intercept the (new) linear restriction: β3 = β2 − β1 or Rβ = 0, where R := (0, 1, −1, 1, 0)0 . We test the linear hypothesis H0 : Rβ = 0 in the framework of our unrestricted linear model (2.1). The test statistic for H0 , given our observations, is ˆ σ 2 = 1.92. F := βˆ0 R0 (RS −1 R0 )−1 Rβ/ˆ Since the upper percentage point F0.05 (1, 8) = 5.32 (here n−p = 13−5 = 8) our hypothesis H0 : β3 = β2 − β1 is not rejected at the 5% significance level (again assuming that the underlying response vector y follows a multivariate normal distribution). The RLS estimates and their associated estimated standard errors (in parentheses) are: β0∗ β1∗ β2∗ β3∗ β4∗
= = = = =
149.0789 (33.1894, 0.5488 (0.1871, −0.3546 (0.3851, −0.9035 (0.2187, −1.0014 (0.3639,
31.6114), 0.1782), 0.3668), 0.2083), 0.3466).
selahatt˙in kac ¸ iranlar et. al
458
Again here (and below) the first standard error given (in parentheses) uses the RLS estimate for σ 2 in the restricted inhomogeneous model (now with the restriction: β3 = β2 − β1 ), while the second uses the OLS estimate for σ 2 in the unrestricted inhomogeneous model; interestingly here (and below) the standard error for each estimate based on the RLS estimate is now higher than that using the OLS estimate. This is not surprising, however, since we have the following estimates for σ 2 : (a) unrestricted inhomogeneous model: σ ˆa2 = 5.982955, (b) restricted inhomogeneous model with the restriction β0 = 0 (or unrestricted homogeneous model): σ ˆb2 = 5.845461 (c) restricted inhomogeneous model with the restriction β3 = β2 − β1 : σ ˆc2 = 6.595199. We observe that the estimate of σ 2 in a restricted (inhomogeneous) model very much depends upon the restriction—indeed with the restriction β0 = 0 we obtain the lowest estimate σ ˆb2 = 5.845461, while with the restriction β3 = β2 − β1 we obtain the highest estimate σ ˆc2 = 6.595199; in the unrestricted inhomogeneous 2 model we have: σ ˆa = 5.982955. From Theorem 3.2.1 we know that these estimates can be further improved. Choosing d in the RL estimator according to (3.11) as dˆRLS = 0.952698 we obtain the following RL estimates along with their estimated standard errors: βˆrd,0 βˆrd,1 βˆrd,2 βˆrd,3 βˆrd,4
= 142.0379 = 0.6207 = −0.2819 = −0.8298 = −0.9301
(31.6215, (0.1721, (0.3690, (0.2031, (0.3481,
30.1181), 0.1640), 0.3514), 0.1934), 0.3316).
ˆ = 4912.09 Comparing mse( d βˆrd ) = 1049.85 with mse(β d ∗ ) = 1101.9 and mse( d β) ˆ ˆ shows that βrd (with the parameter estimate dRLS ) has indeed a smaller estimated (scalar) mse than the other two estimators; this agrees with our theoretical findings in Theorem 3.2.1. These estimated mse’s for βˆrd and β ∗ are again obtained by replacing in the corresponding theoretical mse expressions all ˆ have we unknown model parameters by their RLS estimates. Only in mse( d β) used the OLS estimate σ ˆ 2 = 5.98295; if instead we use the RLS estimate for σ 2 , ˆ increases to 5414.75. which is 6.5952, then mse( d β) Finally we mention that if we estimate dopt according to (3.10) we obtain dˆOLS = 0.795331. Then at this value of d, we compute mse( d βˆrd ) = 1626, which exceeds the mse estimate with the tuning parameter dˆRLS ; this is not a surprise—it is a direct consequence of our theoretical observations in Section 3.2 since in both mse estimates we have used the same RLS estimates as substitutes for the unknown model parameters.
new biased regression estimator
459
Acknowledgements. The authors are most grateful to an anonymous referee for several useful suggestions which improved an earlier version of this paper and to Philip V. Bertrand and S. W. Drury for helpful discussions. References Bertrand, Philip V. (1998). Constructing explained and explanatory variables with strange statistical analysis results. The Statistician: Journal of the Royal Statistical Society, Series D 47, 377–383. Bertrand, Philip V. and Holder, Roger L. (1988). A quirk in multiple regression: The whole regression can be greater than the sum of its parts. The Statistician 37, 371–374. Cook, R. Dennis and Weisberg, Sanford (1994). An Introduction to Regression Graphics. Wiley, New York. Daniel, Cuthbert and Wood, Fred S. (1980). Fitting Equations to Data: Computer Analysis of Multifactor Data. Second Edition, with the assistance of John W. Gorman. Wiley, New York. Gorman, John W. and Toman, R. J. (1966). Selection of variables for fitting equations to data. Technometrics 8, 27–51. Hald, Anders (1952). Statistical Theory with Engineering Applications. Wiley, New York. Hamaker, H. C. (1962). On multiple regression analysis. Statistica Neerlandica 16, 31–56. Liu, Kejian (1993). A new class of biased estimate in linear regression. Communications in Statistics–Theory and Methods 22, 393–402. [Author’s name appears as “Liu Kejian” on title page, but surname is Liu.] Nomura, Masuo (1988). On the almost unbiased ridge regression estimator. Communications in Statistics–Simulation and Computation 17, 729–743. Piepel, Greg and Redgate, Trish (1998). A mixture experiment analysis of the Hald cement data. The American Statistician 52, 23–30. Sall, John and Lehman, Ann (1996). JMP Start Statistics: A Guide to Statistical and Data Analysis using JMP and JMP IN Software. Duxbury Press, Belmont, California. Sarkar, Nityananda (1992). A new estimator combining the ridge regression and the restricted least squares methods of estimation. Communications in Statistics–Theory and Methods 21, 1987–2000. Stein, Charles (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability 1, 197–206. Tierney, Luke (1990). Lisp-Stat: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics. Wiley, New York. Woods, Hubert; Steinour, Harold H., and Starke, Howard R. (1932). Effect of composition of Portland cement on heat evolved during hardening. Industrial and Engineering Chemistry, 24, 1207–1214. S. Kac ¸ iranlar,. ˘ glu, F. Akden˙iz S. Sakallio Dept. of Mathematics Faculty of Arts & Science University of C ¸ ukurova 01330 Adana, Turkey e-mail:
[email protected]
G.P.H. Styan Dept. of Mathematics and Statistics McGill University 805 ouest, Rue Sherbrooke Street West ´al, Que ´bec Montre Canada H3A 2K6 e-mail:
[email protected]
H.J. Werner Institute for Econometrics and Operations Research Econometrics Unit University of Bonn Adenauerallee 24-42 D-53113 Bonn, Germany e-mails:
[email protected] [email protected]