hypotheses based on trimmed least squares are available and are computationally similar to those based ..... for S involves only the intercept, §l' and not the slopes. Also §T(~)l = S, ... regression analysis are special cases of l3. 9). See Searle ...
Robust Regression by Trimmed Least-Squares Estimation by David Ruppert and Raymond J. Carroll
Koenker and Bassett (1978) introduced regression quantiles and suggested _.
a method of trimmed
least~squares
estimation based upon them.
We develop
asymptotic representations of regression quantiles and trimmed least-squares estimators. model.
The latter are robust, easily computed estimators for the linear
Moreover, robust confidence ellipsoids and tests of general linear
hypotheses based on trimmed least squares are available and are computationally similar to those based on ordinary least squares.
2
David Ruppert is Assistant Professor and Raymond J. Carroll is Associate Professor, both at the Department of Statistics, the University of North Carolina at Chapel Hill.
This research was supported by the National Science
Foundation Grant NSF MCS78-01240
and the Air Force Office of Scientific Research
under contract AFOSR-75-2796. The authors wish to thank Shiv K. Aggarwal for his programming assistance.
4Ir~
3
1.
INTRODUCTION
We will consider the linear model
xS+ where Y = (YI, ... ,Y ) " ~ n
e
(1.1)
X is a nxp matrix of known constants,
is a vector of unknown parameters, and i. i. d. with distribution F.
~ =
S=
~
(Sl""'Sp )'
(el, •.. ,e ) where el, ... ,e are n n
Recently there has been much interest in estima-
tors of S which do not have two serious drawbacks of the least-squares estimator, inefficiency when the errors have a distribution with heavier tails than the Gaussian and great sensitivity to a few outlying observations.
In general,
such methods, which fall under the rubic of robust regression, are extensions of techniques originally introduced for estimating location parameters.
In the
location model three broad classes of estimators, M,L and R estimators are available.
See Huber (1977) for an introduction to these classes.
For the linear model, M and R estimators have been studied extensively. Until recently only Bickel (1973) has studied regression analogs of L-estimators. His estimates have attractive asymptotic efficiencies, but they are computationally complex.
Moreover, they are not equivariant to reparametrization (see
remarks after his Theorem 3.1). There are compelling reasons for extending L-estimators to the linear model.
For the location problem, L-estimators, particularly trimmed means,
are attractive to those working with real data.
Stigler (1977) recently
applied robusts estimators to historical data and concluded that "the 10% trimmed mean (the smallest nonzero trimming percentage included in the study) emerges as the recommended estimator." Jaeckel (1971) has shown that if F is sYmmetric then for each L-estimator of location there are asymptotically equivalent M and R estimators.
However,
4 without knowledge of f it is not possible corresponding M and R estimators. equivalent to Huber's M-estimate
to match up a L-estimator with its
For example, trimmed means are asymptotically which is the solution b to
n
I
i=l
p((X.-b)js . 1 . n ) = min!
where if
Ixl
< k
if
Ixl
> k
The value of k is determined by the trimming proportion a of the trimmed mean, F, and the choice of s . n
In the scale non-invariant case (s
n
= 1)
k = F-l(l-a).
The practicing statistician, who known only his data, may find his intuition of more assistance when choosing a compared with k. Recently Koenker and Bassett (1978) have extended the concept of quanti1es to the linear model.
Let 0 < 8 < 1.
=
Define
8 - 1
if
x > 0
if
x < 0, and
(1.2)
Then a 8th regression quantile, S(8), is any value of p which solves
x. b) ~1
~
=
min!
(1.3)
Their Theorem 4.2 shows that regression quanti1es have asymptotic behavior similar to sample quantiles in the location problem.
Therefore, L-estimates
consisting of linear combinations of a fixed number of order statistics, for example the median, trimean, and Gastwirth's estimator, are easily extended
5
to the linear model and have the same asymptotic efficiencies as in the location model.
Moreover, as they point out, regression quantiles can be easily
computed by linear programming techniques.
They also suggest the following
trimmed least-squares estimators, call it ~T(a): remove from the sample any observations whose residual from Sea) is negative or whose residual from A
S(l-a) is positive and calculate the least-squares estimator using the remaining observations.
They conjecture that if lim n-l(X'X) A
then the variance of §T(a) is n
-1 2 0
= Q(positive definite),
n~
1 -1 2 (a,F)Q- , where n 0 (a,F) is the variance of
an a-trimmed mean from a population with distribution F. A
A
In this paper we develop asymptotic expansions for §(8) and §T(a) which provide simple proofs of Koenker and Bassett's Theorem 4.2 and their conjecture A
about the asymptotic covariance of §T(a). In the location model, if F is aSYmmetric then there is no natural parameter to estimate.
In the linear model, if the design matrix is chosen so one
column, say the first, consists entirely of ones and the remaining columns each sum to zero, then our expansions show that for each 0 < a < ~
A
n 2 (§T(a) - S - 8(a))
L
+
N(O,Q
-1
0
2
~
(a,F))
where o(a) is a vector whose components are all zero except for the first. Therefore, the ambiguity about the parameter being estimated involves only the intercept and none of the slope parameters. Additionally and general linear
we present a large sample theory of confidence ellipsoids hypothesis testing, which is quite similar to that of least
squares estimation with Gaussian errors. The close analogy between the asymptotic distributions of trimmed means of our trimmed least-squares estimator, ~T(a) is remarkable.
Other reasonable
definitions of a trimmed least squares estimator do not have this property.
6
For example, define an estimator of
S,
call it K50, as follows: compute the
A
residuals from S(.5) and after removing the observations with k = [an] smallest and k largest residuals compute the least squares estimator.
The asymptotic
behavior of K50, which is complicated and is not identical to that of §T(a), will be reported elsewhere. Section 2 presents the formal model being considered, the main results are in Section 3, and several examples are found in Section 4.
Proofs are in the
appendix.
2.
NOTATION AND ASSUMPTIONS Although y, X, and e will depend on
Recall the form (1.1) of our model.
n we will not make that explicit in the notation. the i th row of X.
Let x. = (x. 1 ' ... ,x. ) be ~1 1 1p
Assume XiI = 1, i = l, ... ,n, lim (
n~
-k
max (n 21 x .. I)) = 0, j :S;p; i:S;n 1J
(2.1)
and there exists positive definite Q such that lim n
-1
(X'X)
Q.
=
(2.2)
n~
-1
Without loss of generality, we assume F
p-variate Gaussian distribution with mean
a ~
p
< 8 < 1,
a
= F-1 (p)
< 8
1
< ... < 8
and define
m
~l
< 1, and
= ~a
and l
a ~2
(Yz) = O.
p
(~,L)
0, there exists K > 0,
n>
Her C.. is our x .. n -~ J1
1J
0, and an integer nO such
that if n > no' then p{
inf
IIM(~) II < n}
K ~
Proof:
This is shown in exactly the same manner as Lemma 5.2 of Jureckova (1977).
16 Proof of Theorem 2:
In M(~) replace ~ by
In
(see) - s(e».
By theorem 1
n
M(1rl
(s(e) - see))
= n-
Jz
\'
x '. ,I, ("
L
i=d
From Lemma A.3, Irl (s(e) - s(e))
=
'l'e
",1
x. see)) = op(l) . ",1
Y1'
op (1). Theorem
CA.4)
2 follows from
(A.3) and (A.4). Proof of Corollary 1:
A
Let y(e i ) = §(e i ) - §(e i )
We need only show that i f ~j
E
and
y' = (y(el)', ... ,yCem)').
(~i,"":~) then
RP for j=l, ... ,m and c' =
L yn c' y V + N(O,c'
'"
(~0
Q-1 )c).
'"
(A. S)
'"
Define
and m
n·l ' =
I
j=l
n·1J..
By Theorem 2, n
Irl c'y
= n -Jz I
n.
i=l
+
0
1
P
(1)
Then (A.S) follows since routine calculations show that conditions of Lindeberg's Central Limit Theorem.
.
nl .,n 2 ., ...
satisfy the
(See for example, Loeve (1963,
section 20.2, Theorem B). For A any matrix, let
IAI = max IA.. I. " 1J 1,
Lemma A.4:
J
Let D.1n (= D.) be a rxc matrix. 1 sup (n - 1 n
Suppose (A.6)
J
17
Let I be an open interval containing
~l
and
~2
and let the function g(x) be For ~l' ~2' and ~3 in RP
defined for all x and Lipschitz continuous on I. and 6 =: (~l' ~2' ~3) define
T* CL'I) T(6) =: T*(6) - E T*(6), (A.7)
S*(6) =: T*(6) - T*(O), ~
~
S(6) =: S*(6) - E S*(6)
Then for all M > 0, IS(6) I=:o (1). ~ P Proof:
Here we follow Bickel (1975, Lemma 4.1) closely.
assume M=:l.
For convenience
Define for £ =: 1,2
and let b. (6) =: g ( e. + x 1
1
~
~i
6
-1:
3
n 2)
Then for all n large enough varIS(~)
I
< n
-1
I 10 1.\ 2
n
i=:l +
n
-1
Varlb.1 (M - b.1 (0) jI{e.1 ~
E
n 2 I \0.\ Vadb. (0)(1. (6) - 1. (0) i=:l 1 1 11 ~ 11
Since g is Lipschitz on I n
I
i=:l
I and e.1
+
X.6 n ~l 3
_~ E
I}
18
by (A.6) and (2.1).
Since F is continuously differentiable in neighborhoods of
~l and ~2
R2 .2 O(n
-1
n
10.1 2 I i::::l
Ix. I n",1
1
~Z
)
::::
0(1)
•
Therefore for any fixed 6, P
SCM Choose
°
> O.
(A.8)
-+ 0
Now cover the p-dimensional cube, [-l,l]P , with a union of
cubes having vertices on the grid J(o) :::: {ljlo,j20, ... ,jpo); ji:::: O, ~l, ~2, ... , or ~ ([1/0]
+
If I~I .2 1, then for ~
I)}.
let V~(~) be the lowest
:::: 1,2,3
vertex of the cube containing ~~ and let V(~) : : (Vl(~)' V2(~)' V3(~))' Then straightforward calculations show that for some C
Is*(~) - S*(V(~)) 1.2 n1
+ n-
Yz
1
Yz
n
1
i~l 10il Ibi(~) - bi(V(~)) I I(e i .!. I~il n- Yz
n
L 10·1 i::::l 1
lb. (V(M) 1
~
I
E
I)
..
[I{-a. < Pl' < a.} 1
-
1 -
1
I
on-
Ix.
1
+
I{-a. < P2 · < a.}] 1 1 1
~ Z
Now since g is Lipschitz on I and F has a continuous derivative in neighborhoods of
for all 6
~l
E
and
~2
there is a constant Kl such that for m::::l,2
(J(0))3.
Then by (2.2), (A.6) and the Cauchy-Schwarz inequality for
19
Wm(~)
Exactly as in the argument leading to (A.S) we have (letting
=
Wm(o'~l)
o (1) . p
Thus, for all E
>
0 and 0
>
0, there exists nO such that if n
P{(max
~ EJ3
W (~ )) > E m
+
0
~
nO' then for m=I,2
K o} < E, I
~O
whence p{
sup
~E[0,1]3p
Is*(~)
-
S*(V(~))
I > E
+
KIo} < E
Note that there exists a constant K3 such that sup
IE(S*(M - S*(V(M))
~E[0,IJ3p
~
~
I :5. K3 0
.
Choosing 0 = e/max (K ,K ) we have that for all E > 0, there exists nO such that I 3 P(
sup
~dO, 1] 2p
IS(~) - S(V(~)
By (A.8), for this 0 and E there exists n P(
sup
~ EJ (0) 3
l
IS (~ ) I ~O
I
such that if n >
~O
Inequalities (A.9) and (A.lO) prove the lemma.
(A.9)
> 3E) < E
E)
" South African Statistics Journal, 8, 127-134. Daniel, Cuthbert and Wood, Fred S. (1971).
Fitting Equations to Data>
New
York: John Wiley.
e.
Huber, Peter J. (1977), Robust Statistical Procedures> Philadelphia: SIAM. Jaeckel, Louis A. (1971), "Robust Estimates of Location: Symmetry and Asymmetric Contamination," The Annals of Mathematical Statistics> 42,
1020~l034.
Jureckova, Jana (1977), "Asymptotic Relations of M-estimates and R-estimates in Linear Regression Model", The Annals of Statistics, 5, 464-472. Koenker, Roger and Bassett, Gilbert, Jr. (1978), "Regression Quantiles,"
Econometrica> 46, 33-50. Lenth, Russell V. (1976), "A Computational Procedure For Robust Multiple Regression," Tech. Report 53, University of Iowa. Loeve, Michel (1963), Probability Theory> New York: Van Nostrand. McKeown, P.G. and Rubin, D.S. (1977), "A Student Oriented Preprocessor for MPS/360," Computers and Operations Research> 4, 227-229.
27
Searle, Shayle R. (1971), Linear
Models~
Scheffe, Henry l1959), The Analysis of
New York: John Wiley and Sons.
Variance~
New York: John Wiley and Sons.
Stigler, Stephen M. (1977), "Do Robust Estimators Work with Real Data?"
The Annals of
.•
Statistics~
5, 1055-1098 .