A SEMI-PARAMETRIC CENSORED REGRESSION ... - ScienceDirect

Journal

of Econometrics

32 (1986) 5-34.

North-Holland

A SEMI-PARAMETRIC CENSORED REGRESSION ESTIMATOR Gregory M. DUNCAN * Washington

State University,

Pullman,

WA 99164-4860,

USA

This paper introduces a semi-parametric method for estimating regression coefficients when the underlying parent population of errors in censored. The method is an example of the method of sieves: and it provides simultaneous estimates of the regression coefficients and the density of the underlying parent population. In the very simplest terms, the underlying unknown density is approximated by a spline with mesh size approaching zero with the sample size. The values of the density at the knots are then added to the list of the usual unknown parameters in a censored regression model, e.g., the regression coefficients and scale parameter. A quasi-likelihood function using the approximate spline density is then maximized over all the parameters mentioned above. The method is shown to result in strongly consistent parameter estimates.

1. Introduction This paper introduces a semi-parametric method for estimating regression coefficients when the underlying parent population of errors is censored. The method is an example of the method of sieves; and it provides simultaneous estimates of the regression coefficients and the density of the underlying parent population. In the very simplest terms, the underlying unknown density is approximated by a spline with mesh size approaching zero with the sample size. The values of the density at the knots are then added to the list of the usual unknown parameters in a censored regression model, e.g. the regression coefficients and scale parameter. A quasi-likelihood function using the approximate spline density is then maximized over all the parameters mentioned above. The method is shown to result in strongly consistent parameter estimates. 2. Background In a left (right) censored regression model, defined more formally below, the values of the dependent variable falling below (above) a given value are *I am indebted to Jim Heckman, Hal White, Alan Marcus, Chris Sims, Steve Cosslett, David Spencer, Rob Engle, and Dale Poirier for helpful comments. Ib Hansen, Cathleen Leue-Ronev. and Mark Thorna-are thanked for research and programming assistance. This research was funded under NSF grant SES-8109274. Participants at seminars at Minnesota, Wisconsin, Chicago, Northwestern, Bell Labs, Princeton, Columbia, UC San Diego, UC Berkeley, UC Santa Barbara and UC Riverside, are also thanked for their comments.

0304~4076/86/$3.500

1986, Elsevier Science Publishers

B.V. (North-Holland)

6

G. M. Duncan, A semi-parametric censored regression estimator

replaced by that value. The regression errors are typically assumed to be independent and identically distributed, with zero expectation and with distribution function F(s); almost without exception, F( .) is assumed to be normal. Examples of the normally censored regressions can be found in Tobin (1958), Amemiya (1973), Heckman (1976,1979), Nelson (1977), and Lee and Trost (1978). They appear as a case of a much larger class of so-called selectivity problems which include mixed continuous/discrete models, Heckman (1978), Schmidt (1978), and Duncan (1980a,1982). All of the above references explicitly assume normality. The problem is that normally censored regression estimation methods are not robust against non-normality. For example, in the uncensored regression case, normality is a convenience and the fact the errors are, say, uniform has no effect on consistency. In the censored regression case, falsely assuming normality or generally misspecifying the distribution will lead to inconsistency. The obvious alternative is to model the true parent distribution. But economic theory, as yet, gives us no guide. The second alternative is to use a common flexible family of parametric distributions, such as the beta or Pearson family. This is unattractive since we still have no guides to the construction or justification of such families in practical situations. The final method and the one we choose here is to jointly estimate the regression coefficients and (to a high degree of accuracy in approximation) the parent population distribution and density function. In particular, we shall assume that the density can be uniformly approximated by a spline with exponential tails, and we find and estimate this approximation, together with the usual parameters. 3. A censored regression model Let y,=X,7;B+ei = 0

if

q> -XiTfi,

if

qI

-X;rS,

i=l

,.*., N.

X, and /3 are K x 1 vectors, while the E, are independent identically distributed random errors with an absolutely continuous distribution, expectation zero and variance u2. It will be necessary to assume that the density of y,l X, can be written as

where f( .) does not depend explicitly on u, p or Xi. This is a usual assumption and results in only a small loss in generality. Let JI = {i: yi > 0} and $‘= {i: y, = 0}, then the log-likelihood function can be

G.M. Duncan, A semi-parametric

censored regression estimator

written as !(/?,a)=

Clnf(r,-X$/e)+ iC$

C lnf(-Xjr/3/a)+

Clnl/a, iC$

jE+

where F(t) is /f_ m f(s)ds. And under the conditions listed in Amemiya (1973) or those listed in Hoadley (1981) a maximum-likelihood estimator of p and u exists which is consistent and asymptotically normal. 4. Splines Most simply, splines are continuous functions the graphs of which are piecewise polynomials joined together at points in the domain called knots. A linear spline is a continuous piecewise linear function. The value of the spline at its knots defines the whole function. So if a linear spline is defined over [a, b] and the knots are at t, < t, < . . * < t, with tO= a, t, = 6, and if f, =f(ti), i = 0,. . . , m, then the function f is completely determined by

=J+++;), r+l

j(t)

t;s t

I tr+l.

0)

1

If the knots are equally spaced then the mesh size, h, is defined by

(2)

h=(b-a)/m,

t,=t,+ih,

i=O ,-.., m,

(3)

clearly t r+l

-

t,=h,

Vi.

Another more general way of writing a spline is as

where OL,and rp,(t) depend upon the spline being chosen. I shall refer to the ai (or L) as spline weights.

G. M. Duncan, A semi-parametric

8


To write (1) in the form of (4) we define a set of basis functions cpi(t); following Prenter (1976) define for fi and h given in (2) and (3) 1 s i I m - 1, Ly(t)

= (r - ri-,)/h,

ti_t If I ti,

=(ti+,-r)/h,

risrsri+l,

= 0,

otherwise.

(5)

L?(r) is an isoceles triangle centered at ri with height 1 and base 2h. For and i = rn, we use right triangles E(t)

= 0 - t,-r)/h,

r,~r~r,_,,

= 0,

otherwise,

= (rt - 0/h,

r,srsr,,

= 0,

otherwise,

i =

0

(6)

and L:(t)

(7)

Now for fi as in (1) we have (8) So (8) represents a linear spline with knots at ri, i = 0,. . . , m, constant mesh size, h, domain [to, r,], and spline weights { f,};; = 0. We need some facts about spline approximation. Let g(r) be an arbitrary function in C2[a, b], let ri be as above, and let i=O ,*.., m.

f,=dti),

Then f(r) = Cj$y(r) the error estimate

is the linear spline approximation

IV- dxJ s w’ll,/4)~2.

to g(r). And we have

(9)

We mention in passing that if ~(f, h) is the modulus of continuity of f with respect to h, then

a

00)

glL 2 (Ilf’ll/a)hy

(11)

IV- gll, s v-03 so that if g E C’(a, b), Ilf-

where a is constant. And if g is simply continuous, then (10) defines the Level



9

of approximation. We also have need to approximate integrals - an obvious approximation to G(t) =/‘g(t)dt a is

where g and f are as above. Define

iVim

=/‘Lm(r)df D

(12)

then

Note that Mi”( t) is piecewise quadratic and an ogive; indeed, M,“( f )/h is a probability distribution function. Again using Prenter (1976, p. 46) we have

If’(t) - G(t)1s (lls”ll,/4)(~- a)h*, or

IIf+) - G(t)ll, 5 (Ils”ll,/4)(~ -dh*. It is useful to note that

O”Ly(t)Ljm(t)dt J-CC

= 1,

if

i=j,

= 0. otherwise.

Also, the two spaces

are isomorphic. Moreover, if A is taken to be R then ._F* and 9 are linear. Consequently, .F has dimension m + 1 and basis (LT(t)};_‘,,.

10

G. M. Duncan, A semi-paramerric censored regression estimator

5. Sieves Consider now a space S( m + 1) of all linear splines over [a, b] with m + 1 knots and fixed mesh size h. Construct a finer space S(2m + 1) from S(m) by halving h maintaining the original knots and locating a new knot halfway between each pair of old knots. We may thus generate a sequence of sets {

Wkm + l,},“_,.

The sequence

is increasing

and has a limit that is dense in C[a, b]. That is,

S(2km + 1) G S(2k+‘m lim k-m

As a further

sup inf { ]]fg

(14)

g]],]fe

S(2km + I), gE

09 C[fz,

bl}

=O.

/

refinement

g(k) = (f(dlf@)

(16)

let

E

St2km+l),

Then one may show that FE i?(k) is isomorphic to the set of all vectors + 1, we have l!?(k) is compact since subset of a finite-dimensional vector I

k=O,...,

+ l),

lbf(‘)dt=l,

f’0).

(17)

implies 0 O,

XEX,

i

-“T8”80tx(t) d&.(x),

/ --do

(20)

y=o,

XEX,

where cuE.&, (e,, 0,) E 8 and h,(x) is a density on X. Finally, we need the following assumptions for the ‘true’ (Y. Assumption

1.

Let

where o( -) and h,( .) are the ‘true’ densities, then for all c and 8r = (f?,, 6:) we have

11n gh

X;

e)dY, x; e)i I M(~, x),



13

and M( y, x) is integrable. This condition would be common in the consistency proofs for maximum likelihood in the case that (Ywere known. Assumption 2. For g(y, x; 0) as above and for all c, Br = (&,, 8,‘) EB( lyl Ix) and E) x I exist and are finite. Assumption

3.

The sets

{XI -lIX%,/B,_>.

(i) For C,,, c A, C,,,+ a means sup ]]a - bllA + 0. &Cm

Theorem 1 below also refers to the following conditions: C.l. For every m and every n, AL is almost surely (dP,J

non-empty.

C.2. If, for some sequence a, EYE, K(u,, a,) -+ 0, then a, -+ a,. C.3. There exists a sequence a, ~9~

such that K(u,, a,) -+ 0.

Finally, for each 6 > 0 and each m, define 0, = {a =?,IH(a,,

a) I H(a,,

a,) - a},

where a, is the sequence in C.3. Given I sets Qi,..., 0, in 9, such that g( ., 6,) is measurable for each k, define

The following theorem holds generally, not just in the censored regression case here. Theorem 1 [Gemun and Hwung (1982)]. Assume { 9,) is chosen so that C. 1 -C. 3 are in force and let m(n) be a sequence diverging to 00 with n. If for each 6 B 0 there are sets O;t, OF,. . . , O;l in Yp,, m = 1,2,. . . , such that (i)

D,,,c ij or, k=l

(ii) (iii)

g( -, Or)

is measurable,

E ~m(nj(Pm(nj)’ < 00, n=l

then J?:(~, + a, almost surely.



17

For our application we need a relationship between convergence of Kullback-Liebler information and .I., convergence. In this regard we have the following lemma from Geman (1981). Lemma 1.

Let f,,(a) be a density function satisfying

lrn fo(x)lnfO,(x)dx< -CO

0~.

If for each n, T, is a collection of density functions and if

lim sup

n-+m

/ET,

m f,(x)h{fo(x)/f(x)} J -CC

dx = 0,

then also

lim sup

n-w

fET,

m If(x) -f,,(x)\ I --oo

dx = 0.

We now state the major result of this paper. Let Sj = 1:if y, > 0 and let 6; = 0 otherwise, then we may write the log-likelihood as

and &,,Cn)= (&,, B,, a ,(,,) is the modified spline maximum likelihood estimator. Note we abuse notation and identify (Y,the vector of knot values of the density with a( a) the spline density. Employing Geman and Hwang (1982) we have the following theorem which is the main result of this paper. Theorem 2. Let 8,,,(,,) be the modijed spline maximum likelihood estimator, then under the assumptions of the model &(“) “3 e,, providedthatforsome~>O,m=m(n)=O(n’-’)

ork=O((l-e)logn).

Proof.

We check the conditions of Geman and Hwang (1982) and Theorem 1 in a series of lemmas and propositions. The proofs are in the appendix. 8. Distribution

theory

A complete asymptotic distribution theory is unavailable. This section sketches a partial theory. The problem in applying the usual theorems is the

18

G.M. Duncan, A semi-parametric


fact that the number of parameters grows as the sample size. However, if one is willing to accept a level of asymptotic bias, then, using White (1982), one can achieve a sort of distribution theory by fixing the mesh size at a suitably small value. The estimated parameters converge to the parameters that minimize the Kullback-Liebler norm. Suitably normalized, the distribution of the estimated parameters may be approximated by a multivariate normal distribution. If one assumes the world may be characterized by a fixed knot spline then, using Duncan (1980b), one has both consistency and normality. The deficiency here is that one has made a distributional assumption, and we are trying to avoid that. The deficiency in applying-White (1982) is that I’ve been unable to develop a relationship between the ‘true’ parameters and their KullbackLiebler approximations. In particular, I’ve been unable to develop a relationship between the Kullback-Liebler ‘approximation’ errors and the usual L, or L, approximation errors. 9. Empirical results Lacking asymptotic distribution results is a problem, however, since the worth of a technique must be judged not on how well it ought to work in huge samples, but in how well it actually performs in moderate samples I present here the results of some limited simulations. These simulations are meant to be indicative of the promise of the method rather than to provide a definitive justification of its use. There are two sets of results; the first gives point estimate comparisons between various estimates including the ordinary least squares estimates for the complete (uncensored) sample. Monte Carlo simulations generated the second set of results which include standard error estimates (Monte Carlo). Currently, I bootstrap to obtain confidence intervals. The first set of results were generated in the following manner. The independent variables were generated independent U( - 20,20), parameter values were chosen, an error distribution was chosen. A sample size was chosen, the inner product of the true parameters and randomly generated independent variables was calculated for each observation; an error was generated from the chosen error distribution and added to the ‘true’ value of the dependent variable. Ordinary least squares (OLS) was performed on these data. Next the dependent variable was replaced by a censored version, that is, negative values were replaced by zeros. Ordinary least squares (OLSC) and a TOBZT were run. Also, the method of this paper was applied with varying numbers of knots [C(K)]. Finally, the observations with zero values for the dependent variable were tossed out, and ordinary least squares (OLST) was performed on the resulting truncated data. My experience has been that the sieve method gives extraordinarily good estimates of the underlying regression coefficients, excluding the intercept, where by good I mean in terms of

G. hf. Duncan, A semi-parametric

Table lb

Table la E - [(lo)

Parameters and estimates E - C(lO), N = 400.=

4 TRUE OLS C(3) C(6) CC121 ~(24) OLSC OLST TOBIT

-

20 20.08 20 20 19.9 19.9 l6.9 19.9 19.9

19


TRUE OLS C(3) C(6) C(l2) ~(24) OLSC OLST TOBIT

-10 - 9.8 -10 -10 - 10 -10 - 1.5 - 9.63 - 9.66

- 1(20), n = 400.

0,

0,

0,

0.02 0.339 0 0.066 0.1 0.117 0.014 - 0.016 0.185

- 0.05 - 0.053 0 - 0.046 - 0.01 0.005 0.004 -0.04 0.009

0.001 - 0.103 0 - 0.066 - 0.08 - 0.08 - 0.06 -0.18 -0.13

“C(i) = sieve/spline estimator with i knots. TRUE = true value used in estimation. Table lc a- 0.5N(5,100)+0.5N(10,25)

TRUE OLS C(3) C(6) OLSC TOBIT

Table ld CB0.1, n = 600.”

6,

**

8,

-4 -4.1 - 2.91 - 2.95 -3.6 - 3.99

1 0.87 - 2.53 -2.51 0.69 0.98

5 5.9 - 2.09 - 2.08 5.3 4.98

E- 0.5N(5,1) + 0.5N(O.l) @ 0.1, n = 600.

TRUE OLS C(3) C(6) C(l2) OLSC TOBIT

81

0,

03

-4 -4 - 4.0 - 4.02 - 4.03 -2.86 - 4.03

1 0.99 1.0 1.00 0.992 0.66 0.99

5 4.99 5.0 5.00 4.99 3.69 5.03

a 0 = mixture. Table le

Table 1f

E - O.SN(S,l) + O.SN(O,1) @ 0.0, n = 500.

E - 0.8N(0,1) 0 0.2N(O, IO), n = 200.

TRUE OLS C(l0) C(20) OLSC TOBIT

0,

e2

0,

-4 - 4.07 -4 - 3.99 - 3.04 - 5.81

1 1.3 0.81 0.82 0.93 2.6

5 4.94 3.25 3.25 3.48 5.53

TRUE OLS C(3) C(6) W2) ~(24) OLSC OLST TOBIT

8,

6%

2 1.86 2.002.04 2.00+ 1.99 1.24 1.58 1.74

-2 - 1.87 2.00- 1.91 -1.86 -1.84 - 0.646 - 1.13 -1.64

agreement with the ordinary least squares estimates that used the complete and uncensored data. The model is _yt= cip,ix,,Bi + E,, t = 1,. . . , N, with x,, - U( - 20,20) independently. In this first set of results, intercepts, scale term and spline weights are unreported. During this first set of runs, which I made basically to test the program, I found that except where the signal to noise ratio was low, the sieve performed

20

G.M. Duncan, A semi-parametric censored regression estimator

almost as well as the ordinary least squares estimator (OLS).The number of knots didn’t seem to matter. Initially, I imposed the zero median constraint, the unity constraint, and the interquartile range constraint using penalty functions, I found that the penalty functions were never sufficiently severe to make the constraints binding exactly and to allow convergence. And I found that the resulting ‘density’ estimates were strange - highly oscillating if a large number of knots were used. I also found a kind of identification problem, the scale factor was ‘off’ proportionaJly to the same extent the density was unnormalized. That is, 8 = 4 and /f = 4 when u = 1 and /f = 1. In short, even though f and 8 were inconsistent f/S was a density and its interquartile range was approximately the right one. I found a similar problem with ihe intercept. The intercept was off by approximately the same amount that f/3 was shifted off zero. This is if fi should have been 10, I might find p = 5 and median (fj6) = 5; the sum of the two was the correct value. As a result I reprogrammed and imposed the unity constraint directly. Then I deviated from the model of the paper, I did not estimate the intercept or the scale, instead I let the domain of the density that was approximated by a spline be [a, b] and estimated a and b. These results are presented in the next tables. The scale parameter is the IQR calculated from the estimated density, and the intercept is the calculated median. Also for these estimates I calculated the Monte Carlo standard errors based on 100 replications each. Clearly, more work must be done, but I found these results encouraging. Particularly troublesome is the oscillation of the estimated density and the difficulty in obtaining intercept and scale estimates. The former problem is common in density estimation, I don’t know why the intercept and scale are so

Table 2a E -

TRUE C(l7) SI “SI = estimate

N(O,

ye)

4

1.5 -

-3 -3.00 (0.028)

of the

samplingvariance

I),

N = locH).a

f%

Median

4 4.00 (0.059)

0 1.56 (1.08)

of the estimates

IQR 0.68 1.07 (0.235)

based on 100 replications.

Table 2b E - N(0,4),

TR l/E C(l7) SI

n = 1000.

E(E)

4

4

Median

1.5

-2 - 2.36 (0.86)

3 3.52 (0.74)

1.34 (1.44)

IQR 2.8 2.21 (1.77)


21

difficult to estimate - where by difhcult I mean that I need much more data than I need to obtain respectable coefficient estimates. I suspect that it derives from the fact that simple differentiability of the density is sufficient to identify the coefficients, whereas the precise density is required to identify the scale and intercept (cf. Lemma 4 of the appendix). Appendix: Proof of Theorem 2 Lemma

2.

Under the assumptions of this paper,

C.1 holak.

Proof.

+

5

(ln/_-z”‘f(t)dt)(l

-a,),

i-l

where a = {f, r, /IT} and f E S(m), then In L,(a)

= -

i

6,ln7+

i i=l

In [ *T’ j_l ajL;m+l( yq8,

i=l

+ i i=l

In 2~1ajM~m+1( -x,‘p/~) [ j-1

1

(1 - Sj).

Now, In L,(a) is a continuous function of (a, T, p) which, for each m, lie in a compact set Ym; hence In L,,(a) achieves its maximum on 9, and A?: is non-empty for each n, m. Lemma

3.

Proof.

Write

Q.E.D


X

-“Tel’eoa(t)dth(x)dx, I-CO

C.3 is satisfied.


22


use the change in variable 5 = (y - XT)/7 also define

h=e,/0,

A = (4 - we,,

and write

xX+[+xTd)dSh(x)dx

+

~+xT+l*F($)]

xG( T+x+z(x)dx, where G(t)

=/’

-03

a[(s)d s

and

F(t) =/’

f(s)ds. -CO

Now in each 9, there are (a, 8) arbitrarily close to (e,,, 0,); so also there are X arbitrarily ,close to one and A arbitrarily close to zero. Pick a sequence (u,,,, p,) -+ (f3,, 0,); then since a(.) is continuous,

and similarly

lnG( -T+x'A)

+lnG(

-$),

and so on. Now f is a linear spline and F is its integral, so both are continuous; hence lnF(-T)+lnF(-21.


23

Also f may be chosen so that sup )a - fl = o( h2) and also supIF-

GI = o(h2).

This follows from the theory of spline approximations again Prenter (1976, p. 47)]. Now write

w%~ b) = //_I,

[lna(t)

and their integrals [see

-lnf(t)]~~(t)dth(x)dx

+//(lnG(-G)-lnF(-?)

+/jm

[lnS(t)-lna(t)]G(f,x)dth(x)dx - Xrs/O

+

-$+x%)-lnG(-q)]

xG( - $+x%)h(x)dx

x(&(t,x)

-ci(r))dth(x)dx

+/[lnG(

-$)

-lnF(

-q)]

.I,!-?,,?I)-G(-q)]Ir(x)dx =A,+A2+A3+A4+A5+A6,

where G(l, x) = ha(ht + xTA).

(A-1)

24



Now by Assumption 1

an integrable function. Similarly, In f and In F are linear in the tails, so Ilnf(z)kx(xz+xrA)l

~KlzlXa(hz+xTd)h(x)

+F(

-$)G(

-$+xrA)~

where the right-hand sides are integrable by Assumption 2. Hence choose f, u, and p as above and, using dominated convergence, we have term by term Ai = o(h2) and

IAil s ‘CxYY)? where c(x, y) is integrable; hence we’ve shown the existence of a sequence such that K(a,, Lemma 4 g(z; Proof.

a,,,) + 0.

Q.E.D.

(Identification).


u) = g(z; b),

Vz,

implies

a = b.

Assume g( z; a) = g( z; b), then

;g( Cp) _if( g?),

y>o,

y =

0,

where a = (g, 7, r),

b= (f,oJ),

G(t) = /’ gb)ds, -CO

F(t)

=/’

f(s)ds. --ec

(A.2)


(a) y = p. differentiable


25

Under our assumptions g and f are in C2( - 1, l), hence are everywhere. Differentiating (A.2) above w.r.t. x, y gives

1 ,B

1

-7 -‘-gu -’a

pg

1 ,Y 7

1

’ =sgfy

hence Y=P* Note, here differentiation so far, undetermined.

(b)

OCT.

Takex=x,,

by a ‘constant’ x makes no sense, so the intercept is,

x$~=x:Y=~,

then

(A-3) Let z = (v - a)/~)

then (A.3) becomes

if(z)=fg(;z), using Assumption (- :, 2) gives

Vz > -u/7.

3 let a/+~ 2 5, then integrating

both sides w.r.t. over

64.4) Since l/2 o(z)d.z=i, J - l/2

Vain A,

by change in variable this gives

Ju/2’g(t)dt=:. -o/27

g(t) is a density w.r.t. df, so ~.(a, b) =/%j)dj (1 is a measure that is absolutely continuous w.r.t. dt. (A.4) implies I*(-;,+)=+.

(A.9

26



(AS) implies

c(-$5)=i. Since either

or vice versa, we have

P[(-&:,A(-;,;)]=O, where A is the symmetric difference [see Halmos (1950)]; hence cl

II -_=1

27

or

2

-=

fJ

7

1.

(c) PO= y,, (where the 0 indicates the constant term). Utilizing (a) and (b) and taking a = x,/3, where now x0 does not contain an intercept, we may write

p!_+(-y, for all u in the range of x/I. Using a change in variable, we may write

f(z,=g(z-q,

vz>

-qc

Using Assumption 3 x0 can be chosen so that

Then integrating both sides w.r.t. z gives

G. M.

Duncan,A semi-parametric


21

from the median constraint. Since measures are shift-invariant,

implies yo-Po/a=O

(d)

f= g.

or

uo=PO.

Finally, since y = p, 7 = u, we have

g(t) =f(r>*

Vt 2 -xy/a,

G(t)

Vt I -xy/a.

= F(t),

Since g, fe

C*( - 1,1) the latter equality implies, by differentiation,

g(t) =f(r),

wt I -x’p/u,

or f(t) Lemma 5. Proof.

=g(t).

Q.E.D.

Under the assumptions of this paper, C.2 is satisfied.

We need to show that K( a, a,) + 0

implies

[Ia - amllA + 0.

Referring to (A.1) of Lemma 3, we write K(a, a,) = (4

+ A2) + (4

+ 4)

+ (A5 + 4).

The sum (A, + A2) is the Kullback-Liebler unknown f; we denote it

information

for known 1y and

Kkf).

The sum (A, + A4) is Kullback-Liebler unknown A, h; we denote it K(l,O;

h, A).

information

for known (Y,y, T and



Note j_;TB/o=

/_ +(r/o)[-q/r-xA] o.

Finally, the sum (A, + A,) has no particular denoted D. Thus ~(a,

a,) = K&x, f,,,) + K&O;

The fundamental

interpretation,

but will be

A,,,, A,) + 4.

information inequality tells us that

K(a, a,) 2 0,

&(a,

f,) 2 0,

K,(LO;

A,, A,,,) 2 0.

We first show that

K(u,a,)-+O

*

K,+O,

K,--+O.

Say that K(u, a,) = 0, but

K, or K,fO. Then

K,+K,+D,=O

and

-(K~+K,) 0, we have

s~p{lo,-fI,l,I~,-&~}

=O.

Q.E.D.

Finally, we need a covering for the epigraph of H(a,, a) for a E S,,,. That is, we need to construct a suitable covering 0,. Recall for each k, that 2k + 1 f&)

=

c

uiL;*+yt),

i=l

where ai 5 2k + 1 for each i. Let m = 2k + 1. Now take forsome

ai=p/m2 a0

=

a2”+

1 =

l/m2,

p=0,...,m3,


30


and associate with each m-tuple a the set A,(a)=

{bES~4+lp~a-6~

0 such that r(t) < 1, or that 3t > 0 such that and is negative, then we are done. The former case

r(t) is convex in t, it is If we can show either that r(t) exists and r’(0) exists is obvious, in the latter case

G.

Duncan, A semi-parametric censored regression estimator

31

convexity and r’(0) < 0 implies inf, t 0 r(t) < r(0) = 1. Following Example 1 in Geman and Hwang (1982), we show the latter. Fix 0$, and b, = (a,, fij) ~9’(k). Then (r’(O))=E[lng(t;a)-lng(z;b,)]

+E(lng(z;b,)-lng(z;a,)).

By definition, the latter expectation is I -S -C0; hence r’(0) s E(ln g( z; u) - In g( z; bk)) - 8.

If we can show lE(lng(z;a)-lng(z;b,))(+O

as

m-+cc,

then we are done since then, for m large enough, r’(0) < 0. E[ln dz,

a) - ln g(z, b,)]

ln~(X~-XTP/~)

+

/

fi(--o/4

G(xrd

-

x%)h(x)dx.

(a) Since 6, T are in O,“/,we have ((I - 71 I l/G


33

I 2/m, so

for z 2 -1, so

2

q=g+

I-+m Cl

mHfl 1,fi WI

1+1/z.

1

-_m

m3

=0(&i). Hence

r’(0) = o(G) - 8. Finally we need to show that, for some t > 0, r(t) exists. For if r(t,) exists, then r(t) exists for all t < t,. But convexity and r’(0) -C0 imply that, for some t > 0, r(t) < 1. Thus

for m large enough. So

e

Imp25 i

m=l

(cm)‘“(p)“=qm.

m=l

Using the root test we find that, with p -C1, Finally, since m = 2k + 1,

q,

is finite if m = O(nlee),

E > 0.

ln( m - 1) k=

ln2

= O((1 - &)log n).’

‘The key feature in the above proof is that, if f(t) - g(t) = ah2, then In f(t) - In g(t) 5 ah2 if ah2 is small enough, since the logarithm is a concave function. Also implied is F(t) - G(t) = bh2, SO that In F(t) - In G( r) I bh2, dominated convergence does the rest.

34



References Amemiya, T., 1973, Regression analysis when the dependent variables is truncated normal, Econometrica 41,997-1017. Dieudonne, J., 1960, Foundations of modem analysis (Wiley, New York). Duncan, G.M., 1980a. Formulation and statistical analysis of the mixed continuous/discrete choice model in classical production theory, Econometrica 48, 839-852. Duncan, G.M., 1980b. A relatively distribution robust censored regression estimator, Working paper (Washington State University, Pullman, WA). Duncan, G.M., 1983, On the use and misuse of Gaussian selectivity corrections: Selection bias as proxy variable problem, Research in Labor Economics 6, 333-345. Geman, S. and C.H. Hwang, 1982, Nonparametric maximum likelihood estimation by the method of sieves, Annals of Statistics 70, 40-414. Grenander, U., 1981, Abstract inference (Wiley, New York). Halmos, P., 1950, Measure theory (Van Nostrand, Princeton, NJ). Heckman, J., 1978, Dummy dependent variables in simultaneous equations model, Econometrica 46, 931-961. Heckman, J., 1976, The common structure of statistical models of truncation, sample selection, and limited dependent variables and simple estimator for such models, Annals of Economic and Social Measurement $475-492. Hoadley, B., 1971, Asymptotic properties of maximum likelihood estimators for the independent non-identically distributed case, Annals of Mathematical Statistics 42, 1977-1991. Huber, P., 1967, The behavior of maximum likelihood estimates under non-standard conditions, in: Fifth Berkely symposium on mathematical statistics and probability, Vol. 1 (University of California Press, Berkeley, CA). Lee, L.F. and R.P. Trost, 1978, Estimation of some limited dependent variable models with applications to housing demand, Journal of Econometrics 8, 357-382. Nelson, F., 1977, Censored regression models with unobserved, stochastic censoring thresholds, Journal of Econometrics 6, 309-328. Prenter, M.V., 1976, Splines and variational methods (Wiley, New York). Schmidt, P., 1978, Estimation of a simultaneous equations model with jointly dependent continuous and qualitative variables: The union-earnings question revisited, International Economic Review 19, 453-465. Tobin, J., 1958, Estimation of relationships for limited dependent variables, Econometrica 26, 24-36. White, H., 1982, Maximum likelihood estimation of misspecitied models, Econometrica 50, l-25.