Estimation of Discrete Response Models Under Multiplicative Heteroscedasticity Songnian Chen
Hong Kong University of Science and Technology
Shakeeb Khan
University of Rochester September 1999 (First Draft)
Abstract In this paper, we consider estimation of discrete response models exhibiting conditional heteroscedasticity of a multiplicative form, where the latent error term is assumed to be the product of an unknown scale function and a homoscedastic error term. It is rst shown that under this type of restriction, even when the homoscedastic error term is parametrically speci ed, the semiparametric information bound is zero. Hence it is impossible to attain the parametric convergence rate for the parameters of interest. However, for ordered response models where the response variable can take at least three dierent values, the parameters of interest can be estimated at the parametric rate under the multiplicative heteroscedasticity assumption. Two estimation procedures are proposed. The rst estimator, based on a parametric restriction on the homosceastic component of the error term, is a two step MLE, where the unknown scale function is estimated nonparametrically in the rst stage. The second procedure, which does not require the parametric restriction, estimates the parameters by a vector which is orthogonal to a kernel weighted regressor matrix. Under regularity p conditions which are standard in the literature, both estimators are shown to be n-consistent and asymptotically normal.
Key Words: binary choice, ordered response, multiplicative heteroscedasticity, semiparametric eciency bound.
Corresponding author. Department of Economics, University of Rochester, Rochester, NY 14627; e-mail:
[email protected]. We are grateful to J. Powell and B. Honore for their helpful comments.
1 Introduction and Motivation Models of discrete choice have always received a great deal of attention in both theoretical and applied econometrics, as they allow for the indivisibility or aggregation that arises in many data generating processes1 . To relate a discrete valued response variable to a set of explanatory variables, econometricians most often work within the linear latent variable framework, which in some cases, models a single latent (i.e. unobserved) dependent variable as a linear function of the explanatory variables, with an additive \error" term:
yi = x0i 0 + i
(1.1)
Binary and ordered response models relate the observed dependent variable to the latent variable by some partition of the real line which classi es the observed dependent variable by where on the line the latent variable lies. Many estimation procedures have been proposed for the parameters of interest 0 for these models. The parametric approach assumes statistical independence between the latent error term i and the explanatory variables xi , and imposes parametric restrictions on the distribution of the latent error term. Under these assumptions, maximum likelihood estimation can be used to estimate 0 . The drawbacks of this approach are that the estimator will be inconsistent if the underlying assumptions are violated. Thus alternative approaches are necessary if conditional heteroscedasticity, and/or an unknown error distribution are to be allowed for. Alternatively, semiparametric approaches impose no parametric assumptions on the latent error term. If the error terms are still assumed to be independent of the explanatory variables, then the binary and ordered response models can be viewed as special cases of the \single index" framework, for which many estimators have been proposed (e.g. Han(1997), Powell et.al.(1989), Ichimura(1993), Ahn et.al.(1997) to name a few). While these estimators p have been proven to converge at the parametric ( n) rate and have limiting normal distributions under their assumptions, they too will be inconsistent if conditional heteroscedasticity of general forms is present. Other approaches which relax the independence assumption have been proposed. For example, for the binary choice model, Manski(1975,1985) proposed the \maximum score" estimator, which only requires a conditional median restriction on the error term, thus 1 See Amemiya(1985), Greene(1998) for speci c examples.
1
allowing for very general forms of conditional heteroscedasticity. However, this estimator has been shown to converge at the slow rate of n?1=3 (see, e.g. Cavanagh(1987), Kim and Pollard(1991)). While variations of this approach have led to improved rates of convergence (e.g. Nawata(1989), Horowitz(1993)), the parametric rate cannot be obtained, as proven by Chamberlain(1986), who showed the semiparametric eciency bound to be zero under the median restriction. These asymptotic results carry over to the ordered response model, which not surprisingly, is also identi ed under a median restriction. Lee(1992) shows the maximum score estimator to be consistent for ordered response models as well, but does not establish the rate of convergence nor its limiting distribution. Under an exclusion restriction, which requires that the latent error term be independent of only one of the explanatory variables (which is known to the econometrician), Lewbel(1998) proposed estimators for both models which converge at the parametric rate. While this approach may be desirable in certain cases, it is quite restrictive in the sense that the econometrician is required to have sucient prior knowledge to treat one particular explanatory variable dierently than the rest. Thus estimators for these two models which allow for general forms of heteroscedasticity and converge at the parametric rate is something which is lacking in the literature. This paper further explores this problem by considering a latent variable model with multiplicative heteroscedasticity that is of the form:
yi = x0i 0 + (xi )i
(1.2)
where () represents an unknown \scale" function of the explanatory regressors, and i is a homoscedastic error term. Models with similar multiplicative structures have been studied elsewhere in the econometric literature. Harvey(1976) considers models where the scale function is parametrically speci ed, as does Engle(1982) in his analysis of ARCH models. In our context, we leave () nonparametrically speci ed, thus allowing for very general forms of heteroscedasticity. We consider the asymptotic properties for estimating binary and ordered response models with this type of restriction. One interesting aspect of our results is that the asymptotics are very dierent for the two models. Speci cally, we show that the information bound is 0 for the binary case, implying an estimator which converges at the parametric rate cannot be found. Surprisingly, this result holds true even if the distribution of i is assumed to be standard normal. In stark contrast, the ordered response model can be estimated at the parametric rate and we propose two estimation procedures based on diering assumptions 2
on i. The results in this paper are in contrast to those in the existing literature, where rates of convergence for the two models have been the same for the same set of error restrictions. In other words, one extra \choice" has made little dierence until now. The paper is organized as follows. The following section analyzes the binary choice model under multiplicative heteroscedasticity, and shows the information bound to be 0, even if the distribution of i is parametrically speci ed. Sections 3 and 4 consider the ordered response model under multiplicative heteroscedasticity, proposing two estimators which are shown p to be n consistent and asymptotically normal. Section 5 discusses how the estimation procedure can be extended to a more general class of models. Section 6 explores the nite sample properties of the estimators through a simulation study, and section 7 concludes by summarizing results and suggesting areas for future research. An appendix collects the proofs of the main theorems.
2 Binary Choice: Zero Information Bound In this section we consider the binary choice model with multiplicative heteroscedasticity:
yi = I [x0i 0 + 0 (xi)i > 0]
(2.1)
where I [] denotes the \indicator" function, taking the value 1 if its argument is true, and 0 otherwise, and xi is a k-dimensional vector of explanatory variables. We assume the sequence of vectors of observed variables, (yi; x0i )0, to be i.i.d., with a common joint density function with respect to some measure . Our main result for this model is that given that the scale function is nonparametrically speci ed, the parameters of interest 0 cannot be estimated at the parametric rate even if the distribution of i is parametrically speci ed. To establish this result, we assume that i has a standard normal distribution and consider the information bound for this model. Letting () denote the standard normal distribution function, the conditional likelihood function can be expressed as: 0 f (yi; xi ; ; ) = ?xi i
!yi "
0 1 ? ?xi i
!#1?yi
(2.2)
where for notational convenience, we suppress the dependence of i on xi. To derive the information bound for estimation of 0 we adopt the directional derivative approach used 3
in, for example, Chamberlain(1986). First, we restrict the nonparametric component of the model, , to lie in some space of smooth, positive functions ?. We de ne a path, , through 0 as a mapping from the open interval of real numbers (a; b) into ?, where there exists a unique 0 2 (a; b) such (0 ) = 0 . Note for a given path , we have a parametric likelihood function of the form:
f(yi; xi; ; ) f (yi; xi ; ; ()) For j = 1; 2; :::k, de ne 1 ?1=2(y ; x ; ; )@f (y ; x ; ; )=@ i i 0 0 i i 0 0 j j (yi ; xi ) = f 2 and de ne: 1 ?1=2 (y ; x ; ; )@f (y ; x ; ; )=@ (yi ; xi ) = f i i 0 0 i i 0 0 2
(2.3)
(2.4)
For a given path , the partial information for the rst component of 0, denoted by 0(1) is de ned as: Z
I;1 = 4 inf ( 1 ? 2 2 ? :::k k ? k+1 )2d
(2.5)
Letting denote the space of \all" paths (weak conditions will be imposed on for the main theorem) the semiparametric information bound for 0(1) can be de ned as:
I;1 = inf I 2 ;1 :
(2.6)
To establish the main theorem of this section, we rst impose weak restrictions on ? and
De nition 1 ? is the space of positive, continuous functions from
k=2, and that is bounded away from 0 and in nity on the support of . Furthermore, p p assume that the truncation sequence satis es Kn= n ! 0 and nKn?2p=k ! 0, then
pn( ^ ? ) ) N (0; M ?1 V M ?1 )
(3.13)
0
where V = E [
0
i i ].
The separate components of the limiting covariance matrix can easily be estimated by analog estimators, replacing 0 with ^, 0 with ^ and expectations with sample averages. Since the objective function is suciently smooth (i.e. twice continuously dierentiable), no new smoothing parameters, nor numerical derivatives are required. It should be noted that the in uence function
? P2i + di0 ? P0i D(xi) d(1i2 ? P2i) (P0i)
i
contains the \penalty term"
!
which represents the increased variance induced by the rst step procedure. Thus our estimator does not fall into the class of \asymptotically orthogonal" estimators. For this reason, it remains an open question as to whether our estimator is the most ecient one possible under our set of restrictions.
4 pn-consistent Estimation without Normal Errors As alluded to in the previous section, the normality assumption on i may be restrictive in certain cases. In this section we propose an estimator which only requires the assumption 9
that i has a strictly monotonic distribution function, with positive density function with respect to Lebesgue measure. To motivate our estimator under this condition, we assume again that the functions P0; P2 are known to the econometrician, where now these functions no longer depend on , but on F , the (unknown) c.d.f of . Assume also that there exists a pair of observations, xi; xj for which both P0i = P0j and P2i = P2j . By the strict monotonicity of F , this will imply that:
and
2 ? x0i 0 = 2 ? x0j 0 0 (xi ) 0 (xj )
(4.1)
?x0i 0 = ?x0j 0
0(xj )
(4.2)
0 (xi) = 0 (xj )
(4.3)
x0i 0 = x0j 0
(4.4)
0(xi ) Hence and
Therefore, if k such pairs of observations could be found, and the corresponding k k matrix of dierenced regressor values were invertible, then 0 could be recovered as the unique nonzero vector orthogonal to this matrix. Of course such an approach is infeasible for two reasons. The rst reason is that, as before, the probability functions are unknown. The second reason is that even if these functions were known, if the probability functions are not discrete valued, such \matches" will occur with probability zero. The rst problem can be remedied as was done previously- by replacing the true function values with their nonparametric estimates. The second problem can be dealt with through the use of \kernel weights" as has been frequently employed in the semiparametric literature (see, for example, Ahn and Powell(1993), Ahn, Ichimura and Powell(1997)). The idea behind this procedure is to exploit the smoothness of the probability functions. Speci cally, when a pair of observations satis es that the two probability functions are \close" to one another, then the dierence in index values will be \close" to 0:
P0i P0j ; P2i P2j ) x0i 0 x0j 0 10
This suggests nding the vector which is orthogonal to a \weighted" dierence matrix, where relatively high weight is given to pairs whose probability functions are close, and relatively low weight to pairs whose probability functions are far apart. Following Powell(1987) we adopt \kernel weights", using the kernel function frequently encountered in nonparametric estimation as a weighting function. Speci cally, assuming that the conditional probability functions were known, we use the following weighting function for pairs of observations: 1 P P 0i ? P0j 2i ? P2j !ij = h2 K K (4.5) hn hn n where hn is the \bandwidth", which converges to zero as the sample sizes increases, ensuring that in the limit, only pairs of observations with probability functions arbitrarily close to each other receive positive weight. K () is the kernel function, which is symmetric around 0, and assumed to have compact support, integrate to 1, and satisfy certain smoothness conditions discussed later on. With the weighting matrix de ned, a natural estimate of it, !^ij follows from replacing the true function values with their nonparametric estimates discussed in the previous sections. To estimate the parameter of interest 0 it will rst be convenient to adopt a dierent scale normalization than the one used previously. Now we assume 0 does not contain an intercept term, and set its rst component, 0(1) to 1. The parameter of interest is now labeled as 0 , where 0 (1; 00 )0. This normalization implies that for a pair of observations for which the indexes are equal, we have:
xi1 ? xj1 = (xj2 ? xi2 )00
(4.6)
where xi1 is the rst component of xi, and xi2 are the remaining k ? 1 components. This suggests a weighted least squares estimator, regressing xi1 ? xj1 on xj2 ? xi2 , with weights !^ij . Thus we propose the following two stage procedure. The rst stage is the same series estimator of the probability functions,3 and the second stage estimator is de ned as: X X ^ = ( (xi ) (xj )^!ij x2i x02i )?1( ? (xi ) (xj )^!ij x2i x1i )
i6=j
i6=j
where xi xi ? xj , and () is again a trimming function.
(4.7)
3 As speci ed in the regularity conditions in the appendix, the conditions on the truncation sequence are more strict than needed for the previous estimator. Speci cally, they will depend on the second stage bandwidth sequence used.
11
4.1 Asymptotic Properties The proposed estimator has a similar structure to those proposed in Powell(1987), Ahn and Powell(1993), Ahn, Ichimura and Powell(1997). As is the case with those estimators, the proofs of its asymptotic properties are quite lengthy and detailed. Consequently, we only state the necessary regularity conditions and main theorems in this section, leaving the details involved in the main proofs to the appendix.
4.1.1 Regularity Conditions The conditions necessary for developing the limiting distribution of this estimator are substantially more detailed than those required for the previous estimator. Speci cally, conditions need to be imposed on the second stage kernel function and bandwidth sequence, the rst stage truncation sequence, and the order of smoothness of certain density and conditional expectation functions. We rst state the necessary identi cation condition on which our estimation procedure is based:
Assumption I (Identi cation) The joint distribution of the propensity scores P0 ; P2 has a density with respect to Lebesgue measure, which is denoted by fP ;P (; ). De ning 0
the following functions of P0i ; P2i:
f(P0 ;P2)i i xi xxi
= = = =
2
fP0;P2 (P0i; P2i) E [ (xi )jP0i; P2i] E [ (xi )x2i jP0i; P2i] E [ (xi )x2i x02i jP0i; P2i]
we require that the matrix: h
i
M1 = 2E f(P0 ;P2)i (ixxi ? xi0xi) has full rank.
We next impose the following conditions on the second stage kernel function and bandwidth sequence: 12
Assumption K (Second stage kernel function) The kernel function K () used in the second stage is assumed to have the following properties:
K.1 K () is continuously dierentiable of order p, where p > 9k=2, and has compact support. K.2 K () is symmetric about 0. K.3 K () is a sixth order kernel: Z
Z
ul K (u)du = 0 for l = 1; 2; 3; 4; 5 u6K (u)du 6= 0
Assumption H (Second stage bandwidth sequence) The bandwidth sequence hn used in the second stage is of the form:
hn = cn? where c is some constant and 2 ( 121 ; 101 ). The following assumption characterizes the order of smoothness of density and conditional expectation functions:
Assumption S (Order of Smoothness of Density and Conditional Expectation Functions) S.1 The function F ?1 and its derivative are sixth order continuously dierentiable with bounded derivatives. S.2 The functions fP0;P2 (; ) and E [x2i jP0 = ; P2 = ] have order of dierentiability of 6, with partial derivatives that are bounded.
The nal set of assumptions involve restrictions for the rst stage series estimator. This involves smoothness conditions on the propensity scores P0 ; P2 and rate at which the truncation sequence increases to in nity. We note that the conditions necessary are much more strict than those required for the previous estimator. Speci cally, the rate at which the truncation sequence increases depends on the second stage bandwidth rate. 13
Assumption PS (Order of smoothness of propensity score functions) The functions P0i; P2i are continuously dierentiable of order p, where p > 29 k.
Assumption TS (Rate condition on rst stage truncation sequence) The truncation sequence Kn of the rst stage series estimator is of the form:
Kn = [c2n ] where [] denotes the integer argument, c2 is some constant and satis es: k 1
2 2p 2 + 4 ; 12 ? 4
!
where is regulated by Assumption H.
4.1.2 Limiting Distribution Based on the conditions outlined in the previous section, we now characterize the rate of convergence and limiting distribution of this estimator. The main result, detailed in the next p theorem, is that the proposed estimator is n-consistent and asymptotically normal. This theorem fully illustrates the dierence between ordered response and binary choice models in the context of multiplicative heteroscedasticity; the latter cannot be estimated at the parametric rate, even when the homoscedastic component of the latent error term has a parametrically speci ed distribution.
Theorem 3 Let F ?10 () denote the derivative of the function F ?1() and de ne the following functions of P0i ; P2i :
1i = (F ?1(1 ? P ) 2? F ?1(P ))2 F ?10 (P0i)F ?1(1 ? P2i) 2i 0i and
2i = (F ?1(1 ? P ) 2? F ?1(P ))2 F ?10 (1 ? P2i )F ?1(P0i) 2i 0i Letting 1i
= 2 (xi )f(P0 ;P2)i [(d0i ? P0i)1i + (d2i ? P2i )2i] (ix2i ? xi) 14
(4.8)
then
pn(^ ? ) ) N (0; M ?1 V M ?1 ) 0
1
(4.9)
1 1
where
V1 = E [
0
1i 1i ]
(4.10)
4.2 A Simple Nearest Neighbor Estimator While the estimator discussed in the previous section was shown to have desirable asymptotic properties, one of its drawbacks is its implementation, as not only does it require the selection of two smoothing parameters (one at each stage), but the two are related in complicated way, and there is no simple rule upon which to select them in nite samples. Here we propose an alternative two stage procedure in which selection of the smoothing parameter in the second stage is much simpler to implement. The idea is to replace the kernel weighted approach with a nearest neighbor approach, as considered in Yatchew(1997) for the semilinear model. This approach is much simpler as it only involves least squares on rst dierenced values after an appropriate reordering of the data. This procedure can be summarized in the following simple steps:
(i) Estimate P0i; P2i as before using series estimation. (ii) Trim observations based on estimated values P^0i; P^2i so that all kept values have estimated probabilities which lie in S , a (closed) sub-square of the unit square. (iii) Cover S with sub-squares of area n?1 for some small > 0, and within each sub-square
construct a path using the nearest neighbor algorithm. Then, knit the paths together by joining endpoints in adjacent sub-squares to obtain a reordering of the data.
(iv) Letting the subscript i denote the reordered data, estimate 0 by n
n
i =2
i =2
X X ^nn = ( x2i x02i )?1 ( ?x2i x1i )
(4.11)
where n denotes the number of observations after trimming, x2i denotes x2i ? x2(i?1) and x1i denotes x1i ? x1(i?1) . 15
p
As discussed in Yatchew(1997), while this estimator is n? consistent, it will be relatively inecient, and eciency can be improved by higher order dierencing. It is also important to note that while the kernel weighted method can apply when the number of values the ordered response variable yi can take is nite and greater or equal to 3, this nearest neighbor approach is only valid for comparing at most 3 choice probabilities.
5 Heteroscedastic Transformation Models An interesting aspect of the estimation procedures introduced in the previous sections is that they can be easily modi ed to estimating the slope coecients and transformation function in a monotonic transformation model of the form:
yi = g(x0i 0 + 0 (xi )i)
(5.1)
where g() is an unknown monotonic function taking on at least three dierent values, and i satis es the same assumptions as in the previous section. Nonparametric transformation models have received a great deal of attention in both the theoretical and applied econometric literature. Several estimators4 for 0 have been proposed under the restriction that 0 (xi) 1. More recently, Horowitz(1996), Ye and Duan(1997), Klein and Sherman(1998) p and Chen(1998) have proposed n? consistent estimators for g() under homoscedasticity.
5.1 Estimating the Slope Coecients To estimate 0 in the presence of multiplicative conditional heteroscedasticity, one could select two \cut points", y0; y2 where y0 < y2 and P (yi y0jxi ) < P (yi y2jxi ), with these probabilities bounded away from 0 and 1 respectively. With these cut points, the indicators
di0 I [yi y0] and
di2 I [yi y2]
4 References include Han(1987), Powell et.al.(1989), Ichimura(1993), Horowitz and Hardle(1996), and Ahn et.al.(1996).
16
could be constructed for the sample, and one could proceed exactly as before. 1. Under the assumption that i N (0; 1), the procedure discussed in Section 3.1 can be extended as follows. De ne g?1() by:
g?1(y) = inf fx : g(x) > yg Since g() is not speci ed, an additional location restriction is required, so we set g?1(y0) 0, and as a scale normalization, we set g?1(y2) 1. It follows that: 0 E [di0jxi] = ?x(xi )0
!
i
and 0 E [di2jxi] = 1 ? 1 ?(xxi) 0 i
!
implying the relationship: 1 = ?1 (1 ? E [d jx ]) ? ?1 (E [d jx ]) i2 i i0 i (xi) So as before, the scale function can be nonparametrically estimated in a rst stage, and then plugged into the likelihood function, which now would treat the indicators di0; di2 as the values of the dependent variable. 2. Under the weaker assumption that i has an unknown distribution function F () which is assumed to be strictly monotonic, we now have the relationship:
E [di0jxi] = E [di0jxj ]; E [di2jxi] = E [di2 jxj ] ) x0i 0 = x0j 0 So either of the the kernel-weighted or nearest neighbor approaches discussed in Section 4 can be applied in this context as well, yielding an estimator of 0 up to scale which is pn-consistent and asymptotically normal. For situations where y takes several values, i more ecient estimators could be constructed by selecting more cut points. In this context, the results are worth comparing with those found in Khan(1998), where a transformation model was estimated under a conditional quantile restriction on the error term. While the quantile restriction allows for more general forms of conditional heteroscedasticity then the multiplicative structure, a disadvantage of the quantile 17
p
restriction is that n-consistency is only attainable under smoothness conditions on g(), and not under discrete valued transformations. However, for transformations where the dependent variable takes a continuum of values, the quantile approach would be more appropriate, as it does not involve approximating the set of values with a nite grid.
5.2 Estimating the Transformation Function Here we illustrate how the estimation procedure discussed in Section 4 can be extended to estimate the unknown function g() under a stronger assumption on the scale function (). Speci cally, we partition the regressor vector xi into two vectors:
xi = (x01i ; x02i)0 And assume the following exclusion restriction on the scale function5 :
(xi) = (x1i) Adopting the location normalization that g?1(y0) 0, we assume without loss of generality that the parameter to be estimated is g?1(y2)6. We now have the following relationship:
E [di0jxi] = 1 ? E [di2 jxj ]; x1i = x1j ) g?1(y2) = (x2j ? x2i )0 0(2) where 0(2) corresponds to the slope coecients of the subvector x2i . This suggests a kernel weighted (or nearest neighbor) estimator. Again, let P^0i ; P^2i denote nonparametric estimators of E [di0jxi ]; E [di2jxi], and de ne the kernel weights: ^0i ? (1 ? P^2j ) ! x1i ? x1j 1 P !^ij = h2 K K hn hn n An estimator of g?1(y2) is:
g^?1(y
2) =
1
n(n?1)
P
^ij (x2j i6=j ! 1
n(n?1)
P
? x2i )0 ^(2)
^ij i6=j !
5 Note under the special case where xi = x2i , the model is homoscedastic. Thus our approach will be valid for homoscedastic transformation models as well. 6 Recall that in Section 4 the scale normalization adopted was that the rst component in 0 (excluding the intercept) was set to 1, so g?1(y2 ) can no longer be normalized.
18
where ^(2) is the corresponding subset of the slope coecient estimator discussed in the previous section. The same arguments used in the proof of Theorem 3 can be used to p establish the n?-consistency and asymptotic normality of this estimator.
6 Monte Carlo Results In this section, the nite sample properties of the two proposed estimators are examined through the results of a small scale simulation study. In the study we consider various designs, allowing for both homoscedasticity and heteroscedasticity and allowing for diering latent error distributions. We report basic summary statistics for the two estimators introduced in this paper, referred to in this section as 2SMLE (2 step maximum likelihood) and KWOV (kernel weighted orthogonal vector), as well as the ordered probit estimator. These results are reported in Tables I-VIII. We simulated from models of the form
yi = + xi1 0(1) + xi2 0(2) + (xi )i where xi1 ; xi2 are random variables each distributed uniformly between -1 and 1, 0(1) was set to 1, 0(2) was set to -1, and was 0.5. The threshold values were set to 0 and 1, and the latent error (xi)i varied to allow for four dierent designs:
Design 1 homoscedastic normal: (xi ) 1, i standard normal. Design 2 homoscedastic logistic: (xi) 1, i logistic, mean 0, scale 1. Design 3 heteroscedastic normal: (xi ) = 0:75e0:15x
i1
Design 4 heteroscedastic logistic: (xi ) = 0:75e0:15x
i1
xi2 ;
i standard normal.
xi2 ;
i logistic, mean 0, scale 1.
Our simulation study was performed in GAUSS. Each design was replicated 801 times, for sample sizes of 100, 200, 400 and 800. The tables report mean bias, median bias, mean squared error, and mean absolute deviation for the three estimators. For each of the estimators proposed in this paper, the rst stage estimator used a power series and adopted a data driven approach to select the number of terms in the series. Speci cally, for each replication, we considered powers of order 1 to 8, and selected the order which maximized the corrected R2 . Also, we adopted a data driven trimming device, 19
deleting observations for which either of the propensity score estimates, P^0i ; P^2i were outside the interval [0:05; 0:95]. For the KWOV estimator, we used a normal kernel in the second stage. The bandwidth was set to cn?1=10 , and we considered a grid of values of c ranging from 0.025 to 0.375, in order to examine the sensitivity of results to the bandwidth choice. Tables I to IV directly compare the ordered probit and 2SMLE approaches. Table I reports the results for Design 1, where the ordered probit performs very well, as expected. In terms of all the summary statistics considered, it performs measurably better than the 2SMLE for the slope coecients. Nonetheless, the 2SMLE's performance is also acceptable, especially for sample sizes exceeding 100. Table II reports the results for Design 2. Here, the ordered probit performs poorly, again as expected. The values of the mean bias and RMSE re ect its inconsistency due to error distribution misspeci cation. On the other hand, the results are quite favorable for the 2SMLE, where the biases and values of RMSE decrease with the sample size. This inevitably is due to the trimming procedure adopted, as the logistic and normal distributions dier most in their tail behavior. In that sense, comparing the two estimators is misleading, as no trimming was incorporated into the ordered probit estimator. Nonetheless, it is encouraging that the trimmed 2SMLE appears quite robust, in practice if not in theory, to misspeci cation of the latent error distribution. The results in Table III, where Design 3 was simulated, fully re ect the bene ts of the 2SMLE if conditional heteroscedasticity is present in the data. For this design, the ordered probit performs very poorly, exhibiting a bias and RMSE in the neighborhood of 0.4 for all sample sizes, fully exposing its inconsistency. In contrast, the 2SMLE performs quite well, with the summary statistics illustrating robustness to conditional heteroscedasticity. The values of the summary statistics are very encouraging for all sample sizes, indicating that this procedure can be used in practice for even small sample sizes. Qualitatively, the results are similar for Design 4, as reported in Table IV. The ordered probit performs just as poorly as before, and the 2SMLE, though exhibiting somewhat larger biases than in Design 3, still performs well for all samples sizes. Tables V -VIII summarize the performance of the KWOV estimator of 0(2) = 0(1) for designs 1 and 2, for a range of values for the bandwidth constant. Theoretically, the KWOV estimator should perform well for all designs, as it is not based on a speci cation of the latent error distribution. Its actual performance is satisfactory, though large biases are exhibited for 20
sample sizes up to 200 for certain values of the constant. The estimator exhibits a moderate level of sensitivity to the choice of the bandwidth constant, performing best when c is in the neighborhood of 0.1, regardless of the design or sample size. Nonetheless, the consistency of the estimator is clearly re ected in the decreasing values of the biases and RMSE as the sample size increases. One discouraging, but not that surprising result, is that its RMSE is signi cantly larger (on the order of 50 %) than that of the 2SMLE for all designs and all sample sizes. This is to be expected as the KWOV involves an extra nonparametric procedure. In summary, the results of our simulation study generally agree with the predictions of the theory behind it. Speci cally, the ordered probit procedure can lead to seriously misleading results when conditional heteroscedasticity is present. Furthermore the positive results for the two estimators we introduce here suggest that they are viable alternatives which can be implemented in practice. However, as is always the case with 2-step estimators, caution is advised in situations where there are several more regressors, as the rst stage estimator would suer from the usual curse of dimensionality, and possibly have a (second order) eect on either of the second stage estimators.
7 Conclusions This paper introduces two estimation procedures for ordered response models under very general forms of conditional heteroscedasticity. Both estimation procedures are shown to converge at the parametric rate with limiting normal distributions. This is in contrast to the binary choice model under such conditions, where it was shown that the parametric rate is unattainable. One of these estimation procedures easily extends to estimating the slope coecients in a monotonic transformation model with multiplicative heteroscedasticity. Thus it is a desirable alternative to most existing estimators of these types of models which assume independence between the errors and regressors. The results of this paper suggest areas for future research. First, it would prove useful to consider the eciency of each of our estimators by evaluating the information bounds under the assumptions imposed. As mentioned, since the asymptotic orthogonality condition is not satis ed, even our maximum likelihood estimator may not be asymptotically ecient. Second, it would be interesting to consider the advantages of employing this type of mul21
tiplicative error behavior to other statistical models. For example, Chen and Khan(1998) apply this type of restriction to the censored regression model, and are able to generalize the full rank conditions needed in Powell's(1984) CLAD estimator.
References [1] Ahn, H., Ichimura, H. and J.L. Powell(1997), \Simple Estimation of Monotone Single Index Models", manuscript [2] Ahn, H. and J.L. Powell (1993), \Semiparametric Estimation of Censored Selection Models with a Nonparametric Selection Mechanism", Journal of Econometrics, 58, 3-29 [3] Andrews, D.W.K. (1994), \Empirical Process Methods in Econometrics", in Engle, R.F. and D.McFadden (eds.), Handbook of Econometrics, Vol. 4, Amsterdam: North-Holland. [4] Cavanagh, C.L. (1987), \Limiting Behavior of Estimators De ned by Optimization", unpublished manuscript [5] Chamberlain, G. (1986), \Asymptotic Eciency in Semiparametric Models with Censoring", Journal of Econometrics, 32, 189-218 [6] Chen, S. (1998), \Rank Estimation of Transformation Models", mimeo [7] Chen, S. and S. Khan (1998), \Generalizing Full Rank Conditions in Heteroscedastic Censored Regression Models", manuscript, University of Rochester [8] Donald, S.G. (1995), \Two-Step Estimation of Heteroscedastic Sample Selection Models", Journal of Econometrics, 65, 347-380 [9] Engle, R.F. (1982), \Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom In ations", Econometrica, 50, 987-1008 [10] Han, A. (1987) \Non Parametric Analysis of a Generalized Regression Model", Journal of Econometrics, 35, 303-316 [11] Harvey, A.C. (1976), \Estimating Regression Models with Multiplicative Heteroscedasticity", Econometrica, 44, 461-465 [12] Howorwiz, J.L. (1993), \Semiparametric and Nonparametric Estimation of Quanal Response Models", in G.S. Maddala, C.R. Rao, H.D. Vinod eds. Handbook of Statistics - Econometrics, Amsterdam: North Holland [13] Horowitz, J.L. (1996), \Semiparametric Estimation of a Regression Model with an Unknown Transformation of the Dependent Variable", Econometrica, 64, 103-137 [14] Horowitz, J.L. and W. Hardle (1996), \Direct Semiparametric Estimation of Single-index Models with Discrete Covariates", Journal of the American Statistical Association, 91, 1632-1640 [15] Ichimura, H. (1993) \Semiparametric Least Squares and Weighted SLS Estimation of Single -Index Models", Journal of Econometrics, 58, 71-120
22
[16] Khan, S. (1998), \Two Stage Rank Estimation of Quantile Index Models", manuscript, University of Rochester [17] Kim J., and D. Pollard (1990), \Cube Root Asymptotics", Annals of Statistics, 18, 191-219 [18] Lee, M. (1992), \Median Regression for Ordered Discrete Response", Journal of Econometrics, 51, 59-77 [19] Lewbel, A. (1998), \Semiparametric Qualitative Response Model Estimation with Unknown Heteroscedasticity or Instrumental Variables", manuscript [20] Manski, C.F. (1975), \Maximum Score Estimation of the Stochastic Utility Model of Choice", Journal of Econometrics, 3, 205-228 [21] Manski, C.G. (1985), \Semiparametric Analysis of Discrete Response: Asymptotic Properties of Maximum Score Estimation", Journal of Econometrics, 27, 313-334 [22] Nawata, K. (1992), \Semiparametric Estimation of Binary Choice Models Based on Medians of Grouped Data", manuscript, University of Tokyo [23] Newey, W.K. (1990), \Semiparametric Eciency Bounds", Journal of Applied Econometrics, 5, 99-135 [24] Newey, W.K. (1994), \Series Estimation of Regression Functionals", Econometric Theory, 10, 1-28 [25] Newey, W.K. and D. McFadden (1994) \Estimation and Hypothesis Testing in Large Samples", in Engle, R.F. and D. McFadden (eds.) , Handbook of Econometrics, Vol. 4, Amsterdam: North-Holland. [26] Powell, J.L. (1984) \Least Absolute Deviations Estimation for the Censored Regression Model", Journal of Econometrics, 25, 303-325 [27] Powell, J.L. (1986) \Censored Regression Quantiles", Journal of Econometrics, 32, 143-155 [28] Powell, J.L. (1989) \Semiparametric Estimation of Censored Selection Models", unpublished manuscript [29] Powell, J.L. (1994) \Estimation of Semiparametric Models",in Engle, R.F. and D. McFadden (eds.), Handbook of Econometrics, Vol. 4, Amsterdam: North-Holland. [30] Powell, J.L., J.H. Stock, and T.M. Stoker (1989) \Semiparametric Estimation of Index Coecients", Econometrica, 57, 1404-1430. [31] Pratt, J.W. (1981), \Concavity of the Log Likelihood", Journal of the American Statistical Association, 76, 103-106 [32] Ser ing, R.J. (1980) Approximation Theorems of Mathematical Statistics, New York: Wiley. [33] Sherman, R.P. (1994a), \U-Processes in the Analysis of a Generalized Semiparametric Regression Estimator", Econometric Theory, 10, 372-395 [34] Yatchew, A. (1997), \An Elementary Estimator of the Partial Linear Model", Economics Letters, 57, 135-143 [35] Ye, J. and N. Duan (1997), \Nonparametric n?1=2 -consistent Estimation for the General Transformation Model", Annals of Statistics, 25, 2682-2717
23
A Appendix Throughout this sectionwe let kk denote the matrix norm. That is, for a matrix a with components denoted P 1=2 by aij , kak = i;j a2i;j .
A.1 Proof of Theorem 1 The following lemma establishes mean square dierentiability, as de ned in Chamberlain(1986) with respect to the space of paths considered; in establishing the result it will prove convenient to de ne g 1=, and rede ne the space of paths as () = g0 (1 + ( ? 0 )h).
f1=2 (yi ; xi ; ; ) ? f1=2 (yi ; xi ; 0 ; 0 ) =
Lemma 1 k X
(j) j (yi ; xi ; 0 ; 0 )( (j ) ? 0 ) + (yi ; xi ; 0 ; 0 )( ? 0 ) + r(yi ; xi ; ; )
j =1
(A.1)
where the remainder satis es
lim ! ;! 0
R
0
r2 (yi ; xi ; ; )d(yi ; xi ) = 0 (k ? 0 k + j ? 0 j)2
(A.2)
Proof: Without any loss of generality, we set yi = 1 and evaluate bounds for
@ f 1=2 (1; x ; ; )
2 i
@
and
@ f 1=2 (1; x ; ; ) 2 i @
For the rst of the above expressions we have:
@ f 1=2 (1; x ; ; )
2 1 (g (x0 ))(g (x0 ))g2 kx k2 i i i i
@ 4
where g () and () ()=(). We note that ()() is bounded on the real line, and g is bounded on the support of xi for in a neighborhood of 0 . Also, since E [kxi k2 ] < 1 by assumption, we have a function q1 (xi ) such that
@ f 1=2 (1; x ; ; )
2 q (x ) 1 i i
@
24
and E [q1 (xi )] < 1. For the second term , we have
@ f 1=2 (1; x ; ; ) 2 1 (g (x0 ))(g (x0 ))g2 h2 (x0 )2 i i i i @ 4
Since h is bounded on the support of xi by assumption and (x0i )2 < k k2 kxi k2 , we have for ( ; ) in a neighborhood of ( 0 ; 0 ), there exists a function q2 (xi ) such that:
@ f 1=2 (1; x ; ; ) 2 q (x ) 2 i i @
and E [q2 (xi )] < 1. Thus to establish mean square dierentiability, we let = ( ; ) and apply the mean value theorem, yielding:
f 1=2 (yi ; xi ; ) ? f 1=2(yi ; xi ; 0 ) =
@ f 1=2 (y ; x ; ) ( ? ) + r(y ; x ; ) 0 i i @ i i 0
where by the continuity of the partial derivatives in , we have with probability 1, lim r2 (yi ; xi ; )=k ? 0 k2 = 0
!0
Thus by the dominance conditions established, mean square dierentiability follows by the dominated convergence theorem. 2 To establish the desired result, we let > 0 be given and note that for a given path , we have: Z
I;1 4 ( 1 ? )2 dFX (xi ) where FX () denotes the c.d.f. of xi . We note that Z
( 1 ? )2 dFX (x) =
0 ?1 0i 0 x x 0 i (x ) 1 ? (x ) 0 i 0 i
Z
0 0 2 2 ( xi( x0 ) )0?2 (xi )(x(1) i ? (xi 0 ) h) dFX (xi )
0 i
Noting that (z )(?z ) is bounded for all z 2