We then present new findings on the computation of bounds on best linear predictors under square loss. We describe a genetic algorithm to compute sharp ...
Computation of Bounds on Population Parameters When the Data Are Incomplete JOEL L. HOROWITZ, CHARLES F. MANSKI, ¨ MARIA PONOMAREVA, and JORG STOYE Department of Economics, Northwestern University, 2001 Sheridan Road, Evanston, IL 60208–2600, USA, e-mail: {joel-horowitz, cfmanski, m-ponomareva, j-stoye}@northwestern.edu (Received: 17 October 2002; accepted: 20 March 2003) Abstract. This paper continues our research on the identification and estimation of statistical functionals when the sampling process produces incomplete data due to missing observations or interval measurement of variables. Incomplete data usually cause population parameters of interest in applications to be unidentified except under untestable and often controversial assumptions. However, it is often possible to identify sharp bounds on these parameters. The bounds are functionals of the population distribution of the available data and do not rely on untestable assumptions about the process through which data become incomplete. They contain all logically possible values of the population parameters. Moreover, every parameter value within the bounds is consistent with some model of the process that generates incomplete data. The bounds can be estimated consistently by replacing the population distribution of the data with the empirical distribution in the functionals that give the bounds. In practice, this is straightforward in some circumstances but computationally burdensome in others; in general, the bounds are the solutions to non-convex mathematical programming problems that can be difficult to solve. Horowitz and Manski (Censoring of Outcomes and Regressors Due to Survey Nonresponse: Identification and Estimation Using Weights and Imputations, Journal of Econometrics 84 (1998), pp. 37–58; Nonparametric Analysis of Randomized Experiments with Missing Covariate and Outcome Data, Journal of the American Statistical Association 95 (2000), pp. 77–84) studied nonparametric mean regression with missing data. In this paper, we first describe the general problem. We then present new findings on the computation of bounds on best linear predictors under square loss. We describe a genetic algorithm to compute sharp bounds and a minimax approach yielding simple but non-sharp outer bounds. We use actual data to demonstrate the computations.
1. Introduction Inference from incomplete data is a common problem in empirical research. For example, attrition from a panel and non-response to one or more questions on a survey are causes of missing observations and, therefore, incomplete data. Incomplete data also arise when responses specify only intervals that contain the variable of interest. For example, a survey may ask which of several intervals contains the respondent’s income. Then the exact value of a respondent’s income is missing.
Whatever the specific cause of incomplete data, the generic consequence is that the population parameters of interest in an application are not identified unless one makes untestable and frequently controversial assumptions about the distribution of missing data. For example, identification is possible if the missing and non-missing data have the same probability distribution or the same distribution conditional on some observed covariates. However, the hypothesis that missing and non-missing data have the same probability distribution cannot be tested and may not be plausible in a given application. Although missing data make point identification problematic, it is often possible to identify sharp bounds on population parameters without making untestable assumptions about missing data. The sharp bounds on a parameter contain all logically possible values of that parameter (that is, all values that are consistent with the observed data and some process for generating the missing data), and they exhaust the information on the parameter that is available from the data. We have previously studied the structure of these bounds when the problem is nonparametric inference on a conditional expectation and some realizations of the outcome and/or covariates are missing [5], [7]. See, also, [19]. Vansteelandt and Goetghebeur [18] investigated settings in which only outcomes are incomplete. Manski and Tamer [12] examined nonparametric and parametric inference of conditional expectations when the outcome or one covariate is interval measured. Here we describe the general problem (Section 2), present new findings on the computation of bounds on best linear predictors under square loss (Sections 3 and 4), provide an empirical illustration (Section 4), and draw conclusions (Section 5). An appendix (Section 6) elaborates on the computation of bounds on best linear predictors by providing a simple example, analysis of the complexity of the problem, and details of the genetic algorithm that we apply. As in our previous work, the analysis in this paper is deliberately conservative. We emphasize mainly “worst case” scenarios in which the researcher has no prior information about the parameter of interest or the process that generates missing data. Our worst case approach contrasts with the “best case” approach that dominates the literature on inference from incomplete data. For example, it is a common practice to assume that data are missing completely at random (MCAR) and to perform analyses using only the non-missing data (e.g., survey responses in which all relevant questions were answered). Conventional methods for imputing missing data assume that data are missing at random conditional on specified covariates (e.g., [16]). On occasion, a model of non-random missing data may be specified (e.g., [4], [13]). Either way, the identification problem is solved, and efficiency of estimation becomes the central matter of concern to statisticians. We have emphasized in [5], [7], [10], and elsewhere that it is not sufficient for empirical researchers to know the inferences that can be made if specified assumptions hold. It is also important to be able to characterize the inferences that may be made without imposing these assumptions. An especially appealing feature of conservative analysis is that it enables establishment of a domain of consensus among researchers who may
hold disparate beliefs about what assumptions are appropriate. A further important feature of our approach is that it provides an indication of the relative importance of the data and untestable assumptions in uniquely identifying the value of a parameter (point identification). If the identified bounds are narrow, then the data are highly informative about the parameter. However, if the bounds are wide, then the data contain little information about the parameter. Point identification must then rely heavily on untestable assumptions, and different assumptions can lead to very different identified values of the parameter. A mathematically complementary approach is to begin with some pointidentifying assumption and examine how identification decays as this assumption is weakened in specified ways. Research of the latter kind is often referred to as sensitivity analysis. For example, studying the problem of missing outcome data, Rosenbaum [15] and Scharfstein, Rotnitzky, and Robins [17] investigate classes of departures from the point-identifying assumption that data are missing at random. As is discussed in more detail in Sections 2 and 4, sharp bounds on unidentified population parameters are the solutions to mathematical programming problems whose constraints are linear but whose objective functions may be highly nonlinear. In some cases, these mathematical programming problems can be solved analytically. In other cases, they can be transformed into linear programs, and numerical solutions can be found relatively easily. However, in many important cases, the bounds are solutions to non-convex mathematical programming problems for which global optimization techniques are needed. We report our experience with a variety of algorithms. We find that none of these algorithms is entirely satisfactory and hope that this article will encourage further research on the development of more efficient algorithms for the bounding problem.
2. Identification of General Statistical Functionals 2.1. BACKGROUND Most population parameters of interest in applications can be expressed as statistical functionals. Specifically, let F be the cumulative distribution function (CDF) of the sampled population. Then a parameter θ typically can be written in the form θ = g(F) for some known functional g. The problem is to infer the scalar quantity h(θ ), where h is a known function. For example, if θ is a vector, then h(θ ) might be one of its components. This framework encompasses a large class of estimation problems. In particular, parameters that solve familiar problems of best prediction under specified loss functions are statistical functionals. Examples include unconditional and conditional means and medians and the best linear predictor (BLP) under square loss. To
illustrate, the BLP under square loss of a real-valued random variable Y conditional on a vector random variable X is xθ , where θ = (EX ′ X) −1 EX ′ Y =
x ′ x dFyx
x ′ y dFyx ,
X is a row vector of explanatory variables and Fyx is the joint CDF of (Y , X). The best logit predictor under square loss of a binary random variable is exp(xθ ) / [1 + exp(xθ )], where θ = (EX ′ X) −1 {EX ′ log[P(Y = 1 | X) / P(Y = 0 | X)]} −1 x ′ x dFyx {x ′ log[P(Y = 1 | X = x) / P(Y = 0 | X = x)]} dFyx . =
When estimating statistical functionals, empirical researchers routinely report estimates based only on sample realizations that are completely observed. This practice is justified if the same population probability distribution generates the realizations that are completely and incompletely observed. Otherwise, it is usually not justified and can produce seriously misleading results. To illustrate, consider the BLP of Y given X. Let Z = 1 if (Y , X) is completely observed and Z = 0 otherwise. Let Fyx | z = 1 denote the CDF of (Y , X) conditional on Z = 1. Then standard practice is to estimate xθc instead of xθ , where
θc =
x ′ x dFyx | z = 1
x ′ y dFyx | z = 1 .
Except in special cases, θc = θ unless Fyx = Fyx | z = 1 almost surely, in which case observations are missing completely at random. It is important to know what can be learned about a parameter of interest when the researcher has no prior information about the distribution of the missing data or the process that causes data to be missing. This motivates the research that is described in the remainder of this paper. 2.2. FORMULATION OF THE PROBLEM Let V = (Y , X) be a random vector. Later, it will be assumed that Y is a scalar dependent variable and that X ∈ R d is a vector of explanatory variables, but this distinction is not necessary for the general formulation that is presented in this section. We assume that θ is a continuous functional of the CDF of V (relative to the supremum norm) and that V is a discrete random variable with support {vi : i = 1, …, I}. The assumption that V is discrete entails no significant loss of generality, because the CDF of a continuously distributed random variable can be approximated with arbitrary accuracy by a discrete CDF. Let Z (1 ≤ Z ≤ zmax ) be an integer-valued random variable that indicates the state of missingness of V. Define Z = 1 if all components of V are observed and Z = zmax if none are
observed. Intermediate values of Z indicate combinations of components of V that are observed and missing or interval measured. For example, Z = 2 might indicate that all components of V but the last are observed. Also define πz = P(Z = z) and pzi = P(V = vi | Z = z). The sampling process identifies πz for all z ∈ (1, …, zmax ) and p1i for all i ∈ 1, …, I. The remaining pzi ’s are not identified. However, there are restrictions on the values of the unidentified pzi ’s. To obtain these, define Sz to be the support of the components of V that are observed (non-missing) when Z = z. Suppose that there are Kz distinct points in Sz . Let qzki denote the probability of point kz ∈ Sz conditional on Z = z. These probabilities are identified by the sampling process. When Z = z, write V ∈ kz if the non-missing components of V correspond to the point kz ∈ Sz . Then the following relations hold: pzi = qzki (z = 2, …, zmax − 1; kz = 1, …, Kz ), (2.2a) i : vi ∈ki I
pzi = 1
(z = 2, …, zmax ),
pzi ≥ 0
(z = 2, …, zmax , i = 1, …, I).
In addition, the probability mass function corresponding to F can be written in the form
P(V = vi ) =
πz pzi .
Since the probability mass function determines F uniquely, θ can be written in the form
θ = g
πz pz1 ,
πz pz2 , …,
πz pzI .
Therefore, h(θ ) has the form
h(θ ) = h g
πz pz1 ,
πz pz2 , …,
πz pzI .
The identification problem is now clear: h(θ ) depends on probabilities pzi (z ≥ 2) that are not identified by the sampling process. The identified sharp upper and lower bounds on h(θ ) are the maximum and minimum values of h(θ ) that are consistent with the constraints (2.2). Thus, the bounds are the optimal solutions to minimize (maximize): pzi : z ≥ 2, i = 1, …, I
h(θ ) = h g
zmax z=1
πz pz1 ,
zmax z=1
πz pz2 , …,
zmax z=1
πz pzI
subject to:
pzi = qzkz (z = 2, …, zmax − 1; kz = 1, …, Kz ),
i : vi ∈kz I
pzi = 1
(z = 2, …, zmax ),
pzi ≥ 0
(z = 2, …, zmax ; i = 1, …, I).
(NLP) is a mathematical programming problem with linear constraints and a nonlinear objective function. Section 6.1 of the appendix illustrates (NLP) for a simple problem of best linear prediction. (NLP) can be solved analytically in special cases. In particular, Horowitz and Manski [5] showed that there is a simple analytic solution when the objective is to infer a conditional expectation nonparametrically and the incompleteness problem is either missing outcome data or jointly missing outcome and covariate data. For the same objective, Horowitz and Manski [7] provided an analytic solution when the outcome variable is binary and there is a general pattern of missingness (that is, when some observations are missing outcome data, others are missing covariate data, and still others are missing both outcomes and covariates). In general, however, analytic solution is not possible and numerical methods must be used. The difficulty of solving (NLP) numerically depends on the size of the mathematical programming problem (that is, the numbers of variables and constraints in (NLP)). Section 6.2 of the appendix presents some illustrations of the size of (NLP). The difficulty of solving (NLP) also depends on the details of the objective function. In some cases, such as nonparametric estimation of a conditional mean function, (NLP) is a fractional linear program that can be solved relatively easily (e.g., [19]). In general, however, (NLP) has multiple local optima, and a global optimization technique is needed. In Section 4, we study best linear prediction under square loss and present an empirical illustration. We have experimented with a genetic algorithm, simulated annealing, and branch-and-bound methods to solve the resulting version of (NLP). We have also experimented with a simulation technique. In this approach, missing observations are assigned values that are sampled randomly from the uniform distribution over the set of logically possible values. This produces a complete pseudo dataset from which θ can be computed. The bounds on any component of θ are then estimated by the maximum and minimum values of this component over many imputations. This produces inner bounds, that is, bounds that are subsets of the sharp bounds. However, the inner bounds approach the sharp bounds as the number of repetitions of the imputation-estimation process increases. We applied these methods to the problem of finding bounds on the best linear predictor using data that are described in Section 4. We found that the genetic algorithm solves the required version of (NLP) more rapidly than do the other procedures, but even the genetic algorithm required many hours of computing. Thus, the question of what is the best algorithm for solving (NLP) remains open and is an important topic for further research.
In some applications, information about the distribution of missing data or the process through which data become missing is available. Such information can be incorporated into (NLP) by adding constraints. For example, suppose that V can be partitioned (Va , Vb ) and that Vb is MCAR. Let Za = 1 if Va is observed and Za = 0 otherwise. Let Zb = 1 if Vb is observed and Zb = 0 otherwise. Define Z = (Za , Zb ). The assumption that Vb is MCAR implies that
P(V = vi | Za = za , Zb = zb ) = P(V = vi | Za = za ) for za , zb = 0 or 1. This is equivalent to
pzi =
πζ pζ i
(i = 1, …, I; A = 0, 1).
ζ ∈{za , zb : za = A}
Therefore, the assumption that Vb is MCAR can be incorporated in (NLP) by adding the constraints (M). 2.3. ESTIMATION AND INFERENCE The optimal solution to (NLP) is a function of the population parameters πz (z = 1, …, zmax ) and qzkz (z = 2, …, zmax ; kz = 1, …, Kz ). Under the assumption that the objective function of (NLP) is Lipschitz continuous, the optimal solution can be estimated consistently by replacing πz and qzkz with empirical analogs. To do this, let {Vi , Zi : i = 1, …, n} be a random sample from the distribution of Z and of the non-missing components of V conditional on Z. Then the empirical analog of πz is πˆ z = n −1
I(Zi = z),
where I(⋅) is the indicator function. The empirical analog of qzkz is qˆ zkz =
n 1 I(Vi ∈ kz , Zi = z). πˆ z n i = 1
Obtaining confidence intervals for the bounds produced by solving (NLP) presents a greater challenge. Let hmin and hmax , respectively, denote the lower and upper bounds on h. Let hˆ min and hˆmax denote the estimators that are obtained by replacing {πz , qzkz } with {πˆ z , qˆ zkz } in (NLP). Here, we describe confidence intervals that have known (asymptotic) probabilities of containing both hmin and hmax ; other types of confidence intervals are studied in Imbens and Manski [8]. To obtain a symmetrical 1 − α confidence interval, we seek functions Ln (πˆ z , qˆ zkz : z = 1, …, zmax ; kz = 1, …, Kz ) ≡ Ln and Un (πˆ z , qˆ zkz : z = 1, …, zmax ; kz = 1, …, Kz ) ≡ Un such that P(Ln ≤ hmin , hmax ≤ Un ) = 1 − α . If the objective function of (NLP) is twice differentiable and the optimal solution to (NLP) satisfies pzi > 0 for all z = 2, …, Zmax and i = 1, …, I, then hmin and
hmax are smooth functions of the population moments {πz , qzks }. See, for example, Horowitz and Manski [7] for cases in which this happens. The delta method can be used to show that (hˆ min , hˆmax ) is asymptotically bivariate normally distributed with mean (hmin , hmax ) and a covariance matrix that can be estimated consistently. This approach is unattractive, however, because the expressions for the asymptotic covariance matrices are very lengthy and, therefore, tedious to calculate. Asymptotically valid confidence intervals can be obtained more easily by using the bootstrap. This consists of carrying out a Monte Carlo simulation in which the data are sampled randomly with replacement (bootstrap sampling). Each bootstrap sample is ∗ , h ∗ ) by solving used to compute a bootstrap estimate of (hmin , hmax ), denoted (hmin max (NLP) with bootstrap analogs of [{πz , qzks }] in place of the population quantities. ∗ , h ∗ ) conditional on the By repeated bootstrap sampling, the distribution of (hmin max data can be estimated with arbitrary accuracy. The estimated distribution is used ∗ −c ∗ ˆ ˆ to find cnα such that P ∗ (hmin nα ≤ hmin , hmax ≤ hmax + cnα ) = 1 − α , where P is the probability measure induced by bootstrap sampling conditional on the data. The bootstrap 1 − α confidence interval for (hmin , hmax ) is [hˆ min − cnα , hˆ max + cnα ]. It has asymptotic coverage probability 1 − α [2]. In general, the optimal solution to (NLP) sets pzi = 0 for some (z, i) pairs. This creates an estimation problem in which the true parameter point (the vector consisting of the the pzi ’s) is on a boundary of the parameter set. Consequently, hˆ min and hˆ max are not smooth functions of sample moments, and the methods of the previous paragraph cannot be used to obtain a confidence interval for (hmin , hmax ) [1]. A procedure that can be used is the subsampling method of Politis and Romano [14]. In this procedure, subsamples of size m < n are drawn by sampling the data randomly without replacement. The subsamples are used, as in the bootstrap, to find cmα such that Pˆ m (hˆ m, min − cmα ≤ hˆ min , hˆ max ≤ hˆm, max + cmα ) = 1 − α , where Pˆ m is the probability measure induced by subsampling and hˆ m, min and hˆm, max are the lower and upper bounds obtained by solving (NLP) after replacing {πz , qzks } with their empirical analogs computed from a subsample. The resulting asymptotic 1− α confidence interval for (hmin , hmax ) is [hˆmin − (m / n) 1 / 2 cmα , hˆmax +(m / n) 1 / 2 cmα ].
3. Outer Bounds on the BLP Because the computation of sharp bounds can be quite difficult, it is useful to investigate simpler bounding methods that can be implemented easily with at most minor enhancements to standard statistical software. This section, extending a minimax approach introduced in [6], describes easily implemented methods for obtaining outer bounds for the coefficients θ of the BLP. Outer bounds are upper (lower) bounds that are above (below) the sharp upper (lower) bounds of Section 2. Section 3.1 obtains minimax outer identification regions that treat incomplete observations as entirely missing. Section 3.2 shows how the resulting bounds can be narrowed when the components of the explanatory variable X include binary indicator vari-
ables whose only possible values are 0 and 1. Section 3.3 shows how the bounds can be narrowed by using the partial data available in incomplete observations. 3.1. MINIMAX OUTER IDENTIFICATION REGIONS This section obtains minimax bounds that treat all incomplete observations as entirely missing. Let Z = 1 if (Y , X) is completely observed and Z = 0 otherwise. For z = 0, 1, let πz = P(Z = z). Let S denote the support of (Y , X). For any vector τ whose dimension is the same as that of X, define S1 (τ ) = E[(Y − τ ′ X) 2 | Z = 1] and S0 (τ ) = E[(Y − τ ′ X) 2 | Z = 0]. Thus, S1 (τ ) and S0 (τ ), respectively, are the expected squared losses over complete and incomplete observations. Then
E(Y − τ ′ X) 2 = S1 (τ )π1 + S0 (τ )π0 . Let hL (τ ) = inf (y − τ ′ x) 2 and hU (τ ) = sup (y − τ ′ x) 2 . Then E(Y − τ ′ X) 2 can (y, x) ∈S
(y, x) ∈S
be bounded as follows: S1 (τ )π1 + hL (τ )π0 ≤ E(Y − τ ′ X) 2 ≤ S1 (τ )π1 + hU (τ )π0 . Moreover, since hL (τ ) ≥ 0, we have the simpler (though looser) bounds S1 (τ )π1 ≤ E(Y − τ ′ X) 2 ≤ S1 (τ )π1 + hU (τ )π0 . Now consider two vectors, τ0 and τ . If the lower bound on E(Y − τ ′ X) 2 exceeds the upper bound on E(Y − τ 0′ X) 2 , then τ cannot minimize squared loss. Formally, S1 (τ )π1 > S1 (τ0 )π1 + hU (τ0 )π0 ⇒ τ = arg min E(Y − t ′ X) 2 . t
It follows from (3.1) that any choice of “benchmark vector” τ0 yields a minimax outer identification region that is given by Bm (τ0 ) = {τ : S1 (τ )π1 ≤ S1 (τ0 )π1 + hU (τ0 )π0 }.
Equivalently, Bm (τ0 ) = {τ : S1 (τ ) − S1 (τ0 ) ≤ hU (τ0 )π0 / π1 }.
Now consider (3.3). Let τ ∗ = arg min[E(Y −t ′ X) 2 | Z = 1]. That is, τ ∗ minimizes t squared loss over complete observations. Then S1 (τ ) − S1 (τ0 ) = S1 (τ ) − S1 (τ ∗ ) + S1 (τ ∗ ) − S1 (τ0 ) = (τ − τ ∗ ) ′ E(XX ′ | Z = 1)(τ − τ ∗ ) + S1 (τ ∗ ) − S1 (τ0 ). Therefore, (3.3) can be written Bm (τ0 ) = {τ : (τ − τ ∗ ) ′ E(XX ′ | Z = 1)(τ − τ ∗ ) ≤ S1 (τ0 ) − S1 (τ ∗ ) + hU (τ0 )π0 / π1 }.
It is clear from this expression that Bm (b) is a (possibly truncated) ellipsoid centered at τ ∗ . Moreover, the class of minimax identification sets is ordered by set inclusion. Accordingly, define τopt = arg min[S1 (t)π1 + hU (t)π0 ]
whenever this quantity exists. Then τopt is the optimal benchmark in the sense that Bm (τopt ) ⊆ Bm (τ ) for all τ , and Bm (τopt ) ⊂ Bm (τ ) if S1 (τ )π1 + hU (τ )π0 = S1 (τopt ) + hU (τopt )π0 . Moreover, τopt is unique if E(XX ′ | Z = 1) is nonsingular. In addition, the optimized identification region Bm (τopt ) is unique even if τopt is not. Computation of τopt is straightforward. Finding the largest and smallest values of components of τ within Bm (τopt ) is also easy because these quantities are solutions to mathematical programming problems with linear objective functions and convex feasible regions. 3.2. THE DOMINANCE CRITERION The minimax outer bounds of Section 3.1 can sometimes be narrowed if X has components whose only possible values are 0 and 1. Let YL and YU , respectively, denote the lower and upper bounds of the support of Y. Let SX denote the support of X. Let X ( j) denote the j-th component of X. Then it is easy to prove the following proposition. PROPOSITION 3.1. Let X ( j) be a binary component of X. (i) Suppose that max τ ′ x < YL . Then τ = arg min E(Y − t ′ X) 2 . min τ ′ x > YU or x ∈SX , x(j) = 1
x ∈SX , x(j) = 1
(ii) Let X include a constant (say nents of X satisfying
X (1)
X (l) ≤ 1. Suppose that
x ∈SX , x(j) = · · · = x(k) = 0
= 1). Let
(X ( j) , …, X (k) )
τ ′ x < YL . Then τ =
be binary compo-
x ∈SX , x(j) = · · · = x(k) = 0 arg min E(Y − t ′ X) 2 . t
τ ′ x > YU or
This proposition generates several constraints that must be satisfied by the BLP. Assume that X ( j) is a binary component of X. Then application of (i) yields max
x ∈SX , x(j) = 1
τ ′ x ≥ YL
x ∈SX , x(j) = 1
τ ′ x ≤ YU .
Moreover, for any set of binary components (X ( j) , …, X (k) ) satisfying
X (l) ≤ 1,
(ii) yields max
x ∈SX , x(j) = · · · = x(k) = 0
τ ′ x ≥ YL
x ∈SX , x(j) = · · · = x(k) = 0
τ ′ x ≤ YU .
The outer bounds of Section 3.1 can sometimes be narrowed by imposing (3.5) and, when applicable, (3.6). That is, one finds the largest and smallest values of τ
that are within Bm (τopt ) and satisfy (3.5) and (3.6). An improvement occurs in cases where constraint (3.5) or (3.6) is binding. 3.3. USING PARTIAL DATA The minimax bounds can be improved further by using the partial data available in incomplete observations. Thus far, incomplete observations have been treated as entirely missing; the upper bound hU (τ ) = sup (y − τ ′ x) 2 on expected loss (y, x) ∈S
supposes that any partially observed realization of (y, x) assumes the same “worstcase” value. This upper bound may not be feasible if some components of (y, x) are observed, in which case the maximum loss caused by τ can be more bounded more tightly. To formalize this, we replace the earlier upper bound on expected loss with partial
(τ ) ≡
pzi : z ≥ 2, i = 1, …, I
E(Y − t ′ X) 2 s.t. the constraints from NLP,
where Z ∈ {1, …, zmax } as in Section 2. Given that the objective function is addipartial tively separable in the observations, hU (τ ) can be evaluated by sequentially considering all incomplete observations and imputing their missing components in a worst-case fashion. This is computationally inexpensive. In some cases, computation of the new optimal benchmark partial
≡ arg min{S1 (t)π1 + hU t
(t)π0 }
(3.8) partial
is tractable. The implied benchmark loss, S1 (τopt )π1 + hopt (τopt )π0 , can then be substituted for the analogous expression in (3.2), yielding a tighter identification partial region Bm (τopt ). Even where program (3.8) is too complex to solve, use of the partial observations without re-optimization of the benchmark vector typically narrows the identification region. 4. Computation of Bounds on the BLP: An Empirical Illustration This section uses actual data to demonstrate computation of sharp bounds and minimax outer bounds on the coefficients of the best linear predictor under square loss. The empirical illustration described here provides information on the properties of algorithms and on the informativeness of the findings in a realistic setting. Manski and Straub [11] analyzed data on worker expectations of job loss collected between 1994 and 1998 in the (U.S.) nationwide Survey of Economic Expectations (SEE). Employed SEE respondents were asked to state their subjective probabilities of job loss in the coming year. The specific question posed was: I would like you to think about your employment prospects over the next 12 months. What do you think is the percent chance that you will lose your job during the next 12 months?
Responses take values in the range [0, 100]. Respondents were also asked to provide information about various covariates, including age, race, and income. Most respondents provided responses to the expectations and covariate questions, but some refused to answer one or more questions. Thus, some outcome and covariate data are missing. Manski and Straub [11] assumed that missing data are MCAR and used the observations with complete data to estimate BLPs of job-loss expectations given various covariates. Here, we estimate bounds on BLPs without imposing assumptions on the process that generates missing data. The bounds are obtained by solving the version of (NLP) that results from replacing the objective function with (2.1) and replacing {πz , qzkz } with {πˆ z , qˆ zkz }. It follows from (2.1) that the BLP of Y given X is
x ′ x dFyx
BLP(x) = x
x ′ y dFyx .
This can be written in the form −1 zmax zmax I I πz pzi X i′ Xi πz pzi X i′Yi . BLP(x) = x z=1 i=1
z=1 i=1
Therefore, consistent estimators of sharp bounds on BLP(x) are given by the solutions to the nonlinear programming problems minimize(maximize): pzi : z ≥ 2, i = 1, …, I
−1 zmax zmax I I x πˆ z pzi X ′ Xi πˆ z pzi X ′Yi i
z=1 i=1
subject to:
z=1 i=1
pzi = qˆzkz (z = 2, …, zmax − 1; kz = 1, …, Kz ),
i : vi ∈kz I
pzi = 1
(z = 2, …, zmax ),
pzi ≥ 0
(z = 2, …, zmax ; i = 1, …, I).
The results reported here were obtained by using a genetic algorithm to solve (BLP). Judd [9] provides a general discussion of genetic algorithms. The specific algorithm that we used is given in Section 6.3 of the appendix. We also report minimax outer bounds that were computed using the methods of Section 3. In all cases, the computed bounds take the support of (y, x) to be the empirical support of the data. We have computed bounds on BLPs corresponding to two sets of covariates. One set consists of an intercept (that is, a covariate whose value is always 1 and is never missing), two race and three age dummies. One race dummy takes the value 1 if the respondent is black and 0 otherwise. The other race dummy takes the value 1
if the respondent is non-black and non-white (e.g., Asian) and 0 otherwise. The age dummies are indicators of the age groups 18–24 years, 25–49 years, and 50–64 years. The second set consists of the first set plus income. The income variable takes 7 values that are assigned as follows: Reported income, Irep ($) Irep 10, 000 ≤ Irep 20, 000 ≤ Irep 30, 000 ≤ Irep 40, 000 ≤ Irep 50, 000 ≤ Irep 60, 000 ≤ Irep
< 10, 000 < 20, 000 < 30, 000 < 40, 000 < 50, 000 < 60, 000
Value of income variable ($) 5, 000 15, 000 25, 000 35, 000 45, 000 55, 000 112, 000
This relatively coarse grid of income values was chosen for computational reasons. With a finer grid, (BLP) becomes a very high-dimensional nonlinear optimization problem, and solving it repeatedly as we do for reasons that are explained below is very time-consuming. The grid points are the means of the within-cell observed incomes and span most of the range of observed incomes. Fewer than 2% of observed incomes exceed $112,000. Some respondents reported a subjective probability of job loss, Y, during the next year of 50%. There is controversy in the survey research literature over whether people who report a subjective probability of 50% are expressing an opinion or are effectively not responding, in which case a response of 50% should be treated as an observation in which the outcome variable is missing [3]. We do not take a position on this controversy here. Rather, we report bounds on the BLPs under the assumption that 50% chances of job loss are non-missing and under the assumption that they are missing. For each set of covariates, we computed bounds on the BLP when all but one of the covariates is zero. This is equivalent to finding bounds on the coefficients of the linear least- squares fit of Y to the covariates. Thus, we computed 24 bounds for the BLPs without the income variable (lower and upper bounds on 6 coefficients with and without Y = 50% treated as missing) and 28 bounds for the BLPs with the income variable (lower and upper bounds on 7 coefficients with and without Y = 50% treated as missing). This entails solving (BLP) 24 times without the income variable and 28 times with it. There were a total of 3860 respondents to the survey. Table 1 displays the pattern of missing data in the responses. The computations were carried out in MATLAB on a 1.7 GHz microcomputer. Solving (BLP) without the income variable took approximately 12 min per bound. Thus, the complete set of computations for the BLPs without the income variable took about 4.8 hrs. With the income variable, solving (BLP) took about 2.5 hr. per bound. Thus, the complete set of computations for the BLPs with income took about 70 hours. There are two reasons for the longer
Table 1. Pattern of missingness in the expectations data. Missing variables None Outcome only Age only Race only Income only Outcome and age Outcome and race Outcome and income Age and race Age and income Race and income Outcome, age, and race Outcome, age, and income Outcome, race, and income Age, race, and income Total
Y=50% non-missing
Y=50% missing
3077 128 4 16 571 0 1 41 0 5 12 0 1 1 3 3860
2857 348 4 15 527 0 2 85 0 4 10 0 2 3 3 3860
computing time with the income BLPs. First, there are two additional sets of bounds when income is included (lower and upper bounds for the income variables with and without Y = 50% treated as missing). Second, the addition of income to the BLP increases the dimension of the optimization problem. Without the income variable, (BLP) has 1372 variables and 39 equality constraints with Y = 50% treated as non-missing. There are 1295 variables and 38 constraints with Y = 50% treated as missing. With the income variable, (BLP) has 7188 variables and 174 equality constraints with Y = 50% treated as non-missing. There are 6660 variables and 184 constraints with Y = 50% treated as missing. The size of (BLP) depends on whether Y = 50% is treated as missing because this affects the number of mass points in the empirical support of (Y , X) and the numbers of ways that observations can be missing. Tables 2–3 display the results of solving (BLP). We do not report bootstrapped confidence intervals because their computation is very time-intensive. We call attention to two substantive features of the results. First, the bounds on the coefficient of the dummy for “black” lie above zero when observations for which Y = 50% are treated as non-missing. Manski and Straub [11], who treated observations with Y = 50% as non-missing and analyzed only the subsample with complete data, reported that blacks tend to have notably higher subjective probabilities of job loss than do whites. This finding is largely confirmed under the conservative analysis of missing data that is carried out here when observations for which Y = 50% are treated as non-missing. Second, the bounds on the coefficients of the dummies for age show considerable overlap and all cover zero. Manski and Straub [11] reported
Table 2. Estimated bounds from (NLP) with covariates age and race.
Intercept Age (18−24) Age (25−49) Age (50−64) Black Other non-white race
Lower bound with Y = 50 treated as non-missing 10.17 −9.28 −9.62 −12.42 3.27 −3.48
Upper bound with Y = 50 treated as non-missing 23.34 7.93 6.81 6.06 15.72 8.43
Lower bound with Y = 50 treated as missing
Upper bound with Y = 50 treated as missing
4.87 −17.20 −17.14 −18.89 −7.16 −10.21
27.59 16.60 15.17 13.40 26.25 14.95
Table 3. Estimated bounds from (NLP) with covariates age, race, and income.
Intercept Income (in $1,000) Age (18−24) Age (25−49) Age (50−64) Black Other non-white race
Lower bound with Y = 50 treated as non-missing 7.06 −0.21 −9.37 −7.84 −10.54 0.58 −5.34
Upper bound with Y = 50 treated as non-missing 30.82 0.14 10.33 11.21 12.00 15.72 8.38
Lower bound with Y = 50 treated as missing −8.08 −0.29 −16.75 −14.30 −16.09 −10.38 −12.54
Upper bound with Y = 50 treated as missing 37.95 0.23 19.23 20.29 21.00 23.06 13.90
that subjective probabilities of job loss tend to decrease as age increases. This substantive conclusion cannot be drawn under the conservative analysis carried out here. The results in Tables 2–3 also illustrate the effects of missing observations on the information that the data provide about the parameters of interest. Treating Y = 50% as missing increases the proportion of individuals for whom one or more variables is missing from 20% to 26%. However, the widths of the intervals spanned by the estimated bounds approximately double. Thus, a relatively small increase in the proportion of incomplete observations causes a large reduction in the information about population parameters that is available from the data. Horowitz and Manski [5] found similar losses of information from an increase in the rate of non-response to a survey. Tables 4 and 5 display minimax outer bounds obtained using the methods of Section 3. In all cases the optimal benchmark is used and observations with Y = 50% are treated as non- missing. The left two columns of Tables 4A and 5A report the simplest and least informative bounds, described in Section 3.1. The right
Table 4. Estimated outer bounds with covariates age and race. A. Not using dominance criterion. Variable Intercept Age (18−24) Age (25−49) Age (50−64) Race: black Race: other
Lower bound not using partial data
Upper bound not using partial data
Lower bound using partial data
Upper bound using partial data
−99.42 −125.39 −113.95 −123.45 −71.34 −79.70
125.60 130.63 116.39 120.67 91.04 83.87
−88.72 −113.21 −103.00 −111.85 −63.62 −71.92
114.90 118.46 105.44 109.06 83.32 76.09
Lower bound not using partial data
Upper bound not using partial data
Lower bound using partial data
Upper bound using partial data
−67.79 −123.18 −109.20 −120.99 −63.62 −69.50
121.47 103.30 82.70 91.62 91.04 83.87
−61.35 −113.09 −100.85 −110.10 −57.18 −63.10
113.23 94.99 76.25 84.04 83.32 76.09
B. Using dominance criterion. Variable Intercept Age (18−24) Age (25−49) Age (50−64) Race: black Race: other
two columns of Tables 4B and 5B report refined bounds that use the dominance restriction of Section 3.2 and partial data as described in Section 3.3. The other columns of the tables report results using one refinement or the other. The outer bounds are always much wider than the bounds obtained by solving (BLP) (Tables 2–3). Thus, the minimax outer bounds of Section 3 do not provide an adequate approximation to the bounds that are obtained by solving (BLP). However, computation of these outer bounds using MATLAB was very inexpensive. Determination of an optimal benchmark required no more than 10 seconds of computing. Following determination of the benchmark, computation of bounds that do not use the dominance restriction likewise required only seconds. Implementation of the dominance restriction was a bit more burdensome. In all, computation of the entries in Tables 4 and 5 took approximately 10 minutes each. 5. Conclusions Incomplete data are a pervasive problem in empirical research. Missing observations or interval-measured data cause population quantities of interest to be unidentified unless untestable assumptions are made about the probability distribution of the missing data. However, sharp bounds on population parameters are often available
Table 5. Estimated outer bounds with covariates age, race, and income. A. Not using dominance criterion. Variable Intercept Income Age (18−24) Age (25−49) Age (50−64) Race: black Race: other
Lower bound not using partial data
Upper bound not using partial data
Lower bound using partial data
Upper bound using partial data
−222.22 −0.40 −261.08 −240.62 −256.56 −142.15 −156.12
251.66 0.38 264.06 241.18 252.50 161.63 160.25
−121.95 −0.24 −149.96 −138.67 −148.84 −77.87 −89.17
151.38 0.21 152.94 139.23 144.78 97.35 93.30
Lower bound not using partial data
Upper bound not using partial data
Lower bound using partial data
Upper bound using partial data
−221.25 −0.40 −259.13 −240.27 −255.51 −141.15 −155.31
251.64 0.38 262.54 238.47 249.22 161.57 160.23
−121.17 −0.24 −149.94 −138.67 −148.83 −77.15 −88.55
151.38 0.21 151.78 137.30 142.48 97.35 93.30
B. Using dominance criterion. Variable Intercept Income Age (18−24) Age (25−49) Age (50−64) Race: black Race: other
without making untestable assumptions about the process that makes the data incomplete. The bounds exhaust the information that is available from the data. In previous research, we showed that the bounds can often be calculated analytically when the objective is to infer a conditional mean function nonparametrically. More generally, however, the bounds are solutions to nonlinear mathematical programming problems and must be computed numerically. This paper has shown that the computations are tractable in some important leading cases, but further research is necessary to improve the speed of computation and to develop a complete understanding of the computational issues that are involved in solving the general mathematical programming problem. An empirical illustration based on real data has illustrated the ability of the bounding procedure to provide certain substantively useful results without making untestable assumptions about missing data.
Appendices A.1. AN EXAMPLE OF PROBLEM (NLP) This section presents a simple example that illustrates the formulation of problem (NLP). Let Y and X be binary random variables whose only possible values are 0 and 1. The example formulates (NLP) for finding bounds on the slope parameter of the BLP of Y conditional on X when there are missing data. It follows from (2.1) that the slope parameter is θ = E(YX) / E(X 2 ). There are four mass points of V = (Y , X). These are i
(Y , X)
(0, 0)
(1, 0)
(0, 1)
(1, 1)
There are four possible missingness patterns. These are z
Missing Vars
Only X
Only Y
X and Y
E(YX) =
πz pz4
E(X 2 ) =
πz (pz3 + pz4 ).
It follows that the objective function of (NLP) is 4
πz pz4
z=1 4
πz (pz3 + pz4 )
The equality constraints (2.2a) are p21 + p23 p22 + p24 p31 + p32 p33 + p34
= = = =
=0|Z =1|Z =0|Z =1|Z
= 2), = 2), = 3), = 3).
The equality constraints (2.2b) are p21 + p22 + p23 + p24 = 1, p31 + p32 + p33 + p34 = 1, p41 + p42 + p43 + p44 = 1. The variables of the problem are {pzi : z = 2, 3, 4; i = 1, …, 4}, so there are 12 variables and 7 equality constraints. A.2. THE SIZE OF PROBLEM (NLP) This appendix illustrates the size of (NLP) in the special case that each component of V has the same number of mass points. Define m = dim(V). Suppose that each component of V has J mass points. Then there are J m possible mass points of V. Suppose that each has non- zero probability. Then the support of V contains J m mass points. There are 2 m different missingness patterns and 2 m − 1 patterns in which at least one component of V is missing. For each pattern, there are J m values of pzi (one for each mass point i). Therefore, the total number of unidentified pzi ’s is (2 m − 1)J m . This is the number of variables of optimization in (NLP). Now consider the equality constraints. Suppose that for missingness pattern z, there are mz observed variables. Then there are J mz mass points in the support of the observed (non- missing) components of V. There are
m mz
ways to choose a missingness pattern with mz observed variables, so there are
m mz
J mz
identified conditional probabilities associated with missingness patterns in which there are mz observed variables. Each conditional probability gives a constraint of the form (2.2a), so the total number of (2.2a) constraints is m −1 mz = 1
m mz
J mz .
There is one (2.2b) constraint for each missingness pattern except the pattern in which there are no missing variables. Therefore, the number of (2.2b) constraints is 2 m − 1. The total number of equality constraints is 2 −1+ m
m −1 mz = 1
m mz
J mz .
Table 6. The size of problem (NLP). m
2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25
75 300 675 1,200 1,875 875 7,000 23,625 56,000 109,375 9,375 150,000 759,375 2,400,000 5,859,375 96,875 3,100,000 23,540,625 99,200,000 3,027 × 108
13 23 33 43 53 97 337 726 1,267 1,957 685 4,655 14,925 34,495 66,365 4,681 61,081 289,231 884,131 2,115,781
Table 6 gives examples of the sizes of the problems NLP that are generated. A.3. THE GENETIC ALGORITHM FOR MAXIMIZATION Define w = {pzi : z = 2, …, zmax ; i = 1, …, I}. Denote the objective function by Q(w). Let C > 0 be a finite constant such that Q(w) + C > 0 for some feasible w. ˜ Set Q(w) = max[0, Q(w) + C]. Step 1: For each z = 2, …, zmax draw a random sample of m = 50 values of the vector {pzi : i = 1, …, I} from the uniform distribution over the set Tz that consists of all values of {pzi : i = 1, …, I} satisfying
pzi = qˆ zkz ;
kz = 1, …, Kz ,
i : vi ∈kz I
pzi = 1,
pzi ≥ 0,
i = 1, …, I .
Combine these pzi ’s into 50 realizations of the vector w. Denote these by W ≡ {wj : j = 1, …, 50}.
Step 2: Let {ƒj : j = 1, …, m} be the probability mass function on W that is defined by ƒj =
˜ j) Q(w 50
˜ k) Q(w
6. Horowitz, J. L. and Manski, C. F.: Imprecise Identification from Incomplete Data, in: Proceedings of the Second International Symposium on Imprecise Probabilities and Their Applications, 2001, http://ippserv.rug.ac.be/˜isipta01/proceedings/index.html. 7. Horowitz, J. L. and Manski, C. F.: Nonparametric Analysis of Randomized Experiments with Missing Covariate and Outcome Data, Journal of the American Statistical Association 95 (2000), pp. 77–84. 8. Imbens, G. and Manski, C.: Confidence Intervals for Partially Identified Parameters, Department of Economics, University of California at Berkeley, 2003, in process. 9. Judd, K. L.: Numerical Methods in Economics, The MIT Press, Cambridge, 1998. 10. Manski, C. F.: Identification Problems in the Social Sciences, Harvard University Press, Cambridge, 1995. 11. Manski, C. F. and Straub, J.: Worker Perceptions of Job Insecurity in the Mid-1990s: Evidence from the Survey of Economic Expectations, Journal of Human Resources 35 (2000), pp. 447–479. 12. Manski, C. F. and Tamer, E.: Inference on Regressions with Interval Data on a Regressor or Outcome, Econometrica 70 (2002), pp. 519–546. 13. Myrveit, I., Stensrud, E., and Olsson, U.: Analysing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods, IEEE Transactions on Software Engineering 27 (2001), pp. 999–1013. 14. Politis, D. N. and Romano, J. P.: Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions, Annals of Statistics 22 (1994), pp. 2031–2050. 15. Rosenbaum, P.: Observational Studies, Springer-Verlag, New York, 1995. 16. Rubin, D.: Inference and Missing Data, Biometrika 63 (1976), pp. 581–590. 17. Scharfstein, D., Rotnitzky, A., and Robins, J.: Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models, Journal of the American Statistical Association 94 (1999), pp. 1096–1120. 18. Vansteelandt, S. and Goetghebeur, E.: Analyzing the Sensitivity of Generalized Linear Models to Incomplete Outcomes Via the IDE Algorithm, Journal of Computational and Graphical Statistics 10 (2001), pp. 656–672. 19. Zaffalon, M.: Exact Credal Treatment of Missing Data, Journal of Statistical Planning and Inference 105 (2002), pp. 105–122.