A Bayesian Interpretation of Extremum Estimators - Semantic Scholar

0 downloads 0 Views 260KB Size Report
Feb 14, 1997 - is only justi ed on the basis of the asymptotic properties of the estimators ...... ^fis given, and so are the b (x; )'s and the corresponding coe cients.
A Bayesian Interpretation of Extremum Estimators Mahmoud A. El-Gamal  Department of Economics, 1180 Observatory Drive University of Wisconsin, Madison, WI 63706.

First Draft: December 1991. Revised: February 14, 1997.

Abstract Extremum estimation is typically an ad hoc semi-parametric estimation procedure which is only justi ed on the basis of the asymptotic properties of the estimators. For a xed nite data set, consider a large number of investigations using di erent extremum estimators to estimate the same parameter vector. The resulting empirical distribution of point estimates can be shown to coincide with a Bayesian posterior measure on the parameter space induced by a minimum information procedure. The Bayesian interpretation serves a number of purposes ranging from lending legitimacy to the use of those procedures in small sample problems, to helping prove asymptotic properties by reference to Bayes central limit theorems, to laying a foundation for combining point estimates from various extremum estimation experiments for statistical decision processes. I am grateful to Kim Border, Peter Bossaerts, David Grether, the participants in the NSF-

NBER conference on Bayesian statistics and Econometrics (in honor of Edwin Jaynes, St. Louis, April 1992), and seminar participants in Arizona, Caltech, Cornell, Rice, Rochester, Texas A & M, U.C. Santa Barbara, and Wisconsin, for their many useful comments. I am also grateful to an anonymous referee for a number of useful suggestions. Any remaining errors are, of course, my own.

A Bayesian Interpretation of Extremum Estimators Mahmoud A. El-Gamal

1 Introduction Perhaps the most popular class of estimators being used today is that of semi-parametric extremum estimators. This class allows for maximum likelihood estimation as a special case, but also allows for a large number of circumstances where we wish to estimate a low dimensional vector of parameters of interest, but the likelihood function depends on a (possibly in nite dimensional) vector of nuisance parameters. In this paper, we consider the situation where a large number of extremum estimators are available for estimating a vector of parameters of interest. The asymptotic properties of the various extremum estimators may or may not be known. Given a data set, we ask the question if there is a mathematical equivalence between the (empirical) distribution of point estimates arising from that family of extremum estimators, and some probability measure on the parameter space of interest which arises from a Bayesian procedure. For instance, in the GMM literature, rst order conditions from a structural model, or orthogonality conditions from a reduced form statistical model, give rise to a family of moment conditions:

Et [gi (xt+1; 0 )] = 0 ; i = 1; 2; ::: ; t = 1; :::; T ? 1; where x = fx1; :::; xT g 2 X  RlT is the observed data, and 0 2   Rd is the unknown parameter of interest.1 Those conditions in turn imply that the unconditional moment restrictions, where Et is replaced by E , have to hold. One picks a nite number (say m) of those moment conditions. In vector notation, let us write those m moment conditions as: E [~g(xt ; 0 )] = ~0. The GMM estimator, then approximates the moments E [~g(xt ; )] by PT 1 their sample counterpart ~gT () = T t=1 ~g(xt ; ), and nds a value of  that minimizes the quadratic form: QT () = ~gT ()0 WT?1~gT (); for some m  m positive de nite matrix WT (which depends on the data). The literature (e.g. Hansen (1982)), then concentrates on the large sample properties of this estimator; namely consistency, asymptotic normality, and asymptotic eciency. Now, if the model is correctly speci ed, it does not matter for consistency purposes which collection of moments { or how many of them { one picks, as long as some basic identi cation requirements are satis ed. Given the chosen collection of moments for the GMM exercise, under certain All the results in this paper, and their interpretations, will remain intact if we were to replace the Euclidean parameter and data spaces by any complete separable metric space. 1

1

regularity conditions, an optimal WT (one which minimizes the asymptotic variance of the estimator within the family) can then be derived (e.g. Hansen (1982)). Given a nite data set, di erent investigations use di erent subsets of the data, di erent collections of moment conditions, and di erent weighting matrices (i.e. di erent instruments), getting di erent parameter estimates. We typically study the asymptotics of a single estimation experiment as the data set gets large. Consider studying a di erent type of asymptotics. From a Bayesian point of view, let us condition on the observed data set fx1; :::; xT g, and consider the empirical distribution of parameter estimates from a family of extremum estimators as the number of investigations varies. Of course, a Bayesian interpretation of this empirical distribution of point estimates does not necessarily produce better small sample performance. It does, however, provide a logically consistent procedure (namely, Bayesian updating) which reproduces the same distribution of point estimates as a Bayesian posterior. Such logical consistency is valuable in itself, and under suitable conditions may facilitate the investigation of asymptotic properties of the estimators, as discussed below. With minor loss in generality, but no loss in intuition, one can think throughout this paper of the GMM example discussed above. Our Bayesian interpretation can serve a number of purposes: 1. It can give some legitimacy to the use of semi-parametric procedures whose optimality properties are known only asymptotically, by relating them to Bayesian procedures that have well known logical and axiomatic foundations regardless of the sample size. This purpose is similar to that of Shore (1984), where under restrictive assumptions, Shore shows that maximum likelihood and minimum-relative entropy classi cation are equivalent. He then argues that ML classi cation is incorrect if it is not assumed that one of the hypotheses is true, but that its equivalence to minimum-relative entropy justi es its use in those cases. The result in Shore (1984) is speci c to a single maximum-likelihood classi er, and it assumes a speci c functional form for the likelihood functions. The result we prove in this paper provides a more general interpretation for general classes of extremum estimators without specifying their functional form. 2. Where Bayesian laws of large numbers and central limit theorems apply, the interpretation can be applied to study the asymptotic properties of a single extremum estimator (the estimate being a single random draw from our Bayesian posterior). 3. Our interpretation justi es combining a number of non-Bayesian results to get a Bayesian prior (as in El-Gamal (1993)), which can then be used to calculate Bayes risks for policy makers who do not have sucient reason to choose only one of many studies on which to base their decisions. 4. Given the interpretation, and given a large class of extremum estimators at our disposal, we can randomly draw estimators to get a distribution of point estimates which we can then treat as a Bayes posterior. This can, for instance, help in resolving the datamining problems that arise when one chooses moment conditions and instruments for a GMM exercise. 2

The rest of the paper proceeds as follows. In section 2, we rigorously de ne extremum estimation, and state the assumptions under which a procedure of randomly drawing extremum estimators from a given family can give rise to our Bayesian interpretation. In that section, sucient conditions are given under which the induced empirical distribution of parameter estimates (which depends on the actual observed data set) de nes a random eld with the strong Markovian property. In section 3, we present a Bayesian method whereby we nd the measure on  which is closest to our prior (in the sense of cross entropy, relative entropy, or I -divergence) given some linear constraints on our posterior which depend on the data. We then show that, for our data set x, the probability measure induced on  by the random drawing of extremum estimators from the family described in section 2 can be perfectly mimicked by some information minimization Bayesian procedure. In section 4, we conclude the paper by discussing in more detail how our interpretation can be helpful in achieving the four goals enumerated above.

2 Randomly Drawn Extremum Estimators

Let X  Rn be a data space, and let X be the Borel--algebra generated by the subsets of X . Our statistical experiment is:

E = f(X; X); P ;j;  2   g; where P ; is a family of probability measures indexed by our parameter of interest  2   Rd , and the nuisance parameter  2 . Now, our extremum estimator is de ned by the optimand: Q: X   ?! R+ : A typical optimand would be the QT () de ned in the introduction for a GMM estimator. Our estimator is de ned by qx = argmin2Q(x; ). For this class of estimators, the procedure to prove consistency and asymptotic normality of the estimates under certain regularity conditions has become quite standard (e.g. LeCam (1986, pp. 305{323), or Amemiya (1985, pp. 105{109)). The proof methodology nds sucient conditions to approximate the minimization of Q on  by the minimization of a quadratic approximation to Q on a linear tangent space to . One can then write Taylor series expansions of the optimand, and apply laws of large numbers and central limit theorems to its terms. On that basis, this class of estimators has become very popular in recent years.2 We consider a large class of optimands Qc : X   ! R+ which de ne various potential extremum estimators, indexed by a set of \con gurations" C . A con guration entails the Of course, the GMM estimators and their Bayesian interpretation as developed in this paper, are dependent on the correctness of the assumed model. If the model is misspeci ed, those procedures may produce arbitrarily erroneous conclusions. Moreover, in cases where the small sample properties of the extremum estimators (e.g. GMM) are poor, it is likely that the statistical properties of our induced Bayesian posterior will also be poor. All a Bayesian interpretation guarantees is logical consistency; but logical consistency does not ensure good statistical properties. 2

3

choice of a (structural or reduced form) model, a subsample of the data set on which to conduct the study, and an extremum estimation procedure (least squares, maximum likelihood, instrumental variable, etc.). Note well: in reality there is a single true x 2 X which all extremum estimators will use. Di erent investigations using di erent data sets are viewed in this context as using di erent subsamples of the data set x, the choice of these subsamples being captured in the con gurations. For any given true x, the collection of extremum estimators we obtain with the di erent con gurations will generate a probability measure on , which we will give a Bayesian interpretation. We model the optimands Qc for extremum estimation experiments as randomly drawn from some con guration space. We formalize that assumption as follows: (A.0) Extremum estimators are obtained by drawing con gurations c from the probability space (C; C;  ). Note Well: The measure  on (C; C) need not give equal weight to all point estimators (con gurations) c 2 C . In that sense, the assumed randomness of extremum estimators does not imply simple ad hoc randomness, and assumption (A.0) allows for a rich class of possible sampling processes which favor some subclasses of estimators (e.g. ones which arise from more parsimonious models, etc.). We now make the following assumptions about the set of potential optimands fQc : c 2 C g: (A.1) The parameter space  is compact. (A.2) Each Qc: X   ?! R+ is continuous. Assumptions (A.1) and (A.2) guarantee the existence of a minimizer of Qc (x; :). We further assume: (A.3) Qc(x; :) has a unique minimizer in , a.s.[ ]. Assumptions (A.1) and (A.2) are standard and they are extensively used for proving asymptotic properties of extremum estimators, e.g. Amemiya (1985, Assumptions (A) and (B), p.107). Assumption (A.3) is di erent from the standard assumptions, e.g. Amemiya (1985, Assumption (C), p.107), where Q is assumed to converge in probability to a non-stochastic function which has a unique global minimum (as the sample size T goes to in nity). Now, under (A.1), (A.2), and (A.3), we can de ne, almost surely [ ], our estimate of 0 by a function q: X ! . For a given data set x 2 X , and con guration c 2 C , we write our estimate as qx (c) = argmin Qc(x; ), which is well de ned and single valued almost surely [ ]. Now, let  be the Borel--algebra of subsets of , then, for a given x 2 X , the measure  on (C; C) generates, through the random variable qx : C ! , a measure on (; ) de ned for A 2  by the probability of drawing a con guration, which together with the given data set x, gives an estimate in A, i.e.: P fAjxg =  fc 2 C : qx(c) 2 Ag: To avoid notational problems later on, we emphasize the following: 4

 q:(c) is a function X 7! .  qx(:) is a measurable function from (C; C;  ) to (; ; P f:jxg), i.e. it is a random

variable taking values in .  q:(:) is a function X  C 7!  which will later de ne our random eld indexed by all possible data sets x 2 X , and determined by the random con guration draws c from (C; C;  ). To draw analogies with standard stochastic processes ft (:)gt2Z , we shall sometimes write the random eld q: (:) also as fqx (:)gx2X . We now state a straight forward property of the estimators qx (c).

Lemma 1 Under (A.1)-(A.3), for any given c 2 C , the function q:(c): X !  is continuous. Proof : The result follows directly from assumptions (A.2) and (A.3) and the theorem of the maximum (Berge (1963, p. 116)).

In the theory of standard stochastic processes, the process is de ned by a sequence of random variables ft (:)g indexed by t 2 Z, de ned on an underlying probability space ( ; B; ). This stochastic process can be identi ed by the sequence ft (:)g itself, or by the - elds Ft = (t (:); t+1 (:); :::;  ?1 (:);  (:)). For our problem, we have to generalize the concept of the stochastic process ft (:)g, which is indexed by t 2 Z, to our process fqx(:)g which is indexed by x 2 X . As before, this random process (taking values in ) can be described by the collection of random variables fqx (:)gx2X , or by the - elds A(S ) = (fqx (:)gx2S ), for S  X an open domain. In other words, A(S ) is the smallest - eld containing all events of the form: fqx (:) 2  ; x 2 S;  2 g 2J , for some index set J . Note well: To avoid confusion later on, we note that in the standard stochastic processes case, we imagine a single ! 2 being drawn and thus determining the entire sequence t (!) =  (T t (!)). We can similarly imagine here each c 2 C de ning the entire ensemble of extremum estimators fqx (c)gx2X . Our goal is to x a given x (the true data set), and determine the distribution of the random variable qx (:) (determined by  on (C; C)). The various interpretations of that resulting measure were discussed in the introduction, and will be discussed again in the conclusion. It should be noted that the study of the random eld q: (:): X  C !  is only a vehicle towards establishing the form of that distribution { P f:jxg { on (; ) generated by qx (:) for the one given x.

De nition 1 (Random Field) The random process q:(:) is called a random eld if for any two open domains S; S 0 2 X2, A(S [ S 0 ) = A(S ) _ A(S 0 ): This additivity condition insures that either way we collect information ( rst looking at all the events de ned by the random variables qx (:) for x 2 S and x 2 S 0 separately and then adding them, or looking at the events de ned by qx (:) for x 2 (S [ S 0)), we get the same 5

answer. This de nition clearly applies to our random extremum estimators qx (:) (to which we also refer by the family of -algebras A(S ), for open S 2 X). We shall interchangeably speak of the family of random variables fqx (:)gx 2 X and the family of - elds A(S ) as our random eld. Since we shall talk of the family of - elds as the random eld, and discuss conditional independence in terms of those - elds, we need to de ne the measure on the space of continuous functions X 7!  induced by the measure  on our space of con gurations. In order to prove the existence of a measure  on the space of continuous functions X 7!  which is de ned by our measures P fAjxg, we rst introduce some notation. Given a topological space Y , we de ne P(Y ) to be the space of probability measures on Y , endowed with the weak-* topology (the topology of weak convergence). We also de ne Cb (Y ) to be the space of continuous and bounded functions on Y , endowed with the uniform topology. With the assumptions (A.1) and (A.4) below that X and  are compact metric, we de ne Cb(X; ) as the space of continuous functions from X to , with the uniform topology. The space Cb (X; ) is the space in which the functions q:(c) for a given c reside. The probability measure of our random eld is a probability measure on Cb (X; ) and it will be de ned by  , the probability measure on the con guration space (C; C). For any such measure  2 P(Cb (X; )), we de ne:

x(A) = ff 2 Cb (X; ): f (x) 2 Ag: It is obvious that for any such , the mapping x 7! x is continuous. What we need is the converse result that given a continuous function : X ! P() (in our case (x)(A) = P fAjxg) generates a measure  2 P(Cb (X; )) such that x = (x). Once such a measure  is constructed, it will be treated as the measure with respect to which all statements of conditional independence of - elds A(S ) are made. In order to prove the existence of the measure  2 P(Cb (X; )) described above, we rst need to prove the continuity of the mapping : X ! P(). Lemma 2 establishes that result.

Lemma 2 Under (A.1)-(A.3), the mapping : X ! P() is continuous in the weak-* topology.

Proof : Given a sequence xn ! x, and a continuous bounded function h:  ! R, we need to show that Z Z h() d(xn )() ?! h() d(x)():

From our de nition of :





(x)(A) = P fAjxg =  fc 2 C : qx (c) 2 Ag; it follows that

Z 

Z

h() d(xn )() = h(qxn (c)) d (c): C

6

By the continuity of qz (c) in z (Lemma 1), and the continuity and boundedness of h, it follows by the bounded convergence theorem that the right-hand-side of the last equation converges to Z Z h(qx (c)) d (c) = h() d(x)(); proving our result.

C



The existence of our measure  generated by the conditionals (x) can now be proven under the following additional assumption.

(A.4) X is compact, X and  are connected and locally connected, and 8x 2 X , the support of (x) = P f:jxg is all of . The assumption of compactness of X is non-standard, but seems harmless. The connectivity constraints on X and  are typically satis ed in practice where they are hyper-cubes in RlT and Rd , respectively. The third part of (A.4) states that for any given data set, all non-trivial sets in  can be reached with positive probability. This can easily be achieved by including some trivial optimands Qc in our collection of estimation con gurations (presumably, those trivial optimands will be drawn with a very small, but positive, probability). Substantively, the full support assumption means that any researcher who searches long enough can nd a model together with an estimation technique that will yield any prespeci ed result. We are now ready to prove the following lemma.

Lemma 3 Under assumptions (A.1)-(A.4), there exists a measure  2 P(Cb(X; )) such that x = (x) = P f:jxg. Proof : The result follows from Blumenthal and Corson (1972, Theorem 2.1, p.34) given the result in Lemma 2, and (A.4).

The analog of this result in nite random elds (a nite random eld is completely determined by its local characteristics; i.e. by its conditional probabilities) is much more straight-forward (see Gri eath (1976, Corollary 12-13, p. 431), also used in the proof of Theorem 3 below). This is perhaps the most fundamental and useful property of random elds (see Dobrushin (1968)). From this point onwards, we shall use this measure on the random eld A(S ), and all statements of conditional independence will be de ned with respect to that measure. Note that here, as in the simple construction of stochastic processes, conditional independence properties in terms of the - elds A(S ) are determined by the conditional probabilities (x)(:) = P f:jxg. Now, getting back to simple stochastic processes ft (:)gt2Z , we de ne the Markovian property by the conditional independence of - elds generated by temporally-distant random variables (t (:); s (:)) given intermediate variables  (:).3 We say that the process ft (:)gt2Z For an excellent treatment of conditional independence in the general case and its applications in Bayesian statistics, see Florens et al. (1990).

3

7

is Markovian if P ft+1(:) 2 B jFt?1 g = P ft+1(:) 2 B jt (:)g. Another way to state that ?1 and F1 are independent conditional on the random variable property is to say that Ft?1 t+1 t (:). For the more general random eld fqx (:)gx2X , alternatively de ned by the collection of - elds A(S ) for S 2 X an open domain, we de ne the Markovian property in a similar fashion. In what follows, we use the notation @A for the boundary of A, and A = A [ @A for the closure of A.

De nition 2 (Markov Random Field) The random eld A(S ), S 2 X is called a Markov random eld with respect to the system S of open subsets of X , if for all open domains S 2 S, A(S ) is independent of A(X n S) conditional on A(O) for O a suciently small neighborhood

of @S . We say that A(S ) is Markov if it is Markov with respect to a complete family of subsets S (i.e. S contains all the open domains that are relatively compact or with compact complements, contains all small neighborhoods of the boundaries of its elements, and is closed under unions and complements).

For our Bayesian interpretation in the next section, we shall start with a Borel subset of our parameter space , and we wish to interpret the proportion of point estimates that fall in that set as the integral of a Bayesian prior over that set. Given a Borel set B 2 , we can de ne the random sets c (B ) = fx 2 X : qx (c) 2 B g. To draw the analogy with simple stochastic processes ft (!)gt2Z , the corresponding notion will be that of stopping time (corresponding perhaps to a rst entrance or exit from some set).

De nition 3 (Strong Markovian Property) Let A(S ) be a Markov Random Field. If

for all open random sets c (B ) (where B is a Borel subset of ) A(c (B )) is independent of A(X n c (B )) conditional on A(@c (B )) (where S  is an  neighborhood of S , and A(S )  A(S  )), then we say that A(S ) has the Strong Markovian Property.

We now make an extra assumption regarding our random eld fqx (:)gx2X under which we shall prove that it is a Markov random eld with the strong Markovian property:

(A.5) The random eld fqx (:)gx2X has independent \generalized increments", i.e. for any (x; y) 2 X 2 ; x = 6 y and any  2 (0; 1), the two random variables (qx+(1?)y ? qx)(:) and (qy ? qx+(1?)y )(:) are independent. We call this assumption having independent generalized increments to draw analogies with stochastic processes indexed by a one-dimensional time parameter (e.g. Brownian motion having independent increments). If we take any two points x 6= y, and move along the vector connecting the two points, the changes in q: (:) along that vector occur by adding independent increments. The substantive import of this assumption is exactly the Markovian structure proven in Theorem 1, that there is no information in the estimates with data y about the estimates with data x that is not captured by some intermediate data set. In essence, the only information we are allowing estimates at some y to give us about estimates 8

at some x 6= y is that which follows from continuity of q: (c) for each c 2 C . In other words, given the realized value of the random variable qy (:), and if y is \close" to x, we expect the realization of the random estimator qx (:) to be close to the realization of qy (:) (since no matter which c was drawn, the function q: (c) is continuous). By assuming that the relevant information gained by moving from x to y along any linear segment is accumulated via independent increments, we are in essence ruling out extra knowledge about the class of functions q: (c) beyond its continuity. For example, we are ruling out knowledge that qz (c) = qz+ (c); 8c 2 C , a type of cyclicality that could allow estimators at distant data sets to carry more information than estimators at nearby data sets.

Theorem 1 The random eld fqx(:)gx2X generated by extremum estimators satisfying (A.0)(A.5), alternatively de ned by A(S ), S 2 X open, is a Markov Random Field with the Strong

Markovian Property.

Proof : The fact that A(S ) is a Markov Random Field follows from assumption (A.5), and Lemma 3. Given any open set S 2 S, the three - elds A(S ), A(@S ), and A(X n S) are generated by the three disjoint sets S; @S; X n S. Since A(S ) is generated by the random variables fqx (:)gx2S , and A(X n S) is generated by the random variables fqx (:)gx2X nS, and since, for any x 2 S , and y 2 X n S, there exists a z 2 @S and a  2 (0; 1) such that z = x +(1 ? )y, assumption (A.5) implies that A(S ) and A(X n S) are independent given A(@S ). Let O be a small neighborhood of @S , then O = (S \ O) [ @S [ ((X n S) \ O). By the de nition of a random eld (De nition 1), it follows that A(O) = A(S \ O) _ A(@S ) _ A((X n S) \ O). It immediately follows that A(S ) and A(X n S) are independent conditional on A(O) (since they are independent conditional on the smaller - eld A(@S )). This proves that our random eld is Markovian. Now, without loss of generality, let the Borel set B 2  be open (we know that all Borel sets can be generated by countable unions of such sets). By the continuity of q: (c) for each c, c (B ) is an open set. Moreover, the closure of c (B ) is co-compatible with the random eld A(S ) (i.e. the events fc (B ) [ @c (B )  S g and fc (B ) [ c (B )  S g are measurable elements of A(S ) ). Now, appealing to Rozanov (1982, Theorem 4, p.92), the result follows.

3 A Bayesian Interpretation of Extremum Estimators In the subsection 3.1, we introduce the Bayesian procedure of choosing the posterior measure on  that minimizes the cross entropy with a given prior measure subject to some linear constraints. Theorem 2 will state that this minimum cross entropy procedure subject to linear constraints generates a posterior measure with a Gibbs density. Moreover, Theorem 2 states that any measure which has such a Gibbs density is the outcome of some minimum cross entropy procedure subject to some set of linear constraints. In Theorem 3, we show that the measure P f:jxg generated by the random variable qx (:) on (; ) is described by a Gibbs density. The two theorems answer the central question of our paper by o ering 9

a Bayesian interpretation for the empirical distribution of point estimates arising from the collection of randomly drawn extremum estimators.

3.1 A Bayesian Procedure

We begin with our given data set x = fx1 ; :::; xT g, and a given prior measure M on (; ), and choose the posterior measure Fx which is closest to M in the information theoretic sense of Kullback and Liebler (1951) subject to a set of linear constraints of the form R  a (x;  )Fx (d ) = (x), 2 ?. This procedure has many axiomatic foundations in information theory (e.g. Shannon and Weave (1962) for the discrete case, and Shore and Johnson (1980) for the general case), and was used in Zellner (1988) to derive and justify Bayes' inversion formula. We shall assume that both M and Fx are absolutely continuous with respect to Lebesgue measure, and have densities m(:) > 0 (i.e. the prior density m() has full support) and fx (:) respectively. The prior density m() is typically chosen to be an uninformative prior which is invariant to an interesting group of transformations (e.g. Jaynes' work in Rosenkrantz (1983), Berger (1985, Ch. 6), Csiszar (1985)). The right hand sides of the linear constraints (the (x)'s) typically consist of computed sample moments justi ed by laws of large numbers (e.g. Csiszar (1985), and the discussion of empirical Bayes methods in Berger (1985)). The following theorem speci es the form our posterior density fx (:) (minimizing cross entropy with the prior density m(:) subject to the linear constraints) must take. It also states that any density of that form can be achieved by minimizing cross-entropy with respect to some set of linear constraints. Remark: It is important to notice that in this subsection, x, the data set, is xed. In other words, even though the statement of the theorem writes everything as a function of x, these functions fx (:), U (x; :), (x), etc. need not even obey simple measurability as functions of x. For a xed x, all of those \functions" can be thought of as constants. Therefore, once we x the value of x, the following theorem is identical with Csiszar (1975, Theorem 3.1, p. 152)

Theorem 2 Condition on the data set x 2 X . Let fa (x; )g 2? be an arbitrary set of real-valued, -measureable functions on , andRf g 2? be real constants. If there exists a measure Fx (with a density fx (:)) satisfying  a (x; ) Fx (d) = (x); 2 ? which

minimizes

  Z I (fx ; m) =  fx () log fmx(()) d;

then, Fx will have density of the form:4

fx () = H ?1(x) e?U (x;) : I am grateful to an anonymous referee for pointing out that it is well known that such a measure need not exist. For instance, the most trivial way to make no such measure exist is to make the set of moments de ned by the a 's and the 's inconsistent. A more sophisticated example is given in Cover and Thomas (1991, Section 11.3, pp. 270-272), where the maximal entropy subject to a set of constraints is well de ned, but there does not exist a density which achieves that entropy.

4

10

Moreover, given a measure F^ on (; ) having density of the form f^() = G?1(x) e?V (x;) ; where the function V (x; ) is in the linear span (without closure) of a collection of functions fb (x; )gR 2? , then this measure minimizes I (f; m) subject to a set of moment constraints of the form  bi (x; ) F^ (d) = i(x): Proof : See proof of Csiszar (1975, Theorem 3.1, pp. 152-154). For a simpler proof of the rst part if the number of constraints is nite, consider the Lagrangian for our constrained optimization problem:

L(fx ; ) =



!

 N N X X f x ( ) ?fx(): log m() ?(x)fx()? i(x)fx ()ai (x; ) d +(x)+ i(x) i (x) i=1 i=1

Z

Z

 r(; fx (); rfx ()) d + C (x); 

where r(:; :; :) contains all the terms that depend on  or fx (), and C (x) contains all the terms that depend only on x. Now, the function r is twice continuously di erentiable, hence, by a standard result in multiple integral calculus of variations (see Morrey (1966)), we get: n @ X i=1

@i rrfxj = rfxj :

Now, rfx () does not appear in I (fx ; ), and fx :  ! R+ [ f0g (the constraint that fx is non-negative, and hence a density function, is implicit since logarithms of negative numbers do not exist). Hence, the rst order conditions for a maximum reduce to rfx = 0, which can be written after collecting terms as:

? log(fx ()) + log(m()) ? (1 + (x)) ?

N X i=1

i (x)ai (x; ) = 0:

The positivity of fx guarantees that the secondPorder conditions for a maximum hold. Now, since fx integrates to unity, setting U (x; ) = Ni=1 i (x)ai (x; ) ? log(m()), and H ?1(x) = R e?1?(x) = 1=[  e?U (x;) d], we get

fx () = H ?1(x) e?U (x;) : The proof of the second part follows Csiszar (1975, Theorem 3.1, pp. 152-154). Notice here that the measure f^ is given, and so are the b (x; )'s and the corresponding coecients attached to those functions in V (x; ), the result is therefore that there exist appropriate i(x)'s to satisfy the moment restrictions. 11

Now, we can answer our central question. Our Bayesian interpretation can be established if, for our given x, we show that for any given Markov random eld q: (:), there exist functions fa (x; )g 2? , and constants f (x)g 2? (for which we can solve if we knew the structure of the random eld), such that the resulting R density p (jx) is a minimum cross entropy posterior subject to the linear constraints  a (x; ) p (jx) d = (x), 2 ?. Note, again, that these functions (x)'s and a (x; )'s need only be found for each x seperately (i.e. they do not need to obey any conditions of measurability as functions of x).

3.2 The Interpretation Theorem

In a variety of di erent contexts, an equivalence theorem between Gibbs and Markov Random Fields on countable lattices has proven most useful. In Statistical Physics, the results in Spitzer (1971) are perhaps the most important, in Probability Theory, the most useful reference is Gri eath (1976), and the very extensive literature on applications in Image Processing is very well motivated and summarized in Geman (1988). The result in Shore (1984) mentioned in the introduction is achieved under the assumption that the likelihood function being maximized is a Gibbs density. Since that Gibbs density can be achieved as a minimum cross-entropy posterior (in his case that is the discrete analog of Theorem 2), his equivalence result follows straight-forwardly, without the need to refer to Markov random elds. In this paper, since we do not make the assumption that our extremum estimators have to be maximum likelihood estimators, and even if they are, we do not specify their functional form, we introduced the idea of random extremum estimators and showed that it results in a Markov random eld with the strong Markovian property. To establish our interpretation result, we shall prove that our Markov random eld indexed by x 2 X  RlT and taking values in   Rd induces, for a given x, a Gibbs distribution on .

Theorem 3 The Random Field q:(:) generated by our random extremum estimators satisfying (A.0)-(A.5) induces, for any given x 2 X , on  a density of the form H ?1(x)e?U (x;) for some function U (x; ). Moreover, this density is the Bayesian posterior arising from minimizing cross entropy with some prior measure on (; ) subject to some set of linear constraints.

Proof : We start by constructing two sequences of nite sets Xn " X , and n " . For example, we can de ne Xn and n by all the elements of X and  respectively written up to n decimal places. We then consider the random eld q:n(:) indexed by Xn and taking values in n de ned by qxnn (c) = argminn2n kn ? qxn (c)k, where k:k is the Euclidean norm. By a simpler version of the argument used in the rst part of Theorem 1, using assumption (A.5) of generalized independent increments, the nite random eld q:n(:) is Markovian. Continuity of q:n (c) follows immediately from niteness, and hence this random eld has the strong Markovian property. Now, for any xed xn 2 Xn, de ne pn(n jxn ) =  fc 2 C : qxnn (c) = n g as the natural nite analog of the density function p (jx). By the construction of q:n (:) and the continuity of q: (c) (which followed by the theorem of the maximum and assumptions (A.1){(A.3)), it

12

follows that, for xn ! x, Pn(:jxn ) ) P (:jx) (where ) denotes weak convergence). We now need to show that pn(:jxn ) is a Gibbs density, then it will be apparent that taking limits as n " 1 will yield a p (:jx) which is also a Gibbs density. For the random eld fqxnn (:)gxn 2Xn (indexed by the nite set Xn and taking values in the nite set n), we can apply the nite analog of our Lemma 3, Gri eath (1976, Corollary 12-13, p.431) to conclude that the random eld is fully de ned by its local characteristics  fqxnn (:) = n jqyn (:) = y ; y 2 Ag where xn 2 A  Xn . An appeal to the Mobius inversion formula (Gri eath (1976, Lemma 12-11, p.427)) to our case gives us an analog of the Marokov random eld / Gibbs random eld equivalence theorem (Gri eath (1976, Theorem 12-16)). Therefore, 

 fc 2 C : qxnn (c) = xn ; xn 2 Xng = Z ?1 e

P P (?1)jBnAj log  fc2C :qn(c)= ;y2Bg y y

B Xn AB



:

This yields the probability measure n on our random eld q:n(:), which is a joint measure on qxnn (:); xn 2 Xn. To obtain the marginal distribution of qxnn (:), we integrate (sum) out all qy (:); xn 6= y 2 Xn , and obtain another probability density function:

pn(n jxn ) =  fc 2 C : qxnn (c) = n g = H ?1(xn ) e?U (xn;n ) : Taking limits as n " 1, and by the above construction of weak convergence of pn(n jxn ) to p (jx) for (xn; n ) ! (x; ), we get the desired result that p (jx) = H ?1(x)e?U (x;) . The last part of the theorem (that the resulting Gibbs density is one which arises from some cross entropy minimization subject to linear constraints procedure) follows by nding a collection of functions a (x; ) such that the function U (x; ) is in their linear span, and applying the second part of Theorem 2. This can clearly be trivially satis ed (via a1(x; ) = U (x; )), and the strength of our interpretation depends on the actual a (:; :)'s that give rise to the particular U (:; :) at hand.

4 Concluding Remarks By Theorem 3, we can now interpret the arising empirical distribution of point estimates in  induced by the random extremum estimators qx (:) as a Bayesian posterior arising from a problem of choosing the least informative distribution on  subject to linear constraints. This is our Bayesian interpretation. We now turn to the four purposes for nding that interpretation which were enumerated in the introduction. 1. The rst goal was to lend some legitimacy to procedures whose asymptotic properties may have been established, but that have no other grounds of epistemological legitimacy. The Bayesian interpretation shows that there exists a logically consistent procedure (namely Bayesian updating) which reproduces the empirical distribution of extremum estimators as a Bayesian posterior. In order to establish that this logically consistent procedure is a reasonable one for the given data set, one must be able to 13

nd the moment restrictions that will generate the same Gibbs density as the random extremum estimators. If those moment restrictions are meaningful, then the Bayesian procedure which uses them as constraints is seen as a reasonable one. Bayesian procedures with poor choices of priors and constraints will not be deemed reasonable. The interpretation provided in this paper is only one step in the direction of nding a reasonable as well as logically consistent procedure which reproduces the empirical distribution of point estimates. 2. The second goal was to help in proving the asymptotic properties of some extremum estimators. A single point estimate that is considered a qx (c) = ^ can be seen as an unbiased estimate of the population mean of that distribution on , hence, in cases where Bayes consistency and central limit theorems are applicable, they can be applied to that single point estimate. 3. The third goal dealt with the situation where we start with an actual collection of point estimates, we would not want to base our decisions on the estimates arising from the best tting model { that will be ignoring useful information. Under the appropriate assumptions on how di erent investigators happen to choose a particular con guration for their point estimation experiments { namely, our Assumption (A.5), which rules out the kinds of data mining and published research e ects that introduce arti cialdependence between the point estimates { we can combine the information from multiple point estimates to construct a pseudo-Bayesian posterior. There are a number of problems with the application of that procedure (the algorithms and the problems with their application are discussed in El-Gamal (1993)), but throwing away all but the point estimate from the best tting model may pose even more troublesome pre-testing problems. 4. The fourth goal was to deal with cases where a number of alternative extremum estimation experiments, with equally valid ex ante justi cations (e.g. being derived from the same model, and having the same asymptotic properties), are available. Consider the example of GMM estimates using di erent collections of moment conditions and instruments. There is a natural data-mining problem that arises when investigators have so much freedom in choosing a procedure to estimate essentially the same model. Even if we ignore those data-mining problems, there is still the issue of which of the in nite number of possibilities to choose. Our interpretation allows us to randomly select di erent collections of moments and instruments (interpreted as a con guration c 2 C ), and treat the resulting empirical distribution of point estimates as a pseudoBayesian posterior. Note here that random selection does not necessarily imply that all estimators in our class will be drawn with equal probability. Indeed, the measure  on our space of con gurations (point estimators) can give more weight to some models (e.g. parsimonious models may be given more weight than highly parametrized ones). A discussion of how to select such weights is provided in El-Gamal (1993). 14

References Amemiya, T. 1985. Advanced econometrics. Cambridge, MA: Harvard University Press. Berge, C. 1963. Topological spaces. New York: The MacMillan Co. Berger, J. 1985. Statistical decicion theory and Bayesian analysis. New York: SpringerVerlag. Blumenthal, R. and H. Corson. 1972. On continuous collections of measures. In LeCam, L., J. Neyman, and E. Scott, eds., Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Proability, pages 33{40. U.C. Press, Berkeley. Cover, T. and J. Thomas. 1991. Elements of information theory. New York: John Wiley & Sons, Inc. Csiszar, I. 1975. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability 3, no.1:146{158. Csiszar, I. 1985. An extended maximum entropy principle and a Bayesian justi cation. In Bernardo, J. et al., eds., Bayesian Statistics 2, pages 83{98. North Holland, Amsterdam. Dobrushin, P. 1968. The description of a random eld by means of conditional probabilities and conditions of its regularity. Theory of Probability and Its Applications 13:197{224. El-Gamal, M. 1993. The extraction of information from multiple point estimates. Nonparametric Statistics 2:369{378. Florens, J., M. Mouchart, and J. Rolin. 1990. Elements of Bayesian statistics. New York: Marcel Dekker, Inc. Geman, D. 1988. Random elds and inverse problems in imaging. In Ancona, A., D. Geman, and N. Ikeda, eds., E cole d'Ete de Probabilites de Saint-Flour XVIII-1988. SpringerVerlag, New York. Gri eath, D. 1976. Introduction to random elds. In Kemeny, J., J. Snell, and A. Knapp, eds., Denumerable Markov Chains. Springer-Verlag, New York. Hansen, L. 1982. Large sample properties of generalized method of moments estimators. Econometrica 50:1029{1054. Kullback, S. and R. Liebler. 1951. On information and sucency. Annals of Mathematical Statistics 22:79{86. LeCam, L. 1986. Asymptotic methods in statistical decision theory. New York: SpringerVerlag. 15

Morrey, C. 1966. Multiple integrals in the calculus of variations. New York: Springer-Verlag. Rosenkrantz, R. 1983. E. T. Jaynes: Papers of probability, physics, and statistical physics. Dordrecht: Reidel Pub. Co. Rozanov, Y. 1982. Markov random leds. New York: Springer-Verlag. Shannon, C. and W. Weave. 1962. The mathematical theory of communications. Urbana: University of Illinois Press. Shore, J. 1984. On a relation between maximum likelihood classi cation and minimum relative-entropy classi cation. IEEE Transactions on Information Theory IT-30, no.6:851{854. Shore, J. and R. Johnson. 1980. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross entropy. IEEE Transactions on Information Theory IT-26:26{37. Spitzer, F. 1971. Markov random elds and gibbs ensembles. American Mathematical Monthly 78:142{154. Zellner, A. 1988. Optimal information processing and Bayes's theorem. The American Statistician 42:278{284.

16

Suggest Documents