Apr 1, 2002 - Econometrics is usually taught from a classical, frequentist perspective. ..... The fundamental difference between classical frequentist inference ...
Introduction to Bayesian Econometrics and Decision Theory Karsten T. Hansen April 1, 2002
Lecture notes, 319 Spring 2002. Bayesian Theory Introduction This note will give a short introduction to Bayesian Econometrics and Bayesian Decision Theory. Econometrics is usually taught from a classical, frequentist perspective. However, thinking of econometric models from a Bayesian viewpoint can often be illuminating. Here is a non-exhaustive list of arguments for considering a Bayesian analysis of an econometric model (taken from Berger’s book): 1. Prior information about economically structural parameters is often available. In many econometric models we often have information about the underlying unknown parameters. Often some parameter values just don’t make much sense in terms of the underlying econometric theory (e.g., own price elasticities being positive in a demand function). A Bayesian analysis makes it very easy to incorporate such information directly. 2. Uncertainty = Probabilities. Any conclusion derived from a statistical analysis should have attached to it an indication of the uncertainty of the conclusion. For example, a point estimate of an unknown parameter is more or less worthless without an indication of the uncertainty underlying the estimate. In classical statistics one can only talk about uncertainty in a repeated sample framework. Recall the construction of confidence intervals! A Bayesian analysis will yield statements like “Given the observed data, I believe with 95 percent probability that this wage elasticity is between .1 and .18”
1
3. Allows for conditioning on data. A Bayesian analysis conditions on the observed data, where as a classical analysis averages over all possible data structures. 4. Exact distribution theory. Frequentist distribution theory of estimators of parameters for all but the most simple non-interesting econometric models rely on asymptotic approximations. These approximations are sometimes good, sometimes horrible. Bayesian distribution is always exact - never requiring the use of asymptotic approximations. 5. Coherency and Rationality. It has been shown that any statistical analysis which is not Bayesian must violate some basic ”common sense” axiom of behavior. This is related to the fact that a Bayesian analysis is directly based on axiomatic utility theory. 6. Bayes is optimal from a classical perspective. It has been shown in numerous papers that whenever one finds a class of optimal decision rules from a classical perspective (optimal with respect to some acceptable principle) they usually correspond to the class of Bayes decision rules. An example is the many complete class theorems in the literature (which roughly says that all admissible decision rules are Bayes decision rules). 7. Operational advantage: “You always know what to do!” Researchers are often faced with problems like ”How do I estimate the parameters of this econometric model in a good way?”. In a Bayesian analysis you always do this the same way - and it is usually with a good answer. 8. Computation. In the past it was often very hard to carry out a Bayesian analysis in practice due to the need of analytical integration. With the introduction of cheap high-performance PC’s and the development of Monte Carlo statistical methods it is now possible to estimate models with several thousand parameters.
2
9. Inference in non-regular models. Classsical asymptotic results usually rely on a set of regularity conditions. Classical inference without these is a nightmare. However, non-regular models are no more challenging than regular models from a Bayesian perspective. This is only a partial list. A few more (technical) reasons for considering a Bayesian approach is • allows for parameter uncertainty when forming predicitions, • can test multiple non-nested models, • allows for automatic James-Stein Shrinkage estimation using hierarchial models. Probability theory as logic Probability spaces are usually introduced in the form of the Kolmogorov axioms. A probability space (Ω, F, P ) consists of a sample space Ω, a set of events F consisting of subsets of Ω and a probability measure P with the properties 1. F is a σ-field 2. P (A) ≥ 0, for all A ∈ F 3. P (Ω) = 1 4. For a disjoint collection {Aj ∈ F}, P (∪Aj ) =
X
P (Aj )
j
These are axioms and hence taken as given. The classical interpretation of the number P (A) is the relative frequency with which A occurs in a repeated random experiment when the number of trials go to infinity. But why should we be base probability theory on exactly these axioms? Indeed many have criticized these axioms as being arbitrary. Can we derive them from deeper 3
principles that seem less arbitrary? Yes and this also leads to an alternative interpretation of the number P (A). Let us start by noting that in a large part of our lives our human brains are engaged in plausible reasoning. As an example of plausible reasoning consider the following little story from Jaynes’ book: “Suppose some dark night a policeman walks down a street, apparently deserted; but suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying a bag which turns out to be full of expensive jewelry. The policeman doesn’t hesitate at all in deciding that this gentleman is dishonest. But by what reasoning process does he arrive at this conclusion?” The policeman’s reasoning is clearly not deductive reasoning which is based on relationships like If A is true, then B is true. Deductive reasoning is then A true =⇒ B true and B false =⇒ A false. The policeman’s reasoning is better described by the following relationship: If A is true, then B becomes more plausible Plausible reasoning is then B is true =⇒ A becomes more plausible. How can one formalize this kind of reasoning? In chapter 1 and 2 of Jaynes’ book it is shown that given some basic desiderata that a theory of plausible reasoning should satisfy one can derive the laws of probability from scratch. These desiderata are that (i) degrees of plausibility are represented by real numbers, (ii) if a conclusion can be reasoned out in more than one way, then every possible way must lead to the same 4
result (plus some further weak conditions requiring correspondence of the theory to common sense). So according to this approach probability theory is not a theory about limiting relative frequencies in random experiments, but a formalization of the process of plausible reasoning and the interpretation of a probability is P (A) = “the degree of belief in the proposition A” This subjective definition of probability can now be used to formalize the idea of learning in an uncertain environment. Suppose my degree of belief in A is P (A). Then I learn that the proposition B is true. If I believe there is some connection between A and B I have then also learned something about A. In particular, the laws of probability (or, according to the theory above, the laws of plausible reasoning) tells me that Pr(A|B) =
Pr(B|A)Pr(A) Pr(A ∩ B) = , Pr(B) Pr(B)
(1)
which, of course, is known as Bayes’ rule. If there is no logical connection between A and B then Pr(B|A) = Pr(B) and in this case Pr(A|B) = Pr(A) and I haven’t learning anything by observing B. On the other hand, if Pr(B|A) 6= Pr(B) then B contains information about A and therefore I must update my beliefs about A. The Bayesian approach to statistics The Bayesian approach to statistics is based on applying the laws of probability to statistical inference. To see what this entails simply replace A and B above by A = the unobserved parameter vector, θ, B = the observed data vector, y. Replacing probabilities with pdfs we get p(θ|y) =
p(y|θ)p(θ) p(y) 5
(2)
Here p(y|θ) is the sample distribution of the data given θ and p(θ) is the prior distribution of θ. So Bayesian statistics is nothing more than a formal model of learning in an uncertain environment applied to statistical inference. The prior expresses my beliefs about θ before observing the data;The distribution p(θ|y) expresses my updated beliefs about θ after observing the data. Definition 1. p(θ) is the prior distribution of θ. p(θ|y) given in (2) is the posterior distribution of θ, and p(y) =
Z
p(y|θ)p(θ)dθ,
is the marginal distribution of the data. Carrying out a Bayesian analysis is deceptively simple and always proceed as follows: • Formulate the sample distribution p(y|θ) and prior p(θ). • Compute the posterior p(θ|y) according to (2) That’s it! All information about θ is now contained in the posterior. For example the probability that θ ∈ A is Pr(θ ∈ A|y) =
Z
p(θ|y)dθ. A
A couple of things to note: • In the Bayesian approach “randomness”= uncertainty. The reason something is random is not because it is generated by a “random experiment” but because it is unknown. According to the Bayesian approach you are only allowed to condition on something you know. • Note that data and parameters are treated symmetrically. Before observing any data both data and parameters are considered random (since they are unknown). After observing the data only the parameters are considered random (since now you know the data but you still don’t know the parameters). 6
• The final product of a Bayesian analysis is the posterior distribution of θ, p(θ|y). This distribution summarizes your current state of knowledge about θ. Note that the posterior distribution is not an estimator. An estimator is a function of the ˆ which given the data y yields a single value of θ. The posterior data, θˆ = θ(y), distribution is a distribution over the whole parameter space of θ. While the posterior distribution is the complete representation of your beliefs about θ, it is sometimes convenient to report a single estimate, e.g. the most likely value of θ. The Likelihood Principle Bayesian procedures statisfy the Likelihood principle (Barnard,Fisher,Birnbaum): “In making inferences or decisions about θ after the data is observed, all relevant experimental information is contained in the likelihood function for the observed data. Furthermore, two likelihood functions contain the same information about θ if they are proportional to each other” Example 1. (From Robert (1994)). Suppose x1 , x2 ∼ N(θ, 1), so f (x1 , x2 |θ) = φ(x1 |θ, 1) φ(x2 |θ, 1). The likelihood function is Lf (θ|x) ∝ exp{−(¯ x − θ)2 /2}. Consider the alternative sample distribution: g(x1 , x2 |θ) ∝ π −3/2
exp{−(x1 + x2 − 2θ)}2 /4 1 + (x1 − x2 )2
Note that Lg (θ|x) ∝ exp{−(¯ x − θ)2 /2}, which is equal to Lf (θ|x). The Likelihood principle states that we should make the same inferences about θ. 7
Note that Maximum Likelihood would yield the same estimator for the two different sample distributions. And a Bayesian analysis would yield two identical posteriors. However, frequentist confidence intervals and test procedures would be different for the two distributions. This violates the likelihood principle. Here is another famous example from Berger’s book: Example 2. Suppose a substance to be analyzed can be sent either to either lab 1 or lab 2. The two labs seem equally good so a fair coin is flipped to choose between them. The coin flip results in lab 1. A week later the results come back from lab 1 and a conclusion is to be made. Should this conclusion take into account the fact that the coin could have pointed to lab 2 instead? Common sense says no, but according to the frequentist principle we have to average over all possible samples including the ones from lab 2. Bayes estimators How should one can generate Bayes estimators from the posterior distribution? Since we argued above that the Bayesian approach is just a model of learning we might as well ask how one should make an optimal decision in an uncertain environment. Well, as economists we know this: Maximize expected utility or, equivalently, minimize expected loss. So let Y = Data sample space, Θ = Parameter space, A = Action space, ˆ where a ∈ A is an action or decision related to the parameter θ, e.g. a = θ(y), an estimator. Definition 2. A function L : Θ × A → R with the interpretation that L(θ1 , a1 ) is the loss incurred if action a = a1 is taken when the parameter is θ = θ1 is called a loss 8
function With these definitions we can now define the posterior expected loss of a a given action. Definition 3. The posterior expected loss of an action a ∈ A is Z ρ(a|y) = L(θ, a)p(θ|y)dθ
(3)
Now we can define a Bayes estimator: Definition 4. Given a sample distribution, prior and loss function, a Bayes estimator θˆB (y) is any function of y so that θˆB (y) = arg mina∈A ρ(a|y) Some typical loss functions when θ is one dimensional are L(θ, a) = (θ − a)2 , L(θ, a) = |θ − a|, k2 (θ − a), if θ > a, , L(θ, a) = k1 (a − θ), otherwise,
quadratic,
(4)
absolut error,
(5)
generalized absolut error.
(6)
Then the corresponding optimal Bayes estimators are E[θ|y], Q1/2 (θ|y), Qk2 /(k1 +k2 ) (θ|y),
(Posterior mean),
(7)
(Posterior median),
(8)
(k2 /(k2 + k1 ) fractile of posterior ),
(9)
Proof. Consider first the quadratic case. The posterior expected loss is Z ρ(a|y) = (θ − a)2 p(θ|y)dθ, which is a continuous and convex function of a so Z Z ∂ρ(a|y) ∗ ∗ = 0 ⇐⇒ (θ − a )p(θ|y)dθ = 0 ⇐⇒ a = θp(θ|y)dθ ≡ E[θ|y]. ∂a 9
For the generalized absolut error loss case we get Z ρ(a|y) = L(θ, a)p(θ|y]dθ Z Z a (a − θ)p(θ|y)dθ + k2 = k1
∞
a
−∞
(θ − a)p(θ|y)dθ
Now using integration by parts, Z a Z a (a − θ)p(θ|y)dθ = (a − a)Pr(θ < a|y) − lim (a − x)Pr(θ < x|y) + Pr(θ < x|y)dx x→−∞ −∞ −∞ Z a Pr(θ < x|y)dx, = −∞
and similarly for the second integral. Then Z Z a ρ(a|y) = k1 Pr(θ < x|y)dx + k2 −∞
∞
Pr(θ > x|y)dx
(10)
a
This is a continuous convex function of a and ∂ρ(a|y) = k1 Pr(θ < a|y) − k2 Pr(θ > a|y). ∂a Setting this equal to zero and solving (using that Pr(θ > a|y) = 1 − Pr(θ < a|y)) we find Pr(θ < a∗ |y) =
k2 , k1 + k2
(11)
which shows that a∗ is the k2 /(k1 + k2 ) fractile of the posterior. For k1 = k2 we get the posterior median. One can also construct loss functions which gives the posterior mode as an optimal Bayes estimator. Here is a simple example of a Bayesian analysis. Example 3. Suppose we have a sample y of size n where by assumption yi is sampled from a normal distribution with mean µ and variance 1, i.i.d.
yi |µ ∼ N(µ, 1).
10
(12)
So the sample distribution is −n/2 −n
p(y|µ) = (2π)
σ
exp
n
n o 1 X 2 − 2 (yi − µ) . 2σ i=1
(13)
Suppose we use the prior p(µ) = (2π)
−1/2
σ0
−1
exp
n
o 1 2 − 2 (µ − µ0 ) , 2σ0
(14)
where µ0 and σ02 is the known prior mean and variance. So before observing any data the best guess of µ is µ0 (at least under squared error loss). Usually one would have σ0 large to express that little is known about µ before observing the data. The posterior distribution of µ is p(µ|y) = The numerator is
p(y|µ)p(µ) p(y|µ)p(µ) =R p(y) p(y|µ)p(µ)dµ
−(n+1)/2 −n
p(y|µ)p(µ) = (2π) Note that
σ
σ0
Then since
exp
n
n o 1 X 1 − 2 (yi − µ)2 − 2 (µ − µ0 )2 2σ i=1 2σ0
n n X X (yi − µ)2 = (yi − y¯)2 + n(µ − y¯)2 i=1
i=1
1
−1
(15)
1 n 1 1 (¯ y − µ0 )2 , (µ − y¯)2 + 2 (µ − µ0 )2 = 2 (µ − µ ¯ )2 + 2 2 2 −1 σ σ0 σ ¯ σ0 + n σ
(16)
where (n/σ 2 )¯ y + (1/σ02 )µ0 , (n/σ 2 ) + (1/σ02 ) 1 σ ¯2 = 2 n/σ + (1/σ02 ) µ ¯=
1
A convenient expansion when working with the normal distribution is ac ab + cd 2 + a(x − b)2 + c(x − d)2 = (a + c) x − (b − d)2 a+c a+c
11
(17) (18)
the term in curly brackets is −
n 1 X 1 1 2 2 − (µ µ ) = − (µ − µ ¯)2 − h(y), (y − µ) − 0 i 2 2σ 2 i=1 2σ0 2¯ σ2
where h(y) = Then
n 1 X 1 2 y − µ0 )2 . (¯ (y ¯ + − y ) i 2 2 2 −1 ) + n σ 2σ i=1 2(σ0
where
1 p(y|µ)p(µ) = p(y)(2π)−1/2 σ ¯)2 }, ¯ −1 exp − 2 (µ − µ 2¯ σ
Then
p(y) = (2π)−n/2 σ −n σ ¯ σ0 −1 exp − h(y)}. p(µ|y) =
p(y|µ)p(µ) 1 ¯ −1 exp − 2 (µ − µ ¯)2 }, = (2π)−1/2 σ p(y) 2¯ σ
(19)
(20)
(21)
which is the density of a normal distribution with mean µ ¯ and variance σ ¯ 2 so we conclude that µ, σ ¯ 2 ). p(µ|y) = N(¯
(22)
To derive this we did more calculations than we actually had to. Remember that when deriving the posterior for µ we only need to include terms where µ enters. Hence, p(µ|y) =
p(y|µ)p(µ) p(y)
∝ p(y|µ)p(µ) n o n 1 X 1 ∝ exp − 2 (yi − µ)2 − 2 (µ − µ0 )2 2σ i=1 2σ0 1 ¯ )2 . ∝ exp − 2 (µ − µ 2¯ σ
So this quick calculation shows that
1 p(µ|y) ∝ exp − 2 (µ − µ ¯)2 . 2¯ σ
We recognize this as an unnormalized normal density. So we can immediately conclude (22). 12
Under squared error loss the Bayes estimator is the posterior mean, E[µ|y] = µ ¯=
(n/σ 2 )¯ y + (1/σ02 )µ0 . (n/σ 2 ) + (1/σ02 )
(23)
The optimal Bayes estimator is a convex combination of the usual estimator y¯ and the prior expectation µ0 . When n is large and/or σ0 is large most weight is given to y¯. In particular, E[µ|y] → y¯
as n → ∞
E[µ|y] → y¯
as σ0 → ∞.
In this example there was a close correspondence between the optimal Bayes estimator and the classical estimator y¯. But suppose now we had the knowledge that µ has to be positive. Suppose also we initially use the prior p(µ) = I(K > µ > 0)
1 , K
where K is a large positive number. Then we can compute the posterior for K < ∞ and then let K approach infinity. Then posterior is then 1 1 p(µ|y) ∝ exp − 2 (µ − µ ¯)2 I(K > µ > 0) , σ 2¯ K
where now µ ¯ = y¯ and σ ¯ 2 = σ 2 /n. This is an unormalized doubly truncated normal distribution so
φ µ|¯ ¯ 2 I(K > µ > 0) µ, σ µ p(µ|y) = σ) Φ K−¯ − Φ(−¯ µ/¯ σ¯ 2 φ µ|¯ µ, σ ¯ I(∞ > µ > 0) → , Φ(¯ µ/¯ σ)
as K → ∞.
The posterior is a left truncated normal distribution with mean E[µ|y] = y¯ + σ ¯
φ(¯ y /¯ σ) . y /¯ σ) Φ(¯
(24)
Note that the unrestricted estimate is y¯ which may be negative. Developing the repeated sample distribution of y¯ under the restriction µ > 0 is a tricky matter. On the other hand, the posterior analysis is straightforward and E[µ|y] is a reasonable and intuitive estimator of µ. 13
Models via exchangeability In criticisms of Bayesian statistics one often meet statements like “This is too restrictive since you have to use a prior to do a Bayesian analysis whereas in classical statistics you don’t”. This is correct but now we will show that under mild conditions there always exists a prior. Consider the following example. Suppose you wish to estimate the probability of unemployment for a group of (similar) individuals. The only information you have is a sample y = (y1 , . . . , yn ) where yi is one if individual i is employed and zero if unemployed. Clearly the indices of the observations should not matter in this case. The joint distribution of the sample p(y1 , . . . , yn ) should be invariant to permutations of the indices, i.e., p(y1 , . . . , yn ) = p(yi(1) , . . . , yi(n) ), where {i(1), . . . , i(n)} is a permutation of {1, . . . , n}. Such a condition is called exchangeability. Definition 5. A finite set of random quantities z1 , . . . , zn are said to be exchangeable if every permutation of z1 , . . . , zn has the same joint distribution as every other permutation. An infinite collection is exchangeable if every finite subcollection is exchangeable. The relatively weak assumption of exchangeability turns out to have a profound consequence as shown by a famous theorem by deFinetti. Theorem 1. deFinetti’s representation theorem Let z1 , . . . be a sequence of 0-1 random quantities. The sequence (z1 , . . . , zn ) is exchangeable for every n if and only if p(z1 , . . . , zn ) = where
Z
1 0
n Y i=1
θzi (1 − θ)1−zi dF (θ),
n 1 X F (θ) = lim Pr zi ≤ θ n→∞ n i=1
14
What does deFinetti’s theorem say? It says that if the sequence z1 , z2 , . . . is considered exchangeable then it is as if the zi ’s are iid Bernoulli given θ, i.i.d.
zi |θ ∼ Bernoulli(θ),
i = 1, . . . , n,
where θ is a random variable with a distribution which is the limit distribution of the P sample average n−1 ni=1 zi . So one way of defending a model like i.i.d.
zi |θ ∼ Bernoulli(θ),
i = 1, . . . , n,
θ ∼ π(θ),
(25) (26)
is to appeal to exchangeability and think about your beliefs about the limit of n−1 when you pick the prior π(θ).
Pn
i=1 zi
DeFinitti’s theorem can be generalized to sequences of continuous random variables, see the book by Schervish for theorems and proofs. Frequentist properties of Bayes procedures It is often of interest to evaluate Bayes estimators from a classical frequentist perspective. Consider first the issue of consistency. Suppose we have computed the posterior distribution of the parameter vector θ, p(θ|y) =
p(y|θ)p(θ) . p(y)
Suppose now we also make the assumption that there is a population distribution f (y). As a measure of the difference between the sample distribution used to compute the posterior p(y|θ) and the actual population distribution we can use the Kullback-Leibler discrepency, H(θ) =
Z
log
p(y|θ) f (y)
15
f (y)dy.
(27)
Let θ∗ be the value of θ that minimizes this distance. One can show that if f (yi ) = p(yi |θ0 ), i.e. the sample distribution is correctly specified and the population is indexed
by some true value θ0 , then θ∗ = θ0 . Then2
Theorem 2. If the parameter space Θ is compact and A is a neighborhood of θ∗ with nonzero prior probability, then p(θ ∈ A|y) → 1
as n → ∞.
So in the case of correct specification, f (yi ) = p(yi |θ0 ), the posterior will concentrate around the true value θ0 asymptotically as long as θ0 is contained in the support of the prior. Under misspecification the posterior will concentrate around the value of θ that minimizes the distance to the true model. Now we shall consider the frequentist risk properties of Bayes estimators. To this ˆ This is end we shall first define the frequentist risk of an estimator θ(y). Z ˆ = L(θ, θ(y))p(y|θ)dy. ˆ r(θ, θ)
(28)
Note the difference between this risk measure and the Bayesian risk measure (3): The frequentist risk averages over the data for a given parameter θ whereas the Bayesian risk measure averages over the parameter space given the data. Furthermore, the ˆ The Bayes risk frequentist risk is a function of both θ and the proposed estimator θ. ˆ is only a function of a = θ. There are two popular ways to choose estimators optimally based on their frequentist risk. These are minimaxity and admissibility It turns out that there is close relationship between admissibility and Bayes estimators. Definition 6. An estimator θˆ is inadmissible if there exists an estimator θˆ1 which ˆ i.e., such that for every θ, dominates θ,
2
ˆ ≥ r(θ, θˆ1 ), r(θ, θ) For proofs see the textbook by Schervish or Gelman et.al. on the reading list.
16
and, for at least one value θ0 , ˆ > r(θ0 , θˆ1 ). r(θ0 , θ) If θˆ is not inadmissible it is admissible. The idea behing admissibility is to reduce the number of potential estimators to consider. Indeed is seems hard to defend using an inadmissible estimator. Under mild conditions Bayes estimators can be shown to be admissible. Under somewhat stronger conditions one can in fact show the reverse: All admissible estimators are Bayes estimators (or limits of Bayes estimators). A theorem proving this is called a complete class theorem and different versions of complete class theorems exist.3 Comparisons between classical and Bayesian inference The fundamental difference between classical frequentist inference and Bayesian inference is in the use of pre-data versus post-data probability statements. The frequentist approach is limited to pre-data considerations. This approach answers questions of the following form: (Q1) Before we have seen the data, what data do we expect to get? (Q2) If we use the as yet unknown data to estimate parameters by some known algorithm, how accurate do we expect the estimates to be? (Q3) If the hypothesis being tested is in fact true, what is the probability that we shall get data indicating that it is true? These questions can also be answered in the Bayesian approach. However, followers of the Bayesian approach argue that these questions are not relevant for scientific inference. What is relevant are post-data questions: (Q1’) After having seen the data, do we have any reason to be surprised by them? 3
For more about this see the book by Berger.
17
(Q2’) After we have seen the data, what parameter estimates can we now make, and what accuracy are we entitled to claim? (Q3) What is the probability conditional on the data, that the hypothesis is true? Questions (Q1’)-(Q3’) are only meaningful in a Bayesian framework. In the frequentist approach one cannot talk about the probability of a hypothesis. The marginal propensity to consume is either .92 or not. A frequentist 95 pct. confidence interval (a, b) does not mean that the probability of a < θ < b is 95 pct. θ either belongs to the interval (a, b) or not. Sometimes frequentist and Bayesian procedures give similar results although their interpretation differ. Example 4. In example 1 we found the posterior, µ, σ p(µ|y) = N(¯ ¯ 2 ), where y + (1/σ02 )µ0 (n/σ 2 )¯ , (n/σ 2 ) + (1/σ02 ) 1 σ ¯2 = n/σ 2 + (1/σ02 ) µ ¯=
Suppose we look at the “limit prior”, σ0 → ∞4 Then p(µ|y) = N(¯ y , σ 2 /n), and the Bayes estimate under squared error loss plus/minus one posterior standard deviation is σ µ ˆB = y¯ ± √ . n On the other hand, the repeated sampled distribution of µ ˆ = y¯ is
4
p(¯ y |µ) = N(µ, σ 2 /n), This is a special case of a “non-informative” prior. We will discuss these later.
18
(29)
and the estimate plus minus one standard deviation of the repeated sample distribution is σ (30) y¯ = µ ± √ . n Conceptually (29) and (30) are very different, but the final statements one would make about µ would be nearly identical. We stress once again the difference between (30) and (29). (30) answers the question (Q1) How much would the estimate of µ vary over the class of all data sets that we might conceivably get? whereas (29) answers the question (Q2) How accurately is the value of µ determined by the one data set that we actually have? The Bayesian camp has often critized the fact that the frequentist approach takes data that could have been observed but wasn’t into account when conducting inference about the parameter vector. Here is an often quoted example that how this affects testing. Example 5. Suppose in 12 independent tosses of a coin you observe 9 heads and 3 tails. Let θ =probability of heads. You wish to test H0 : θ = 1/2 vs. H1 : θ > 1/2. Given that this is all the information you have, there are two candidates for the likelihood function: (1) Binomial. The number n = 12 was fixed beforehand and the random quantity X was the number of heads observed in n tosses. Then X ∼ Bin(12, θ) and 12 9 L1 (θ) = θ (1 − θ)3 . 9 (2) Negative binomial. The coint was flipped until the third head appeared. Then the random component is X =the number of flips required to complete the experiment, so X ∼ N egBin(3, θ) and
11 9 L2 (θ) = θ (1 − θ)3 . 9 19
Suppose we use the test statistic X =number of heads and decision rule “reject H0 if X ≥ c”. The p-value is the probability of observing the data X = 9 or something more extreme under H0 . This 1 X 12 α1 = Pr(X ≥ 9|θ = 1/2) = 2 (1/2)j (1/2)12−j = .075, j j=9 ∞ X 2+j α2 = Pr(X ≥ 9|θ = 1/2) = (1/2)j (1/2)3 = .0325. j j=9 So using a conventional Type I error level α = .05 the two model assumptions lead to two different conclusions. But there is nothing in the situation that tells us which of the two models we should use. What happens here is that the Neyman-Pearson test procedure allows unobserved outcomes to effect the results. X values more extreme than 9 was used as evidence against the null. The prominent Bayesian Harold Jeffreys described this situation as “a hypothesis that may be true may be rejected because is has not predicted observable results that have not occurred”. There is also an important difference between frequentist and Bayesian approaches to the elimination of nuisance parameters. In the frequentist approach nuisance parameters are usually eliminated by the plug-in method. Suppose we have an estimator θˆ1 of a parameter θ1 which depends on another parameter θ2 : θˆ1 = θˆ1 (y, θ2 ). Typically one would get rid of the dependence of θ2 by plugging in an estimate of θ2 : ˆˆθ1 = θˆ1 (y, θˆ2 (y)). In the Bayesian approach one gets rid of nuisance parameters by integration. Suppose the joint posterior distribution of θ1 and θ2 is p(θ1 , θ2 |y). Inference about θ1 is then based on the marginal posterior p(θ1 |y) =
Z
p(θ1 , θ2 |y)dθ2 . 20
Note that we can rewrite this integration as Z p(θ1 |y) = p(θ1 , θ2 |y)dθ2 Z = p(θ1 |θ2 , y)p(θ2 |y)dθ2 , so instead of plugging in a single value of θ2 we average over all possible values of θ2 by integrating the conditional posterior of θ1 given θ2 w.r.t. the marginal posterior for θ2 .
21
Bayesian Mechanics The normal linear regression model The sampling distribution of the n vector of observable data y is p(y|X, θ) = N(y|Xβ, τ −1 In ),
(31)
where X is n × k with rank k and θ = (β, τ ). Note that the covariance matrix τ −1 In is formulated in terms of the precision τ of the observations. The precision is the inverse of the variance. We need a prior for θ. There are two popular choices. p(β) = N(β|β0 , Λ−1 0 ), p(τ ) = G(τ |α1 , α2 ) and p(β, τ ) ∝ τ −1
(32)
The first prior specifies that β and τ are prior independent with β having a multivariate normal prior with mean β0 and covariance Λ−1 0 and τ having a gamma prior with shape parameter α1 and inverse scale parameter α2 (We could also have chosen to work with the variance σ 2 = 1/τ . The implied prior on σ 2 will be an inverse gamma distribution). The second prior is a “non-informative” prior. This is a prior that you may want to use if you don’t have much prior information about θ available (you may be wondering why τ −1 represents a “non-informative” prior on τ . This will become clearer below). Consider the second prior first. The posterior distribution of θ is τ p(θ|y) ∝ τ n/2−1 exp − (y − Xβ)0 (y − Xβ) n 2τ o ˆ 0 (y − X β) ˆ + (β − β) ˆ , ˆ 0 X 0 X(β − β) = τ n/2−1 exp − (y − X β) 2
where βˆ = (X 0 X)−1 X 0 y.
22
(33)
If we are interested primarily in β we can integrate out the nuisance parameter τ ˆ we get ˆ 0 (y − X β) to get the marginal posterior of β. Letting s(y) = (y − X β) Z p(β|y) = p(θ|y)dτ Z n τ o ˆ dτ ˆ 0 X 0 X(β − β) ∝ τ n/2−1 exp − s(y) + (β − β) 2 −n/2 1 ˆ 0 X(β − β) ˆ ∝ 1+ (β − β)X , s(y)
(34) (35)
which is the kernel of a multivariate t distribution, ˆ Σ ˆ , p(β|y) = tn−k β|β,
(36)
with n − k degrees of freedom, mean βˆ and scale matrix ˆ = s(y) (X 0 X)−1 . Σ n−k ˆ Note that this is exactly equivalent to the repeated sample distribution of β. We can also derive the marginal posterior of τ . From (33) we get Z p(τ |y) = p(θ|y)dβ Z n τ o ˆ dβ ˆ 0 X 0 X(β − β) ∝ τ n/2−1 exp − s(y) + (β − β) 2 Z n τ τ o n/2−1 ˆ 0 X 0 X(β − β) ˆ dβ ∝τ exp − s(y) exp − (β − β) 2 2 τ ∝ τ (n−k)/2−1 exp − s(y) , 2
(37) (38)
which we recognize as the kernel of a gamma distribution with shape parameter (n − k)/2 and inverse scale s(y)/2, n − k s(y) p(τ |y) = G τ | , . 2 2 Note that the mean of this distribution is E[τ |y] =
1 (n − k) = 2, s(y) σ ˆ
where σ ˆ 2 = s(y)/(n − k). 23
(39)
Now we can see one way the prior p(β, τ ) ∝ τ −1 may be considered “non-informative”: The marginal posterior distributions have properties closely resembling the corresponding repeated sample distributions. For the first prior we get τ p(θ|y) ∝ τ n/2 exp − (y − Xβ)0 (y − Xβ) × 2 1 exp − (β − β0 )0 Λ0 (β − β0 ) τ α1 −1 exp{−α2 τ } (40) 2
This can be rewritten as p(θ|y) ∝ τ n/2+α1 −1 exp
n
−
o τ ˆ 0 X 0 X(β − β) ˆ × s(y) + (β − β) 2 1 exp − (β − β0 )0 Λ0 (β − β0 ) exp{−α2 τ } (41) 2
This joint posterior of θ does not lead to convenient expressions for the marginals of β and τ .
We can, however, derive analytical expressions for the conditional posteriors p(β|τ, y) and p(τ |β, y). These conditional posteriors turn out to play a fundamental role when designing simulation algorithms. Let us first derive the conditional posterior for β given τ . Remember that we then only need to include terms containing β. Then we get p(β|τ, y) ∝ exp
n
−
o 1 ˆ 0 τ X 0 X(β − β) ˆ + (β − β0 )0 Λ0 (β − β0 ) (β − β) 2
(42)
Now we can use the following convenient expansion5 :
Lemma 1. Let z, a, b be k vectors and A, B be symmetric k × k matrices such that
(A + B)−1 exists. Then
(z − a)0 A(z − a) + (z − b)0 B(z − b) = (z − c)0 (A + B)(z − c) + (a − b)0 A(A + B)−1 B(a − b), where c = (A + B)−1 (Aa + Bb) 5
For a proof see Box and Tiao (1973), p.418.
24
Applying lemma 1 it follows that p(β|τ, y) ∝ exp
n
o 1 ¯ 0Σ ¯ , ¯ −1 (β − β) − (β − β) 2
where ¯ = (τ X 0 X + Λ0 )−1 , Σ ¯ −1 (τ X 0 y + Λ0 β0 ). β¯ = Σ So ¯ Σ), ¯ p(β|τ, y) = N(β|β,
(43)
a multivariate normal distribution. Similarly, from (40) we get τ p(τ |β, y) ∝ τ n/2+α1 −1 exp{− (y − Xβ)0 (y − Xβ) − τ α2 }, 2 which is the kernel of a gamma distribution, n (y − Xβ)0 (y − Xβ) p(τ |β, y) = G τ | + α1 , + α2 2 2
(44)
We will later on see how it is extremely easy to simulate draws from p(β|y) and p(τ |y) using these conditional distributions. The SURE model Consider now the model yij = x0ij βj + εij ,
i = 1, . . . , n; j = 1, . . . , J,
(45)
where εi = (εi1 , . . . , εiJ ) is assumed jointly normal, εi |Λ ∼ N(0, Λ−1 ). We can rewrite this model as Yi = Xi β + εi ,
i = 1, . . . , n. 25
(46)
where Yi = (yi1 , . . . , yiJ )0 and x 0 ··· 0 i1 0 xi2 · · · 0 Xi = , .. .. . . . . 0 . 0 0 · · · xiJ
β 1 β2 β= . .. βJ
We need to specify a prior on θ = (β, Λ). Again we can consider a non-informative
and an informative prior. The usual non-informative prior for this model is p(β, Λ) ∝ |Λ|−(J+1)/2 .
(47)
Alternatively, one can use p(β) = N(β|β0 , Λ−1 0 ), p(Λ) = W(Λ|ν, S). The prior for β is a multivariate normal distribution as before. The Λ prior is a Wishart distribution. This is the multivariate generalization of the gamma distribution. The Wishart distribution has mean E[Λ] = νS. The posterior under the first prior is n 1X (Yi − Xi β)0 Λ(Yi − Xi β) p(β, Λ|y) ∝ |Λ|−(J+1)/2 |Λ|n/2 exp − 2 i=1 n 1X (n−J−1)/2 = |Λ| exp − (Yi − Xi β)0 Λ(Yi − Xi β) 2 i=1
(48)
Using the well known result n X i=1
0
(Yi − Xi β) Λ(Yi − Xi β) =
n X i=1
0 ˆ ˆ Λ(Yi − Xi β(Λ))+ (Yi − Xi β(Λ)) 0 ˆ (β − β(Λ))
26
n X i=1
ˆ Xi0 ΛXi (β − β(Λ)), (49)
where ˆ β(Λ) =
n X i=1
we find the conditional posterior
−1 Xi0 ΛXi
n X
Xi0 ΛYi ,
i=1
X 1 0 ˆ ˆ p(β|Λ, y) ∝ exp − (β − β(Λ)) Xi0 ΛXi (β − β(Λ) 2 i=1
So
n
n −1 X ˆ Xi0 ΛXi . p(β|Λ, y) = N β|β(Λ),
(50)
(51)
i=1
The conditional posterior of β is normal with mean equal to the efficient GLS estimator (when Λ is known). To get the conditional posterior for Λ note that n n o X 0 (Yi − Xi β) Λ(Yi − Xi β) = Tr ΛM (β) , i=1
where
n X (Yi − Xi β)(Yi − Xi β)0 . M (β) = i=1
From (48) we then get
1 n o p(Λ|β, y) ∝ |Λ|(n−J−1)/2 exp − Tr ΛM (β) . 2
(52)
This is the kernel of a Wishart distribution, p(Λ|β, y) = W n, M (β)−1 )
(53)
Note that the posterior of mean of Λ given β is
E[Λ|β, y] = nM (β)−1 . The inverse of this is n−1 M (β) which is the usual estimate of the covariance matrix if β is known.
27
Next let’s derive the conditional posteriors under the proper prior distributions. The joint posterior is ν−J−1)/2
p(β, Λ|y) ∝ |Λ|
1 1 −1 0 exp − (β − β0 ) Λ0 (β − β0 ) × exp − Tr ΛS 2 2 n 1X n/2 0 |Λ| exp − (Yi − Xi β) Λ(Yi − Xi β) (54) 2 i=1
The conditional for β is then 1 p(β|Λ, y) ∝ exp − (β − β0 )0 Λ0 (β − β0 ) exp − 2 1 ∝ exp − (β − β0 )0 Λ0 (β − β0 ) exp − 2
1X (Yi − Xi β)0 Λ(Yi − Xi β) 2 i=1 n
n X 1 0 ˆ ˆ Xi0 ΛXi )(β − β(Λ) (β − β(Λ)) ( 2 i=1
Application of lemma 1 then gives ¯ ¯ p(β|Λ, y) = N β|β(Λ), Σ(Λ) ,
(55)
where n −1 X ¯ Σ(Λ) = Λ0 + Xi0 ΛXi ,
(56)
i=1
n X ¯ ¯ β(Λ) = Σ(Λ) Xi0 ΛYi + Λ0 β0 .
(57)
i=1
The conditional for Λ is n 1 1X p(Λ|β, y) ∝ |Λ|(n+ν−J−1)/2 exp − Tr ΛS −1 − (Yi − Xi β)0 Λ(Yi − Xi β) 2 2 i=1 1 1 ∝ |Λ|(n+ν−J−1)/2 exp − Tr ΛS −1 − Tr ΛM (β) 2 2 1 ∝ |Λ|(n+ν−J−1)/2 exp − Tr Λ S −1 + M (β) , 2
which is the kernel of a Wishart distribution,
−1 p(Λ|β, y) = W Λn + ν, S −1 + M (β) 28
(58)
Readings • Bayesian foundations and philosophy – Jaynes, E.T., (1994), “Probability: The logic of science”, unpublished book. Chapters may be downloaded from http://bayes.wustl.edu/etj/prob.html – Jeffreys, H., (1961), “Theory of Probability”, Oxford University Press. • Bayesian statistics and econometrics – Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995), “Bayesian Data Analysis”, Chapman and Hall. – Schervish, M.J. (1995) “Theory of Statistics”, Springer. – Zellner, A. (1971) “An Introduction to Bayesian Inference in Econometrics”, Wiley. • Bayesian statistics and Decision Theory – Berger, J.O., (1985), “Statistical Decision Theory and Bayesian Analysis”, Springer – Robert, C.P., (1994), “The Bayesian Choice”, Springer.
29