1 Introduction. So far we have been using frequentist (or classical) methods. ...
The difference between Bayesian inference and frequentist inference is the goal.
Lecture Notes 14 Bayesian Inference Relevant material is in Chapter 11.
1
Introduction
So far we have been using frequentist (or classical) methods. In the frequentist approach, probability is interpreted as long run frequencies. The goal of frequentist inference is to create procedures with long run guarantees. Indeed, a better name for frequentist inference might be procedural inference. Moreover, the guarantees should be uniform over θ if possible. For example, a confidence interval traps the true value of θ with probability 1 − α, no matter what the true value of θ is. In frequentist inference, procedures are random while parameters are fixed, unknown quantities. In the Bayesian approach, probability is regarded as a measure of subjective degree of belief. In this framework, everything, including parameters, is regarded as random. There are no long run frequency guarantees. Bayesian inference is quite controversial. Note that when we used Bayes estimators in minimax theory, we were not doing Bayesian inference. We were simply using Bayesian estimators as a method to derive minimax estimators. Before proceeding, here are two comics (Figures 1 and 2). The first is from xkcd.com and the second is from me (in response to xkcd). One very important point, which causes a lot of confusion, is this:
Using Bayes’ Theorem 6= Bayesian inference The difference between Bayesian inference and frequentist inference is the goal. Bayesian Goal: Quantify and analyze subjective degrees of belief. Frequentist Goal: Create procedures that have frequency guarantees. Neither method of inference is right or wrong. Which one you use depends on your goal. If your goal is to quantify and analyze your subjective degrees of belief, you should use Bayesian inference. If our goal create procedures that have frequency guarantees then you should use frequentist procedures. Sometimes you can do both. That is, sometimes a Bayesian method will also have good frequentist properties. Sometimes it won’t. A summary of the main ideas is in Table 1.
1
Figure 1: From xkcd.com
2
I NEED AN ESTIMATE FOR THE MASS OF THIS TOP QUARK.
I NEED AN ESTIMATE FOR THE MASS OF THIS MUON NEUTRINO.
HERE IS A BAYESIAN 95 PERCENT INTERVAL.
Q
S
Q
S
SS
HERE IS A BAYESIAN 95 PERCENT INTERVAL.
Q Q
S
S
S S
Q
S
Q
S
SS
Q Q
S
S
S S
I NEED AN ESTIMATE FOR THE HUBBLE CONSTANT. HERE IS A BAYESIAN 95 PERCENT INTERVAL.
Q
S
Q
S
SS
... AND SO ON ...
Q Q
S
S
S S
HEY! NONE OF THESE INTERVALS CONTAINED THE TRUE VALUE!
CAN I GET MY MONEY BACK?
OF COURSE NOT. THOSE WERE DE-FINETTI STYLE, FULLY COHERENT REPRESENTATIONS OF YOUR BELIEFS. THEY WEREN’T CONFIDENCE INTERVALS. Q
Q
S
Q
S
SS
Q Q
S
S
S S
Q
S
Q
S
SS
Figure 2: My comic.
3
Q Q
S
S
S S
Probability Goal θ X Use Bayes’ theorem?
Bayesian Frequentist subjective degree of belief limiting frequency analyze beliefs create procedures with frequency guarantees random variable fixed random variable random variable Yes. To update beliefs. Yes, if it leads to procedure with good frequentist behavior. Otherwise no. Table 1: Bayesian versus Frequentist Inference
To add to the confusion: Bayes’ nets: are directed graphs endowed with distributions. This has nothing to do with Bayesian inference. Bayes’ rule: is the optimal classification rule in a binary classification problem. This has nothing to do with Bayesian inference.
2
The Mechanics of Bayes
Let X1 , . . . , Xn ∼ p(x|θ). In Bayes we also include a prior π(θ). It follows from Bayes’ theorem that the posterior distribution of θ given the data is π(θ|X1 , . . . , Xn ) = where
p(X1 , . . . , Xn |θ)π(θ) m(X1 , . . . , Xn )
Z m(X1 , . . . , Xn ) =
p(X1 , . . . , Xn |θ)π(θ)dθ.
Hence, π(θ|X1 , . . . , Xn ) ∝ L(θ)π(θ) where L(θ) = p(X1 , . . . , Xn |θ) is the likelihood function. The interpretation is that π(θ|X1 , . . . , Xn ) represents your subjective beliefs about θ after observing X1 , . . . , Xn . A commonly used point estimator is the posterior mean R Z θL(θ)π(θ) θ = E(θ|X1 , . . . , Xn ) = θπ(θ|X1 , . . . , Xn )dθ = R . L(θ)π(θ) For interval estimation we use C = (a, b) where a and b are chosen so that Z
b
π(θ|X1 , . . . , Xn ) = 1 − α. a
4
This interpretation is that P (θ ∈ C|X1 , . . . , Xn ) = 1 − α. This does not mean that C traps θ with probability 1 − α. We will discuss the distinction in detail later. Example 1 Let X1 , . . . , Xn ∼ Bernoulli(p). Let the prior be p ∼ Beta(α, β). Hence Γ(α + β) Γ(α)Γ(β)
π(p) = and
∞
Z
tα−1 e−t dt.
Γ(α) = 0
Set Y =
P
i Xi . Then
π(p|X) ∝ pY 1 − pn−Y × pα−1 1 − pβ−1 ∝ pY +α−1 1 − pn−Y +β−1 . | {z } | {z } prior
likelihood
Therefore, p|X ∼ Beta(Y + α, n − Y + β). (See page 325 for more details.) The Bayes estimator is Y +α Y +α = = (1 − λ)b pmle + λ p pe = (Y + α) + (n − Y + β) α+β+n where α α+β p= , λ= . α+β α+β+n This is an example of a conjugate prior. Example 2 Let X1 , . . . , Xn ∼ N (µ, σ 2 ) with σ 2 known. Let µ ∼ N (m, τ 2 ). Then 2
σ τ2 n X + E(µ|X) = 2 τ 2 + σn τ2 +
and Var(µ|X) =
σ2 n
m
σ 2 τ 2 /n 2 . τ 2 + σn
Example 3 Suppose that X1 , . . . , Xn ∼ Bernoulli(p1 ) and that Y1 , . . . , Ym ∼ Bernoulli(p2 ). We are interested in δ = p2 − p1 . Let us use then prior π(p1 , p2 ) = 1. The posterior for p1 , p2 is n−X Y π(p1 , p2 |Data) ∝ pX p2 (1 − p2 )m−Y ∝ g(p1 )h(p2 ) 1 (1 − p1 ) P P where X = i Xi , Y = i Yi , g is a Beta(X +1, n−X +1) density and h is a Beta(Y +1, m− Y +1) density. To get the posterior for δ we need to do a change variables: (p1 , p2 , ) → (δ, p2 ) to get π(δ, p2 |Data). Then we integrate: Z π(δ|Data) = π(δ, p2 |Data)d p2 . (An easier approach is to use simulation.) 5
3
Where Does the Prior Come From?
This is the million dollar question. In principle, the Bayesian is supposed to choose a prior π that represents their prior information. This will be challenging in high dimensional cases to say the least. Also, critics will say that someone’s prior opinions should not be included in a data analysis because this is not scientific. There has been some effort to define “noninformative priors” but this has not worked out so well. An example is the Jeffreys prior which is defined to be p π(θ) ∝ I(θ). You can use a flat prior but be aware that this prior doesn’t retain its flatness under transformations. In high dimensional cases, the prior ends up being highly influential. The result is that Bayesian methods tend to have poor frequentist behavior. We’ll return to this point soon. It is common to use flat priors even if they don’t integrate to 1. This is possible since the posterior might still integrate to 1 even if the prior doesn’t.
4
Large Sample Theory
There is a Bayesian central limit theorem. In nice models, with large n, ! 1 b π(θ|X1 , . . . , Xn ) ≈ N θ, b In (θ)
(1)
where θbn is the mle and I is the Fisher information. In these cases, the 1 − α Bayesian intervals will be approximately the same as the frequentist confidence intervals. That is, an approximate 1 − α posterior interval is zα/2 C = θb ± q b In (θ) which is the Wald confidence interval. However, this is only true if n is large and the dimension of the model is fixed. Let’s summarize this point: In low dimensional models, with lots of data and assuming the usual regularity conditions, Bayesian posterior intervals will also be frequentist confidence intervals. In this case, there is little difference between the two. Here is a rough derivation of (1). Note that log π(θ|X1 , . . . , Xn ) =
n X
log p(Xi |θ) + log π(θ) − log C
i=1
6
where C is the normalizing constant. Now the sum has n terms which grows with sample size. The last two terms are O(1). So the sum dominates, that is, log π(θ|X1 , . . . , Xn ) ≈
n X
log p(Xi |θ) = `(θ).
i=1
Next, we note that b 2 00 b b + (θ − θ)` b 0 (θ) b + (θ − θ) ` (θ) . `(θ) ≈ `(θ) 2 b = 0 so Now `0 (θ) b 2 00 b b + (θ − θ) ` (θ) . `(θ) ≈ `(θ) 2 Thus, approximately, b2 (θ − θ) π(θ|X1 , . . . , Xn ) ∝ exp − 2σ 2 where σ2 = −
1 b `00 (θ)
!
.
Let `i = log p(Xi |θb0 ) where θ0 is the true value. Since θb ≈ θ0 , b ≈ `00 (θ0 ) = `00 (θ)
X i
`00i = n
1 X 00 b = −In (θ) b ` ≈ −nI1 (θ0 ) ≈ −nI1 (θ) n i i
b and therefore, σ 2 ≈ 1/In (θ).
5
Bayes Versus Frequentist
In general, Bayesian and frequentist inferences can be quite different. If C is a 1−α Bayesian interval then P (θ ∈ C|X1 , . . . , Xn ) = 1 − α. This does not imply that frequentist coverage = inf Pθ (θ ∈ C) = 1 − α.. θ
Typically, a 1 − α Bayesian interval has coverage lower than 1 − α. Suppose you wake up everyday and produce a Bayesian 95 percent interval for some parameter. (A different parameter everyday.) The fraction of times your interval contains the true parameter will not be 95 percent. Here are some examples to make this clear.
7
Example 4 Normal means. Let Xi ∼ N (µi , 1), i = 1, . . . , n. Suppose we use the flat prior π(µ1 , . . . , µn ) ∝ 1. Then, with µ = (µ1 , . . . , µn ), the posterior for µ is multivariate Normal P with mean X = (X1 , . . . , Xn ) and covariance matrix equal to the identity matrix. Let θ = ni=1 µ2i . Let Cn = [cn , ∞) where cn is chosen so that P(θ ∈ Cn |X1 , . . . , Xn ) = .95. How often, in the frequentist sense, does Cn trap θ? Stein (1959) showed that Pµ (θ ∈ Cn ) → 0,
as n → ∞.
Thus, Pµ (θ ∈ Cn ) ≈ 0 even though P(θ ∈ Cn |X1 , . . . , Xn ) = .95. Example 5 Sampling to a Foregone Conclusion. Let X1 , X2 , . . . ∼ N (θ, 1). Suppose √ we continue sampling until T > k where T = n|X n | and k is a fixed number, say, k = 20. The sample size N is now a random variable. It can be shown that P(N < ∞) = 1. It can also be shown that the posterior π(θ|X1 , . . . , XN ) is the same as if N had been fixed in advance. That is, the randomness in N does not affect the posterior. Now if the prior π(θ) is smooth then √ the posterior is approximately θ|X1 , . . . , XN ∼ N (X n , 1/n). Hence, if Cn = X n ± 1.96/ n then P(θ ∈ Cn |X1 , . . . , XN ) ≈ .95. Notice that 0 is never in Cn since, when we stop sampling, T > 20, and therefore 1.96 20 1.96 X n − √ > √ − √ > 0. n n n
(2)
Hence, when θ = 0, Pθ (θ ∈ Cn ) = 0. Thus, the coverage is Coverage = inf Pθ (θ ∈ Cn ) = 0. θ
This is called sampling to a foregone conclusion and is a real issue in sequential clinical trials. Example 6 Here is an example we discussed earlier. Let C = {c1 , . . . , cN } be a finite set of constants. For simplicity, assume that cj ∈ {0, 1} (although this is not important). Let P θ = N −1 N j=1 cj . Suppose we want to estimate θ. We proceed as follows. Let S1 , . . . , Sn ∼ Bernoulli(π) where π is known. If Si = 1 you get to see ci . Otherwise, you do not. (This is an example of survey sampling.) The likelihood function is Y π Si (1 − π)1−Si . i
The unknown parameter does not appear in the likelihood. In fact, there are no unknown parameters in the likelihood! The likelihood function contains no information at all. The posterior is the same as the prior. But we can estimate θ. Let
N 1 X cj Sj . θb = N π j=1
8
b = θ. Hoeffding’s inequality implies that Then E(θ) P(|θb − θ| > ) ≤ 2e−2n
2 π2
.
b Hence, p θ is close to θ with high probability. In particular, a 1 − α confidence interval is b θ ± log(2/α)/(2nπ 2 ).
6
Bayesian Computing
If θ = (θ1 , . . . , θp ) is a vector then the posterior π(θ|X1 , . . . , Xn ) is a multivariate distribution. If you are interested in one parameter, θ1 for example, then you need to find the marginal posterior: Z π(θ1 |X1 , . . . , Xn ) =
π(θ1 , . . . , θp |X1 , . . . , Xn )dθ2 · · · dθp .
Usually, this integral is intractable. In practice, we resort to Monte Carlo methods. These are discussed in 36/10-702.
7
Bayesian Hypothesis Testing (Optional)
Bayesian hypothesis testing can be done as follows. Suppose that θ ∈ R and we want to test H0 : θ = θ0
and H1 : θ 6= θ0 .
If we really believe that there is a positive prior probability that H0 is true then we can use a prior of the form aδθ0 + (1 − a)g(θ) where 0 < a < 1 is the prior probability that H0 is true and g is a smooth prior density over θ which represents our prior beliefs about θ when H0 is false. It follows from Bayes’ theorem that P (θ = θ0 |X1 , . . . , Xn ) =
aL(θ0 ) ap(X1 , . . . , Xn |θ0 ) R = aL(θ0 ) + (1 − a)m ap(X1 , . . . , Xn |θ0 ) + (1 − a) p(X1 , . . . , Xn |θ)g(θ)dθ
R where m = L(θ)g(θ)dθ. It can be shown that P (θ = θ0 |X1 , . . . , Xn ) is very sensitive to the choice of g. Sometimes, people like to summarize the test by using the Bayes factor B which is defined to be the posterior odds divided by the prior odds: B=
posterior odds prior odds
9
where posterior odds =
P (θ = θ0 |X1 , . . . , Xn ) 1 − P (θ = θ0 |X1 , . . . , Xn )
=
aL(θ0 ) aL(θ0 )+(1−a)m (1−a)m aL(θ0 )+(1−a)m
=
aL(θ0 ) (1 − a)m
and prior odds =
P (θ = θ0 ) a = P (θ 6= θ0 ) 1−a
and hence B=
L(θ0 ) . m
Example 7 Let X1 , ldots, Xn ∼ N (θ, 1). Let’s test H0 : θ = 0 versus H1 : θ 6= 0. Suppose we take g(θ) to be N (0, 1). Thus, 1 2 g(θ) = √ e−θ /2 . 2π Let us further take a = 1/2. Then, after some tedious integration to compute m(X1 , . . . , Xn ) we get P (θ = θ0 |X1 , . . . , Xn ) = =
L(0) L(0) + m e−nX
2
/2
+
e−nX p n
2
/2
e−nX n+1
2
/(2(n+1))
.
√ On the other hand, the p-value for the usual test is p = 2Φ(− n|X|). Figure 3 shows the posterior of H0 and the p-value as a function of X when n = 100. Note that they are very different. Unlike in estimation, in testing there is little agreement between Bayes and frequentist methods.
8
Conclusion
Bayesian and frequentist inference are answering two different questions. Frequentist inference answers the question: How do I construct a procedure that has frequency guarantees? 10
1.0 0.8 0.6 0.4 0.2 0.0 0.0
0.1
0.2
0.3
0.4
0.5
Figure 3: Solid line: P (θ = 0|X1 , . . . , Xn ) versus X. Dashed line: p-value versus X. Bayesian inference answers the question: How do I update my subjective beliefs after I observe some data? In parametric models, if n is large and the dimension of the model is fixed, Bayes and frequentist procedures will be similar. Otherwise, they can be quite different.
11