Frequentist and Bayesian statistics - Bendix Carstensen

Faculty of Life Sciences

Frequentist and Bayesian statistics Claus Ekstrøm E-mail: [email protected]

Outline

1

Frequentists and Bayesians • What is a probability? • Interpretation of results / inference

2

Comparisons

3

Markov chain Monte Carlo

Slide 2— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

What is a probability? Two schools in statistics: frequentists and Bayesians.


Frequentist school School of Jerzy Neyman, Egon Pearson and Ronald Fischer.


Bayesian school “School” of Thomas Bayes

P(H|D) =

P(D|H) · P(H) P(D|H) · P(H)dH


Frequentists Frequentists talk about probabilities in relation to experiments with a random component. Relative frequency of an event, A, is defined as P(A) =

number of outcomes consistent with A number of experiments

0.0

0.2

Relative frequency 0.4 0.6 0.8

1.0

The probability of event A is the limiting relative frequency.

0

20

40

60 n


80

100

Frequentists — 2 The definition restricts the things we can add probabilities to: What is the probability of there being life on Mars 100 billion years ago? We assume that there is an unknown but fixed underlying parameter, θ , for a population (i.e., the mean height on Danish men). Random variation (environmental factors, measurement errors, ...) means that each observation does not result in the true value.


The meta-experiment idea Frequentists think of meta-experiments and consider the current dataset as a single realization from all possible datasets.



167.2 cm



167.2 cm 175.5 cm



167.2 cm 175.5 cm 187.7 cm



167.2 175.5 187.7 182.0


cm cm cm cm

Confidence intervals Thus a frequentist believes that a population mean is real, but unknown, and unknowable, and can only be estimated from the data. Knowing the distribution for the sample mean, he constructs a confidence interval, centered at the sample mean. • Either the true mean is in the interval or it is not. Can’t

say there’s a 95% probability (long-run fraction having this characteristic) that the true mean is in this interval, because it’s either already in, or it’s not. • Reason: true mean is fixed value, which doesn’t have a

distribution. • The sample mean does have a distribution! Thus must

use statements like “95% of similar intervals would contain the true mean, if each interval were constructed from a different random sample like this one.” Slide 9— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Maximum likelihood How will the frequentist estimate the parameter?


Maximum likelihood How will the frequentist estimate the parameter? Answer: maximum likelihood.


Maximum likelihood How will the frequentist estimate the parameter? Answer: maximum likelihood.

Basic idea Our best estimate of the parameter(s) are the one(s) that make our observed data most likely. We know what we have observed so far (our data). Our best “guess” would therefore be to select parameters that make our observations most likely. Binomial distribution: n y P(Y = y ) = p (1 − p)n−y y


Bayesians Each investigator is entitled to his/hers personal belief ... the prior information. No fixed values for parameters but a distribution. Thumb tack pin pointing down:

2.0 Prior distribution

1.5 1.0 0.0

In many cases trying to circumvent by using vague priors.

0.5

Can still talk about the mean — but it is the mean of my distribution.

2.5

3.0

All distributions are subjective. Yours is as good as mine.

0.0

0.2

0.4

0.6

0.8

Theta


Credibility intervals Bayesians have an altogether different world-view. They say that only the data are real. The population mean is an abstraction, and as such some values are more believable than others based on the data and their prior beliefs.


1.0

Credibility intervals Bayesians have an altogether different world-view. They say that only the data are real. The population mean is an abstraction, and as such some values are more believable than others based on the data and their prior beliefs. The Bayesian constructs a credibility interval, centered near the sample mean, but tempered by “prior” beliefs concerning the mean. Now the Bayesian can say what the frequentist cannot: “There is a 95% probability (degree of believability) that this interval contains the mean.”


Comparison

Frequentist

Advantages

Disadvantages

Objective

Confidence intervals (not quite the desired)

Calculations Bayesian

Credibility intervals (usually the desired) Complex models

Subjective Calculations


In summary

• A frequentist is a person whose long-run ambition is to

be wrong 5% of the time. • A Bayesian is one who, vaguely expecting a horse, and

catching a glimpse of a donkey, strongly believes he has seen a mule.


In summary

• A frequentist is a person whose long-run ambition is to

be wrong 5% of the time. • A Bayesian is one who, vaguely expecting a horse, and

catching a glimpse of a donkey, strongly believes he has seen a mule. A frequentist uses impeccable logic to answer the wrong question, while a Bayesean answers the right question by making assumptions that nobody can fully believe in. P. G. Hamer


Jury duty


Example: speed of light

What is the speed of light in vacuum “really”? Results (m/s) 299792459.2 299792460.0 299792456.3 299792458.1 299792459.5


Example: frequentists solution The average of our observations is an estimate of the true, fixed (but unknown) speed of light, θˆ = 299792458.6. Conclusion: If we were to repeat this sequence of 5 measurements a repeated number of times, approximately 95% of my estimators will be within 1.83 m/s to the true speed of light. However, on this particular occasion where I have already calculated my statistic, I have no clue how close I actually am to the true value, but I feel comfortable that I am doing okay because of certain properties that my estimator has on repeated uses.


Example: Bayesian solution The observations are fixed realization from the underlying distribution of the true speed of light. 1

“Guess” what the distribution of the speed of light is (the prior distribution).

2

Use Bayes Theorem to modify/update the prior distribution based on the observed data.

3

The modified distribution is denoted the posterior distribution.

The posterior distribution holds the information about the true speed of light – and this distribution is entirely subjective.


Markov Chain Monte Carlo

Having a likelihood does not necessarily makes it easy to work with. In Bayesian statistics the posterior distribution contains all relevant information about the parameters. Statistical inference is often calculated from summaries (integrals)

J=

L(x)dx

However, these evaluations are not necessarily easy.


Bayesian modelling, Markov Chain Monte Carlo, Graphical Models Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark August 13, 2012

Contents 1 Bayesian modeling

2

2 Inference

2

3 Bayesian models based on DAGs 3.1 Example: Independent samples . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Example: Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Example: Random regression model . . . . . . . . . . . . . . . . . . . . . .

3 3 4 4

4 Computations using Monte Carlo methods 4.1 Rejection sampling . . . . . . . . . . . . . . 4.2 Example: Rejection sampling . . . . . . . . 4.3 Sampling importance resampling (SIR)* . . 4.4 Markov Chain Monte Carlo methods . . . . 4.5 The Metropolis–Hastings algorithm . . . . . 4.6 Special cases . . . . . . . . . . . . . . . . . . 4.7 Example: Metropolis–Hastings algorithm . . 4.8 Single component Metropolis–Hastings . . . 4.9 Gibbs sampler* . . . . . . . . . . . . . . . . 4.10 Sampling in high dimensions – problems . . 5 Conditional independence

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

5 5 6 7 8 8 9 10 10 11 11 12

1

1

Bayesian modeling • In a Bayesian setting, parameters are treated as random quantities on equal footing with the random variables. • The joint distribution of a parameter (vector) θ and data (vector) y is specified through a prior distribution π(θ) for θ and a conditional distribution p(y | θ) of data for a fixed value of θ. • This leads to the joint distribution p(y, θ) = p(y | θ)π(θ) • The prior distribution π(θ) represents our knowledge (or uncertainty) about θ before data have been observed. • After observing data y, the posterior distribution π ∗ (θ) of θ is obtained by conditioning with data which gives p(y|θ)π(θ) π ∗ (θ) = p(θ|y) = ∝ L(θ)π(θ) p(y) R where L(θ) = p(y | θ) is the likelihood and the marginal density p(y) = p(y | θ)π(θ)dθ is the normalizing constant.

2

Inference • With respect to inference, we might be interested in the posterior mean of some function g(θ): Z ∗ E(g(θ)|π ) = g(θ)π ∗ (θ)dθ • However, usually π ∗ (θ) can not be found analytically because the normalizing conR stant p(y) = p(y | θ)π(θ)dθ is intractable. • In such cases one will often resort to sampling based methods: If we can draw samples θ(1) , . . . , θ(N ) from π ∗ (θ) we can do just as well: 1 X E(g(θ)|π ∗ ) ≈ g(θ(i) ) N i • The question is then how to draw samples from π ∗ (θ) where π ∗ (θ) is only known up to the normalizing constant. • In some cases simple Monte Carlo methods will do (e.g. rejection sampling). • Most often, however, we have to use Markov Chain Monte Carlo Methods (e.g. the Metropolis–Hastings algorithm) 2

3

Bayesian models based on DAGs

The graph is a directed acyclic graph (DAG) where then nodes represent random quantities. xA xT

xS xL

xB

xE xX

xD

Figure 1: Directed acyclic graph. Nodes represent random quantities. A joint distribution for x = (xA , xS , xT , xL , xB , xE , xX , xD ) can be specified as a product of conditional distributions, p(x) = p(xA )p(xT |xA )p(xS )p(xL |xS )p(xB |xS )p(xE |xT , xL ) p(xX |xE )p(xD |xE , xB ) • Notice: The specification has the form Y p(x) = p(xv |xpa(v) ) v

where pa(v) denotes the parents of v in the directed acyclic graph. • Hence, we define a complex multivariate distribution by multiplying conditional univariate densities. • Notice also that we use x here as a generic symbol for a random quantity rather than using y to represent data and θ to represent parameters. • This makes sense in a Bayesian setting where there is no conceptual difference between parameters and data.

3.1

Example: Independent samples

Joint distribution: p(x1 , . . . , x5 , θ) = π(θ)

Y

p(xi | θ)

i

For example we may have: xi | θ ∼ N (θ, 1) and θ ∼ N (0, 1). 3

θ

x1

x2

θ

x3

x4

xν

x5

ν = 1, . . . , 5 Figure 2: Left: Representation of a Bayesian model for simple sampling. The picture to the right represents the same, but the plate allows a more compact representation. α

β

α

β

xi

µi

xi

yi

yi

σ

i = 1, . . . , N

σ

i = 1, . . . , N Figure 3: Graphical representations of a traditional linear regression model with unknown intercept α, slope β, and variance σ 2 . In the representation to the left, the means µi have been represented explicitly.

3.2

Example: Linear regression

Regression model: Yi ∼ N (µi , σ 2 ),

µi = α + βxi ,

i = 1, . . . , N

In this example, the parameters are θ = (α, β, σ). To complete the model specification we therefore need to specify a prior π(θ). The xi ’s are just explanatory variables and the µi ’s are deterministic functions of their parents.

3.3

Example: Random regression model

Weights have been measured weekly for 30 rats over 5 weeks. Observations yij are weights of rat i at age xj . Random regression model yij ∼ N (αi + βi (xi − x¯), σc2 ) 4

αc

σα

α0

βc

αi

βi

yij

σc

σβ

xj

j = 1, . . . , 5

i = 1, . . . , N Figure 4: Graphical representation of a random coefficient regression model for the growth of rats. and αi ∼ N (αc , σα2 ),

4

βi ∼ N (βc , σβ2 )

Computations using Monte Carlo methods

Consider a random variable (vector) X with density / probability mass function p(x). We shall call p(x) the target distribution (from which we want to sample). In many real world applications • we can not directly draw samples from p. • p is only known up to a constant of proportionality; that is p(x) = k(x)/c where k() is known and the normalizing constant c is unknown. We reserve h(x) for a proposal distribution which is a distribution from which we can draw samples.

4.1

Rejection sampling

Let p(x) = k(x)/c be a density where k() is known and c is unknown. Let h(x) be a proposal distribution from which we can draw samples. Suppose we can find a constant M such that k(x) < M h(x) for all x. The algorithm is then 5

1. Draw sample x ∼ h(). Draw u ∼ U (0, 1). 2. Set α =

k(x)/M h(x)

3. If u < α, accept x. The accepted values x1 , . . . xN is a random sample from p(·). Notice: • It is tricky to choose a good proposal distribution h(). It should have support at least as large as p() and preferably heavier tails than p(). • It is desirable to choose M as small as possible. In practice this is difficult so one tends to take a conservative (large) choice of M whereby only few proposed values are accepted. Thus it is difficult to make rejection sampling efficient.

4.2

Example: Rejection sampling

0.0 0.5 1.0 1.5 2.0 2.5 3.0

k(x)

k MethComp(tJox)$VarComp s.d. Method IxR MxI res CO 0.2575410 0.1811183 0.1243838 pulse 0.2232714 0.1565227 0.2031203

28/ 33

Transformation If the data do not exhibit:

linear relationship between methods constant variation across the range of measurements

— transform by some function, e.g. logit, and then do analysis. Report on the original scale.

100

29/ 33

CO = −6.00+1.12pulse (5.07) ●● ● ● ●●●● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ●●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ●● ●●●●●● ●● ● ●●● ●●● ●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●

80

●

● ●

60

● ●

●

●

●

CO

●

● ●

●

● ● ●

●

●

● ● ●

● ●

40

●

●

●

20

●

0

pulse = 5.38+0.90CO (4.55) 0

20

40

60

80

100

pulse

100

30/ 33

●● ● ● ●●●● ●● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ●● ●●●●●● ●● ● ●●● ●●● ●●●●● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●

80

●

● ●

60

● ●

●

●

CO

● ●

●

●

● ● ●

● ● ●

● ●

● ●

●

40

●

●

●

0

20

●

0

20

40

60

80

100

pulse

31/ 33

40

CO = −6.00+1.12pulse (5.07) pulse = 5.38+0.90CO (4.55)

20

● ●

● ●

●

● ●

●

CO − pulse 0

●●●

●

● ●

●

●

● ●

● ●●

●● ●

●

● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ●●● ● ●● ● ● ●● ● ●●● ●● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ●●● ●● ●●● ●● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ●● ● ● ●● ● ●● ● ●● ●● ●●● ● ●●●●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ●

−20

● ●

−40

●

CO−pulse = −5.68+0.11(CO+pulse)/2 (4.80) 0

20

40 60 ( CO + pulse ) / 2

80

100

40

32/ 33

20

● ●

● ●

●

●

● ●

●

CO − pulse 0

●●●

● ●

●

● ●

● ● ●●

●● ●

●

● ● ● ●●● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ●●● ● ●● ● ● ●● ● ●●● ●● ●● ● ● ● ● ●● ● ●●●● ● ●●● ●● ●●●● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ●● ● ●● ●● ●●● ● ●●●●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ●

−20

● ●

−40

●

0

20

40

60

80

100

( CO + pulse ) / 2

33/ 33