computer intensive statistical methods - University of Idaho

8 downloads 718 Views 2MB Size Report
language. For the reader who would like a more in-depth coverage, the following advanced books are recommended: • Monte Carlo Statistical Methods, 2nd ed.
C OMPUTER I NTENSIVE S TATISTICAL M ETHODS M ARTIN S KÖLD Lecture notes for FMS091/MAS221 October 2005, 2nd printing August 2006

CENTRUM SCIENTIARUM MATHEMATICARUM

Centre for Mathematical Sciences Mathematical Statistics

Computer Intensive Statistical Methods Lecture notes for FMS091/MAS221. Martin Sk¨old

2

Contents I

Simulation and Monte-Carlo Integration

1 Introduction 1.1 Scope of course . . . . . . . . . . 1.2 Computers as inference machines 1.3 References . . . . . . . . . . . . . 1.4 Acknowledgement . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

5 . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7 7 8 8 8

2 Simulation and Monte-Carlo integration 2.1 Introduction . . . . . . . . . . . . . . . . . 2.2 Issues in simulation . . . . . . . . . . . . . 2.3 Buffon’s Needle . . . . . . . . . . . . . . . 2.4 Raw ingredients . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

11 11 11 11 14

3 Simulating from specified distributions 3.1 Transforming uniforms . . . . . . . . . . 3.2 Transformation methods . . . . . . . . . 3.3 Rejection sampling . . . . . . . . . . . . 3.4 Conditional methods . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

15 15 18 19 23

. . . .

4 Monte-Carlo integration 25 4.1 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.1 Unknown constant of proportionality . . . . . . . . . . 32 5 Markov Chain Monte-Carlo 5.1 Markov chains with continuous state-space . 5.2 Markov chain Monte-Carlo integration . . . 5.2.1 Burn-in . . . . . . . . . . . . . . . . 5.2.2 After burn-in . . . . . . . . . . . . . 5.3 The Metropolis-Hastings algorithm . . . . . 5.4 The Gibbs-sampler . . . . . . . . . . . . . . 5.5 Independence proposal . . . . . . . . . . . . 5.6 Random walk proposal . . . . . . . . . . . . 5.6.1 Multiplicative random walk . . . . . 5.7 Hybrid strategies . . . . . . . . . . . . . . . 3

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

35 35 36 36 37 38 39 41 43 47 47

4

II

CONTENTS

Applications to classical inference

6 Statistical models 6.1 Functions of data . . . . . . . . 6.2 Point estimation . . . . . . . . 6.3 Asymptotic results . . . . . . . 6.4 Interval estimates . . . . . . . . 6.5 Bias . . . . . . . . . . . . . . . 6.6 Deriving the error distribution 7 The 7.1 7.2 7.3 7.4 7.5

49 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

51 52 52 53 54 55 55

Bootstrap The plug-in principle for finding estimators . . . . . The plug-in principle for evaluating estimators . . . The non-parametric Bootstrap for i.i.d. observations Bootstrapping parametric models . . . . . . . . . . . Bootstrap for semi-parametric models . . . . . . . . 7.5.1 Regression models . . . . . . . . . . . . . . . 7.5.2 Time-series . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

57 57 59 62 66 69 69 73

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

77 78 80 81 83 88

8 Testing statistical hypothesises 8.1 Testing simple hypothesises . . 8.2 Testing composite hypothesises 8.2.1 Pivot tests . . . . . . . 8.2.2 Conditional tests . . . . 8.2.3 Bootstrap tests . . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . .

9 Missing data models

III

93

Applications to Bayesian inference

103

10 Bayesian statistics 10.1 Bayesian statistical models . . . . . . . 10.1.1 The prior distribution . . . . . . 10.1.2 The posterior distribution . . . . 10.2 Choosing the prior . . . . . . . . . . . . 10.2.1 Conjugate priors . . . . . . . . . 10.2.2 Improper priors and representing 10.3 Hierarchical models . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ignorance . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

105 105 106 106 109 109 110 112

11 Bayesian computation 11.1 Using the Metropolis-Hastings algorithm 11.1.1 Predicting rain . . . . . . . . . . 11.2 Using the Gibbs-sampler . . . . . . . . . 11.2.1 A poisson change-point problem 11.2.2 A regression model . . . . . . . . 11.2.3 Missing data models . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

115 115 116 120 121 123 128

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Part I

Simulation and Monte-Carlo Integration

5

Chapter 1

Introduction 1.1

Scope of course

The term ‘computer intensive methods’ means different things to different people. It is also a dynamic subject: what requires intensive computing today may be solvable with a pocket calculator tomorrow. Not so long ago, the calculation of normal probabilities to reasonable accuracy would have required considerable CPU time. An initial classification of computer intensive methods as applied to statistics is the following: 1. Computers for graphical data exploration. 2. Computers for data modelling. 3. Computers for inference. There is obviously some overlap in these three, but in this course I intend to focus mostly on the third of the above. That is, we shall aim at understanding computer techniques which require innovative algorithms to apply standard inferences. I see two roles of this type of course. The first is to gain some understanding and knowledge of the techniques and tools which are available (so long as you’ve got the computing power). The second is that many of the techniques (if not all of them) are themselves clever applications or interpretations of the interplay between probability and statistics. So, understanding the principles behind the different algorithms can often lead to a better understanding of inference, probability and statistics generally. In short, the techniques are not just a means to an end. They have their own intrinsic value as statistical exercises. This is not a course on computing itself. We won’t get into the details of programming. Furthermore, this is not a course which will deal with specialised statistical packages often used in statistical computing. All the examples will be handled using simple Matlab functions - far from the most efficient way of implementing the various techniques. It is important to recognise that high-dimensional complex problems do require more efficient programming (commonly in C or Fortran). However the emphasis of this 7

8

CHAPTER 1. INTRODUCTION

course is to illustrate the various methods and their application on relatively simple examples. A basic familiarity with Matlab will be assumed.

1.2

Computers as inference machines

It is something of clich´e to point out that computers have revolutionized all aspects of statistics. In the context of inference there have really been two substantial impacts: the first has been the freedom to make inferences without the catalogue of arbitrary (and often blatantly inappropriate) assumptions which standard techniques necessitate in order to obtain analytic solutions — Normality, linearity, independence etc. The second is the ability to apply standard type models to situations of greater data complexity — missing data, censored data, latent variable structures.

1.3

References

I’ve stolen quite heavily from the following books for this course: • Stochastic simulation, B. Ripley. • An introduction to the bootstrap, B. Efron and R. Tibshirani. • Tools for statistical inference, M. Tanner. By sticking closely to specific texts it should be easy to follow up any techniques that you want to look at in greater depth. Within this course I’ll be concentrating very much on developing the techniques themselves rather than elaborating the mathematical and statistical niceties. You’re recommended to do this for yourselves if interested.

1.4

Acknowledgement

Much of these notes are adapted from previous course notes of one form or another. Thanks are due especially to Stuart Coles. Gareth Roberts, 2002 The original notes, used in various British universities, have been extensively revised and extended. Some parts and many examples remain untouched; they can probably be recognised by their varied and proper use of the English language. For the reader who would like a more in-depth coverage, the following advanced books are recommended: • Monte Carlo Statistical Methods, 2nd ed. (2005) by C.P. Robert and G. Casella. This is the most up-to-date and extensive book on Monte Carlo methods available, with a focus on Markov-Chain Monte-Carlo and problems in Bayesian statistics. Does not cover bootstrap methods.

1.4. ACKNOWLEDGEMENT

9

• Bootstrap Methods and their Application, (1997) by A.C. Davison and D.V. Hinkley. In-depth coverage of Bootstrap methods with a focus on applications rather than theory. For the theoretical statistical background, the choice is more difficult. You might want to try Statistical Inference, 2nd ed. (2001), by G. Casella and R.L. Berger, which is the book used in the course MAS207 and contains an introduction to both frequentist and Bayesian statistical theory. Martin Sk¨old, 2005

10

CHAPTER 1. INTRODUCTION

Chapter 2

Simulation and Monte-Carlo integration 2.1

Introduction

In this chapter we look at different techniques for simulating from distributions and stochastic processes. Unlike all subsequent chapters we won’t look much at applications here, but suffice it to say the applications of simulation are as varied as the subject of statistics itself. In any situation where we have a statistical model, simulating from that model generates realizations which can be analyzed as a means of understanding the properties of that model. In subsequent chapters we will see also how simulation can be used as a basic ingredient for a variety of approaches to inference.

2.2

Issues in simulation

Whatever the application, the role of simulation is to generate data which have (to all intents and purposes) the statistical properties of some specified model. This generates two questions: 1. How to do it; and 2. How to do it efficiently. To some extent, just doing it is the priority, since many applications are sufficiently fast for even inefficient routines to be acceptably quick. On the other hand, efficient design of simulation can add insight into the statistical model itself, in addition to CPU savings. We’ll illustrate the idea simply with a well–known example.

2.3

Buffon’s Needle

We’ll start with a simulation experiment which is intrinsically nothing to do with computers. Perhaps the most famous simulation experiment is 11

12CHAPTER 2. SIMULATION AND MONTE-CARLO INTEGRATION Buffon’s needle, designed to calculate (not very efficiently) an estimate of π. There’s nothing very sophisticated about this experiment, but for me I really like the ‘mystique’ of being able to trick nature into giving us an estimate of π. There are also a number of ways the experiment can be improved on to give better estimates which will highlight the general principle of designing simulated experiments to achieve optimal accuracy in the sense of minimizing statistical variability. Buffon’s original experiment is as follows. Imagine a grid of parallel lines of spacing d, on which we randomly drop a needle of length l, with l ≤ d. We repeat this experiment n times, and count R, the number of times the needle intersects a line. Denoting ρ = l/d and φ = 1/π, an estimate of φ is pˆ φˆ0 = 2ρ where pˆ = R/n. Thus, π ˆ0 = 1/φˆ0 = 2ρ/ˆ p estimates π. The rationale behind this is that if we let x be the distance from the centre of the needle to the lower grid point, and θ be the angle with the horizontal, then under the assumption of random needle throwing, we’d have x ∼ U [0, d] and θ ∼ U [0, π]. Thus p = Pr(needle intersects grid) Z 1 π = Pr(needle intersects |θ = φ)dφ π 0 Z π 1 l 2 = ( × sin φ)dφ π 0 d 2 2l = πd A natural question is how to optimise the relative sizes of l and d. To address this we need to consider the variability of the estimator φˆ0 . Now, R ∼ Bin(n, p), so Var(ˆ p) = p(1−p)/n. Thus Var(φˆ0 ) = 2ρφ(1−2ρφ)/4ρ2 n = 2 φ (1/2ρφ − 1)/n which is minimized (subject to ρ ≤ 1) when ρ = 1. That is, we should set l = d to optimize efficiency. ˆ ≈ Then, φˆ0 = 2pˆ , with Var(φˆ0 ) = (φ/2 − φ2 )/n. It follows that Var(φ) π 4 Var(φˆ0 ) ≈ 5.63/n. Figure 2.1 gives 2 realisations of Buffon’s experiment, based on 5000 simulations each. The figures together with an estimate pihat of π can be produced in Matlab by pihat=buf(5000); where the m-file buf.m contains the code function pihat=buf(n); x=rand(1,n); theta=rand(1,n)*pi; I=(cos(theta)>x);

13

proportion of hits 4.0 3.0

3

3.5

proportion of hits 4 5

4.5

6

5.0

2.3. BUFFON’S NEEDLE

0

1000 2000 3000 4000 5000 number of simulations

0

1000 2000 3000 4000 5000 number of simulations

Figure 2.1: Two sequences of realisations of Buffon’s experiment

R=cumsum(I); phat=R./(1:n); if R(n)>0 plot(find(phat>0),1./phat(phat>0)) xlabel(’proportion of hits’) ylabel(’number of simulations’) pihat=n/R(n); else disp(’no successes’) pihat=0; end Thus I’ve used the computer to simulate the physical simulations. You might like to check why this code works. There are a catalogue of modifications which you can use which might (or might not) improve the efficiency of this experiment. These include: 1. Using a grid of rectangles or squares (which is best?) and basing estimate on the number of intersections with either or both horizontal or vertical lines. 2. Using a cross instead of a needle. 3. Using a needle of length longer than the grid separation. So, just to re–iterate, the point is that simulation can be used to answer interesting problems, but that careful design may be needed to achieve even moderate efficiency.

14CHAPTER 2. SIMULATION AND MONTE-CARLO INTEGRATION

2.4

Raw ingredients

The raw material for any simulation exercise is random digits: transformation or other types of manipulation can then be applied to build simulations of more complex distributions or systems. So, how can random digits be generated? It should be recognised that any algorithmic attempt to mimic randomness is just that: a mimic. By definition, if the sequence generated is deterministic then it isn’t random. Thus, the trick is to use algorithms which generate sequences of numbers which would pass all the tests of randomness (from the required distribution or process) despite their deterministic derivation. The most common technique is to use a congruential generator. This generates a sequence of integers via the algorithm xi = axi−1 (mod M )

(2.1)

for suitable choices of a and M . Dividing this sequence by M gives a sequence ui which are regarded as realisations from the Uniform U [0, 1] distribution. Problems can arise by using inappropriate choices of a and M . We won’t worry about this issue here, as any decent statistics package should have had its random number generator checked pretty thoroughly. The point worth remembering though is that computer generated random numbers aren’t random at all, but that (hopefully) they look random enough for that not to matter. In subsequent sections then, we’ll take as axiomatic the fact that we can generate a sequence of numbers u1 , u2 , . . . , un which may be regarded as n independent realisations from the U [0, 1] distribution.

Chapter 3

Simulating from specified distributions In this chapter we look at ways of simulating data from a specified distribution function F , on the basis of a simulated sample u1 , u2 , . . . , un from the distribution U [0, 1].

3.1

Transforming uniforms

We start with the case of constructing a draw x from a random variable X ∈ R with a continuous distribution F on the basis of a single u from U [0, 1]. It is natural to try a simple transformation x = h(u), but how should we choose h? Let’s assume h is increasing with inverse h−1 : R 7→ [0, 1]. The requirement is now that F (v) = P (X ≤ v) = P (h(U ) ≤ v, )

= P (h−1 (h(U )) ≤ h−1 (v)) = P (U ≤ h−1 (v)) = h−1 (v),

for all v ∈ R and where in the last step we used that the distribution function of the U [0, 1] distribution equals P (U ≤ u) = u, u ∈ [0, 1]. The conclusion is clear, we should choose h = F −1 . If F is not one-to-one, as is the case for discrete random variables, the above argument remains valid if we define the inverse as F −1 (u) = inf{x; F (x) ≥ u}. The resulting algorithm for drawing from F is the Inversion Method: Algorithm 3.1 (The Inversion Method). 1. Draw u from U [0, 1]. 2. x = F −1 (u) can now be regarded a draw from F .

15

(3.1)

16 CHAPTER 3. SIMULATING FROM SPECIFIED DISTRIBUTIONS Figure 3.1 illustrates how this works. For example, to simulate from the 1

0.9

0.8

0.7 F

u

0.6

0.5

0.4

0.3 U 0.2

0.1

0 −4

−3

−2

−1

X

0 x

1

2

3

4

Figure 3.1: Simulation by inversion; the random variable X = F −1 (U ) will have distribution F if U is uniformly distributed on [0, 1]. exponential distribution we have F (x) = 1 − exp(−λx), so F −1 (u) = −λ−1 log(1 − u). Thus with u=rand(1,n); x=-(log(1-u))/lambda; we can simulate n independent values from the exponential distribution with parameter lambda. Figure 3.2 shows a histogram of 1000 standard (λ = 1) exponential variates simulated with this routine. For discrete distributions, the procedure then simply amounts to searching through a table of the distribution function. For example, the distribution function of the Poisson(2) distribution is x F(x) -----------0 0.1353353 1 0.4060058 2 0.6766764 3 0.8571235 4 0.9473470 5 0.9834364 6 0.9954662

17

0

100

frequency 200

300

3.1. TRANSFORMING UNIFORMS

0

2

4

6

x

Figure 3.2: Histogram of 1000 simulated unit exponential variates

7 8 9 10

0.9989033 0.9997626 0.9999535 0.9999917

so, we generate a sequence of standard uniforms u1 , u2 , . . . , un and for each ui obtain a Poisson(2) variate xi where F (xi − 1) < ui ≤ F (xi ). So, for example, if u1 = 0.7352 then x1 = 3. The limitation on the efficiency of this procedure is due to the necessity of searching through the table, and there are various schemes to optimize this aspect. Returning to the continuous case, it may seem that the inversion method is sufficiently universal to be the only method required. In fact, there are many situations in which the inversion method is either (or both) complicated to program or excessively inefficient to run. The inversion method is only really useful if the inverse distribution function is easy to program and compute. This is not the case, for example, with the Normal distribution function for which the inverse distribution function, Φ−1 , is not available analytically and slow to evaluate numerically. An even more serious limitation is that the method only applies for generating draws from univariate random variables. To deal with such cases, we turn to a variety of alternative schemes.

18 CHAPTER 3. SIMULATING FROM SPECIFIED DISTRIBUTIONS

3.2

Transformation methods

The inversion method is a special case of more general transformation methods. The following theorem can be used to derive the distribution of Z = h(X) for a more general class of real-valued random variables X. Theorem 3.1 (Transformation theorem). Let X ∈ X ⊆ R have a continuous density f and h a function with differentiable inverse g = h−1 . Then the random variable Z = h(X) ∈ R has density f (g(z))|g′ (z)|,

(3.2)

for z ∈ h(X ) and zero elsewhere.

Proof. The proof is a direct application of the change-of-variable theorem for integrals. Note that two random variables X and Z have the same distribution iff P (X ∈ A) = P (Z ∈ A) for all sets A. This device is used extensively in simulation, for example when we want to generate a N (µ, σ 2 ) variate y, it is common to first draw x from N (0, 1) and then set y = σx + µ. Use Theorem 3.1 to show that this works. Sums of random variables can also be useful in creating new variables. Recall that Theorem 3.2. Let X ∈ R and Y ∈ R be independent with densities f and g R respectively, then the density of Z = X +Y equals f ∗g(z) = f (t−z)g(t) dt.

This can be used to generate Gamma random variables. A random variable X has a Gamma(a, 1) distribution if its density is proportional to xa−1 exp(−x), x > 0. Using Theorem 3.2 we can show that if X and Y are independent Gamma(a, 1) and Gamma(a′ , 1) respectively, then Z = X + Y has a Gamma(a + a′ , 1) distribution. Since Gamma(1, 1) (i.e. Exponential(1)) variables are easily generated by inversion, a Gamma(k, 1) variable Z, for integer values k, is generated by z=

k X i=1

− log(uk )

(3.3)

using independent draws of uniforms u1 , . . . , un . As an alternative we can use a combination of Theorems 3.1 and 3.2 to show that z=

2k X

x2i /2

(3.4)

i=1

is a draw from the same distribution if x1 , . . . , x2k are independent standard Normal draws. Example 3.1 (The Box-Muller transformation). This is a special trick to simulate from the Normal distribution. In fact it produces two independent variates in one go. Let u1 , u2 be two independently sampled U[0, 1] variables, then it can be shown that p p x1 = −2 log(u2 ) cos(2πu1 ) and x2 = −2 log(u2 ) sin(2πu1 )

are two independent N(0, 1) variables.

19

3.3. REJECTION SAMPLING Below we give the multivariate version of Theorem 3.1.

Theorem 3.3 (Multivariate transformation theorem). Let X ∈ X ⊆ Rd have a continuous density f and h : X 7→ Rd a function with differentiable inverse g = h−1 . Further write J(z) for the determinant of the Jacobian matrix of g = (g1 , . . . , gd ), dg1 (z)/dz1 .. J(x) = . dgd (z)/dz1

. . . dg1 (z)/dzd .. .. . . . . . dgd (z)/dzd

Then the random variable Z = h(X) ∈ Rd has density

.

f (g(z))|J(z)|,

(3.5)

(3.6)

for z ∈ h(X ) and zero elsewhere. Example 3.2 (Choleski method for multivariate Normals). The Choleski method is a convenient way to draw a vector z from the multivariate Normal distribution Nn (0, Σ) based on a vector of n independent N (0, 1) draws (x1 , x2 , . . . , xn ). Choleski decomposition is a method for computing a matrix C such that CC T = Σ, in Matlab the command is chol. We will show that z = Cx has the desired distribution. The density of X is f (x) = (2π)−d/2 exp(−xT x/2) and the Jacobian of the inverse transformation, x = C −1 z, equals J(z) = |C −1 | = |C|−1 = |Σ|−1/2 . Hence, according to Theorem 3.3, the density of Z equals f (C −1 z)|Σ|−1/2 = (2π)−d/2 exp(−(C −1 z)T (C −1 z)/2)|Σ|−1/2 = (2π)−d/2 exp(−z T Σ−1 z/2)|Σ|−1/2 , which we recognise as the density of a Nn (0, Σ) distribution. Of course, z + µ, µ ∈ Rd is a draw from Nn (µ, Σ).

3.3

Rejection sampling

The idea in rejection sampling is to simulate from one distribution which is easy to simulate from, but then to only accept that simulated value with some probability p. By choosing p correctly, we can ensure that the sequence of accepted simulated values are from the desired distribution. The method is based on the following theorem: Theorem 3.4. Let f be the density function of a random variable on Rd and let Z ∈ Rd+1 be a random variable that is uniformly distributed on the set A = {z; 0 ≤ zd+1 ≤ M f (z1 , . . . , zd )} for an arbitrary constant M > 0. Then the vector (Z1 , . . . , Zd ) has density f .

20 CHAPTER 3. SIMULATING FROM SPECIFIED DISTRIBUTIONS Proof. First note that Z

A

Z M f (z1 ,...,zd ) dzd+1 ) dz1 · · · dzd ( dz = RdZ 0 = M f (z1 , . . . , zd ) dz1 · · · dzd = M. Z

Hence, Z has density 1/M on A. Similarily, with B ⊆ Rd , we have Z M −1 dz P ((Z1 , . . . , Zd ) ∈ B) = {z;z∈A}∩{z;(z1 ,...,zd )∈B} Z −1 =M M f (z1 , . . . , zd ) dz1 · · · dzd B Z f (z1 , . . . , zd ) dz1 · · · dzd , = B

and this is exactly what we needed to show. The conclusion of the above theorem is that we can construct a draw from f by drawing uniformly on an appropriate set and then drop the last coordinate of the drawn vector. Note that the converse of the above theorem is also true, i.e. if we draw (z1 , . . . , zd ) from f and then zd+1 from U (0, M f (z1 , . . . , zd )), (z1 , . . . , zd+1 ) will be a draw from the uniform distribution on A = {z; 0 ≤ zd+1 ≤ M f (z1 , . . . , zd )}. The question is how to draw uniformly on A without having to draw from f (since this was our problem in the first place); the rejection method solves this by drawing uniformly on a larger set B ⊃ A and rejecting the draws that end up in B\A. A natural choice of B is B = {z; 0 ≤ zd+1 ≤ Kg(z1 , . . . , zd )}, where g is another density, the proposal density, that is easy to draw from and satisfies M f ≤ Kg. Algorithm 3.2 (The Rejection Method).

1. Draw (z1 , . . . , zd ) from a density g that satisfies M f ≤ Kg. 2. Draw zd+1 from U (0, Kg(z1 , . . . , zd )). 3. Repeat steps 1-2 until zd+1 ≤ M f (z1 , . . . , zd ). 4. x = (z1 , . . . , zd ) can now be regarded as a draw from f .

It might seem superflous to have two constants M and G in the algorithm. Indeed, the rejection method is usually presented with M = 1. We include M here to illustrate the fact that you only need to know the density up to a constant of proportionality (i.e. you know M f but not M or f ). This situation is very common, especially in applications to Bayesian statistics.

21

3.3. REJECTION SAMPLING

The efficiency of the rejection method depends on how many points are rejected, which in turn depends on how close Kg is to M f . The probability of accepting a particular draw (z1 , . . . , zd ) from g equals P (Zd+1 ≤ M f (Z1 , . . . , Zd )) Z Z M f (z1 ,...,zd ) (Kg(z1 , . . . , zd ))−1 dzd+1 )g(z1 , . . . , zd ) dz1 · · · dzd = ( Z0 M M . = f (z1 , . . . , zd ) dz1 · · · dzd = K K



• •

• •

• ••



f(x)

••

0.5

• •



• •





• •



••

••

1.5

• • • •

••







• 0.2

0.4

0.6

0.8

1.0

• • •

••• • • • • • • • •



•• •• • • • • •• • • • • •• • • • • • • •• •



•• •





• •



••

••



•• •

•• •



0.0

• •

••• • • • • • • • •



0.0









0.5





• •• • • • • •• • • • • •



0.0

1.0

• •



• •



1.0

• •

f(x)

1.5

For large d it becomes increasingly difficult to find g and K such that M/K is large enough for the algorithm to be useful. Hence, while the rejection method is not strictly univariate as the inversion method, it tends to be practically useful only for small d. The technique is illustrated in Figure 3.3.

• 0.0

0.2

•• • • • • •• • • • • •• • • • • • • •• • 0.4

x

0.6

0.8

1.0

x

Figure 3.3: Simulation by rejection sampling from an U [0, 1] distribution (here M = 1 and G = 1.5); the x–coordinates of the points in the right panel constitute a sample with density f As an example, consider the distribution with density f (x) ∝ x2 e−x ;

0 ≤ x ≤ 1,

(3.7) e−x

a truncated gamma distribution. Then, since f (x) ≤ everywhere, we can set g(x) = exp(−x) and so simulate from an exponential distribution, rejecting according to the above algorithm. Figure 3.4 shows both f (x) and g(x). Clearly in this case the envelope is very poor so the routine is highly inefficient (though statistically correct). Applying this to generate a sample of 100 data by

0.0

0.2

exp( - x) 0.4 0.6

0.8

1.0

22 CHAPTER 3. SIMULATING FROM SPECIFIED DISTRIBUTIONS

0

1

2

3

4

5

x

Figure 3.4: Scaled density and envelope

[x,m]=rejsim(100); hist(x); using the following code function [x,m]=rejsim(n) m=0; for i=1:n acc=0; while(~acc) m=m+1; z1=-log(rand); z2=rand*exp(-z1); if (z2 0. Then p (4.9) P ( Cα n(Z(⌈nα⌉) − τ ) ≤ x) → Φ(x) as n → ∞,

where Cα = α(1 − α)f 2 (0) and Φ is the distribution function of the N (0, 1) distribution.

29 Bias and the Delta method It is not always we can find a function tn such that E(Tn ) = τ . For example we might be interested in τ = h(E(X)) for some specified smooth function ¯ again is the arithmetic mean, then a natural choice is Tn = h(X). ¯ h. If X However, unless h is linear, E(Tn ) is not guaranteed to equal τ . This calls for a definition: the bias of t (when viewed as an estimator of τ ), Tn = t(X1 , . . . , Xn ) is Bias(t) = E(Tn ) − τ.

(4.10)

The concept of bias allows us to more fully appreciate the concept of mean square error, since E(Tn − τ )2 = Var(Tn ) + Bias2 (t),

(4.11)

(show this as an exercise). The mean square error equals variance plus squared bias. In the abovementioned example, a Taylor expansion gives an impression of the size of the bias. Roughly we have with µ = E(X) ¯ − h(µ)] E(Tn − τ ) = E[h(X) ¯ − µ)h′ (µ) + ≈ E(X =

¯ − µ)2 E(X h′′ (µ) 2

Var(X) ′′ h (µ). 2n

(4.12)

And it is reassuring that (4.12) suggests a small bias when sample size n is large. Moreover, since variance of Tn generally is of order O(n−1 ) it will dominate the O(n−2 ) squared bias in (4.11) suggesting that bias is a small problem here (though it can be a serious problem if the above Taylor expansions are not valid). ¯ is easily We now turn to the variance of Tn . First note that while Var(X) ¯ estimated by e.g. (4.8), estimating Var(h(X)) is not so straightforward. An useful result along this line is the Delta Method Theorem 4.4 (The Delta method). Let rn be an increasing sequence and Sn a sequence of random variables. If there is µ such that h is differentiable at µ and P (rn (Sn − µ) ≤ x) → F (x), as n → ∞ for a distribution function F , then P (rn (h(Sn ) − h(µ)) ≤ x) → F (x/|h′ (µ)|). Proof. Similar to the Taylor expansion argument in (4.12). ¯ has variance σ 2 /rn , then the variThis theorem suggests that if Sn = X ance of Tn = h(Sn ) will be approximately σ 2 h′ (µ)2 /rn for large n. Moreover, if Sn is asymptotically Normal, so is Tn .

30

4.1

CHAPTER 4. MONTE-CARLO INTEGRATION

Importance sampling

Importance sampling is a technique that might substantially decrease the variance of the Monte-Carlo error. It can also be used as a tool for estimating E(φ(X)) in cases where X can not be sampled easily. We want to calculate Z τ = φ(x)f (x)dx (4.13) which can be re–written τ=

Z

ψ(x)g(x)dx

(4.14)

where ψ(x) = φ(x)f (x)/g(x). Hence, if we obtain a sample x1 , x2 , . . . , xn from the distribution of g, then we can estimate the integral by the unbiased estimator n X −1 ψ(xi ), (4.15) tn = n i=1

for which the variance is

Var(Tn ) = n

−1

Z

{ψ(x) − τ }2 g(x)dx.

(4.16)

This variance can be very low, much lower than the variance of an estimate based on draws from f , provided g can be chosen so as to make ψ nearly constant. Essentially what is happening is that the simulations are being concentrated in the areas where there is greatest variation in the integrand, so that the informativeness of each simulated value is greatest. Another important advantage of importance sampling comes in problems where drawing from f is difficult. Here draws from f can be replaced by draws from an almost arbitrary density g (though it is essential that φf /g remain bounded). This example illustrates the idea. Suppose we want to estimate the probability P (X > 2), where X follows a Cauchy distribution with density function 1 f (x) = (4.17) π(1 + x2 ) so we require the integral Z

1{x > 2}f (x)dx.

(4.18)

We could simulate from the Cauchy directly and apply basic Monte-Carlo integration, but the variance of this estimator is substantial. As with the bivariate Normal example, the estimator is the empirical proportion of exceedances; exceedances are rare, so the variance is large compared to its mean. Put differently, we are spending most of our simulation budget on

31

4.1. IMPORTANCE SAMPLING

estimating the integral of 1{x > 2}f (x) over an area (i.e. around the origin) where we know it equals zero. Alternatively, we observe that for large x, f (x) is similar in behaviour to the density g(x) = 2/x2 on x > 2. By inversion, we can simulate from g by letting xi = 2/ui where ui ∼ U [0, 1]. Thus, our estimator becomes: tn = n−1

n X i=1

x2i , 2π(1 + x2i )

(4.19)

where xi = 2/ui . Implementing this with the Matlab function function [tn,cum]=impsamp(n); x=2./rand(1,n); psi=x.^2./(2*pi*(1+x.^2)); tn=mean(psi); cum=cumsum(psi)./(1:n); and processing [tn,cum]=impsamp(1000); plot(cum)

0.146

0.148

p 0.150

0.152

0.154

gave the estimate tn = .1478. The exact value is .5 − π −1 tan 2 = .1476. In Figure 4.3 the convergence of this sample mean to the true value is demonstrated as a function of n by plotting the additional output vector cum.

0

200

400

600

800

1000

n

Figure 4.3: Convergence of importance sampled mean For comparison, in Figure 4.4, we show how this compares with a sequence of estimators based on the sample mean when simulating directly

32

CHAPTER 4. MONTE-CARLO INTEGRATION

0.0

0.2

0.4

p

0.6

0.8

1.0

from a Cauchy distribution. Clearly, the reduction in variability is substantial (the importance sampled estimator looks like a straight line).

0

200

400

600

800

1000

n

Figure 4.4: Comparison of importance sampled mean with standard estimator

4.1.1

Unknown constant of proportionality

To be able to use the above importance sampling techniques, we need to know f (x) explicitly. Just knowing M f for an unknown constant of proportionality M is not sufficient. However, importance sampling can also be used to approximate M . Note that, Z Z M f (x) g2 (x) dx, (4.20) M = M f (x) dx = g2 (x) for a density g2 . Thus, based on a sample x1 , x2 , . . . , xN from g2 , we can approximate M by tN =

N 1 X M f (xi ) . N g2 (xi )

(4.21)

i=1

It should be noted that this approximation puts some restrictions on the choice of g2 . To have a finite variance, we need (with X ′ ∼ g2 )   Z (M f (X ′ ))2 (M f (x))2 dx, E = g2 (X ′ )2 g2 (x)

33

4.1. IMPORTANCE SAMPLING

to be finite, i.e. f 2 /g2 is integrable. Hence, a natural requirement is that f /g2 is bounded. This can now be used to approximate τ = E(φ(X)) using sequences x1 , x2 , . . . , xN from g2 and y1 , y2 , . . . , yN from g through ! ! N N X t′N φ(yi )M f (yi ) . X M f (xi ) , (4.22) = tN g(yi ) g2 (xi ) i=1

i=1

where the numerator approximates M τ and denominator M . Of course we could use g = g2 and xi = yi in (4.22), but this is not usually the most efficient choice.

34

CHAPTER 4. MONTE-CARLO INTEGRATION

Chapter 5

Markov Chain Monte-Carlo Today, the most-used method for simulating from complicated and/or highdimensional distributions is Markov Chain Monte Carlo (MCMC). The basic idea of MCMC is to construct a Markov Chain that has f as stationary distribution, where f is the distribution we want to simulate from. In this chapter we introduce the algorithms, more applications will be given later.

5.1

Markov chains with continuous state-space

The theory for Markov chains that take values in a continuous state-space is complex. Much more so than the theory for finite/countable state-spaces, that you might have already been exposed to. We will here not pretend to give a full account of the underlying theory, but be happy with formulating a number of sufficient conditions under which the methods work. Links to more advanced material will be provided on the course web-page. In this chapter, we will consider Markov-Chains X1 , X2 , . . . , XN defined through a starting value x0 and a transition density q˜. The transition density xi 7→ q˜(xi−1 , xi ) = g(xi |xi−1 ) is the density of Xi |Xi−1 = xi−1 , and this determines the evolution of the Markov-chain across the state-space. We would ideally like to draw the starting value x0 and choose q˜ in such a way that the following realisations x1 , x2 , . . . , xN are independent draws from f , but this is a too ambitious task in general. In contrast to the previous chapter, here our draws will neither be independent nor exactly distributed according to f . What we will require is that f is the asymptotic distribution of the chain, i.e. if Xi takes values in X , Z f (x) dx, (5.1) P (Xn ∈ A) → A

as n → ∞, for all subsets A of X and independently of the starting value x0 ∈ X . A condition on the transition density that ensures the Markov chain has a stationary distribution f , is that it satisfies the global balance condition f (y)˜ q(y, x) = f (x)˜ q (x, y). 35

(5.2)

36

CHAPTER 5. MARKOV CHAIN MONTE-CARLO

This says roughly that the flow of probability mass is equal in both directions (i.e. from x to y and vice versa). Global balance is not sufficient for the stationary distribution to be unique (hence the Markov chain might converge to a different stationary distribution). Sufficient conditions for uniqueness are irreducibility and aperiodicity of the chain, a simple sufficient condition for this is that the support of f (y) is contained in the support of y 7→ q˜(x, y) for all x (minimal necessary conditions for q˜ to satisfy (5.1) are not known in general, however our sufficient conditions are far from the weakest known in literature).

5.2

Markov chain Monte-Carlo integration

Before going into the details of how to construct a Markov chain with specified asymptotic distribution, we will look at Monte-Carlo integration under the new assumption that draws are neither independent not exactly from the target distribution. Along this line, we need a Central Limit Theorem for Markov chains. First we observe that if X1 , X2 , . . . , Xn is a Markov chain on Rd and φ : Rd 7→ R, then with Zi = φ(Xi ), the sequence Z1 , Z2 , . . . , Zn forms a Markov chain on R.P As before, we want to (i) approximate τ = E(Z) = E(φ(X)) by tN = N −1 N i=1 z , for a sequence z (i) = φ(x(i) ) of draws from the chain.

5.2.1

Burn-in

An immediate concern is that, unless we can produce a single starting value x0 with the correct distribution, our sampled random variables will only asymptotically have the correct distribution. This will induce a bias in our estimate of τ . In general, given some starting value x(0) , there will be iterations x(i) , i = 1, . . . , k, before the distribution of x(i) can be regarded as “sufficiently close” to the stationary distribution f (x) in order to be useful for further analysis (i.e. the value of x(i) is still strongly influenced by the choice of x(0) when i ≤ k). The values x(0) , . . . , x(k) are then referred to as the burn-in of the chain, and they are usually discarded in the subsequent output analysis. R For example when estimating φ(x)f (x) dx, it is common to use N X 1 φ(x(i) ). N −k i=k+1

An illustration is given in Figure 5.1. We now provide the CLT for Markov Chains. Theorem 5.1. Suppose a geometrically ergodic Markov chain Xi , i = 1, . . . , n, on Rd with stationary distribution f and a real-valued function 2+ǫ φ : Rd 7→ R satisfies E(φ Pn (X)) ≤ ∞ for some ǫ > 0, then with τ = −1 E(φ(X)) and Tn = n i=1 φ(Xi ) √  n(Tn − τ ) P ≤ x → Φ(x) (5.3) σ

37

5.2. MARKOV CHAIN MONTE-CARLO INTEGRATION

5

0

x

−5

−10

−15

−20

−25

0

50

100

150

200 iteration

250

300

350

400

Figure 5.1: A Markov chain starting from x(0) = −20 that seems to have reached its stationary distribution after roughly 70 iterations where Φ is the distribution function of the N (0, 1) distribution and 2

σ = r(0) + 2

∞ X

r(i)

(5.4)

i=1

where r(i) = limk→∞ Cov{φ(Xk ), φ(Xk+i )}, the covariance function of a stationary version of the chain. Note that the above holds for an arbitrary starting value x0 . The error due to “wrong” starting value, i.e. the bias E(Tn ) − τ , is of O(n−1 ). Hence squared bias is dominated by variance, and asymptotically negligible. This does not suggest that it is unnecessary to discard burn-in, but rather that it is unnecessary to put too much effort into deciding on how long it should be. A visual inspection usually suffices.

5.2.2

After burn-in

If we manage to simulate our starting value x(0) from the target distribution f , all subsequent values will also have the correct distribution. Would our problems then be solved if we were given that single magic starting value with the correct distribution? As the above discussion suggests, the answer is no. The “big problem” of MCMC is that (5.4) can be very large, especially in problems where the dimension d is large. It is important to understand that converging to the stationary distribution (or getting close enough) is just the very beginning of the analysis. After that we need to do sufficiently many iterations in order for the variance σ 2 /n of Tn to be small.

38

CHAPTER 5. MARKOV CHAIN MONTE-CARLO

It is actually a good idea to choose starting values x(0) that are far away from the main support of the distribution. If the chain then takes a long time to stabilize you can expect that even after reaching stationarity, the autocorrelation of the chain will be high and σ 2 in (5.4) large. Here we give a rough guide for estimating σ 2 , which is needed want we to produce a confidence interval: 1. We assume your draws x(i) , i = 1, . . . , N are d-variate vectors. First make a visual inspection of each of the d trajectories, and decide a burn-in 1, . . . , k where all of them seems to have reached stationarity. Throw away the first k sample vectors. 2. Now you want to estimate E(φ(X)). Compute zi = φ(x(i+k) ), i = 1, . . . , N − k and estimate the autocorrelation function of the sequence zi , i.e. the function ρ(t) = Corr(Z1 , Zt ) over a range of values t = 1, . . . , L, this can be done with Matlabs xcorr function (though it uses the range −L, . . . , L). A reasonable choice of L is around (N − k)/50 (estimates of ρ(L) tends to be too unreliable for larger values). If the estimated autocorrelation function does not reach zero in the range 1, . . . , L, go back to the first step and choose a larger value for N . 3. Divide your sample into m batches of size l, ml = N −k. Here l should be much larger than the time it took for the autocorrelation to reach zero in the previous step. The batches are (z1 , . . . , zl ),(zl+1 , . . . , z2l ), . . ., (zm(l−l)+1 , . . . , zml ). Now compute the arithmetic mean z¯j of each batch, j = 1, . . . , m. Estimate τ by the mean of the batch means and σ 2 by s2 /m, where s2 is the empirical variance of the batch means. This procedure is based on the fact that if batches are sufficiently large, their arithmetic means should be approximately uncorrelated. Hence, since our estimate is the mean of m approximately independent batch means it should have variance σb2 /m, where σb2 is the variance of a batch mean. We now turn to the actual construction of the Markov chains.

5.3

The Metropolis-Hastings algorithm

How do you choose transition density q˜ in order to satisfy (5.1)? The idea behind the Metropolis-Hastings algorithm is to start with an (almost) arbitrary transition density q. This density will not give the correct asymptotic distribution f , but we could try to repair this by rejecting some of the moves it proposes. Thus, we construct a new transition density q˜ defined by q˜(x, y) = α(x, y)q(x, y) + (1 − α(x, y))δx (y),

(5.5)

where δx (y) is a point-mass at x. This implies that we stay at the level x if the proposed value is rejected and we reject a proposal y ∗ with probability 1 − α(x, y ∗ ). Simulating from the density y 7→ q˜(x, y) works as follows 1. Draw y ∗ from q(x, ·).

39

5.4. THE GIBBS-SAMPLER 2. Draw u from U (0, 1). 3. If u < α(x, y ∗ ) set y = y ∗ , else set y = x.

We now have to match our proposal density q with a suitable acceptance probability α. The choice of the Metropolis-Hastings algorithm, based on satisfying the global balance equation (5.2), is α(x, y) = min(1,

f (y)q(y, x) ), f (x)q(x, y)

(5.6)

you might want to check that this actually satisfies (5.2). Algorithm 5.1 (The Metropolis-Hastings algorithm).

1. Choose a starting value x(0) . 2. Repeat for i = 1, . . . , N : i.1 Draw y ∗ from q(x(i−1) , ·). i.2 Draw u from U (0, 1).

i.3 If u < α(x(i−1) , y ∗ ) set x(i) = y ∗ , else set x(i) = x(i−1) . 3. x(1) , x(2) , . . . , x(N ) is now a sequence of dependent draws, approximately from f . There are three general types of Metropolis-Hastings candidate generating densities q used in practise; the Gibbs sampler, independence sampler and the random walk sampler. Below we will discuss their relative merits and problems.

5.4

The Gibbs-sampler

The Gibbs-sampler is often viewed as a separate algorithm rather than a special case of the Metropolis-Hastings algorithm. It is based on partitioning the vector state-space Rd = Rd1 ×· · · Rdm into blocks of sizes di , i = 1, . . . , m such that d1 + · · · + dm = d. We will write z = (z1 , . . . , zm ) = (x1 , . . . , xd ), where zi ∈ Rdi (e.g. z1 = (x1 , . . . , xd1 ). With this partition, the Gibbssampler with target f is now:

40

CHAPTER 5. MARKOV CHAIN MONTE-CARLO

Algorithm 5.2 (The Gibbs-sampler). 1. Choose a starting value z (0) . 2. Repeat for i = 1, . . . , N : (i)

(i−1)

, . . . , zm

).

(i)

(i)

(i−1)

, . . . , zm

(i−1)

).

(i)

(i)

(i)

(i−1)

(i−1)

i.1 Draw z1 from f (z1 |z2

(i−1)

i.2 Draw z2 from f (z2 |z1 , z3

i.3 Draw z3 from f (z3 |z1 , z2 , z4 .. . (i)

(i)

(i)

, . . . , zm

).

(i)

i.m Draw zm from f (zm |z1 , z2 , . . . , zm−1 ). 3. z (1) , z (2) , . . . z (N ) , is now a sequence of dependent draws approximately from f . This corresponds to an MH-algorithm with a particular proposal density and α = 1. What is the proposal density q? Note the similarity with Algorithm 3.3. The difference is that here we draw each zi conditionally on all the others which is easier since these conditional distributions are much easier to derive. For example, z1 7→ f (z1 |z2 , . . . , zm ) =

f (z1 , . . . , zm ) ∝ f (z1 , . . . , zm ). f (z2 , . . . , zm )

Hence, if we know f (z1 , . . . , zm ), we also know all the conditionals needed (up to a constant of proportionality). The Gibbs-sampler is the most popular MCMC algorithm and given a suitable choice of partition of the state-space it works well for most applications to Bayesian statistics. We will have the opportunity to study it more closely in action in the subsequent part on Bayesian statistics of this course. Poor performance occurs when there is a high dependence between the components Zi . This is due to the fact that the Gibbs-sampler only moves along the coordinate axes of the vector (z1 , . . . , zm ), illustrated by Figure 5.2. One remedy to this problem is to merge the dependent components into a single larger component, but this is not always practical. Example 5.1 (Bivariate Normals). Bivariate Normals can be drawn with the methods of Examples 3.4 or 3.2. Here we will use the Gibbs-sampler instead. We want to draw from    1 ρ (X1 , X2 ) ∼ N2 0, , ρ 1 and choose partition (X1 , X2 ) = (Z1 , Z2 ). The conditional distributions are given by p p Z1 |Z2 = z2 ∼ N (ρz2 , 1 − ρ2 ) and Z2 |Z1 = z1 ∼ N (ρz1 , 1 − ρ2 ).

41

5.5. INDEPENDENCE PROPOSAL

Figure 5.2: For a target (dashed) with strongly dependent components the Gibbs sampler will move slowly across the support since “big jumps”, like the dashed move, would involve first simulating a highly unlikely value from f (z1 |z2 ). (0)

(0)

The following script draws 1000 values starting from (z1 , z2 ) = (0, 0). function z=bivngibbs(rho) z=zeros(1001,2); for i=2:1000 z(i,1)=randn*sqrt(1-rho^2)+rho*z(i-1,2); z(i,2)=randn*sqrt(1-rho^2)+rho*z(i,1); end z=z(2:1001,:); In Figure 5.3 we have plotted the output for ρ = 0.5 and ρ = 0.99. Note the strong dependence between successive draws when ρ = 0.99. Also note that each individual panel constitute approximate draws from the same distribution, i.e. the N (0, 1) distribution.

5.5

Independence proposal

The independence proposal amounts to proposing a candidate value y ∗ independently of the current position of the Markov Chain, i.e. we choose q(x, y) = q(y). A necessary requirement here is that supp(f ) ⊆ supp(q); if this is not satisfied, some parts of the support of f will never be reached. This candidate is then accepted with probability

CHAPTER 5. MARKOV CHAIN MONTE-CARLO 4

2

2 1

4

0

z

z

1

42

−2 −4

0 −2

0

200

400

600

800

−4

1000

4

2

2 2

4

z

0 −2 −4

0

200

400

0

200

400

600

800

1000

600

800

1000

0 −2

0

200

400

600

800

−4

1000

iteration

iteration

Figure 5.3: Gibbs-draws from Example 5.1, left with ρ = 0.5 and right with ρ = 0.99.



α(x, y ) =

(



(y )q(x) min{ ff (x)q(y ∗ ) , 1} 1

if f (x)q(y ∗ ) > 0 if f (x)q(y ∗ ) = 0

And we immediately see that if q(y) = f (y), α(x, y) ≡ 1, i.e. all candidates are accepted. Of course, if we really could simulate a candidate directly from q(y) = f (y) we would not have bothered about implementing an MH algorithm in the first place. Still, this fact suggests that we should attempt to find a candidate generating density q(y) that is as good an approximation to f (y) as possible, similarily to the rejection sampling algorithm. The main difference is that we don’t need to worry about deriving constants M or K such that M f < Kq when we do independence sampling. To ensure good performance of the sampler it is advisable to ensure that such constants exists, though we do not need to derive it explicitly. If it does not exist, the algorithm will have problems reaching parts of the target support, typically the extreme tail of the target. This is best illustrated with an example; assume we want to simulate from f (x) ∝ 1/(1 + x)3 using an Exp(1) MH independence proposal. A plot of (unnormalised) densities q and 1/(1 + x)3 in Figure 5.4 does not indicate any problems — the main support of the two densities seem similar. The algorithm is implemented as follows function [x,acc]=indsamp(m,x0) x(1:m)=x0; acc=0; for i=1:m y=gamrnd(1,1); a=min(((y+1)^(-3))*gampdf(x(i),1,1)... /gampdf(y,1,1)/((x(i)+1)^(-3)),1); if (rand f (x), are always accepted. A common choice is to let g be an N(0, s2 Σ) density with a suitably chosen covariance matrix Σ and a scaling constant s. In this case proposals are generated by adding a zero-mean Gaussian vector to the current state, y ∗ = x + sǫ where ǫ ∈ N (0, Σ). What remains for the practitioner in this case is the choice of scaling constant s and covariance matrix Σ. The latter choice is difficult and in practise Σ is often chosen to be a diagonal matrix of ones. For s some general rules of thumb can be derived though. Lets first look at a simple example: The function rwmh.m implements a Gaussian random-walk for a standard Gaussian target density given input N number of iterations, x0 starting value and s scaling constant: function [x,acc]=rwmh(N,x0,s); x(1:N)=x0; acc=0; for i=1:N-1 y=x(i)+s*randn; a=min(exp(-y^2/2+x(i)^2/2),1); if rand 0 is composite. More examples of composite hypothesises are H0 : θ = 0 in the model of N (θ, σ 2 ) distributions (σ 2 remains unspecified by the hypothesis and Θ0 = {0} × R+ ) and if the model contains all bivariate distributions F (x, y), H0 : “X and Y are independent” corresponds to the subset of distributions in P such that F (x, y) = H(x)G(y) for two univariate distribution functions H and G. The hypothesis HA : θ0 ∈ Θ\Θ0 is usually refered to as the alternative hypothesis. A statistical test of a hypothesis H0 specifies a rejection region R ⊆ Y and a rule stating that if data y ∈ R, H0 should be rejected (and if y ∈ Y\R, H0 is not rejected). The question as to whether a hypothesis can ever be accepted is a philosophical one; in this course we will only attempt to find evidence against hypothesises. The significance level of a test with rejection region R is defined as

α = sup Pθ (Y ∈ R) = sup θ∈Θ0

θ∈Θ0

Z

fθ (y) dy,

(8.1)

R

that is, the probability of rejecting a true null-hypothesis maximized over all probability distributions satisfying the null-hypothesis. Typically α is a fixed small value, e.g. α = 0.05. 77

78

8.1

CHAPTER 8. TESTING STATISTICAL HYPOTHESISES

Testing simple hypothesises

In the case of a simple hypothesis, the distribution of data under the nullhypothesis is completely specified and we write α = P (Y ∈ R|H0 )

(8.2)

for the level (8.1) of the test. To define a rejection region R, a common approach is to first define a test-statistic t : Y 7→ R. t(y) provides a realvalued summary of observed data y, and it is usually designed in such a way that “large” values of t(y) will constitute evidence against H0 . Hence, a natural rejection region is of the form R = {y; t(y) > r}, that rejects H0 for large values of t(y). The question remains how to choose r (i.e. what is “large”) in order to to achive a pre-specified level α. To achieve this, we introduce the P -value as the statistic p : Y 7→ [0, 1] defined by p(y) = P (t(Y ) ≥ t(y)|H0 ),

(8.3)

i.e. the probability of, under H0 , observing a larger value of the test-statistic than the one we actually observed. The below simple lemma allow us to determine the distribution of the random variable p(Y ): Lemma 8.1.1. If X ∈ R is a random variable with a continuous distribution function F , then the random variable Z = 1 − F (X) has an uniform distribution on [0, 1]. Hence, if FT0 is the distribution function of t(Y ) under the null-hypothesis, then p(Y ) = 1−FT0 (t(Y )), which has an uniform distribution under the nullhypothesis. Note also that the result only applies to continuous distribution functions in general. This result provides a convenient way of constructing rejection regions; we set R = {y; p(y) ≤ α}. Since p(Y ) is uniformly distributed under H0 , P (Y ∈ R|H0 ) = α as required by a level-α test. Hence, to test a hypothesis, we simply compute p(y) and reject the hypothesis if this value is less than α. The mechanic “rejection” involved hypothesis testing has often been critisised, an alternative approach favoured by many is to report the P value rather than the decision, since the P -value itself is a measure of the accuracy (or rather the lack thereof) of H0 . The scientist community is then left to interpret this evidence provided by the statistician. What is not so simple here is to compute p(y), since this involves the distribution FT0 which is not analytically available in general. However, if we can simulate new samples from the null-hypothesis, p(y) can easily be approximated by Monte-Carlo integration.

8.1. TESTING SIMPLE HYPOTHESISES

79

Algorithm 8.1 (Monte-Carlo test of a simple hypothesis).

1. Draw N samples y (1) , . . . , y (N ) , from the distribution specified by H0 . 2. Compute t(i) = t(y (i) ) for i = 1, . . . , N . P (i) ≥ t(y)}. 3. Compute pˆ(y) = N −1 N i=1 1{t 4. Reject H0 if pˆ ≤ α.

Traditionally, P -values have been approximated through asymptotic expansions of the distribution of t(Y ). One advantage of Monte-Carlo based tests is that we have almost complete freedom in choosing our test-statistic (without having to proove an asymptotic result). Example 8.1 (Normal mean). If we have observed y1 , . . . , yn independently from N (µ0 , σ02 ), where σ02 is known, we may want to test H0 : µ0 = 0. A natural test-statistic here is t(y) = |¯ y | and testing proceed by 1. Draw y (1) , . . . , y (N ) , N vectors of n independent N (0, σ 2 ) variables. 2. Compute t(i) = y¯(i) , i = 1, . . . , N . P (i) ≥ t(y)}. 3. Compute pˆ(y) = N −1 N i=1 1{t 4. Reject H0 if pˆ ≤ α.

Randomized tests and the problem with discretely distributed test-statistics When the distribution of t(Y ) is discrete, the above procedure does not guarantee a test that rejects a true null-hypothesis with probability α. Consider, for example the (slightly pathological) case where we have one observation from a Bernoulli(p) distribution, and want to test H0 : p = 1/2. In this case, no matter how we define the rejection region R, it will contain Y ∼ Bernoulli(1/2) with probability 0, 1/2 or 1. Hence, for example, we can’t define a level α = 0.05 test. We can construct a level 0.05 test by a randomization argument however. By augmenting the sample space, i.e. defining Y ′ = Y × R, where the “new” data-set is y ′ = (y, z) ∈ Y ′ and z is a draw from, say, N (0, σ 2 ), the test-statistic t′ (y ′ ) = t(y) + z will now have a continuous distribution and the procedure applies again. Since the result of such a procedure depends on z, which we just drew using a random number generator independently of data, randomized tests are not very popular in practise but rather employed as technical devices in theory. Example 8.2 (Flipping a coin). To test whethter a coin was biased, 100 independent coin-flips were performed out of which 59 turned out to be

80

CHAPTER 8. TESTING STATISTICAL HYPOTHESISES

1 0.9 0.8 0.7

F(p)

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

0.8

1

p

Figure 8.1: The null-distribution function of p(Y ) for the coin-flipping example (solid line).

“tails”. If the coin was unbiased, y = 59 could be regarded as a draw from a Bin(100,1/2) distribution, hence we choose this as out null-hypothesis. A reasonable choice of test-statistic is t(y) = |y/100−1/2|, since this can be expected large if the true probability of “tails” deviated from 1/2. To estimate the P -value, we perform 1000 draws from Bin(100,1/2), y (1) , . . . , y (1000) , and computed t(y (1) ), . . . , t(y (1000) ). In a particular application, 75 of the t(y (i) ) were larger than the observed t(59). This gives an approximate P -value of 75/1000 which is larger than 0.05. Hence, we can not reject the nullhypothesis at this level. In Matlab: t=abs(59/100-0.5); Y=binornd(100,.5,1,1000); tY=abs(Y/100-0.5); p=mean(tY>=t) p = 0.0750 For this simple example, we can calculate the P -value exactly as p(59) = P (Y ≤ 41) + P (Y ≥ 59) = 0.088 . . .. In Figure 8.1 we have plotted the distribution function of p(Y ). As you can see, it is not exactly uniform (the dashed line is the uniform), but fairly close to uniform in its left tail. The true level of the test is α ≈ 0.035.

8.2

Testing composite hypothesises

Unfortunately, most null-hypothesises of interest are composite, and for a composite hypothesis the P -value (8.3) is no longer well-defined since it might take different values in different parts of Θ0 . There are three basic strategies for dealing with this problem:

81

8.2. TESTING COMPOSITE HYPOTHESISES

1. Pivotal tests; here we find a pivotal test-statistic t such that t(Y ) has the same distribution for all values of θ ∈ Θ0 . 2. Conditional tests; here we convert the composite hypothesis into a simple by conditioning on a sufficient statistic. 3. Bootstrap tests; here we replace the composite hypothesis Θ0 by the ˆ θˆ ∈ Θ0 . simple {θ},

8.2.1

Pivot tests

A pivotal statistic is a statistic t such that the distribution of t(Y ), Y ∼ Pθ , is the same for all values θ ∈ Θ0 . Hence, in Step 2 of Algorithm 8.1, we can simulate from any such Pθ . Example 8.3 (Testing for an equal mean, Normal case). Assume we have two independent samples, y1 , . . . , yn from N (µy , 1) and x1 , . . . , xm from N (µx , 1) and we want to test H0 : µy = µx . Here the null-hypothesis contains one parameter that remains unspecified (µy = µx ). Since we can write yi = ui + µy and xi = vi + µx , for independent draws ui and vi from N (0, 1) the statistic t given by n

m

n

m

i=1

i=1

i=1

i=1

1 X 1X 1 X 1X yi − xi | = | ui − vi + (µy − µx )|, (8.4) t(x, y) = | n m n m is independent of µy and µx under the null-hypothesis. Hence, a test might proceed as follows 1. Draw N samples y (1) , . . . , y (N ) , each being a vector of n independent N (37, 1) draws. 2. Draw N samples x(1) , . . . , x(N ) , each being a vector of m independent N (37, 1) draws. 3. Compute t(i) = t(x(i) , y (i) ), i = 1, . . . , N with t as defined in (8.4). P (i) ≥ t(x, y)}. 4. Compute pˆ = N −1 N i=1 1{t

5. Reject H0 if pˆ ≤ α.

Of course, the value “37” can be replaced by any value in R as long as it is the same in Steps 1 and 2. Tests based on ranks Example 8.4 (Ph-levels cont.). Consider the data of Example 7.6, where it was relevant to test for an increased Ph-level. If we again denote by F0 the distribution of historical measurements, G0 the distribution of the current and assume all measurements independent, a natural null-hypothesis is that of H0 : F0 = G0 . This is a complex hypothesis since F0 and G0 are unknown. We represent data as y = (y1 , . . . , y273 ) where y1 , . . . , y124 are

82

CHAPTER 8. TESTING STATISTICAL HYPOTHESISES

historical and y125 , . . . , y273 current measurements. To find a pivot, we let ri be the rank of yi , i.e. if the components of y are sorted with respect to size, then yi comes in ri :th place. What is important to note here is that if the null-hypothesis is true, the vector (r1 , . . . , r273 ) can be seen as a draw from the uniform distribution on the set of all permutations of the vector (1, 2, . . . , 273). Hence, any statistic based on r alone is pivotal. Since there was some prior suspicion that the Ph-level had increased, this would correspond to the current measurements having larger ranks than what could be expected from the null-distribution. Hence, we choose as a test-statistic the average difference in ranks t(y) =

273 124 1 X 1 X ri − ri . 149 124 i=125

i=1

In Matlab the test is implemented as follows [dummy,r]=sort(y); t=mean(r(125:273))-mean(r(1:124)); for i=1:1000 R=randperm(273); T(i)=mean(R(125:273))-mean(R(1:124)); end phat=mean(T>=t) phat = 0.0770 and this test could not find any hard evidence of an increase. Pivotal tests of the above form are (naturally) called rank-based tests. They are generally applicable when the null-hypothesis is exchangable, i.e. when (Yi1 , . . . , Yin ) has (under the null hypothesis) the same distribution for all permutations (i1 , . . . , in ) of the vector (1, . . . , n). They are of somewhat limited efficiency in many situations since there is a loss of information in the transformation from y to r. Example 8.5 (Hormone levels cont.). In Example 7.8 we assumed a timeseries model for the measurements of lutenizing hormone levels. The fact that consequtive measurements are really dependent of eachother may be questioned. Can we reject the null-hypothesis H0 :” Measurements are independent from F0 ”? If there is (positive) serial dependence, the ranks of consequtive observations will be closer to eachother than under the nullhypothesis. Hence, we choose as a test statistic 1 . i=2 |ri − ri−1 |

In Matlab

t(y) = P48

83

8.2. TESTING COMPOSITE HYPOTHESISES [dummy,r]=sort(hormone); t=1/sum(abs(r(2:48)-r(1:47))); for i=1:1000 R=randperm(48); T(i)=1/sum(abs(R(2:48)-R(1:47))); end phat=mean(T>=t) phat = 0.0670

and no significant deviation from the null-hypothesis was found with this test either.

8.2.2

Conditional tests

Consider the usual setup with data y ∈ Y a realisation of Y ∼ Pθ0 . A sufficient statistic s(y) for θ is a statistic that summarises all information available on θ in the sample. Definition 8.1. A statistic s(·) is said to be sufficient for a parameter θ if the distribution of Y |s(Y ) = s(y) does not depend on θ (for any y). Obviously, s(y) = y is always a sufficient statistic. The following theorem provides a convenient way to check whether a particular statistic is sufficient: Theorem 8.1. A statistic t is sufficient for θ if and only if the likelihood factorises as fθ (y) = h(y)k(t(y), θ), θ ∈ Θ,

y ∈ Y,

(8.5)

i.e. into a part that does not depend on θ and a part that only depends on y through t(y). Example 8.6. Assume y1 , . . . , yn are independent draws from N (θ, 1). Then fθ (y) = C exp(− = C exp(−

n X (yi − θ)2 /2) i=1 n X i=1

yi2 /2) exp(−nθ 2 /2 + θ

n X

yi ).

i=1

P Hence t(y) = ni=1 yi is a sufficient statistic for θ. This means that there is no loss of information if we forget the individual values and just keep their sum. A conditional test is performed conditionally on the observed value of s(y). Hence, the P -value (8.3) is replaced by the conditional P -value defined by p(y) = P (t(Y ) ≥ t(y)|s(Y ) = s(y), H0 ).

(8.6)

84

CHAPTER 8. TESTING STATISTICAL HYPOTHESISES

Example 8.7 (A call for help?). Assume that person D has agreed with person E that in the need of assistance, he should continuously transmit a binary message . . . 0101010101010101 . . . (i.e. alternating zeros and ones). When D is not transmitting, E recieves background noise . . . yi yi+1 yi+2 . . . where the yi are independent and equal to zero with probability p and one with probability 1 − p. One day, E recieved the sequence 000101100101010100010101. Was this just background noise or a call for help where some signals got distorted?

1

0.8

0.6

0.4

0.2

0 5

10

15

20

25

Figure 8.2: Recorded message in Example 8.7. Here we want to test the hypothesis H0 :”message is background noise” and a possible test-statistic is t(y) = max(|y − I1 |, |y − I2 |) where I1 and I2 are alternating sequences of zeros and ones starting with a zero and a one respectively. If we knew P p, this would be a straightforward Monte-Carlo test. However, s(y) = yi , the total number of ones, is a sufficient statistic for p (check this). Conditionally on the observed s(y) = 10, Y |s(y) = 10 is just a random arrangement of 10 ones and 14 zeros in a vector. Hence we proceed as follows in Matlab y=[0 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1]; I1=repmat([0 1],1,12); I2=repmat([1 0],1,12); t=max(sum(abs(y-I1)),sum(abs(y-I2))); for i=1:10000 Y=y(randperm(24)); T(i)=max(sum(abs(Y-I1)),sum(abs(Y-I2)));

8.2. TESTING COMPOSITE HYPOTHESISES

85

end phat=mean(T>=t) phat = 0.0018 Looks like a call for help. Example 8.8 (Pines). Figure 8.3 describes the location of pines in a square forest, data is contained in the file pines.mat. It is of interest to determine if they are randomly scattered or tend to cluster in groups. Hence, we might want to test H0 :”locations form a Poisson process with rate λ”. Simulating from H0 involves first generating random variable N with a Poisson(λA) distribution (where A is the area of the forest) and then N points uniformly distributed over the forest. Unfortunately we can’t simulate such an N since we do not know λ. Hence, we condition on the value of N we actually observed: N = 118. We let H0 : “the locations are randomly scattered”

(8.7)

and as a test statistic we choose t=

n X

τi ,

(8.8)

i=1

where τi is the distance between pine i and its nearest neighbour. We would expect t to be small if pines tend to be clustered and large if they tend to avoid eachother. Since there are 118 pines in the data-set, simulating from H0 simply amounts to simulating 118 points uniformly on [0, 100] × [0, 100]. t is calculated by the following matlab function function t=tn2(x); n=length(x); for i=1:n d=sqrt((x(i,1)-x(:,1)).^2+(x(i,2)-x(:,2)).^2); tau(i)=min(d(d>0)); end t=sum(tau); and analysis by simulating 1000 (WARNING: 1000 values might take some time on a slow computer, use less) values from t(Y )|H0 , N = 118 is performed by for i=1:1000 T(i)=tn2(rand(118,2)*100); end A histogram of the simulated test statistics is given in Figure 8.4. Evaluating (8.8) for the observed data gave the value 536, which gives no confidence in rejecting the hypothesis that pines are randomly scattered. Note again that (8.8) was rather arbitrarily chosen as a test statistic and perhaps with another choice we would have produced a stronger test that could reject H0 .

86

CHAPTER 8. TESTING STATISTICAL HYPOTHESISES

100

90

80

70

60

50

40

30

20

10

0

0

10

20

30

40

50

60

70

80

90

100

Figure 8.3: Locations of pines.

250

200

150

100

50

0 460

480

500

520

540

560

580

600

620

640

660

Figure 8.4: Histogram of simulated test statistic t∗ .

8.2. TESTING COMPOSITE HYPOTHESISES

87

Permutation tests We have already mentioned that the full sample, s(y) = (y1 , . . . , yn ), is a sufficient statistic for any parameter θ. It is not very useful in testing though, since the distribution of t(Y )|Y = y is a point mass at t(y). However, the related statistic s(y) = (y(1) , . . . , y(n) ), the ordered sample is often useful. Definition 8.2 (Exchangable random variables). A vector Y = (Y1 , . . . , Yn ) of random variables is said to be exchangable if (Yi1 , . . . , Yin ) has the same distribution for all permutations i = (i1 , . . . , in ) of the index set (1, . . . , n). Obviously independent and identically distributed random variables are exchangable, but the exchangability concept is slightly more general. For example, if (Z1 , . . . , Zn ) are i.i.d. and X a random variable independent of the Zi , then (Y1 , . . . , Yn ) = (Z1 + X, . . . , Zn + X) are exchangable but not independent. Examples of non-exchangable vectors are cases where (some of) the Yi have different distributions and vectors with serial dependence like time-series models. What is important here is that if Y = (Y1 , . . . , Yn ) with distribution Pθ are exchangeble, then s(Y ) = (Y(1) , . . . , Y(n) ) is a sufficient statistic for θ. This is clear since with fθ , the density corresponding to Pθ , by exchangability fθ (y1 , . . . , yn ) = fθ (y(1) , . . . , y(n) ) = k(s(y), θ). Moreover, the distribution of (Y1 , . . . , Yn )|(Y(1) , . . . , Y(n) ) = (y(1) , . . . , y(n) ) is the uniform distribution over the set of all permutations of the vector (y1 , . . . , yn ). Based on these reults, a permutation test proceeds as follows Algorithm 8.2 (Monte-Carlo permutation test).

1. Draw N random permutations y (1) , . . . , y (N ) , of the vector y. 2. Compute t(i) = t(y (i) ) for i = 1, . . . , N . P (i) ≥ t(y)}. 3. Compute pˆ(y) = N −1 N i=1 1{t 4. Reject H0 if pˆ ≤ α.

Permutation tests can be very efficient for testing an exchangable nullhypothesis against a non-exchangable alternative. Example 8.9 (Ph-levels cont.). Returning to the Ph-levels again, we assume this time that historical data are independent observations from some unknown distribution F (x) and current data from G(y) = F (y − θ). The null-hypothesis is H0 : θ = 0. Under this hypothesis the full vector of current and historical measurements y = (y1 , . . . , y273 ) is an observation of a vector of i.i.d. and hence also exchangable random variables.

88

CHAPTER 8. TESTING STATISTICAL HYPOTHESISES A natural test-statistic is the difference in means t = t(y) =

273 124 1 X 1 X yi − yi , 149 124 i=125

i=1

and we want to reject H0 when this is “large”. In Matlab y=[ph1;ph2]; t=mean(ph2)-mean(ph1); for i=1:1000 Y=y(randperm(273)); T(i)=mean(Y(125:273))-mean(Y(1:124)); end p=mean(T>=t) ans = 0.0240 Since the P -value is less than 0.05 we can reject the null-hypothesis at this level. Example 8.10 (Law school data cont.). Returning to the law schools scores from Example 7.5, of interest is again to test for independence between the two tests and hence we define H0 :F0 (x, y) = H(x)G(y), where data are assumed to be drawn from F0 . This time, a sufficient statistic under the null-hypothesis is the two ordered samples, s(x, y) = ((y(1) , . . . , y(15) ), (x(1) , . . . , x(15) ). And drawing from (X, Y )|s(X, Y ) = s(x, y) proceeds by combining the two randomly permuted vectors in pairs. As test-statistic we choose the absolute value of the empirical correlation of the pairs. t=abs(corr(law)); for i=1:1000 LAW=law([randperm(15)’,randperm(15)’]); T(i)=abs(corr(LAW)); end p=mean(tn>=t) p = 0.0010 based on this value, we can reject the null-hypothesis of independence with large confidence.

8.2.3

Bootstrap tests

Tests based on confidence intervals With the knowledge of how to produce bootstrap confidence intervals, we immediately have the tools for testing hypothesis of e.g. the form H0 : θ = θ0 or H0 : θ > θ0 for fixed values θ0 , using the relation between construction of confidence intervals and hypothesis testing. For example using the analysis in Example 7.5 we can immediately reject the hypothesis that the two lawscores are uncorrelated, since the bootstrap confidence interval did not cover

8.2. TESTING COMPOSITE HYPOTHESISES

89

0. Similarily, a quick look at the histogram in Figure 7.7 suggests that 0 is not contained in a percentile confidence interval of the median differences, and we can reject the hypothesis that there has been no increase in Ph-level with some confidence. The above procedure generalises traditional tests based on the normal distribution to cases where there are evidence of non-normality. Sampling from an approximate null-distribution As before, the Bootstrap idea is to estimate P0 by Pˆ and then proceed as if the latter was the correct distribution. Hence, we want to find a Monte-Carlo approximation of the P -value p(y) = P (t(Y ∗ ) > t(y)|H0 ), where Y ∗ ∼ Pˆ .

(8.9)

In contrast to the previous sections, this is only a statistical approximation of a P -value but if H0 is true and Pˆ is a good estimate it should provide sensible results. An important difference from the Bootstrap algorithms discussed earlier is that here we need to insist that Pˆ satisfies H0 . Example 8.11 (Testing for a zero mean). Assume we have observations y = (y1 , . . . , yn ) independent from F0 and we want to test if they have a zero mean, H0 : E(Yi ) = 0. Here F0 remains unspecified by the null-hypothesis and can be estimated by the empirical distribution function Fˆ . However, if Z ∗ ∼ Fˆ , then E(Z ∗ ) = y¯ which is not necessarily zero. Hence we choose Q to approximate P0 by the distribution function Fn (u) = ni=1 Fˆ (ui − y¯) instead, which has zero-mean marginals. With t(y) = |¯ y | as test-statistic, we proceed as follows: 1. Repeat for j = 1, . . . , N j.1 Draw i1 , . . . , in independently from the uniform distribution on {1, . . . , n}. j.2 Set y ∗j = (yi1 − y¯, . . . , yin − y¯).

j.3 Compute t∗j = t(y ∗j ). P ∗j 2. Compute pˆ = N −1 N j=1 1{t ≥ t(y)}.

3. Reject H0 if pˆ ≤ α.

The procedure seems similar to that of a permutation test, the main difference being that we draw new observations with replacement in the Bootstrap version and without replacement in the permutation test. However, a permutation test would not be appropriate here since the model under the alternative hypothesis is also exchangable. Moreover, the test-statistic t(y) = t(|¯ y |) is invariant under permutations of the vector y. Example 8.12 (Testing goodness of fit). Consider Example 7.3 with the failure-times for airconditioning equipment. In Figure 7.2 we plotted the

90

CHAPTER 8. TESTING STATISTICAL HYPOTHESISES

empirical distribution function together with the fitted Exponential distribution. Of course we would not expect the functions to agree fully, but we might worry that the discrepancy is larger than what could be assumed from an Exponential distribution. This could be formulated in terms of the hypothesis H0 :”Data are independent draws from an Exponential distribution” and as a test-statistic we might choose n X (i/n − (1 − exp(−y(i) /¯ y ))2 , t(y) =

(8.10)

i=1

which is the sum of the squared differences between the empirical distribution and the fitted Exponential distribution function evaluated at the sampled points. In the Boostrap approximation, we sample from Y ∗ ∼Exp(106.4), the fitted Exponential (which obviously is a member of Θ0 ). In Matlab ysort=sort(y); m=mean(y); u=(1:12)/12; t=sum((u-(1-exp(-ysort/m))).^2); for i=1:1000 ystar=sort(exprnd(106.4,1,12)); m=mean(ystar); T(i)=sum((u-(1-exp(-ystar/m))).^2); end phat=mean(T>=t) phat = 0.2110 hence, the Exponential seems a safe assumption. Example 8.13 (Comparing models). In the above example, the alternative hypothesis contained all distributions that are not Exponential. Sometimes you have two candidate models and want to choose the “best” by testing them against eachother. For the air-conditioning example we might test H0 :”Data are independent draws from an Exponential distribution” against HA :”Data are independent from a log-normal distribution”. The difference in procedure lies mainly in the choice of test-statistic, here a natural choice is the difference of log-likelihood 0 ˆ ˆ t(y) = log(LA y (θA (y))) − log(Ly (θ0 (y))), 0 ˆ ˆ where LA y ,θA (y) and Ly ,θ0 (y) are the likelihoods and maximum-likelihood estimates under the null and alternative hypothesises respectively. In Matlab

tA=lognfit(y); t0=mean(y); t=sum(log((lognpdf(y,tA(1),tA(2)))))-sum(log(exppdf(y,t0))); for i=1:1000 ystar=exprnd(106.4,1,12);

91

8.2. TESTING COMPOSITE HYPOTHESISES tA=lognfit(ystar); t0=mean(ystar); T(i)=sum(log((lognpdf(ystar,tA(1),tA(2)))))... -sum(log(exppdf(ystar,t0))); end phat=mean(T>=t) phat = 0.4600

suggesting little evidence for or against either (if the Exponential had a superior fit we would expect a P -value close to 1). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

100

200 300 Failure time

400

500

Figure 8.5: Empirical distribution function (solid) fitted Exponential (dotted) and fitted log-Normal (dashed) for airconditioning data (×)

92

CHAPTER 8. TESTING STATISTICAL HYPOTHESISES

Chapter 9

Missing data models In this chapter we will be concerned with Maximum-Likelihood estimation of parameters in models involving stochastic variables that are not directly observed. This includes cases where part of data were actually lost for some reason, models where some of the unknowns are better modelled as random variables than fixed parameters (often refered to as latent/hidden variable models) and cases where random variables are introduced into the model for purely computational reasons. The general setup is that of an observed data model: y ∈ Y is an observation of Y ∼ Pθ0 , Pθ0 ∈ P = {Pθ ; θ ∈ Θ}, just as in Chapter 6. The difference is that in this chapter we will consider cases where the likelihood Ly (θ) = fθ (y) is not explicitly known or difficult to maximise. Example 9.1 (Truncated data). Assume you have observed y ∈ Rn , that are independent observations from Yi = ⌈Zi ⌉, where Zi has density fθ0 for i = 1, . . . , n. Here Z = (Z1 , . . . , Zn ) can be viewed as “missing” data. The likelihood in this example is Ly (θ) =

n Z Y i=1

yi

fθ (zi ) dzi ,

(9.1)

yi −1

which can be difficult to compute or maximize. The idea here and in the following is that deriving the likelihood would have been much easier had we observed the missing data z. The augmented/complete data is (y, Z) ∈ Y × Z, where marginally y is an observation of Y ∼ Pθ0 and (Y, Z) ∼ Pθaug . Correspondingly, we may 0 define the augmented/complete data model as Pθaug ∈ {Pθaug ; θ ∈ Θ} and 0 the augmented likelihood Ly,Z (θ) = fθ (y, Z),

(9.2)

where fθ (y, Z) is the density corresponding to Pθaug evaluated at observed data y and the random variable Z. Note that Ly,Z (θ) is a random variable. 93

94

CHAPTER 9. MISSING DATA MODELS

Our quantity of interest is now the Maximum-Likelihood estimator based on observed data, i.e. Z ˆ θ = argmaxθ Ly (θ) = argmaxθ fθ (y, z) dz. (9.3) We will in the next chapter discuss the EM-algorithm, a numerical scheme for finding θˆ defined in (9.3) while avoiding explicit evaluation of the integral on the right-hand side. In Example 9.1, the augmented data had a clear interpretation as being “lost” in the observation process. An example where hidden random variables are part of the modelling is hidden Markov models: Example 9.2 (Noisy binary data). Assume a signal Z1 , . . . , Zn consisting of zeros and ones evolves according to a Markov chain with transition matrix M = (mij ), (i, j) ∈ {0, 1} × {0, 1} i.e. P (Zk = j|Zk−1 = i) = mij , but we can only observe a noisy version y1 , . . . , yn where Yk = Zk + ǫk and the ǫk are independent N (0, σ 2 ). This is an example of a hidden Markov model (HMM), a sample data set is shown in Figure 9.1. 1.5 1 0.5 0 −0.5

0

5

10

15

20

Figure 9.1: Observations (dots) from a hidden Markov model, the dotted line is the (unobserved) signal. A similar example, but where the hidden variables are artificially introduced, is given by mixture models: Example 9.3. For a certain species of insects it is difficult to distinguish sex of the animals. However, based on previous experiments, the weight of males are known to be distributed according to a density f0 and of females according to f1 . The weight of a 1000 randomly chosen insects in a large population is recorded, and of interest is the proportion θ of males in the population. Here, the sample can be viewed as independent draws from the mixture density θf0 (y) + (1 − θ)f1 (y).

(9.4)

n Y (θf0 (yi ) + (1 − θ)f1 (yi )), Ly (θ) =

(9.5)

The likelihood is

i=1

95 and requires numerical maximation. However, we may introduce latent indicator variables Zi , where Zi = 0 if yi is measured on a male and Zi = 1 if it is a female and hence P (Zi = 0) = θ, we get the augmented likelihood Ly,z (θ) =

n Y i=1

fzi (yi )θ 1−zi (1 − θ)zi ,

(9.6)

which is often easier to handle numerically. 60

50

40

30

20

10

0 97

98

99

100

101

102

103

104

105

106

107

Figure 9.2: Histogram of data from the mixture example. The EM-algorithm is an iterative algorithm for approximating θˆ in (9.2). Given a current approximation θ (i) , we define the function Z (i) θ 7→ Q(θ, θ ) = log(Ly,z (θ))fθ(i) (z|y) dz = E(log(Ly,Z (i) (θ))), (9.7) where Z (i) is a random variable with density fθ(i) (z|y), the density of missing data conditionally on observed data and that the unknown parameter equals θ (i) . Algorithm 9.1 (The EM-algorithm). 1. Choose a starting value θ (0) . 2. Repeat for i = 1, 2, . . . until convergence. i Compute θ (i) = argmaxθ Q(θ, θ (i−1) ).

96

CHAPTER 9. MISSING DATA MODELS

Note that if Ly,z (θ) = h(θ, y, z)g(y, z) we can replace

Q(θ, θ (i) )

by

˜ θ (i) ) = E(log(h(θ, y, Z (i) ))) Q(θ, in the algorithm without changing the location of the maxima. The name of the algorithm comes from the fact that each iteration involves taking an expectation when computing Q(θ, θ (i−1) ) (the E-step) and performing a maximisation (the M -step) when maximising the same. For the algorithm to be of any use, it is of course essential that these two steps are easier to perform than a direct evaluation of 9.3. Theorem 9.1. If θ (i) = argmaxθ Q(θ, θ (i−1) ), then Ly (θ (i) ) ≥ Ly (θ (i−1) ) with equality holding if and only if Q(θ (i+1) , θ (i) ) = Q(θ (i) , θ (i) ). I.e. the (observed data) likelihood is increased in each step of the EM-algorithm. If in addition Q(θ, θ ′ ) is continuous in both variables, the algorithm will ˜ i.e. such that θ˜ = argmax Q(θ, θ). ˜ Under converge to a stationary point θ, θ further technical conditions, a stationary point will correspond to a local maxima of Ly (θ), but there is no guarantee that this will be the global maxima. Example 9.4 (Truncated data cont.). Assume we have n independent observations of Y = ⌈Z⌉, where we assume Z ∼ Exp(θ0 ). To apply the EMalgorithm, we first need to define our augmented data-set which here is naturally given by (y1 , Z1 ), . . . , (yn , Zn ), where Zi is the value of yi before truncation. Next we need to derive the augmented likelihood Ly,z (θ) and the distribution of Z|y: First note that the conditional distribution of Yi |Zi = zi is just a point-mass at ⌈zi ⌉. Hence Ly,z (θ) = fθ (y, z) = fθ (y|z)fθ (z) =

n Y i=1

1{yi = ⌈zi ⌉} exp(−zi /θ)/θ,

(9.8)

where of course the factor fθ (y|z) does not depend on θ and can be dropped. Further, zi 7→ fθ (zi |yi ) ∝ fθ (zi , yi ) = 1{yi = ⌈zi ⌉} exp(−zi /θ)/θ

an Exp(θ) random variable truncated to [yi , yi + 1]. Hence, n X (i) (E(−Zj )/θ − log(θ)) Q(θ, θ ) = (i)

j=1

P (i) which is maximized by θ (i+1) = n−1 nj=1 E(−Zj ). Finally R yi +1 z exp(−z/θ (i) )/θ (i) dz y (i) E(Zj ) = R iyi +1 exp(−z/θ (i) )/θ (i) dz yi =

exp(1/θ (i) )(yj + θ (i) ) − yj − 1 − θ (i) , exp(1/θ (i) ) − 1

which specifies the recursion. In Matlab, with a simulated data-set:

(9.9)

97 y=ceil(exprnd(3,1,100)); t(1)=1; for i=1:20 t(i+1)=mean((exp(1/t(i))*(y+t(i))-y-1-t(i)))/(exp(1/t(i))-1); end In Figure 9.3 we have plotted the 20 iterations of the algorithm and convergence seems almost immediate to θˆ = 3.7 . . .. It is perhaps questionable as to whether using the EM-algorithm here was simpler than direct maximisation and the example should be viewed as an illustration only.

4

3.5

θ

3

2.5

2

1.5

1 0

5

10 Iteration

15

20

Figure 9.3: Iterations θ (i) for the truncated data example.

Example 9.5. Assume a radioactive material is known to emit Po(100) particles in unit time. A measurement equipment records each particle with probability θ0 . If the recorded value was y = 84, what is the maximumlikelihood estimate of θ0 ? Here we failed to observe Z, the total number of particles emitted. Conditionally on Z, y is an observation of a Bin(θ0 , Z) variable. The augmented likelihood is Ly,z (θ) = fθ (y, z) =



∝ θ y (1 − θ)z−y

z y



θ y (1 − θ)z−y

100z exp(−100) z!

98

CHAPTER 9. MISSING DATA MODELS

and z! 100z exp(−100) (1 − θ)z−y (z − y)! z! z−y z (100(1 − θ))z−y (1 − θ) 100 ∝ , ∝ (z − y)! (z − y)!

z 7→ fθ (z|y) ∝

which we recognise as the probability mass function of y + W , where W ∼ Po(100(1 − θ)). Hence, ˜ θ (i) ) = y log(θ) + log(1 − θ)E(Z (i) − y) Q(θ,

= y log(θ) + 100 log(1 − θ)(1 − θ (i) ),

which is maximized by θ (i+1) = y/(y + 100(1 − θ (i) )). In Matlab y=84; t(1)=0.5; for i=1:50 t(i+1)=y/(y+100*(1-t(i))); end which converges to 0.84 as expected. 0.9 0.85 0.8

θ

0.75 0.7 0.65 0.6 0.55 0.5

0

10

20

30 Iteration

40

50

60

Figure 9.4: Iterations θ (i) for the Poisson-Binomial example.

99 150 8064* 8064* 8064* 8064* 8064* 8064* 8064* 8064* 8064* 8064*

170 1764 2772 3444 3542 3780 4860 5196 5448* 5448* 5448*

190 408 408 1344 1344 1440 1680* 1680* 1680* 1680* 1680*

220 408 408 504 504 504 528* 528* 528* 528* 528*

Table 9.1: Censored regression data Example 9.6 (Censored regression). Here we consider a regression problem involving censored data. Consider the data in Table 9.1. These data represent failure times (hours) of motorettes at four different temperatures (Celsius). The asterisks denote a censored observation, so that for example, 8064* means the item lasted at least 8064 hours. Physical considerations suggest the following model relating the logarithm (base 10) of lifetime, yi = log10 (ti ) and xi = 1000/(temperature + 273.2): Yi = β0 + β1 xi + ǫi (9.10) where ǫi ∼ N (0, σ 2 ) and the xi are fixed. On this scale a plot of yi against xi is given in Figure 9.5 in which censored data are plotted as open circles. The transformed data are stored in motorette.mat, where the first 17 values are uncensored and the rest censored. Now, in this situation, if the data were uncensored we would have a simple regression problem. Because of the censoring we use the EM algorithm to first estimate where the censored values actually are (the E–step) and then fit the model to the augmented data set (the M–step). Re–ordering the data so that the first m values are uncensored, and denoting by Zi the augmented data values, we have the augmented log-likelihood: n m X X 2 (Zi −β0 −β1 xi )2 )/(2σ 2 ). log(Ly,Z (θ)) = −n log σ−( (yi −β0 −β1 xi ) + i=1

i=m+1

(9.11) The E–step requires us to find the expectations of Zi given the censoring time ci . Unconditionally on the censoring, the distribution of each Zi is normal, so the conditioning is a normal probability, conditioned on Zi > ci . It is fairly straightforward to show that with θ = (β0 , β1 , σ 2 ), Mi (θ) = E(Zi |Zi > ci ) = µi + σH((ci − µi )/σ) and Vi (θ) = E(Zi2 |Zi > ci ) = µ2i + σ 2 + σ(ci + µi )H((ci − µi )/σ),

100

3.8

CHAPTER 9. MISSING DATA MODELS

3.6

• • • ••

log (failure time) 3.2 3.4





2.8

3.0

• •

2.6

• •

• 2.1

2.2 v

Figure 9.5: Censored regression data

2.3

101 where µi = β0 + β1 xi and H(x) = φ(x)/(1 − Φ(x)). Thus m X (yj − β0 − β1 xj )2 /(2σ 2 ) Q(θ, θ ) = −n log σ − (i)

j=1



n X

(i)

j=m+1

E(Zj − β0 − β1 xj )2 /(2σ 2 )

m X (yj − β0 − β1 xj )2 /(2σ 2 ) = −n log σ − j=1



n X

(Vj (θ (i) ) − 2(β0 + β1 xi )Mi (θ (i) ) + (β0 + β1 xj )2 )/(2σ 2 ).

j=m+1

Maximising this with respect to (β0 , β1 ) proceeds as for an ordinary regression problem with data (y1 , x1 ), . . . , (ym , xm ), (Mm+1 (θ (i) ), xm+1 ), . . . , (Mn (θ (i) ), xn ), while the solution for σ 2 equals m

σ (i+1)2 =

1X (i+1) 2 (yj − µj ) n j=1

n σ (i)2 X (i) (i) + (1 + ((cj − µj )/σ (i) )H((cj − µj )/σ (i) ). n j=m+1

In Matlab, with starting values fitted through the uncensored observations, b=polyfit(x(1:17),y(1:17),1); mu=b(1,2)+b(1,1).*x; s=std(y(1:17)-mu(1:17)); d=(y(18:40)-mu(18:40))/s(1); for i=1:50 M(1:23)=mu(18:40)+s(i)*normpdf(d)./(1-normcdf(d)); b(i+1,1:2)=polyfit(x,[y(1:17) M],1); mu=b(i+1,2)+b(i+1,1).*x; d=(y(18:40)-mu(18:40))/s(i); s1=(y(1:17)-b(i+1,2)-b(i+1,1)*x(1:17)).^2; s2=s(i)^2*(1+d.*normpdf(d)./(1-normcdf(d))); s(i+1)=sqrt(mean([s1 s2])); end The sequence of linear fits is shown in Figure 9.7. Note that the initial fit (bottom line) was made without taking censored data into account, so the difference between the first and final fits gives an indication of the necessity of including the censored observations.

102

CHAPTER 9. MISSING DATA MODELS

6

4

2

0

−2

−4

−6

−8

0

10

20

30 Iteration

40

50

60

Figure 9.6: Convergence of parameter estimates for censored regression example, from top to bottom β1 , σ, β0 .

4.2 4 3.8 3.6 3.4 3.2 3 2.8 2.6

2

2.05

2.1

2.15

2.2

2.25

2.3

2.35

2.4

Figure 9.7: EM iterates to regression fit of censored data

Part III

Applications to Bayesian inference

103

Chapter 10

Bayesian statistics During the past 30 years, interest in the Bayesian approach to statistics has increased dramatically, much due to the popularisation of MCMC methods which has made previously intractable calculations straightforward with a fast computer. The fundamental difference between the Bayesian and classical approach is that instead of viewing parameters as unknown and fixed quantities, the Bayesian view them as unobserved random variables. Hence, to a Bayesian statistician, a statement like P (θ > 0) makes sense and will give an answer in [0, 1] while a classical/frequentist would argue that the probability is either zero or one (and we are not likely to deduce the answer) since there are no random variables involved. The view that the parameter is a random variable should not be interpreted as “there is no fixed true value θ0 ” but rather as a way to model the uncertainty involved in this unknown value.

10.1

Bayesian statistical models

Remember that a classical statistical model was defined by the set P of possible distributions Pθ of data. A Bayesian statistical model is defined by the joint distribution of data and parameters, i.e. the density f (y, θ) = f (y|θ)f (θ) of (Y, Θ) (here and further on Θ is a random variable and not a parameter space). Roughly, a Bayesian analysis proceeds through the following four steps: 1. Specification of a likelihood f (y|θ), the distribution of Y |Θ = θ, data conditionally on the parameters. 2. Determination of a prior distribution f (θ) of the unknown parameters Θ. 3. Observing data y and computing the posterior distribution f (θ|y), the distribution of Θ|Y = y, parameters conditionally on observed data. 4. Drawing inferences from this posterior distribution. Essentially new here (in comparison with the classical approach) is the concept of a prior distribution and that inference is based on the posterior 105

106

CHAPTER 10. BAYESIAN STATISTICS

distribution f (θ|y) rather than the likelihood f (y|θ). Another difference is that the classical assumption of independent and identically distributed observation transfers to the assumption of exchangable observations. In a Bayesian model, the distribution of the data vector Y = (Y1 , . . . , Yn ) depends on the value of the random variable Θ, hence the components of the data-vector will in general not be independent (though they are often assumed independent conditionally on Θ).

10.1.1

The prior distribution

The prior distribution f (θ) summarises the apriori knowledge of Θ i.e. the knowledge of Θ obtained before observing data. For example, if you “know” that 2 ≤ Θ ≤ 5 it may be appropriate to let f (θ) be an uniform distribution on this interval. Prior considerations tend to vary from person to person, and this lead to subjective choices of prior distribution. The main objection to Bayesian inference is that such subjective choices of prior will necessarily affect the final conclusion of the analysis. Historically, Bayesians have argued that prior information is always available and that the subjective view is advantageous. Nowadays, much research in Bayesian statistics is instead focused on finding formal rules for “objective” prior elicitation. Classical statisticians on the other hand tend to argue; Bayesians are subjective, I am not a Bayesian, hence my analysis is objective. Though classical choices of estimators and test-statistics are of course subjective in a similar manner and both approaches make the subjective choice of a likelihood function. We will leave this discussion to the philosophers for now and just mention that as more data is collected, the prior distribution will have less influence on the conclusion.

10.1.2

The posterior distribution

The posterior distribution f (θ|y) summarises the information about Θ provided by data and prior considerations. It is given by Bayes’ Theorem, which is essentially a trivial consequence of the definition of conditional probabilities: θ 7→ f (θ|y) =

f (y|θ)f (θ) ∝ f (y|θ)f (θ), f (y)

(10.1)

i.e. the posterior density is proportional to the product of the likelihood and the prior. Based on this density, any question of interest regarding Θ can be derived through basic manipulations derived from probability theory only. No asymptotic or “statistical” approximations are necessary. However, for complex models, numerical approximations are often necessary and this is where MCMC comes into the picture. By constructing a Markov-Chain Θ(i) , i = 1, . . ., with f (θ|y) as stationary distribution, we can answer questions about f (θ|y) by Monte-Carlo integration. Example 10.1. If Y |Θ = θ ∼ Bin(n, θ) and, apriori, Θ is standard uniform, then f (θ|y) ∝ θ y (1 − θ)n−y ,

θ ∈ [0, 1],

(10.2)

107

10.1. BAYESIAN STATISTICAL MODELS

which we recognise as (proportional to) the density of a Beta(y +1, n−y +1) distribution. R 1 Of course we do not need to calculate the normalising constant f (y) = 0 θ y (1 − θ)n−y dθ to come to this conclusion. In Chapter 8 we considered a problem where 59 out of 100 coin-flips turned tails up. If Θ is the probability of tails and with the above prior specification, the posterior distribution of Θ is Beta(60, 42). The posterior probability P (Θ = 1/2|Y = 59) is then equal to 0, since the Beta-distribution is a continuous distribution. But this is simply a consequence of the precise mathematical nature of the statement; we do not really believe there is a coin that lands tails up with probability exactly equal to 1/2 (or a coinflipper that could flip it in an exactly fair manner for that matter). More interesting is the R1 statement P (Θ > 1/2|Y = 59) = 1/2 f (θ|y = 59) dθ = 0.9636 . . ., which suggests a rather strong posterior evidence that the coin is biased. In the coin-flipping example it seems reasonable to assume apriori that Θ is close to 1/2 (it seems hard to construct a coin that always/never ended up tails). Hence, we may choose a Beta(α, α) prior for some α > 1 (the standard uniform corresponds to α = 1). This gives posterior f (θ|y) ∝ θ y (1 − θ)n−y θ α−1 (1 − θ)α−1 =θ

y+α−1

n−y+α−1

(1 − θ)

θ ∈ [0, 1],

(10.3) (10.4)

i.e. a Beta(y + α, n − y + α) posterior distribution. In Figure 10.1 we have plotted the posterior distributions for this example together with the priors for α equal to 1, 2, 10 and 50. Note that in this example we consider α a fixed parameter that index the prior distribution. Such a parameter is often refered to as a hyperparameter. Note that the mean of the Beta(y + α, n − y + α) distribution is (y + α)/(n + 2α) and its variance will be O(n−1 ), thus for sufficiently large n, the posterior distribution will be concentrated around the Maximum-Likelihood estimator y/n regardless of the choice of prior hyperparameter α. This is generally the case; Bayesian and classical conclusions tend to agree for large samples in the sense that the Maximumlikelihhod estimator t(Y ) will have a distribution close to f (θ|y) for large samples. For small samples you have to choose whether you want to rely on the classical statisticians large-sample approximations or the Bayesian statisticians prior distribution. Multi-parameter models Classical statistical approaches often fail to produce reasonable answers when a model contains too many unknown parameters, this is not a problem (in principle) with a Bayesian model. Note that since parameters are treated as unobserved random variables, there is no formal distinction between parameters and missing/latent data. Hence, if you are only interested in one of the model-parameters, i.e. Θ1 , where Θ = (Θ1 , . . . , Θd ), its posterior distribution is given by integrating the other parameters out of the problem, i.e. Z Z f (θ1 |y) = f (θ|y) dθ2 · · · dθd ∝ f (y|θ)f (θ) dθ2 · · · dθd . (10.5)

108

12

10

10

8

8

6

6

4

4

2

2

0

f(θ|y)

f(θ|y)

12

0

0.5 θ

0

1

12

12

10

10

8

8 f(θ|y)

f(θ|y)

CHAPTER 10. BAYESIAN STATISTICS

6

4

2

2 0

0.5 θ

1

0.5 θ

1

0

0.5 θ

1

6

4

0

0

0

Figure 10.1: Prior (dotted) and posterior (solid) distributions for α equal to 1, 2, 10 and 50

When the above integration is not feasible, a Monte-Carlo approach is to (i) (i) simulate vectors θ (i) = (θ1 , . . . , θd ) from the posterior f (θ|y) (which is (i) easy to derive up to proportionality using (10.1)), the first coordinates, θ1 , are now a sample from the marginal f (θ1 |y) and can be used to approximate related quantities through Monte-Carlo integration. Prediction Suppose you have data y, an observation of Y = (Y1 , . . . , Yn ), and want to say something about a “future observation” Yn+1 . For example, you might be interested in the probability that Yn+1 lies in some set A given the information contained in y, P (Yn+1 |Y = y). A solution based on classical statistics (using the plug-in principle) would be to compute P (Yn+1 ∈ A|Θ = ˆ for an estimate θˆ of θ0 . This can be a reasonable approximation, but it θ) fails to take into account the uncertainty of the estimate θˆ (we are effectively conditioning on an event that we know is only approximately true). Hence, for example prediction intervals based on classical statistics tends to be too short. The Bayesian solution however follows directly from probabilistic principles (rather than more ad-hoc statistical principles like the plug-in principle) since given a Bayesian model Z Z P (Yn+1 ∈ A|y) = f (yn+1 , θ|y) dθ dyn+1 , (10.6) A

10.2. CHOOSING THE PRIOR

109

where f (yn+1 , θ|y) is the density of (Yn+1 , Θ)|Y = y. Hence, a Bayesian integrates out instead of plugging in. This is a very attractive feature of Bayesian statistics; once a model is defined answers follow directly from probability theory only and “statistical approximations” are unneccesary. The “ad-hockery” lies instead in the model-formulation and especially so in the choice of prior. Example 10.2. Lets continue Example 10.1 and predict the result Y2 of 10 independent new flips of the coin. We have observed y1 = 59, of Y1 |Θ = θ ∼ Bin(100, θ) and we choose a uniform prior for Θ. First note that since Y1 and Y2 are conditionally independent given Θ = θ, f (θ|y1 , y2 ) ∝ f (y1 , y2 |θ)f (θ) = f (y2 |θ)(f (y1 |θ)f (θ)) ∝ f (y2 |θ)f (θ|y1 ). Hence, deriving the posterior f (θ|y1 , y2 ) with prior f (θ) is equivalent to deriving the posterior f (y2 |θ) with f (θ|y1 ) as prior. Now we were interested in Z Z f (y1 , y2 |θ)f (θ) f (y2 |y1 ) = f (y2 , θ|y1 ) dθ = dθ f (y1 ) Z ∝ f (y1 |θ)f (y2 |θ)f (θ) dθ Z 1 1 ∝ θ y1+y2 (1 − θ)110−y1 −y2 dθ (10 − y2 )!y2 ! 0 (y1 + y2 )!(110 − y2 − y1 )! , y2 ∈ {0, 1, . . . , 10}. ∝ (10 − y2 )!y2 ! We have plotted this distribution in Figure 10.2.

10.2

Choosing the prior

10.2.1

Conjugate priors

In Example 10.1 we saw that when using a Beta-distribution as a prior for Θ, the posterior distribution was also Beta but with different parameters. This property of the model, that the prior and posterior distributions belong to the same family of distributions is an example of conjugacy, and a prior with this property (i.e. a Beta-prior in Example 10.1) is refered to as a conjugate prior. Example 10.3. Assume y is an observation of Y |Θ = θ ∼ Po(θ) and we choose a Gamma(α, β) prior for Θ, where (α, β) are fixed hyperparameters, then f (θ|y) ∝ f (y|θ)f (θ) ∝ θ y exp(−θ)θ α−1 exp(−θβ) = θ y+α−1 exp(−θ(1 + β)),

i.e. the posterior is Gamma(y + α, 1 + β) and hence the Gamma-prior is conjugate for a Poisson likelihood just as the Beta was conjugate for a Binomial

110

CHAPTER 10. BAYESIAN STATISTICS

0.25

0.2

f(y2|y1)

0.15

0.1

0.05

0

0

2

4

6

8

10

y2

Figure 10.2: Predictive distribution of Y2 |Y1 = 59 in Example 10.2.

likelihood. Note that we use Gamma(α, β) to represent the distribution with density proportional to y α−1 exp(−yβ) and that this distribution is sometimes denoted Gamma(α, 1/β) (The latter is for example used by Matlab!). Conjugate priors are often applied in practise, but they are difficult to derive in more complex models. Moreover, their motivation is mainly based on mathematical convenience and the concept does not give much guidance on how to choose the hyperparameters. Table 10.2.1 lists conjugate priors and their posteriors for some common models, you may want to check these results.

10.2.2

Improper priors and representing ignorance

Consider the conjugate N (m, s2 ) prior for Θ in the N (θ, σ 2 ) likelihood (Table 10.2.1). The posterior is here N(

1 m/s2 + n¯ y /σ 2 , ) → N (¯ y , σ 2 /n) as s → ∞. 2 1/s + n/σ 2 1/s2 + n/σ 2

The limiting prior distribution (i.e. N (m, ∞)) can be interpreted as f (θ) ∝ 1, for θ ∈ R, but this is not a probability density since it is not integrable. A prior if this type, i.e. that fails to integrate to one, is called an improper prior. Improper priors are allowed in Bayesian analysis as long as the posterior is proper, i.e. as long as the right-rand side of f (θ|y) ∝ f (y|θ)f (θ) is integrable. Choosing a prior that is uniformly distributed over the whole real line is a common strategy in cases where there exist no reliable prior information.

111

10.2. CHOOSING THE PRIOR Likelihood Bin(n, θ) Ge(θ) NegBin(n, θ) Gamma(k, θ) Po(θ)

Prior Beta(α, β) Beta(α, β) Beta(α, β) Gamma(α, β) Gamma(α, β)

N (µ, θ −1 ) N (θ, σ 2 )

Gamma(α, β) N (m, s2 )

Posterior Beta(α + y, β + n P− y) Beta(α + n, β + ni=1 yi − n) Beta(α + n, β + y − n) Pn Gamma(α + nk, Pn β + i=1 yi ) Gamma(α + i=1 yiP , β + n) n

(y −µ)2

Gamma(α + n2 , β + i=1 2 i 2 +n¯ y /σ2 1 , ) N ( m/s 1/s2 +n/σ2 1/s2 +n/σ2

)

Table 10.1: Conjugate priors for Θ for some common likelihoods. All parameters except θ are assumed fixed and data (y1 , . . . , yn ) are conditionally independent given Θ = θ.

Indeed, the uniform distribution does not “favour” any particular value and can be seen as a way to represent ignorance. Now, if you are ignorant about a parameter Θ you are likely to be equally ignorant about Θ′ = h(Θ) for some function h. But if Θ has a uniform prior, fΘ (θ) = 1, then (with g = h−1 ) this corresponds to using a prior fΘ′ (θ ′ ) = fΘ (g(θ ′ ))|g′ (θ ′ )| = |g′ (θ ′ )| on Θ′ = h(Θ) (c.f. the theorem on transformation of random variables, Theorem 3.1). Hence, ignorance represented by the uniform prior distribution does not translate across scales. For example, if you put a uniform prior on R+ on the parameter Θ = σ in √ a Normal likelihood, this corresponds to using a prior proportional to θ ′ for the parameter Θ′ = σ 2 . Uniformity is also not very natural for parameters that do not represent location. For example if Θ is a scale parameter, i.e. the likelihood is of the form f (y|θ) = g(y/θ)/θ, then we would like to put more weight on the interval θ ∈ [0, 1] than the interval θ ∈ [10, 11], since even if the intervals are of the same length choosing between θ = 0.1 and θ = 0.9 is going to affect the conclusion dramatically while choosing between θ = 10.1 and θ = 10.9 is not. Jeffrey’s prior A prior that does translate across scale is Jeffrey’s prior. Recall the concept of Fisher information,     2 d log f (Y |θ) d log(f (Y |θ)) =E . (10.7) IΘ (θ) = −E dθ 2 dθ The Jeffrey’s prior is then defined as f (θ) = |IΘ (θ)|1/2 .

(10.8)

This means that if Θ′ = h(Θ) for a one-to-one transformation h with inverse g, then fΘ′ (θ ′ ) = |IΘ (g(θ ′ ))|1/2 |g′ (θ ′ )| = |IΘ′ (θ ′ )|1/2

(10.9)

112

CHAPTER 10. BAYESIAN STATISTICS

since ′

IΘ′ (θ ) = −E

d2 log(fY |Θ′ (Y |θ ′ ) dθ ′2

!

= −E

d2 log(fY |Θ (Y |g(θ ′ )) dθ ′2

!

= IΘ (θ ′ )g′ (θ ′ )2 , which also explains the square root. Example 10.4. If we have observations y1 , . . . , yn from Y |Θ = θ ∼ N (θ, σ 2 ), σ 2 known, then   2 Pn d ( i=1 −(Yi − θ)2 /(2σ 2 ) n (10.10) = 2, I(θ) = −E dθ 2 σ which does not depend on θ, i.e. for a Normal location parameter the Jeffrey’s prior is uniform on the real line. Example 10.5. If y is an obs from Y |Θ = θ ∼ Bin(n, θ), then     2 Y n−Y d (Y log(θ) + (n − Y ) log(1 − θ)) + =E I(θ) = −E dθ 2 θ 2 (1 − θ)2   1 1 n =n + . = θ (1 − θ) θ(1 − θ) Hence, the Jeffrey’s prior for the Binomial likelihood is proportional to θ −1/2 (1 − θ)−1/2 , i.e. a Beta(1/2, 1/2)-prior (an not a standard uniform as we may have expected). Jeffrey’s priors can be extended to multiparameter models (though Jeffrey himself was emphasising the use in single-parameter models), but we will not pursue this further here.

10.3

Hierarchical models

Consider a Bayesian statistical model where the parameter Θ can be partitioned into blocks Θ = (Θ1 , . . . , Θd ) where the components Θi may be vector valued. A hierarchical prior is a prior that factorises as f (θ) = f (θ1 )f (θ2 |θ1 ) · · · f (θd |θd−1 ),

(10.11)

i.e. according to a Markov structure where Θi is independent of Θj , j < i−1, conditionally on Θi−1 = θi−1 . In a similar fashion, a hierarchical model is a model where the joint density f (y, θ) factorises as f (y, θ) = f (θ1 )f (θ2 |θ1 ) · · · f (θd|θd−1 )f (y|θd ).

(10.12)

A model of this form is often represented graphically as Θ1 −→ Θ2 −→ · · · −→ Θd −→ Y

(10.13)

emphasising that data is generated sequentially starting with Θ1 . Hierarchical models form a convenient framework in many cases where observations can be assumed exchangable within subsets of the sample.

113

10.3. HIERARCHICAL MODELS

Example 10.6. Assume we are interested in the proportion of domestic animals in a population that carry a certain virus. In order to estimate this proportion, n randomly chosen herds are examined giving observations y1 , . . . , yn where yi is the number of infected animals in herd i. A simple model here would be to assume yi is an observation of Yi |Θ = θ ∼ Bin(ni , θ) where ni is the number of animals in herd i. But this is clearly inappropriate if e.g. the virus is contagious, since it is based on the assumption that animals in a particular herd are infected independently of eachother. More appropriate is to assume that animals are exchangable within but not across herds, which leads to the likelihood model Yi |Pi = pi ∼ Bin(ni , pi ). If we in addition assume the herds to be exchangable, the model is completed by assuming Pi |(A, B) = (a, b), i = 1, . . . , n, to be independent Beta(a, b) conditionally on (A, B) = (a, b) and finally A and B to be apriori independent Gamma(αA , βA ) and Gamma(αB , βB ) for hyperparameters (αA , βA , αB , βB ). This defines a hierarchical model with Θ1 = (A, B) and Θ2 = (P1 , . . . , Pn ). In this model, the parameters Θ1 then describes the spread of the disease at the full population-level and Θ2 the spread at the herd-level (which is likely to be less interesting than the former). A graphical description for n = 3 would be

ր

P1 −→ Y1

Θ1 −→ P2 −→ Y2 ց

P3 −→ Y3

Example 10.7 (HMM). If Θ2 = (Z1 , . . . , Zn ) where Θ2 |Θ1 = θ1 is a stationary Markov chain with transition matrix Θ1 and our observations can be viewed as draws from Y = (Y1 , . . . , Yn ), Yi = Zi + ǫi where the ǫi :s are conditionally (on Θ3 = θ3 ) independent N (0, θ3 ) and finally Θ3 is apriori independent of everything else. The joint distribution can now be factorised as f (y, θ) = f (θ1 )f (θ2 |θ1 )f (θ3 )f (y|θ2 , θ3 ). This can be described by the graph Θ1 −→Θ2 → Y Θ3

ր

114

CHAPTER 10. BAYESIAN STATISTICS

Chapter 11

Bayesian computation While the posterior distribution of the full parameter vector Θ = (Θ1 , . . . , Θd ) is usually easy to derive (up to proportionality) using Bayes’ theorem, it is difficult to interpret (being a function of several variables θ1 , . . . θd ). Hence, being able to compute marginals, i.e. f (θi |y) =

Z

f (θ1 , . . . , θd |y) dθ1 · · · dθi−1 dθi+1 · · · dθd ,

(11.1)

probabilities P (Θ ∈ A|Y = y) =

Z

A

f (θ1 , . . . , θd |y) dθ1 · · · dθd ,

(11.2)

or predicitve distributions f (yn+1 |y) =

Z

f (yn+1 , θ1 , . . . , θd |y) dθ1 · · · dθd ,

(11.3)

is crucial in order to make conclusions in a Bayesian analysis. Historically, the above integrals have only been analytically available for rather simple models. It is with the advent of efficient MCMC simulation techniques that Bayesian statistical analysis have been popularised and its full potential realised. Indeed, with the aid of a large sample θ (1) , . . . , θ (N ) from f (θ|y), approximating the above integrals is straightforward with Monte-Carlo methods since the integrands are easily derived up to proportionality using Bayes’ theorem.

11.1

Using the Metropolis-Hastings algorithm

If Θ ∈ Rd , and d is not too large, a simple choice is to sample from f (θ|y) using the Metropolis-Hastings algorithm. RHere we can fully appreciate the fact that the normalising constant f (y) = f (y, θ) dθ need not be known. 115

116

11.1.1

CHAPTER 11. BAYESIAN COMPUTATION

Predicting rain

Figure 11.1 displays annual maximum daily rainfall values recorded at the Maiquetia international airport in Venezuela for the years 1951-1998. In December 1999 a daily precipitation of 410 mm caused devastation and an estimated 30000 deaths. 160

140

120

100

80

60

40

20 1950

1955

1960

1965

1970

1975

1980

1985

1990

1995

2000

Figure 11.1: Annual maximum daily rainfall. A popular model for extreme-values (based on asymptotic arguments, see the course MAS231/FMS155) is the Gumbel distribution with density function f (x|θ) =

x − θ1 x − θ1 1 exp( ) exp{− exp( )}, θ2 > 0. θ2 θ2 θ2

(11.4)

It is not really appropriate for this example, since it may give negative values, but we disregard from this fact since we only aim to illustrate the compuatations involved. Our goal is to predict future rainfall. We will assume the data to be independent and Gumbel(θ) conditionally on Θ = θ. Further, we assume Θ1 and Θ2 to be apriori independent with improper prior f (θ1 , θ2 ) = θ2−1 , θ2 > 0. This gives a posterior distribution n Y f (θ|y) ∝ ( f (yi |θ))f (θ) i=1

n Y yi − θ1 yi − θ1 1 exp( ) exp{− exp( )})/θ2 . ∝( θ2 θ2 θ2 i=1

11.1. USING THE METROPOLIS-HASTINGS ALGORITHM To compute the predictive density, we need to solve the integral Z Z f (yn+1 , y|θ)f (θ) dθ f (yn+1 , θ|y) dθ = f (y) Z ∝ f (yn+1 |θ)f (y|θ)f (θ) dθ,

117

(11.5) (11.6)

where f (yn+1 |θ) is the Gumbel density. We start by sampling from f (θ|y) using a Gaussian random walk MHalgorithm, in Matlab acc=0; n=length(y); th=50*ones(2,10000); for i=1:9999; thprop=th(:,i)+2*mvnrnd([0 0],[32 -6;-6 16])’; yi=(y-th(1,i))/th(2,i); yp=(y-thprop(1))/thprop(2); a=min(exp(sum(yp-yi)-sum(exp(yp)-exp(yi)))*... (th(2,i)/thprop(2))^(n+1)*(thprop(2)>0),1); if rand
yn+1 |y) by Pˆ (Yn+1 > yn+1 |y) =

10000 X 1 (1 − F (yn+1 |θ (i) )) 10000 i=1

where F is the Gumbel distribution function in (11.7). Figure 11.6 shows a plot of this function. 1 0.9 0.8

P(Yn+1>yn+1|y)

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

50

100 yn+1

150

200

Figure 11.6: Predictive distribution Pˆ (Yn+1 > yn+1 |y) for rainfall data. A more appropriate analysis of this example can be found in http://web.maths.unsw.edu.au/~scott/papers/paper_hydrology.pdf

11.2

Using the Gibbs-sampler

Recall the concept of conjugate priors that led to simple posteriors within the same family of distribution. For complex statistical models, conjugate

121

11.2. USING THE GIBBS-SAMPLER

priors are rarely available. However, the related property of conditional conjugacy often is. If the parameter is Θ = (Θ1 , . . . , Θd ), then a prior f (θ) is said to be conditionally conjugate for Θi if f (θi |θ−i , y) is in the same family of distributions as f (θi |θ−i ) for all y (here we write θ−i for (θ1 , . . . , θi−1 , θi+1 , . . . , θd ), i.e. the parameter vector with θi removed). While it is not too important that f (θi |θ−i , y) belongs to the same class of distributions as the conditional prior, the fact that it is often of simple form (when the full posterior f (θ|y)) is not) greatly simplifies computation since it implies there will be a Gibbs-sampler naturally available.

11.2.1

A poisson change-point problem

To illustrate the use of Gibbs sampling in Bayesian statistics, we will look at a series relating to the number of British coal mining disasters per year, over the period 1851 – 1962. A plot of these data is given in Figure 11.7, and the time series itself is stored in the file coal.mat.

0

1

no. disasters 2 3 4

5

6







•• •

••

•• ••



• ••

• ••

•• •









• •

• •

••

• •

••

• •••• • •• •



••• • •



• ••

•• • •• • • • • ••• ••• •



1860



• •• • ••

• •

1880







1900 year

1920



••

•• ••••





• ••

••• ••• ••••• •• • 1940

1960

Figure 11.7: Time series of counts of coal mine disasters

From this plot it does seem to be the case that there has been a reduction in the rate of disasters over the period. We will assume there has been an abrupt change at year Θ1 (corresponding to e.g. the introduction of a new legislation on security measurements) for which we assume an uniform prior

122

CHAPTER 11. BAYESIAN COMPUTATION

on {1, . . . , 1962 − 1850 = n}. Further, we assume the number of disasters Yi in year i follow a Po(Θ2 ) distribution for i = 1, . . . , Θ1 and a Po(Θ3 ) distribution for i = Θ1 + 1, . . . , n. Finally we choose a hierarchical prior for (Θ2 , Θ3 ) by assuming them to be independent Exp(Θ4 ) where Θ4 is apriori Exp(0.1) independent of everything else. We want to update the 4 parameters using a Gibbs-sampler, and hence compute the conditionals f (θ1 |θ2 , θ3 , θ4 , y) ∝ f (y|θ1 , . . . , θ4 )f (θ1 , . . . , θ4 ) ∝

θ1 y i n Y θ2 exp(−θ2 ) Y θ3yi exp(−θ3 ) yi ! yi ! i=1 Pθ1

∝ θ2

i=1 yi

i=θ1 +1 Pn

exp(−θ1 θ2 )θ3

i=θ1 +1

yi

= L(θ1 ), θ = 1, . . . , n.

exp(−(n − θ1 )θ3 )

This is a discrete distribution with

Next

L(i) P (Θ1 = i|θ2 , θ3 , θ4 , y) = Pn , i = 1, . . . , n. j=1 L(j)

(11.10)

f (θ2 |θ1 , θ3 , θ4 , y) ∝ f (y|θ1 , . . . , θ4 )f (θ1 , . . . , θ4 ) Pθ1

yi

Pθ1

yi

∝ θ2

i=1

∝ θ2

i=1

exp(−θ1 θ2 )f (θ2 |θ4 ) exp(−θ1 θ2 ) exp(−θ2 θ4 )

P1 yi , θ1 +θ4 ) distribution. In a similar which we recognise as a Gamma(1+ θi=1 P fashion we find that f (θ3 |θ1 , θ2 , θ4 , y) is a Gamma(1 + ni=θ1 +1 yi , (n − θ1 ) + θ4 ) distribution. Finally, f (θ4 |θ1 , θ2 , θ3 ) ∝ f (y|θ1 , . . . , θ4 )f (θ1 , . . . , θ4 )

∝ f (θ2 , θ3 , θ4 ) = f (θ4 )f (θ2 |θ4 )f (θ3 |θ4 )

∝ exp(−0.1θ4 )θ4 exp(−θ2 θ4 )θ4 exp(−θ3 θ4 ) = θ42 exp(−(0.1 + θ2 + θ3 )θ4 )

i.e. Gamma(3, 0.1 + θ2 + θ3 ). Hence, while the posterior distributions of Θ2 , Θ3 , Θ4 are not Gamma-distributed (you may want to check this), their conditional distributions are. This is an example of conditional conjugacy which makes Gibbs-sampling easy to implement. A Gibbs-sampler where Θ1 is updated using a Random-Walk MH-algorithm (adding a uniform random variable on {−5, . . . , 5}) is set up in Matlab as follows: n=length(y); th1=50*ones(1,10000);

11.2. USING THE GIBBS-SAMPLER

123

th2=ones(1,10000);th3=th2;th4=th3; for i=1:10000 % Draw theta1 with RW-MH Li=getL(th1(i),th2(i),th3(i),y); th1prop=th1(i)+floor(11*rand)-5; Lp=getL(th1prop,th2(i),th3(i),y); if (rand (i) (i) Θ3 |y) which roughly equals 1, since θ2 was greater than θ3 for all of the (i) (i) sampled pairs (θ2 , θ3 ).

11.2.2

A regression model

Figure 11.13 shows a plot of hardness against density of Australian timbers. An appropriate model for this data could be a simple linear regression where we let density x be fixed and model hardness Y as Y = Θ1 + Θ2 x + ǫ, where ǫ ∼ N (0, 1/Θ3 ),

(11.11)

124

CHAPTER 11. BAYESIAN COMPUTATION

55

50

Θ1

45

40

35

30

0

2000

4000

6000

8000

10000

Iteration

Figure 11.8: Draws from Θ1 |y.

4

Θ

2

5

3 2

0

2000

4000

6000

8000

10000

0

2000

4000

6000

8000

10000

0

2000

4000

6000

8000

10000

Θ3

1.5 1 0.5

Θ

4

4 2 0

Iteration

Figure 11.9: Draws from Θi |y for i = 2, 3, 4 (from top to bottom)

125

11.2. USING THE GIBBS-SAMPLER

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

10

20

30

40

50

Figure 11.10: Autocorrelation plot for the draws from Θi , i = 1, . . . , 4.

2500

10000× P(Θ1=θ1)

2000

1500

1000

500

0 30

35

40

θ1

45

50

Figure 11.11: Histogram of sample from Θ1 |y

55

126

CHAPTER 11. BAYESIAN COMPUTATION

4

3.5

Intensity of disasters

3

2.5

2

1.5

1

0.5

1860

1880

1900 Year

1920

1940

1960

Figure 11.12: Draws from the posterior distribution of the intensity function.

3500

3000

Hardness

2500

2000

1500

1000

500

0 20

30

40

50

60

70

Density

Figure 11.13: Herdness versus density for Australian timbers.

127

11.2. USING THE GIBBS-SAMPLER

where we have choosen to work with precision Θ3 rather than variance. We complete the model by choosing an improper prior f (θ1 , θ2 , θ3 ) = 1. The posterior distribution is given by f (θ|y) ∝ f (y|θ)f (θ) ∝

(n/2) θ3 exp(−

n X (yi − θ1 − θ2 xi )2 θ3 /2). i=1

Which is of non-standard form. However n X (yi − θ1 − θ2 xi )2 θ3 /2) f (θ1 |θ2 , θ3 , y) ∝ exp(− i=1



exp(−nθ12 /2θ3



n X i=1

θ1 (yi − θ2 xi )θ3 )

∝ exp(−n ∗ θ3 (θ1 − y¯ + θ2 x ¯)2 /2, i.e. an N (¯ y −θ2 x ¯, (nθ3 )−1 ) distribution. Similarily P we find that f (θ2 |θP 1 , θ3 , y) is N ((sxy − θ0 n¯ x)/sxx , (sxx θ3 )−1 ) where sxx = x2i and sx y = xi yi . Finally, f (θ3 |θ1 , θ2 , y) ∝

(n/2) θ3 exp(−

n X (yi − θ1 − θ2 xi )2 θ3 /2), i=1

P which we recognise as a Gamma(n/2 + 1, ni=1 (yi − θ1 − θ2 xi )2 /2) distribution. Hence, we can implement a Gibbs-sampler in Matlab as follows: n=length(y); sxx=sum(x.^2); sxy=sum(x.*y); sx=sum(x); sy=sum(y); th=ones(3,10000); for i=1:10000 th(1,i+1)=normrnd((sy-th(2,i)*sx)/n,sqrt(1/(n*th(3,i)))); th(2,i+1)=normrnd((sxy-th(1,i+1)*sx)/sxx,sqrt(1/(sxx*th(3,i)))); th(3,i+1)=gamrnd(n/2+1,1/(sum(y-th(1,i+1)-th(2,i+1)*x)/2)); end Traceplots are shown in Figure 11.14, convergence seems fast but a plot of Θ1 against Θ2 in Figure 11.15 shows they are strongly dependent apriori (the spread-out points in the right of the figure constitute the burn-in). The reason for the strong correlation is that the design points x are not centered around the origin; if x ¯ is large Θ2 is constrained by the fact that the regression line has to have intercept Θ1 and also pass through the pointcloud defined by data. On the other hand, if x ¯ = 0 we see that the conditional distribution of Θ1 |Θ2 , Θ3 , y will not depend on Θ2 (and similarily for

128

CHAPTER 11. BAYESIAN COMPUTATION

Θ2 |Θ1 , Θ3 , y). Hence, if we reparametrise the model as Yi = Θ∗1 + Θ2 zi + ǫi , with zi = xi − x ¯ performance of the Gibbs-sampler will be improved upon (the improvement is not huge here since x ¯ is fairly small to begin with, try adding a factor 1000 to the original design points and see what happens then). Traceplots of the reparametrised model is shown in Figure 11.16, in Figure 11.17 we have made autocorrelation-plots of the parameters under the different parametrisations.

Θ1

−500 −1000 −1500 0

2000

4000

6000

8000

10000

2000

4000

6000

8000

10000

2000

4000

6000

8000

10000

Θ2

70 60 50 40

0 −4

Θ

3

1

x 10

0

0

Iteration

Figure 11.14: Draws from Θ|y in the regression example. We may now be interested in predicting the hardness (which could be difficult to measure) of a timber with density 65, i.e. (in the new parametrisation) Yn+1 = Θ∗1 + Θ2 (65 − x ¯) + ǫn+1 . The predictive density is given by f (yn+1 |y) = E(f (yn+1 |Θ)|y), where f (yn+1 |Θ) is the N (Θ∗1 + Θ2 (65 − x ¯), 1/Θ3 ) density. In Matlab ypred=linspace(1700,3400,1000); for i=1:1000 fy(i)=mean(normpdf(ypred(i),th(1,:)+th(2,:)*19.3,1./sqrt(th(3,:)))); end The results is plotted in Figure 11.18.

11.2.3

Missing data models

In a Bayesian framework there is no formal difference between missing data and unknown parameters. The algorithm corresponding to the EMalgorithm is here the Gibbs-sampler that iterates draws from missing data

129

11.2. USING THE GIBBS-SAMPLER

70

60

50

Θ2

40

30

20

10

0 −2000

−1500

−1000

−500

Θ1

0

500

1000

1500

Figure 11.15: Draws from (Θ1 , Θ2 )|y in the regression example.

*

Θ1

1600 1500 1400 1300

0

2000

4000

6000

8000

10000

2000

4000

6000

8000

10000

2000

4000

6000

8000

10000

Θ2

70 60 50 40

0 −4

Θ

3

1

x 10

0

0

Iteration

Figure 11.16: Draws from Θ|y in the regression example after reparametrisation.

130

CHAPTER 11. BAYESIAN COMPUTATION

1 0.8 0.6 0.4 0.2 0

0

5

10

15

20

25

30

0

5

10

15

20

25

30

1 0.8 0.6 0.4 0.2 0

Figure 11.17: Autocorrelation plots of parameters, after reparametrisation (top) and before (bottom).

−3

2.5

x 10

2

f(yn+1|y)

1.5

1

0.5

0 1600

1800

2000

2200

2400 2600 yn+1

2800

3000

3200

3400

Figure 11.18: Predictive density for hardness of a timber with density 65

131

11.2. USING THE GIBBS-SAMPLER

Z and unknown parameters Θ1 . This special case of the Gibbs-sampler is often refered to as the Data augmentation algorithm. Figure 11.19 shows a histogram of service-times for customers arriving at one of two service stations. However, it is not known what service station is used. The service-time for station 1 is Exponential with mean 1/Θ1 and for station 2 Exponential with mean 1/Θ2 . If a customer uses station 1 with probability Θ3 , the observations are draws from the mixture density f (yi |Θ) = Θ3 exp(−yi Θ1 )Θ1 + (1 − Θ3 ) exp(−yi Θ2 )Θ2 .

(11.12)

Service station 1 is known to be faster, hence we choose a flat prior f (θ1 , θ2 ) = 1 on 0 < θ2 < θ1 . We could sample from the posterior using a MetropolisHastings algorithm, but choose instead to augment the model with indicator variables I1 , . . . , In where Ii = 1 if the i:th customer used service station 1 and Ii = 2 is he used station 2. Hence, apriori, P (Ii = 1) = Θ3 and the model is completed by choosing a standard uniform prior on Θ3 . 50 45 40 35 30 25 20 15 10 5 0

0

1

2

3

4

5

6

7

Figure 11.19: Histogram of service times To derive the conditional distributions, we note that f (θ1 |θ2 , θ3 , I, y) ∝ f (y|θ, I)f (θ1 |θ2 ) n Y exp(−yi θIi )θIi 1{θ1 > θ2 } ∝ i=1 Pn

∝ θ1

i=1 (2−Ii )

exp(−θ1

n X i=1

yi (2 − Ii ))1{θ1 > θ2 }

P P i.e. a Gamma( ni=1 (2−Ii )+1, ni=1 yi (2−Ii )) distribution truncated P P on θ1 > θ2 . Similarily, f (θ2 |θ1 , θ3 , I, y) will be Gamma( ni=1 (Ii − 1)+ 1, ni=1 yi (Ii −

132

CHAPTER 11. BAYESIAN COMPUTATION

1)) truncated on the same interval. We also have that f (θ3 |θ1 , θ2 , I, y) = f (y|θ, I)f (I|θ3 )f (θ3 ) Pn

∝ θ3

i=1 (2−Ii ))

Pn

(1 − θ3 )

i=1 (Ii −1)

since P the likelihood does P not depend on θ3 , and this we recognise as a Beta( ni=1 (2 − Ii ) + 1, ni=1 (Ii − 1) + 1) distribution. Finally, P (Ii = j|I−i , θ, y) ∝ exp(−yi θj )θj θ32−j (1 − θ3 )j−1

which implies that P (Ii = j|I−i , θ, y) =

exp(−yi θj )θj θ32−j (1 − θ3 )j−1 , exp(−yi θ1 )θ1 θ3 + exp(−yi θ2 )θ2 (1 − θ3 )

conditionally independently for i = 1, . . . , n. I will present a full Matlab analysis of this example on the course web-page.

October 2005, 2nd printing August 2006 Mathematical Statistics Centre for Mathematical Sciences Lund University Box 118, SE-221 00 Lund, Sweden http://www.maths.lth.se/