with the BUGs coding used in Chapter 5. Thanks also to the staff of QUT for their considerable support and tolerance. Thanks to Rob Henderson for providing.
School of Mathematical Sciences Queensland University of Technology
Bayesian latent variable models for biostatistical applications
Peter Gareth Ridall M.Sc. (Hons )
A thesis submitted for the degree of Doctor of Philosophy in the Faculty of Science, Queensland University of Technology.
2004 Principal Supervisor: Professor A. N. Pettitt
Abstract In this thesis we develop several kinds of latent variable models in order to address three types of bio-statistical problem. The three problems are the treatment effect of carcinogens on tumour development, spatial interactions between plant species and motor unit number estimation (MUNE). The three types of data looked at are: highly heterogeneous longitudinal count data, quadrat counts of species on a rectangular lattice and lastly, electrophysiological data consisting of measurements of compound muscle action potential (CMAP) area and amplitude. Chapter 1 sets out the structure and the development of ideas presented in this thesis from the point of view of: model structure, model selection, and efficiency of estimation. Chapter 2 is an introduction to the relevant literature that has influenced the development of this thesis. In Chapter 3 we use the EM algorithm for an application of an autoregressive hidden Markov model to describe longitudinal counts. The data is collected from experiments to test the effect of carcinogens on tumour growth in mice. Here we develop forward and backward recursions for calculating the likelihood and for estimation. Chapter 4 is the analysis of a similar kind of data using a more sophisticated model, incorporating random effects, but estimation this time is conducted from the Bayesian perspective. Bayesian model selection is also explored. In Chapter 5 we move to the two dimensional lattice and construct a model for describing the spatial interaction of tree types. We also compare the merits of directed and undirected graphical models for describing the hidden lattice. Chapter 6 is the application of a Bayesian hierarchical model (MUNE), where the latent variable this time is multivariate Gaussian and dependent on a covariate, the stimulus. Model selection is carried out using the Bayes Information Criterion (BIC). In Chapter 7 we i
approach the same problem by using the reversible jump methodology (Green, 1995) where this time we use a dual Gaussian-Binary representation of the latent data. We conclude in Chapter 8 with suggestions for the direction of new work. In this thesis, all of the estimation carried out on real data has only been performed once we have been satisfied that estimation is able to retrieve the parameters from simulated data.
Keywords: Amyotrophic lateral sclerosis (ALS), carcinogens, hidden Markov models (HMM), latent variable models, longitudinal data analysis, motor unit disease (MND), partially ordered Markov models (POMMs), the pseudo autologistic model, reversible jump, spatial interactions.
ii
List of publications & manuscripts This thesis comprises the following publications which have been accepted, or submitted, for publication in international refereed journals. • [Chapter 3:] P.G Ridall and A.N. Pettitt. Markov switching models for longitudinal counts • [Chapter 4:] P.G Ridall and A.N. Pettitt. A Bayesian hidden Markov model for longitudinal counts. The Australian and N.Z. Journal of Statistics (Submitted: November 2003. Accepted: April 2004, In press: June 2005.). • [Chapter 5:] P.G Ridall and A.N. Pettitt. A multivariate hidden binary model for spatial interactions. Statistical Modelling (Submitted April 2003). • [Chapter 6:] P.G Ridall, A.N. Pettitt, R Henderson and P. McCombe A Bayesian hierarchical model for estimating motor unit numbers. Biometrics (Submitted: Feb 2004. Accepted: May 2005). • [Chapter 7:] P.G Ridall, A.N. Pettitt, N.Friel, P. McCombe and R. Henderson. Motor unit number estimation using reversible jump Markov chain Monte Carlo. The Journal of Royal Statistical Society. Series C (Submitted: April 2004).
iii
iv
Contents Abstract
i
List of Publications & Manuscripts
iii
List of Figures
xix
List of Tables
xxii
Statement of Original Authorship
xxiv
Acknowledgments
xxvi
1 Introduction and motivation
1
1.1
The aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
The applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.3
The models of the thesis . . . . . . . . . . . . . . . . . . . . . . .
2
1.4
Problems for latent variable models . . . . . . . . . . . . . . . . .
4
2 Relevant ideas from the literature
11
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Early latent variable models . . . . . . . . . . . . . . . . . . . . .
14
2.2.1
15
2.3
The EM algorithm . . . . . . . . . . . . . . . . . . . . . .
Mixtures, finite mixtures and their generalisations 2.3.1
. . . . . . . .
19
Discrete z . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
v
2.3.2
Continuous z . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.3.3
The hidden Markov model
. . . . . . . . . . . . . . . . .
22
2.3.4
Graphical models
. . . . . . . . . . . . . . . . . . . . . .
24
2.4
Auxiliary variables . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.5
Bayesian model selection . . . . . . . . . . . . . . . . . . . . . . .
31
2.5.1
The M-open view
. . . . . . . . . . . . . . . . . . . . . .
32
2.5.2
The M-closed view . . . . . . . . . . . . . . . . . . . . . .
38
2.5.3
Sampling over model and parameter space . . . . . . . . .
43
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
2.6
3 A Markov switching model for longitudinal counts
57
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.2
The rat tumour data . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.2.1
Previous analysis of the data . . . . . . . . . . . . . . . . .
62
The proposed model . . . . . . . . . . . . . . . . . . . . . . . . .
63
3.3.1
The state structure . . . . . . . . . . . . . . . . . . . . . .
63
3.3.2
The observation process . . . . . . . . . . . . . . . . . . .
64
3.3.3
Likelihoods for the complete and observed data . . . . . .
65
Estimation of unknown parameters . . . . . . . . . . . . . . . . .
66
3.4.1
The filter . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
3.4.2
The smoother . . . . . . . . . . . . . . . . . . . . . . . . .
68
3.4.3
The M step of the EM algorithm . . . . . . . . . . . . . .
68
3.4.4
Standard error estimation for the EM algorithm . . . . . .
70
Estimation and inferences from the data . . . . . . . . . . . . . .
71
3.5.1
Predictive fits and predictive Pearson residuals . . . . . . .
74
3.5.2
Smoothed fits and smoothed Pearson residuals . . . . . . .
75
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
3.3
3.4
3.5
3.6
vi
4 Bayesian hidden Markov mo dels for longitudinal counts
83
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.2
The tumour data . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
4.2.1
Previous analysis of the data . . . . . . . . . . . . . . . . .
86
4.2.2
Outline of our approach . . . . . . . . . . . . . . . . . . .
86
The proposed model . . . . . . . . . . . . . . . . . . . . . . . . .
88
4.3.1
The state structure . . . . . . . . . . . . . . . . . . . . . .
89
4.3.2
Models for the observations conditional on the states . . .
90
4.3.3
Likelihood definition for model selection . . . . . . . . . .
91
4.3.4
Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
4.3
4.4
4.5
4.6
Bayesian inference
. . . . . . . . . . . . . . . . . . . . . . . . . .
93
4.4.1
Sampling from the states . . . . . . . . . . . . . . . . . . .
93
4.4.2
Sampling from the parameters for the transition matrix . .
95
4.4.3
Sampling from the parameters for the probability of the observations . . . . . . . . . . . . . . . . . . . . . . . . . .
96
Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . .
96
4.5.1
Posterior distributions of the parameters . . . . . . . . . .
96
4.5.2
Predictive fits . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.3
Model selection . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5 Hidden bivariate binary models for spatial interactions
113
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2
The Lansing Woods data set . . . . . . . . . . . . . . . . . . . . . 116
5.3
Outline of our approach . . . . . . . . . . . . . . . . . . . . . . . 117 5.3.1
Graphical models . . . . . . . . . . . . . . . . . . . . . . . 118
5.3.2
POMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.3
A generalisation of a POMM . . . . . . . . . . . . . . . . . 123 vii
5.4
5.3.4
A generalisation of the pseudo-autologistic model . . . . . 123
5.3.5
A bivariate hidden structure . . . . . . . . . . . . . . . . . 124
5.3.6
Analysis of interactions of any number of tree types . . . . 124
Details of analysis and results . . . . . . . . . . . . . . . . . . . . 125 5.4.1
5.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . 132
6 A Bayesian hierarchical model for estimating motor unit numbers
137
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2
Background to our model
6.3
6.4
6.2.1
Protocol and data collection . . . . . . . . . . . . . . . . . 147
6.2.2
Technical problems with data collection and model . . . . 148
Previous methods of MUNE . . . . . . . . . . . . . . . . . . . . . 150 6.3.1
The incremental technique . . . . . . . . . . . . . . . . . . 150
6.3.2
The Poisson or statistical method . . . . . . . . . . . . . . 151
6.3.3
Method of estimation for our model . . . . . . . . . . . . . 153
Our proposed stochastic model for MUNE for a variable stimulus for a fixed N
6.5
6.6
. . . . . . . . . . . . . . . . . . . . . . 141
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.4.1
Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.4.2
The probability model . . . . . . . . . . . . . . . . . . . . 158
6.4.3
Some MCMC considerations . . . . . . . . . . . . . . . . . 159
6.4.4
The marginal likelihood . . . . . . . . . . . . . . . . . . . 159
6.4.5
Model selection to infer N . . . . . . . . . . . . . . . . . . 160
Results of our analysis . . . . . . . . . . . . . . . . . . . . . . . . 161 6.5.1
The properties of the motor units . . . . . . . . . . . . . . 162
6.5.2
Model diagnostics . . . . . . . . . . . . . . . . . . . . . . . 165
Discussion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 viii
7 Motor unit number estimation using reversible jump Markov chain Monte Carlo
177
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.2
Electrophysiological techniques . . . . . . . . . . . . . . . . . . . 182
7.3
Background to the fixed N model . . . . . . . . . . . . . . . . . . 183 7.3.1
7.4
7.5
7.6
The trans-dimensional Markov chain. . . . . . . . . . . . . . . . . 191 7.4.1
The probability model . . . . . . . . . . . . . . . . . . . . 192
7.4.2
The types of moves . . . . . . . . . . . . . . . . . . . . . . 193
Results of our RJMCMC . . . . . . . . . . . . . . . . . . . . . . . 198 7.5.1
Patient 1: A replicated study . . . . . . . . . . . . . . . . 199
7.5.2
Patient 2: A replicated study . . . . . . . . . . . . . . . . 201
7.5.3
Patient 3: A serial study . . . . . . . . . . . . . . . . . . . 202
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 7.6.1
7.7
Details of the statistical model . . . . . . . . . . . . . . . . 186
Priors and sensitivity analysis . . . . . . . . . . . . . . . . 206
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8 Concluding remarks and recommendations for further work
ix
225
x
List of Figures 2.1
(a) The graphical model for the mixture model and two generalisations of it,(b) the hidden Markov model (HMM) and (c) the autoregressive hidden Markov model (AHMM). . . . . . . . . . .
2.2
26
The DAG depicted by (a) shows the relationship between the data, y, the latent states, z, the parameters for the random effects, θz , random effects for the observations, θy , and the hyper-parameters for the observations Φ. (b) shows replicates of a new observation, y ? , given the state and individual. (c) shows replication of a new observation from a new state, z ? , given the individual. (d) shows replication of a new observation from a new individual, θy? , given the current state. (e) shows replication of a new observation from a new individual, θy? , and a new state , z ? . . . . . . . . . . . . . .
3.1
35
Directed acyclic graph for our proposed model. The observations are denoted by yt and the states by st . . . . . . . . . . . . . . . .
60
3.2
Counts of tumours against time . . . . . . . . . . . . . . . . . . .
61
3.3
The evolution of the underlying hidden state . . . . . . . . . . . .
64
3.4
Repicates of the data simulated from the model with parameters equal to the estimated values. . . . . . . . . . . . . . . . . . . . .
3.5
71
The one step ahead predictions are shown for two rats (1,2) from the acetone group and four rats (8,10,12,13) from the benzene group. 75 xi
3.6
Pearson residuals are plotted against smoothed fits for two rats (1,2) from the acetone group and four rats (8,10,12,13) from the benzene group. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7
Smoothed fits are shown for shown for two rats (1,2) from the acetone group and four rats (8,10,12,13) from the benzene group.
4.1
77
Plots of tumour count against time for each animal. The missing data are rats that died in the course of the study. . . . . . . . . .
4.2
76
87
Posterior distributions for the hyper-parameters and transition parameters for the three state model. The solid line indicates strain I and the dotted line Strain II. . . . . . . . . . . . . . . . . . . . .
4.3
99
Posterior distributions for the hyper-parameters for the 4 state model. The solid line indicates strain I and the dotted line Strain II.100
4.4
Posterior distributions for the parameters for transition probabilities for the four state model. The solid line indicates strain I and the dotted line Strain II. . . . . . . . . . . . . . . . . . . . . . . . 101
4.5
95% credible intervals for the random effects for individual animals for the four state model . . . . . . . . . . . . . . . . . . . . . . . 103
4.6
95% credible intervals showing predictive fits for the three state model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.7
95% credible intervals showing predictive fits for the four state model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1
The Lansing Woods data set. The location of Maple and Hickory trees. The length of each axis represents approximately 280m. . . 117 xii
5.2
(a) shows an undirected graphical model depicting an autologistic model. (b), (c) and (d) show directed acyclic graphs (DAGs) depicting POMMs of order (b) two, (c) three and (d) four. The order is the number of parents of each interior node. The ordering of a POMM depends on the position of the root node (the node without parents) . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3
The diagram shows the Markov blanket of (b) the 2nd order (c) the 3rd order and d) the 4th order POMMs. The neighbourhood structure of a first order Markov random field (a) is shown for comparison. Note that the third order POMM,(c), has the same neighbourhood structure as a second order Markov random field. . 122
5.4
95% credible intervals for the spatial interaction parameters for the second order POMM and the pseudo-autologistic models at four different resolutions, varying from 20 x 20 to 160 x 160. . . . . . . 127
5.5
95% credible intervals for the spatial parameters for a second order POMM with four orderings at an 80 x 80 resolution.
5.6
. . . . . . . 128
Maps showing the averaged states for the second order POMMs for two different ordering, for the Maples. The left hand side frame shows a root node on the top left and the right hand frame shows a root node on the top right. Dark regions indicate regions with high expected counts. The location of the trees are depicted by the ”+” symbol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.7
Maps showing the averaged states for the second order POMMs over two different ordering, for Hickorys. The left hand side frame shows a root node on the top left and the right hand frame shows a root node on the top right. Dark regions indicate regions with high expected counts. . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.8
Averaged states for the pseudo-autologistic model at an 80 by 80 resolution. Maple is shown on the left and Hickory on the right. . 131 xiii
5.9
The averaged states for second order POMMs averaged over the four different orderings at an 80 x 80 resolution. Maple is shown on the left and Hickory on the right. . . . . . . . . . . . . . . . . 131
6.1
The diagram shows the three components of a motor unit. (i)The ventral or anterior horn cell in the spinal cord, (ii) the axon which is a myelinated fibre running in the peripheral nerves which among other things conducts the action potential and (iii) the nerve terminals at the neuromuscular junction where acetylcholine is released stimulating nicotinic receptors in the muscle membrane.
6.2
. . . . . 141
The diagram shows the components of the action potential. When the stimulus exceeds the threshold, a depolarisation caused by an increase of permeability of the membrane to sodium ions generates a spike. This is followed by a re-polarisation caused by a decrease in the sodium permeability and an increase of potassium permeability in the reverse direction. The slow change in potassium permeability is also responsible for the slightly negative after-potential. (A) show the absolutely refractive period where the axon is unresponsive to another stimulation. (B) is the relative refractory period where a larger stimulus is required to generate a smaller action potential. (C) is the supernormal period of increased excitability where smaller stimulus is required for stimulation. (D) is the subnormal period of decreased excitability where increased stimulus is required to generate an action potential. xiv
. . . . . . . 142
6.3
This shows the CMAP recorded and superimposed on the screen of the computer at several different stimuli. The vertical axis represents the magnitude of the potential(V) and the horizontal axis, (V=0), time. The units, (1 mV, 2.5 kHz), for these graduations are shown on the top of the figure. The portion of the waveform between A and B is called the baseline and ideally should be as flat as possible. The short vertical segments, B and C, on the horizontal axis are markers that are set by the user to enclose the portion of the trace above the x-axis. The supra-maximal CMAP area, (10098µV , and amplitude, (3.699 mV), are shown at the bottom of the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.4
The diagram shows recordings of percentage CMAP area plotted against the stimulus intensity serially taken from three ALS patients over a period of about six months. Figures 1A, 1B, 1C and 1D show recordings taken from patient one; Figures 2B, 2C and 2D show recordings from patient two; Figure 3A shows a recordings from patient three. . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5
The diagram shows the probability-stimulus plots of 100 simulated motor units. S = 20mA is a fixed stimulus and na is the number of motor units that are always firing at that stimulus. . . . . . . . 152
6.6
The directed acyclic graph depicts the probability model. This depicts the relation between the priors (the parent nodes), the stimulus (square node) and the data (y). . . . . . . . . . . . . . . 158 xv
6.7
Each of the plots show the mean BIC plotted against the number of motor units, N . The mean is calculated over 49 repetitions and the estimate is displayed and enclosed by standard error bars. Figures (1A),(1B),(1C) and (1D) show the stable condition of patient 1. Figures (2A),(2B) and (2C) show a loss of units taken from a patient with rapidly advancing ALS. Figure (3A) shows the results for a poliomyelitis patient. . . . . . . . . . . . . . . . . . . . . . . 163
6.8
The diagrams show how the preferred five unit model using the effect known as ‘alternation’ can explain the data collected from configurations 1,2,3 and 4 of electrodes for the left hand of patient 1. The diagram shows why the number of increments is not equal to the number of units. Each configuration shows a different number of steps 6,8,6 and 7 but all are described with a 5 unit model.
6.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
The diagrams shows diagnostic plots for the 479 observations of Figure 6.4, 1a. (a) A normal probability of the Pearson residuals, (b) sequential plot of the Pearson residuals, (c) plot of the autocorrelation function, (d) plot of the partial autocorrelation function, (e) a plot of the Pearson residuals against the same residuals lagged by one observation, and (f) plot of the observations against the expected values under the model. xvi
. . . . . . . . . . . . . . . 166
7.1
This is a printout of the virtual oscilloscope provided by the manufacturer in low resolution. It shows several recorded waveforms superimposed. Each represents a different combinations of units. The vertical axis represents potential and the horizontal axis time. The portion of the waveform between A and B is called the baseline and ideally should be as flat as possible. The short vertical segments, B and C, on the horizontal axis are markers that are set by the user to enclose the portion of the trace above the x-axis which by convention is referred to as a negative deflection. Software provided by the manufacturer is then able to estimate the area of that part of the waveform. . . . . . . . . . . . . . . . . . . 183
xvii
7.2
This is an illustration of the implementation of our fixed N model on a data set taken from patient 1 whose clinical details are given in Table 7.1. In this model N = 12. From top left we have: (A) The data-set consisting of percentage CMAP area plotted against stimulus. The vertical lines show the positions of the mean threshold, mk , for each unit or the location parameter. Panel (B) shows the stimulus as a solid line being manually increased from the submaximal value when no units are firing to the supramaximal value where all units are firing. The first broad horizontal band (shown by an arrow), is actually four overlapping lines lines depicting the thresholds of four separate units. As the stimulus moves through the thresholds of these units we get the multiple levels shown in panel (A) (also shown by an arrow) at about S = 45 − 47mA and an effect called alternation when the number of increments exceeds the number of units. Panel (C) shows the excitability curves that describe the probability of each unit firing as a function of stimulus. These values are taken from the posterior modes. The four units mentioned above are depicted by an arrow. The intercepts of these curves with the horizontal line, p = 0.5, also give the mean thresholds, mk . Panel (D) shows the expected CMAP area of every tenth point in the data set together with the actual CMAP area, marked by a ’+’ symbol. The shaded areas within each bar of CMAP show the contributing single MUAPs.
xviii
. . . . . . . . . 187
7.3
The DAG represents the probability model, Equation (7.14), implemented in this paper. The round nodes represent stochastic variables and the rectangular nodes represent data. The binary latent variable is denoted by sk,t and the Gaussian latent variable by τk,t . Note the parameters shown can be partitioned into two subsets: Θ = {Θz , Θy } where {mk , δk2 } ⊂ Θz are parameters that
describe the latent variable and {µk , µb , σ 2 , σb2 } ⊂ Θy are parameters that describe the data. The priors have been omitted to improve the appearance of the diagram. Note that as N changes the dimension of all the variables on the upper plate also change.
7.4
194
(1A) and (2A) show two data sets collected from patient 1 with the recording electrode fixed but the position of the stimulating electrode altered. Panels (1B) and (2B) show the rate of convergence of the RJMCMC on a log scale. Each line shows a separate chain with different initialisations. Panels (1C) and (2C) show the posterior distributions of the corresponding data sets. . . . . . . . 200
7.5
(1A) and (1B) show data collected from patient 2, whose clinical details are given in Table 7.1. Both recordings were taken on the same day with both electrodes removed and repositioned. Panels (1B) and (2B) show the corresponding posterior distributions. . . 203
7.6
(1A)-(4A) show data collected on four occasions at over a period of almost two years. (1B)-(4B) are the corresponding trace plots for N and (1C)-(4C) the corresponding posterior distributions for N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
xix
xx
List of Tables 1.1
The table compares the assumptions used in this thesis with those of the mixture model and its derivative, the hidden Markov model. A1-A3 are assumptions about the hidden structure. A4 is an assumption about the conditional independence of the observations. A5 deals with whether or not there is a one to one relationship between states and observations. . . . . . . . . . . . . . . . . . . .
3
3.1
Tabulation of zero, increasing, decreasing and tied observations. .
62
4.1
Posterior means, standard deviations and 95% credible intervals for the transition matrix parameters for the three state model. . .
4.2
97
Posterior means standard deviations for the transition matrix probabilities for the four state model. Note that the transition matrix parameter d is not inluded because it is identical to that of the three state model . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
4.3
Posterior means and standard deviations for the hyper-parameters
98
4.4
95% credible intervals for the difference between parameters that describe Strain I and Strain II animals.
7.1
. . . . . . . . . . . . . . 102
The table summarises the clinical details of three patients with ALS or a closely related disease. Note that all recordings were taken from the abductor digiti minimi (ADM) muscle group in the hand upon stimulation of the ulnar nerve in the wrist. xxi
. . . . . . 184
7.2
Here we compare two ways of moving down a dimension in our trans-dimensional model. The first move involves replacing three units (where two of the three are single MUAPS are nearly equal) by two units. The second involves replacing two units by one unit. The left hand table shows the values of µ5i from the model that incorporates alternation and µ6i from the model that does not. Note that µ64 + µ65 + µ66 ≈ µ54 + µ55 , µ64 ≈ µ66 ≈ µ54 and µ65 ≈ µ55 − µ54 . Moves of this kind are incorporated in moves 2 to 5 and shown in Appendix B. The right hand table compares µ5i from the 5 unit model and µ4i from the 4 unit model. Here we note µ43 = µ53 + µ54 . Moves capable of describing such a transformation between these two models kind are incorporated in moves 6 and 7 and shown in Appendix C. This data are also analysed in Ridall et al. (2005) where MUNE using the BIC yields a most probable estimate of N = 5.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.3
The table shows a three way ANOVA from a Poisson GLM . . . . 202
7.4
The table shows a summary of the posterior distribution for each of the data sets used in this paper. It shows the dates on which the recordings were made, the mode of the posterior, p(N |y), a 95% credible interval for the same posterior. The last column is the number of observations in the data set.
xxii
. . . . . . . . . . . . 205
xxiii
statement of original authorship The work contained in this thesis has not been previously submitted for a degree or diploma at any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by any other person except where due reference is made.
Signed: Date:
xxiv
xxv
Acknowledgments Thanks to Tony Pettitt for his enthusiasm, patience, inspiration and for the enumerable number of rewrites he has had to endure. Thanks to Ross McVinnish for the numerous times he has given me help. Thanks to Rob Reeves for his help with the BUGs coding used in Chapter 5. Thanks also to the staff of QUT for their considerable support and tolerance. Thanks to Rob Henderson for providing me with the motor unit problem and the encouragement and support to solve it. Thanks also to my dead father who provided me with my love of mathematics.
xxvi
xxvii
Chapter 1 Introduction and motivation Summary In this Chapter we talk about the aims, scope and structure of this thesis. It is essentially a work in applied statistics and centred around various contentious problems in the study of biostatistics. New theory is developed only where existing theory has been shown to be problematic, incomplete or able to be improved in some way.
1.1
The aim
The aim of the research in this thesis is to develop and apply Bayesian statistical methodology to a number of complex situations arising in biological and medical sciences so that new scientific insight into the relevant processes results.
1.2
The applications
For our applications, we have chosen three fields where a need for new work has been expressed. The first comes from studies of carcinogenicity where the data consist of longitudinal counts of tumours on rats. The call for work in this area 1
was expressed by Dunson (2000). Our second application comes from the field of ecology where interest lies in the equilibrial spatial dynamics of a population of species of trees on a rectangular lattice. Such interactions, either attractive or repulsive, can suggest the presence of a variety of causal environmental factors. Spatial repulsion is often caused by competition for resources, and spatial attraction by various symbiotic relationships. Our third application comes from the field of neurophysiology where a reliable and universally accepted method for estimating motor unit numbers (MUNE) is sought. A reliable method has been sought, without success, for over 30 years (Shefner, 2001). MUNE is needed primarily as a means of tracking the progress of a variety of neuro-muscular diseases that lead to gradual paralysis. Examples of these diseases are motor neurone disease (MND), also known as amyotrophic lateral sclerosis (ALS), some neuropathies and polio. Reliable MUNE will aid the developers of effective treatments for these incurable diseases.
1.3
The models of the thesis
All of the models used in this thesis are generalisations of a simple mixture model. Below is a list of assumptions (or observations) of the very simplest mixture model. We will use these assumptions to illustrate model development of this thesis. They are discussed in more detail in the next chapter. A1. The states or hidden variables are discrete. A2. The states are conditionally independent given the parameters. A3. The states are identically distributed. A4. The observations are conditionally independent given the states. A5. One state is allocated to each observation. 2
Our models depict complexity on assumed state structures which have varying degrees of complexity. In Chapter 3, for linear longitudinal data, we relax (A1) and (A4) and hold fixed the other two. Rather than have independent states, we assume that the states evolve in a Markov manner governed by a transition matrix. In addition, rather than using a mixture of independent observations, we assume a mixture of autoregressive processes. In Chapter 5 we hold fixed (A1), (A3) and (A4) but for (A2) we substitute a multivariate and spatially correlated hidden structure on a two dimensional lattice. In Chapter 6, only (A4) and (A2) hold. Instead of (A1) we use a Gaussian hidden state. For (A3) the distribution of the states is not identical but dependent on a covariate. In this chapter, instead of using (A5), we allocate each observation to a subset of N states. In Chapter 7 for (A1) we use a dual representation of the latent state. Instead of (A5), as in the previous chapter, each observation is allocated to a subset of N states but this time N is a variable.
Model
A1 A2 A3 A4 A5
Finite mixture
X
HMM
X
X
X
X
X
X
X
X
Chapter 3
X
X
X
Chapter 4
X
X
X
Chapter 5
X
X
X
Chapter 6
X
X
Chapter 7
X
X
X
Table 1.1: The table compares the assumptions used in this thesis with those of the mixture model and its derivative, the hidden Markov model. A1-A3 are assumptions about the hidden structure. A4 is an assumption about the conditional independence of the observations. A5 deals with whether or not there is a one to one relationship between states and observations.
3
1.4
Problems for latent variable models
An appreciation of the structure of the thesis can be made by considering problems involved in the estimation process. These problems present a major limiting factor for the development of our models. Because a great bulk of this thesis approaches estimation and inference from the Bayesian paradigm, we focus on problems involved in Markov chain Monte Carlo (MCMC), the principal Bayesian computational method. Most of these problems are specific to latent variable models and others are more general. In the case of mixture type models, estimation can be done without the latent variable formulation by direct maximisation of the likelihood. Lehmann (1983) has shown that this technique, even for the simplest of the mixture models, can be highly problematic. Estimation for mixtures can be made more straightforward by the adoption of a latent variable, a discrete state that allocates each observation to a component of the mixture. Inferences about the individual states can also help with the interpretation of the model. Much of the early work on latent variables was done with the use of the EM algorithm of Dempster et al. (1977). Chapter 3 is an application of this methodology. The EM algorithm is a method for finding the maximum likelihood estimates of parameters in the presence of missing data. However the calculation of the variability, or the standard error of the estimates is usually not a natural part of the EM algorithm, and its estimation can be problematic (See Chapter 3). Problems of calculating the variability of estimates are not experienced in a fully Bayesian approach which predominantly uses Monte Carlo Markov chain (MCMC) methods as its means of computation. The output of a MCMC, unlike the EM algorithm, fully expresses the variability of parameters. The Bayesian approach is not without problems. However, most of these are shared with other approaches. We now detail what we consider to be the major difficulties. Identifiability problems. One of these, in the case of mixtures, is known as 4
the label switching problem and is caused by the fact that the likelihood is invariant to permutations of the labels for the components or states of the mixture. It can lead to multi-modality on the conditional posterior distribution. Model selection problems. In order to carry out model selection one needs the likelihood and the deviance, both measures of fit or model adequacy. For mixture models the marginal likelihood is usually required and this requires the integration out or the summation over the hidden structure. This can be difficult, if not impossible, to carry out. We now list parts of the thesis that involve the problem of likelihood calculation and how we deal with the problem. 1. If there are T observations and N states then direct summation of the states of an autoregressive hidden Markov models would involve N T iterations which would be computationally expensive. In Chapter 3, we develop a forward recursion that efficiently calculates the likelihood. 2. For a similar reason, the calculation of the normalising constant for the autologistic model is impossible for large lattices. In Chapter 5 we use a substitute for the likelihood called the pseudo-likelihood (Besag, 1974) but we generalise it to describe a multivariate hidden lattice. 3. For large multivariate truncated Gaussian hidden lattices, calculation of the marginal likelihood cannot be done analytically and must be approximated. In Chapter 6, for each observation, we repeatedly simulate from the prior and then calculate the conditional likelihood, in order to approximate the likelihood for our model for estimating motor unit numbers. 4. In models where scientific truth rather than prediction is the principal aim, the marginal posterior distribution of the model is required. In 5
Chapter 6 we use the BIC approximation to estimate the marginal likelihood in order to make model selection. 5. Green (1995) has developed the theory of trans-dimensional modelling where the marginal posterior distribution of the model is calculated rather than the marginal likelihoods of each model. This method, also known as reversible jump Markov chain Monte Carlo (RJMCMC), has been shown to be a feasible method of model selection for simple mixture models. In Chapter 7, we extend the methods of Richardson and Green (1997) to our more complex state structure for the purpose of motor unit estimation. Convergence problems. With greater numbers of parameters and with an increasing proportion of the total information held in the missing data structure (missing information), convergence of the MCMC algorithm becomes increasingly problematic. It is also a major problem in the case of betweenmodel moves in RJMCMC where acceptance rates can be lower than 1%. A default procedure for updating parameters, the random walk MetropolisHastings update, has the disadvantage that each parameter requires the adjusting of a separate tuning parameter. These are generally found by trial and error. For large models this is prohibitively time-consuming. Below we list some of the strategies that we have considered to improve mixing of the MCMC. 1. For the hidden Markov model and its generalisations, rather than using repeated Gibbs updates of the discrete states, it can be more efficient to develop forward and backward recursions. In Chapter 4, for the autoregressive hidden Markov model, having performed a forward recursion to obtain the likelihood, we sample the states in a backward order in the style of Scott (2002). 6
2. Rather than sampling large numbers of parameters individually and sequentially, it is often more efficient to use a block Gibbs update. We use this mechanism in Chapters 6 and 7. Here we re-parameterise the regression parameters and re-express the latent variables as multivariate Gaussian. This is done primarily to enable approximate block Gibbs updates of these parameters. In Chapter 7 we use a dual representation of the hidden variables. The discrete version of the hidden variable is needed for efficient between-model moves. The continuous Gaussian form is needed for within-model updates. 3. For the precision parameters, when the full conditional do not have a recognised form and Gibbs updates are not possible, we consider approximate Gibbs updates by using a Gamma approximation to the full conditional as a proposal. In Chapters 6 and 7 we do this by calculating the mode and the Hessian at the mode of the full conditional. 4. If approximate Gibbs updates are not possible then delayed MetropolisHastings could be used. This allows the application of a second try following an initial rejection. Although we have not used this method, we are considering it for between-model updates (Chapter 7) because of the very low acceptance rate. 5. Often a re-parameterisation can aid convergence for mixtures. Fortunately for our applications, this was not necessary. 6. Mechanisms for the choice of proposals to promote good mixing can be constructed by introducing a Taylor expansion of the expression for the Metropolis-Hastings ratio and comparing it to the actual rate of acceptance. Brooks and Giudici (1998) illustrate how to use this method for across model moves. Although we do not use this method in this thesis, our acceptance rates for between-model moves in Chapter 7 are low. We are currently looking to this method to improve 7
mixing in our model. 7. Van Dyk and Meng (2001) show how auxiliary variables can be used to improve mixing for Gaussian latent variable models. We make use of this innovation in Chapter 7 and discuss it in more detail in the next chapter.
8
Bibliography Besag, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Roy. Statist. Soc. Ser. B 36, 192–236. Brooks, S. P. and P. Giudici (1998). Convergence assessment for reverisble jump MCMC simulations. In Bernardo, Berger, Dawid, and Smith (Eds.), Bayesian Statistics 6, Oxford, pp. 733–742. Oxford University Press. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39, 1–38. Dunson, D. B. (2000). Models for papilloma multiplicity and regression: Applications to transgenic mouse studies. Appl. Statist. 49, 9–30. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 411–732. Lehmann, E. (1983). Theory of point estimation. New York: Wiley. Richardson, S. and P. J. Green (1997). On Bayesian analysis of mixtures with an unknown number of components. J. Royal Statist. Soc. B 59, 731–792. Scott, S. L. (2002). Bayesian methods for hidden Markov models: Recursive computing in the 21st Century. J. Am. Statist. Ass. 97, 337–351. 9
Shefner, J. M. (2001). Motor unit number estimation in human neurological diseases and animal models. J. Clin. Neurophysiol. 112, 955–964. Stephens, M. (2000). Dealing with label switching in mixture models. J. Roy. Statist. Soc. Ser. B 62, 795–809. Van Dyk, D. A. and X. Meng (2001). The art of data augmentation. J. Comp. and Graph. Stat. 10, 933–944.
10
Chapter 2 Relevant ideas from the literature The topic of latent or hidden variables is the unifying theme of this thesis. Latent variables give the statistician the extraordinary ability to describe complexity in data through the modelling of a hidden or unobserved structure. In this chapter we define the meaning of the term latent variable and related terms, present the main motivations for using a latent variable approach, and give a little of the history of the evolution of knowledge in this field.
2.1
Introduction
In this thesis, we use a latent variable approach to model different types of biostatistical data. A latent variable, z, is the name for an unobserved or hidden variable that is embedded in the model for a variety of reasons. The observations, yt , t = 1, 2, . . . , T , are expressed in terms of the latent variables in what is called an observation process with distribution p(y | z, θy ) which contains noise. For each observation we allocate a latent variable, zt ,
t = 1, 2, . . . , T , possibly a
vector, with distribution, p(z | θz ). We denote the parameters of the model as, θ = {θz , θy } where θz are the parameters that describe the distribution of the latent variables and θy the parameters that describe the distribution of the 11
observations conditional on the latent variables. The distribution of just the original data can then be obtained by a process of marginalisation over the latent variable, z. For continuous z the marginal distribution of y is: Z p(y | θ) = p(y | z, θy )p(z | θz )dz
(2.1)
z
and for discrete z, p(y | θ) =
X z
p(z | θz )p(y | z, θy )
(2.2)
In (2.1), p(y | θ) is said to be a mixture of distributions p(y | z, θy ) and p(z | θz ). In (2.2), p(y | θ) is said to be a finite mixture. We formulate the assumption of several kinds of finite mixtures in Section (2.3). Data augmentation (Tanner and Wong, 1987) is a wider term which also includes the addition of hidden parameters, such as random effects, into the model. An auxiliary variable has the same definition as a latent variable, in that the latent data can be integrated out to yield the distribution of the original data. However, this term tends to be used in the context of improving the convergence properties of the MCMC. There are several reasons why latent variables have become popular in statistical modelling. • Latent variables are able to expand the range of distributions that can be modelled, without having to resort to distributions outside the exponential family. By allocating p(z | θz ) in (2.1) a conjugate distribution to p(y | z, θy ), tractable expressions for p(y | θ) are created which can be sampled from in a way that involves considerable computational savings. Finite mixtures, (2.2), are useful for modelling multi-modal distributions. • By allowing the latent variable, z, to take on a more complex dependence structure, we gain the ability of the model to describe an underlying, unobserved or hidden causal structure. This underlying causal structure, which 12
can be described indirectly through an observation process, can have important biological implications. The assumption that the observations are conditionally independent makes estimation relatively straightforward but, when z is one dimensional, this assumption too can be relaxed. We now give three examples of latent variables, z, taking on a Markov structure. Firstly, when the latent variable is one dimensional and discrete we have the hidden Markov models (HMM). HMMs are an important tool for speech recognition and DNA sequencing. Secondly, if z is two dimensional and discrete, we have a model capable of representing a hidden lattice suitable for modelling spatial processes. Thirdly, if both the observation process and the hidden or state processes are linear and have Gaussian errors, prediction and the calculation of the likelihood is possible through application of the Kalman filter, a forward recursion. The Kalman smoother is a backward recursion that is used for estimation of the state process. The latent variable, z, can be allocated yet still more complex structures. For example, if z is allocated a hierarchical Gaussian structure, with z representing communalities, we have linear factor analysis. (See for example Hogan and Tchernis (2004)). Factor analysis is a framework frequently used by social scientists to search for causal relationships. In fact, there is no limit to the complexity of the hidden process and that is one of the reason it has been so attractive to computer scientists and statisticians alike.
• Latent variables are able to facilitate efficient computation. This is because the complete likelihood, p(y, z | θ), has a much simpler form than the observed data likelihood, p(y | θ). The factorisation of latent variable models is revealed through examination of a directed acyclic graph (DAG). A DAG is a powerful aid to both understanding and communication because it encodes the dependence structure of the model thus providing the local Markov and global properties of the probability model (Pearl, 1988). DAGs 13
are discussed in greater detail in Section 2.3.4.
• Latent variables can also be used to improve convergence properties of chains. In this context they are normally referred to as auxiliary variables (Besag and Green, 1993). They can be used to decouple the complex dependencies in MCMC output that lead to slow mixing. However auxiliary variables have more general uses. For example they can be used as an alternative to the Metropolis-Hastings algorithm which requires the adjusting of tuning parameters. Damien et al. (1999) show how auxiliary variables can be used for sampling from non-standard densities and are discussed in more detail in Section 2.4. Although auxiliary variables are used primarily to improve convergence properties of the MCMC, they can have other uses. For example see M¨oller et al. (2003).
Having outlined the main uses of hidden variables in statistics, we now describe some of the important works in their evolution. This history has an important bearing on the range of methods of estimation that are available today.
2.2
Early latent variable models
An important work in the history of research into latent variables is that of Dempster et al. (1977). This article greatly popularised the idea of missing data and it was initially through this approach that a great number of important concepts such as mixed models were subsequently tackled. In this section we describe the EM algorithm, write about convergence and discuss attempts of researchers to improve various aspects of the EM algorithm. These efforts ultimately led many researchers to realise that they would be better served by employing a fully stochastic algorithm and the Bayesian approach to inference. 14
2.2.1
The EM algorithm
Dempster et al. (1977) present a general formalisation of the EM (expectationmaximisation) algorithm using the method of maximum likelihood in the presence of incomplete or missing data. However this article merely put the EM in a general form, recognising that the ideas had been around for specific applications for some time. Indeed, Dempster et al. (1977) credited a large number of individuals, including Baum (1972), in the development of the algorithm. In order to maximise the marginal likelihood, p(y | θ), the EM algorithm utilises the complete likelihood, p(y, z | θ), or the joint probability density function of the observations, y, and hidden data, z, given the parameters, θ. It repeatedly carries out an E-step then the M-step, described below, until convergence is attained.
• The E-step Here the conditional expected value (with respect to the hidden data given the current estimate, θold , of the parameters and the observed data, y) of the complete log-likelihood is given by
Q(θ | y, θold ) = Ez|θold ,y log p(z, θ | y)
• The M-step A new value of the parameter, θnew , is found by finding
θnew = arg max Q(θ | y, θold ). θ
The EM algorithm has a very attractive property that it converges monotonically in the likelihood to its maximum value. The following inequality from Roweis and Ghahramani (1999) illustrates the property. 15
Let q(z) be any distribution of the hidden variables. First we establish a lower bound for the log of the marginal likelihood, L(θ) = log p(y | θ). Z L(θ) = log p(y, z | θ)dz Zz p(y, z | θ) = log q(z) dz q(z) z Z p(y, z | θ) ≥ q(z) log dz q(z) z Z Z q(z) log p(y, z | θ)dz − q(z) log q(z)dz =
(2.3)
z
z
= F(q(z), θ)
Inequality (2.3) follows from Jensen’s inequality which is the result of the concavity of the log function. Note that the second term in F is independent of θ and is the entropy of q(z). The EM algorithm alternates between maximising F with respect to distribution, q(z), and the parameters, θ, respectively holding the others fixed. Hence the EM algorithm can be reexpressed as • The E-step qk+1 ← arg max F(q, θk ) q
• The M-step θk+1 ← arg max F(qk+1 , θ). θ
We now show that this version of the E-step, the maximising of F(q, θ) with respect to θ, involves the setting of q(z) to p(z | y, θk ). To see this let Z p(y, z | θ) F(p(z | y, θ), θ) = p(z | y, θ) log dz p(z | y, θ) z Z = p(z | y, θ) log p(y | θ)dz z
= log p(y | θ)
= L(θ) But from (2.3) L(θ) is an upper bound and therefore setting q(z) = p(z | y, θ) maximises F(q(z), θ) and sets it equal to L(θ). Thus p(z | y, θ) is our best guess 16
at the latent variable, z, and represents our averaging distribution on the next E-step. Note that for the E-step we close the gap between the function, L, and the lower bound, F, so that when the next M-step is applied it not only raises the lower bound but also increases L(θ). Note that instead of using an M-step, any step which increases the expected complete log likelihood (for instance a single Newton-Raphson step) will result in an increase in the marginal log-likelihood. We now outline, in a little more detail, attempts by researchers to enhance the EM algorithm and correct its deficiencies. The trend over time, was to replace the deterministic components of the EM algorithm by stochastic components. This led from Celeux and Diebolt (1985) to the data augmentation algorithm of Tanner and Wong (1987) and finally to the fully Bayesian approach of Smith and Roberts (1993) and Diebolt and Robert (1994). 1. A concern about the speed of the EM algorithm has led to a large amount of work into efforts to improve the rate of convergence, eg Louis (1982) and Lange (1995). 2. The EM algorithm suffers problems caused by multi-modality of the expected complete log-likelihood. This has led to research to provide mechanisms for escaping from local modes. An obvious way of dealing with this problem is to use a variety of different starting points and calculating the log-likelihood at convergence. A second is to introduce a simulated annealing scheme. Geman and Geman (1984) show that a suitably designed cooling design will guarantee convergence to the global mode. 3. A serious deficiency is that the standard error is not a natural bi-product of the model and its calculation can be highly problematical (Jamshidian and Jennrich, 2000). Meng and Rubin (1991) attempt to solve this problem by constructing a Supplemented EM algorithm which obtains the asymptotic variance-covariance matrix by monitoring the rate of convergence which is related to the fraction of missing data. Baker (1992) presents a review of 17
methods for calculation of standard errors. Jamshidian and Jennrich (2000) use numerical differentiation to calculate them.
4. Neither the E-step nor the M-step may have analytical solutions. In the absence of an analytical M-step a variety of iterative methods can be used. One solution is to perform just a single Newton step. Another, known as the ECM algorithm, is to apply a sequence of conditional maximisation steps (Meng and Rubin, 1993). Of these two problems, the E-step has the potential to be particulary difficult. Wei and Tanner (1990) suggest using Monte Carlo to estimate the expectation. In a Bayesian approach, (Celeux and Diebolt, 1985) formulated the stochastic EM. This incorporates an Sstep which simulates a realisation of the missing data set from the posterior density based on the current estimate, which is then updated by maximising the likelihood function of the restored data set. If necessary, the S-step can be completed through Gibbs sampling or Metropolis-Hastings. The code for stochastic expectation maximisation algorithm provides a Monte Carlo approximation of the observed data information through the Louis (1982) formulation (Diebolt and Ip, 1996).
In the context of maximisng the posterior rather than the likelihood, Tanner and Wong (1987) suggested replacing the E-step by an imputation step where the latent variables, z (1) , . . . , z (m) , are drawn from an approximation to p(z | y) followed by a posterior step instead of an M-step, where p(θ | y) is approximated by P (i) m−1 m i=1 p(θ | z , y). Since we are sampling directly from the posterior distribution the standard error is not needed. A simpler and more effective technique
of estimating the posterior distribution of parameters of interest in the presence of missing data came with the formulation of the fully Bayesian approach. 18
2.3
Mixtures, finite mixtures and their generalisations
Two different types of mixture are created according to whether the latent variable, z, is discrete or continuous.
2.3.1
Discrete z
A finite mixture is created when the hidden process is allocated a discrete variable (or a state). Following from (2.2), the probability of observation y t is: p(yt | θ) = =
X zt
X
p(zt | θz )p(yt | zt , θy )
(2.4)
pi fi (yt ; θy ).
(2.5)
i
This is a conventional finite mixture. The finite mixture has components or a basis, fi , and weights, pi . (2.5) shows the alternate representation of the finite mixture without any reference to a latent variable. Finite mixtures are widely used in statistics for describing, multi-modal distributions (Robert, 1996), contaminated data and count data with an excess of zeros. The computational advantage that latent variables offer can be seen by comparing the likelihoods of observations, y, expressed without the latent variable, referred to as the observed data likelihood to the likelihood of the observations and the latent variable which is is called the likelihood for the complete data or the complete likelihood. If the observations are conditionally independent and the states independent then the following hold: The observed data likelihood is Lo (θ) = p(y | θ) T X Y = p(zt | θz )p(yt | zt , θy ). t=1 zt
19
(2.6)
and the complete likelihood is Lc (θ) = p(y, z | θ) T Y = p(zt | θz )p(yt | zt , θy ).
(2.7)
t=1
Estimation from (2.6) directly is problematical. Lehmann (1983) shows that because there is a finite possibility that a component has no observations, instability may be observed when carrying out maximum likelihood (M.L.) estimation of parameters of a finite mixture model. In some situations, M.L. estimates may not exist. See Titterington et al. (1985) for a survey of methods of estimation for mixtures. In using a latent variable formulation, the log-likelihood of the complete data becomes the sum of separate terms, those involving θz and those involving θy , assuming there are no common parameters. Thus estimation of the parameters can be found by using simple iterative procedures such as the EM algorithm (Dempster et al., 1977) or Gibbs sampling (Gelman et al., 1996). Thus the latent variable can be seen as arising from the need to facilitate computational ease. Note that Equation (2.7) marginalised over z gives Equation (2.6) thus qualifying as a latent variable model. This approach towards the analysis of mixtures was outlined in Smith and Roberts (1993) and applied to mixtures by Diebolt and Robert (1994). In this formulation, the missing data z and θ are given equal status and sampled alternately from their full conditionals in what is known as a Gibbs scheme, θ ∼ p(θ | y, z), z ∼ p(z | y, θ),
(2.8)
Marginal inferences about θ can be made by merely ignoring z from the output. Inference on the zs enable us to find the weighting given to the ith component of each data point, yt . 20
For hidden Markov models, the assumption about independence of states is dropped and instead the states z are given the Markov property. For instance, the states can be given the first order Markov property by allowing each state to be dependent on just its neighbouring states. p(zt | z\t ) = p(zt | zt−1 , zt+1 ) where z\t = {zi , i 6= t}. HMMs have had a remarkable impact in a large number of fields and we outline the history of research in this area in Section 2.3.3. Further generalisations to the HMM can be made by dropping the conditional independence assumption of the observation process. The conditional independence assumptions of the observations are discussed in section 2.3.4.
2.3.2
Continuous z
Equation (2.1) can be used to expand the range of available distributions by allocating z a continuous distribution. If p(y | z) is allocated the one parameter Poisson distribution, p(y | z) = P o(y; z), where z is its rate parameter and if z is allocated a gamma distribution, p(z | θ) = Ga(z; α, β), this yields, upon a change of variable and integration, the two parameter negative-binomial distribution (Cameron and Travedi, 1998). Because of its ability to model over-dispersion, this has wider applicability. In a similar way, if the proportion parameter of the binomial distribution is allowed to take on a beta distribution then this generates the beta-binomial distribution. Exactly the same thing happens if the variance parameter of the Gaussian distribution is given an inverse-gamma distribution generating a t-distribution. All of the distributions mentioned are useful for describing over-dispersion. Problems with mixtures We now discuss the label switching problem mentioned in the last chapter. This can cause unwanted multi-modality in the conditional posterior distributions and 21
will disrupt attempts to find the marginal likelihood. Often this problem is not easily solved. Richardson and Green (1997) attempt to solve it by removing the symmetry and assuming an artificial ordering on one of the parameters but a close examination of plots of their posteriors distributions shows that multi-modality has not been eliminated. They order on the means, µi , of each components of the mixture but other authors have ordered the variances, σi2 . Stephens (2000) argues that imposing such an ordering on the Markov chain is ineffective in solving the problem and that the MCMC should be allowed to run its natural course without constraints. Stephens (2000) describes a decision theoretic approach which involves the post-processing of the MCMC output. By selecting the permutation of the component parameters, so that the posterior expectation of a loss function, is minimised, uni-modality can be preserved as far as is possible. He argues that the label switching problem can be viewed as a version of the transport problem known as the assignment problem from the operations research literature. See for example, page 195 of Taha (1989). Another problem with mixtures, pointed out by Mengersen and Robert (1995) arises when the priors for the parameters are non-informative. They claim that when only a few observations are allocated to one component the Gibbs sampler is unable to escape from a local mode leading to trapping states. Their solution to the problem is a re-parametrisation of the mixture.
2.3.3
The hidden Markov model
Hidden Markov models (HMM)(MacDonald and Zucchini, 1997), a generalisation of mixture models, are a rich and diverse family and extensively used in statistics and aligned disciplines. Their strength is their ability to describe an unobserved causal process, z, given a well defined observation process, y | z. Over the years, numerous applications have emerged in such diverse areas such as acoustics (Raphael, 1999), bioinformatics (Boys et al., 2000), economics (Billio et al., 1999), 22
spatial data analysis (Green and Richardson, 2002), speech recognition (Baker, 1975), character recognition (Raviv, 1966), climatology (Hughes et al., 1999), image processing (Li et al., 2000) and neurophysiology (de Gunst et al., 2001). An HMM consists of two processes: a hidden process or a sequence of states that evolves in a Markov manner and an observed process that is dependent on this hidden process. Here we present a brief history of research regarding the HMM. In the next two chapters two applications in this field are presented. The first of the applications comes from the frequentist paradigm and applies the methodology of the Baum-Welsh algorithm (Baum, 1972) to an autoregressive HMM. The second uses a Bayesian approach for inference on the same kind of model.
A brief History of the HMM HMMs were introduced by Baum and Petrie (1966). Baum et al. (1970) developed forward-backward recursions for estimation of the states and an iterative procedure for finding the maximum likelihood parameters of the HMM. This procedure is just the expectation-maximisation (EM) algorithm of Dempster et al. (1977) and is sometimes referred to as the Baum-Welch algorithm. HMMs are an obvious generalisation of a mixture model but unlike a mixture model, the observations are not independent. HMMs are a special case of an autoregressive HMM or a switching autoregressive process under a Markov regime (Hamilton, 1994). In these models the type of autoregressive process is determined by a hidden Markov chain. When the order of the autoregressive process is zero, the switching autoregressive process degenerates into an HMM. Another close relative of the HMM is the Markov modulated Poisson process (MMPP) (Fisher and Meier-Hellstern, 1992). These are Poisson processes where the rate parameter is controlled by a hidden continuous time Markov chain. One of the first applications of the HMM was in the field of automatic character recognition (Ra23
viv, 1966). Another major application of HMMs occurred in the field of speech recognition (Jelinek, 1976), (Baker, 1975). Linguistic decoding was performed by the Viterbi algorithm (Viterbi, 1967) which is a recursion for finding the most probable sequence of states. Applications of the HMM to speech processing was further studied by Baker (1980) and Rabiner (1989). These studies popularised the HMM and since then applications have become very widespread. Conditions for identifiably of an HMM were developed by Leroux (1992) and Ryden (1994). Conditions for identifiably of a Markov modulated Poisson process were given by Ryden (1995). A Gibbs sampling Bayesian approach for estimating the parameters of an HMM was developed by Robert et al. (1993). Scott (2002) adapted the forward-backward recursions to a Bayesian setting. Finally Robert et al. (2000) showed how the reversible jump MCMC could be used to estimate the assumed unknown number of states of a Bayesian HMM.
2.3.4
Graphical models
This thesis is concerned with generalisations of the mixture model and mostly entail the relaxation of the independence assumption for the hidden variable. In its place we use a hidden process with an alternate dependence structure. An important device in the exploration of the dependence structures is the graphical model. Graphical models comprise a fusion of graph theory and probability theory and offer an elegant formalism by which various models can be portrayed. With this device conditional relationships can be explored and the model factorised. A directed acyclic graph (DAG), (Wermuth and Lauritzen, 1983), is a particular type of graphical model, G, and consists of vertices, V (which denote parameters, latent variables and data) and edges, E, which assume a causal direction (used in the statistical sense). Interpreting a statistical problem in terms of a causal process is an effective and economical way of understanding and communicating that problem. In a directed graph, each node, ν ∈ V , is expressed in terms of 24
its parents, pa(ν). The local Markov properties, or the dependence properties of each node, can then be deduced using the formulation of Pearl (1988) which involves the moralisation or the marrying of co-parents of nodes to form a moral graph which is undirected. A DAG for a mixture model is shown in Figure 2.1 (a) and in Figure 2.1 (b) the hidden Markov model. From both of these DAGs, for example, the dependence structure of the observations can be seen to be: p(yt | . . .) = p(yt | zt , θz ).
(2.9)
Figure 2.1 (c) shows the autoregressive hidden Markov model (AHMM) where the dependence structure of the observations can be seen to be: p(yt | . . .) = p(yt | yt−1 , yt+1 , zt , θy ). AHMMs are used in Chapters 3 and 4. The joint probability distribution that can be represented by a DAG inherits a particularly simple recursive factorisation: p(V ) =
Y
ν∈V
p(xν | pa(ν)),
(2.10)
where ν ∈ V are the nodes belonging to the set of vertices of the DAG, G = {V, E}. In Bayesian statistics, DAGs play an important role because it is through an application of (2.10) that the full probability model is obtained. This is the joint probability distribution of all observations, states and parameters. From this the full conditionals are obtained and these define repeated sequential draws of the parameters and states from the probability model. For example using (2.10), the full probability model of the mixture depicted in Figure 2.1(a) can be expressed as π(θz )π(θy )
T Y t=1
p(zt | θz )p(yt | zt , θy ), 25
(2.11)
Figure 2.1: (a) The graphical model for the mixture model and two generalisations of it,(b) the hidden Markov model (HMM) and (c) the autoregressive hidden Markov model (AHMM).
26
where π(θz ) and π(θy ) are priors. From the full probability model, (2.11), Gibbs sampling for the parameters is particularly simple to carry out if π(θz ) is set to the conjugate of p(z | θz ) and π(θy ) is set to the conjugate of p(y | θy , z). For instance because p(z | θz ) is multinomial and if π(θz ) is allocated the Dirichlet distribution, draws from θz | . . . also can be made from the Dirichlet distribution. Using the same method, the full probability model for the AHMM, shown in Figure 2.1(c) is p(θz )p(θy )
T Y t=1
p(zt | zt−1 , θz )p(yt | yt−1 , zt , θy ).
This time p(zt | zt−1 , θz ) is described by a transition matrix and θz refers to the parameters of this matrix. In this case each row of the transition matrix is allocated a Dirichlet prior and each row of the transition matrix are sampled from the Dirichlet distribution. This is discussed in much greater detail in Chapter 4. A consequence of the simple recursive factorisation of a DAG, (2.10 ), is that the normalising constant of the joint probability density function is easily calculated. On the other hand, for some kinds of undirected graphical model, such as the auto-logistic model, a model that is used for portraying spatial binary data, there is no such simple factorisation and hence the calculation of the normalising constant is a major problem. Undirected models can be used for representing latent variables and offer some distinct advantages. We look at the advantages and disadvantages of using directed and undirected graphical models for spatial data sets in Chapter 5.
2.4
Auxiliary variables
For Bayesian applications using MCMC, as the complexity of the model increases convergence of the chain becomes an issue. Convergence is also problematical if the variables are highly correlated. Augmented or missing data, used principally to improve the mixing of the Markov chain, are known as auxiliary variables. 27
A description of the use of an auxiliary variable follows. The first part follows Besag (2001). Let {f (x) : x ∈ S} denote a probability distribution for a random quantity x where x is possibly but not necessarily a parameter. If we let the current realisation of S be x, we create a random vector U whose conditional distribution is q(u | x) and is under our control. Then given S = x and U = u
the subsequent state x0 ∈ S is drawn from the conditional distribution η(x0 | x, u) where, to achieve detailed balance, the following expression is satisfied: f (x)q(u | x)η(x0 | x, u) = f (x0 )q(u | x0 )η(x | x0 , u)
(2.12)
An important special case of an auxiliary variable that satisfies (2.12) arises if η is chosen to be the conditional distribution of x given u, that is η(x0 | x, u) ∝ f (x0 )q(u | x0 ). Thus x and u can alternately be sampled using block Gibbs sampling. The major important successes of the auxiliary variable technique have been for: • Sampling from non-standard distributions Slice sampling (Neal, 2000) can be used to sample from nonstandard distributions as an alternative to the Metropolis-Hastings algorithm (Metropolis et al., 1953),(Hastings, 1970) when Gibbs sampling cannot be used. Unlike the Metropolis-Hastings algorithm, it does not require tuning. Let f (x) be the un-normalised distribution, with normalisng constant Z, that we wish to sample from. Let U = {(x, u),
0 < u < f (x)} be the uniform joint
distribution of the area below the curve. p(x, u) = 1/Z,
0 < u < f (x)
and 0 otherwise. Then the marginal distribution of x can be shown to be p(x) = f (x)/Z, which is the distribution that we wish to sample from. The joint distribution of U can be reexpressed as: p(x, u) ∝ f (x)I(u < f (x)) Thus if we sample from (x, u) jointly then u is just ignored in the output. Sampling from (x, u) is done by Gibbs sampling from the full conditionals 28
of x and u. Sampling from u | x ∼ (U (0, f (x)) is simple but for x | u, where x is sampled uniformly over the slice x | u ∼ U ({x : f (x > u)}), the procedure is not always as easy, particularly when the distribution is multimodal. Slice sampling is closely related to the Swendsen-Wang algorithm (Swendsen and Wang, 1987) which is described in Section 2.4. An extension to slice-sampling, which can involve the simulation of several auxiliary variables, was proposed by Damien et al. (1999) and we describe it briefly below. It avoids the problem of sampling from the horizontal slice by expressing the target distribution, f , in terms of invertible functions, li (x), i = 1, 2, . . . , N , as explained below. Let f (x) ∝ π(x)
N Y
li (x).
(2.13)
i=1
Here li (x) represents functions, not necessarily density functions and π(x) is a density of known form. To sample from f we introduce the non-negative random variables, U = {ui , i = 1, . . . N }, and define the joint density of U and x to be: f (x, u1 , u2 , . . . , uN ) ∝ π(x)
N Y i=1
I[0 ≤ ui ≤ li (x)].
(2.14)
Note that Equation (2.14) marginalises to (2.13). The full conditionals for ui are p(ui | x) =
1 I[0 ≤ ui ≤ li (x)], li (x)
i = 1, 2, . . . , N,
(2.15)
which means that ui is sampled from ui ∼ U(0, li (x)), i = 1, 2, . . . , N. The full conditional for x can be obtained by Bayes Theorem using (2.13) and (2.15) which yields p(x | u) = π(x)
N Y i=1
I[ui ≤ li (x)],
i = 1, 2, . . . , N.
(2.16)
This involves sampling from π subject to the condition that ui ≤ li (x). This step can be problematical in some cases. 29
If fi (x) is allowed to be an invertible function then x can be sampled from the left truncated distribution: x ∼ π(x) : x ≥ li−1 (ui ),
i = 1, 2, . . . , N.
Damien et al. (1999) show how this scheme can be used to sample from a large number of non-conjugate pairs of priors and likelihoods. Below we describe the operation of the Swendsen-Wang algorithm (Swendsen and Wang, 1987). • Simulating the auto-logistic or Potts models with high spatial interaction The autologistic model is used for describing lattices of binary data by using a first order Markov random field. The Ising model is special case of autologistic model. The Potts model is a generalisation of this to N states rather than two. The Swendsen-Wang algorithm (Swendsen and Wang, 1987) was designed to speed up convergence of the autologistic and Potts models in the presence of high spatial interactions. This method achieves this by using the latent variable to decouple the term describing abundance from the term describing spatial interaction. We let S be the set of lattice sites, with pairwise adjacencies denoted by i ∼ j. The parameters are depicted as θ = {α, β}, where α is the abundance parameter and β describes the spatial interactions. The likelihood can be written as: ( ) ( ) X X L(x) ∝ exp αi (xi ) × exp βij I[xi = xj ] , x ∈ {0, 1}n . i∼j
i∈S
The first term is the abundance term and the second term the interaction term. We assume that βij > 0. The full conditionals for uij are then: p(uij | x) =
1 I (0 ≤ uij ≤ exp{βij I[xi = xj ]}) exp{βij I[xi = xj ]}
Thus the interaction parameter partitions the lattice into several clusters C according to whether uij > 1 and uij ≤ 1. Using a similar argument as above, the full conditional for x is thus: ( ) X Y p(x | u) ∝ exp αi (xi ) × exp I [0 ≤ uij ≤ I[xi = xj ]}] . i∼j
i∈S
30
x is easy to sample from because if uij > 0 then it follows that exp{βij I[xi = xj ]} > 0. So uij > 0 implies that xi = xj . So for one cluster C, xi ∈ C P is set to one with probability proportional to exp{ i∈C αi (xi )}. Nott and Green (2002) use the same concept to solve problem of variable selection
for regression in the presence of collinearity. • Dealing with intractable normalising constants in spatial models Bayesian estimation for the autologistic model involves Metropolis-Hastings updates of the parameters which involves the ratio of two likelihoods with differing normalising constants. Despite work by Pettitt et al. (2003) and Friel and Pettitt (2004), the calculation of this constant for the autologistic model is computationally difficult for lattices of width greater than 20 pixels. M¨oller et al. (2003) have greatly extended the size of lattices for which estimation may be performed by simulating an auxiliary lattice which in effect allows the normalising constants to cancel when the MetropolisHastings update of the parameters is required. • Accelerating convergence of estimation with probit models. Probit models use a Gaussian latent variable with variance one (McCullagh and Nelder, 1989). Van Dyk and Meng (2001) describe how, by allowing the variance parameter to vary, the convergence properties of the MCMC will be improved. We use this innovation in our last chapter.
2.5
Bayesian model selection
This thesis is confined to likelihood based model selection techniques. But even within these techniques there is a wide range of choice and considerable care needs to be taken in choosing the appropriate model selection process. The most important factor is careful consideration of the purpose for which the model has been designed. Bernardo and Smith (1994) distinguish between explanatory mod31
els used by scientists trying to establish the truth of the matter and predictive models used by technologists whose primary aim is to predict and control. Another important consideration is whether priors on models have any meaning. Bernardo and Smith (1994) define three alternative ways in which models can be viewed. These are: 1. The M-closed view Here the Bayesian method of model selection is simple in principle. Priors are placed on all unknowns, including those for model choice and these are updated with respect to the data giving the posterior distributions. There are N models:, M = {Mj }, being considered, where j = {1, . . . , N }, all with priors, p(Mj ), and one of the models is believed to be true. 2. The M-completed view Here it does not make sense to assign priors but models are evaluated and compared in terms of beliefs, p(x | Mj ), given the model. For instance a model may have attractions because of parsimony, ease of use, or that it fits in with established theories. 3. The M-open view This is the the same as the M-completed view where the priors have no meaning. However this time model choice is done without the existence of a belief model. We now examine the M-closed and M-open views of model selection in the presence of missing data.
2.5.1
The M-open view
The M-open view of Bernardo and Smith (1994) can be looked on as the exploratory stage of modelling. Generally, a measure of model adequacy which combines both goodness-of-fit and model complexity is needed. For a measure 32
of fit we use measures based on the likelihood. However in the case of latent variables there are a number of ways that the likelihood can be defined and we discuss them below. The likelihood and the deviance When constructing predictive models the likelihood and deviance can both be looked on as penalties for prediction. However, in the presence of missing data both the likelihood and prediction can be defined in a variety of ways as we discuss below. Consider a model describing a data set y with latent variables z and parameters θ. Possible likelihoods are • p(y | θ): The marginal likelihood. • p(y, z | θ): The complete likelihood. • p(y | z, θ): The conditional likelihood. If random effects (φi ,
i = 1, 2, . . . , N ) are added to the model even more ambi-
guity is added. We must then decide for the likelihoods defined above whether the random effects should be kept in the likelihood or integrated out. Once the likelihood has been defined we can define the deviance. For example, the marginal deviance is defined as D(θ) = −2 log p(y | θ) + 2 log p(y)
(2.17)
and can be looked on as a logarithmic penalty for prediction of the observations. p(y) is a standardising term that is a function of the data alone. Frequentists set this by applying a saturated model which is to use a distinct parameter for each data point and calculating the likelihood. However when comparing models for a fixed data set, log p(y) can be set to zero without loss of generality. A likelihood or a deviance can be looked on as a penalty for replication. However, the word replication like the likelihood, can also take on a variety of meanings. It can mean for example 33
1. Replication of a new observation given the current state and individual. 2. Replication of a new observation from a new state given the individual. 3. Replication of a new observation from a new individual given the state. 4. Replication of a new observation from a new individual and a new state. These alternative definitions of replication are depicted in Figure 2.2. So when a good predictive model is sought, it is important to be precise about the kind of replication that fits the statistical question. Integrating out the latent states The marginal deviance, (2.17), a measure of fit, requires the calculation of the marginal likelihood which requires the integration out of the latent variable from the complete likelihood. We discuss this problem briefly. In the case of a model that contains continuous latent variables only, the marginal likelihood can be obtained by integrating out the continuous states from the complete likelihood: Z m L (θ) = p(y | z, θy )p(z | θz )dz. (2.18) where θ = {θy , θz } is partitioned into those parameters that describe the hidden variables, θz , and those parameters that describe the observations given the states, θy . If the components of (2.18) are conjugate then the integral can be calculated analytically. However, the integral is usually intractable. In Chapter 6 of this thesis we use the following method which relies on assumptions of conditional dependence of the observations, that is equation (2.9). For each observation, we (r)
make R draws from the distribution of the latent variable: zt ∼ p(z | θz ), r = 1, 2, . . . , R and then calculate ˆt = L
PR
r=1
(r)
p(yt | zt , θy ) . R
(2.19)
Having done this for each data point, the likelihood for all the data can be ˆ ˆ m (θ) = QT L approximated as: L t=1 t . In the case of a discrete hidden variable, the 34
Figure 2.2: The DAG depicted by (a) shows the relationship between the data, y, the latent states, z, the parameters for the random effects, θz , random effects for the observations, θy , and the hyper-parameters for the observations Φ. (b) shows replicates of a new observation, y ? , given the state and individual. (c) shows replication of a new observation from a new state, z ? , given the individual. (d) shows replication of a new observation from a new individual, θ y? , given the current state. (e) shows replication of a new observation from a new individual, θ y? , and a new state , z ? .
35
marginal likelihood can be obtained by summing the complete likelihood over the states: Lm (θ) =
X z
p(y | z, θy )p(z | θz ).
(2.20)
If the dimension of zt is low then (7.16) can be summed over all possibilities. Alternatively (2.19) could be tried after zt has been sampled from the prior. In Section 2.5.2 we approach the even more difficult problem of integrating out the parameters of the model. The AIC and BIC The BIC (Schwarz, 1978) is used when an explanatory model is required and is used to establish the empirical truth of a model. The BIC is related to the log of the marginal probability, log p(y | M ). (See Section 2.5.2.) The AIC (Akaike, 1978), on the other hand, is typically used mainly by frequentists as a measure of the predictive power of a model. The type of deviance used will depend on the type of prediction required. Expressions for the AIC and BIC are given below AIC = D(θ) + 2p BIC = D(θ) + p log T However neither the AIC nor the BIC account for the reduction in complexity that occurs if the parameters are correlated as in random effects models. This was a major motivation for the development of the deviance information criterion (DIC) (Spiegelhalter et al., 2002). The DIC The DIC, introduced specifically for Bayesian models by Spiegelhalter et al. (2002), uses an alternate measure of complexity of a model. However the DIC was not developed for mixture models and research into the DIC for this purpose is in its very early stages. Because of this there is considerable doubt about 36
whether the DIC is able to reliably be of any assistance when comparing mixture models. The DIC for comparing HMMs is discussed in Chapter 4. For the definition of a DIC we require the posterior mean of the deviance, ˆ evaluated at a typical posterior value, θ, ˆ such as the ¯ and the deviance, D(θ), D, ˆ ¯ − D(θ). mean. The complexity of the model, pD , is estimated by the difference D
¯ The DIC in its original form was not The DIC is then defined as pD + D. designed for missing data models although the discussion for Spiegelhalter et al. (2002) contains speculations by various authors about how the DIC for missing data models could be defined. Ambiguity in its definition is caused partly by the deviance term as the likelihoods can themselves be defined in various ways in the presence of missing data. Indeed Celeux et al. (2003) examine eight different definitions of the DIC for mixture models. In one of their studies they simulate from a four component mixture, (N = 4), and test their eight measures of DIC for mixtures with number of components: N = 2, 3 . . . , 7. The only DIC that performs adequately is
based on the complete likelihood. All of the others either produce negative pD s, have pD s that do not increase with complexity (N) or produce DICs that are too variable from one model to the next and thus are not able to pick the correct model.
Model selection using measures of distance The Kullback-Leibler (KL) distance between two distributions f (x) and g(x) is : KLD (f, g) =
Z
f (x) log g(x)dx. g(x)
(2.21)
Robert (1996) suggests using (2.21) for measuring the distance between a one and two component mixture, where if (2.21) is below a certain threshold then the two component mixture is not warranted. However, closed expressions for (2.21) are not available even for small numbers so analytical approximations such as the 37
Laplace approximation must be used. Another alternative is to use the Hellinger distance (Robert and Rousseau, 2003).
2.5.2
The M-closed view
Suppose there are N models under consideration, M = {Mj }, where j = {1, . . . , N }, being considered. Let the observed likelihood under each model be given by p(y | zj , θj , Mj ), the prior for the missing data for each model be p(zj | θj , Mj ), the prior for the parameters for each model be given by p(θj | Mj ), the prior for each model be given by p(Mj ) and the predicted data be y ? . Thus the posterior distribution of all unknowns will be: p(y ? , zj , θj , Mj | y) ∝ p(y ? | zj , θj , Mj , y)p(y | zj , θj , Mj ) × p(zj | θj , Mj )p(θj | Mj )p(Mj ).
(2.22)
The above expression can be used for a variety of Bayesian inferences and decisions. If prediction is the goal then in order to obtain the predictive distribution, p(y ? | y), the above expression must be marginalised over θj and the missing data, zj , and averaged over all models, Mj . Thus model selection is avoided and replaced by model averaging. Bernardo and Smith (1994) show how model selection within this framework can be unified using the concept of expected utility. If we let ω be the aspect of the model that is of interest. It might be the truth of the model or the predictive power of the model for future observations. Then the optimal model choice is given by ˆ opt ) = sup U(M ˆ j | y), U(M j=1,...,N
ˆ j | y) = where U(M
R
U(Mj , ω)p(ω | y)dω and p(ω | y) =
PN
j=1
p(ω | y, Mj )p(Mj |
y). Bernardo and Smith (1994) show that if ω corresponds to the truth of the model then the utility function U is set to one if the model is correct and zero otherwise. From this they show that ˆ j ) = p(Mj | y) U(M 38
(2.23)
and the optimal model choice is the one with the highest posterior probability. If prediction is the goal the utility function can be defined in terms of a loss function, for example, quadratic or logarithmic loss.
The marginal posterior probability of the model The estimation of equation(2.23) is seldom straightforward and usually highly problematical. Its calculation involves the application of Bayes theorem: p(y | Mj )p(Mj ) , p(Mj | y) = PN j=1 p(y | Mj )p(Mj )
where p(y | Mj ), j = 1, 2, . . . , N , is the marginal likelihood of model Mj ,and is calculated by integrating all the other parameters out of the likelihood: p(y | Mj ) =
Z
p(y | Mj , θ)p(θ | Mj )dθ
(2.24)
The key expression required for model selection, p(y | Mj ), is the normalising constant and denominator of the expression produced when Bayes theorem is applied to find the posterior distribution of θ: p(θ | y, M ) =
p(θ | M )p(y | θ, M ) . p(y | M )
If all priors are conjugate the normalising constant can be calculated from a closed analytical expression. Closed form expressions of this kind, however, are rarely available. The calculation of p(y | Mj ) is a continuing open area of research and still remains a potentially formidable problem. There are at least three approaches for approximating intractable normalising constants. These methods are analytical approximations (DiCiccio et al., 1997) numerical integration and Monte Carlo simulation (Evans and Swartz, 2000). The Bayes factor, a statistic that is often used to compare models, also uses the marginal likelihood. The Bayes factor is the ratio of the posterior odds for 39
model M2 compared to model M1 to the prior odds for M2 to M1 . p(M2 | y) p(M2 ) / p(M1 | y) p(M1 ) p(y | M2 ) = p(y | M1 ).
BF =
It can be interpreted as the weight of evidence for M2 against M1 . Because of the difficulty in evaluating (2.24), a possible alternative is to use the suggestion of Aitkin (1991) and use the posterior Bayes factor. This uses output of the posterior distribution from the MCMC. The posterior Bayes factor is defined as: P BF =
Ez,θ|y,M1 p(y | θ, z, M1 ) , Ez,θ|y,M2 p(y | θ, z, M2 )
(2.25)
where the expectation is taken by calculating the conditional likelihood from each sweep of the MCMC output once convergence has been attained. This method is suspect because it uses the data twice and therefore needs calibration (Aitkin, 1997). Approximation of the marginal likelihood A naive way of calculating (2.24) is to sample from the prior, θ (i) ∼ p(θ | M ), and then to calculate (Carlin and Louis, 1996) " # G 1 X p(y | M ) ≈ p(y | θ(i) , M ). G i=1 Such an estimator is likely to be highly inefficient. To find the marginal likelihood using samples from the MCMC, θ (i) ∼ p(θ | y, M ), once convergence has been achieved, requires using an importance sampling type correction and leads to the harmonic likelihood, (Kass and Raftery, 1995) (Newton and Raftery, 1994). " #−1 G 1 X 1 p(y | M ) ≈ . G i=1 p(y | θ(i) , M ) The problem with this method is that occasional very large values in the tails of the distribution unduly influence this calculation. Newton and Raftery (1994) 40
point out the instability of this estimator. A generalisation of this result was provided by Gelfand and Dey (1994) which led to an alternate estimator using the MCMC output, θ (i) , "
G h(θ(i) ) | M ) 1 X p(y | M ) ≈ G i=1 p(y | θ(i) , M )p(θ (i) | M )
#−1
.
Gelfand and Dey (1994) suggest using a h that matches closely the distribution of the MCMC output, eg a multivariate normal normal or t density with the same mean and covariance matrix as collected from the MCMC output, θ (i) . The Laplacian approximation to the marginal likelihood This section follows Raftery (1996). The Laplace method is based on a Taylor series expansion of the log of the unnormalised posterior distribution, g(θ | M ) = log (p(θ | M )p(y | θ, M )) , about the mode, θ ? . A measure of curvature, I ? , is the generalised observed Fisher information matrix which is minus the Hessian of g(θ | M ) calculated at the posterior mode: ∂2 ? Iij = − . log(p(y | θ, M )p(θ | M )) ∂θi ∂θj θ=θ ?
(2.26)
The Taylor approximation enables the vector of parameters, θ, to be integrated out thus yielding p
p(y | M ) ≈
(2π) 2 p(y | θ? , M )p(θ ? | M ) 1
|I ? | 2
,
(2.27)
where p is the number of parameters used. The mode is not always easy to find but can be approximated in a number of ways (Raftery, 1996). 1. Calculate g(θ | M ) for each iteration of the MCMC and then to choose the value from the MCMC that maximises this. 2. The multivariate median. 41
3. The component-wise posterior median 4. The mode of a nonparametric density estimator calculated at the posterior. 5. As the number of observations, T , gets large the mean of the posterior output can be used. For models involving latent states the first option can be problematical because of the computational burden of integrating out the states for each iteration of the MCMC. A further problem with the analytical Laplacian method is when the derivatives required are either hard or tedious to calculate. When the prior is flat (2.26) can be approximated by the observed Fisher information matrix, Iˆij , ˆ It is defined as which relies on the maximum likelihood estimator θ. T 2 X ∂ Iˆij = − log(p(yt | θ, M )) ∂θi ∂θj ˆ θ=θ, t=1
(2.28)
where θˆ is the maximum likelihood estimate of θ. In these cases one can use the following suggestions put forward by Raftery (1996). 1. By making use of the property that the Hessian, H(θ ? ), is asymptotically equal to minus the inverse of the posterior covariance matrix, Ψ, lim (H(θ? )) = −Ψ−1 ,
t→∞
a robust estimator of Ψ can be made from the MCMC output. 2. Another asymptotic approximation to the Laplace estimator can be made by assuming that |I ? | = T p |I|, where I is the expected information matrix of a single observation and T is the number of observations. This yields:
ˆ M )) − log p(y | M ) ≈ log(p(y | θ,
p log(T ). 2
(2.29)
The BIC can be conveniently expressed in terms of the marginal deviance, Lm (θ) (2.24), as 42
ˆ + p log(T ) BIC = −2 log Lm (θ)
(2.30)
with a lower value indicating a better model. A problem with the last method with models that contain latent variables can be the computational burden of calculating the marginal deviance. This was discussed in Section 2.5.1.
2.5.3
Sampling over model and parameter space
The model selection described so far requires the running of several models, Mi , i = 1, 2, . . . , N and for each model the calculation of p(y | Mi ) or some other measure of model adequacy such as the AIC (Akaike, 1978) or the BIC (Schwarz, 1978). For large N this may not be feasible. One way of dealing with this is to use the model indicator, Mi , as a parameter and to design a Markov chain that is able to traverse model space, generating the indicator: M (i) ∼ p(M | y),
i = 1, 2, . . . , G. Here G is the length of the run obtained
once convergence is attained. Thus the marginal posterior estimate of the model indicator is merely: number of M (i) = j pˆ(M | y) = total number of M (i) An early attempt at using this approach was demonstrated by Carlin and Poison (1991) where M was used as a parameter in a Gibbs sampler. The limitation of the model is that it requires the same parametrisation and therefore cannot be used for comparing mean structures (Carlin and Poison, 1991). George and McCulloch (1993) introduced an algorithm for stochastic search variable selection (SSVP) in the context of multiple regression problems where a binary latent indicator was used to indicate which of the regression terms, βi , would be set to a low value, almost to zero. Thus the meta model had fixed dimension. Unfortunately, 43
the Bayes factors obtained are highly dependent on the tuning parameters used in the algorithm. In 1995 two important and widely applicable ideas were put forward. The first is the Gibbs sampling method of Carlin and Chib (1995). Carlin and Chib (1995) Suppose there are N candidate models and for each model there is a parameter vector, θj , of dimension nj , j = 1, 2, . . . , N , where the likelihood is p(y | θj , M = j) with priors p(θj | M = j) and pseudo-priors p(y | θj , M = i 6= j), where i = 1, 2, 3, . . . , N . Two important assumptions are made: the observations given the model indicator, M = j, are independent of θi6=j , p(y | θ, M = j) = p(y | θj , M = j); and, secondly, the priors and pseudo priors are independent, θi ⊥ θk | M = j (i 6= k). Using these two assumptions the full probability model can be expressed as " (N ) # N Y Y p(y | θj , M = j) p(θi | M = j) P (M = j) . p(y, θ, M ) = j=1
i=1
In order to implement Gibbs sampling the full conditionals of θi and M are
needed. They are given by p(y | θ , M = j)p(θ | M = j) M = j j j p(θj , | M, θi6=j , y) ∝ (2.31) p(θj | M = i) M = i 6= j (N ) Y p(M = j | θ, y) ∝ p(y | θj , M = j) p(θi | M = j) P (M = j)(2.32) i=1
(2.31) is easily calculated if the prior is conjugate to the likelihood and the nor-
malising constant for (2.32) is easily found because it has a finite number of states. Nevertheless the computational burden of calculating (2.32) for N models at every step of the MCMC can be overwhelming. This led Dellaportas et al. (1998) to propose a Metropolised Carlin and Chib where only a proposal rather than all models need be considered at each step. 44
Reversible jump and birth death MCMC A more general way of traversing model space is the reversible jump Markov chain Monte Carlo (RJMCMC) of Green (1995). As this theory is explained and used in Chapter 7 of this thesis, we will not discuss it in detail here. RJMCMC was extended to model selection problems in the presence of missing data by Richardson and Green (1997). Obtaining good convergence properties, however, for the trans-dimensional model is often difficult. This problem was addressed by Brooks and Giudici (1998). Stephens (2000a) suggests an alternative to RJMCMC for model selection in mixtures called a birth-death Markov chain Monte Carlo (BDMCMC). Here Stephens (2000a) creates a continuous time Markov birth-death process where the births occurs at a constant rate determined by the prior and the deaths at a high rate for components that are not critical for explaining the data. In order that the property of perfect balance is attained in the BDMCMC, Stephens (2000a) derives an expression for the death rate in terms of the birth rate.
2.6
Conclusion
In this chapter we have presented a little of the history and theory of latent variables, the simplest type of latent variable model, the mixture model, and its close derivatives. The terms latent variables, missing data, hidden data, augmented data and auxiliary data, all with very similar meanings, have been described. Finally the history of research into Bayesian model selection in the presence of missing data has been outlined.
45
46
Bibliography Aitkin, M. (1991). Posterior Bayes factors with discussion. J. Roy. Statist. Soc. B, 53:111–142. Aitkin, M. (1997). The calibration of P-values, posterior Bays factors and the AIC from the posterior distribution of the likelihood . Statist. Comp., 7:253–261. Akaike, H. (1978). A Bayesian analysis of the minimum AIC procedure. Ann. Inst. Statist. Math, 30:9–14. Baker, J. K. (1975). The Dragon System: an overview. IEEE Trans. Acoust. Speech Signal Processing, 23:24–29. Baker, J. K. (1980). Variable duration models for speech. In Proc. Symposium on the Application of Hidden Markov Models to Text and Speech, 64:143–179. Baker, S. J. (1992). A simple method for computing the observed information matrix when using the EM algorithm for categorical data. J. Comput. and Graph. Statist., 1:63–76. Baum, L. (1972). An equality and associated maximisation technique in statistical estimation for probabilistic functions of Markov processes. Inequali ties, 3:1–8. Baum, L. E. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Statist. , 37:1554–1563. 47
Baum, L. E., Petrie, T., Sules, G., and Weiss, N. (1970). A maximisation technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist., 41:164–171. Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. Wiley, Chichester. Besag, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Roy. Statist. Soc. B, 36:192–236. Besag, J. E. (2001). An introduction to Markov chain Monte Carlo. Technical report, University of Washington. Besag, J. E. and Green, P. J. (1993). Spatial statistics and Bayesian computation. J. Roy. Statist. Soc. B, 55:25–38. Billio, M., Monfort, A., and Robert, C. P. (1999). Bayesian estimation of switching ARMA models. J. of Econom., 37:29–255. Boys, R. J., Henderson, D. A., and Wilkinson, D. J. (2000). Detecting homogeneous segments in DNA sequences using transition models with latent variables. Appl. Statist., 49:9–30. Brooks, S. P. and Giudici, P. (1998). Convergence assessment for reversible jump MCMC simulations. In Bernardo, Berger, Dawid, and Smith, editors, Bayesian Statistics 6, pages 733–742, Oxford. Oxford University Press. Cameron, A. C. and Travedi, D. K. (1998). Regression analysis of count data. Cambridge University Press, Cambridge. Carlin, B. P. and Chib, S. (1995). Bayesian model choice via Markov chain Monte Carlo methods. J. Roy. Statist. Soc. B, 57:473–484. Carlin, B. P. and Louis, T. . (1996). Bayes and Empirical Bayes methods for data analysis. Chapman and Hall, London. 48
Carlin, B. P. and Poison, N. G. (1991). Inference for nonconjugate Bayesian models using the Gibbs sampler. Canad. J. Statist., 19:399–405. Celeux, G. and Diebolt, J. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput. Statist. Quarter., 2:73–82. Celeux, G., Forbes, F., Robert, C., and Titterington, M. (2003). Deviance information criteria for missing data models. Technical report, INRIA. Damien, P., Wakefield, J., and Walker, S. (1999). Gibbs sampling for Bayesian non-conjugate and hierarchical models by auxiliary variables. J. Roy. Statist. Soc. B, 61:331–344. de Gunst, M. C. M., Kunsch, H. R., and Schouten, J. (2001). Statistical analysis of Ion channel data using hidden Markov models with correlated statedependent noise and filtering. J. Am. Statist. Assoc., 95:805–815. Dellaportas, P., Forster, J. J., and Ntzoufras, I. (1998). On Bayesian model and variable selection using MCMC. Technical report, Athens University of Economics and Business. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B, 39:1–38. DiCiccio, T. J., Kass, R. E., Raftery, A., and Wasserman, L. (1997). Computing Bayes factors by combining simulation and asymptotic approximations. J. Am. Statist. Ass., 92:903–915. Diebolt, J. and Ip, E. H. S. (1996). A stochastic EM algorithm for approximating the maximum likelihood estimate, chapter 15 from Markov chain Monte Carlo in Practice, pages 259–273. Chapman and Hall. 49
Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. J. Roy. Statist. Soc. B, 56:363–375. Dunmur, A. P. and Titterington, D. M. (1997). Computational Bayesian analysis of hidden Markov mesh models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:1296–1300. Dunson, D. B. (2000). Models for papilloma multiplicity and regression: Applications to transgenic mouse studies. Appl. Statist., 49:9–30. Dunson, D. B. and Haseman, J. K. (1999). Modelling tumour onset and multiplicity using transition models with latent variables. Biometrics, 57:229–255. Evans, M. and Swartz, T. (2000). Approximating integrals via Monte Carlo and deterministic methods. Oxford University Press, Oxford. Fisher, W. and Meier-Hellstern, K. (1992). The markov-modulated poisson process (mmpp) cookbook. Perf. Eval, 18:149–171. Friel, N. and Pettitt, A. N. (2004). Likelihood estimation and inference for the autologistic model. J. Comp. Grap. Statist., 13:232–246. Gelfand, A. E. and Dey, D. K. (1994). Bayesian model choice: asymptotics and exact calculations. J. Roy. Statist. Soc. B, 56:501–514. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1996). Bayesian Data Analysis. Chapman and Hall, London. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell., 6:721–741. George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Am. Statist. Assoc., 88:881–889. 50
Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:411–732. Green, P. J. and Richardson, S. (2002). Hidden markov models and disease mapping. J. Am. Statist. Assoc., 97:1–16. Hamilton, J. D. (1994). Time series analysis. Princeton University Press, New Jersey. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109. Hogan, J. and Tchernis, R. (2004). Bayesian Factor Analysis for Spatially Correlated Data, With Application to Summarising Area-Level Material Deprivation From Census Data ”. J. Am. Statist. Assoc., 99:314–324. Hughes, J. P., Guttorp, P., and Charles, S. P. (1999). A non-homogeneous hidden markov model for precipitation occurrence. Appl. Statist., 48:15. Jamshidian, M. and Jennrich, R. I. (2000). Standard errors for EM estimation. J. Roy. Statist. Soc. B, 62:257–270. Jelinek, F. (1976). Continuous speech recognition by statistical methods. IEEE Proceedings, 64:532–556. Kass, R. E. and Raftery, A. E. (1995). Bayes Factors. J. Am. Statist. Ass., 90:773–780. Lange, K. (1995). A quasi-Newton acceleration of the EM algorithm. Statist. Sin., 5:1–18. Lehmann, E. L. (1983). Theory of point estimation. Wiley, New York. Leroux, B. (1992). Maximum-likelihood estimation for hidden Markov models. J. R. Statist. Soc. B, 40:127–143. 51
Li, J., Gray, R. M., and Olshen, R. A. (2000). Multiresolution image classification by hierarchical modelling with two-dimensional hidden Markov models. IEEE transactions on information theory, 46:127–143. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. R. Statist. Soc. B, 44:226–233. MacDonald, I. L. and Zucchini, W. (1997). Hidden Markov and Other Models for Discrete-valued Time Series. Chapman and Hall. McCullagh, P. and Nelder, J. A. (1989). Generalised linear models . Chapman and Hall. Meng, X. and Rubin, D. B. (1991). Using the EM algorithm to obtain asymptotic variance-covariance matrices: the SEM algorithm. J. Am. Statist. Assoc., 86:899–909. Meng, X. and Rubin, D. B. (1993). Maximum likelihood via the ECM algorithm: a general framework. J. Am. Statist. Assoc., 80:267–278. Mengersen, K. L. and Robert, C. P. (1995). Testing for mixture via entropy distance and gibbs sampling. In Berger, J. O., Bernardo, J. M., Dawid, A. P., Lindley, D. V., and Smith, A. F. M., editors, Bayesian Statistics 5. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N. Teller, A., and Teller, E. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys., 21:1087–1091. M¨oller, J., Pettitt, A. N., Berthelsen, K. K., and Reeves, R. W. (2003). An efficient Markov chain Monte Carlo method for distributions with intractable normalisation constants. Technical report. Neal, R. M. (2000). Slice Sampling. Technical report, Department of Statistics, University of Toronto. 52
Newton, M. and Raftery, A. (1994). Approximate Bayesian inference by the weighted likelihood bootstrap. J. Roy. Statist. Soc. B, 56:1–48. Nott, D. J. and Green, P. J. (2002). Bayesian variable selection and the SwendsenWang algorithm. Technical report, University of Bristol. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann Publishers, San Mateo, CA. Pettitt, A. N., Friel, N., and Reeves, R. (2003). Efficient calculations of the normalising constant of the autologistic and related models on the cylinder and lattice. J. Roy. Statist. Soc. B, 65:235–246. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77:257–286. Raftery, A. E. (1996). Hypothesis and model selection. In Gilks, W. R., Richardson, S. T., and Spiegelhalter, D. J., editors, Markov chain Monte Carlo in Practice. Chapman and Hall. Raphael, C. (1999). Automatic segmentation of acoustic musical signals using hidden Markov models. IEEE transactions on pattern analysis and machine learning., 21:360. Raviv, J. (1966). Decision making in marker chains applied to the problem of pattern recognition. IEEE Trans., 4:536–551. Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. B, 59:731–792. Robert, C. P. (1996). Mixtures of distribution: inference and estimation, chapter 24 from Markov chain Monte Carlo in Practice, pages 441–464. Chapman and Hall. 53
Robert, C. P., Celeux, G., and Diebolt, J. (1993). Bayesian estimation of hidden Markov chains: a stochastic implementation. Statist. Probab. Lett., 16:77–83. Robert, C. P. and Rousseau, J. (2003). A mixture approach to bayesian goodness of fit. Technical report, Universit´e Paris Dauphine. Robert, C. P., Ryden, T., and Titterington, D. M. (2000). Bayesian inference in hidden Markov models through the reversible jump Markov chain Monte Carlo method. J. R. Statist. Soc. B, 62:57–. Roweis, S. and Ghahramani, Z. (1999). An em algorithm for identification of nonlinear systems. Technical report, Gatsby Computational Neuroscience Unit, University College, London. Ryden, T. (1994). Consistent and asymptotically normal parameter estimates for hidden Markov models. Ann. Statist., 22:1884–1895. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6:461– 464. Scott, S. L. (2002). Bayesian methods for hidden Markov models: Recursive computing in the 21st Century. J. Am. Statist. Ass., 97:337–351. Shefner, J. M. (2001). Motor unit number estimation in human neurological diseases and animal models. J. Clin. Neurophysiol., 112:955–964. Smith, A. F. M. and Roberts, G. O. (1993). Bayesian computation via Gibbs sampler and related Markov chain Monte Carlo methods (with discussion). J. Roy. Statist. Soc. B,, 55:3–23. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van der Linde, A. (2002). Bayesian measures of complexity and fit. J. Roy. Statist. Soc. B, 64:583–639. 54
Stephens, M. (2000a). Bayesian analysis of mixture models with an unknown number of components
an alternative to reversible jump methods. Ann.
Statist., 28:40–74. Stephens, M. (2000b). Dealing with label switching in mixture models. J. Roy. Statist. Soc. B, 62:795–809. Swendsen, R. H. and Wang, J. S. (1987). Nonuniversal critical dynamics in Monte Carlo simulations . Physical Review Letters, 58:8688. Taha, H. A. (1989). Operations research: An introduction. MacMillian Publishing Company, New York. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion) . J. Am. Statist. Assoc., 82:528–550. Titterington, D., Smith, A. F. M., and Makov, U. E. (1985). Statistical analysis of finite mixture distributions. Wiley, New York. Van Dyk, D. A. and Meng, X. (2001). The art of data augmentation. J. Comp. and Graph. Stat., 10:933–944. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm . IEEE Transactions on Information Processing, 13:260–269. Wei, G. C. G. and Tanner, M. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation. J. Am. Statist. Assoc., 85:232–246. Wermuth, N. and Lauritzen, S. L. (1983). Graphical and recursive models for contingency tables. Biometrika, 70:537–552.
55
56
Chapter 3 A Markov switching model for longitudinal counts
Statement of joint authorship P.G. Ridall (Candidate): Conducted a search of the literature relating to the EM algorithm for hidden Markov models, wrote code for EM algorithm and estimation of standard error, wrote and proof read the manuscript. A.N Pettitt : Helped with some of the technical details, assisted in extensions, and proof read the manuscript several times. R Reeves : Coded up a (univariate) pseudo-autologistic model and POMM in BUGS and checked the code using simulated data. Made useful comments about script.
57
This figure is not available online. Please consult the hardcopy thesis available from the QUT Library
82
Chapter 4 Bayesian hidden Markov Models for longitudinal counts
A paper accepted for publication in the Australian and New Zealand Journal of Statistics, April 2004. In Print June 2005
Statement of joint authorship P.G. Ridall (Candidate): Developed all modelling equations, carried out MCMC, interpreted the results, wrote and proof read the manuscript, and revised the manuscript based on the referees’ comments. A.N Pettitt : Gave technical advice on development of some of the full conditionals. Made notational suggestions. Assisted with the structure of the manuscript, proof read the manuscript and offered advice on the revision of the manuscript based on the referees’ comments.
83
This figure is not available online. Please consult the hardcopy thesis available from the QUT Library
Chapter 5 Hidden bivariate binary models for spatial interactions
A Paper submitted to Statistical Modelling.
Statement of joint authorship P.G. Ridall (Candidate): Developed the new models, performed the analysis of the equations, made interpretations, and wrote and proof read the manuscript. A.N. Pettitt : Provided initial idea, offered advice on the structure of the manuscript, proof read the manuscript, helped with interpretation of results. R Reeves : Provided the initial coding for the univariate pseudo-autologistic model and the POMM. Tested the models by simulating from known parameters.
113
This figure is not available online. Please consult the hardcopy thesis available from the QUT Library
Chapter 6 A Bayesian hierarchical model for estimating motor unit numbers
A Paper accepted by Biometrics (Journal of the international Biometric Society), May 2005.
Statement of joint authorship P.G. Ridall (Candidate): developed the model, carried out MCMC simulations, interpreted the results, and wrote and proof read the manuscript. A.N. Pettitt : Provided technical assistance in derivation of full conditionals, assisted with the structure of the manuscript, and proof read the manuscript. R. Henderson : Motivated the problem, supplied the data, some of the references and assisted by reading and commenting on drafts. P. McCombe : Read and made comments on manuscript. Assisted in biological interpretation of results. Helped construct and reference the section referring to biological assumptions.
137
+ β 1 This figure is not 2 available online. Please consult the hardcopy thesis available from the QUT Library
176
Chapter 7 Motor unit number estimation using reversible jump Markov chain Monte Carlo
A paper conditionally accepted by Journal of the Royal Statistical Society Series C December 2004.
Statement of joint authorship P.G. Ridall (Candidate): Developed all modelling equations, carried out MCMC simulations, interpreted the results, and wrote and proof read the manuscript. A.N. Pettitt : Helped with construction of the full conditionals and priors, assisted with the structure of the manuscript, provided critical feedback on model. R. Henderson : Motivated the problem, supplied the data, assisted with the introduction. N. Friel : Provided advice on the allocation problem of the reversible jump and provided comments on the final manuscript. P. McCombe : Read and made comments on manuscript. Assisted in biological interpretation of results. Commented on and helped reference section referring to biological assumptions.
177
This figure is not available online. Please consult the hardcopy thesis available from the QUT Library
210
Appendix A The full conditionals for within model MCMC. (Move type M = 1) We use Metropolis-Hastings within Gibbs sampling. τt | . . . Let τ t = {τ1,t , τ1,t , . . . , τ1,t }. Since p(τ t | . . .) ∝ p(τ t | Θz )p(yt | τ t , Θy , St ) at each time instance, t, a proposal τ t → τ˜t is made from the term on the left above, the prior distribution: τ˜t ∼ N (m, 1/δ 2 ), and accept with probability calculated from the term on the right. µb | . . . We use (7.9) and (7.10) for our definition of µTt , nt and Vt . Since p(µb | . . .) ∝ p(µb )
T Y
N yt ; µTt , Vt
i=1
∝ p(µb )N µb ;
PT
t=1
yt −
PT
PN
k=1 Vt
µk sk,t
1 t=1 Vt
1
, PT
1 t=1 Vt
!
,
a proposal can be made from the term on the right and the acceptance probability calculated using the term on the left. µk | . . . We let µ ˘k,t = µTt − sk,t µk and V˘k,t = Vt − sk,t σ 2 . Since PT
t=1:sk,t =1
p(µk | . . .) ∝ p(µk )N µk ; PT 214
yt −˘ µk,t V˘k,t
1 t=1:sk,t =1 V˘k,t
, PT
1
1 t=1:sk,t =1 V˘k,t
!
,
a proposal can be made from the term on the right and the acceptance probability calculated using the term on the left. σ2 | . . . The mode, σ ˆ 2 , of the full conditional is found by numerically solving for σ 2 in the equation T
T
1 X nt 1 X nt (yt − µTt )2 β3 α3 + 1 ∂ log p(σ 2 | . . .) = − = 0, + + 4− ∂σ 2 2 t=1 Vt 2 t=1 Vt2 σ σ2
−1 and approximating the variance by Vˆ (ˆ σ 2 ) = (−H(ˆ σ 2 )) at that mode
where T T T 2 2 2 X X α + 1 1 ) 2β (y − µ n n 3 3 t t t t + H(ˆ σ2) = − − . 3 6 4 2 2 t=1 Vt2 V σ σ t t=1 σ ˆ
We make a proposal from the gamma distribution Gamma(α, β), with the same mode and variance as the full conditional where the parameters α and β can be found by solving the simultaneous equations α β2
= (−H(ˆ σ 2 ))
−1
α−1 β
=σ ˆ 2 and
.
σb2 | . . . A similar scheme to the above is used for σb2 . The mode, σ ˆb2 , is found by numerically solving for σb2 in the equation T
T
1 X 1 1 X (yt − µTt )2 β4 α4 + 1 ∂ log(p(µ, µb , σb , σ | y)) = − + 4− =0 + ∂σb2 2 t=1 Vt 2 t=1 Vt2 σb σb2
−1 and approximating the variance by Vˆ (ˆ σb2 ) = (−H(ˆ σb2 )) at that mode
where T
H(ˆ σb2 ) =
T
X (yt − µT )2 2β4 α4 + 1 1X 1 t − − 6 + 3 2 t=1 Vt2 V σb σb4 t t=1
! . 2 σ ˆb
We make a proposal from the gamma distribution Gamma(α, β), with the same mode and variance as the full conditional using the same method as above. 215
δ2 | . . . For k = 1, 2, . . . , N we sample sample δk2 from its full conditional: PT T (τk,t − mk )2 2 (δk | . . .) ∼ Gamma + αδ , t=1 + βδ . 2 2 m | ... For k = 1, 2, . . . , N we sample from PT τ 1 t=1 k,t , 2 . m ˜k ∼ N T T δk
(7.25)
and accept this proposal with probability α where i h i h i h mk−1 −µm mk+1 −µm m ˜ k −µm m ˜ k −µm m ) Φ − Φ Φ − Φ φ m˜ kσ−µ σm σm σm σm m i h i h i , α = min 1, h mk−1 −µm mk+1 −µm mk −µm mk −µm mk −µm Φ −Φ Φ −Φ φ σm σm σm σm σm and mN +1 = max(St ) and m0 = min(St ).
αδ , β δ | . . . The derivative of the log of the full conditional log p(δ | αδ , βδ ) with respect to αδ and βδ gives P α1 −1 N log βδ + N δ − N Ψ (α ) + − β 2 δ 1 k=1 k αδ . The modal values of S= PN N αδ +α2 −1 − k=1 δk − β2 βδ αδ and βδ can be foundby solving S = 0 using Newton’s method and
the Hessian which is H=
−N Ψ3 (αδ ) −
α1 −1 α2δ
N βδ
2 −1 − N αδ +α βδ2
N βδ
, where Ψ2 is
the digamma function and Ψ3 is the trigamma function. If
solution to this then a proposal q :
αδ βδ
→
bivariate normal distribution: N Ψ3 (αδ ) + α ˜ α ˆδ δ ∼N , − βN β˜δ βˆδ ˆ δ
216
α1 −1 α2δ
α ˜δ β˜δ
α ˆδ βˆδ
is the
is made from the
− βN ˆδ N αδ +α2 −1 βδ2
−1 .
αµ , β µ | . . . A similar scheme to the one above is used for αµ and βµ . To get a proposal we assume that the random effects, µ, are approximately gamma. The derivative of the log of the full conditional, log p(µ | αµ , βµ ), with respect to αµ and βµ gives P α5 −1 µ − N Ψ (α ) + N log βµ + N − β 2 µ 5 k=1 k αµ . The modal values of S= PN N αµ +α6 −1 − k=1 µk − β6 βµ αµ and βµ can be found method using the by solving S = 0 using Newton’s
Hessian which is H=
−N Ψ3 (αµ ) −
α5 −1 α2µ
N βµ
N βµ
6 −1 − N αµβ+α 2 µ
.
µm | . . . We make a proposal µ ˜m from τ PN m 1 i m i=1 (˜ µm | . . .) ∼ N , , τ + N τ m τ + N τ m 2 where τm = 1/σm and τ = 1/σ2 . and accept using an expression calculated
from p(m) and given by (7.4). 2 | ... σm 2 We make a proposal σ ˜m from
2 (˜ σm
| . . .) ∼ IGamma
N 2
+ α1 ,
PN
i=1 (mi
2
− µm )2
+ β1 .
and accept this proposal using an expression calculated from p(m) and given by (7.4).
217
Appendix B Details of the varying N move types M = 2, 3, 4, 5 These moves consist of reallocating two units to three units or the reverse move. For the birth move, we randomly select i ∈ {2, 3, . . . , N } with equal probability and propose replacing units i − 1 and i in N model space by three units, i − 1, i and i+1, in N +1 space. For the death move, in the reverse direction, N +1 → N , (assuming there are N + 1 units), we select i ∈ {2, 3, . . . , N } and replace units i − 1, i and i + 1 by two units. Note that pbirth (i) = pdeath (i). The general form of the bijection on the continuous parameters is µN +1 µN u 1 i−1:i+1 i−1:i N +1 N δ i−1:i u2 birth
death δ i−1:i+1 +1 mN mN i−1:i u3 i−1:i+1
.
(7.26)
The random variables u1 ,u2 are randomly generated from univariate distribu-
tions. u1 is simulated from the normal distribution, u1 ∼ N (0, σ12 ), where σ12 is a tuning parameter. u2 is simulated from the gamma distribution, u2 ∼ G(ν, ν) where ν is a tuning parameter. The variable u3 is simulated from the uniform distribution: u3 ∼ U (0, 1). The corresponding transformation on the discrete variables is birth
N +1 sN i−1:i si−1:i+1 death
(7.27)
with the units before and after remaining unchanged but relabelled. For instance N +1 N +1 in the forward move, µN 1:i−2 will be relabelled as µ1:i−2 and when µi−1:,i+1 is N +1 inserted µN i+1:N will be relabelled as µi+2:N +1 with corresponding changes in the
labels for δ N +1 , mN +1 and sN +1 . 1. The excitability parameters, m and δ. The proposals from N model space to N + 1 model space for the shape and 218
location parameters are N δiN +δi−1 N +1 N +1 N +1 N N δi δi+1 u2 δi 2 δi−1 δi−1 (7.28) := +1 N +1 N +1 N N N N N mN m m m m + u (m − m ) m 3 i−1 i−1 i i−1 i i−1 i i+1 The inverse of this transformation, the transformation from the N +1 model
space to N model space for the excitability parameters, is 2δiN +1 N +1 N +1 δi−1 δi+1 N +1 N +1 δN δiN u2 δi+1 +δi−1 := i−1 N +1 N +1 . m −m N +1 N +1 N mN u3 mi−1 mi+1 mNi +1 +mNi−1+1 i−1 mi i+1
i−1
2. The single MUAPS. The increments in area parameters, µ. The single MUAPS must be transformed, for both the N → N + 1 move and the N + 1 → N move in such a way that their sum is conserved, that P Pi+1 N +1 is ij=i−1 µN . If this constraint is not applied then the j = j=i−1 µj
rest of the µs would have to be rescaled. We consider four possible moves, M = 2, 3, 4, 5, represented by the rows of the matrix below, designed for switching between the alternation and the no alternation models. Either the first or the second will be rejected with probability one depending on
the relative size of µi and µi−1 . Similar comments also apply to the third and fourth moves.
+1 +1 +1 µN µN µN i−1 i i+1
µN i−1
N N µN i − µi−1 + u1 µi−1 − u1
N N N µi µN i−1 − µi + u1 µi − u1 := µN + u µN − µ N − u µN 1 1 i−1 i i−1 i−1 N N µN µN i + u1 i−1 − µi − u1 µi
. (7.29)
+1 N +1 +1 If the proposal does not satisfy the prior condition: µN , µN i−1 , µi i+1 >
µmin , it is rejected. For the inverse transformation we select i ∈ {2, 3, . . . , N } and propose replacing units i − 1, i and i + 1 by two units. The transformation from N + 1 model space to N model space for the increments, the 219
inverse of the moves shown in (7.29), is
N µN i−1 µi
+1 µN i−1
+1 +1 +1 N +1 µN + µN µN i i+1 i−1 − µi+1
N +1 +1 +1 +1 N +1 µi + µN µN µN i+1 i−1 i−1 − µi+1 u1 := +1 +1 N +1 +1 N +1 µN µN µN i+1 i−1 + µi i−1 − µi+1 +1 N +1 +1 +1 N +1 µN µN µN i−1 + µi i+1 i−1 − µi+1
.
+1 N +1 The right hand column of these matrices show that if µN i−1 and µi+1 differ
greatly in magnitude then the move will be rejected with high probability. 3. The discrete variable, s To reallocate the new units to each observation, we propose discarding two of the allocations for each observation but keep the rest as shown in (7.27). Having new values of δ N +1 , µN +1 and mN +1 , the three new +1 states sN i−1:i+1,t for each observation are selected by block Gibbs sampling +1 N +1 N +1 from the full conditionals. sN i−1:i+1,t ∼ p(si−1:i+1,t | . . .) where si−1:i+1,t ∈
{(0, 0, 0), (0, 0, 1), . . . , (1, 1, 1)} and N +1 N +1 N +1 N +1 +1 N +1 , s ()7.30) p(sN i−1:i+1,t | . . .) ∝ p(si−1:i+1,t | δ i−1:i+1,t , mi−1:i+1,t ) p(yt | Θy N +1 +1 N +1 and p(sN i−1:i+1,t | δ i−1:i+1,t , mi−1:i+1,t ) =
Qi+1
j=i−1
sN +1
N +1
pj,tj,t (1 − pj,t )1−sj,t
and
pj,t is given by (7.12). The proposal probability, needed to calculate the QT N +1 acceptance ratio, is given by q N →N +1 = t=1 pt (si−1:i+1,t | . . .) where +1 p(sN i−1:i+1,t | . . .) are the normalised full conditionals given in(7.30). The re-
verse proposal from three to two units must also be considered: q N +1→N = QT N t=1 pt (si−1:i,t | . . .) where N N N N N p(sN i−1:i,t | . . .) ∝ p(si−1:i,t δ i−1:i,t , mi−1:i+1,t ) p(yt | Θy , s )
(7.31)
and sN i−1:i,t ∈ {(0, 0), (0, 1), (1, 0), (1, 1)}. Finally, the birth move {xN , u} −→ xN +1 is accepted with probability 220
A(xN , xN +1 ) where xN = {N, ΘN , sN } and N +1→N p(xN +1 | y) q N +1→N rM J A(xN , xN +1 ) = min 1, Q3 N →N +1 N N →N +1 p(x | y) q rM i=1 p(ui )
!
, (7.32)
where r is described below equation (7.22) and where ∂ΘN +1 J = ∂(u, ΘN ) N δiN + δi−1 N = (mN − m ) . i i−1 2 N +1 N +1 +1 N +1 δi+1 + δi−1 = (mN − m ) i+1 i−1 2
The acceptance ratio for the reverse of this move: the N + 1 → N move, 1 N +1 N is A(x , x ) = min 1, B(xN ,xN +1 ) where B(xN , xN +1 ) is given by the
second term in brackets on the right-hand side of (7.32).
If the proposal is accepted the newly created binary random variables are reexpressed as Gaussian latent variables using (7.13).
Appendix C Details of the varying N move types (M = 6, 7) Replace one unit by two units or the reverse move. There are two moves of this type. For the birth move, we randomly select with equal probability i ∈ {1, 2, . . . , N } and propose replacing that unit in N model space by two units i and i + 1 in N + 1 space. In the reverse direction (assuming there are N units), we select with equal probability i ∈ {1, 2, . . . , N − 1} and replace units i and i + 1 by one unit. The general form of the bijection on the continuous parameters is
µN i
+1 µN i:i+1
u4 birth N +1 δ i u5 δ N death i:i+1 +1 mN u6 mN i i:i+1 221
.
(7.33)
In the birth move, a new location and shape parameter are introduced and the area of one unit is shared among two others. The contribution of one unit to the observation is reallocated to two units. 1. The shape and location parameters, δ, m. For the shape parameters we propose two moves, M = 6, 7, which we depict by the rows of the matrix below.
N +1 δiN +1 δi+1
:=
u5 δiN
δiN
δiN
u5 δiN
and u5 ∼ Gamma(ν, ν) and ν is a tuning parameter. For the reverse move from N + 1 to N model space (assuming there are N + 1 units): N +1 N +1 δi δi+1 N +1 δi+1 . δiN u5 := N +1 N +1 δi+1 δi δ N +1 i
For the location parameters we propose two moves, M = 6, 7, which we depict by the rows of the matrix below. N N N N m + u (m − m ) m 6 i−1 i i−1 i +1 +1 mN mN (7.34) := i i+1 N N N N mi + u6 (mi+1 − mi ) mi and where u6 ∼ U (0, 1). The reverse moves are +1 N +1 −mN N +1 mi i−1 m i+1 mNi+1+1 −mNi−1+1 := mN u N +1 N +1 , 6 i +1 mi+1 −mi mN i mN +1 −mN +1 i+2
i
N where mN 0 = min(St ) and mN = max(St )
2. The single MUAPs, µ The transformation from N model space to N + 1 model space on µi is:
+1 +1 µN µN i i+1
:=
N N u4 (µN i − µmin ) µi − u4 (µi − µmin )
222
, (7.35)
where u4 ∼ U (0, 1). Equation (7.35) was motivated by observing that in comparing N to N + 1 dimensional models one unit was shared among two others. The right hand panel of Table 7.2 compares µ5i from the 5 unit model and µ4i from the 4 unit model. Note that µ43 ≈ µ53 + µ54 . The reverse move from N + 1 to N model space is:
u4 µN i
:=
+1 µN i
+
+1 µN i+1
µiN +1 N +1 +1 µi +µN i+1 −µmin
.
3. The discrete variables, s For the transformation N → N + 1 one unit is reallocated to two units. This and the reverse move are given by
sN i,t
birth
death
+1 sN i:i+1,t
,
t = 1, 2, . . . , T.
(7.36)
The allocation is done by Gibbs sampling in the same manner as for the previous move. In the forward move, four possible reallocations are selected for each observation and in the reverse direction only two. The acceptance ratio for the forward move N → N + 1 for M = 7 is given by (7.32) where for move M = 6 ∂ΘN +1 J = ∂(u, ΘN ) N N = δiN (mN i − mi−1 )(µi − µmin ) N +1 +1 N +1 N +1 +1 = δi+1 (mN + µN i+1 − mi−1 )(µi i+1 − µmin ),
and for move M = 7 N N J = δiN (mN i+1 − mi )(µi − µmin ) N +1 +1 N +1 +1 +1 = δi+1 (mN )(µN + µN i+2 − mi i i+1 − µmin ).
223
224
Chapter 8 Concluding remarks and recommendations for further work This thesis has been an exploration of the use of latent variable models for modelling complexity within biological systems. Firstly we summarise what we have done in this thesis. Secondly we present areas where there is need of new work. We have done the following: • We have developed an autoregressive hidden Markov model for modelling longitudinal counts. This is essentially a mixture of auto regressive processes where the transitions between these components or states is explicitly modelled. This is an extension of the model of Dunson (2000). We believe the Bayesian method provides superior method of expressing the variability of the parameters, expresses better the within and between rat variability, and is more parsimonious than the Dunson (2000) model. • The pseudo auto-logistic model of Besag (1974) was introduced as a substitute for the true likelihood of the auto-logistic model for binary spatial data because of the need to calculate the normalisng constant. We have 225
extended this model to a hidden multivariate lattice and applied it to the Lansing Woods data set to describe the interaction of two types of trees. • Motor unit number estimation has been a actively researched but unsolved problem for over 30 years (Shefner, 2001). A method developed by Daube (1995) seems to give reasonable estimates with patients with normal or mildly diseased muscle but fails for patients with advanced disease, where it is needed the most. We have developed a method of MUNE for patients with disease, by using the theory of Green (1995). Using reversible jump MCMC we have developed an algorithm for calculating the posterior marginal distribution of motor unit numbers. Through the use of our method of MUNE, researchers now have the means to track diseases of the neuromuscular system such as motor neurone disease or amyotrophic lateral sclerosis. Hopefully this will lead to an experimental protocol for testing the efficacy of drugs that are claimed to be capable of stopping or slowing down the progress of these diseases. In addition we have exposed several potential areas for new work. 1. We have discussed the versatility of the hidden Markov model which makes it very useful. It does, however, have one fundamental limitation. The discrete hazard to exit state means that the waiting time for exit must take on the geometric distribution. This one parameter distribution is not versatile enough for some kinds of problem. An example is inability of the standard HMM to model the variability of lengths of segments encountered in gene recognition systems. At the moment the rows of a transition matrix are modelled from the Dirichlet distribution. Could the transition matrix matrix be modelled as a mixture of distributions? This and other alternatives need to be looked at as ways of providing more flexibility in the ability of the HMM to model the waiting time or length distributions. 226
2. In Chapter 5 we have used a generalisation of the pseudo-autologistic model of Besag (1974) to describe the interactions of a number of species on a rectangular lattice. The reason the true likelihood was not used was the intractability problem of calculating the normalising constant. Recently M¨oller et al. (2003) have used an auxiliary lattice in such a way that estimation can be conducted on large lattices without the need to calculate the normalising constant. The use of a hidden true autologistic lattice needs to be explored for use in our application and the question needs to be asked of whether it offers any advantage over the pseudo autologistic model. 3. The MUNE problem has thrown up several additional challenges. (a) An efficient Bayesian means of of conducting MUNE for normal patients or patients with moderate disease needs to be developed. The method of Daube (1995) does not provide an measure of the variability of the estimate. (b) Our method of MUNE relies on the use of one informative prior. That is the parameter that represents the lower threshold for the increments of area. By conducting many repeated studies (where MUNE is conducted at the same session but with different configurations of the electrodes) we need to investigate ways of determining the optimal setting this parameter so that it will provide the most reliable estimate of motor unit numbers. (c) Patients with rapidly advancing disease show decremental response. Here, reductions in area and amplitude are observed in patients with rapidly advancing disease. This effect is thought to occur because when a diseased unit fires it is thought to become depleted in adenosine triphosphate (ATP), reducing the ability of the unit to fire with the same magnitude of CMAP area immediately afterward. This effect 227
is also thought to be related to the frequency at which pulses are delivered. Not only must our method of MUNE be adjusted for this effect but there is case for modelling this process to learn more about this depletion process. (d) An adequate solution to the problem of drift discussed in Chapter 7 needs to be looked at. (e) Instead of using just area to determine MUNE, a bivariate model needs to be constructed that looks at both area and amplitude simultaneously.
228
Bibliography Besag, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Roy. Statist. Soc. Ser. B 36, 192–236. Daube, J. R. (1995). Estimating the number of motor units in a muscle. J. Clin. Neurophysiol. 16, 585–594. Dunson, D. B. (2000). Models for papilloma multiplicity and regression: applications to transgenic mouse studies. Biometrics 49, 19–30. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:411–732. M¨oller, J., A. N. Pettitt, K. K. Berthelsen, and R. Reeves (2003). An efficient Markov chain Monte Carlo method for distributions with intractable normalisation constants. Technical report. Shefner, J. M. (2001). Motor unit number estimation in human neurological diseases and animal models. J. Clin. Neurophysiol. 112, 955–964.
229