Abstract: Variational Learning theory allows the estimation of posterior probability distributions of ... analysis, such as full Bayesian model estimation and au-.
ENSEMBLE HIDDEN MARKOV MODELS FOR BIOSIGNAL ANALYSIS Iead Rezek and Stephen J. Roberts Robotics Research Group, Department of Engineering Science, University of Oxford, UK. irezek,sjrob @robots.ox.ac.uk Abstract: Variational Learning theory allows the estimation of posterior probability distributions of model parameters, rather than the parameters themselves. We demonstrate the use of variational learning methods on Hidden Markov models with different observation models and apply the HMM to a range of biomedical signals, such as EEG, periodic breathing and RR-interval series. 1. INTRODUCTION Hidden Markov models (HMMs) are well-established models with a wide range of applications. The two main components of the HMM are its hidden state sequence, which encodes abrupt changes in the data, and a set of observation models, which model the within-state dynamics of the data. Traditionally, the HMM parameters are estimated in the maximum-likelihood framework. Maximumlikelihood approaches, however, suffer from well-known problems of over-fitting (the likelihood always increases with model complexity) and thus the number of states in the HMM must be known a priori or estimated by application of (possibly inconsistent) penalty terms to the likelihood score. A full Bayesian approach to learning avoids these problems. This paper describes such a scheme and its application to the analysis of biomedical time-series.
2. OVERVIEW OF VARIATIONAL LEARNING Variational inference is a relatively new method. It can be regarded as a collection of inference methods of which, for example, the Expectation Maximisation algorithm is special case. In this framework, integration over computationally or analytically intractable models is solved by minimising the distance between an approximate but tractable distributions and the exact but intractable true model distribution (see [4] and [6] for excellent tutorials). In this paper the distribution to be approximated is the full posterior probability distribution over all hidden variables (parameters or hidden states). By approximating the full posterior, one can reap the benefits of Bayesian analysis, such as full Bayesian model estimation and automatic penalties for over-complex models thus avoiding over-fitting1. Variational learning aims to minimise the so-called variational free energy [5] between the (intractable) model posterior and a simpler (analytic) approximating distribution . The free energy is given as the Kullback-Leiber 1 Being
an approximation, the optimal model is chosen from the class of approximated and thus suboptimal models
(KL) divergence between
and
, i.e.
(1) where the distribution is defined over the hidden variables, such as parameters or hidden states. Since the first term on the right-hand side is always non-negative, the divergence is an upper bound to the true log-probability of the data, i.e. the evidence. The integral (1) is maximised with respect to the individual distributions. Given a set of hidden variables , the method known as “Mean Field” variational approximation assumes that the Q-distributions factorise, ie. (2) with the additional constraint that . Under the mean-field assumption, the distributions which maximise the free energy integral (1) can be shown to be [3] (3) where and is just a normalisation constant. In this paper, we deviate from the mean field approach in two ways. First, we drop the mean field assumption for the hidden state variables but maintain a mean-field assumption for the parameters. Thus, we assume the Q’s to be of the following form
Secondly, we use mixtures of Q-distributions, (4)
where each Q-component corresponds to a different statespace dimension and is weighted by ( ). The two assumptions jointly reflect the structure of the HMM [9], in that they reflect the chain structure of the HMM and different state space dimensionalities. The free energy, under the above conditions, may then be expressed as
The estimation thus involves optimisation of the individual free-energy components and estimating the freeenergy weights from the relation [5]
for the observation model means Normal densities ( )
, -dimensional
for the observation model precisions mensional Wishart densities (
, )
di-
The choice of Gaussian observation model depends on the data and, hence, might not be appropriate. A change to another distribution within the exponential family is theoretically easy, provided conjugate priors exist. For example, a Poisson observation model might be appropriate for count data, such as RR-intervals obtained from electrocardiogram recordings. In this case,
(5) 3. VARIATIONAL LEARNING OF HIDDEN MARKOV MODELS 3.1. Definitions In the following, for clarity, we drop temporarily the subscript of the mixture components. The free energy integral to be minimised is
where, each of the states has a Poisson distribution . The data points are counts with parameter while the values are called the exposure of the -th unit, i.e. a fraction of the unknown parameter of interest [2]. The prior for the parameter of the Poisson distribution is chosen to be conjugate and thus a Gamma density, (
)
(6) In the above, the Hidden Markov model likelihood of the complete data is given by
As mentioned in the previous section, we take the Qdistributions to factorise as
in which, for Gaussian observation models
where we assume an HMM with a discrete M-dimensional state space (defined by probabilities for the initial state probability and for the state transitions probabilities) and Gaussian observation models (parameterised by means and precision matrices , ). The model parameter priors are assumed to be conjugate and thus [1] for initial state probability Dirichlet density
, an
-dimensional
or, for Poisson observation models,
and The distributions for are identical to the priors and so, to avoid confusion, we denote the parameters of with tildes, e.g. . 3.2. Estimation
for the transition probabilities Dirichlet densities
,
-dimensional
3.2.1. Model Parameters By taking the derivatives of the free energy with respect to the distributions of the unknown parameters, we obtain a set of update formulae for the parameters of the distributions, which are given in the appendix.
State Label Sequence
3.2.2. Hidden States 3
2.5
State
The hidden variables (i.e. the state sequence) can be estimated using standard forward-backward message passing [7], conditioned on the data and the expectations of the model parameters under the Q-distributions. The use of the forward-backward recursions is justified by the fact that message passing equations are fixed point equations of the free-energy when the Q-distributions are assumed to be of the form given in equation (4) [10].
2
1.5
3.2.3. State Space Dimension 1
4. EXPERIMENTS 4.1. Sleep EEG
2530
1
0.9
2520
0.8 2510
50
100
150 Time (sec)
200
250
Fig. 2. HMM Sleep EEG Segmentation 4.2. Periodic Respiration We also applied the HMM to features extracted from a section of Cheyne Stokes Data 2 , consisting of one EEG recording and a simultaneous respiration recording, both sampled at . The feature, the fractional spectral radius (FSR) [8], was computed from consecutive nonoverlapping windows of two seconds length for the EEG and respiration signals seperately. The features thus extracted jointly formed a -dimensional feature space to which the HMM was then applied. As seen in figure (3), the model clearly shows a preference for a -dimensional state space. Figure (4) shows a data section with the corresponding Viterbi state sequence. The data is segmented predominantly into the following regimes: segments of arousal from sleep, wake state with rapid respiration, and two sleep states different only in the EEG micro-structure. Cheyne−Stokes: Optimal Variational Evidence
Cheyne−Stokes: Variational Energy Posterior Probability
6350 1
6300
6250 0.8 Variational Free Energy
The first application to medical time series analysis demonstrates the use of the variational HMM to features extracted from a section of electroencephalogram data. The recordings were taken from a subject during a sleep experiment. The study looked at changes of cortical activity during sleep in response to external stimuli (e.g. from a vibrating pillow under subjects head). The Fractional spectral radius (FSR) measure and spectral entropy [8] were computed for consecutive non-overlapping windows of one second length. In figure (1), the model clearly shows a preference for a -dimensional state space. Figure (2) shows a section of data with the corresponding Viterbi state sequence. The data is segmented into the following regimes: wake (state , first of the recording), deeper sleep (state ) and light sleep (state ) which is clearly visible at sleep onset ( ) and at the arousal ( ).
0
6200
Probability Q(a) ! F(S|a)
Estimation is performed over several state space dimensions. Given a fixed state space dimension estimation involves iterative application of forward-backward message passing, update of the model parameters, and estimation of the free energies. The free energies, obtained for each state-space dimension are then used to evaluate the highest probability model according to equation (5) and thus the optimal state space dimension.
6150
6100
0.6
0.4
6050
6000
2500
Model Probability
Variational Free Energy
0.7
2490
2480
0.6
0.2 5950
0.5
5900
0.4
2
2.5
3
3.5
4 4.5 5 State−Space Dimension
5.5
6
6.5
7
0
2
3
4 5 State−Space Dimension
6
7
0.3 2470
(a)
0.2 2460
2450
(b)
0.1
2
3
4 5 6 State Space Dimension
7
0
2 3 4 5 6 State Space Dimension
Fig. 1. State Space Dimension Selection
7
Fig. 3. Model order for CS-data: Variational free energy and model probability. 2 A breathing disorder in which bursts of fast respiration are interspersed with breathing absence.
Cheyen−Stokes Segmentation
RR intervals 1000 950
250
900
200 Time (msec)
Respiration & CS−Class
300
150 100 50 0
850 800 750 700 650
850
900
950
1000
600
1050
550
0
20
40
700
80
100
80
100
80
100
RR interval segmentation
600
3
500 400
2.5
300
Viterbi state
EEG & CS−Class
60
200 100 0 −100
2
1.5
850
900
950 Time (sec)
1000
1050 1 0
Fig. 4. CS-data: Respiration and EEG Signals with their respective segmentationq.
20
40
60 RR interval segmentation
1000 950
4.3. Heart Beat Intervals
Time (msec)
900 850 800 750 700 650
In order the apply the Poisson model of the HMM, we took a sequence of RR-intervals, obtained from the RRData base at UPC, Barcelona. A subject (Identifier RPP1, Male, Age: 25 years, Height: 178cm, Weight: 70kg) underwent a controlled respiration experiment. While sitting, the subject took 6 deep breaths between 30 and 90 seconds after recording onset (ECG signal sampling rate is ). The optimal segmentation was found to be 3 states and is shown in figure 5. The top plot depicts the original RR-interval time series and the middle plot the state labels resulting from the Viterbi path. The segmentation based on the Viterbi path is shown in the bottom plot. A change in state dynamics is clearly visible in the state sequence between and seconds, i.e. the period during with the subject took deep breaths. The state dynamics parallel those observed in the heart-rate signal, obtained from the RR-intervals by interpolation. The difference, however, is that the state sequence is essentially a smoothed version of the heart-rate signal as several heartrate levels will fall into one state. In addition, no interpolation as such is done, i.e. the HMM derives the heart rate statistically.
600 550
0
20
40
60 Time (sec)
Fig. 5. HMM 3-Stage Viterbi Segmentation for subject RPP1 during controlled respiration experiments. Acknowledgement The authors would like to thank R. Conradt for her invaluable contributions to this research. I.R. is supported by the UK EPSRC to whom we are most grateful. We also like to thank Dr. M.A.Garc´ia Gonz´alez from the Electronic Engineering Department at the Polytechnique University of Catalonia in Barcelona for providing the RRInterval recording (available at http://petrus.upc.es/ wwwdib/people/GARCIA MA/database/main.htm ).
REFERENCES [1] J.M. Bernardo and A.F.M. Smith. Bayesian Theory. John Wiley and Sons, 1994.
5. CONCLUSION Unlike the traditional maximum likelihood learning of HMMs, our approach does not require additional penalisation of over-complex models, the algorithm convergences more rapidly and is much more robust. We have found the Gaussian model to be the most robust observation models. It clearly has many applications and we have found it extremely useful also in other applications. The Poisson has the disadvantage of quickly running into machine precision problems if the rate in the exponent of the exponential in the density function becomes to large.
[2] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, 2000. [3] M. Haft, R. Hofmann, and V. Tresp. ModelIndependent Mean Field Theory as a Local Method for Approximate Propagation of Information. Computation in Neural Systems, 10:93–105, 1999. [4] T.S. Jaakkola. Tutorial on Variational Approximation Methods. In M. Opper and D. Saad, editors, Advanced Mean Field Methods: Theory and Practice. MIT Press, 2000.
[5] T.S. Jaakkola and M.I. Jordan. Improving the Mean Field Approximation Via the Use of Mixture Distributions. In M.I. Jordan, editor, Learning in Graphical Models. Kluwer Academic Press, 1997.
,
is the average
log-likelihood under the Q-distribution,
[6] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. An Introduction to Variational Methods for Graphical Models. In M.I. Jordan, editor, Learning in Graphical Models. Kluwer Academic Press, 1997. [7] L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceeding of the IEEE, 77(2):257–284, 1989. is the sum of all the KL-divergences between the Q- and the prior distributions, which for Gaussian densities are
[8] I. Rezek and S.J. Roberts. Stochastic Complexity Measures for Physiological Signal Analysis. IEEE Transactions on Biomedical Engineering, 44(9):1186–1191, 1998. [9] L.K. Saul and M.I. Jordan. Exploiting Tractable Substructures in Intractable Networks. In D.S. Touretzky, M.C. Mozer, and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8. MIT Press, 1996.
for Wishart densities ,
[10] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Bethe free energy, Kikuchi approximations and Belief Propagation algorithms . In NIPS, 2000. A. APPENDIX: UPDATE EQUATIONS In the following we make use of the notation introduced in [7], specifically and .
and for Dirichlet densities
For Gaussian observation models we have for the posterior means ,
and posterior precisions
and
. For the ,
C. POISSON MODEL UPDATE EQUATIONS Using conjugate prior distributions for the observation model, it can be shown that the parameters of the optimised Q-distribution for the Poisson rate of state are
(7) The posterior initial state and transition probabilities are Dirichlet distributed with parameters, respectively,
(8)
B. VARIATIONAL FREE ENERGY The Free energy is given by is the negative entropy of the hidden variables, where
,