Blind Separation and Deconvolution

University of California Los Angeles

Blind Separation and Deconvolution : Contributions to Aggregated Time Series Analysis and Signal Processing

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Statistics

by

Kerby Shedden

1999

c Copyright by

Kerby Shedden 1999

The dissertation of Kerby Shedden is approved.

Bengt Muthén

Wing Hung Wong

Donald Ylvisaker

Ker-Chau Li, Committee Chair

University of California, Los Angeles 1999

ii

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 Data analysis through blind separation . . . . . . . . . . . . . . .

3

2.1

The Computational method . . . . . . . . . . . . . . . . . . . . .

4

2.1.1

Identification of the Sources . . . . . . . . . . . . . . . . .

6

2.1.2

Estimation of the intensity weights . . . . . . . . . . . . .

15

2.1.3

Posterior Reconstruction . . . . . . . . . . . . . . . . . . .

16

Application to fMRI Series . . . . . . . . . . . . . . . . . . . . . .

23

2.2.1

fMRI background . . . . . . . . . . . . . . . . . . . . . . .

23

2.2.2

Application of the procedure . . . . . . . . . . . . . . . . .

24

2.3

U.S. unemployment rates . . . . . . . . . . . . . . . . . . . . . . .

36

2.4

Non-Gaussian alternatives . . . . . . . . . . . . . . . . . . . . . .

38

2.5

Comparison to other decompositions . . . . . . . . . . . . . . . .

44

2.5.1

Raw data as the PCA input . . . . . . . . . . . . . . . . .

44

2.5.2

Frequency spectra as the PCA input . . . . . . . . . . . .

48

2.6

Comparison to ARIMA Modeling . . . . . . . . . . . . . . . . . .

52

2.7

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

3 Digital blind deconvolution . . . . . . . . . . . . . . . . . . . . . .

55

2.2

3.1

Deconvolution and Bayesian restoration . . . . . . . . . . . . . . .

57

3.2

Importance Sampling, Sequential Importance Sampling . . . . . .

60

3.3

The Inverse Filter . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

iii

3.4

Application of the Inverse Filter to Monte Carlo Deconvolution . .

66

3.4.1

The Forward Approach . . . . . . . . . . . . . . . . . . . .

66

3.4.2

The Inverse Approach . . . . . . . . . . . . . . . . . . . .

69

3.5

Blind Deconvolution . . . . . . . . . . . . . . . . . . . . . . . . .

72

3.6

The method of moments . . . . . . . . . . . . . . . . . . . . . . .

76

3.7

Further discussion and Conclusion

. . . . . . . . . . . . . . . . .

77

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

iv

List of Figures

2.1

Two views of the autocovariance for example 2.1 . . . . . . . . . .

2.2

Raw data sequences corresponding to the boundaries of the 2-

8

dimensional autocovariance cone for example 2.1. . . . . . . . . .

14

2.3

True and reconstructed series from example 2.1. . . . . . . . . . .

21

2.4

True and reconstructed series from example 2.1. . . . . . . . . . .

22

2.5

Mean image for fMRI series. . . . . . . . . . . . . . . . . . . . . .

25

2.6

Cross section of the autocovariance cone. . . . . . . . . . . . . . .

28

2.7

The fMRI autocovariance cone, and some sample trajectories . . .

28

2.8

A view of the main three dimensions of the fMRI autocovariance cone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Boundary autocovariances for the fMRI autocovariance cone . . .

30

2.10 Spatial distribution of fMRI intensity weights . . . . . . . . . . .

32

2.11 A reconstructed sequence from the 3 source fMRI model. . . . . .

34

2.12 A reconstructed sequence from the 3 source fMRI model. . . . . .

35

2.9

2.13 The unemployment rate autocovariances projected onto their first two eigenvectors. . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.14 Posterior reconstruction for a single time series from example 2.2.

42

2.15 Posterior reconstruction for a single time series from example 2.2.

43

3.1

Error counts for the exact Bayesian restoration. . . . . . . . . . .

59

3.2

Weights for 1000 sequential importance sampling draws computed after 2, 5, 50, and 150 time points have been processed. . . . . . .

v

62

3.3

The inverse filter for example 3.1 . . . . . . . . . . . . . . . . . .

65

3.4

The inverse filter for example 3.2 . . . . . . . . . . . . . . . . . .

65

3.5

Error counts using the forward approach for example 3.1 . . . . .

69

3.6

Error counts using the forward approach for example 3.2 . . . . .

70

3.7

Error counts using the backward approach for example 3.1 . . . .

72

3.8

Error counts using the backward approach for example 3.2 . . . .

73

3.9

Blind deconvolution error counts for example 3.1 . . . . . . . . . .

75

3.10 Blind deconvolution error counts for example 3.2 . . . . . . . . . .

76

vi

Vita

1971

Born, St. Louis, Missouri, USA.

1994

B.A. Mathematics, University of Michigan

1998

C. Phil. Statistics, UCLA

vii

Abstract of the Dissertation

Blind Separation and Deconvolution : Contributions to Aggregated Time Series Analysis and Signal Processing by

Kerby Shedden Doctor of Philosophy in Statistics University of California, Los Angeles, 1999 Professor Ker-Chau Li, Chair

This dissertation is concerned with time series analysis and signal processing. Two related topics are addressed. The first topic involves the problem of data analysis for very large collections of time series curves. The primary result is an analytic procedure that is computationally efficient, and that is interpretable from several different perspectives. The method involves viewing each of the observed time series curves as a randomly weighted sum of independent draws from a collection of stationary sources. This probability model is shown to have a natural correspondence to the geometry of the set of autocovariance functions of the time series in the collection, viewed as points in Euclidean space. The estimation algorithm consists of a PCA-like procedure for performing the source identification, followed by a posterior restoration of the hidden components. We apply the method to various types of simulated data, as well as to functional brain imaging data, and to U.S. unemployment rates. Results from these applications can be viewed at http://www.stat.ucla.edu/˜kshedden. We also draw connections to basis-construction methods (ICA, PCA, FDA), and to ARIMA models. The

viii

second topic of the dissertation involves signal restoration for a single observed time series. We introduce two sampling procedures for carrying out Monte Carlo signal restoration that are shown to significantly reduce the computational burden relative to other Monte Carlo restoration procedures. These procedures can be incorporated into the data analysis described in the first section of the dissertation as a method for restoring the hidden signal components.

ix

CHAPTER 1 Introduction A number of important scientific problems involve the analysis of massive collections of high dimensional data in which the dimensions can be sequentially ordered. Generically, we have a large number n of records {z1 , . . . zn }, such that each record zj is itself a vector: zj = (zj (1), . . . , zj (d)), and the d scalar components of zj have a natural ordering imposed by known information about the data generating mechanism. The most common situation of this type is when the components of zj comprise a scalar time series. Other important examples include gene and protein sequences and representations of natural language. The data structure described above is often referred to as a multiple time series, but historically much of the work using that name has focused on the situation in which there are, by our scale, a moderate number of measurements at each time point (a few hundred at most). In contrast, we will primarily be interested in situations in which there are thousands, or tens of thousands of measurements per time point. An example of this type of situation occurs when the time series are recorded spatially, on a grid. If measurements are taken on a 80 × 80 × 10 3-dimensional grid, then 64, 000 measurements are obtained at each time point. Much (but not all) of the familiar multiple time series methodology will not scale to problems of this size. This thesis will broadly be concerned with the problem of data analysis for scientific measurements of the type described above. Estimation of specific, highly

1

restricted stochastic models will not be a major focus. Rather, we take the perspective of a researcher who, at least for a moment, wishes to avoid directly incorporating prior physical or biological knowlege into the analysis. Even when strong external information is available, it will often be useful to estimate a less restricted model, and then check that whatever structure is found in the unrestricted case is a likely consequence of the hypothesized model. Lacking the ability to formulate a model in terms of physical quantities and laws, the models that we describe here will by necessity be described in terms of generic statistical quantities, such as covariances, and occasionally higher-order moments and frequency spectra. Our definition of interesting structure will be a stable functional relationship among such quantities. Since it is difficult to verify the stability of a functional relationship among high dimensional quantities, we will primarily be interested in relationships that can be described in terms of low dimensional projections. The work is divided into two roughly independent parts. Chapter two describes a broad class of models that express each of the observed time series additively in terms of a small number of basic stochastic components. A computational procedure is described, and several simulated and real data examples are given. Chapter three focuses very specifically on a particular type of computation that arises in chapter two – the reconstruction of moving average signal components whose input distributions are finitely-supported. In addition to applications in the modeling framework described in chapter 2, this computation is a crucial part of several important problems in digital signal processing.

2

CHAPTER 2 Data analysis through blind separation A number of important scientific problems involve the analysis of large ensembles of time series curves. For instance, if a series of images is recorded over time, as in functional medical imaging or satellite remote sensing, a time series is obtained for each pixel in the field of view. Such a data collection will have a very complex probability structure, since there will be spatial dependence within each image, as well as temporal dependence at a fixed location over time. The complex systems on which these time series measurements are being made often are comprised of a relatively small number of subsystems that might be viewed, at least approximately, as contributing additively to a measurement that we make of the system. This motivates the following approach for analyzing high dimensional systems: search for a small collection of probability laws such that the sum of an independent draw from each law provides a good approximation to the observed probability distribution. For example, consider the problem of quantifying the activity of the systems in the human brain that control high level mental functions such as language and memory. At the finest scale, the electrical activity of a single neuron in the human cortex depends in a complex way on its precise location in the network of neurons in the brain. However on a coarser scale, it is hoped that a collection of neurons that are involved in a particular type of functional activity will have a distinctive dynamical behavior. If we consider as a simple model that the cortex

3

is an ensemble of spatially interwoven, but functionally homogeneous neural systems, then the dynamical character of neural activity in a region might become a surrogate indicator for the underlying type of mental activity. This kind of situation is captured in the following model. Let {Pj ; j = 1, . . . , d} denote a collection of stationary probability laws normalized so that when zj ∼ Pj , then E zj (t) = 0, and V ar zj (t) = 1 for each t. Let Λ = (λ1 , . . . , λd ) denote a random weight vector, which we will call the vector of intensity weights. For the ith observed time series y i, let

y i(t) =

d X

λij zji (t),

(2.1)

j=1

where zji ∼ Pj , independently across j. The independence of the component time series zji over distinct sources represents the central structural restriction of this model, since it allows for the identification of the source distributions Pj . If the zji are also independent across i, then all of the between time series structure is captured in the low dimensional vector of intensity weights Λi. This accomplishes a dimension reduction that can greatly simplify a representation of the joint distribution of the y i .

2.1

The Computational method

Algorithmically, the first step is to identify the sources, which amounts to estimating the parameters of the probability laws Pj . The second step is to resolve each observed time series in terms of the sources. This resolution will be accomplished using the posterior mean of each source, multiplied by a simple consistent ˆ i of the intensity weight λi : estimate λ j j

4

i i î i i λd j zj = λj E(zj |y ).

(2.2)

For this chapter, we assume that the observed time series y i are stationary. This assumption is not clearly violated in any of our examples, and is valuable from a computational standpoint, because it allows us to make consistent estimates of the covariance structures of each observed time series without referring to the other observations. Alternatively, a non-stationary covariance can be assumed, as long as it is sufficiently restricted to allow it to be estimated with reasonable accuracy from a single time series. The outline of the algorithm follows. Detailed explanations of each step will be given in the following sections. 1. Let M denote the matrix whose rows are the sample autocovariances γˆ i of the y i. Let β1 , . . . , βd denote the right singular vectors of M with largest singular values. 2. Project all of the γˆ i onto hβ1 , . . . , βd i. For simplicity, henceforth refer to the projected autocovariances as γˆ i . 3. Find d generator vectors g1 , . . . , gd that generate a cone approximately equal to the cone generated by the γˆ i . Each gj estimates the autocovariance of the corresponding Pj . 4. For each i, solve the linear system γˆ i =

P î 2 î j λj gj . The λj are estimates of

the intensity weights. ˆ i E(z i |y i). 5. Compute the posterior means zˆji = λ j j

5

2.1.1

Identification of the Sources

Recall that the sources are normalized so that E zj (t) = 0, and V ar zj (t) = 1 for all t. Let γj (`) = Cov(y(t), y(t + `)). Since we will only be working with stationary time series, Cov Pj will always be taken to be the one-sided autocovariance sequence:

Γj = (γj (0), γj (1), . . .).

(2.3)

Since we identify each source using its autocovariance, the identification step requires only that we estimate the Γj . The information that we will use for this estimation is the set of sample autocovariances γî = Cov y i(t). Even in the Gaussian case, the γî will be sufficient only when the y i are independent across i. In the case in which we are specifically interested, the y i are dependent across i. We will see that under mild assumptions, the γî can be used to consistently identify the sources even if the y i are dependent. The key fact that allows source identification from the autocovariances is the following relation, which follows from the independence within i of the input time series zji .

Cov(y(t)|Λ = (λ1 , . . . , λd )) =

d X

λ2` Γ`

(2.4)

`=1

Before we consider the estimation problem more formally, consider the following example. Example 2.1: Generate 400 draws from each of the following two moving average sources:

6

z1i (t) = 0.8805xi1 (t) − 0.4402xi1 (t − 1) + 0.1761xi1(t − 2) z2i (t) = 0.8111xi2 (t) + 0.4867xi2 (t − 1) + 0.3244xi2 (t − 2). where the xij (t) are standard normal random variables, independent across i, j, and t. Next, generate λij independently for i = 1, . . . , 400 and j = 1, 2 as uniform random variables supported on the interval [0, 1]. Finally, let

y i (t) = λi1 z1i (t) + λi2 z2i (t).

(2.5)

We computed the sample autocovariance functions γˆ i to five lags for each of the simulated time series y i(t). Under mild conditions, each γî is unbiased and consistent for the corresponding value in (2.4). This implies that the set of γˆ i should roughly cover the cone spanned by the Γj . We can check that this is the case for example 2.1 by examining figure 2.1. The top left plot in the figure shows a side view of the autocovariance cone, which is obtained by projecting the set of sample autocovariances onto their first two singular vectors. The top right plot of the figure shows a different projection, obtained using the first and third singular vectors. The true autocovariance cone should have zero width along the third eigenvector – with the sampling error in the autocovariance estimates, the width is not zero, but it is still possible to discern that the cone is thinner along the third direction than it is along the second direction. This conclusion is much more obvious in the lower set of plots, where time series of length 1600 rather than 400 are used, so the standard errors of the sampling autocovariance are reduced by a factor of two relative to the top pair of plots.

7

Figure 2.1: Left top: view of autocovariance cone for example 2.1 projected onto e1 and e2 , right top: view of autocovariance cone for example 2.1 projected onto e1 and e3 . Left bottom: same as left top, but with time series length 1600, right bottom: same as right top, but with time series length 1600.

8

2.1.1.1

Dimension Reduction

Let p denote the number of time series lags that are to be used for the source identification. Generally we will have p > d, so that the span of {Γ1 , . . . , Γd } will lie in a proper subspace S < Rp . We begin by identifying the subspace S. Since the expected values of the sample autocovariances lie in the span of the Γj , this subspace can be consistently identified by applying any of a number of dimension reduction procedures to the set of sample autocovariances. In the above outline, the singular value decomposition was used. Below, the SVD as well as two other procedures that may be substituted for it are described. Note that it may be useful to consider normalizing the γˆ i to have length 1 before applying any of these procedures. • Let M denote the matrix whose rows are the γî . Use the singular value decomposition to write M = ADB 0 , where A and B are orthogonal, and D is diagonal. Assume that the diagonal elements of D decrease from top to bottom. Use the left-most columns of B as a basis for S. The number of basis elements to use can be roughly estimated using the magnitude of the diagonal elements of D. • Use the mean of the γˆj as the first basis vector for S. Let M denote the matrix whose rows are the γˆ i projected onto the complement of their mean. Apply the SVD to M, and use the left-most columns of B together with the mean as a basis for S. • Use the mean of the γˆj as the first basis vector for S. Project each γˆj onto the complement of the mean, and then use the mean of the projections as the second basis vector. Iterate through these two steps until a complete basis is found. The number of basis vectors to use can be roughly estimated

9

using the residual sum of squares following each projection. Once the subspace S, is found, the γˆj are projected onto S. Only the projections are referenced for the rest of the procedure. For simplicity, we will henceforth refer to the projections as γˆj . Example 2.1 (continued): The orthogonal basis for the autocovariance cone in example 2.1 is computed using the SVD. The singular values (λ) show a clear 2 dimensional structure: Eigenvector

λ

0.9765 0.0530 0.2088 10.7 -0.0715 0.9941 0.0820 3.0 -0.2032 -0.0950 0.9745 0.5 The theoretical autocovariances for the two sources are:

Γ1 = (1.00, −0.47, 0.16)0 Γ2 = (1.00, 0.55, 0.26)0. Note that the bisector of the theoretical autocovariances: (Γ1 + Γ2 )/2 = (1.00, 0.04, 0.21)0 is very close to the first singular vector. This should be the case whenever the intensity weight distribution is roughly symmetric. In the same way, the difference between the theoretical autocovariances: Γ1 − Γ2 = (0.00, 1.02, 0.11)0 should be close to the second singular vector.

10

2.1.1.2

The general procedure

For this chapter, we assume that as the sample size grows to infinity, then with probability one, for each j we observe a subsequence Λij such that i

maxi0 6=j λij0 i

λjj

→ 0.

(2.6)

This condition is imposed to simplify the source identification, although it will be shown later to be stronger than what is needed. Condition 2.6 will be satisfied, for instance, if the components of λ are independent, and each component has a marginal density with mass at the origin. As long as condition (2.6) is satisfied, then for a large enough sample, we expect to have a few ‘pure’ observations for which Λ is nearly proportional to a coordinate vector ej . A sequence approaching a pure observation from the j th source can be used to consistently identify the parameters for Pj . Let S denote the convex cone in Rp generated by the Γj , and let Rn denote the random cone generated by the values in (2.4) for a sample of n time series. Note that we do not have access to Rn , since we do not observe the intensity weights. Under the assumption given by (2.6), the boundary of Rn converges pointwise to the boundary of S. If we could observe the cone Rn directly, then the Γj could be consistently identified by applying the following algorithm: ˜ n = {v/v(0) − e1 |v ∈ R, v 6= 0}. Note that if {gj |j ∈ I} generate the 1. Let R ˜ n then {gj + e1 |j ∈ I} generate the convex hull of Rn . convex hull of R ñ. 2. Let {gj ; j = 1, . . . , k} denote a set of generators for the convex hull of R These generators can be computed in n log n time by using a convex hull

11

algorithm such as the Jarvis march, or Graham’s Scan. 3. For each j, let Vj denote the volume of the convex set generated by {gi|i 6= j}. 4. Reorder the gj so that V1 < . . . < Vk . Then e1 + gk−d+1, . . . , e1 + gk are consistent for Γ1 , . . . , Γd . As stated above, in practice we do not observe the cone generated by the vectors in (2.4). Rather, we observe a noisy version of the cone that is generated by the γˆ i . For sufficiently long time series, the result of applying the above procedure to the γˆ i should give adequate results. It will be consistent if both the number of time series, and the length of each time series grows arbitrarily large. If we apply the above algorithm to the γî for a very large sample of moderately long time series, then we will expect to overestimate the cone, since variation in the sample autocovariances will eventually lead to the appearance of points that lie outside S. On the other hand we will expect to underestimate the cone for small samples, since nearly ‘pure’ observations from each of the sources might not be obtained. Optimistically, we might hope that for samples of moderate size these two effects will compensate to some degree. When there are only two sources, the above algorithm degenerates to a substantially simpler procedure. In this situation, once the supporting 2-plane has been identified in the dimension reduction step, there is only one degree of freedom in the location of the generators. This degree of freedom can be taken to be the angle relative to an arbitrary vector in hG1 , G2 i. We use the normalized mean P

γ¯ =

k

γˆ i îk iγ

i

P

12

(2.7)

as this reference vector, and then compute the relative angles

θi = cos−1 (

γˆ i · γ¯ ). γˆ i · γˆ i

(2.8)

The pair consisting of the γˆ i with the most positive θi , and the γˆ i with the most negative θi are chosen as the estimates of the generators. It is a simple matter to eliminate the necessity of condition (2.6), as long as we eliminate the requirement that the vertex of the autocovariance cone lie at the origin. Suppose that we have d sources, and we have a procedure for fitting a d-generated cone to the set of sample autocovariances, with arbitrary origin ν. Let gˆj denote the estimated set of generators, and νˆ denote the estimated origin. Writing νˆ =

P j

cj gˆj , we can conclude that that gˆj estimates Γj , and that

cj estimates the minimum of the support of the distribution of λj . Example 2.1 (continued): The pair of projected sample autocovariances having the widest angle are: ˆ1 Γ

=

(1.0, −0.51, 0.16)0

ˆ2 Γ

=

(1.0, 0.61, 0.27)0

The true intensity weights for these two curves are: Γ1 :

λ1 = 0.82, λ2 = 0.03

Γ2 :

λ1 = 0.02, λ2 = 0.40

This shows that we have succeeded in identifying an approximately pure observation from each of the two sources. The raw data for these two observations are shown in figure 2.2. It is easy to see the marked difference between the dynamic characters of these two signals.

13

Boundary 1 1.5 1 0.5 0 -0.5 -1 -1.5 0

50

100

150

200

250

300

350

250

300

350

Boundary 2 3 2 1 0 -1 -2 -3 0

50

100

150

200

Figure 2.2: Raw data sequences corresponding to the boundaries of the 2-dimensional autocovariance cone for example 2.1.

14

2.1.2

Estimation of the intensity weights

For the examples in this chapter, we used a simple, second-moment based procedure for estimating the intensity weights. The equality

E(ˆ γ i |Λi) = Cov(y i(t)|Λi )

(2.9)

motivates substituting γˆ i into the left side of (2.4), and the estimated generators gj into the right side of (2.4). This gives the method of moments equation

γˆ i =

X

ˆj. λij 2 Γ

(2.10)

j

ˆ i of the intensity weights can be obtained Thus a set of simple estimates λ j very rapidly by solving a small linear system for each observation. Example 2.1 (continued): One of the 200 time series in the collection had sample autocovariance (0.22, −0.09, 0.02). Regressing this on the estimated source autocovariances:

ˆ 1 = (1.00, −0.51, 0.16)0 Γ ˆ 2 = (1.00, 0.61, 0.27)0 Γ gives estimated intensity weights of: λ1 = 0.45, λ2 = 0.13. It turns out that the true true intensity weights for this observation were: λ1

= 0.48

λ2

= 0.08

15

As long as the time series are long enough for the covariance estimates to be stable, then this procedure has worked well in practice. If the time series are very short, say with 60 time points, then the following more complex algorithm may be used. 1. Compute the fast estimates of the intensity weights by solving the linear method of moments equation, as described above. This will provide a tenˆ for the intensity weight vector. tative value Λ ˆ as described ˆ to carry out the posterior reconstruction zˆj = E(zj |y, Λ), 2. Use Λ in the next section. 3. Regress y on the zˆj to get an updated estimate of λ. Return to step 2 until a fixed point is reached.

2.1.3

Posterior Reconstruction

The algorithm used in the posterior reconstruction depends on the distributional assumption made on the sources Pj . For the work in this chapter, we looked at three different possibilities for this distribution – Gaussian moving average, Gaussian autoregression, and discrete moving average. These three distributions were chosen because dynamic programming algorithms are available for rapidly computing the posterior means.

2.1.3.1

Gaussian Moving Average or Autoregression

If a stationary Gaussian distribution is used for the Pj , then we will assume that it is either a moving average, or an autoregression. The selection between these two parameterizations can be made by inspecting the estimate of Γj that was obtained previously – if it is appears to have finite support, then one should

16

chose a moving average parameterization, while if it decreases more slowly, chose an autoregression. The posterior distribution in either case can be computed directly using a regression computation, or sequentially by using the Kalman filter. The Kalman Filter will require much less computation, since it scales linearly with increasing time series length T . In order to apply either posterior computation procedure, we will need access to the probability densities for the Gaussian AR and MA processes, which are usually expressed in terms of filters, rather than autocovariances. Therefore ˆ j obtained previously to filter we must convert the autocovariance estimates Γ estimates ψˆj . For the AR process, we write

zj (t) =

p X

ψj (`)zj (t − `) + j (t)

(2.11)

`=1

while for the MA process, we have

zj (t) =

p−1 X

ψj j (t − `).

(2.12)

`=0

In both the AR and MA cases, j (t) denotes white noise. Recall that the identification condition for the intensity weights requires that V ar zj (t) = 1 for each t. To achieve this in the MA case, we require that kψj k = 1, and V ar j (t) = 1 for each t. In the AR case, the variance of j (t) is set to the value that implies unit variance for each term of the process zj . This value can be computed as follows: 1. Write γ` = Cov(zj (t), zj (t − `)). 2. Multiply both sides of (2.11) by zj (t − t0 ), and take the expectation to get the relation:

17

γt0 =

p−1 X

ψ|t0 −`| + σ 2 I(t0 = 0).

`=1

3. Tentatively set σ 2 = 1, and solve the system of equations implied by the above relation for any p values of γ` that include ` = 0. Denote the solution components by γˆ` . 4. The variance of the AR white noise should be set to γˆ0−1 . In the AR case, the filter can be estimated very simply by regressing the value observed at each time point on the predictor vector formed from the values of the previous p time points. For the MA case, the estimation solution that we employed is to use the method of moments, which reduces to a well-known algorithm that involves factorizing the autocovariance generating function. Once the filters estimates are available, we express the model in state-space form in order to simplify the application of the Kalman Filter. Let

y(t) = Z 0 x(t) + (t)

(2.13)

x(t) = F x(t − 1) + ζ(t),

(2.14)

where (t) ∼ N(0, σ 2 ), and ζ(t) ∼ N(0, Ψ). For our application, we will have σ 2 = 0, and x(t) will be a concatenation of blocks containing p lagged values from each of the k sources:

x(t) = (z1 (t), . . . , z1 (t − p + 1), . . . , zk (t), . . . zk (t − p + 1))

18

(2.15)

The transition matrix F will be block diagonal, with blocks of shape p × p. Let Fj be the block for the j th process. If this process is an autoregression, then Fj has the following structure: 

0



 ψ  ,

Fj = 

(2.16)

B

where B is the back shift operator. On the other hand, if the j th process is a moving average, then Fj is simply a back shift. Similarly, Z will be the concatenation of k p-vectors,

Z 0 = (Z10 , . . . , Zk0 )

(2.17)

where Zj is (λij , 0, . . . , 0)0 if the j th source is parameterized as an autoregression, or Zj is λij ψˆj0 if the j th block is parameterized as a moving average. Finally, Ψ will also be constructed in block-diagonal form from p × p blocks, Ψ1 , . . . , Ψk , where Ψj is 0 except in the (1, 1) position. Ψj (1, 1) will equal 1 for the blocks corresponding to moving average process, while for for autoregressions, it will equal the value computed previously that gives the AR process unit variance. Once the state-space parameters for the Kalman Filter are available, then it is a simple matter to carry out the posterior computation. Roughly, it involves making a forward pass through the data to calculate the forward conditional means and variances:

αt|t = P (xt |y1 , . . . , yt ) Vt|t = V ar(xt |y1 , . . . , yt )

19

and then making a backward pass through the data to get the full posterior means and variances:

αt|T = P (xt |y1 , . . . , yT ) Vt|T = V ar(xt |y1 , . . . , yT ) The details of this procedure are available from many references on Kalman filtering (for instance [17]) and will not be given here. Example 2.1 (continued): Figures 2.3 and 2.4 show the result of carrying out the posterior reconstruction for two of the 400 time series in example 2.1.

20

Observed Process 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -3 0

20

40

60 MA Process

80

100

1.5 True Est

1 0.5 0 -0.5 -1 -1.5 -2 0

20

40

60 MA Process

80

100

2 True Est

1.5 1 0.5 0 -0.5 -1 -1.5 -2 0

20

40

60

80

Intensity Weights Est 0.5899 0.6040 True 0.5624 0.5342 MA MA Figure 2.3: True and reconstructed series from example 2.1.

21

100

Observed Process 1.5 1 0.5 0 -0.5 -1 -1.5 0

20

40

60 MA Process

80

0.8 0.6

100

True Est

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 0

20

40

60 MA Process

80

100

0.8 True Est

0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 0

20

40

60

80

Intensity Weights Est 0.3460 0.3057 True 0.3705 0.2963 MA MA Figure 2.4: True and reconstructed series from example 2.1.

22

100

2.2

Application to fMRI Series

2.2.1

fMRI background

We have applied our procedure to a time series of images generated by the imaging technique known as functional magnetic resonance imaging (fMRI). This technique is an extension of the more standard MRI procedure that is used to generate static images. The basic idea behind MRI is that protons from water molecules are induced to spin in unison by applying an external magnetic field with rapidly reversing polarity. When this magnetic field is removed, the protons will return to their equilibrium state at a rate that is determined by the electrostatic properties of their local environments. Since this electrostatic environment will vary across different tissue types, an image can be generated with the brightness at each pixel determined by the measured relaxation rate in the pixel. Contrast in the image will then correspond to tissue type differences within the region being imaged. A major problem for functional MRI is the difficulty in evoking a signal which tracks well with the dynamic processes of scientific interest. Scientific interest rests primarily with the electrical activity in the neurons, which can not be directly observed by any variant of the MRI procedure. For many years it has been known that the electrical behavior of the neurons is correlated with changes in regional cerebral blood flow (rCBF). These blood flow (vascular) changes are temporally delayed relative to the neural firing, a confounding factor known as hemodynamic lag. Since the hemodynamic lag varies in a complex way from tissue to tissue, it is not possible to completely recover the electrical process from the vascular process. Nevertheless, the vascular process remains an informative surrogate for electrical activity.

23

Since heavy neural firing in a particular region will lead to an increase in rCBF, it is natural to try to collect images that have contrast deriving from changes in local blood flow. Two key observations make this possible [6]: • Oxygenated hemoglobin has a slightly different magnetic susceptibility compared to unoxygenated hemoglobin. • For a period of time following activation, fresh blood is delivered to activated areas in excess of the demand. The underlying explanation for the second point, and the relative quantity of excess blood that is delivered are not well understood. In any case, an excess of oxygenated blood will collect in areas receiving heavy use relative to other areas. Therefore, from the first point, the local electrostatic environment will change as a response to heavy neuronal firing (spins in the activated areas will relax slightly more slowly). This leads to temporal contrast arising in the image that is related to blood flow changes, and ultimately to neuronal firing. The exploitation of this effect for functional MRI is known as BOLD [15], for Blood Oxygenation Level Dependent imaging. While it is not the only method that is used for acquiring fMRI images with useful levels of temporal contrast, it is currently the most popular. Nevertheless, it should be recognized that it is only through a chain of poorly-understood links that fMRI raw data obtained with the BOLD technique are related to the underlying physiology of scientific interest.

2.2.2

Application of the procedure

We applied our procedure to an image series provided by Professor Mark Cohen, of the UCLA Neuroscience Department. This series consisted of 108 images recorded at 1 second intervals. Each image is comprised of 16 axial slices obtained

24

at resolution 64×64 with 16 bit depth. The mean image from this series is shown in figure 2.5.

Figure 2.5: Mean image for fMRI series. The 108 second duration of the experiment is subdivided into 9 intervals of 12 second duration each. During intervals 1, 3, 5, 7, and 9, no external stimulus was presented. During intervals 2 and 6 a visual stimulus was presented, and during intervals 4 and 8 motor activity was elicited from the subject’s fingers. We will not be immediately concerned with identifying the response to the stimulus, a process known as activation mapping. A number of procedures for activation mapping are available, most of which involve projecting the raw data using a contrast function that measures differences between the stimulus and rest periods. The effectiveness of such an approach will be highly dependent on the error structure of the data. Our goal is to reach a better understanding of this structure before applying contrast-measuring projections. Two ways that our method could be

25

combined with the projection approach would be: • Identify the activated regions using any of the currently-popular techniques, then apply our method separately to the activated and non-activated regions. Differences in the source probability structures may be informative about the activation dynamics. • Apply our method to the entire image series, and then carry out the projection-based activation mapping separately to the series of images for each component. Both of these approaches will be taken up in the future, but they will not be mentioned again in this dissertation. The sample autocovariances were truncated to 10 lags, a value which was chosen visually to support most of the mass in the estimated sequences. The singular values for the set of autocovariances in the slices in the second row of figure 2.5 are given in table 2.1. The most notable feature in this table is that none of the singular values are close to zero, which suggests that a large number of processes with widely varying autocovariances contribute to the observed signals. Nevertheless, a pattern in the variation can be detected – the drop from the second to the third singular value is always at least a factor of 2, while the total drop from the third to the tenth singular value never approaches a factor of 2 for any slice. This suggests that there is a 3-dimensional structure to the autocovariances that is buried in a large amount of rotationally symmetric noise. To further explore the structure of the autocovariance cone, we plotted a 2dimensional cross section of it. Using a plane that is parallel to the second and third eigenvectors, we sliced through the cone 1 unit above the vertex along the

26

Singular values (2, 1)

(2, 2)

(2, 3)

(2, 4)

24684.2 24564.8 28201.6 31445.1 7384.1 7349.5 8517.9 10552.7 2789.8 3127.8 3636.6 3908.1 1377.8 1286.1 1528.5 1876.0 1043.0 1050.4 1147.8 1397.8 1017.4 1009.8 1093.8 1199.2 942.2 1064.9 1164.8 977.2 901.4 923.1 1042.0 893.0 861.6 902.0 983.3 856.0 828.5 865.6 941.4 830.2 Table 2.1: Eigenvalues for the autocovariance cones of four axial fMRI slices. first eigenvector. Each sample autocovariance vector was scaled so that the head of the vector was in the slicing plane. Figure 2.6 shows this plot for slice (2, 1) – the other three slices have similar cross sections. A clear triangular pattern can be discerned from figure 2.6. From the standpoint of the model discussed in this chapter, this indicates that three source distributions are contributing to each of the observed signals. To gain further insight into this hypothesis, we used a three-dimensional rotating scatterplot of the autocovariance cone projected onto the first three eigenvectors. Two views of this rotating scatterplot are especially informative. On the left plot in figure 2.7, the upper edge corresponds to a single edge of the three dimensional cone, while the lower edge corresponds to a face of the 3 dimensional cone viewed from the side (so that two edges of the three dimensional cone coincide in the projection). The right plot in figure 2.7 shows the sample autocovariances for a set of points selected from the upper edge. These autocovariances clearly imply that the upper edge corresponds to a white noise source.

27

Cross section of Autocovariance Cone 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -2

-1.5

-1

-0.5

0

0.5

Figure 2.6: Cross section of the autocovariance cone.

Figure 2.7: Left: a two dimensional projection of the sample autocovariance cone. Right: the sample autocovariances for the set of points selected from the upper boundary of the cone.

28

A different angle of the cone is informative about the other two sources. The cone shown in the left plot above is rotated 90 degrees around its principal axis. This causes the upper edge from figure 2.7 to become occluded by the rest of the cone, while the lower edge from figure 2.7 resolves into the upper and lower edges of the cone seen in figure 2.8.

Figure 2.8: A view of the main three dimensions of the fMRI autocovariance cone. Next we selected a set of points from the upper edge of figure 2.8, and plotted their autocovariances in the lower left plot shown below. Similarly, a set of points from the lower edge of figure 2.8 are selected, and their autocovariances are shown in the lower right plot in figure 2.9. From figure 2.9, we see that in addition to the white noise source that was identified previously, there is a source whose autocovariance drops rapidly after one lag and then stays flat indefinitely, and a source whose autocovariance drops monotonically to zero in about 10 lags. The estimated autocovariances of these sources are shown in table 2.2. In order to carry out the posterior reconstruction, we need to specify the distribution for each of these sources. As a first attempt, we will set P1 to be white noise, and P2 and P3 to be Gaussian autoregressions with the filters given in table 2.3.

29

Figure 2.9: Right side, several autocovariances from points on the upper-left edge of the cone. Left side, several autocovariances from points on the lower-right edge of the cone.

ˆ1 Γ

ˆ2 Γ

ˆ3 Γ

566.18 -35.16 -13.58 -6.70 3.89 -54.63 -17.96 -45.69 23.86 -61.62

467.32 252.90 236.34 230.82 220.12 223.00 246.74 232.39 247.45 218.87

365.01 193.11 166.76 133.98 65.19 62.31 -3.49 -24.37 -55.31 -89.84

Table 2.2: Estimated source autocovariances for the 3 component fMRI model.

30

P2

P3

0.1935 0.0984 0.0971 0.0306 0.0577 0.1519 0.0537 0.1514 0.0065 0.0632

0.3609 0.1877 0.1405 -0.1294 0.1333 -0.1242 0.0271 -0.0370 -0.0318 -0.1754

Table 2.3: AR filters for the non-white processes in the 3 component fMRI model.

31

The three images in figure 2.10 show the spatial distribution for the intensity weights of each of the three processes. From these images, we see that the white noise process is nearly spatially homogeneous, with the exception of two small symmetrically located areas with very high white noise loadings. The other two processes are markedly inhomogeneous, although a simple pattern is difficult to discern. In the future it may be possible to use smoothing in conjunction with edge detection to reveal a spatial pattern. It is also seen that there are large weights for P2 and P3 in certain areas of the skull (bone tissue), which may be accountable to movement, but is scientifically unimportant in any case.

Figure 2.10: Spatial distribution of intensity weights. Left image: P1 (white noise), center image: P2 , right image: P3 . Posterior reconstructions for two sequences using these distributions are shown in figures 2.11 and 2.12. An interesting feature that will be a subject of future work is the periodicity that often appears in the reconstruction from the third source, P3 . This periodicity should be expected from the periodic structure of the filter estimate given above. An interesting possibility that we have not yet explored would be to allow the sources for this image series to be something other than a pure MA or AR law. In particular, source P2 does not appear to have the autocovariance structure

32

of an AR law, even though for simplicity we are presently modeling it as such. As is well-known, the autocovariance for an AR model decreases exponentially. ˆ 2 drops very far at the first lag, but then is level through the next 9 lags, Since Γ it is unlikely to correspond to a pure AR process. We are currently exploring modeling source P2 as an independent sum of a white noise process and a unit root AR process. The white noise will account for a portion of the autocovariance at lag 0, so that the autocovariance for the unit root AR component will be characteristically flat across all lags.

33

Observed Process 40 30 20 10 0 -10 -20 -30 0

20

40

60 MA Process

80

100

0

20

40

60 AR Process

80

100

0

20

40

60 AR Process

80

100

0

20

40

60

80

100

15 10 5 0 -5 -10 -15 10 8 6 4 2 0 -2 -4 -6 -8 15 10 5 0 -5 -10 -15

Intensity Weights Est

8.1514

7.1979

8.0070

Figure 2.11: A reconstructed sequence from the 3 source fMRI model.

34

Observed Process 30 20 10 0 -10 -20 -30 -40 0

20

40

60 MA Process

80

100

0

20

40

60 AR Process

80

100

0

20

40

60 AR Process

80

100

0

20

40

60

80

100

30 20 10 0 -10 -20 -30 3 2 1 0 -1 -2 -3 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5

Intensity Weights Est

12.0122

3.9916

2.9048

Figure 2.12: A reconstructed sequence from the 3 source fMRI model.

35

2.3

U.S. unemployment rates

Regional U.S. unemployment rates provide another collection of multiple time series that can be used to explore our method. For each of the 50 American states, we recorded the monthly unemployment rate for the 133 months between January 1988 and February 1999. These data are available from the www site of the US Bureau of Labor Statistics (www.bls.gov). We obtained the data in a seasonally adjusted form. The details of the procedure for seasonal adjustment are explained on the BLS website, but essentially involve removing the seasonal trend separately for each major industry division, and then aggregating over these divisions. Seasonal adjustment removes one source of non-stationarity in the unemployment series. Additional non-stationarity results from the inherently integrated nature of unemployment rates, as well as from the presence of a long term national trend. To mitigate these effects, we first differenced each series once, which results in series of monthly net job loss or job creation in each region. Then we subtracted the national mean from each of the 50 state-level series. The eigenvalues (λ) for the set of 50 10-lag autocovariances are shown in table 2.3. These values strongly suggest that only a single source is present in the state level unemployment rate series. Figure 2.13, which shows the projection of the 10-lag autocovariances onto the first two eigenvectors, provides further evidence for this conclusion (notice that the cone is very narrow). While the conclusion that a one source model is appropriate for the unemployment rates does not allow us to carry out the reconstruction stage of our algorithm, it is still a potentially useful observation that might easily be overlooked. One immediate application of this observation is that if a Gaussian model

36

λ 0.2306 0.0530 0.0341 0.0235 0.0210 0.0163 0.0133 0.0113 0.0109 0.0082 Table 2.4: Eigenvalues for US unemployment rate autocovariances.

Autocovariance Projections DE GA

0.015 0.01 0.005 0 -0.005 -0.01 -0.015 0

ID MI IL OH NJ WV OK IW NCND VT MS WI MT MD IN AR VA PA WY KY FL TX MN NENY OR CA TNKS NV AB UT SD CO CT WA NHAZ MO ME SC HA AK MANM LO 0.01 0.02 0.03 0.04 0.05

RI

0.06

0.07

0.08

0.09

Figure 2.13: The unemployment rate autocovariances projected onto their first two eigenvectors.

37

is used, then all of the states can be pooled together for purposes of parameter estimation. Rather than reserving a full set of parameters for each state, a common set of parameters can be used for all of the states, with a single state-varying degree of freedom reserved to account for the length differences among the autocovariance vectors.

2.4

Non-Gaussian alternatives

At the level of probability models, the decomposition in terms of independent components is represented as a deconvolution. Such an operation in general requires full knowledge of the probability law that we are attempting to decompose. For purposes of computational speed and stability however, we have used only the second moments to carry out the identification of the components. This restriction leads to fast algorithms that can be rapidly executed on large data sets, such as arise in fMRI and other remote sensing applications. To summarize the reason that our procedure will scale well to large data sets: • At the stage of the algorithm when all of the time series are considered jointly, only the low dimensional autocovariance vectors are used for computation. • At the stage of the algorithm when computations must be carried out on the time series in their raw form, a separate computational branch can be created for each time series, and there will is no need for any exchange of information between the branches. An approach based on second moments naturally corresponds to models having Gaussian error structure. As stated above, our procedure for identifying the

38

sources which uses only the sample autocovariances can be efficient only when the y i(t) are independent and have a quadratic-exponential distribution such as the Gaussian. However, although we restrict to using covariance information at the stage where we are identifying the sources, the posterior signal separation can be carried out for any distribution, as long as there is a good algorithm available for computing the posterior means. The tradeoff of restricting to covariance information at the stage of source identification is that non-Gaussian sources with similar covariance structures may be difficult to separate using only second order information. One possible direction for problems of this type would be to apply our procedure to the cone generated by higher order cumulants, or the power spectra. Since both of these functionals will be additive over independent components, the basic structure of the algorithm will not need to be changed. As a non-Gaussian alternative, we have implemented a discrete moving average distribution for the sources Pj . Discrete moving averages have been written about in a number of places, especially in the context of digital blind deconvolution [13], [12], [9]. The simplest formulation of the discrete moving average is in terms of an unobserved discrete iid sequence (t). For each t, (t) has probability 1 of lying among a finite set of support points {s1 , . . . , sk }. The functional relationship between the hidden inputs and the realized process is identical to the case of the Gaussian moving average, as given in (2.11). The dynamic programming algorithm that is used for constructing the posterior distribution in the discrete moving average case is known variously as the forward-backward algorithm, or as Baum’s algorithm. This procedure was developed to handle a generic class of models for sequential data known as hidden Markov models, or HMM’s. The general formulation of an HMM involves

39

an unobserved Markov chain x(t), with law π(x(t)|x(t − 1)), and a distribution P (y(t)|x(t)), which specifies the way that the observed data y(t) is related to the hidden state x(t). The key restriction is that given x(t), y(t) is independent of x(t − 1), x(t − 2), . . .. This restriction, together with the Markovian assumption on x(t) guarantees that the marginal posterior P (x(t)|y(1), . . . , y(T )) is also Markovian, and that it can be computed with complexity T k p+1 , where T is the length of the time series, k is the number of hidden states, and p is the length of the filter used to generate the moving average. Detailed descriptions of the algorithm are now widely available (for example, [2], [7]), so one will not be given here. Example 2.2. Generate 400 draws of length 200 from each of the following two digital moving average sources:

z1i (t) = 0.8805xi1 (t) − 0.4402xi1 (t − 1) + 0.1761xi1(t − 2) z2i (t) = 0.8111xi2 (t) + 0.4867xi2 (t − 1) + 0.3244xi2 (t − 2), where xij (t) are distributed independently and identically across i, j, and t. The distribution of each xij (t) is supported on the set {0, 1}, with equal probability at each of the support points. The first four steps of the algorithm are the same as for the Gaussian case. The final step, in which the posterior reconstruction is carried out, uses the forward-backward algorithm. The eigenvectors of the set of sample autocovariances are given in table 2.5, along with the corresponding eigenvalue λ.

40

Eigenvector

λ

0.9767 0.1026 -0.1887 149.9 -0.1261 0.9851 -0.1166 61.1 4.9 0.1739 0.1377 0.9751 Table 2.5: Eigenvectors and eigenvalues for discrete moving average. The estimated boundaries of the autocovariance cone are:

Γ1 = (1.00, −0.51, 0.13) Γ2 = (1.00, 0.61, 0.27) The true boundaries of the autocovariance cone are:

ˆ 1 = (1.00, −0.47, 0.16) Γ ˆ 2 = (1.00, 0.55, 0.26) Γ ˆ j , are: The filter estimates, obtained by applying the method of moments to the Γ

ψˆ1 = (0.85, −0.50, 0.16) ψˆ2 = (0.76, 0.55, 0.35). Figures 2.14 and 2.15 show the posterior reconstructions for two of the time series in the collection.

41

Observed Process 4 3 2 1 0 -1 -2 -3 -4 0

5

10

15

20 25 30 MA Process

35

40

45

50

4 True Est

3 2 1 0 -1 -2 -3 -4 0

5

10

15

20 25 30 MA Process

35

40

45

50

0.6 True Est

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 0

5

10

15

20

25

30

35

40

45

50

Intensity Weights Est 2.2719 0.4673 True 2.2705 0.2154 MA MA Figure 2.14: Posterior reconstruction for a single time series from example 2.2.

42

Observed Process 3 2 1 0 -1 -2 -3 0

5

10

15

20 25 30 MA Process

35

40

45

50

2 True Est

1.5 1 0.5 0 -0.5 -1 -1.5 -2 0

5

10

15

20 25 30 MA Process

35

40

45

50

1.5 True Est

1 0.5 0 -0.5 -1 -1.5 0

5

10

15

20

25

30

35

40

45

50

Intensity Weights Est 1.2409 0.6615 True 1.1286 0.7234 MA MA Figure 2.15: Posterior reconstruction for a single time series from example 2.2.

43

2.5

Comparison to other decompositions

From the standpoint of probability modeling, the analysis method that we have described constitutes a form of stochastic decomposition, or dimension reduction. On the other hand, viewed algorithmically, what we are doing clearly falls within the wide variety of techniques for multivariate analysis deriving from principal component analysis (PCA). The key difference between different PCA-derived algorithms for time series analysis is the choice of transformed data unit that is used as the input to the PCA procedure. In this section, we will compare three different types of input for the PCA algorithm: • Raw data, or partially-aggregated raw data. • Spectral density estimates. • Autocovariance estimates. The comparison of these methods will be carried out by considering the class of models for which the output of the PCA procedure is most interpretable.

2.5.1

Raw data as the PCA input

A common framework for investigating the mean structure of multiple time series curves is to express the observations in terms of basis curves {βj (t)|j = 1, . . . , d}. For instance, we may write

y i(t) =

d X

λij βj (t) + i (t),

(2.18)

j=1

where the λij are random coefficients, and (t) is a white noise process that is independent of the coefficients. To express this model in terms of stochastic sources,

44

one can set the j th source Pj to be βj (t) + j (t), where the j (t) are independent white noise processes. To make it easier to identify the βj , it is usually assumed that the λij are independent. In fact, it is not possible to simultaneously identify the basis vectors βj and the covariance structure of the λij . Interestingly, for the model with stochastic sources that is the main subject of this chapter, dependence over j of the λij within a single Λi does not affect our ability to identify the source autocovariances. Since the mean of each source is zero, the following calculation shows that for fixed i, the scaled signal components z˜ji = λij zji remain uncorrelated.

zji , z˜ji 0 |Λi) + Cov(˜zji ,˜z i0 ) (E(˜ zji |Λi), E(˜ zji 0 |Λi )) Cov(˜ zji , z˜ji 0 ) = EΛi Cov(˜ j

= 0+0

The uncorrelatedness of the z˜ji preserves the additivity of the squared intensity weights, 2.10, and so does not affect the source identification procedure. At the same time, it may be valuable that there is some dependence among the z˜ji within i, induced through possible dependence of λij within i. This allows the scaled signal components to interact in their overall levels, but not in their small-scale dynamics. In the next three sections, we will briefly describe three algorithms that are useful in investigating this model.

2.5.1.1

SIR

This model given (2.18) is essentially a regression model, and is particularly popular in the presence of covariates xi , so that the mean can be modeled in

45

terms of the covariates:

E(y i(t)|xi ) =

d X

E(λij |xi )βj (t).

(2.19)

j=1

The dimension-reduction procedure known as Sliced Inverse Regression (SIR) [11] can be used to carry out the estimation for this model in the common situation that the link between y and x is not known beforehand. The βj can always be selected so that the λij are uncorrelated given x. Therefore, if we take the variance of (2.19), we get

V arE(y i(t)|xi ) =

d X

V arE(λij |xi )βj (t)βj (t)0 .

(2.20)

j=1

This implies that the βj (t) are the first d eigenfunctions of V ar E(y(t)|x). This does not completely identify the βj (t), since the uncorrelatedness condition will be preserved by transformations that are orthogonal with respect to Cov x. To completely identify the basis functions, a second symmetric positive definite matrix must be specified so that the βj (t) will be constrained to be orthogonal with respect to the metric implied by this matrix. The SIR algorithm for estimating the basis functions provides a simple procedure for estimating the covariance (2.20) which is based on slicing the x distribution. Furthermore, SIR specifies a metric with respect to which the singular vectors are computed. Since the covariance estimation is achieved by aggregating the data within bins determined by the slices of the covariate, it can be viewed as an application of PCA to an aggregated refinement of the raw data.

46

2.5.1.2

FDA

Other work on model (2.18) has focused on regularizing the basis function estimates through the incorporation of smoothness penalties. Let D denote an operator that measures smoothness, such as the second derivative operator. By subtracting λD 0 D from the covariance before taking the spectral decomposition (where λ is a smoothing parameter), the estimated basis functions will be pulled closer to the nullspace of D by an amount that depends on the size of λ. From a Bayesian standpoint, this can be viewed as operating under a prior with has mass concentrated near the nullspace of D. Work in this area often goes under the name ‘functional data analysis’ (see, for instance, the book by Ramsay and Silverman [16]).

2.5.1.3

ICA

A third path of work on (2.18) has arisen out of investigations into the possibility that PCA-derived methods, which take advantage only of second moment information, may be sacrificing efficiency by ignoring information that is contained in higher order moments. One popular approach to ICA [3] uses mutual information MI(·, ·), which is an entropy based function of a pair of random variables that satisfies MI(X, X) = 1, and MI(X, Y ) = 0 if and only if X and Y are independent. The mutual information is generally difficult to work with, so in computational work, it is typically replaced with an approximating polynomial in the cumulants. When using mutual information for carrying out ICA, the first basis function β1 maximizes the mutual information with respect to an empirically-obtained estimate of the distribution of y. Subsequent basis functions β2 , β3 , . . . maximize the same function, subject to having 0 mutual information with the basis func-

47

tions that have already been estimated. A number of other algorithms are also available for carrying out ICA, such as the gradient method of Amari [1]. A frequently cited motivation for the development of ICA is the difficulty of basis selection in the situation where only the second moments are used to estimate the basis functions. All of the algorithms that use the spectral decomposition can only identify the βj (t) up to the orthogonal transformation that preserves the uncorrelatedness of the loadings λj . But if we require the λj to be independent, rather than merely uncorrelated, the basis functions will often be uniquely be determined (the Gaussian case is an important exception). In PCA-derived procedures, the most common practice is to leave the basis vectors in the form that they are returned from the PCA program, so that they are orthogonal in the Euclidean sense. The apparent arbitrariness of this convention has motivated several appearances of the phrase “the brain is not orthogonal” in papers advocating the use of ICA, rather than PCA, in the analysis of EEG and fMRI data. Notably, the procedure described in this chapter, while based on second moments, does not require an arbitrary restriction to identify the sources. Although a dimension-reduction procedure similar to PCA is used in the first stage of source identification, the geometric structure of the autocovariance cone is sufficiently restricted by the positivity constraint to allow unique identification of the generators.

2.5.2

Frequency spectra as the PCA input

The basis construction methods described in the previous section are appropriate for a complicated mean structure coupled with a simple error structure. In fact, these methods are usually proposed in the context of white noise errors. A com-

48

pletely different class of models can be treated by applying PCA in the frequency domain, as described in chapter 9 of Brillinger [5]. The output of this analysis is most interpretable in the context of multivariate moving averages with rank-deficient filters. Given an observed n-variate time series Y (t), let Z(t) denote a q vector of independent Gaussian variates, and let Z(1), Z(2), . . . denote a sequence of such random variables generated independently over time. Let Ψ denote a n × q matrix-valued filter of length p, so that for j = 1, . . . , p, Ψ(j) is a n × q matrix. We will be concerned with the case where n is much larger than q. The observed data will be assumed to arise in the following way:

Y (t) =

p X

Ψ(j)Z(t − j).

(2.21)

j=1

As usual, the goal will be to estimate the filter Ψ, and then obtain posterior reconstructions of the inputs Z(t). There are essentially two differences between the frequency domain PCA and the autocovariance-based PCA described in this chapter. • When the autocovariances are used as PCA inputs, the input series zji (t) are independent across the observations y i(t). In the frequency domain PCA, the same input time series are used at each observation. • When the autocovariances are used as PCA inputs, the filters will lie in a low-dimensional subspace with dimension equal to the number of components. In the frequency-domain PCA, the filters vary freely. To begin, let

49

Y (t) = (y 1 (t), . . . , y n (t))0

(2.22)

denote the vector of all observations recorded at time t, and let

C(s) = E Y (t)Y (t + s)0

(2.23)

denote the autocovariance at lag s. Note that unlike the procedure described above, in which only marginal autocovariances are considered, in this context we will require the joint autocovariance function. C(s) can be estimated in the usual way, by 1X ˆ C(s) = Y (t)Y (t + s)0 . T t

(2.24)

The spectral density f (λ) is the Fourier transform of C(s), so that

f (λ) =

1 X C(s)e−isλ . 2π s

(2.25)

ˆ A crude estimate of f (λ) can be obtained by substituting C(s) into (2.25). In practice, this estimator performs quite poorly, and a large literature is devoted to procedures for improved spectral density estimation. Rather than review that area here, we refer to chapter 7 of Brillinger’s book [5], and assume for the ˆ remainder that a suitable spectral density estimate f(λ) has been obtained. Given the moving average model (2.21), it is simple to compute that the relationship between the filter and the autocovariance sequence is given by the following:

C(s) =

X

Ψ(j)Ψ(j + s)0 .

j

50

(2.26)

Substituting this into (2.21), it is an easy matter to obtain the following relation:

f (λ) =

X 1 X C(s)e−isλ C(s)0 eisλ . 2π s s

(2.27)

¯ 0 , where Thus we have written 2πf (λ) = Bλ B λ

Bλ =

X

C(s)e−isλ.

(2.28)

s

Since f (λ) is Hermetian, it can be written

f (λ) = Ad A¯0d

(2.29)

where for each d = 1, . . . , n, Ad is the n×d matrix whose columns are the singular vectors of f (λ) with the largest d singular values. Thus, Bλ can be estimated using Aq where q is the number of principal component time series specified in the model. Since (2.29) is invariant under right multiplication of Ad by an orthogonal matrix, Bλ is not completely identified in this fashion. In fact, in the model (2.21), the filter components Ψ(j) are only identified up to arbitrary right multiplication by an orthogonal matrix. Finally, we can invert the Fourier transform to recover estimates of the filter components

Ψ(j) =

1 X Bλ eiλj . 2π λ

(2.30)

From a computational standpoint, it would be extremely costly to apply the frequency domain PCA in situations such as the analysis of brain image data. The spectral density function f (λ) will be very high dimensional – at each frequency

51

λ, order n2 components must be estimated, where n is the number of pixels. On the other hand, the autocovariance-based PCA that is described earlier in this chapter is much less demanding. The autocovariance sequences will be truncated to be much shorter than the raw sequences, so the dimension reduction procedure will be applied to n p-vectors (with p typically between 10 and 20), rather than n n−vectors, as in the frequency domain PCA.

2.6

Comparison to ARIMA Modeling

One of the most popular frameworks for time series modeling over the last 25 years has been provided by the class of ARIMA models (autoregressive integrated moving average models), which is described in the book by Box and Jenkins [4]. One way of expressing an observation y(t) from this class of models is as follows:

y˜(t) = ∇r y(t) φ(B)˜ y(t) = ψ(B)z(t), where z(t) is a white noise sequence, ∇ is the differencing operator, defined by (∇y)(t) = y(t)−y(t−1), B is the backshift operator, defined by (Bz)(t) = z(t−1), and ψ(·) and φ(·) are polynomials. ARIMA models are popular for a number of reasons, including their theoretical tractability and the wide range of dynamic behaviors that they encompass. There are also efficient, though not necessarily simple, computational procedures for carrying out maximum likelihood and method of moments estimation of the parameters. The model that we have used for the marginal law of one observed time series

52

– an independent sum of MA or AR processes – is conceptually simpler than the ARIMA(p, q, r) model. Unlike the ARIMA model, for the Gaussian case our model is not even identified when only a single time series is observed. In order to identify k sources, we need to observe at least k different time series with linearly independent intensity weights. Nevertheless, we have found that this class of models naturally extends to a class of models for describing multiple time series, and that parameter estimation is relatively simple and efficient for the multiple time series case. One can imagine dimensionally-reduced multivariate extensions of the ARIMA class. For instance, the order of the model could be fixed, but the coefficients of ψ and φ might be constrained to vary over a low dimensional subspace. On the other hand, the sources in our model could be allowed to come from an arbitrary ARIMA distribution. It is hoped that the added flexibility that results from having multiple independent sources eliminates the need for complete generality in this area. However, as discussed above in the case of the fMRI data, it may sometimes be useful to move beyond pure AR or MA models for the sources.

2.7

Conclusion

In this chapter, we have discussed models for an ensemble of time series curves in which each time series in the ensemble arises as a randomly weighted sum of independent draws from a small collection of stationary sources. We have argued that this model provides a probabilistic justification for carrying out principal component analysis (PCA) on the set of sample autocovariance functions for all of the time series in the ensemble. A simulation study demonstrated that the basic shape of the hidden components can be recovered for moderate sample sizes. Applications to two real data sets (U.S. unemployment rates, and functional

53

neuroimaging using MRI) revealed interesting features that might not be easy to discover using other methods. Finally, we described the relationship between our method, and the more popular analysis techniques of basis search, frequency domain PCA, and ARIMA modeling. It seems clear that the method that we propose is worth considering further, especially for situations in which a very large number of time series is collected.

54

CHAPTER 3 Digital blind deconvolution

Deconvolution and blind deconvolution have diverse applications in areas such as geophysics and telecommunications. The problem is to recover an input sequence based on a blurred, noisy observation sequence. The relationship between the input sequence and the observation sequence is given by the following:

yt =

p−1 X

ψj xt−j + t ,

(3.1)

j=0

where the ψj are referred to as the filter coefficients, xt is the input sequence, yt is the observation sequence, p is the filter length, and t is the noise. Written in vector form, the model is Y = X ? ψ + , where ? denotes convolution. Typically, the noise is taken to be iid Gaussian. If the filter is known, then the problem of restoring the xt is known as deconvolution, while if the filter must be estimated, the problem is referred to as blind deconvolution. Recently, a series of papers has appeared focusing on the case that the input sequence is drawn from a distribution with finite support (‘digital blind deconvolution’). These papers fall into two groups. The first group is primarily concerned with the use of the inverse filter, as exemplified by papers of T.H. Li [12], and Gamboa and Gassiat [9]. The second group approaches the problem as a Bayesian computation, and focuses on the development of Monte Carlo algorithms for computing the relevant posterior distributions. Work of this type has been carried

55

out by Chen, Liu, and Wong [13], [14]. The inverse-filtering approach focuses mainly on the estimation of filter coefficients. There is very little discussion on how to restore signals efficiently after filter estimation. The Bayesian approach on the other hand provides a natural framework for addressing the issue of efficient signal restoration. As is typical in the Bayesian context, the computation of the relevant posterior quantities is a challenging task. The work of Chen, Liu and Wong makes a significant advance in this problem by introducing sequential importance sampling (SIS), which was initially proposed in [10]. While the technique of SIS is promising, its performance is known to be tempered by the skewness in the distribution of the importance weights attached to the sampled sequences. For example, starting from a pool of 2000 trial sequences, it is not unusual to end up with only three or four sequences receiving more than 50% of the total weight. This is the case for sequences with say only 100 time points. The problem gets worse as the sequence grows longer. In this chapter, we seek other ways of implementing the posterior computation. We still use importance sampling as the basic method of generating samples. However unlike SIS, our new strategy does not have a severe weight-skewness problem. The key idea in our approach relies on the use of inverse filtering for localizing the information most critical for restoring the signal at each time point. This chapter is organized in the following way. First, in section 3.1 we will discuss an exact but intractable method for computing the posterior restoration. In section 3.2, we will discuss the basic importance sampling and sequential importance sampling algorithms. In section 3.3 we discuss the inverse filter, and in section 3.4 we introduce our Bayesian restoration which uses the inverse filter for information localization. In section 3.5 we discuss the blind case, and in section

56

3.6 we review the method of moments estimation of the filter.

3.1

Deconvolution and Bayesian restoration

If the filter is known, and the goal is to minimize the number of misclassified time points in the restoration, then the optimal restoration can be found by maximizing the posterior :

xˆt = argmaxj Pψ (xt = j|Y ).

(3.2)

For simpler problems like example 3.1 to be given at the end of this section, the exact computation of the posterior probabilities is feasible after we reformulate the deconvolution problem as a hidden Markov model. To proceed, we begin by blocking the xt into runs of the same length as the filter. Write zt = (xt , xt−1 , . . . , xt−p+1 ). It is easy to see that P (yt |zt ) = N(ψ · zt , σ 2 ). Since zt follows a Markov chain distribution, we can regard this as a hidden Markov model. The forward-backward algorithm of Baum and Egon [2] can be used to compute P (zt |y1, . . . , yT ). More specifically, define the forward probabilities αt (j), and the backwards probabilities βt (j) as follows:

αt (j) = P (y1 , . . . , yt, xt = j)

(3.3)

βt (j) = P (yt+1 , . . . , yT |xt = j)

(3.4)

They can be updated recursively:

57

αt+1 (j) = P (yt+1|zt+1 = j)

X

αt (i)π(zt = j|zt−1 = i)

(3.5)

i

βt (j) =

X

βt+1 (i)P (yt+1 |zt+1 )π(zt = i|zt+1 = j).

(3.6)

i

Where π(zt = i|zt−1 = j) is 1/k or 0, depending on whether zt−1 is compatible with Lzt , where L is the left shift operator. The posterior probabilities are obtained using the following:

P (zt = i|y1 , . . . , yT ) ∝ αt (i)βt (i).

(3.7)

As an illustration, consider the following example. Example 3.1 Suppose that the length of the filter is 3 and ψ = (1, 0.8, −0.4). The input signals xt takes values from {0, 1, 3} with equal probability. The length of the observed sequence is 500. We simulate 100 sequences of yt , t = 1, . . . , 500 according to equation (3.1) with Gaussian noise at 4 different levels – (0, 0.1, 0.2, 0.3). We then apply the Bayesian restoration computation as described above to each sequence. The total numbers of errors are shown below in figure 3.1 The averages are 1.5, 7.9, 28.6, and 49.2, and the standard deviations are 0.73, 2.98, 6.33, and 7.63. The error counts follow approximately a Poisson distribution. The complexity of the forward-backward algorithm for blind deconvolution is T kp+1 , where T is the number of time points, k is the number of states, and p is the length of the filter. Note that this is much better than the usual hidden Markov complexity of T k 2p , since we can take advantage of the zeros in the

58

70

60

50

40

30

20

10

0

Figure 3.1: Error counts for the exact Bayesian restoration. transition matrix of zt . Nevertheless, the complexity is such that this method is only practical for off-line processing of small problems. The hidden Markov chain for the above example has only 33 = 27 states, which is the reason that we can still carry out the simulation reasonably fast. The next example posts a great challenge because the number of states increases to 164 = 65, 536. We can not do anything with this example now, but we will return to it later in the chapter when more efficient methods have been introduced. Example 3.2. Take p = 4 and ψ = (.9162, −.1833, .4812, −.1987). The support of xt is given by {−15, −13, . . . , 13, 15}. The noise level is set at σ 2 = 0.2. The sequence length is n = 500.

59

3.2

Importance Sampling, Sequential Importance Sampling

Ever since the Monte Carlo approach became a popular method for performing difficult integrations, importance sampling has been one of the most useful tools in this area. It has the advantage of being straightforward to implement and analyze, and it produces independent samples. However it requires the specification of a good trial density, and the efficiency of importance sampling can be quite sensitive to this choice. Suppose that one is trying to estimate the expectation of a functional φ(X), where X is drawn from a distribution π(X), so that the expectation is equal to

R

φ(X)π(X)dX. Then if independent samples Xi can be drawn from π(X),

and if

R

φ(X)2 π(X)dX < ∞, then

1 n

Pn 1

φ(Xi ) converges strongly to the desired

quantity. Now suppose that it is not possible to sample directly from π(X), but it is possible to sample from a trial density π ∗ (X) whose support contains the support R

of π. Then since φ(X)dX = also the strong limit of ∗

from π (as long as

R

1 n

Pn 1

R

φ(X) ππ(X) ∗ (X) dX, the expectation being estimated is π(X ∗ )

φ(Xi∗ ) π∗ (Xi∗ ) , where the Xi∗ are drawn independently i

2 (φ(X) ππ(X) ∗ (X) ) π(X)dX

< ∞).

The SIS (sequential importance sampling) algorithm was introduced as a way to generate trial densities for importance sampling in the case that the quantity being sampled can be linearly ordered, and the dependence is spatially local with respect to this ordering. Deconvolution of noisy signals when the filter is known falls naturally into this context (the blind case can be handled as an extension). The main idea is that the trial density incorporates the observations yt sequentially, instead of as a batch. Thus, for the deconvolution problem, the

60

trial density at time t, given the trial samples obtained previously, is

π ∗ (xt |xt−1 , xt−2 , . . .) = P (xt |yt , yt−1 , . . . , x∗t−1 , x∗t−2 , . . .).

(3.8)

The resulting importance weights are shown by Kong, Liu, and Wong [10] to be algebraically equivalent to the products of the predictive probabilities of each time point, given the past draws and observations, as follows:

T P (x1 , . . . , xT |y1 , . . . , yT ) Y = P (yt|yt−1 , yt−2 , . . . , xt−1 , xt−2 , . . .). π ∗ (x1 , . . . , xT ) 2

(3.9)

While SIS is simple to implement, there is one serious drawback. This is the weight-skewness problem. Recall that for any importance sampling procedure, if the trial density π ∗ is not well-chosen, then the weights π(X)/π ∗ (X) will have a highly skewed distribution. In this case the average will be effectively based only on a tiny proportion of the samples, and hence will be very unstable. Figure 3.2 illustrates the seriousness of the weight-skewness problem. Here we generate a sequence of yt from the setting of Example 3.2 with length n = 150. One thousand sequences are generated using the sequential importance sampling procedure described above. The relative weight received by each sequence at time points t = 2, 5, 50, and 150 is plotted on the vertical axis. At time t = 150, it is quite apparent that most of the weight is controlled by a single sequence. To make SIS practical for deconvolution, two methods of monitoring the weight skewness have been introduced - rejuvenation [13], and rejection control [14]. In general, they require more trial sequences and a good deal of training in order to find a balance between the tolerable weight skewness and the sampling cost.

61

0.01 0.005 0 0 0.1

100

200

300

400

500

600

700

800

900

1000

100

200

300

400

500

600

700

800

900

1000

0 0 0.5

100

200

300

400

500

600

700

800

900

1000

0 0

100

200

300

400

500

600

700

800

900

1000

0.05 0 0 0.4 0.2

Figure 3.2: Weights for 1000 sequential importance sampling draws computed after 2, 5, 50, and 150 time points have been processed.

62

Our strategy is different. Instead of fighting against the weight distribution, we will find other ways of conducting importance sampling which can avoid the serious weight skew problem. There are two key ideas. The first one is what we call ‘information localization’ - in order to restore the signal at time t = t0 , we really do not need much information from those yt values taken at time t that are too far way from t0 . The second idea is the use of inverse filter to achieve information localization.

3.3

The Inverse Filter

Suppose ψ1 and ψ2 are filters, and ψ1 ? ψ2 = δ, where δ(t) = I(t = 0). If this identity holds, then ψ2 is called the inverse filter of ψ1 , and vice-versa. Note that if ψ1 is compactly supported and causal, then ψ2 must be non-compactly supported, and may be non-causal. The two filters are functionally inverses of −1 each other, in the sense that if ψj,k is the inverse of ψ truncated to j positive and

k negative terms, then

−1 (ψj,k ? (ψ ? x))t → xt ,

(3.10)

as j, k → ∞, separately for each t. It is simple to show that any nonzero filter which has no characteristic roots on the unit circle will have such an inverse, possibly with complex coefficients. To construct the inverse, write the filter ψ as a polynomial ψ(B) where B is the backward shift operator. Factor ψ(B) = ψp−1

Qp−1

j=0 (B

− rj ), and invert the

filter one factor at a time. If |ri | > 1, then (B − ri )−1 = − (B − ri )−1 =

P∞

j=0

P∞

j=0 B

j

/rij+1. Similarly, if |ri | < 1, then

F j rij , where F = B −1 is the forward shift operator. By

63

inverting each factor of ψ, and then multiplying the inverses. we obtain a biinfinite series in B which is the inverse filter of ψ. In practice, these power series expansions must be truncated. We use a straightforward truncation, although more sophisticated approaches that adjust the remaining coefficients to minimize the truncation error are available. The properties of the inverse filter are immediately clear in the case that there is no noise. In this situation, the xt can be recovered without error by applying the inverse filter. We are primarily interested in the case that the blurred input signal is also contaminated with noise. In this situation, the inverse filter does not completely localize the information in Y related to xt . In other words, if we apply the inverse filter, we get ψ −1 ? Y = X + ψ −1 ? . A straightforward way to carry out the restoration is to round the inverted sequence to the nearest value in the set of admissible input signal values. However, since the components of ψ −1 ? are correlated, there is information about xt in components of ψ −1 ? Y other than (ψ −1 ? Y )t . Without exploring this dependence structure, the result cannot compete with the Bayesian restoration. Example 3.1 (continued): ψ = (1, 0.8, −0.4), xt ∈ {0, 1, 3}, σ 2 = 0.1. The roots of ψ are 2.8708, -0.8708. Example 3.2 (continued): ψ = (.9162, −.1833, .4812, −.1987), xt ∈ {−15, −13, . . . , 13, 15}, σ 2 = 0.2. The roots of ψ are 2.7094, −0.1438 + 1.2966i, −0.1438 − 1.2966i.

64

1 0.5 0 −0.5 −1

−30

−20

5

10

−10

0

10

20

4 3 2 1 0 −1 0

15

20

25

30

Figure 3.3: Top: graph of the inverse filter for example 3.1. Bottom: graph of the inverse filter applied to a simulated observation sequence. The ’+’ denotes the true input, and the solid line interpolates the estimated values.

1.5 1 0.5 0 −0.5 −1 0

5

10

15

20

25

30

20 10 0 −10 −20 0

5

10

15

20

25

30

Figure 3.4: Top plot: graph of the inverse filter for example 3.2. Bottom plot: graph of the inverse filter applied to a simulated observation sequence. The ’+’ denotes the true input, and the solid line interpolates the estimated values.

65

3.4

Application of the Inverse Filter to Monte Carlo Deconvolution

Write Y˜ for the inverted sequence after the application of the inverse filter to Y , and y˜t for the tth term of Y˜ . The conditional probabilities in (3.2) can be expressed in terms of Y or Y˜ , since P (xt |Y ) = P (xt |Y˜ ). As mentioned above, the information in Y or Y˜ that is useful for restoring xt will be relatively local to yt . This will hold to an even greater degree if we are conditioning on Y˜ instead of Y . This suggests that gains in computational speed might be realized by replacing the conditional probabilities in (3.2) with the slightly less informative values

P (xt |yt−db , . . . , yt+df )

(3.11)

P (xt |˜ yt−db , . . . , y˜t+df ),

(3.12)

or

where db and df are positive integers to be specified. We refer to the approach that conditions on subintervals of Y as the forward approach, whereas if we are conditioning on subintervals of Y˜ , we have the inverse approach.

3.4.1

The Forward Approach

The first step of our forward algorithm is to apply the inverse filter to get ψ −1 y = x + ψ −1 . Writing y˜t = (ψ −1 y)t , it is clear that marginally, the components of y˜ follow a normal distribution as follows.

66

P (˜ yt|xt ) = N(xt , σ ∗2 )

σ ∗2 = σ 2

X

ψi−2

(3.13)

(3.14)

i

The marginal trial density for xt in our forward algorithm is π ∗ (xt ) = P (xt |˜ yt ), which is easily obtained by evaluating (3.13) at all values of xt , and normalizing the probabilities. To generate a sample chain, we sample from these trial densities independently. Our posterior reconstruction at time t will be based on

P (xt |yt−db , . . . , yt+df ),

(3.15)

where db and df are small integers, generally taken to be between p and 2p. The use of a limited interval of observations on the conditioning side of (3.15) will result in a negligible loss of statistical efficiency, but leaves a much more tractable problem from the standpoint of Monte Carlo computation. Sample (xt−db −p+1 , . . . , xt+df ) from the trial density π ∗ , and use these values to get the conditional mean predictions of yt−db , . . . , yt+df , which we denote yˆt−db , . . . , yˆt+df . Next compute the logarithm of the importance weight for estimating functionals of

P (xt−db −p+1, . . . , xt+df |yt−db , . . . , yt+df ),

(3.16)

which is given by t+df

t+d

X

Xf 1 − (yt − yˆt )2 /σ 2 − log π ∗ (xs ). 2 s=t−db s=t−db −p+1

67

(3.17)

The functionals that we are estimating are just I(xt = sj ) for each time point t and each support point sj . Thus the quantities (3.15) can be estimated by computing the following at each time point: X

x`t ew` /

`

X

ew` ,

(3.18)

`

where w` is given by (3.17). Note that it is not necessary to generate a new set of draws for each point that is to be restored. At the outset, x1 , . . . , xT , π ∗ (x1 ), . . . , π ∗ (xT , and yˆ1 , . . . , yˆT can be computed and stored. Then the weights (3.17) can be rapidly re-computed at each time point as a moving average, allowing most of the computational effort to be recycled. In many cases, the weights given by (3.17) will be tightly focused on the values of (xt−db −p+1 , . . . , xt+df ) that are neighbors of the value given by the inverse filter. That is, for each t, if s1 and s2 are the support points on either side of y˜t , then xt is equal to either s1 or s2 . In this situation, the expectation (3.2) can be computed exactly by exhaustive summation over these neighboring values. Example 3.1 (continued): Figure 3.5 shows the error counts for 50 simulated draws from the model given above as example 3.1. Four error variances are used: 0, 0.1, 0.2, and 0.3. The mean error counts out of 500 points are 0.94, 11.88, 39.3, and 62.9. The standard deviations are 1.53, 4.40, 7.99, and 7.25. 500 Monte Carlo draws were used for each restoration.

Example 3.2 (continued):

68

90 80 70 60 50 40 30 20 10 0 −10

Figure 3.5: Error counts for simulated sequences of length 500 from example 3.1. From the left, the error variances are: 0, 0.1, 0.2, 0.3. Figure 3.6 shows the error counts for 50 simulated draws from the model given above as example 3.2. Four error variances are used: 0, 0.1, 0.2, and 0.3. The mean error counts out of 500 points are 0, 2.14, 31.58, and 75.18. The standard deviations are 0, 1.91, 7.57, and 10.98. 500 Monte Carlo draws were used for each restoration. For all noise levels, fewer than 1% of the time points were off by more than 2 (further off than the adjacent value).

3.4.2

The Inverse Approach

When taking the inverse approach, we have a more sophisticated way of generating the trial density that tailors the shape of the distribution individually for each time point. This results in sampling that is statistically more efficient, however there is a computational cost to re-evaluating the sampling density at each time point. The benefit of this approach would be most noticeable in a parallel environment, where the number of draws, rather than the expense per

69

100

80

60

40

20

0

Figure 3.6: Error counts for simulated sequences of length 500 from example 3.2. From the left, the error variances are: 0, 0.1, 0.2, 0.3. draw would be the rate-limiting factor. As with the forward approach, the trial density at the point being restored is equal to P (xt |˜ yt ). Unlike in the forward approach, the neighboring values are drawn from a distribution that is adjusted depending on what value is drawn at the central point. The neighboring values are drawn alternatingly from the right and left sides of the point being restored. Following the draw of xt , xt+1 is drawn from

P (xt+1 |˜ yt , y˜t+1 , xt ),

(3.19)

P (xt−1 |˜ yt , y˜t+1 , y˜t−1 , xt , xt+1 ).

(3.20)

then xt−1 is drawn from

This process continues until a specified window t − db , . . . , t + df is filled out. The importance weights can be obtained by using (3.9), or by evaluating the ratio

70

of P (xt−db , . . . , xt+df |˜ yt−db , . . . , y˜t+df ) to the trial density explicitly. An immediate advantage of the inverse approach is that it is not necessary to sample the ‘tail’ (xt−db −1 , . . . , xt−db −p ). This is because the distribution

P (xt−db , . . . , xt+df |˜ yt−db , . . . , y˜t+df )

(3.21)

is directly accessible, in contrast to the corresponding distribution

P (xt−db , . . . , xt+df |yt−db , . . . , yt+df )

(3.22)

that appears in the forward approach. This is a sequential procedure, although it differs in two key respects from the sequential importance sampling applied in [13]. Firstly, the raw data Y have been transformed to Y˜ , to enhance the degree of local dependence. Secondly, the order in which the components are sampled is changed to focus the trial distribution on the support of the functional of interest. Note that is not necessary to use the same values of db and df for each value that is to be restored. In a more advanced implementation, larger intervals could be taken at the points for which there is greater ambiguity in the inverse filter restoration. Example 3.1 (continued): Figure 3.7 shows the error counts for 50 simulated draws from the model given above as example 3.1. Four error variances are used: 0, 0.1, 0.2, and 0.3. The mean error counts out of 500 points are 2.9, 11.78, 36.0, and 60.7. The standard deviations are 1.22, 3.76, 8.53, and 6.27. 100 Monte Carlo draws were used for each restoration.

71

80 70 60 50 40 30 20 10 0 −10

Figure 3.7: Error counts for simulated sequences of length 500 from example 3.1. From the left, the error variances are: 0, 0.1, 0.2, 0.3. Example 3.2 (continued): Figure 3.8 shows the error counts for 50 simulated draws from the model given above as example 3.2. Four error variances are used: 0, 0.1, 0.2, and 0.3. The mean error counts out of 500 points are 5.62, 7.38, 32.26, 70.82. The standard deviations are 0.60, 2.07, 7.65, and 7.87. 100 Monte Carlo draws were used for each restoration. For all noise levels, fewer than 1% of the time points were off by more than 2 (further off than the adjacent value).

3.5

Blind Deconvolution

The problem of simultaneously estimating the filter and restoring the input sequence is considerably more difficult than performing the restoration alone in the case that the filter is known. There are several methods for estimating the filter described in the literature. For instance, Chen and Liu describe a fully

72

100 90 80 70 60 50 40 30 20 10 0

Figure 3.8: Error counts for simulated sequences of length 500 from example 3.2. From the left, the error variances are: 0, 0.1, 0.2, 0.3. Bayesian approach in [13]. Since their method treats the filter as an additional random quantity to be estimated through its posterior distribution, the burden placed on the Monte Carlo computation will be even greater than in the nonblind case. As an illustration of how our restoration method can be extended to the blind case without severely complicating the Monte Carlo computations, we adopt the simpler strategy of estimating the filter in a separate step from the signal restoration. To begin with, suppose we have a tentative estimate of the filter, from which we obtain a tentative signal restoration xt . Put Y = (y1 · · · , yn )0 and construct the matrix D(X), where D(X)ij = xi−j+1 for i = 1, . . . , n, and j = 1, . . . , p. We can express the relationship between Y and ψ through the linear model Y = D(X)ψ + . This allows us to update the filter estimate using least squares. This suggests alternating between restoring the input sequence and updating the filter estimate using regression. Of course, if the initial value of the filter is not reasonably close to the truth, then the first restoration will be effectively a

73

random guess. This throws doubt on the ability of this procedure to climb uphill to the true filter. However we find that if the sample size is moderate, then reasonably good starting values can be obtained using the method of moments. In this case, the second restoration of the input sequence has fewer errors than the first one, and ultimately a good estimate of the filter is obtained. This alternating algorithm gives an approximation to the EM algorithm update of the filter, which has the form

ψˆ = (E(D(X)0D(X)|Y ))−1 E(D(X)0Y |Y ).

(3.23)

However in this context, the usual conditions required for the EM algorithm to be locally convergent are not met, so it is not clear that this lends any theoretical support to the procedure. Within the iterations of the alternating algorithm, we do not use either of the importance sampling restorations that were described previously. Instead, we use a simpler deterministic procedure within the iterations, and then do a polishing Monte Carlo restoration once the final filter estimate is obtained. The simple deterministic restoration has the following stages. • Compute the inverted filter ψ −1 , and apply it to the observation sequence to get y˜ = ψ −1 ? y. • Tentatively set xt equal to the support point sj that is closest to y˜t . • Cycle through the time points t = 1, . . . T , updating each term xt to the support point sj that maximizes Pψ (Y |X) when xt0 , t0 6= t are held fixed. • Repeat step 3 until convergence to a local mode of P (Y |X) occurs.

74

250

200

50

00

50

0

Figure 3.9: Blind deconvolution error counts for simulated sequences of length 1000 from example 3.1. From the left, the error variances are: 0, 0.1, 0.2, 0.3. Example 3.1 (continued): The following plot shows the error counts for the blind deconvolution algorithm applied to 50 simulated draws from the model given above as example 3.1. Four error variances are used: 0, 0.1, 0.2, and 0.3. The mean error counts out of 1000 points are 1.32, 24.3, 82.6, and 149.74. The standard deviations are 1.41, 6.33, 11.95, and 28.45.

Example 3.2 (continued): The following plot shows the error counts for the blind deconvolution algorithm applied to 50 simulated draws from the model given above as example 3.2. Four error variances are used: 0, 0.1, 0.2, and 0.3. The mean error counts out of 1000 points are 4.8, 27, 102.8, and 181.7. The standard deviations are 33.9, 94.5, 97.06, and 54.37. Note that the procedure does well except for a few runs which are complete failures. These failures can be easily identified in practice by

75

examining the magnitude of the residuals. This leads to the possibility of hybrid procedures, in which a simple method like ours is applied first, and then only in the cases that it fails would the more complex methods be applied. 500

400

300

200

100

0

Figure 3.10: Blind deconvolution error counts for simulated sequences of length 1000 from example 3.2. From the left, the error variances are: 0, 0.1, 0.2, 0.3.

3.6

The method of moments

Details of computing the method of moments estimate for the filter are widely available. The basic idea is to find ψˆ so that the covariance of ψ ? X matches the empirical autocovariance of the y. When there are no unit roots, there are 2p−1 distinct solutions to the moment equations. We choose the best of these solutions using the following procedure. 1. For each method of moments solution ψi , apply the fast deterministic restoration given above to get an estimate of the input sequence, which we will denote xi . 2. Compute the predicted observation sequence y i = ψi ? xi .

76

3. Rank the ψ i according to the residual sum of squares

P

t (yt

− yˆti)2 . The

method of moments solutions with the smallest RSS will be used as starting values. The method of moments solutions can be computed very rapidly, as follows. Write ψ(B) =

Pp−1 j=0

ψj B j for the characteristic polynomial of ψ, and write ψ r (B)

for the same polynomial with reversed coefficients. If y = ψ(B)x + , the autocovariance of y is given by

γ0 = var(yt ) = σ 2 + var(x) γi = cov(yt , yt+i ) = var(x)

X

X

ψj2

(3.24)

ψj ψj+i

(3.25)

Set γ˜0 = (γ0 − σ 2 )/var(x), and γ˜i = γi/var(x). The coefficients of ψ(B)ψ r (B) are γ˜p−1, . . . , γ˜0 , . . . , γ˜p−1 . Therefore the roots of the polynomial Γ(T ) =

2p−1 X

γ˜|p−j|tj

j=0

are the same as the roots of ψ(B)ψ r (B). By symmetry, the roots of ψ r (B) are reciprocals of the roots of ψ(B). Set ψ =

Q

(B − ri ), where ri runs over a set consisting of one member of each

reciprocal pair of roots of Γ(T ). Every such ψ is a method of moments solution.

3.7

Further discussion and Conclusion

We find that for the case that the filter is known, the inverse filter is very successful at guiding the importance sampling in performing the deconvolution. The error rates that we attain are not far from optimal, and require relatively few

77

Monte Carlo draws. The inverse sequential importance sampling method is particularly promising, in that it required only around 100 draws per point to get the optimal mode. For blind deconvolution, we have achieved a qualified success. When sufficient data is available to produce a good initial filter estimate using the method of moments, the alternating algorithm that we advocate works well, and almost certainly will be much faster than a full-blown Bayesian approach. Unfortunately, the region of attraction for the correct mode of the filter in these problems was found to be surprisingly small. An impractically large sample size may be required for the method of moments to give a sufficiently good starting value for more difficult problems. Further work should focus on fast ways to generate good starting values, or efficient strategies for performing an initial randomized search for the general area of the correct mode. On a related front, Fredkin and Rice [8] have recently developed a deterministic numerical procedure for accelerating the forward-backward algorithm of Baum and Egon. It will also be worthwhile to investigate the possibility of applying their procedure to this problem.

78

References [1] S. Amari, A. Cichocki. H.H. Yang, 1996. A new learning algorithm for blind signal separation, Advances in Neural Information Processing Systems 8. Cambridge: MIT press. [2] L. Baum, T. Petrie, G. Soules, and N. Weiss, 1970. A maximization technique in statistical estimation for probabilistic functions of Markov processes. Annals of Mathematical Statistics 41, 164-171. [3] A.J. Bell, T.J. Sejnowski, 1995. An information maximisation approach to blind separation and blind deconvolution. Neural Computation, 7, 6, 11291159 [4] G.E.P. Box, G.M. Jenkins, 1976. Time series analysis: forecasting and control. San Fransisco: Holden-Day. [5] D.R. Brillinger, 1975. Time series: data analysis and theory. New York: Holt, Rinehart and Winston. [6] M.S. Cohen, 1996. Rapid MRI and Functional Applications. in Brain mapping : the methods, A. Toga and J. Mazziotta, ed. San Diego : Academic Press. [7] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, 1998. Biological sequence analysis: probabilistic models of proteins and nucleic acids. UK: Cambridge University Press. [8] D.R. Fredkin, J. Rice, Fast evaluation of the likelihood of an HMM: ion channel currents with filtering and colored noise, manuscript. [9] E. Gamboa, E. Gassiat, 1996. Blind Deconvolution of Discrete Linear Systems, Ann. Statist. 24 no. 5, 1964-1981. [10] A. Kong, J.S. Liu, W.H. Wong, 1994. Sequential Imputations and Bayesian Missing Data Problems, Journal of the American Statistical Association 89, 278-288. [11] K.C. Li, 1991. Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86 316-342. [12] T.H. Li, 1995. Blind Deconvolution of Linear Systems with Mulitlevel Nonstationary Inputs, Ann. Statist. 23 no. 2, 690-704.

79

[13] J.S. Liu, R. Chen, 1995. Blind Deconvolution via Sequential Imputations, Journal of the American Statistical Association 90, 567-576. [14] J.S. Liu, R. Chen, W.H. Wong. Rejection Control and Importance Sampling. Journal of the American Statistical Association 93, 1022-1301. [15] S. Ogawa, T.M. Lee, 1990. Magnetic resonance imaging of blood vessels at high fields: In vitro and in vivo measurements and image simulation. Magn. Reson. Imaging 8(5), 557-566. [16] J.O. Ramsay, B.W. Silverman, 1997. Functional data analysis. New York: Springer. [17] L.A. Zadeh, C.A. Desoer, 1979. Linear systems theory : the state space approach. Huntington, N.Y.: R.E. Krieger Pub. Co.

80