Multiscale Graphical Modeling in Space: Applications to Command ...

This is page 1 Printer: Opaque this

Multiscale Graphical Modeling in Space: Applications to Command and Control Hsin-Cheng Huang Noel Cressie

ABSTRACT Recently, a class of multiscale tree-structured models was introduced in terms of scale-recursive dynamics de ned on trees. The main advantage of these models is their association with a fast, recursive, Kalman lter prediction algorithm. In this article, we propose a more general class of multiscale graphical models over acyclic directed graphs, for use in command and control problems. Moreover, we derive the generalized-Kalman lter algorithm for graphical Markov models, which can be used to obtain the optimal predictors and prediction variances for multiscale graphical models.

1 Introduction Almost every aspect of command and control (C2) involves dealing with information in the presence of uncertainty. Since information in a battle eld is never precise, its status is rarely known exactly. In the face of this uncertainty, commanders must make decisions, issue orders, and monitor the consequences. The uncertainty may come from noisy data or, indeed, regions of the battle space where there are no data at all. It is this latter aspect, namely lack of knowledge, that is often most important. Statistical models that account for noisy data are well accepted by the research community but the full quanti cation of uncertainty due to lack of knowledge (caused either by hidden processes or missing data) falls under the purview of statistical modeling. Statistical models for C2 are relatively new and still rather embryonic; for example, Pate-Cornell and Fischbeck (1995) consider a very simple statistical model for a binary decision in C2. Given the spatial nature of the battle eld, it seems that a logical next step is to propose and t spatial statistical models. Because timely analysis and decision-making in C2 is crucial, very fast optimal- ltering algorithms are needed. The purpose of this article is to present spatial statistical models for which rst and second posterior moments can be obtained quickly and exactly.

2

Hsin-Cheng Huang, Noel Cressie

Consider a scene of interest made up of M N pixels

fP (p; q) : p = 1; : : : ; M; q = 1; : : : ; N g centered at pixel locations

fs(p; q) : p = 1; : : : ; M; q = 1; : : : ; N g: For the moment, assume the scene is observed completely at one time instant, yielding multivariate spatial data from L sensors:

yl (yl(s(p; q)) : p = 1; : : : ; M; q = 1; : : : ; N )0 ; l = 1; : : : ; L;

where the index l corresponds to l-th sensor and the datum yl (s(p; q)) is observed on pixel P (p; q) centered at pixel location s(p; q). One powerful way to deal with multi-sensor information is to use a hierarchical statistical model. The multi-sensor data fyl : l = 1; : : : ; Lg are noisy. The \true" scene is an image

x (x(s(p; q)) : p = 1; : : : ; M; q = 1; : : : ; N )0 :

Then a hierarchical statistical model is obtained upon specifying the probability densities, [y1 ; : : : ; yL jx] = f and [x] = g : The general goal is to obtain [xjy1 ; : : : ; yL ], from which predictions about the true scene can be made. In this article, we assume only one sensor and we give very fast algorithms that yield E (xjy ) and var(xjy). If the parameters (; ) are estimated, then our approach could be called empirical Bayesian. A fully Bayesian approach would express the uncertainty in (; ) through a prior distribution but, while conceptually more pleasing, the very fast optimal- ltering algorithms referred to above are no longer available. Optimal C2 decisions, in the face of uncertainty, can be handled through statistical decision theory, provided commanders can quantify their loss functions. That is, what loss

L((y); x)

would be incurred if the true scene is x but an estimate (y ) was used. Some basic results in decision theory demonstrate that the optimal lter 0 is the one that minimizes E ?L((y); x) y with respect to measurable functions . For example, if the squared-error loss function:

L( (y); x) = ( (y) ? x)0 ( (y) ? x);

1. Multiscale Graphical Modeling in Space: Applications to C2

3

is used, the optimal lter is:

(y) = E (xjy): 0

This is the loss function we shall use henceforth in this article. It is an appropriate loss function for drawing maps of x, although it is not all that useful for identifying nonlinear features (e.g., extremes in x). It is typically the case that the battle commander has a dierent \aperture" than a platoon commander. The battle commander needs more global, aggregated data, whereas the platoon commander is often making decisions based on local, more focused information. Yet the sum of the parts is equal to the whole and it can be important for the battle commander and those in a chain of command all the way to the platoon commander, to query and receive information at dierent scales. Our intent here is not to decide which scales belong to which level of command but rather to develop a methodology that makes multiscale modeling and analysis possible. Recently, a class of multiscale tree-structured models was introduced in terms of scale-recursive dynamics de ned on trees (Chou, 1991; Basseville et al., 1992a, 1992b, 1992c). This class of models is very rich and can be used to describe stochastic processes with multiscale features. Let xt denote a vector representing the (multivariate) status of the battle eld at some location t within a given scale. Then the multiscale tree-structured model is: xt = Atxpa(t) + wt; t 2 T n ft0g; where T is a tree with root t0 , and pa(t) denotes a location connected to t but within a coarser scale. Think of pa(t) as a parent of t. The potential observations fyt g are given by

yt = C txt + t ; t 2 T:

Our goal is to obtain the conditional distribution of fxt : t 2 T g based on the data fyt g. The main advantage of these models over other statistical methods is their association with a fast, recursive, Kalman- lter prediction algorithm (Chou et al., 1994). The algorithm can be used to handle massive amounts of data eciently, even in the presence of missing data. However, as models for time series or spatial processes at the nest level of the tree, they are somewhat restricted in that they may enforce certain discontinuities in covariance structures. That is, two points that are adjacent in time or space may or may not have the same parent node (or even grandparent node) on the tree. These discontinuities can sometimes lead to blocky artifacts in the resulting predicted values. There are at least two possible ways that the tree-structured model above can be generalized to describe a more realistic situation in C2. First, we may consider an index set T that is more general than a tree. Second, we may keep the tree structure, but allow for more general nonlinear and

4


non-Gaussian models as follows:

xt yt

= ft (xpa(t) ; wt ); t 2 T n ft0 g; = gt (xt ; t ) ; t 2 T:

This paper addresses the rst approach. We consider graphical Markov models for C2, where the nodes of the graph represent random variables and the edges of the graph characterize certain Markov properties (conditional-independence relations). In particular, we propose multiscale graphical models in terms of scale-recursive dynamics de ned on acyclic directed graphs. This class of models not only allows us to describe a wide range of spatial-dependence structure, but also can be used to reduce the discontinuities sometimes obtained from tree-structured models. In Section 2, we describe tree-structured Markov models and its associated Kalman- lter algorithm. In Section 3, we introduce graphical Markov models and develop the generalized-Kalman- lter algorithm for these models. The proposed multiscale graphical models, which are more general than multiscale tree-structured models, are presented in Section 4. Finally, a brief discussion is given in Section 5.

2 Tree-Structured Markov Models In application areas such as command and control, fast ltering and prediction of data at many dierent spatial resolutions is an important problem. Gaussian tree-structured Markov models are well adapted to solving the problem in simple scenarios. Consider a (multivariate) Gaussian random process fxt : t 2 T g indexed by the nodes of a tree (T; E ), where T is the set of nodes, and E is the set of directed edges. Let t0 denote the root of the tree, and let pa(t) denote the parent of a node t 2 T . Then

E = f(pa(t); t) : t 2 T n ft0 gg : The Gaussian tree-structured Markov model on the tree (T; E ) is de ned as follows. Assume that the Gaussian process evolves from parents to children in a Markovian manner according to the following state-space model:

xt yt

Atxpa t + wt; t 2 T n ft g; (1.1) C t xt + t; t 2 T; (1.2) where fyt : t 2 T g are (potential) observations, fxt : t 2 T g are hidden = =

( )

0

(i.e., unobserved), zero-mean, Gaussian state vectors that we would like


s (j ? 1; k)

5

s (j ? 1; k + 1)

A A A A A A A A A A A A Us A U A s(j; 2k) s (j; 2k + 1) C (j; 2k + 2) Cs (j; 2k + 3) C C C C C C C C C C C C C C C C C C C C C C CW s CW s CW s CW s s s s s

FIGURE 1. Graphical representation of a tree-structured model.

to predict, fAt : t 2 T n ft0 gg and fC t : t 2 T g are deterministic matrices, fwt : t 2 T n ft0 gg and ft : t 2 T g are independent zero-mean Gaussian vectors with covariance matrices given by

W t var (wt) ; t 2 T n ft g; V t var (t ) ; t 2 T; fxt : t 2 T g and ft : t 2 T g are independent, and xpa t 0

( ) and w t are independent for t 2 T n ft0 g. For example, we consider a tree-structured model with the hidden process given by,

xj;k = xj?1;[k=2] + wj;k ; j = J0 + 1; : : : ; J; k 2 ZZ; (1.3) where [k=2] denotes the largest integer that is less than or equal to k=2, var(xJ0 ;k ) = 02 ; k 2 ZZ, and var(wj;k ) = j2 ; j = J0 + 1; : : : ; J , k 2 ZZ. The associated tree of this model is shown in Figure 1. For J0 = 0, J = 6, 02 = 10, 12 = 8, 22 = 6, 32 = 4, 42 = 3, 52 = 2, and 62 = 1, the correlation function of the latent process,

f (t; s) corr (x6;t ; x6;s ) ; t; s = 0; 1; : : : ; 127; constructed from two nodes x0;0 and x0;1 at the coarsest scale, is shown in

Figure 2. Notice the blocky nature of the correlation function and, while it is not stationary, en gros it is nearly so. The standard Kalman- lter algorithm developed for time series based on a state-space model (e.g., Harvey, 1989) can be generalized to treestructured models. The derivation of the algorithm has been developed by Chou et al. (1994). In this section, a new and somewhat simpler approach based on Bayes' theorem is used to derive the algorithm.


0

0.2

Correlation 0.6 0.4

0.8

1

6

12

0 10

0

120

80

100

60 s

80 40

40

20

60 t

20 0

0

FIGURE 2. Correlation function f (t; s) = corr (x6;t ; x6;s ) of (x6;0 ; x6;1 ; : : : ; x6;127 ) for a tree-structured model given by (1.3). 0

Kalman ltering (Kalman, 1960; Kalman and Bucy, 1961) in time series is a recursive procedure to obtain the optimal predictors (i.e., conditional expectations) of state vectors based on all the observations that are currently available. Once the end of the series is reached, we can move backward to obtain the optimal predictors of state vectors based on all the data. These two steps, the forward recursion and the backward recursion, are known as ltering and smoothing, respectively. In a tree-structured model given by (1.1) and (1.2), our goal is to obtain the optimal predictors of state vectors based on all the data observed on a tree (T; E ). In order to visit all the nodes on T , the generalized Kalman lter for tree-structured models also consists of two steps, the uptree ltering step, followed by the downtree smoothing step. In the uptree ltering step, the generalized Kalman lter goes upward from the leaves of the tree, and successively collects information from each stage's children (i.e., compute the optimal predictor of the state vector xt at a node t based only on the data observed at t and its descendents). Once the root t0 is reached, we obtain the optimal predictor of xt0 based on all the data, since all the other nodes are descendents of the root t0 . We then go back downward by sending information to the root's children, who in turn send to their children, and so on. That is, we proceed in the reverse order of the uptree step, and recursively compute the optimal predictor of the state vector xt at a node t based on all the data. These two steps, though more complicated, are analogous to the ltering and the smoothing step of the standard Kalman


7

lter. First, we introduce some notation. We use a boldface letter to denote both a vector and a set. Denote t t0 if t0 = t or t0 is a descendent of t. For t 2 T , let

1; if y t is observed; 0; otherwise; Y fyt : t = 1g ; Y t fyt : t = 1; t t0g ; Y t fyt : t = 1; t t0; t0 6= tg = Y t n fytg ;

t

Y ct x^ t1jt2 x^ t1jt2 t t1;t2 ?t1jt2 ?t1jt2 ?t ?t1;t2

0

0

0

0

Y n Y t; E ( xt j Y t ) ; ? E xt j Y t ; var (xt ) ; cov (xt ; xt ) ; ? var ( xt j Y t ) = var xt ? x^ t jt ; ? ? var xt j Y t = var xt ? x^ t jt ; var ( xt j Y ) = var (xt ? x^ t ) ; cov ( xt ; xt j Y ) = cov (xt ? x^ t ; xt ? x^ t ) : 1

2

1

2

1

2

1

2

1

1 2

1

2

1

1 2

1

1

1

2

2

2

The uptree ltering algorithm starts with the leaves and proceeds along all upward paths to the root of the tree. At each node t, x^ tjt and ?tjt are obtained recursively. Since the algorithm proceeds in a backward direction, as opposed to the forward direction given by (1.1), it is necessary to have a backward representation for each t 2 T n ft0 g; that is,

xpa t

( )

xpa t xt + ?xpa t ? E ? xpa t xt B t xt + t ; t 2 T n ft g;

= E

?

( )

( )

( )

0

where for t 2 T n ft0 g,

Bt pa t A0t?t ; t xpa t ? pa t A0t ?t xt ; 1

( )

( )

( )

and

1

Rt var (t) = pa t ? pa t A0t?t Atpa t : ? Note that xt and? t are uncorrelated, since B t xt = E xpa t xt and t = xpa t ? E xpa t xt are uncorrelated, for all t 2 T n ft g. Also ( )

( )

1

( )

( )

( )

0

( )

note that ft : t 2 T n ft0gg can be computed recursively from the forward representation (1.1) by

t = Atpa t A0t + W t ; ( )

8


where W t var (wt ); t 2 T n ft0 g. For a leaf t 2 T , if t = 0 (i.e., yt is not observed), then x^ tjt = 0 and ?tjt = t, whereas if t = 1 (i.e., yt is observed), then by (1.2),

x^ tjt

= = ?tjt = =

cov (xt ; yt ) (var (yt ))?1 y t

t C 0t C ttC 0t + V t ? yt; var (xt ) ? cov (xt ; yt ) (var (yt ))? cov (yt ; xt ) t ? tC 0t ?C ttC 0t + V t? C tt: Therefore, for a leaf t 2 T , we have x^ tjt = ttC 0t ?C t tC 0t + V t? yt; (1.4) ? ? (1.5) ?tjt = t ? t tC 0t C tt C 0t + V t C tt : For a node t 2 T that is not a leaf, let ft ; t ; : : : ; tk g denote all the children of t, where kt is the number of children of the node t. Suppose that we have computed x^ t jt and ?t jt ; i = 1; : : : ; kt . Since ?

1

1

1

1

1

1

i

i

i

2

t

i

xt = Bt xt + t ; i = 1; : : : ; kt ; i

i

i

it follows that for i = 1; : : : ; kt ,

xt jY t N ?Bt x^ t jt ; Bt ?t jt Bt + Rt : i

i

i

i

i

i

i

i

i

Therefore, for i = 1; : : : ; kt , the following recursions are obtained:

(1.6) = B t x^ t jt ; (1.7) ?tjt = Bt ?t jt Bt + Rt : Next we compute x^ tjt , ?tjt . Suppose that we have computed x^ tjt and ?tjt ; i = 1; : : : ; kt. Using Bayes' theorem, we have

x^ tjt

i

i

i

i

i

i

i

i

i

i

i

i

?

p xt Y t

?

= p xt Y t1 ; : : : ; Y t ?

kt

/ p Y t1 ; : : : ; Y t xt p (xt ) = p (xt )

kt Y

i=1

kt

p ( Y t j xt ) : i

Note that the last equality holds, since Y t1 ; : : : ; Y t are independent, conditional on xt . Hence by using Bayes' theorem again, we have kt

?

p xt Y t

/ p (xt )

kt Y

p ( xt j Y t ) i=1 p (xt ) i

1. Multiscale Graphical Modeling in Space: Applications to C2 (

k

9

X (xt ? x^ tjt )0 ??tjt (xt ? x^ tjt ) / exp ? 21 x0t ?t xt + i t

1

1

i

?x0t ?t xt 1

(

k

i

i

=1

)

X / exp ? 21 x0t ?t + ??tjt ? ?t xt i t

1

1

1

i

?2 x0

t

k X t

i=1

=1

??tjt1 x^ tjt i

)

:

i

Therefore,

x^ tjt =

?tjt (

?tjt =

kt X i=1

?t + 1

??tjt1 x^ tjt i

kt X

! i

;

(1.8)

??tjt ? ?t 1

1

)?1

i

i=1

:

(1.9)

The nal uptree step for each update is to compute x^ tjt and ?tjt . Suppose that we have computed x^ tjt and ?tjt . If t = 0 (i.e., yt is not observed), then x^ tjt = x^ tjt and ?tjt = ?tjt . If t = 1 (i.e., yt is observed), the standard Kalman- lter algorithm can be applied to update x^ tjt and tjt . Using Bayes' theorem again, we have

p ( xt j Y t ) = p ( xt j yt ; Y t ) / p ( yt ; Y t j xt ) p (xt ) = p ( yt j xt ) p ( Y t j xt ) p (xt ) / p ( yt j xt ) p ( xt j Y t )

/ exp ? 21 (yt ? C t xt )0 V ?t (yt ? C t xt ) i +(xt ? x^ tjt )0 (?tjt )? (xt ? x^ tjt ) h

1

1

/ exp ? 21 x0t C 0t V ?t C t + (?tjt )? xt i ?2x0t C 0t V ?t yt + (?tjt )? x^ tjt : h

1

1

1

1

Since

C 0tV ?t C t + (?tjt)? 1

1

?1

= ?tjt ? ?tjt C 0t

C t?tjtC 0t + V t

?1

C t?tjt;

10


we see that,

x^ tjt

= ?tjt t C 0t V ?t 1 yt + (?tjt )?1 x^ tjt ;

?tjt = ?tjt ? t?tjtC 0t C t?tjtC 0t + V t

?1

(1.10)

C t?tjt:

(1.11)

Thus, the uptree ltering algorithm can be obtained recursively by (1.6) through (1.11) with the initial predictors and the prediction variances of the state vectors at the leaves given by (1.4) and (1.5). The algorithm stops at the root t0 , where we obtain x^ t0 = x^ t0 jt0 ; ?t0 = ?t0jt0 : The downtree smoothing algorithm moves downward from the root to the leaves of the tree, allowing x^ t and ?t (i.e., the optimal predictor of xt and its prediction variance based on all the data) to be computed recursively for t 2 T . To do this, we rst compute the following conditional density for each t 2 T n ft0 g: ? ? ? p xt ; xpa(t) Y = p xpa(t) Y p xt xpa(t) ; Y ? ? = p xpa(t) Y p xt xpa(t) ; Y t ; Y ct = p

?

pa(t)

x

?

c Y p px?tY; Y tx xpa(;t)Y; Yc t : (1.12) t pa(t) t

? Since x0t ; Y 0t 0 and Y ct are independent, conditional on xpa(t) , we have from (1.12),

?

p xt ; xpa(t) Y = = = = =

/

?

p x Y p px?tY; Y tx xpa(t) t pa(t) ? ? p xpa(t) Y p xt xpa(t) ; Y t ? p xt ; xpa(t) Y t ? ? p xpa(t) Y p xpa(t) Y t ? p (xt j Y t ) p xpa(t) xt ; Y t ? ? p xpa(t) Y p xpa(t) Y t ? p (xt j Y t ) p xpa(t) xt ? ? p xpa(t) Y p xpa(t) Y t ? ? p? xpa(t) Y 1 h ? exp ? 2 xt ? x^ tjt 0 ??tjt1 xt ? x^ tjt p xpa(t) Y t ?

pa(t)

i ? ? + xpa(t) ? B t xt 0 R?t 1 xpa(t) ? B t xt :


11

Evaluating the conditional density above at xpa(t) = x^ pa(t) yields ? p xt ; xpa(t) = x^ pa(t) Y h? ? / exp ? 21 xt ? x^ tjt 0 ??tjt1 xt ? x^ tjt + x^ pa(t) ? B t xt 0 R?t 1 x^ pa(t) ? B t xt h / exp ? 21 x0t ??tjt1 + B 0t R?t 1 B t xt ?

?

i

i ?2x0t ??tjt x^ tjt + B 0t R?t x^ pa t : 1

We obtain the recursion,

x^ t = ??tjt + B0t R?t Bt Also,

1

1

1

?1

( )

??tjt x^ tjt + B0t R?t x^ pa t :

?1

1

1

(1.13)

( )

??tjt + B0tR?t Bt B0tR?t o n ? = ?tjt ? ?tjt B 0t Rt + Bt ?tjt B 0t ? B t ?tjt B 0t R?t ? = ?tjt B 0t R?t ? ?tjt B 0t Rt + B t ?tjt B 0t ? B t ?tjt B 0t R?t = ?tjt B 0t R?t ? ? ??tjt B 0t Rt + B t ?tjt B 0t ? B t ?tjt B 0t + Rt ? Rt R?t n o ? = ?tjt B 0t R?t ? ?tjt B 0t R?t ? ?tjt B 0t Rt + B t ?tjt B 0t ? = ?tjt B 0t ??pa t jt ; 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 ( )

where the last equality follows from (1.7). Therefore, (1.13) can also be written as

x^ t

?1

??tjt x^ tjt + B0t R?t x^ pa t ??tjt + B0t R?t Bt ? ??tjt x^ tjt + B0t R?t Btx^ tjt + B0tRt? x^ pa t = ??tjt + B 0t R?t B t ?B 0t R?t B t x^ tjt ? ? B0t R?t x^ pa t ? B0tR?t Btx^ tjt = x^ tjt + ??tjt + B 0t R?t B t ? ? B0t R?t x^ pa t ? B0tR?t x^ pa t jt = x^ tjt + ??tjt + B 0t R?t B t ? ? = x^ tjt + ??tjt + B 0t R?t B t B 0t R?t x^ pa t ? x^ pa t jt ? = x^ tjt + ?tjt B 0t ??pa t jt x^ pa t ? x^ pa t jt ; (1.14)

=

1

1

1

1

1

1

1

1

1

( )

1

1

1

1

1

1

1

1

1 ( )

1

1

1

1

1

( )

1

( )

1

( )

1

( )

( )

( )

( )

( )

12


where x^ pa(t)jt and ?pa(t)jt have been computed from (1.6) and (1.7), respectively, and x^ tjt and ?tjt have been computed from (1.10) and (1.11), respectively, in the uptree ltering step. From (1.14), the mean squared prediction error (prediction variance) of x^ t can be obtained recursively by

?t = var (xt ? x^ t) = var (xt ) ? var (^ xt) ? = var (xt ) ? var x^ tjt + ?tjt B 0t ??pa t jt x^ pa t ? x^ pa t jt n o ? = var (xt ) ? var(^xtjt ) + var ?tjt B 0t ??pa t jt x^ pa t ? x^ pa t jt = var (xt ) ? var(^xtjt ) ? ??tjt B 0t ??pa t jt var x^ pa t ? x^ pa t jt ??pa t jt B t ?tjt = ?tjt + ?tjt B 0t ??pa t jt var(^xpa t jt ) ? var(^xpa t ) ??pa t jt B t ?tjt ? = ?tjt + ?tjt B 0t ??pa t jt ?pa t ? ?pa t jt ??pa t jt B t ?tjt ; (1.15) 1 ( )

( )

( )

1 ( )

1 ( )

( )

( )

1 ( )

( )

( )

1 ( )

( )

1 ( )

( )

( )

( )

1 ( )

1 ( )

where the second equality holds since x ? x^ t and x^ t are uncorrelated, and the last equality holds since

var(^xpa(t)jt ) ? var(^xpa(t) ) ? ? = var(xpa(t) ) ? var(^xpa(t) ) ? var(xpa(t) ) ? var(^xpa(t)jt ) = ?pa(t) ? ?pa(t)jt : Note that ?pa(t)jt and ?tjt have been computed from (1.7) and (1.11), respectively, in the uptree ltering step. Thus, the generalized Kalman lter for tree-structured models can be carried out recursively by uptree ltering: use (1.6) through (1.11) with the initial predictors and the prediction variances of the state vectors at the leaves given by (1.4) and (1.5); follow this by downtree smoothing: use (1.14) and (1.15) to obtain the optimal predictor x^ t and the prediction variance ?t , for all t 2 T . Luettgen (1995) show that the prediction errors x^ t ? xt ; t 2 T , also follow a multiscale tree-structured model. That is,

x^ t ? xt = Gt ?x^ pa t ? xpa t + t; t 2 T; ( )

(1.16)

( )

where

Gt ?tjtB0t?pa? t jt ; t 2 T n ftg; ? xpa t and ft : t 2 Ttcg are independent for t 2 T n ft g, and var (t ) = Rt = pa t ? pa t A0t ?t At pa t ; t 2 T: 1 ( )

x^ pa t

( )

0

( )

( )

( )

1

( )


13

Notice that Gt ; t 2 T , can be computed in the uptree ltering step. From (1.16), for any t; t0 2 T , we can obtain the prediction covariance between any two variables as:

?t;t = cov (^xt ? xt; x^ t ? xt ) ? ? = Gt1 Gt2 Gt var x^an t;t ? xan t;t Gt1 Gt2 Gt 0 ; 0

0

0

(

?

0)

?

(

0

0)

0

0

where an(t; t0 ); t1 ; t2 ; : : : ; t and an(t; t0 ); t01 ; t02 ; : : : ; t0 are the paths from an(t; t0 ) (the rst common ancestor of t and t0 ) to t and t0 , respectively. In particular, we have ?t;pa(t) = Gt?pa(t):

3 Graphical Markov Models The results in Section 2 demonstrate how recursive ltering yields optimal prediction of the x-process at all resolutions of the tree. Some data structures will dictate the use of acyclic directed graphical models, which represent a generalization of the tree-structured models. This extra exibility hints at graphical models' usefulness in command and control problems. Graphical Markov models, rst introduced by Wright (1921, 1934), use graphs to represent possible dependencies among random variables. Graphs can be either undirected (edges represented by lines), directed (edges represented by arrows), or a mixture of the two. In graphical Markov models, the nodes of the graph represent random variables, and the edges of the graph characterize certain Markov properties (conditional independence relations). These models have been applied in a wide variety of areas, including undirected graphical models (also known as Markov random elds) in spatial statistics Besag, 1974; Isham, 1981; Cressie, 1993), and acyclic directed graphical models in image analysis (Davidson and Cressie, 1993; Fieguth et al., 1995; Fosgate et al., 1997; Cressie and Davidson, 1998), categorical data analysis (Darroch et al., 1980; Edwards and Kreiner, 1983; Wermuth and Lauritzen, 1983), in uence diagrams (Howard and Matheson, 1981; Shachter, 1986; Smith, 1989), and probabilistic expert systems (Lauritzen and Spiegelhalter, 1988; Pearl, 1988; Spiegelhalter et al., 1993). The books by Pearl (1988), Whittaker (1990), Edwards (1995), and Lauritzen (1996) are good references for graphical Markov models and their applications. In this section, we shall introduce graphs and graphical Markov models. The emphasis throughout is on Gaussian graphical models, which are also called covariance-selection models (Dempster, 1972). We shall develop a fast optimal-prediction algorithm for Gaussian graphical models.

14


3.1 Graphs

A graph G is made up of the pair (V; E ), where V denotes a nite set of nodes (or vertices), and E V V denotes the set (possibly empty) of edges between two distinct nodes. An edge (v; v0 ) 2 E such that (v0 ; v) 2 E is called an undirected edge and is represented as a line between v and v0 in our gures. An edge (v; v0 ) 2 E such that (v0 ; v) 2= E is called a directed edge and is represented by an arrow from v to v0 in our gures. The subset A V induces a subgraph GA = (A; EA ), where EA = E \ (A A). For both directed and undirected graphs, we de ne a path of length k from v0 to vk to be a sequence of nodes v0 ; v1 ; : : : ; vk ; k 1, such that (vi ; vi+1 ) is an edge for each i = 0; : : : ; k ? 1. A directed path v0 ; v1 ; : : : ; vk is a path such that (vi ; vi+1 ) is a directed edge for each i = 0; : : : ; k ?1. A cycle of length k in both directed and undirected graphs is a path (v0 ; v1 ; : : : ; vk ) such that v0 = vk . A graph G is said to be connected if there is a path between any pair of nodes. If E contains only undirected edges, then the graph G is called an undirected graph. If E contains only directed edges, then the graph G is called a directed graph. An acyclic directed graph is a directed graph that has no cycles in it. A graph with both directed and undirected edges but with no directed cycle is called a chain graph. We shall consider only undirected graphs and acyclic directed graphs. Two nodes v; v0 2 V are said to be adjacent (or neighbors) if either (v; v0 ) 2 E or (v0 ; v) 2 E . For A V , we de ne the boundary of A to be bd(A) fv0 2 V n A : (v; v0 ) or (v0 ; v) 2 E; where v 2 Ag :

The closure of A V is cl(A) A [ bd(A). For a directed edge (v; v0 ), v is said to be a parent of v0 , and v0 is said to be a child of v. A node v of a directed graph is said to be terminal if it has no children. We say that v0 is a descendant of v if there exists a directed path from v to v0 . An undirected tree is a connected undirected graph with no cycles. A directed tree is a connected acyclic directed graph such that there is a node v0 2 V that has no parent, called the root, and each v 2 V n fv0 g has exactly one parent node. Note that, given an undirected tree, we can construct a directed tree by rst choosing any node of the tree as a root, and directing all edges away from this root. A graphical Markov model is de ned by a collection of conditional independence relations among a set of random variables (vectors) in the form of a graph, where the set of random variables (vectors) are indexed by the nodes of the graph. For example, tree-structured models described in Section 2 are a special class of graphical Markov models. The generalizedKalman- lter algorithm we shall derive does not work directly on the graph, but on an associated tree, called a junction tree. Each node on a junction tree contains clusters of variables obtained from the graphical structure.


15

Let G = (V; E ) be a graph. We consider a (multivariate) stochastic process fxv : v 2 V g with V as an index set. For A V , denote xA fxv : v 2 Ag. For A; B V , denote p (xA ) as the probability density of xA , and denote p (xA jxB ) as the conditional probability density of xA given xB . For disjoint sets A; B; C V , denote the conditional independence relation, xA conditionally independent of xB , given xC , as xA ? xB jxC ; that is, p (xA ; xB jxC ) = p (xA jxC ) p (xB jxC ) : Without loss of generality, we assume throughout the section that G is connected, since all the results can likewise be applied to a disconnected graph by applying them successively to each connected component. A graph is complete if all pairs of distinct nodes are joined by an (directed or undirected) edge. A subset is complete if it induces a complete subgraph. A complete subset that is not a proper (genuine) subset of any other complete subset is called a clique. For A; B; S V , we say S separates A and B if all paths from an element a 2 A to an element b 2 B intersect S . A pair (A; B ) of nonempty subsets of V = A [ B is said to form a decomposition of G if A \ B is complete, and A \ B separates A n B and B n A. If the sets A and B are both proper subsets of V , the decomposition is proper. An undirected graph G is said to be decomposable if it is complete, or if there exists a proper decomposition (A; B ) into decomposable subgraphs GA and GB . A chord in a cycle is an edge connecting two nonconsecutive nodes in the cycle. An undirected graph is chordal (or triangulated) if every cycle of length greater than three has a chord. A well known result states that an undirected graph is decomposable if and only if it is chordal (see Golumbic, 1980; Lauritzen et al., 1984; Whittaker, 1990). For an acyclic directed graph G, we de ne its moral graph Gm as the undirected graph obtained from G by \marrying" parents, that is, by placing undirected edges between all parents of each node and then dropping the directions from all directed edges.

3.2 Graphical Markov Models Over Undirected Graphs Let G = (V; E ) be an undirected graph, and xV fxv : v 2 V g be a

stochastic process with index set V . Associated with this graphical structure, there are at least three Markov properties for xV . De nition 1 A stochastic process xV fxv : v 2 V g on an undirected graph G = (V; E ) is said to have the local Markov property if, for any v 2V, xv ? xV ncl(v) xbd(v) ;

16


it is said to have the pairwise Markov property if, for any nonadjacent nodes (v1 ; v2 ), xv1 ? xv2 xV nfv1 ;v2g ; it is said to have the global Markov property if, for any disjoint subsets A; B; S V such that S separates A from B in G,

xA ? xB jxS :

We call xV a graphical Markov model over G if any one of the Markov properties above is satis ed. Note that by the de nition above, it is clear that the global Markov property implies the local Markov property, and the local Markov property implies the pairwise Markov property. Though the three Markov properties are in general dierent (some discussion can be found in Speed, 1979), and thus may lead to dierent graphical Markov models, they are equivalent under quite general conditions. For example, the following proposition is due to Lauritzen et al. (1990). Proposition 1 Suppose that xV is a stochastic process with index set V . Let G = (V; E ) be an undirected graph, and let p (xV ) be the joint probability density of xV with respect to a product measure. Then xV satis es the global Markov property over G if p (xV ) factorizes, namely

p (xV ) =

Y

A2A

A (xA ) ;

for some nonnegative functions f A : A 2 Ag, where A is a collection of complete subsets of V . Note that a result known as the Hammersley-Cliord theorem (Hammersley and Cliord, 1971) shows that when xV satis es the positivity condition (i.e., the support of xV is the Cartesian product of the support of each individual component of xV ), any one of the three Markov properties imply that p (xV ) factorizes. Therefore, the three Markov properties are equivalent under the positivity condition. Since the Gaussian processes we are focusing on obviously satisfy the positivity condition, we shall consider only the global Markov property throughout Section 3. A graphical model over an undirected graph G is called a decomposable graphical model (Frydenberg and Lauritzen, 1989) if G is decomposable. This class of models is crucial for the rest of this section, and is the key to deriving a fast optimal-prediction algorithm for Gaussian graphical models. An important structure associated with computational aspects of decomposable graphs is a junction tree de ned as follow: De nition 2 Let HJ denote the set of cliques of an undirected graph G = (V; E ). Then TJ = (HJ ; EJ ) is called a junction tree for G, where EJ is the


17

set of edges of TJ , if TJ is an undirected tree and TJ satis es the cliqueintersection property (i.e., for every two cliques K; K 0 2 HJ , all cliques of the undirected tree on the path between the clique K and the clique K 0 contain all the elements of K \ K 0 ).

Buneman (1974) and Gavril (1974) independently proved that a connected undirected graph G is decomposable if and only if there exists a junction tree for G (i.e., there is an associated tree TJ = (HJ ; EJ ) for which the clique-intersection property holds). Pearl (1988) showed that, given a junction-tree representation for a decomposable graphical model fxv : v 2 V g, the joint probability density can be written as Y

K 2H Y

p (xV ) =

i

K ;K )2E

(

i

j

p (xK ) i

J

?

p xK \K i

:

(1.17)

j

J

For any two cliques Ki ; Kj 2 HJ such that (Ki ; Kj ) 2 EJ , their intersection Ki \ Kj is called their separator.

Though many undirected graphs of interest may not be decomposable, one may always identify a decomposable cover G0 of an undirected graph G, by an addition of appropriate edges, called a lling-in. For example, the complete graph is a decomposable cover for any undirected graph. However, by lling in new edges, we may obscure certain conditional independencies. Now because lling in does not introduce any new conditional independence relations, the underlying probability distribution is unaected. A lling-in is said to be minimal if it contains a minimal number of edges. Tarjan and Yannakakis (1984) give various methods for testing whether an undirected graph is decomposable. Algorithms for nding minimal decompositions can be found in Tarjan (1985) and Leimer (1993).

3.3 Graphical Markov Models Over Acyclic Directed Graphs

Consider the same setup as in Section 3.2, except that now G = (V; E ) is assumed to be an acyclic directed graph, where V is the set of nodes and E is the set of directed edges.

De nition 3 We say that xV is a graphical Markov model over an acyclic directed graph G = (V; E ) if each xv ; v 2 V , is conditionally independent of its nondescendents given its parents, where the nondescendents of v are those v0 2 V such that there is no directed path from v to v0 .

Note that acyclic directed graphical models are also referred to in the literature as recursive graphical models, Bayesian networks, belief networks, causal networks, in uence diagrams, causal Markov models, and directed

18


Markov elds. In the spatial context, these models are called partially ordered Markov models (Cressie and Davidson, 1998; and Davidson, Cressie, and Hua, 1999). Assume that p (xV ) is the joint probability density of xV with respect to a product measure. It follows that

p (xV ) = p (xV0 )

Y

v2V nV0

?

p xv xpa(v) ;

(1.18)

where V0 fv : pa(v) = ;g is the set of minimal elements. The following lemma enables us to transform an acyclic directed graphical model to an undirected graphical model. Lemma 1 Suppose that xV is a graphical Markov model over an acyclic directed graph G. Then xV satis es the global Markov property over the moral graph Gm . Proof. The result follows directly from Proposition 1 and (1.18), since for each v 2 V , the set v [ pa(v) is complete in Gm . 2 Note that by lling in appropriate undirected edges, the moral graph Gm can then be converted into a decomposable graph G0 , as described at the end of Section 3.2. Therefore, we can make any acyclic directed graph decomposable by marrying parents, making all directed edges undirected, and adding new undirected edges.

3.4 Generalized Kalman Filter for Gaussian Graphical Models

We shall now derive the generalized-Kalman- lter algorithm for Gaussian graphical models over an undirected graph or an acyclic directed graph. Recall from Section 3.2 and Section 3.3 that any graphical model over an undirected graph or an acyclic directed graph G can be represented as a graphical model over a decomposable graph G0 , which is de ned at the end of Section 3.2. Also, we have the following proposition: Proposition 2 Let G = (V; E ) be a connected decomposable graph and TJ = (HJ ; EJ ) be a junction tree for G. Suppose that xV fxv : v 2 V g satis es the global Markov property over G. Then the stochastic process fxH : H 2 HJ g satis es the global Markov property over TJ , where xH fxv : v 2 H g. Proof. Let A, B, and S be disjoint subsets of HJ such that S separates A from B in TJ . De ne A fv 2 H : H 2 Ag, B fv 2 H : H 2 Bg, and S fv 2 H : H 2 Sg. By De nition 1, it is enough to show that S separates A n S from B n S in V . Suppose that S does not separate A n S from B n S in V . Then there exists a path (v0 ; v1 ; : : : ; vk ) in G with v0 2 A n S , vk 2 B n S , and vi 2 V n S ; i = 0; 1; : : :k ? 1. Hence, we can nd H1 2 A, Hk 2 B,


19

and Hi 2 HJ n S ; i = 2; : : : ; k ? 1, such that fvi?1 ; vi g Hi ; i = 1; : : : ; k. For each i = 2; : : : ; k, since (Hi?1 \ Hi ) n S 6= ;, it follows from the cliqueintersection property that the path from Hi?1 to Hi in TJ does not intersect S . Therefore, the path from H1 2 A to Hk 2 B does not intersect S . This contradicts the assumption that S separates A from B in TJ . 2 By choosing a node H0 of the tree TJ as a root, and directing all edges away from this root, we obtain a directed tree TD = (HJ ; ED ), where ED is the set of the corresponding directed edges of EJ . Since from Proposition 2, fxH : H 2 HJ g satis es the global Markov property over TJ , it follows from De nition 3 that fxH : H 2 HJ g is a graphical Markov model over the directed tree TD . Let G = (V; E ) be a connected decomposable graph and let xV fxv : v 2 V g be a zero-mean Gaussian graphical model over G; that is, xV is a Gaussian process with index set V , and it satis es the global Markov property over G (De nition 1). We assume that the data are observed, perhaps incompletely, according to yv = C v xv + v ; v 2 V; (1.19) where fC v : v 2 V g are deterministic matrices, and fv : v 2 V g is a zeromean Gaussian white-noise process independent of xV . For each H 2 HJ , denote pa(H ) as the parent of H in TD . Let yH fyv : v 2 H g and H fv : v 2 H g. We have, xH = AH xpa(H) + H ; H 2 HJ n fH0g; (1.20) yH = C H xH + H ; H 2 HJ ; (1.21) for some matrices fAH g and fC H g, where (1.21) is obtained from (1.19), and fH : H 2 HJ n fH0 gg are zero-mean, independent Gaussian vectors with ? var (H ) = var xH xpa(H ) ; H 2 HJ n fH0g: We therefore obtain a Gaussian tree-structured Markov model given by (1.20) and (1.21). Thus, the generalized-Kalman- lter algorithm constructed in Section 2 can now be used to compute the optimal predictors and their prediction variances for the state vectors fxH : H 2 HJ g. In particular, we obtain the optimal predictors and their prediction variances for the state vectors fxv : v 2 V g. For computational convenience, nodes that have only one child on the directed tree TD can be removed to simplify the tree structure without aecting the global Markov property. When a node H is removed, the edges between H and its parent and between H and its children are also removed, and new directed edges from the parent of H to each child of H are added. Moreover, for a terminal node H on the directed tree TD , we can reduce the size of H by replacing H by H n pa(H ). In both cases, the global Markov property of fxH : H 2 HJ g over TJ guarantees that we still have a tree-structured model over the new reduced directed tree TR .

20


Note that nodes can be removed only if no information is lost. A node

H with no data observed can also be removed if we are not interested

in obtaining the optimal predictor for xH , or we can obtain the optimal predictor for xH through other nodes.

3.5 Example

We consider a Gaussian graphical model over an acyclic directed graph G shown as in Figure 3(a). We assumed that the data are observed only at the terminal nodes by

yk = xk + "k ; k = 1; : : : ; 15; where fy1 ; : : : ; y15 g are observed data, and f"1 ; : : : ; "15 g is a Gaussian white-noise process independent of fx1 ; : : : ; x15 g. The goal is to predict fx1 ; : : : ; x15 g based on the data fy1 ; : : : ; y15 g. The corresponding moral graph Gm is shown in Figure 3(b). A decomposable graph G0 , obtained by lling in certain undirected edges from Gm , is shown in Figure 4(a). A junction tree TJ composed of cliques of G0 is shown in Figure 4(b). Finally, a reduced directed tree TR , to which the generalized Kalman lter will be applied, is shown in Figure 4(c). A more general Gaussian graphical model with this acyclic directed graphical structure will be described in the next section.

4 Multiscale Graphical Models In Section 2, we introduced tree-structured models with natural multiscale structures. The main advantage of these models is that associated with them there is a fast optimal prediction algorithm that can be used to handle large amounts of data. However, to represent a time series or a spatial process by the variables in the nest level of a tree, these models are somewhat restricted in that they may enforce certain discontinuities in covariance structures (see Figure 2). That is, two points that are adjacent in time or space may or may not have the same parent node on the tree, and even their parents may come from dierent parents. These discontinuities sometimes lead to blocky artifacts in the resulting predicted values. To x this problem, we propose more general multiscale models based on an acyclic directed graphical structure. We consider a series of hidden zero-mean Gaussian stochastic processes xj;k : k 2 ZZd with scales j = J0 ; : : : ; J . The stochastic processes evolve in a Markov manner from coarser (smaller j ) to ner (larger j ) scales based on

1. Multiscale Graphical Modeling in Space: Applications to C2 al 1 XXXX XXX ? 9 z b3l X l l b1 b2 HH H H H H ? ? ? H H H j H j H j H l l l l l l

c1

c2

c3

c4

c5

c6

21

l

c7

? ? @ ? ? @ ? ? @ ? ? @ ? ? @ ? ? @ ? ? @ ? R l @ ? R l @ ? R l @ ? R l @ ? R l @ ? R l @ ? R @ l l l l l l l l

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

l

x15

(a) al 1 XXXXX XX X b3l l lH b1 b2 HH H H H H H H H cl H H l l l l l c c c c c cl

1

2

3

4

5

6

7

? @ ? @ ? @ ? @ ? @ ? @ ? @ ? ? ? ? ? ? ? l l@ l l@ l l@ l l@ l l@ l l@ l l@

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

l

x15

(b) FIGURE 3. (a) An acyclic directed graph G; (b) The corresponding moral graph

Gm .

xj;k

X

l2M

aj;k;l xj?1;l + wj;k ; j = J0 + 1; : : : ; J; k 2 ZZd ;

(1.22)

j;k

where fMj;k g are nite subsets of ZZ, faj;k;l g are constants, ?

xJ0 ;k N 0; J20 ; independently for k 2 ZZd ; ? wj;k N 0; j2 ; independently for j = J0 + 1; : : : ; J; k 2 ZZ; n

o

n

o

and xJ0 ;k : k 2 ZZd and wj;k : j = J0 + 1; : : : ; J; k 2 ZZd are independent. Let DJ be a nite subset of ZZd, and let Dj?1 f(j ? 1; l) : ((j ? 1; l); (j; k)) 2 F; k 2 Dj g ; j = J0 + 1; : : : ; J; where o n F ((j ? 1; l); (j; k)) : j = J0 + 1; : : : ; J; l 2 Mj;k ; k; l 2 ZZd : For each scale j = J0 ; : : : ; J , we assume that the data are observed (perhaps incompletely) according to yj;k = xj;k + "j;k ; k 2 Dj ; (1.23)

22


al 1X XXX H ? @ HH XXX X b3l l l ? b2 @ H b1 H HH H H H H H ? @ H H H ? @ H H H l l l l l Hl H

c1

c2

c3

c4

c5

c6

l

c7

? @ ? @ ? @ ? @ ? @ ? @ ? @ ? ? ? ? ? ? ? l l@ l l@ l l@ l l@ l l@ l l@ l l@

x1

x2

x3

x4

x5

x6

x7

1

x8

x9

(a)

x10

x11

x12

x13

x14

l

x15

a ; b2 ; c3 ; c 5

1

@

?

@ @ 1 2 3 1 2 3 5 @ ? ? @ ? @ 1 1 2 1 1 2 3 2 3 4 5 1 3 5 6 3 6 7 J A

J A

A J A J A

A A AA ? ?

a ;b ;b ;c

b ;c ;c

c1

c1

x1

x2

a ;b ;c ;c

c1 c2

c2

c2 c3

a ;b ;b ;c

b ;c ;c ;c

c3

c3 c4

c4

a ;b ;c ;c

c4 c5

c5

c5 c6

b ;c ;c

c6

c6 c7

c7

c7

x14

x15

3 5 7 9 11 13 x

x4

x

x6

x

x8

x

x10

x

x12

x

(b)

1

a ; b2 ; c3 ; c 5

1

b ; c1 ; c2

A A ? AU

1

? ?

?

? ?

?

2

@

@

?

@ @

R @@ 1 3

- 3

a ; b1 ; c2 ; c 3

b ; c 3 ; c 4 ; c5

a ; b ; c5 ; c 6

A A ? AU

A A U ? A

A A U ? A

b ; c6 ; c7

A A U ? A

l l l l l l l l l l 11l 12l 13l 14l 15l 2 3 4 5 6 7 8 9 10

x1

x

x

x

x

x

x

x

x

x

x

x

x

x

x

(c)

FIGURE 4. (a) A decomposable graph G obtained from Gm in Figure 1 (b); (b) A corresponding junction tree TJ of G ; (c) A corresponding reduced directed tree TR . 0

0


23

where f"j;k g is a zero-mean Gaussian white-noise process, independent of fxj;k g, with variance "2 . De ne, for k 2 Dj ; j = J0 ; : : : ; J ,

j;k

1; if yj;k is observed; 0; otherwise:

The goal is to predict fxj;k : k 2 Dj ; j = J0 ; : : : ; J g based on the data Y fyj;k : j;k = 1g. Notice that if each Mj;k contains only a single element, we obtain a tree-structured model. Write

Xj X wj

xj;k : k 2 Dj 0 ; j = J0 ; : : : ; J; ? 0 X J0 ; : : : ; X 0J 0; ? wj;k : k 2 Dj 0 ; j = J0 + 1; : : : ; J; ? "j;k : j;k = 1 0 :

?

Then, (1.22) and (1.23) can be written as

Xj Y

Aj X j? + wj ; j = J + 1; : : : ; J; CX + ; for some matrices AJ ; : : : ; AJ and for some matrix C . = =

1

0

(1.24) (1.25)

0 +1

Note that from (1.24) and (1.25), we have the following representation:

X j = (Aj AJ

0 +1

) X J0 + (Aj AJ0 +2 ) wJ0 +1 + + Aj wj?1 + wj ;

for j = J0 + 1; : : : ; J . Therefore, for j = J0 + 1; : : : ; J ,

var (X j ) = J20 (Aj AJ0 +1 ) (Aj AJ0 +1 )0 +J20 +1 (Aj AJ0 +2 ) (Aj AJ0 +2 )0 + + j2?1 Aj A0j + j2 I :

(1.26)

4.1 Multiscale Graphical Models for One-Dimensional Processes

We give here two examples of multiscale graphical models for one-dimensional processes. For the rst example, we consider a multiscale graphical model with the hidden processes given by,

xj;2k = xj?1;k + wj;2k ; j = J0 + 1; : : : ; J; k 2 ZZ; xj;2k+1 = 1 +1 (xj?1;k + j xj?1;k+1 ) + wj;2k+1 ; j j = J0 + 1; : : : ; J; k 2 ZZ;

(1.27) (1.28)

24


r (j ? 1; k ? 1) @

@

r (j ? 1; k)

? ?

@ @

?@

@

@ @

r

? ?

?

? ? @ @ ? ? @ ? ? R r? @ R r? @ r? r?(j; 2k) @ r? ( j; 2 k ? 1) ( j; 2 k + 1) A A A A A A A A A A A A A A A A A A A A r? AU r r? AU r r? AU r r? AU r r?

FIGURE 5. Graphical representation of a multiscale graphical model.

where j > 0; j = J0 + 1; : : : ; J , are constants. The associated acyclic directed graph of this model is shown in Figure 5. Note that because each node has more than one parent, this is not a tree. If we assume that the data are observed (perhaps incompletely) according to (1.23), then we can apply the generalized-Kalman- lter algorithm (Section 3.4) to these models to compute the optimal predictors of fxj;k g. Recall that in Section 3.5, we have given an example of a Gaussian graphical model whose graph structure is precisely that implied by (1.27) and (1.28) for construction of the algorithm. For the second example, we consider a multiscale graphical model with the hidden processes given by, xj;2k = 1 +1 (xj?1;k?1 + j xj?1;k ) + wj;2k ; j j = J0 + 1; : : : ; J; k 2 ZZ; (1.29) 1 xj;2k+1 = 1 + (j xj?1;k + xj?1;k+1 ) + wj;2k+1 ; j

j = J0 + 1; : : : ; J; k 2 ZZ; (1.30) where j > 0; j = J0 ; : : : ; J , are constants. The associated acyclic directed graph of this model is shown in Figure 6. For J0 = 0, J = 6, 02 = 10, 12 = 8, 22 = 6, 32 = 4, 42 = 3, 52 = 2, 62 = 1, and 1 = 2 = = 5 = 2,

the correlation function of the latent process, f (t; s) corr (x6;t ; x6;s ) ; t; s = 0; 1; : : : ; 127; constructed from two nodes x0;0 and x0;1 at the coarsest scale, is shown in Figure 7. The correlation function is not stationary but the gure shows that it is nearly so. Notice there is no blocky structure of the sort seen in Figure 2, where the correlation function for a tree-structured model was given.


r

r

r

25

r

Q(j ? 1; k + 1) Q(j ? 1; k) Q A Q A Q A Q A Q A Q A Q A QQ A QQ A QQ Q A Q A Q A QQ QQ QQ U r A U r A AU r + s + s + s r r r (j; 2k ? 1) (j; 2k) (j; 2k + 1) C S S C S C CS CS CS CS CS CS CS C S C S C S C S C S C S C S C S C S C S CW r w r / S CW r w r / S CW r w r / S CW r w r / S CW r w r / S

0

0.2

Correlation 0.4 0.6 0.8

1

FIGURE 6. Graphical representation of a multiscale graphical model.

12

0 10

0

120

80

100

60 s

80

40 40

20

60 t

20

0

0

FIGURE 7. Correlation function f (t; s) = corr (x6;t ; x6;s ) of (x6;0 ; x6;1 ; : : : ; x6;127 ) for a multiscale graphical model given by (1.29) and (1.30). 0

26


Figure 8 shows several correlation functions of this model with J0 = 0, J = 6, and 1 = 2 = = 5 = 2, where each is generated from only one scale (i.e., only one of 02 ; 12 ; : : : ; 62 is non-zero). Note that from (1.26),

a linear combination of these correlation functions, plus the white-noise correlation function, generate the correlation function shown in Figure 7. It is easy to see from these multiscale features how dierent covariance structure can be constructed from this model.

4.2 Multiscale Graphical Models for Two-Dimensional Processes

Consider a two-dimensional multiscale graphical model with the corresponding acyclic directed graph having a graphical structure illustrated in Figure 9. From the scale J0 to the scale J1 , a non-tree acyclic directed graph is assumed (Figure 9 (a)) but from the scale J1 + 1 to the scale J , a quadtree structure (Figure 9 (b)) is assumed, where J0 J1 J . Speci cally, for j = J0 + 1; : : : ; J1 , k1 ; k2 2 ZZ, we assume

xj;2k1 ;2k2 = xj?1;k1 ;k2 + wj;2k1 ;2k2 ; xj;2k1 +1;2k2 = 1 +1 (xj?1;k1 ;k2 + j;1 xj?1;k1 +1;k2 ) + wj;2k1 +1;2k2 ; j;1

xj;2k1 ;2k2 +1 xj;2k1 +1;2k2 +1

= 1 +1 (xj?1;k1 ;k2 + j;2 xj?1;k1 ;k2 +1 ) + wj;2k1 ;2k2 +1 ; j;2 ? = 1 + +1 + xj?1;k1 ;k2 + j;3 xj?1;k1 +1;k2 j;3 j;4 j;5 +j;4 xj?1;k1 ;k2 +1 + j;5 xj?1;k1 +1;k2 +1 +wj;2k1 +1;2k2 +1 ;

and for j = J1 + 1; : : : ; J , k1 ; k2 2 ZZ, we assume

xj;k1 ;k2 = xj?1;[k1 =2];[k2 =2] + wj;k1 ;k2 ; where j;m > 0; j = J0 + 1; : : : ; J1 ; m = 1; : : : ; 5, are constants. Recall that the data are observed (perhaps incompletely) according to (1.23). Note that the generalized-Kalman- lter algorithm (Section 3.4) for a two-dimensional process is usually dicult to implement, except for a treestructured model. However, by treating all the variables at each scale, from the scale J0 to the scale J1 , as a single node on the graph, we obtain a treestructured model that consists of a multivariate Markov-chain structure from the scale J0 to the scale J1 , a multi-branch tree structure between the scale J1 and the scale J1 + 1, and a univariate quadtree structure from the scale J1 + 1 to the scale J . We can now apply the generalized-Kalman- lter algorithm (Section 2) to this tree-structured model to compute the optimal predictors of fxj;k g.

27

0

0

0.2

0.2



1

1


12

12

0

0

10

10

0

0

120

80 6 s0 40

20

100

60 s

80

40

120

80

100

80

40

60 t

40

20

20

0

20

0

0

0

(d)

0

0.2

0.2

Correlation 0.4 0.6


0.8

1

1

(a)

60 t

12

0

0 10

12

0

0

120

80

10

0

120

80

40

20

80

40

80

40

100

60 s

100

6 s0

0

40

20

60 t

20

0

20 0

0

(e)

0

0

0.2

0.2

Correlation 0.4 0.6

0.8


1

1

(b)

60 t

12

0

12

10

0

0

10

0

120

80

100

60 s

80

40 40

20

20

0

60 t

120

80

100

60 s

80

40 40

20

60 t

20

0

0

0

(c)

(f)

FIGURE 8. Correlation functions f (t; s) = corr (x6;t ; x6;s ) of fx6;t : t 2 ZZg for the model given by (1.29) and (1.30) with J0 = 0, J = 6, j = 2; j = 1; : : : ; 6, and the only nonzero fj2 : j = 0; : : : ; 6g is (a) 02 = 1; (b) 12 = 1; (c) 22 = 1; (d) 32 = 1; (e) 42 = 1; (f) 52 = 1.

28


(a)

(b)

FIGURE 9. (a) An acyclic directed graph. (b) a quadtree.

Note that the generalized-Kalman- lter algorithm involves multivariate updates for state vectors from the scale J0 to the scale J1 . However, these multivariate computations are not all that computationally intensive, because the number of variables in coarser scales are relatively small. In fact, the number of variables in each scale is around one fourth the number of variables at the next ner scale. Hence, the generalized-Kalman- lter algorithm is still computationally ecient for this model.

5 Discussion While the theory of multiscale graphical modeling is quite general, our examples in this article are restricted to zero-mean Gaussian distributions with linear autoregressive structure. Our future research will be devoted to generalizing the ltering algorithms to allow for non-Gaussian distributions with non-zero means. This will be a challenging problem, since recall that model speci cation consists of a data step (noise model for y) and a process step (graphical model for x). It is not yet clear how (moments of) the posterior distribution of x given y can be obtained from a fast Kalman lter-like algorithm. Currently, if data exhibit obvious spatial trend and obvious non Gaussianity, our recommendation is to try to transform the data (e.g., using a Box-Cox family of transformations) and then detrend (e.g., median polish). The generalized-Kalman- lter algorithm may then be applied to the residuals. Incorporation of time into the graphical Markov model is, in principle, straightforward. Any autoregressive temporal structure on the spatio-


29

temporal process fx g, where indexes time, can be conceptualized as a tree with an x at each node. Since those nodes are themselves trees or more general graphical structures, the resulting graph with individual random variables at each node is of the general type considered in this article. Ideally, it is a tree, for which the algorithms of Section 2 are immediately applicable. Another generalization of these models is to deal with the case where data are multivariate. Again, this should be straightforward; instead of individual random variables at each node, there are random vectors. Within the linear Gaussian modeling framework, this results in bigger vectors and matrices, but the algorithms are conceptually the same. The one new feature will be inversions of matrices of order determined by the number of variables in the original multivariate random vector. This does have the potential to slow up the algorithm considerably if the order becomes too large. Finally, in applications to command and control, one is often interested in nonlinear functions of x; for example, xyJ maxfxJ;k : k 2 ZZg is the largest feature in the region of interest, at the nest resolution. Loss functions other than squared error (Section 1) are needed to extract nonlinear functions; their development and implementation in spatio-temporal contexts, will be a topic for future research.

Acknowledgements The authors would like to thank Marc Moore for helpful comments. This research was supported by the Oce of Naval Research under grant no. N0001499-1-0214.

30


This is page 31 Printer: Opaque this

Bibliography Basseville, M., Benveniste, A., Chou, K. C., Golden, S. A., Nikoukhah, R., and Willsky, A. S. (1992a). Modeling and estimation of multiresolution stochastic processes. IEEE Transactions on Information Theory, 38:766{ 784. Basseville, M., Benveniste, A., and Willsky, A. S. (1992b). Multiscale autoregressive processes, part I: Schur-Levinson parametrizations. IEEE Transactions on Signal Processing, 40:1915{1934. Basseville, M., Benveniste, A., and Willsky, A. S. (1992c). Multiscale autoregressive processes, part II: lattice structures for whitening and modeling. IEEE Transactions on Signal Processing, 40:1935{1954. Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). Journal of the Royal Statistical Society B, 2:192{225. Buneman, P. (1974). A characterization of rigid circuit graphs. Discrete Mathematics, 9:205{212. Chou, K. C. (1991). Multiscale systems, Kalman lters, and Riccati equations. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA. Chou, K. C., Willsky, A. S., and Nikoukhah, R. (1994). Multiscale systems, Kalman lters, and Riccati equations. IEEE Transactions on Automatic Control, 39:479{492. Cressie, N. (1993). Statistics for Spatial Data. Wiley, New York, revised edition. Cressie, N. and Davidson, J. L. (1998). Image analysis with partially ordered markov models. Computational Statistics and Data Analysis, 29:1{ 26. Darroch, J. N., Lauritzen, S. L., and Speed, T. P. (1980). Markov elds and log-linear interaction models for contingency tables. Annals of Statistics, 8:522{539.

32


Davidson, J. L. and Cressie, N. (1993). Markov pyramid models in images analysis. In Dougherty, E. R., Gader, P. D., and Serra, J. C., editors, Image Algebra and Morphological Image Processing IV, volume 2030 of Society of Photo-Optical Instrumentation Engineers (SPIE) Proceedings, pages 179{190. SPIE, Bellingham, WA. Davidson, J. L., Cressie, N., and Hua, X. (1999). Texture synthesis and pattern recognition for partially ordered markov models. Pattern Recognition. Forthcoming. Dempster, A. P. (1972). Covariance selection. Biometrics, 28:157{175. Edwards, D. (1995). Introduction to Graphical Modelling. Springer-Verlag, New York. Edwards, D. and Kreiner, S. (1983). The analysis of contingency tables by graphical models. Biometrika, 70:553{565. Fieguth, P. W., Karl, W. C., Willsky, A. S., and C., W. (1995). Multiresolution optimal interpolation and statistical analysis of TOPEX/POSEIDON satellite altimetry. IEEE Transactions on Geoscience and Remote Sensing, 33:280{292. Fosgate, C. H., Krim, H., Irving, W. W., Karl, W. C., and Willsky, A. S. (1997). Multiscale segmentation and anomaly enhancement of SAR imagery. IEEE Transactions on Image Processing, 6:7{20. Frydenberg, M. and Lauritzen, S. L. (1989). Decomposition of maximum likelihood in mixed interaction models. Biometrika, 76:539{555. Gavril, F. (1974). The intersection graphs of subtrees in trees are exactly the chordal graphs. Journal of Combinatorial Theory B, 16:47{56. Golumbic, M. C. (1980). Algorithmic Graph Theory and Perfect Graphs. Academic Press, London. Hammersley, J. M. and Cliord, P. (1971). Markov elds on nite graphs and lattices. Oxford University. Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, Cambridge. Howard, R. A. and Matheson, J. E. (1981). In uence diagrams. In Howard, R. A. and Matheson, J. E., editors, Readings on the Principles and Applications of Decision Analysis, volume II, pages 719{762. Strategic Decisions Group, Menlo Park, CA. Isham, V. (1981). An introduction to spatial point processes and Markov random elds. International Statistical Review, 49:21{43.


33

Kalman, R. E. (1960). A new approach to linear ltering and prediction problems. Journal of Basic Engineering, 82:34{45. Kalman, R. E. and Bucy, R. S. (1961). New results in linear ltering and prediction theory. Journal of Basic Engineering, 83:95{108. Lauritzen, S. L. (1996). Graphical Models. Oxford University Press, Oxford. Lauritzen, S. L., Dawid, A. P., Larson, B. N., and Leimer, H. G. (1990). Independence properties of directed markov elds. Networks, 20:491{505. Lauritzen, S. L., Speed, T. P., and Vijayan, K. (1984). Decomposable graphs and hypergraphs. Journal of the Australian Mathematical Society A, 36:12{29. Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems (with discussion). Journal of the Royal Statistical Society B, 50:157{ 224. Leimer, H.-G. (1993). Optimal decomposition by clique separators. Discrete Mathematics, 113:99{123. Luettgen, M. R. and Willsky, A. S. (1995). Multiscale smoothing error models. IEEE Transactions on Automatic Control, 40:173{175. Pate-Cornell, M. E. and Fischbeck, P. S. (1995). Probabilistic interpretation of command and control signals: Bayesian updating of the probability of nuclear attack. Reliability Engineering and System Safety, 47:27{ 36. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Shachter, R. D. (1986). Evaluating in uence diagrams. Operations Research, 34:871{882. Smith, J. Q. (1989). In uence diagrams for statistical modelling. Annals of Statistics, 17:654{672. Speed, T. P. (1979). A note on nearest-neighbour gibbs and markov probabilities. Sankhya A, 41:184{197. Spiegelhalter, D. J., Dawid, A. P., Lauritzen, S. L., and Cowell, R. G. (1993). Bayesian analysis in expert systems (with discussion). Statistical Science, 8:219{283. Tarjan, R. E. (1985). Decomposition by clique separators. Discrete Mathematics, 55:221{232.

34


Tarjan, R. E. and Yannakakis, M. (1984). Simple linear-time algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs. SIAM Journal of Computing, 13:566{579. Wermuth, N. and Lauritzen, S. L. (1983). Graphical and recursive models for contingency tables. Biometrika, 70:537{552. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. John Wiley, Chichester, UK. Wright, S. (1921). Correlation and causation. Journal of Agricultural Research, 20:557{585. Wright, S. (1934). The method of path coecients. Annals of Mathematical Statistics, 5:161{215.

Multiscale Graphical Modeling in Space: Applications to Command ...

Multiscale Graphical Modeling in Space: Applications to Command ...

Suggest Documents

HOW INTERACTIVE GRAPHICAL MODELING HELPS SPACE

Applications of Crystal Plasticity in Multiscale Modeling - OSTI.gov

Graphical Modeling

Command Speech Interface to Virtual Reality Applications

Review of Hierarchical Multiscale Modeling to

Applications of Probabilistic Graphical Models to

Multiscale neuronal modeling - arXiv

Modeling Command Entities - IJCAI

Multiscale Approaches to Protein Modeling

flashWeb: Graphical Modeling of Web Applications for Data ...

An unsupervised spatiotemporal graphical modeling approach to ...

An unsupervised spatiotemporal graphical modeling approach to ...

A probabilistic graphical modeling approach to

Multiscale Multifractal Intermittent Turbulence in Space Plasmas

CLASSIFICATION IN SCALE-SPACE: APPLICATIONS TO TEXTURE ...

Multiscale modeling and simulation of deformation in

Erratum to: Multiscale Modeling and Data Integration in ... - Springer Link

Multiscale mathematical modeling in dental tissue ...

Multiscale Modeling of Damage in Multidirectional - oaktrust.tamu.edu

Multiscale Sagebrush Rangeland Habitat Modeling in Southwest ...

Modeling Command & Control Centers - CiteSeerX

Running Graphical Desktop Applications on

Multiphysics & Multiscale Modeling, Data-Model

MULTISCALE MODELING AND SIMULATION OF