Predictive information and Bayesian surprise in

Predictive information and Bayesian surprise in exchangeable random processes Samer Abdallah and Mark Plumbley

Centre for Digital Music, Queen Mary, University of London Technical Report C4DM-TR10-09 Version 0.1 – 31st October, 2010

Abstract In this report we examine the connection between Itti and Baldi’s ‘Bayesian suprise’ [1] and instantaneous predictive information [2], and show that the two are identical in certain classes of exchangeable random sequences. We also examine how these relationships can be used to compute the instantaneous predictive information in a Markov chain where the transition matrix is unknown. Samples from such a process are not exchangable, but nonetheless useable results can be obtained.

Predictive information and Bayesian surprise in exchangeable random processes Samer Abdallah ([email protected]) Mark Plumbley ([email protected]) 31st October, 2010 Abstract In this report we examine the connection between Itti and Baldi’s ‘Bayesian suprise’ [1] and instantaneous predictive information [2], and show that the two are identical in certain classes of exchangeable random sequences. We also examine how these relationships can be used to approximate the instantaneous predictive information in a Markov chain where the transition matrix is unknown even though samples from such a process are not exchangable, showing that the Bayesian surprise remains as one of two components of the instantaneous predictive information.

1

Introduction

In this technical note we examine predictive information in exchangeable random sequences [3], a class which includes sequences of observations derived from well known Bayesian models such as the Dirichlet-discrete conjugate system, and the closely related Chinese Restaurant process. The main result is that, in exchangeable random sequences, the instantaneous predictive information, when finite, is equal to the information gained from each observation about the underlying de Finetti measure. When this measure can be parameterised, e.g. as a discrete distribution with a Dirichlet prior, the information gained is equal to the ‘Bayesian surprise’ introduced by Itti and Baldi [1, 4]. We also generalise some of the analysis to random sequences which are not exchangeable but for which the Bayesian surprise is still meaningful. Predictive Information The concept of predictive information in a random process has developed over a number of years, with many contributions to be found in the physics and machine learning literature. For example, Crutchfield and his coworkers [5, 6] called the mutual information between the semi-infinite past and future of a sequential random process the excess entropy, and examined its relationship with statistical complexity in 1-dimensional spin systems. Bialek et al [7] took the idea of excess entropy further and examined the mutual information between a finite segment of the process and its infinite future, which they called the predictive information. The excess entropy and the predictive information are illustrated in fig. 1(a,b). Of particular interest

1

infinite past

finite past (N )

infinite future

EE

infinite future

PI

(a)

(b)

present

present

IIG

0

IPI

params

past

past

(c)

future

(d)

Figure 1: Schematic illustration of (a) Crutchfield et al ’s excess entropy, the mutual information between the infinite past and the infinite future; (b) Bialek et al ’s predicitive information, which, as a function of N , is the mutual information between a finite segment of length N and the infinite future; and (c) Haussler and Opper’s instantaneous information gain, which is the information in a single observation about the parameters of the process generating a sequence of iid observations. The zero in the overlap area representing the mutual information between the past and the present given the parameters represents the conditional independence of the observations given the parameters. Finally, (d) illustrates the instantaneous predictive information, which is the information in a single observation about the future given the past.

is the asymptotic growth of this predictive information as the length of the segment increases, which was shown to be an effective of index of process complexity. Bialek et al note in passing [7, p. 2420] that, when the observations are independent given a parameterisable model, their predictive information is equal to the mutual information between a block of observations and the parameters. Information theoretic characterisations of learning in parameterised models were previously investigated by Haussler and Opper [8, 9, 10], who obtained upper and lower bounds on the mutual information between a sequence of observations and the parameters of the model from which they were sampled. They used the term instantaneous information gain—see fig. 1(c)—to denote the new information about the parameters obtained from a single observation. This is essentially what Itti and Baldi [1] later described as ‘Bayesian surprise’. Abdallah and Plumbley [2] examined several information-theoretic measures that could be used to characterise not only a random process as a whole (i.e., an ensemble of possible sequences), but also specific realisations of such a process, from the point of view of a Bayesian observer. One of these measures was the instantaneous predictive information, which is the information in one observation about the infinite future given all the previous observations. When averaged over all possible realisations of a stationary process, this gives the predictive information rate. The purpose of this article is to show that the instantaenous predictive information equals the instantaneous information gain or Bayesian surprise when the observations are conditionally independent given the parameters. A slight

2

generalisation of the analysis can be used to relate instantaneous predictive information to Bayesian surprise in non-conditionally iid models, specifically, in Markov chains with unknown transition probabilities. Notational conventions In the following, all random variables are defined on a probability space (Ω, A, P ), where (Ω, A) is a measurable space and P is a probability measure on that space. The probability that some proposition Q is true with respect to this probability space will be written as Pr(Q). Random variables will be declared as functions from Ω to someR set of values, e.g. X : Ω → X . The expectation of X will be written as E X = Ω X(ω) dP (ω). The mutual information between two discrete random variables X : Ω → X and Y : Ω → Y is defined as I(X; Y ) =

XX

Pr(X= x ∧ Y = y) log

x∈X y∈Y

Pr(X= x ∧ Y = y) Pr(X= x) Pr(Y = y)

(1)

The information in the observation event X= x about a random variable Y is defined as the Kullback-Leibler (KL) divergence from the ‘posterior’ distribution pY |X=x to the ‘prior’ pY I(X= x; Y ) =

X

Pr(Y = y|X= x) log

y∈Y

Pr(Y = y|X= x) . Pr(Y = y)

(2)

The mutual information I(X, Y ) is simply the expectation of the information I(X= x; Y ) before X is observed. An underline will be used to denote vectors or 1-D arrays, e.g. x, α ∈ RK . δx , when x is in a continuous domain, will denote a delta distribution or unit mass at x, i.e., δx (A) = 1 if x ∈ A and 0 otherwise. N denotes the set of natural numbers starting from 1.

2 2.1

Direct approach The Dirichlet-discrete system

Assume we have a sequence (X1 , X2 , . . .) of random variables Xi : Ω → X , whose range X is the set of integers 1..K. These are drawn independently from a discrete distribution parameterised by θ = (θ1 , . . . , θK ) such that Pr(Xi= k|θ) = θk . This means that the Xi are independent and identically distributed (iid) given θ; however, we will assume that θ is unknown and is modelled by a random variable Θ : Ω → ∆K , where ∆K denotes the K − 1 dimensional simplex in K dimensions, and that Θ has a Dirichlet distribution Dirichlet(α) where α ∈ RK . To summarise, Θ : Ω → ∆K ,

Θ ∼ Dirichlet(α)

∀i ∈ N, Xi : Ω → X ,

(Xi |Θ= θ) ∼ Discrete(θ)

(3) (4)

When θ is unknown, the Xi are no longer independent, since observing one of the values provides information about θ and therefore affects our beliefs about the rest of the variables. Thus, we may ask, how much information is there in X1 about the infinite sequence of future observations (X2 , X3 , . . .)? 3

To answer this question, we first consider the finite sequence (X1 , . . . , XN +1 ) or X1:N +1 for short. If X1 is observed to be k, the information so gained about the remainder of the sequence X2:N +1 is the Kullback-Leibler (KL) divergence between the distributions of (X2:N +1 ) and (X2:N +1 |X1= k). From the definition of the model, we have, for x ∈ X N , Z Pr(X1:N = x|Θ= θ)pΘ (θ) dθ Pr(X1:N = x) = ∆K K Y

Z = ∆K

(5)

! θknk

pD (θ; α) dθ,

k=1

where pΘ (·) is the probability density function (pdf) of Θ, which in this case is PN pD (·; α), the Dirichlet distribution at α, and nk = i=1 δk,xi , that is, the number of times each value k appears in the sequence x. Notice that the probability of observing a sequence x is invariant to permutations of the values it contains, which means that X1:N is an exchangeable random sequence [3]. Let us define the function qα : X N → [0, 1] as the distribution of any N values drawn from the Dirichlet-discrete system given that Θ ∼ Dirichlet(α), so we have Pr(X1:N = x) = Pr(X2:N +1= x) = qα (x).

(6)

Since X1 and X2:N +1 are conditionally independent given Θ, the posterior distribution of X2:N +1 after the observation X1= k is Z Pr(X2:N +1= x|X1= k) = Pr(X2:N +1= x|Θ= θ)pΘ (θ|X1= k) dθ. (7) ∆K

The fact that the Dirichlet distribution is a conjugate prior for the discrete distribution means that the conditional density pΘ (·|X1 = k) is also Dirichlet, with some new parameter value α0 . In this case, it is can easily be shown that α0 = α + δ k , where δ k ∈ {0, 1}K is the k th ‘unit vector’. Hence, we can set pΘ (θ|X1= k) = pD (θ; α0 ) and expand Pr(X2:N +1= x|Θ= θ), yielding ! Z K Y nk Pr(X2:N +1= x|X1= k) = θk (8) pD (θ|α0 ) dθ. ∆K

k=1

Comparing equations (5) and (8), we see that Pr(X2:N +1= x|X1= k) = qα0 (x), so the information gain associated with a change in predictive distribution over X2:N +1 from qα to qα0 is the KL divergence D(qα0 ||qα ) =

X

qα0 (x) log

x∈X N

qα0 (x) . qα (x)

(9)

By exploiting the exchangeability of the Dirichlet-discrete system, or by direct analysis, it can be shown that [n1 ]

qα (x1:N ) =

α1

[n ]

. . . αK K [N ]

,

(10)

αX

where t[n] denotes the ascending factorial function, which can be expressed in terms of the Gamma function as Γ(t + n) t[n] = t(t + 1) . . . (t + n − 1) = . (11) Γ(t) 4

Hence, the log of the probability ratio in (9) is K

0 qα0 (x) Γ(αX + N )Γ(αX ) X Γ(αk0 + nk )Γ(αk ) log = − log + . log 0 qα (x) Γ(αX )Γ(αX + N ) Γ(αk0 )Γ(αk + nk )

(12)

k=1

Since the updated parameters α0 are obtained from the initial parameters α by incrementing just the k th element, the above expression can be simplified quite a bit. After some manipulation, we obtain log

qα0 (x1:N ) αX (αk + nk ) = log . qα (x1:N ) αk (αX + N )

We can now subsitute this back into (9) to obtain X αX (αk + nk ) qα0 (x) log D(qα0 ||qα ) = . αk (αX + N ) N

(13)

(14)

x∈X

Note that this has the form of an expectation of a function of x (via the dependence on nk ) when x is drawn from a Dirichlet-discrete system starting with the parameters α0 . If we take the limit as N → ∞, then a simple answer is obtained. By the law of large numbers, the relative frequencies nk /N will converge almost surely to their probabilities θk in the underlying distribution, so that lim

N →∞

αk + nk = θk . αX + N

(15)

Bearing in mind that θ is not known but is distributed as Dirichlet(α0 ), the instantaneous predictive information (IPI) in the limit N → ∞ is αX lim D(qα0 ||qα ) = log + E log Θ0k , (16) N →∞ αk where Θ0 ∼ Dirichlet(α0 ). It is well known that, when Θ ∼ Dirichlet(α), E log Θk = ψ(αk ) − ψ(αX ), (where ψ denotes the digamma function), and so we obtain our result, that the information in the observation X1= k about the infinite future is αX I(X1= k; X2:∞ ) = log + ψ(αk + 1) − ψ(αX + 1). (17) αk The entire derivation can be applied recursively to successive observations: if the next observation is X2= j, and α00 = α0 + δ j , then I(X2= j; X3:∞ |X1= k) = log

0 αX 00 + ψ(αj00 ) − ψ(αX ), αj0

(18)

and so on. Similarly, the expected IPI, which at the first observation is just the mutual information I(X1 ; X2:∞ ), is given by averaging the IPI over the predictive distribution for the next observation: I(X1 ; X2:∞ ) =

K X

Pr(X1= k)I(X1= k; X2:∞ )

k=1

K X αk αX 0 0 = log + ψ(αk ) − ψ(αX ) . αX αk k=1

5

(19)

Θ

X1

X2

X3

...

Figure 2: Graphical model representation of an infinite exchangeable random sequence (X1 , X2 , . . .), which are conditionally independent given the distribution-valued latent variable Θ.

3

General information theoretic approach

Our aim is to compute the instantaneous predictive information (IPI) and the expected IPI at any point during the sequential observation of an exchangeable random process (X1 , X2 , . . .), where the Xt : Ω → X are random variables taking values in X . Given a sequence of observations (x1 , x2 , . . .), the IPI at any time t ∈ N is defined as It = I(Xt = xt ; Xt+1:∞ |X1:t−1= x1:t−1 ),

(20)

and the expected IPI is a mutual information conditioned on the previous observations, I t = I(Xt ; Xt+1:∞ |X1:t−1= x1:t−1 ). (21) De Finetti’s representation theorem and its generalisations [ref] state that, with some restrictions, the distribution of any infinite exchangeable sequence can be represented as a mixture (possibly infinite) of factorial distributions under which the observables are independent and identically distributed. Expressed in generative terms, this means that there exists an unobserved random variable Θ : Ω → P whos values are probability distributions (i.e., P denotes the set of probability distributions over X ), and that the Xt are conditionally iid given Θ (see fig. 2). If pΘ (·) is the pdf of Θ and pX (·|θ) is the probabilitiy density or mass function of the distribution represented by θ (depending on whether X is continuous or discrete), then Z pX (x1:N ) =

pΘ (θ) P

N Y

pX (xt |θ) dθ.

(22)

t=1

To simplify the notation, we can at any time t consider a model for random variables (Θt , Yt , Zt ) = (Θ, Xt , Xt+1:∞ |X1:t−1= x1:t−1 ) which is conditioned on the previous observations X1:t−1 = x1:t−1 . In these terms, the expected IPI is the mutual information I(Yt , Zt ) and the IPI can be found by considering the function I(Yt= y; Zt ) for y ∈ X . In the following, it will be useful to work with finite futures of length N , that is, ZtN = Xt+1:N +1 , and then to consider how the various quantities behave as N tend to infinity, since Zt = Zt∞ . By construction, Yt and Zt are conditionally independent given Θt , so for all θ ∈ P, y ∈ X , N ∈ N and z ∈ X N , Pr(Yt= y ∧ ZtN = z|Θt = θ) = Pr(Yt= y|Θt = θ) Pr(ZtN = z|Θt = θ). 6

(23)

Θt

0

It 0

Yt

Zt

Figure 3: Schematic representation of entropies and mutual information for an exchangeable random sequence. The zeros represent our conditional independence assumptions: I(Yt ; Zt |Θt ) = 0 due to the conditional independence of observables given the parameters Θt , and I(Θt ; Yt |Zt ) = 0 due to the assumption that Yt adds no new information about Θt in the limit of an infinitely long sequence Zt = Xt+1:∞ . Hence, I(Yt ; Zt ) = I(Yt ; Θt ) = It .

This implies that the conditional mutual information is zero: ∀N ∈ N,

I(Yt ; ZtN |Θt ) = 0.

(24)

Next, we make an additional assumption concerning the asymptotic behaviour of I(ZtN ; Θt ) as N increases. Essentially, we suppose that if we have a long sequence of N observations, each additional observation carries less and less extra information about Θ, until, in the limit N → ∞, an extra observation will not change our beliefs about Θ and will therefore carry no information. Since ZtN and Yt play precisely this role, we can write our assumption as lim I(Yt ; Θt |ZtN ) = 0.

N →∞

(25)

This does not mean that if we observe the infinite sequence of values represented by Zt we must be able to determine Θt fully (which would imply that H(Θt |Zt ) = 0); only that we cannot learn any more about Θt by observing the single extra value represented by Yt . With our assumptions in place, we can show that I(Yt ; Zt ) = I(Yt ; Θt ), that is, the (expected) predictive information at time t is equal to the mutual information between the current observation and the parameters given the observations so far, and, with a minor strengthing of the assumptions, that It = I(Yt= xt ; Θt ), that is, the instantaneous predictive information in the current observation is equal to the instantaenous information gain about the parameters, which is the same as Itti and Baldi’s Bayesian surprise. Theorem 1. If, for all positive integers N ∈ N, Yt and ZtN are conditionally indepenent given Θt , and limN →∞ I(Θt ; Yt |ZtN ) = 0, then lim I(Yt ; ZtN ) = I(Yt ; Θt ).

N →∞

(26)

Proof. McGill [11] showed that, for three random variables U , V and W , the conditional mutual information can be expressed as I(U ; V |W ) = I(U ; V ) − A(U, V, W ),

7

(27)

where A(U, V, W ) is the ‘interaction information’, defined as A(U, V, W ) =H(U, V, W ) + H(U ) + H(V ) + H(W ) − H(U, V ) − H(V, W ) − H(W, U ).

(28)

On the Venn-diagram visualisation of informational quantities, this corresponds to the central overlapped region such as the one labelled It in fig. 3. Applied to the three variables Θt , Yt and ZtN this gives I(Yt ; ZtN ) = A(Θt , Yt , ZtN ) + I(Yt ; ZtN |Θt ) = I(Yt , Θt ) − I(Yt ; Θt |ZtN ) + I(Yt ; ZtN |Θt ).

(29)

Applying our two assumptions I(Yt ; ZtN |Θt ) = 0 and limN →∞ I(Yt ; Θt |ZtN ) = 0 yields the theorem directly. Definition 1. If U : Ω → U, V : Ω → V and W : Ω → W are random variables, then the conditional information in the event U= u about V given W is Z I(U= u; V |W ) = p(w|u)I(U= u; V |W= w) dw ZW Z (30) p(v|u, w) = p(v, w|u) log dvdw, p(v|w) W V where p(·|·) denotes the appropriate conditional probability density or mass functions for the implied random variables. Theorem 2. If, for all positive integers N ∈ N, Yt and ZtN are conditionally indepenent given Θt , and, for y ∈ X , the asymptotic conditional information, lim I(Yt= y; Θt |ZtN ) = 0,

(31)

lim I(Yt= y; ZtN ) = I(Yt= y; Θt ).

(32)

N →∞

then N →∞

Proof. Consider the difference between the left and right hand sides of (32). Let ZtN take values in X N and let p(·|·) denote the appropriate conditional probability density function or mass function for the variables implied: I(Yt= y; Θt ) − I(Yt= y; ZtN ) Z Z p(θ|y) p(z|y) = p(θ|y) log dθ − p(z|y) log dz p(θ) p(z) N P X Z Z p(θ|y) p(z) = p(θ, z|y) log dθ dz. p(θ) p(z|y) XN P Z Z p(θ, y)p(z) = p(θ, z|y) log dθ dz. p(θ)p(z, y) XN P Since Yt and ZtN are conditionally independent given Θt , we can multiply the

8

probability ratio inside the logarithm by p(y, z|θ)/p(y|θ)p(z|θ) = 1: I(Yt= y; Θt ) − I(Yt= y; ZtN ) Z Z p(θ, y)p(z) p(y, z|θ) = dθ dz p(θ, z|y) log p(θ)p(z, y) p(y|θ)p(z|θ) XN P Z Z p(θ, y)p(z) p(y, z, θ)p(θ)p(θ) = p(θ, z|y) log dθ dz p(θ)p(z, y) p(y, θ)p(z, θ)p(θ) XN P Z Z p(z)p(y, z, θ) = dθ p(θ, z|y) log p(z, y)p(z, θ) N ZX ZP p(θ|y, z) dθ dz. = p(θ, z|y) log p(θ|z) XN P Noticing that this is beginning to look like a KL divergence, we can rearrange the integrals to make the correspondence more obvious: Z Z p(θ|y, z) I(Yt= y; Θt ) − I(Yt= y; ZtN ) = p(z|y) p(θ|z, y) log dθ dz p(θ|z) N P ZX = p(z|y)D(pΘt |Yt =y,ZtN =z ||pΘt |ZtN =z ) dz N ZX = p(z|y)I(Yt= y; Θt |ZtN = z) dz. XN

This is the information in the observation Yt= y about Θt given ZtN averaged over all the possible realisations of ZtN given Yt= y, which is the definition of the conditional information that we have assumed is zero in the limit N → ∞. The conditional information is defined in such a way that the conditional mutual information I(Yt ; Θt |Zt ) is the expectation of the conditional information as a function of Yt . If I(Yt= y; Θt |Zt ) = 0 for all y ∈ X , then I(Yt ; Θt |Zt ) = 0, but the reverse is not necessarily true—even if I(Yt ; Θt |Zt ) = 0, there may be zero-probability points in Y at which the conditional information is not zero. Thus the zero conditional information assumption is slightly stronger than the zero conditional mutual information assumption. Returning to the sequential prediction problem in terms of the original variables, we find that I(Xt= xt ; Xt+1:∞ |X1:t−1= x1:t−1 ) = I(Xt= xt ; Θ|X1:t−1= x1:t−1 ).

(33)

The left hand side is just the instantaneous predictive information and the right hand side is Haussler and Opper’s instantaneous information gain or Itti and Baldi’s Bayesian surprise.

3.1

Bayesian surprise in the Dirichlet-discrete system

We verify that this is indeed the case for the Dirichlet-discrete system by computing the KL divergence between the Dirichlet distributions parameterised by α and α0 = α + δ k , that is, before and after the observation X= k respectively. The KL divergence between two Dirichlet distributions parameterised by α and α0 can be shown to be K

DDir (α0 , α) = log

B(α) X 0 0 + (α − αj )[ψ(αj0 ) − ψ(αX )]. B(α0 ) j=1 j 9

(34)

Using the definition of the multivariate Beta function B : RK → R, QK Γ(α0X ) j=1 Γ(αj ) B(α) = QK B(α0 ) Γ(αX ) j=1 Γ(α0j ) QK Γ(αX + 1) j=1 Γ(αj ) = QK Γ(αX ) j=1 Γ(αj + δjk ) QK αX Γ(αX ) j=1 Γ(αj ) αX = = . QK αk αk Γ(αX ) j=1 Γ(αj ) Hence, we obtain K

DDir (α0 , α) = log

αX X 0 + δjk [ψ(αj0 ) − ψ(αX )] αk j=1

(35)

αX + ψ(αk + 1) − ψ(αX + 1), = log αk which is in agreement with (17), the expression we found by direct calculation. We can also use (34) to verify that the vanishing conditional information assumption I(Yt = k; Θt |ZtN ) → 0 holds when Θt ∼ Dirichlet(α), (Yt |Θt = θ) ∼ N Discrete(θ) and (ZtN |Θt = θ) ∼ [Discrete(θ)]N 1 , that is, Zt consists of N iid N copies of Discrete(θ). For z ∈ X , the conditional distribution of (Θt |ZtN = z) is Dirichlet(α + n), where n ∈ ZK is the histogram of values occurring in the PN sequence z, i.e., nk = i=1 δk,zk . Similarly, the conditional distribution of N (Θt |Yt= k, Zt = z) is Dirichlet(α + n + δ k ). The information I(Yt= k; Θt |ZtN = z) is therefore DDir (α + n + δ k , α + n), which, by analogy with (35), we can immediately write down: I(Yt= k; Θt |ZtN = z) = log

αX + N + ψ(αk + nk + 1) − ψ(αX + N + 1). αk + nk

Since ψ(t + 1) ≈ log(t) + 1/2t + O(1/t2 ) for large t, we obtain I(Yt= k; Θt |ZtN = z) ≈

1 1 − 2(αj + nk ) 2(αX + N )

so long as nk tends to infinity as N increases. This will be the case almost surely if αk > 0, and so limN →∞ I(Yt = k; Θt |ZtN = z) = 0 for all typical sequences z, that is, except for sequences z whos total probability vanishes in the limit N → ∞. Hence, the conditional information (see Definition 1), obtained by averaging over ZtN is also zero when αk > 0: αk > 0 =⇒

lim I((Yt= k; Θt |ZtN ) = 0.

N →∞

If, for some k, αk = 0, then Pr(Yt= k) = 0 and so the probability that the zero information condition is violated is zero.

10

4

Predictive information in Markov chains

If we relax the assumption that the ‘present’ Yt and the ‘future’ Zt are conditionally independent given the parameters Θt , we can still obtain a usable result in some cases. From (29), we see that, assuming still I(Yt ; Θt |Zt ) = 0, I(Yt ; Zt ) = I(Yt ; Θt ) + I(Yt ; Zt |Θ).

(36)

In some models (such as Markov chains), I(Yt ; Zt |Θt= θ) is easy to compute for given parameter settings θ, in which case we may be able to evaluate exactly or approximately the conditional mutual information using Z I(Yt ; Zt |Θt ) = pΘt (θ)I(Yt ; Zt |Θt= θ) dθ. (37) P

Similarly, we can revisit Theorem 2 without the present/future conditional independence assumption, but retaining the asymptotic zero information assumption I(Yt= k; Θt |Zt ) = 0, to obtain Z I(Yt= y; Zt ) = I(Yt= y; Θt ) + pΘt |Yt (θ|y)I(Yt= y; Zt |Θt= θ) dθ. (38) P

In the rest of this section we will apply these results to the analysis of instantaneous predictive information in a Markov chain when the transition matrix is unknown and must be inferred from the observations. Suppose we have a discrete-valued Markov chain (X1 , X2 , . . .) and a matrixvalued random variable A such that Pr(Xt+1 = k|Xt = j, A = a) = akj . The columns of A are independent and Dirichlet distributed, with parameters given by the columns of α ∈ RK×K ; that is, A:j ∼ Dirichlet(α:j ). The distribution of X1 is not specified as it is not needed in the following. In summary: α ∈ RK×K , A:Ω→ ∀t ∈ N,

∆K K,

Xt : Ω → 1..K,

∀i, j ∈ 1..K, aij > 0 K

A ∼ [Dirichlet(α:j )]j=1 (Xt+1 |Xt= j, A= a) ∼ Discrete(a:j ).

K×K Note that ∆K denotes the set of matrices where each column is in K ⊂ R the simplex ∆K . With an eye to applying (38), we define the time-dependent variables for t ≥ 2 as

Θt = (A|X1:t−1= x1:t−1 ), Yt = (Xt |X1:t−1= x1:t−1 ), Zt = (Xt+1:∞ |X1:t−1= x1:t−1 ). The asymptotic zero information condition I(Yt= k; Θt |Zt ) = 0 will be satistied in this system as long as all transitions occur infinitely often in the infinite future Zt , which will be the case if the Markov chain is irreducible. To simplify the subsequent notation, we let Ot denote the observation event X1:t−1= x1:t−1 , in which terms the IPI at time t ≥ 2 is Z It = I(Xt= xt ; A|Ot ) + pA (a|Ot+1 )I(Xt= xt ; Xt+1:∞ |Ot , A= a) da. (39) ∆K K

11

Note that the conditional density in the integral is pA (a|Ot+1 ), as it must be conditioned on all the observations up to and including Xt = xt . Since the column-wise Dirichlet distributions of A are conjugate to the discrete conditional distributions of (Xt+1 |Xt ), the distribution of (A|Ot ) is also Dirichlet: h iN (t−1) (A|X1:t−1= x1:t−1 ) ∼ Dirichlet α:j , j=1

where

(t)

αkj = αkj +

t−1 X

δj,xi δk,xi+1 .

i=1

This means that the instantaneous information gain about A is straightforward to compute as the KL divergence between the Dirichlet distributions of A:xt−1 before and after the observation Xt= xt : (t) (t−1) I(Xt= xt ; A|Ot ) = DDir α:j , α:j where j = xt−1 . (40) Turning to the second term in (39), the IPI at time t given a known transition matrix A= a reduces to the instantaneous information gained about just the next variable in the sequence Xt+1 and is a function of the transition matrix a and the current and previous observations Xt= xt and Xt−1= xt−1 only [2]: I(Xt= xt ; Xt+1:∞ |Ot , A= a) = π(xt , xt−1 , a),

(41)

where π : (1..K) × (1..K) × ∆K K → R is defined as π(k, j, a) =

K X m=1

amk log PK

amk

l=1

aml alj

.

(42)

Assembling these parts and rewriting the integral in (39) as an expectation of a function of At = (A|Ot+1 ) yields h iN (t) let At ∼ Dirichlet α:l and j = xt−1 in l=1 (t) (t−1) It = DDir α:j , α:j + E π(xt , xt−1 , At ).

(43)

Predictive information in Markov chains was analysed in our previous paper [2], but in that paper, the predictive information at any point in time was computed using a point estimate of the transition matrix, (effectively, π(xt , xt−1 , E At ) in the present notation), disregarding any uncertainty the observer may have about it. The contribution to the predictive information resulting from information gained about the transition matrix was computed separately (the ‘model information’, or ‘Bayesian surprise’), but no attempt was made to relate it to the theoretical basis of predictive information. We can now see that the predictive information in such a system does indeed comprise two terms, one of which is the Bayesian surprise, and the other the expectation of the Markov chain-based predictive information considered as a function of the unknown transition matrix. These two components will be referred to as the ‘parameter’ component and the ‘data’ component of the predictive information in the following section.

12

4.1

Estimation of ‘data’ component

As noted above, our previous work [2] involved computing an estimate of the IPI equivalent to π(xt , xt−1 , E At ) in the present notation, where At is a random variable representing the observers beliefs about the transition matrix at time t. We can now consider using this as an estimate of the ‘data’ component of the IPI, which is E π(xt , xt−1 , At ), where the expectation is taken over the distribution of At . Another way of estimating this quantity is via Monte Carlo integration, computing the numerical average of π(xt , xt−1 , a) for many values of a sampled independently from At . A third approach, which we detail in the appendix, is to derive an upper bound on E π(k, j, A) when A using the log-sum inequality. This yields K

A ∼ [Dirichlet(α:l )]l=1 =⇒ X αlj E π(k, j, A) ≤ ψ(αΣl + δjl ) − ψ(αΣk + δjk + 1) αΣj l6=k

+

K X αmk + δml δjk ψ(αmk + δml δjk + 1) αΣk + δjk m=1 − ψ(αml + δml δjl ) ,

(44)

where αΣk denotes the sum of values in the k th column of α. These three approaches were implemented numerically and compared. The dataset for each test consisted of 400 matrices intended to serve as the parameter matrix α for some product-of-Dirichlets distribution over the columns of a K × K transition matrix. In each dataset, the parameter matrices were themselves drawn from a symmetric Dirichlet distribution and then multiplied by a ‘weight’ parameter. Then, for each combination of previous state k and current state j, the data component of the predictive information, E π(k, j, A) was estimated using the three different methods and compared with an accurate estimate obtained by Monte Carlo intgration over 500 iterations. This approximation was tested and compared with values obtained by Monte Carlo integration. The results, illustrated in fig. 4, suggest that the upper bound obtained can be rather loose, especially as the transition matrix becomes even moderately sparse (the α = 1 case), and therefore not very useful as an approximation. In most cases, the point estimate method (shown in green in fig. 4) was closer to the true value than the upper bound (shown in red). The Monte Carlo method provided good estimates with as few as 50 samples, and so we would suggest that this is the best method unless the application cannot afford the added computational complexity.

5

Conclusions

The main conclusion of this technical note is that, for exchangeable random processes, the instantaneous predictive information (IPI) is the same as Itti and Baldi’s Bayesian surprise: any information gained about the ‘parameters’ of the exchangeable process (which amount to a description of the underlying 13

[K=4,alpha=1,weight=10]


600

800

MC(50) point est upper bound

400


600 400

200 200

0

−0.2

0

0.2

0.4

0 −0.1

0.6


0.1

0.2

0.3

0.4


800

1500


600

0


1000

400 500

200 0

−0.2

0

0.2

0.4

0 −0.1

0.6

0

0.1

0.2

0.3

0.4

Figure 4: Error histograms for three different methods of estimating instantaneous predictive information in a Markov chain. The three methods are Monte Carlo integration with 50 iterations (in blue); point estimate using expectation of transition matrix (in green); and the upper bound of [ref] (in red). In each case, the method was compared accurate estimates obtained by Monte Carlo integration with 500 iterations. The approximations were tested on 400 product-of-Dirichlet distributions of varying sparsity (controlled by ‘alpha’ parameter) and overall concentration (controlled by the ‘weight’ parameter). In all cases, the Monte Carlo estimator performed best and the upper bound worst. The point estimator gives better estimates as the ‘weight’ parameter increases. See text for further details. Note that much of the area under the red curves appears to be missing, because in many cases the upper bound was much much higher than the true value and these points are not represented in the histograms.

De Finetti measure) is eventually manifest in the infinite sequence of future observations. This provides motivation for the Bayesian surprise approach in online data analysis or perceptual modelling: learning about the parameters of a perceptual model is useful because it provides information about future observations. Theorem 2 unifies the two concepts of instantaneous predictive information and Bayesian surprise for a substantial class of random processes, including the Chinese Restaurant Process. A secondary conclusion is that much of the analysis also applies to the random process defined as a first order Markov chain with an unknown transition matrix. Though not an exchangeable process, the IPI in this model decomposes into two components, one of which is the Bayesian surprise and the other equal to the expectation of the IPI in a simpler, analytically tractable model, the Markov chain with known transition matrix. However, even though the IPI in this simpler model is simple to compute for a given transition matrix, its

14

expection with respect to the current posterior over transition matrices is not. Hence, we derive an upper bound on this expectation. Our final conclusion is that this upper bound is too loose to be useful as an approximation, and that the ‘point-estimate’ and Monte Carlo methods described in § 4.1 are preferrable.

Acknowledgements

A

Research supported by EPSRC grant GR/S82213/01.

Dirichlet distributions

For a random variable Θ : Ω → ∆K with Dirichlet distribution parameterised by α ∈ RK , we write Θ|α ∼ Dir(α) pα (θ) =

1 B(α)

K Y

(45) θkαk −1 ,

(46)

k=1

where the multivariate Beta function B : RK → R is defined as QK Γ(αk ) B(α) = k=1 . PK Γ( k=1 αk )

(47)

For convenience, we let X = {1, . . . , K} denote the range of the index k and PK define αX = k=1 αk , in terms of which, E Θk |α = αk /αX E log Θk |α = ψ(αk ) − ψ(αX ).

(48) (49)

The entropy of a distribution θ drawn from a Dirichlet distribution is H(θ) =

K X

θk log θk .

(50)

k=1

Hence, to compute the expected entropy E H(Θ) we need to consider E (Θk log Θk |α). Since the marginals of the Dirichlet distribution are Beta distributions: Θk |α ∼ Beta(αk , αX − αk ),

(51)

we will compute E (X log X|a, b) for a Beta-distributed random variable (X|a, b) ∼

15

Beta(a, b): Z

1

E (X log X|a, b) =

(x log x) p(x; a, b) dx 0

Z

1

=

x log x 0

Z

1

=

log x 0

Γ(a + b) a−1 x (1 − x)b−1 dx Γ(a)Γ(b)

aΓ(1 + a + b) xa (1 − x)b−1 dx (a + b)Γ(1 + a)Γ(b)

Z 1 a (log x) p(x; a + 1, b) dx a+b 0 a = E (log X|a + 1, b) a+b a [ψ(1 + a) − ψ(1 + a + b)] . = a+b =

So, for a Dirichlet distributed variable Θ, we obtain E (Θk log Θk |α) =

αk [ψ(1 + αk ) − ψ(1 + αX )] , αX

(52)

and the expected entropy is E H(Θ)|α =

K X αk [ψ(1 + αk ) − ψ(1 + αX )] . αX

(53)

k=1

Below are results for a few more expectations of functions of the components of a Dirichlet distribution. These can be obtained by direct integration using the Beta-marginal and aggregration properties of the Dirichlet distribution. αi (αj + δij ) αX (αX + 1) αi (αj + δij ) E Θi log Θj = αX (αX + 1) αi (αi + 1) E Θ2i log Θi = [ψ(αi + 2) − ψ(αX + 2)] αX (αX + 1) αi (αj + δij ) E Θi Θj log Θj = [ψ(αi + δij + 1) − ψ(αX + 2)] αX (αX + 1) E Θ i Θj =

B

(54) (55) (56) (57)

IPI in Markov chains

Here we derive an upper bound on the parameter-conditional component of the instantaneous predictive information in (43), which is E π(k, j, A) for i, j ∈ 1..K when A ∼ [Dirichlet(α:l )]K l=1 . Since the expectation operator is linear and distributes across sums, this amounts to computing ! K K K X X X (E Amk log Amk ) − E Amk log Aml Alj . m=1

m=1

l=1

The first term (the expected entropy of A:k ) is easily obtained, but the second term, the expected cross entropy of two distributions both depending on A 16

cannot be expressed simply in closed form. Applying the log-sum inequality to the (negative) cross entropy: K X

K X

amk log

m=1

aml alj ≥

K X K X

amx alj log aml .

m=1 l=1

l=1

Expectations of products of the form amk alj log aml can be found by considering separately the cases when k, j, l and m are different or the same and using the independence of the columns of A and the results of § A. Omitting the details, the result can be combined into a single expression: E ajx akz log ajk =

αkz (αjx + δjk δzx ) ψ(αjk + δjk δzk + δxk ) αΣz (αΣx + δzx )

(58)

− ψ(αΣk + δzk + δxk ) , where αΣk = −

K X

PK

j=1

αjk . Including the summation over j yields

E ajx akz log ajk

j=1

αkz = ψ(αΣk + δzk + δxk ) αΣz −

K X αjx + δjk δzx j=1

αΣx + δzx

ψ(αjk + δjk δzk + δxk ) . (59)

Similarly, the expectations required to compute the entropy term are E ajx log ajx =

αjx ψ(αjx + 1) − ψ(αΣx + 1) , αΣx

(60)

or, including the summation over j and noticing that ψ(αΣx + 1) is independent of j and can therefore be taken out of the sum, −

K X

E ajx log ajx = ψ(αΣx + 1) −

j=1

K X αjx ψ(αjx + 1). α j=1 Σx

(61)

However, in anticipation of the eventual combination with the PKexpectation of the cross entropy, we choose to evaluate it as follows: since k=1 akz = 1, we may write ! K X E ajx log ajx = E akz ajx log ajx k=1

=E

K X

akz ajx log ajx

k=1

=

K X

E akz ajx log ajx

k=1

Using the same kind of reasoning as that used to obtain [eqf], we obtain E akz ajx log ajx =

αkz (αjx + δjk δzx ) ψ(αjx + δjk δzx + 1) αΣz (αΣx + δzx ) − ψ(αΣx + δzx + 1) . 17

(62)

Notice that the prefactor on the right-hand side is the same as that in [eqrf]. Summing over j yields −

K X

E akz ajx log ajx =

j=1

αkz ψ(αΣx + δzx + 1) αΣz −

K X αjx + δjk δzx j=1

αΣx + δzx

(63) ψ(αjx + δjk δzx + 1) .

The upper bound on the information is obtained by summing each over j and k and taking the difference: I(x|z, α) ≤

K X αkz ψ(αΣk + δzk + δxk ) − ψ(αΣx + δzx + 1) αΣz

k=1

+

K X αjx + δjk δzx ψ(αjx + δjk δzx + 1) αΣx + δzx j=1

(64)

− ψ(αjk + δjk δzk + δxk )

Note that the ψ(·) terms cancel when k = x, leaving: X αkz ψ(αΣk + δzk ) − ψ(αΣx + δzx + 1) I(x|z, α) ≤ αΣz k6=x

+

K X αjx + δjk δzx ψ(αjx + δjk δzx + 1) αΣx + δzx j=1

− ψ(αjk + δjk δzk )

18

(65)

References [1] Laurent Itti and Pierre Baldi. Bayesian surprise attracts human attention. In Advances Neural in Information Processing Systems (NIPS 2005), volume 19, pages 547–554, Cambridge, MA, 2005. MIT Press. [2] Samer A. Abdallah and Mark D. Plumbley. Information dynamics: Patterns of expectation and surprise in the perception of music. Connection Science, 21(2):89–117, 2009. [3] D. Heath and W. Sudderth. De Finetti’s theorem on exchangeable variables. American Statistician, 30(4):188–189, 1976. [4] Laurent Itti and Pierre Baldi. Bayesian surprise attracts human attention. Vision Research, 49(10):1295—1306, 2009. [5] James P. Crutchfield and Karl Young. Inferring statistical complexity. Phys. Rev. Lett., 63(2):105–108, Jul 1989. [6] J. P. Crutchfield and D. P. Feldman. Statistical complexity of simple 1D spin systems. Physical Review E, 55(2):1239R–1243R, 1997. [7] William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, complexity, and learning. Neural Computation, 13:2409–2463, 2001. [8] D. Haussler, M. Kearns, and R.E. Schapire. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14(1):83–113, 1994. [9] David Haussler and Manfred Opper. General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Seventh Annual ACM Workshop on Computational learning theory (COLT ’95), pages 402–411, New York, NY, USA, 1995. ACM. [10] David Haussler and Manfred Opper. Mutual information, metric entropy and cumulative relative entropy risk. The Annals of Statistics, 25(6):2451–2492, 1997. [11] W. McGill. Multivariate information transmission. Information Theory, IRE Professional Group on, 4(4):93–111, 1954.

19