On Universal Linear Least-Squares Prediction - Semantic Scholar

2 downloads 0 Views 272KB Size Report
Aug 23, 2000 - We consider the problems of linear prediction and adaptive filtering using an ... in the universal coding and computational learning theory literature. ... we call such a strategy “universal” with respect to the class K. In [2], Vovk ...
On Universal Linear Least-Squares Prediction Andrew C. Singer∗and Meir Feder† August 23, 2000

Abstract We consider the problems of linear prediction and adaptive filtering using an approach that is motivated by recent developments in the universal coding and computational learning theory literature. This development provides a novel perspective on the adaptive filtering problem, and represents a significant departure from traditional adaptive filtering methodologies. In this context, we demonstrate a sequential algorithm for linear prediction whose accumulated squared prediction error, for every possible sequence, is asymptotically as small as the best fixed linear predictor for that sequence. The redundancy, or excess prediction error loss above that of the best linear predictor for that sequence, is upper bounded by roughly A2 ln(n)/n, where n is the data length and the sequence is assumed to be bounded by some A, though A need not be known in advance. Index-Terms Universal algorithms, prediction, sequential probability assignment.

1

Introduction

There are a number of problems in signal processing, control, and information theory that share a common framework where a signal is to be processed in the presence of a large degree of uncertainty. This problem often can be described with the following generic sequential decision problem, as described in [1]. A player observes a sequence of outcomes, x[t], t = 1, 2, . . . . At each time, t, the player is required to select a strategy, b[t], from a given family of permitted strategies, b ∈ B. The player then incurs a loss, l(b[t], x[t]), which is a function of the selected strategy and the next outcome. There is also a class K of “experts” available, who offer “advice” bk [t], k ∈ K. The goal of the game, is to develop a sequential strategy for which the sequentially Pn accumulated average loss, Ln ({b[t]}, x) = n1 t=1 l(b[t], x[t]), is asymptotically as small as that for the best

expert in the class, mink Ln ({bk [t]}, x). This algorithm must be sequential, in that the player can only observe the sequence of outcomes and the actions of the experts as they occur. The relative performance of any such strategy can be measured in terms of the excess loss, or regret, against the best expert in the class, R[n] = Ln ({b[t]}, x) − mink Ln ({bk [t]}, x). Following the universal source coding literature, when R[n] → 0, we call such a strategy “universal” with respect to the class K. In [2], Vovk points out that a good strategy

∗ Andrew Singer is with the Department of Electrical and Computer Engineering at the University of Illinois at UrbanaChampaign. email:[email protected] † Meir Feder is with the Department of EE-Systems at Tel Aviv University. email:[email protected] Portions of this work were presented at the International Symposium on Information Theory, 2000, Sorrento, Italy.

1

for a class of problems in this framework has been known for decades, namely the Bayesian mixture, which is optimal in an MAP sense when the loss function is the log-loss, such that minimizing the associated loss corresponds to maximizing the a posteriori probability. In the last decade, a number of researchers have developed formal means for extending Bayesian prior probability models to include merging strategies for more general loss functions [3–5]; the “aggregating algorithm” of Vovk [3] and the “weighted majority algorithm” of Haussler Kivinen and Warmuth[4] are two related examples. Analytical results bounding the excess loss have shown these approaches to be minimax optimal for a variety of loss functions [6] for binary-valued data and for a finite class of experts. This relatively generic sequential decision problem is closely related to a number of problems in information theory, including universal source coding [7–12], sequential probability assignment [13], binary prediction [1, 14–16] and classification [17–19] of sources with unknown statistics. For many of these problems, a variety of universal algorithms have appeared in the research literature, with some of the earliest work in the field of lossless source coding starting with the pioneering work of Fittingoff [20], Davisson [21], Elias [22], Ziv [23] and later Lempel and Ziv [8], Krichevsky and Trofimov [24], Rissanen and Langdon [25], and others [26–30]. The connections between universal source coding and sequential decision problems with other loss functions can be traced back to the work of Cover [31], Rissanen [10], Rissanen and Langdon [25], Ryabko and others [15, 32, 33]. Following the development of universal source coding algorithms based on Bayesian mixtures and sequential probability assignment [9, 11, 13], we recently developed a related algorithm for universal linear prediction of real-valued data with respect to both finite and continuous classes of experts [34–36]. Our approach is based on some of the well-known properties of Bayesian mixture probability models for sequential encoding of binary data [10, 22, 24, 27, 32, 33]. The essential idea behind universal sequential probability assignment [13] is to obtain a probability model for a sequence of data that is almost as good of a fit as the best out of a large class of models. Of course this could be achieved by collecting all of the data, testing each of the models on the data, and then simply selecting the model in the class with the largest probability. Sequential probability assignment is a method of obtaining a sequential algorithm, i.e. one with minimal delay, and achieving the same goal. For general loss functions, we can use these ideas as follows. Given a sequence of data x[t] and a class K of “experts”, we can pretend that each of the experts is simply a probability model for the data, where the probability from the k-th expert is of the form Pk ({x[t]}nt=1 ) = exp(−Ln ({bk [t]}, x)). Now, the expert with the smallest loss in the class is also the probability model with the highest probability. We can then apply methods of universal sequential probability assignment to achieve a probability that is almost larger than the largest probability in the class. The key to applying Bayesian mixtures to problems with general loss functions is to now define a “universal algorithm” whose probability is of the proper form and is even larger than the universal probability such that the loss of this algorithm is therefore almost smaller than that of the best expert. In this paper we explore this problem in detail for the quadratic loss function as applied to

2

prediction and filtering of arbitrary sequences. The approach taken in this paper, based on universal sequential probability assignment, is very similar to the use of Vovk’s Aggregating Algorithm for linear regression with the square-error loss function. During the development of this paper, Vovk [2] has obtained similar results to ours for the regression problem, where the sequence x[n] is to be predicted based on regressors ~y[n] of order p. The results due to Vovk arise as an application of his general Aggregating Algorithm to the specific problem of regression with square-error loss. The results obtained in this paper arise from a different approach with a simpler, and more direct proof.

2

Linear Prediction

While the linear prediction problem has received considerable attention in the research literature over the past several decades [37], remarkably, this relatively simple problem formulation continues to provide a source of as yet unanswered questions. For example, in the deterministic case, it is relatively simple to find the best predictor coefficients to minimize the total squared error over a given sequence of data, i.e. to find ak , k = 1, . . . , p, such that the total prediction error Ea [N ] =

X n

(x[n] −

p X

k=1

ak x[n − k])2

is minimized. However, much is still unknown about how small the sequentially accumulated prediction error can be when the coefficients must be determined on-line as the data are revealed, i.e. how small the on-line error E[N ] =

X n

(x[n] −

p X

k=1

ak [n − 1]x[n − k])2

can be made for all sequences, where ak [n − 1] must be determined based only on x[1], . . . , x[n − 1]. In this paper, we will address this issue, and provide an algorithm whose on-line prediction error can be made almost as small as mina Ea [N ], i.e. almost as small as the prediction error of the best fixed linear predictor for that sequence. The same issue arises in adaptive filtering, where the coefficients of a linear filter are adjusted to construct an estimate x ˆ[n] of a desired signal x[n] based on a vector of observations ~y[n] = [y1 [n], . . . , yp [n]]T . While a variety of methods have been developed to sequentially adapt the coefficients wk such that the expected filtering error ew [N ] = E

  

x[n] −

p X

k=1

!2   wk yk [n] 

is minimized, little is known about the total sequentially accumulated filtering error, Ew [N ] =

X n

x[n] −

p X

k=1

3

!2

wk [n − 1]yk [n]

.

The linear prediction problem is easily seen as a special case of the adaptive filtering problem, where ~y[n] = [x[n − 1], . . . , x[n − p]]T . The main results of this paper include an adaptive filtering and prediction algorithm whose sequentially accumulated prediction error, for every individual sequence, is asymptotically as small as the best fixed predictor for that sequence. In section 3, a scalar predictor is derived along with a proof of its performance. These results are then generalized to the vector case in section 4. The bounds developed in this paper improve on those developed in [35] and provide a more direct and intuitive proof. The basic idea behind the proof is to define a “probability” assignment from each of the continuum of all possible linear predictors to the data sequence, such that the probability is an exponentially decreasing function of the total squared-error for that predictor. By defining a universal probability as an a priori mixture of the assigned probabilities, then to first order in the exponent, the universal probability will be dominated by the largest exponential, i.e., the probability assignment of the model with the smallest total squared error. The conjugate prior is used such that the mixture over the parameters can be obtained in closed form. The universal probability assignment can then be related to the accumulated squared error of the universal predictor giving the desired result.

3

First-Order Prediction and Adaptive Filtering

In this section, we consider the scalar adaptive filtering problem. Given a sequence of scalar observations, y[t], t = 1, . . . , n, and past values of the sequence x[t], t = 1, . . . , n − 1, a first order estimate or prediction of the signal is to be constructed of the form, xˆ[n] = wy[n]. This can be viewed as a classical linear regression problem, where the parameter w is desired such that the total squared prediction error is minimized over a batch of data of length N . In this case, w would be selected according to the following minimization: w[N ] = arg min w

N X

(x[n] − wy[n])2 .

(1)

n=1

Minimizing (1) yields the well-known equation for the least-squares optimal parameter w[N ] = = N where Rab [m] =

PN

n=1

PN

n=1

PN

x[n]y[n]

2 n=1 y[n] N Rxy [0] , N −1 Ryy [0]

,

(2) (3)

a[n]b[n + m] and N denotes the block of data over which the minimization was taken.

The parameters w[N ] can be computed recursively with the recursive least-squares (RLS) algorithm

4

according to w[t] = w[t − 1] + K[t](x[t] − w[t − 1]y[t]) = w[t − 1] + K[t]∆[t], where ∆[t] is the instantaneous prediction error. The gain vector K[t] and the inverse deterministic autocot variance P [t] = (Ryy [0])−1 satisfy

P [t − 1]y[t] 1 + y 2 [t]P [t − 1] P 2 [t − 1]y 2 [t] . P [t] = P [t − 1] − 1 + y 2 [t]P [t − 1] K[t] =

However, while the parameters of the RLS algorithm are guaranteed to result in the same parameters w[N ] that could have been obtained by processing the data all at once, the accumulated square-prediction error of a sequential predictor that used these parameters, x ˆ[t] = w[t − 1]y[t], is lower bounded by and will usually be substantially greater than a predictor which used the best batch parameter w[N ] for the entire sequence, x ˆ[t] = w[N ]y[t]. A slightly more general loss function which often arises in many signal processing problems is min w

N X t=1

(x[t] − wy[t])2 + δ(w − w0 )2 ,

where δ ≥ 0, and w0 is given. Choosing δ = 0 yields the original least-squares expression. Here, δ is typically used to incorporate additional a priori knowledge concerning w in the problem statement [38]. This can enable a tradeoff between confidence in the parameter w suggested by the observations and prior knowledge that w should be near w0 . Setting δ very small places most of the emphasis on the observations, while a large value of δ places a strong emphasis on the prior knowledge that w should be near w0 . In this paper, we will assume that w0 = 0, which could also be obtained through a suitable change of variables. The minimizing value of w for this problem is given by PN

N Rxy [0] x[n]y[n] = N . w [N ] = PNn=1 2 R [0] +δ yy n=1 y[n] + δ ∗

These parameters can also be computed recursively using the RLS algorithm, with the modification, t P [t] = [Ryy [0] + δ]−1 .

For the scalar linear prediction problem, min a

N X (x[t] − ax[t − 1])2 + δ(a − a0 )2 , t=1

5

the minimizing value of a for this problem is given by PN

N x[n]x[n − 1] Rxx [−1] . a [N ] = PNn=1 = N −1 2 Rxx [0] + δ n=1 x[n − 1] + δ ∗

The main contribution of this paper is a universal linear prediction algorithm whose accumulated average square error is as small, to within a negligible term, as that of a linear predictor whose parameters were preset to the best set of parameters given the sequences in advance. Here we consider first a scalar universal predictor. In our derivation we only assume that the sequences x[1], x[2], . . . and y[1], y[2], . . . are bounded, i.e., |x[t]| < Ax < ∞ and |y[t]| < Ay < ∞ for all t, but otherwise arbitrary, real-valued sequences. It is not required, nor is it assumed, that the values of Ax , Ay or any other properties of the sequences are known in advance. An explicit description of the universal predictor we suggest is as follows. We can write x ˜u [n] = wu [n − 1]y[n], where, wu [n] =

n Rxy [0] n+1 Ryy [0] + δ

,

and δ > 0 is a constant. The following Theorem below relates the performance of the universal predictor, ln (x, x˜u ) =

n X t=1

(x[t] − x ˜u [t])2 ,

to that of the best batch predictor. Theorem 1 Let x[n] and y[n] be bounded, real-valued arbitrary sequences, such that |x[n]| < Ax , and |y[n]| < Ay , for all n. Then ln (x, x˜u ) satisfies ( ln (x, x˜u ) ≤ min ln (x, xˆw ) + w

A2x

ln 1 +

n Ryy [0]δ −1

ln (x, x˜u ) ≤ min{ln (x, xˆw )} + δw[n]2 w





+

!) n Rxy [0]2 n 2 n − + 2wRxy [0] − w Ryy [0] n [0] δ + Ryy

n Ryy [0] n Ryy [0] + δ



n + A2x ln 1 + Ryy [0]δ −1



  n ln (x, x˜u ) ≤ min ln (x, xˆw ) + δw2 + A2x ln 1 + Ryy [0]δ −1 w

Therefore,

A2  nA2y 1 1 ln (x, x˜u ) ≤ min ln (x, xˆw ) + δw2 + x ln 1 + n n w n δ

!

Theorem 1 tells us that the average squared prediction error of the scalar universal predictor is within O(n−1 ln(n)) of the best batch scalar linear prediction algorithm, uniformly, for every pair of individual 6

sequences x[n] and y[n]. The theorem also holds for every individual sequence x[n] in the prediction problem where y[n] = x[n − 1]. For example   A2x  1 nA2x 1 2 , ln (x, x˜u ) ≤ min ln (x, xˆa ) + δa + ln 1 + n n a n δ

where the class of linear predictors are given by x ˆa [n] = ax[n − 1].

3.1

Proof of Theorem 1

Suppose that a prediction of the value of x[n + 1] is formed based on the prior values of x[n] and y[n] in the following way. Given a continuum of predictors, each with a different value of the parameter w, denoted, xˆw [n] = wy[n], then for each of the predictors, a measure of their sequential prediction performance, or loss, is constructed, ln (x, xˆw ) =

n X t=1

(x[t] − wy[t])2 .

Also define a function of the loss, namely the “probability,” ! n 1 X 2 (x[k] − wy[k]) , Pw (x ) = exp − 2c k=1   1 = exp − ln (x, xˆw ) , 2c    1 N N N = exp − Rxx [0] − 2wRxy [0] + w2 Ryy [0] , 2c n

(4) (5)

which can be viewed as a probability assignment of the predictor with parameter w to the data x[t], for 1 ≤ t ≤ n, induced by performance of w on the sequences x[n] and y[n]. We will refer to such exponential functions of the loss as probabilities in analogy to problems in sequential data compression. Similar use of exponentiated loss functions, and Bayesian mixtures over such exponentiated loss functions, have led to a variety results in the on-line algorithms and machine learning literature [2]. We construct a universal estimate of the probability of the sequence x[n], as an a priori weighted combination, or mixture, among all of the probabilities, Pu (xn ) =

Z



p(w)Pw (xn )dw,

(6)

−∞

where p(w) is an a priori weighting assigned to the parameter w. Since the assigned probabilities for the square-error loss are Gaussian in form, the Gaussian prior enables the integration of probabilities assigned to the sequence. We let   w2 1 exp − 2 . p(w) = √ 2σ 2πσ

7

The universal probability assignment can thus be obtained in closed form; integrating (6), !) ( σ2 σ2 n n n n 2 1 1 c Rxx [0]Ryy [0] + Rxx [0] − c (Rxy [0]) n , Pu (x ) = q exp − σ2 n 2 2c n [0]) + 1 ( σc Ryy c Ryy [0] + 1 !) ( n n n n [0]Ryy [0] + δRxx [0] − (Rxy [0])2 1 1 Rxx , =q exp − n [0] + δ 2c Ryy δ −1 Rn [0] + 1

(7)

(8)

yy

where δ = c/σ 2 .

Comparing (4) and (7), yields for any w, n

n

−2c ln Pu (x ) = −2c ln Pw (x ) + c ln(1 + δ

−1

n Rxy [0]2 + −w − n Ryy [0] + δ ! n Rxy [0]2 n 2 n 2wRxy [0] − w Ryy [0] − n . Ryy [0] + δ

n Ryy [0])

n = ln (x, xˆw ) + c ln(1 + δ −1 Ryy [0]) +

n 2wRxy [0]

2

n Ryy [0]

!

,

(9) (10)

We would like to have the universal probability be as large as the probability assigned to the sequence by the predictor with the smallest prediction error, i.e., the largest probability among the continuum of probabilities Pw (xn ). For each n, the value of w which maximizes the assigned probability is the value of w given by (2). The probability assigned to the data by the maximizing value of w, is given by max Pw (xn ) = Pw[n] (xn ) w ) ( 1 n Rxy [0])2 ( 2c 1 n = exp − Rxy [0] + 1 n 2c 2c Ryy [0]   1 = exp − ln (x, xˆw[n] ) . 2c To compare the universal probability to the maximizing probability, we look at their ratio, ) ( 1 n ( 2c Rxy [0])2 exp − 1 n [0])σ 2 + 1) 1 Rn [0] Ryy (2( 2c Pu (xn ) 2c yy q . = maxw Pw (xn ) 1 n [0])σ 2 + 1 2( 2c Ryy

Taking the logarithm of both sides, we obtain   1 n [0])2 Rxy ( 2c 1 Pu (xn ) 1 n 2 . = − ln ln(1 + R [0]σ ) + 1 n [0])(1 + 2( 1 Rn [0])σ 2 ) maxw Pw (xn ) 2 c yy Ryy ( 2c 2c yy Now from (4),  n  Ryy [0] n + c ln 1 + Ryy [0]δ −1 n Ryy [0] + δ   n  Ryy [0] n 2 + c ln 1 + Ryy [0]δ −1 . = min{ln (x, xˆw )} + δw[n] n [0] + δ w Ryy

−2c ln Pu (xn ) = ln (x, xˆw[n] ) + δw[n]2



It is also convenient to consider the above comparison for

w∗ [n] = arg min{ ln (x, xˆw ) + δw2 }, w

n Rxy [0] . = n Ryy [0] + δ

8

For this value of w = w∗ [n], we obtain n Rxy [0]2 n n n −2c ln Pu (xn ) = ln (x, xˆw∗ [n] ) + c ln(1 + δ −1 Ryy [0]) + 2w∗ [n]Rxy [0] − w∗ [n]2 Ryy [0] − n Ryy [0] + δ  n = ln (x, xˆw∗ [n] ) + δw∗ [n]2 + c ln 1 + Ryy [0]δ −1  n = min{ln (x, xˆw ) + δw2 } + c ln 1 + Ryy [0]δ −1 . w

!

(11)

We now have a method of assigning a universal probability to the sequence that achieves, to first order in the exponent, the same sequential probability as the best predictor, i.e. one with either w∗ [n] or w[n]. We now must relate this universal probability to an actual prediction. As we will see, the universal probability will suggest a particular linear predictor, which is obtained through a mixture over all of the predictors, w. In fact, the mixture over the parameters w is precisely the same as the mixture over the assigned probabilities. Since each of the predictors assigns a probability which is exponential in the prediction error for that predictor, we look to the exponent of Pu (xn ) for the predictor. Specifically, we have   1 Pw (xn |xn−1 ) = exp − (x[n] − wy[n])2 , 2c relating the prediction error at time n to the probability Pw (xn |xn−1 ). Similarly, we expect to obtain an expression of this form for Pu (xn |xn−1 ). From (7), we obtain  s ! !2   −1 Rn−1 [0] + δ  n−1 n−1 R [0] R [0] + δ yy xy yy x[n] − Pu (xn |xn−1 ) = exp y[n] . n−1 n [0] + δ n [0] + δ  2c  Ryy Ryy Ryy [0] + δ

(12)

From (12), and since,

n−1 Ryy [0] + δ n Ryy [0] + δ

!

→ 1,

n for Ryy [0] → ∞, we might infer that the predicted value of the universal predictor be given by

x ˆu [n] =

n−1 Rxy [0] n−1 Ryy [0] +

δ

y[n],

(13)

which yields wu [n − 1] =

n−1 Rxy [0] n−1 Ryy [0] + δ

= w∗ [n − 1].

The universal probability assignment is given by ( !) n n n n [0]Ryy [0] + δRxx [0] − (Rxy [0])2 1 Rxx 1 n exp − , Pu (x ) = q n [0] + δ 2c Ryy δ −1 Rn [0] + 1 yy

which, although Gaussian (quadratic exponential), cannot be expressed in the same form as Pw (xn ), i.e.   1 ˆnw ) . Pw (xn ) = exp − ln (xn , x 2c 9

However, from the conditional universal probability (looking term by term), we see that it is almost in this form,  ! !2   −1 Rn−1 [0] + δ  n−1 n−1 R [0] R [0] + δ yy xy yy x[n] − exp Pu (xn |xn−1 ) = y[n] n−1 n [0] + δ n [0] + δ  2c  Ryy Ryy Ryy [0] + δ  !2    −1 n−1 R [0] xy α2 x[n] − n−1 y[n] = α exp   2c Ryy [0] + δ   −1 2 2 = α exp α (x[n] − w∗ [n − 1]y[n]) . 2c s

If we could find another Gaussian, which were expressed in the form     −1 2 −1 2 2 ≥ α exp (x[n] − x˜u [n]) α (x[n] − w∗ [n − 1]y[n]) , P˜u (xn |xn−1 ) = exp 2c 2c for the sequences of interest, i.e. for |x[n]| ≤ Ax , then we would have ln (x, x˜u [n]) ≤ −2c ln Pu (xn ), completing the proof of the theorem. Comparing P˜u (xn |xn−1 ) and Pu (xn |xn−1 ), we obtain,     1 1 Pu (xn |xn ) = α exp − α2 (x[n] − x ˆu [n])2 ≤ exp − (x[n] − x ˜u [n])2 , 2c 2c for x ˆu [n] = w∗ [n − 1]x[n − 1], and for some x ˜u [n]. Note that these are two Gaussians, with different means, and different variances. We would like to select an appropriate mean for P˜u (xn |xn−1 ), i.e. x˜u [n], such that

over the range x[n] ∈ [−Ax , Ax ], P˜u (xn |xn−1 ) is larger than Pu (xn |xn−1 ). This would ensure that the loss of the predictor x ˜u [n] satisfies Theorem 1. Suppose that Ax = 1 and xˆu [n] = 1. In this case, Pu (xn |xn−1 ) is a Gaussian which is centered at xn = 1.

The goal is to identify the mean x˜u [n] of a second Gaussian whose variance is smaller by a factor of α2 , such that the second Gaussian lies above Pu (xn |xn−1 ) for all possible values of x[n], i.e. x[n] ∈ [−Ax , Ax ]. This is depicted in Fig. 1, where xl and xr are the locations of the crossover for the two Gaussians. While the actual variance of each of the two Gaussians will be a function of the parameter c, for illustration purposes, we set c = 1 in the figure. To identify the locations of the crossover points, xl and xr , where P˜u = Pu , we let x ˜u [n] = γ x ˆu [n], and set the expressions for the two Gaussians equal, i.e.,     1 1 α exp − α2 (x[n] − xˆu [n])2 = exp − (x[n] − γ x ˆu [n])2 , 2c 2c 1 1 2 ˆu [n])2 = − (x[n] − γ x ˆu [n])2 . ln α − α (x[n] − x 2c 2c Solving for x[n], i.e. xl and xr yields, p −2ˆ xu [n]2 γα2 + 2α2 ln(α)c + α2 xˆu [n]2 γ 2 − 2 ln(α)c + α2 xˆu [n]2 x ˆu [n](α2 − γ) ± x[n] = 2 (α − 1) (α2 − 1) 10

1

0.9

0.8

0.7

~ P (x ) u

0.6

P (x )

n

u

n

0.5

0.4

0.3

0.2

0.1

0 −4

−3

−2

−1

0

1

~ ^ xu xu

xl

2

3

4

xr

Figure 1: The two Gaussians for Pu (xn |xn−1 ) and P˜u (xn |xn−1 ) are shown as a function of x[n]. Here, Ax = 1 and c = 1 were selected for illustration. Note that the size of the region [xl , xr ] over which P˜u ≥ Pu grows with increasing c, since α ≤ 1, and is centered about x[n] = x ˆu [n]

(α2 − γ) . (α2 − 1)

We will consider the case of α = 1 separately, and for now assume that α < 1. We would like to select a value of c as small as possible, since it appears as a constant multiplier of the redundancy, or excess prediction error of the universal predictor. Since we require that P˜u ≥ Pu for all x[n] ∈ [−Ax , Ax ], the smallest value of c can be selected only when the region [xl , xr ] is centered about x[n] = 0. This can be achieved only by the choice γ = α2 . Note that for the case of α = 1, then γ = 1, and P˜u = Pu . For this choice of γ, we have   −1 P˜u (xn |xn−1 ) = exp (x[n] − x˜u [n])2 , 2c where the prediction is given by x ˜u [n] = α2 w∗ [n − 1]y[n] =

n−1 Rxy [0] y[n]. n Ryy [0] + δ

n Note that x ˜u [n] can be viewed as using w∗ [n]x[n − 1] where we assume that x[n] = 0 to update Rxy [0]

n−1 n (which remains at Rxy [0]) and Ryy [0] accordingly before computing w∗ [n]. Perhaps some intuition to this

11

interpretation is the following. We do not know what the next value of x[n] will be, however, we do know that x[n] ∈ [−Ax , Ax ]. Since w∗ [n] may be favoring a positive or negative value for the prediction, we can certainly reduce the prediction error if the sign has been guessed incorrectly, by moving this prediction back towards zero. If the sign is guessed correctly, than it should not effect things too much. This may also be related to the autocorrelation method of linear prediction, as opposed to the covariance method. When solving for the least-squares optimal parameter a[n − 1] which minimizes the sum of squares of the prediction error over the observed data, it is well known that a[n − 1] can be arbitrarily large, even for a bounded sequence. However, in the autocorrelation method, the data are assumed to be zero outside the region of observed data, and the predicion error is minimized over a region which includes these zero values. In this case, the optimal parameter a[n − 1] satisfies |a[n − 1]| < 1. For higher order prediction problems, the prediction error filter is guaranteed to be stable when using the autocorrelation method. Now we can select the smallest value of c so that the region [−Ax , Ax ] ⊆ [xl , xr ], i.e. p 2c ln(α)(α2 − 1) + α2 x ˆu [n]2 (1 − α2 ) Ax ≤ (1 − α2 ) 2 2 A (1 − α ) − α2 xˆu [n]2 , c≥ x −2 ln(α) which must hold for all values of xˆu [n] ∈ [−Ax , Ax ]. Therefore, c ≥ A2x

(1 − α2 ) , −2 ln(α)

where α < 1. Note that for 0 < α < 1 the function 0
Pu (xn ) over the range x[n] ∈ [−1, 1], as shown with the connected line segment with circles. Contrary to the algorithm using x ˜u [n], this range is no longer symmetric about x[n] = 0, and therefore c must be larger in Fig. 3. Since the value of c selected is significantly larger than needed for the algorithm using x ˜u [n], the range over which P˜u (xn ) > Pu (xn ) is much larger than necessary, as illustrated by the 14

1

0.9

^ Pu(xn)

0.8

Pu(xn)

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 −5

−4

−3

−2

−1

0

1

2

3

^x u

xl

4

5

xr

Figure 2: The two Gaussians for Pu (xn |xn−1 ) and Pˆu (xn |xn−1 ) are shown as a function of x[n]. Here, Ax = 1, and c = 2.75 were selected for illustration.

1

0.9

~ Pu(xn)

0.8

^ Pu(xn)

Pu(xn)

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 −4

−3

−2

−1

0

1

2

3

4

^x u

Figure 3: The three Gaussians for Pu (xn |xn−1 ), Pˆu (xn |xn−1 ), and P˜u (xn |xn−1 ) are shown as a function of x[n]. Here, Ax = 1 and c = 2.75 were selected for illustration.

15

1

0.9

~ Pu(xn)

0.8

^ Pu(xn)

Pu(xn)

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 −4

−3

−2

−1

0

1

2

3

4

^x u

Figure 4: The three Gaussians for Pu (xn |xn−1 ) Pˆu (xn |xn−1 ), and P˜u (xn |xn−1 ) are shown as a function of x[n]. Here, Ax = 1, and c = 1 were selected for illustration. connected line segment with asterisks. Equivalently, this value of c would permit sequences x[n] with larger values of Ax . Note that in Fig. 4, the value of c = 1 selected is significantly smaller than needed for the algorithm using x ˆu [n]. As a result, the range over which Pˆu (xn ) > Pu (xn ) no longer includes the range x[n] ∈ [−1, 1]. Equivalently, this value of c would only permit sequences x[n] with smaller values of Ax . 3.2.2

Bounded Predictions

For the prediction problem, i.e. y[n] = x[n − 1], there is another interesting difference between the algorithm using x ˜u [n] and that using xˆu [n]. When the data x[n] satisfies |x[n]| < Ax , the predictor x ˜u [n] also satisfies |˜ xu [n]| < Ax , while the predictor x ˆu [n] may not. This can be shown in the scalar case using the Schwarz innequality. Pn−1

x[k]x[k − 1] x[n − 1] x˜u [n] = Pk=1 n−1 k=1 x[k]x[k] + δ P n−1 x[k]x[k − 1] k=1 |˜ xu [n]| ≤ Pn−1 x[n − 1] x[k]x[k] k=1 P n−1 x[k]x[k − 1] k=1 ≤ Ax P n−1 k=1 x[k]x[k]

16

Using the vector notation, xn1 = [x[1]x[2] . . . x[n]]T , we have n−1 T [(x ) 0](x1n−1 ) 2 |˜ xu [n]| ≤ Ax n−1 T n−1 (x ) (x1 )

 1 

x2n−1 n−1

  kx1 k



0 ≤ Ax kx1n−1 k2 ≤ Ax .

For x ˆu [n], we have Pn−1 x[k]x[k − 1] x ˆu [n] = Pk=1 x[n − 1] n−2 k=1 x[k]x[k] + δ

For a sequence x1n−1 = [0, . . . , 0, 1, Ax ]T , this yeilds,

x ˆu [n] =

Ax Ax , 1+δ

which, for Ax > (1 + δ), yields, |ˆ xu [n]| > Ax , i.e. a prediction outside the range [−Ax , Ax ]. Note that for this sequence, Ax Ax 1 + A2x + δ 1 = −1 Ax Ax + Ax + δA−1 x 1 ≤ Ax . 2

x ˜u [n] =

While this simple argument verifies that in the scalar case, x˜u [n] guarantees bounded predictions (i.e. within the range [−Ax , Ax ]) while x ˆu [n] does not, we suspect that this property holds in general for higher order predictors as well. 3.2.3

Prediction vs. Filtering

We also note that the bounds developed in this section apply to both the prediction problem, and to the more general filtering problem. Since the prediction problem is significantly more constrained than the filtering problem, we might expect that tighter bounds could be found. For example, for the filtering problem, we found that the constant c, which appears as a multiplier of the redundancy, is the smallest value which satisfies c≥

A2x (1 − α2 ) − α2 xˆu [n]2 . −2 ln(α)

In the prediction problem, this expression can be reduced somewhat, to yield, c ≥ A2x −

n (Rxx [−1])2 , n−1 Rxx [0] + δ

which may be significantly less than the A2x bound for the filtering case. 17

3.2.4

Algorithm Details

The predictor described in this section is closely related to the standard recursive least squares linear prediction algorithm, for which a variety of fast and efficient implementations exist. As such, we note that these fast algorithms can be modified slightly to provide the outputs x ˜u [n] with either no increase, or only a slight increase in computational complexity. For the filtering case, the recursive least squares algorithm provides a means for computing x ˆu [n] with O(p2 ) computations per output sample, for a p − th order filter. To compute x˜u [n] with precisely the same number of operations, the order in which the RLS update is computed needs to be slightly adjusted. As we shall see in the next section, it is also true for the p-th order case, that x ˜u [n] is computed by updating the RLS algorithm with the assumption that the next sample x[n] is zero. This can be accomplished in the RLS algorithm by updating the inverse autocorrelation matrix before computing the prediction, rather than after the prediction, as in the standard RLS algorithm. In the prediction case, ~y [n] = [x[n−1], . . . , x[n−p]]T , there exist RLS prediction algorithms that take O(p) computations per output sample to compute x ˆu [n]. For example, the fast transversal filter (FTF) algorithm [40] updates the predictor coefficients, a1 , . . . , ap , at each time step using only O(p) computations. A simple procedure for computing x ˜u [n] would be to store the current state of the FTF algorithm, and then update the FTF with the assumption that x[n] = 0. This would use O(p) computations. Then these filter coefficients can be applied to the lag vector [x[n − 1], . . . , x[n − p]], to compute x ˜u [n]. To compute the next output sample, the state of the FTF needs to be restored (to “undo” the effect of the “zero-sample-update”), and the standard FTF update can proceed. This would require at most twice the number of computations of the original FTF algorithm, and still require O(p) computations per output sample.

4

p-th-Order Linear Prediction

In this section, we consider the problem of linear prediction with a predictor of fixed-order p. The predictor is now parameterized by the vector w ~ = [w1 , . . . , wp ]T , and the predicted value can be written x ˆ[n] = w ~ T ~y[n], where ~y [n] = [y1 [n], . . . , yp [n]]T , and in the case of an auto-regressive model, ~y [n] = [x[n − 1], . . . , x[n − p]]T . If the parameter vector w ~ is selected such that the total squared prediction error is minimized over a batch of data of length N , then the coefficients are given by, w ~ N = arg min w ~

N X

k=1

(x[k] − w ~ T ~y[k])2 .

18

The well-known least-squares solution to this problem is given by N −1 N w ~ N = (Ryy ) rxy ,

where N Ryy =

N X

~y[k]~y [k]T ,

(15)

x[k]xy[k]. ~

(16)

k=1

and N rxy =

N X

k=1

We will also consider the more general least-squares problem: w ~ ∗ [N ] = arg min{ ln (x, xˆw~ ) + δ|w| ~ 2 }, w

N N = [Ryy + δI]−1 rxy .

We now construct a universal p-th-order linear predictor using a mixture over all predictors w. ~ The following theorem extends Theorem 1 using a vector version of the mixture approach. Theorem 2 Let ~y [n] and x[n], n = 1, . . . , N , be bounded, but otherwise arbitrary, vector and scalar sequences, such that |yk [n]| ≤ Ay and |x[n]| < Ax . Let xˆw [n] be the output of a p-th-order linear predictor with parameter vector w, ~ and ln (x, xˆw~ ) be the running total squared prediction error, i.e. ln (x, x ˆw~ ) =

n X t=1

(x[t] − x ˆw~ [t])2 ,

where, xˆw [n] = w ~ T ~y [n]. Define a universal predictor x ˜u [n], as x˜u [n] = w ~ u [n − 1]T ~y [n], where,  n+1 −1 n w ~ u [n] = Ryy + δI rxy ,

n n Ryy is the deterministic autocorrelation matrix defined in (15), rxy is the correlation vector defined in (16)

and δ > 0 is a positive constant. Then the total squared prediction error of the p-th-order universal predictor, ln (x, x˜u ) =

n X t=1

(x[t] − x ˜u [t])2 ,

satisfies n n −1 n −1 ln (x, x˜u ) ≤ min{ln (x, x ˆw~ )} + δ~aT [n]Ryy [δI + Ryy ] w[n] ~ + A2x ln I + Ryy δ w ~

19

 n −1 δ , ˆw~ ) + δkwk ~ 2 + A2x ln I + Ryy ln (x, x˜u ) ≤ min ln (x, x w ~

and therefore,

A2 p  A2y n 1 ~ 2 + x ln 1 + ln (x, x˜u ) ≤ min ln (x, xˆw~ ) + δkwk w ~ n n δ

!

.

In the case of a pure autoregression we have, ~y[n] = [x[n − 1], . . . , x[n − p]]T and   A2x p  1 A2x n 2 . ˆ~a ) + δk~ak + ln (x, x ˜u ) ≤ min ln (x, x ln 1 + ~ a n n δ

Theorem 2 tells us that the average squared prediction error of the p-th-order universal predictor is

within O(p ln(n)/n) of the best batch p-th-order linear prediction algorithm, for every individual sequence x[n]. The proof for Theorem 2 follows that of Theorem 1. Proof of Theorem 2: Over the continuum of predictors with coefficients w, ~ we assign the a priori Gaussian mixture √ p(w) ~ = ( 2πσ)−p exp



 1 T w ~ w ~ , 2σ 2

and, define the universal probability Pu (xn ) =

Z

p(w)P ~ w~ (xn )dw, ~

dw ~

with

This yields

where

  1 Pw~ (xn ) = exp − ln (x, xˆw~ ) , 2c    1 n n = exp − Rxn [0] − 2w ~ T rxy +w ~ T Ryy w ~ . 2c     n −1 n   −1/2 1  n n T n Pu (xn ) = δ −1 Ryy Rx [0] − (rxy ) Ryy + δI rxy , exp − +I 2c Rxn [0] =

n X

x2 [k].

k=1

To compare the universal probability with the maximum probability over all parameters w, ~ observe that max Pw~ (xn ) = Pw~ (xn )|w= ~ w[n] ~ w ~   1 ) , = exp − ln (x1 , xˆw[n] ~ 2c where, n −1 n w[n] ~ = (Ryy ) rxy .

20

Since, n  2 X  n −1 n T ln (x, xˆw[n] )= x[k] − (Ryy ) rxy ~y[k] ~ k=1

n n −1 n =Rxn [0] − rxy (Ryy ) rxy ,

we obtain  1 n n n −1 n Pw[n] (x ) = exp − (Rx [0] − rxy (Ryy ) rxy ) . ~ 2c 

n

Comparing Pu (xn1 ) to maxw~ Pw~ (xn1 ) we observe  n  −1/2 −1 1 n o  n T1 1 n ) c c Ryy + σ12 I exp − 21 1c Rxn [0] − (rxy c rxy  1 = n n (Rn )−1 rn ) maxw~ Pw~ (x1 ) exp − 2c (Rxn [0] − rxy yy xy !) (    −1/2 −1 1 1 1 1 n n T n −1 n n T 1 n −p 1 n (rxy ) (Ryy ) rxy − (rxy ) I R + r exp − = σ Ryy + 2 I c σ 2c c yy σ 2 c xy   −1/2   1 n −1 1 1  n T n −1  n −1 n −1 n c(Ryy ) + σ2 I c(Ryy ) rxy (rxy ) (Ryy ) = σ −p Ryy + 2 I exp − c σ 2c     −1 n  −1/2   n −1 n 1 n T n −1 n = δ Ryy + I exp − rxy , (rxy ) Ryy δ Ryy + Ryy 2c Pu (xn1 )

 n + Pu (xn1 ) = σ −p 1c Ryy

1 σ2 I

where the matrix inversion lemma was used from the second line to the third line above.

Taking the logarithm, and for the maximizing w ~ = w[n], ~   Pu (xn ) 1 −1 n + 1 (rn )T [Rn δ −1 Rn + Rn ]−1 rn − ln = ln δ R + I yy yy yy yy xy maxw~ Pw~ (xn ) 2 2c xy 1 1 n n n = ln δ −1 Ryy + I + δ w[n] ~ T Ryy [Ryy + δI]−1 w[n]. ~ 2 2c

This gives,

n n n −2c ln (Pu (xn )) = min {ln (x, xˆw~ )} + c ln δ −1 Ryy + I + δ w[n] ~ T Ryy [Ryy + δI]−1 w[n]. ~ w ~

(17)

Repeating the same analysis for w ~ =w ~ ∗ [n], we obtain  n + I . ~ 2 + c ln δ −1 Ryy −2c ln (Pu (xn )) = min ln (x, xˆw~ ) + δkwk w ~

(18)

Once again, we now have −2c ln Pu (xn ) in terms of the loss of the two predictors with coefficients w[n] ~ and w ~ ∗ [n], which minimize the total accumulated loss and the regularized loss, respectively. However, although Pu (xn ) is Gaussian in the data, it is not of the form exp(−ln (x, xˆu )/2c). If it were, then we could directly relate the loss of the associated predictor to the loss of the best predictor. So, we again look for another Gaussian, which is larger than Pu but is of the correct form. To this end, we again examine the conditional

21

density, Pu (xn ) Pu (xn |xn−1 ) = Pu (xn−1 ) s     n−1 [Ryy 2 + δI] 1 1 n−1 T n−1 −1 exp − , x[n] − (r ) [R + δI] ~ y [n] = xy yy n−1 n + δI] [Ryy 2c 1 + ~y [n]T [Ryy + δI]−1 ~y[n] ) ( n−1 n−1 [Ryy + δI] 1/2 2 + δI] 1 [Ryy n−1 T n−1 −1 x[n] − (rxy ) [Ryy + δI] ~y[n] = exp − n + δI] 2c [Ryy [Rn + δI] 1/2 yy ( ) n−1 n−1 [Ryy + δI] 1/2 + δI] 1 [Ryy 2 (x[n] − x = exp − ˆu ) , n + δI] 2c [Ryy [Rn + δI] 1/2 yy

where,

n−1 n−1 x ˆu [n] = [Ryy + δI]−1 rxy

=w ~ ∗ [n − 1]T ~y[n],

T

~y[n]

and the second equality follows from the following lemma, proved in the appendix. Lemma 1 For n × n invertible matrix A and b and c n × 1 vectors, |A + bcT | = |A|(1 + cT A−1 b). Note that once again, Pu (xn |xn−1 ) is of the form 1

2

Pu (xn |xn−1 ) = αe− 2c α

(x[n]−ˆ xu [n])2

,

where 0 < α ≤ 1.

As in the scalar prediction case, we seek another Gaussian, P˜u (xn |xn−1 ) ≥ Pu (xn |xn−1 ) for all x[n] ∈

[−Ax , Ax ] such that   1 n−1 2 ˜ Pu (xn |x ) = exp − (x[n] − x˜u [n]) . 2c Once again, by selecting x ˜u [n] = α2 x ˆu [n], P˜u (xn |xn−1 ) ≥ Pu (xn |xn−1 ), over a region [xl , xr ], which is centered at x[n] = 0. Thus setting our universal predictor as x˜u [n], we are guaranteed that the accumulated prediction error will be smaller than −2c ln Pu (xn ), and thus Theorem 2 will be satisfied. We again look at the form of x ˜u [n] for an interpretation of this result. We have, n−1 T n−1 x ˜u [n] = α2 (rxy ) [Ryy + δI]−1 ~y[n]

=

n−1 T n−1 (rxy ) [Ryy + δI]−1 ~y [n]

. n−1 1 + ~y[n]T [Ryy + δI]−1 ~y [n] 22

If we again have the interpretation that x ˜u [n] is simply the result of updating the least-squares solution w ~ ∗ [n] with the assumption that x[n] = 0, before predicting x[n], then we would have ?

n−1 T n x ˜u [n] = (rxy ) [Ryy + δI]−1 ~y [n] ?

n−1 T n−1 = (rxy ) [Ryy + δI + ~y[n]~y [n]T ]−1 ~y [n] √

=

n−1 T n−1 (rxy ) [Ryy + δI]−1 ~y[n]

, n−1 1 + ~y[n]T [Ryy + δI]−1 ~y[n]

which is indeed the case, by application of the matrix inversion lemma. This completes the proof of Theorem 2. We note that many of the comments in Section 3.2 apply in the p-th order case. As such, we include in the appendix a statement indicating that the calculation of the upper bound in Theorem 2 is asymptotically tight for this algorithm.

5

Appendix

5.1

The upper bound in Theorem 1

The following statement indicates that the calculation of the upper bound in Theorem 1 is asymptotically tight (for this algorithm). Statement 1 Let 1 inf n n

n X

2

y [k] + δ

k=1

!

> 0,

then ln (x, x˜u ) satisfies   A2 1 1 n ln (x, x˜u ) ≥ min ln (x, xˆw ) + δw2 + x ln 1 + Ryy [0]δ −1 − ǫn , w n n n

where ǫn = O (ln(n)/n).

Comparing the two Gaussians, P˜u (xn |xn−1 ) and Pu (xn |xn−1 ), we obtain, !   P˜u (xn |xn−1 ) 1 − 1 (x[n]−α2n xˆu [n])2 1 α2n (x[n]−ˆxu [n])2 2c 2c ln e = ln e Pu (xn |xn−1 ) αn  1  (1 − α2n )(α2n x ˆu [n]2 − x[n]2 ) = − ln(αn ) + 2c  1  (1 − α2n )(α2n x ˆu [n]2 ) , ≤ − ln(αn ) + 2c

(19)

where we have made explicit the dependence of the parameter α on the time index n. This provides the lower bound on the ratio of the two, over possible values of the sequence x[n], −2c ln P˜u (xn |xn−1 ) ≥ −2c ln Pu (xn |xn−1 ) + 2c ln αn − [α2n (1 − α2n )ˆ xu [n]2 ]. 23

(20)

Since the performance of the universal predictor is given by the Gaussian P˜u (xn |xn−1 ), and the upper bound developed in Theorem 1 corresponds to the Gaussian Pu (xn |xn−1 ), then two factors will contribute to the

bound in Theorem 1 not being tight. First, the performance of the predictor with x ˜u [n] is better than that of xˆu [n] for the actual value of x[n] that occurs. This was guaranteed by construction in Theorem 1, by producing a Gaussian that was larger than the mixture Gaussian over the range x[n] ∈ [−Ax , Ax ]. However, the precise amount by which the performance exceeds that of x ˆu [n] is difficult to estimate over the course of the sequence. This excess is given by (19), and depends on the exact value of the sequences x ˆu [n] and x[n]. The lower bound given in (20) can be lower bounded by noting that the ratio between P˜u (xn |xn−1 ) and Pu (xn |xn−1 ) is maximized for x[n] = 0, which, combined with Theorem 1 gives, n

 1X A2   1 1 n ln (x, x˜u ) ≥ min ln (x, xˆw ) + δw2 + x ln 1 + Ryy [0]δ −1 + 2c ln αk − α2k (1 − α2k )ˆ xk [n]2 . n n w n n k=0

Hence, we have,

where,

  A2 1 1 n ln (x, x˜u ) ≥ min ln (x, xˆw ) + δw2 + x ln 1 + Ryy [0]δ −1 + ǫn , n n w n n

 1X ǫn = 2c ln αk − α2k (1 − α2k )ˆ xn [k]2 . n k=1

The first term in ǫn can be shown to be −O(ln(n)/n), ! n n k−1 Ryy [0] + δ 2c X cX ln αk = ln k [0] + δ n n Ryy k=1 k=1 !  ! ! n−2 n−1 Ryy [0] + δ Ryy [0] + δ δ c ··· = ln n−1 n [0] + δ n Ryy y 2 [1] + δ Ryy [0] + δ   δ c = ln n [0] + δ n Ryy   c δ ≥ ln n nA2y + δ  c = − ln (nA2y + δ)δ −1 + 1 n  ln(n) . = −O n n If (Ryy + δ) ≥ γn, i.e.

γ = inf n

1 n (R + δ) > 0, n yy

24

(21)

then, the second term in ǫn is given by n

n

1X 1X − (1 − α2k )(α2k x ˆu [k]2 ) = − 1− n n k=1

1 =− n

k=1 n X k=1

1−

k−1 Ryy [0] + δ k Ryy [0] + δ

!!

k−1 Ryy [0] + δ k Ryy [0] + δ

!!

k−1 Ryy [0] + δ k Ryy [0] + δ

!

2

x ˆu [k]

!

x˜u [k]2

n X

!! k−1 Ryy [0] + δ ≥ 1− k [0] + δ Ryy k=1  n  A2x A4y X y[k]2 ≥− 2 k [0] + δ γ n Ryy k=1 !  2 A2x A4y A2y A (n − 1) + δ ≥− 2 + ln γ n δ δ   ln(n) , ≥ −O n A2x A4y − 2 γ n

where the following Lemma is used for the last innequality. Note that the condition for the statement to hold, (21), is fairly unrestrictive. It simply lower bounds the average energy of the sequence y[n]. Lemma 2 For any sequence a[1], a[2], . . . , a[n], such that 0 ≤ a[k] ≤ 1, for all k, and constant δ > 0 the following innequalities hold n X

k=1

5.2

n X

a[k] P k

i=1

 ≤ a[i] + δ k=1

Proof of Lemma 2

1 P  ≤ + ln k−1 δ i=1 a[i] + δ a[k]



(n − 1) + δ δ



For any a[k], k = 1, . . . , n, 0 ≤ a[k] ≤ 1, and any δ > 0, let F (a[1], . . . , a[n]) =

a[2] a[n] a[1] + + ···+ . δ a[1] + δ a[1] + · · · + a[n − 1] + δ

and G(a[1], . . . , a[n]) =

a[1] a[2] a[n] + + ···+ , a[1] + δ a[1] + a[2] + δ a[1] + · · · + a[n] + δ

Since each term in F (a[1], . . . , a[n]) and G(a[1], . . . , a[n]) satisfies, a[k] a[k] ≥ , a[1] + · · · + a[k − 1] a[1] + · · · + a[k] then F (a[1], . . . , a[n]) ≥ G(a[1], . . . , a[n]). We therefore seek an upper bound on F (a[1], . . . , a[n]). Let a[1]∗ , . . . , a[n]∗ denote the maximizing sequence of F , i.e. F (a[1], . . . , a[n]) ≤ F (a[1]∗ , . . . , a[n]∗ ). Note that the partial 1 ∂G = 0 + ···+ 0 + >0 ∂a[n] (a[1] + · · · + a[n − 1] + δ)

25

for all a[k], k = 1, . . . , n. Therefore the maximum of F with respect to a[n] must be at a[n]∗ = 1, regardless of the values of the other a[i], i = 1, . . . , n − 1. Now the partial 2 ∂ 2 G = >0 ∂a[n − 1]2 a[n]=1 (a[1] + · · · + a[n − 1] + δ)3

for all a[k], k = 1, . . . , n − 1. Therefore the maximum of F with respect to a[n − 1] over the range a[n − 1] ∈

[0, 1] must occur at either a[n − 1]∗ = 1 or a[n − 1]∗ = 0. We can compare these two values of F , a[n − 2] 1 1 a[1] + ···+ + + δ a[1] + · · · + a[n − 3] + δ a[1] + · · · + a[n − 2] + δ a[1] + · · · + a[n − 2] + 1 + δ a[1] a[n − 2] 0 1 F (· · · , 0, 1) = + ···+ + + . δ a[1] + · · · + a[n − 3] + δ a[1] + · · · + a[n − 2] + δ a[1] + · · · + a[n − 2] + 0 + δ F (· · · , 1, 1) =

The first n − 2 terms of each sum are equivalent. The n − 1st term in F (a[1], · · · , 1, 1) is equal to the last term in F (a[1], · · · , 0, 1), and the nth term in F (a[1], · · · , 1, 1) is positive. Therefore F (a[1], · · · , 1, 1) > F (a[1], · · · , 1, 1), and a[n − 1]∗ = 1. Assume that a[n]∗ = 1, a[n − 1]∗ = 1, . . . , a[k + 1]∗ = 1. We now show that a[k]∗ = 1. Note that 2 2 ∂ 2 G = + + ∂a[k]2 a[k+1]=···=a[n]=1 (a[1] + · · · + a[k] + δ)3 (a[1] + · · · + a[k] + 1 + δ)3 2 > 0. (a[1] + · · · + a[k] + 1 + · · · + 1 + δ)3

Therefore a[k]∗ = 1 or a[k]∗ = 0. Comparing these choices, a[1] a[k − 1] 1 + ··· + + + δ a[1] + · · · + a[k − 2] + δ a[1] + · · · + a[k − 1] + δ 1 1 + ···+ a[1] + · · · + a[k − 1] + 1 + δ a[1] + · · · + a[k − 1] + 1 + · · · + 1 + δ a[k − 1] 1 a[1] + ···+ + + F (· · ·, a[k − 1], 0, 1, . . . , 1) = δ a[1] + · · · + a[k − 2] + δ a[1] + · · · + a[k − 1] + δ 1 1 1 + + ···+ a[1] + · · · + a[k − 1] + 0 + δ a[1] + · · · + a[k − 1] + 1 + δ a[1] + · · · + a[k − 1] + 1 + · · · + 1 + δ F (· · ·, a[k − 1], 1, . . . , 1) =

Again, the first k − 1 terms of each sum are equivalent. The remaining n − k non-zero terms in the second sum, are equivalent to the next n − k terms of the first sum. This leaves one remaining positive term in the first sum, and therefore, F (a[1], · · · , a[k − 1], 1, 1, . . . , 1) > F (a[1], · · · , a[k − 1], 0, 1, . . . , 1), i.e., a[k]∗ = 1. The lemma follows by unduction on k.

5.3

Proof of Lemma 1

The following lemma will be helpful. Lemma 3 Express A by its columns, writing A = [a1 , a2 , . . . an ], and let b be a column vector. Then, |[a1 , a2 , . . . , ai + b, . . . , an ]| = |A| + |[a1 , a2 , . . . , b, . . . , an ]| 26

Proof of Lemma 3: Let Ai,j be the i − th, j − th cofactor of the matrix A, then, |[a1 , a2 , . . . , ai + b, . . . , an ]| = (ai [1] + b[1])A1,i + . . . + (ai [n] + b[n])An,i = |A| + |[a1 , a2 , . . . , b, . . . an ]|

Now we return to Lemma 1. Express the required determinant as |(A + bcT )| = |([a1 + c[1]b, a2 + c[2]b, . . . an + c[n]b])|. Note that |([a1 + c[1]b, a2 , . . . , an ])| = |A| + |([c[1]b, . . . , an ])|. Now add |([a1 + c[1]b, a2 + c[2]b, . . . , an ])| = |A| + |[c[1]b, . . . , an ]| + |[a1 + c[1]b, c[2]b, . . . an ]| = |A| + |[c[1]b, . . . , an ]| + |[a1 , c[2]b, . . . , an ]| + |[c[1]b, c[2]b, . . . , an ]| = |A| + |[c[1]b, a2 , . . . , an ]| + |[a1 , c[2]b, . . . , an ]| + 0. Now by induction on j, |a1 + c[1]b, a2 + c[2]b, . . . , aj + c[j]b, aj+1 + c[j + 1]b, aj+2 , . . . , an | = |A| + |c[1]b, a2 , . . . | + . . . + |a1 , . . . , c[j]b, . . . | + |a1 + c[1]b, a2 + c[2]b, . . . , aj + c[j]b, c[j + 1]b, aj + 2, . . . , an | = |A| + |c[1]b, a2 , . . . | + . . . + |a1 , . . . , c[j]b, . . . | + |a1 , a2 + c[2]b, . . . , aj + c[j]b, c[j + 1]b, aj+2 , . . . , an | + |c[1]b, a2 + c[2]b, . . . , aj + c[j]b, c[j + 1]b, aj+2 , . . . , an | = |A| + |c[1]b, a2 , . . . | + . . . + |a1 , . . . , c[j]b, . . . | + |a1 , a2 + c[2]b, . . . , aj + c[j]b, c[j + 1]b, aj+2 , . . . , an | ... = |A| + |c[1]b, a2 , . . . | + . . . + |a1 , . . . , c[j]b, . . . | + |a1 , a2 , . . . , aj , c[j + 1]b, aj+2 , . . . , an |. Therefore, |A + bcT | = |A| + |c[1]b, . . . an | + . . . + |a1 , . . . , c[n]b| = |A| +

n X n X

c[i]b[j]Ai,j

i=1 j=1

= |A|(1 + cT A−1 b).

5.4

The upper bound in Theorem 2

Statement 2 Let λn be the smallest eigenvalue of

1 n n (Ryy

+ δI) and

λmin = inf λn > 0, n

27

then ln (x, x˜u ) satisfies  A2 p A2y n 1 1 ~ 2 + x ln 1 + ln (x, x˜u ) ≥ min ln (x, xˆw~ ) + δkwk n n w~ n δ

!

− ǫn ,

where ǫn = O (ln(n)/n). As in the scalar case, we compare the two Gaussians, P˜u (xn |xn−1 ) and Pu (xn |xn−1 ), to obtain, !  A2x A2y n 1 1 2 + ǫn , ln (x, x˜u ) ≥ min ln (x, xˆw~ ) + δkwk + ln 1 + n n w~ n δ where, n

ǫn =

 1X 2c ln αk − α2k (1 − α2k )ˆ xn [k]2 . n k=1

The first term in ǫn be shown to be O(−p ln(n)/n), n

2c X ln αk n k=1

n k−1 + δI| c X |Ryy ln = k n |Ryy + δI| k=1

c = ln n

n−2 n−1 + δI| |Ryy + δI| |Ryy |δI| ··· n−1 n + δI| |Ryy |~y[0]~y [0]T + δI| |Ryy + δI|

!

|δI| c ln n n |Ryy + δI| c n = − ln |Ryy + δI| n ! A2y n cp ≥ − ln 1 + n δ   ln(n) ≥ −O p . n =

The bound follows from the log determinant of a matrix equaling the sum of the logs of the eigenvalues and Jensen’s innequality.

28

The second term in ǫn is given by n n k−1 |Ryy + δI| 1X 1X − (1 − α2k )(α2k x ˆu [k]2 ) = − 1− k n n |Ryy + δI| k=1

=−

1 n

k=1 n X

!

x ˆu [n]˜ xu [n]

n−1 ~ykT (Ryy + δI)−1 ~yk x˜u [n]2

k=1

n p2 A2x A4y X n−1 ~ykT (Ryy + δI)−1 ~yk ≥− nλ2min k=1

n p2 A2x A4y X pA2y ≥− 2 nλmin n(λmin + nδ −1 ) k=1

3



p A2x A6y − nλ3min

ln(n + 1)

≥ −O(ln(n)/n),

n−1 where λmin is a lowerbound on the smallest eigenvalue of [Ryy + δI].

References [1] N. Merhav and M. Feder, “Universal schemes for sequential decision from individual sequences,” IEEE Trans. Info. Theory, vol. 39, pp. 1280–1292, July 1993. [2] V. Vovk, “Competitive on-line statistics,” preprint, 1999. [3] V. Vovk, “Aggregating strategies (learning),” in Proceedings of the Third Annual Workshop on Computational Learning Theory (M. Fulk and J. Case, eds.), (San Mateo, CA), pp. 371–383, Morgan Kaufmann, 1990. [4] D. Haussler, J. Kivinen, and M. Warmuth., “Tight worst-case loss bounds for predicting with expert advice,” in Computational Learning Theory. Second European Conference, EuroCOLT ’95. Proceedings (P. Vitanyi, ed.), pp. 69–83, March 1995. [5] N. Cesa-Bianchi, Y. Freund, D. Helmbold, D. Haussler, R. Schapire, and M. Warmuth, “How to use expert advice,” Annual ACM Symposium on Theory of Computing, pp. 382–391, 1993. [6] D. Haussler, J. Kivinen, and M. Warmuth., “Sequential prediction of individual sequences under general loss functions,” IEEE Trans. Info. Theory, vol. 44, pp. 1906–1925, Sept. 1988. [7] G. Shamir and N. Merhav, “Low complexity sequential lossless coding for piecewise stationary memoryless sources,” IEEE Trans. Info. Theory, vol. 45, pp. 1498–1519, July 1999. [8] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Trans. Info. Theory, vol. IT-24, pp. 530–536, September 1978. [9] P. A. J. Volf and F. M. Willems, “Switching between two universal source coding algorithms,” in Proc. 1998 Data Comp. Conf., (Snowbird, UT), pp. 491–500, 1998. [10] J. Rissanen, “Universal coding, information, prediction, and estimation,” IEEE Trans. Info. Theory, vol. IT-30, pp. 629–636, 1984. [11] F. Willems, Y. Shtarkov, and T. Tjalkens, “The context-tree weighting method: basic properties,” IEEE Trans. Info. Theory, vol. IT-41, pp. 653–664, May 1995. 29

[12] T. Shamoon and C. Heegard, “Adaptive update algorithms for fixed dictionary lossless data compressors,” Proc. 1994 IEEE International Symposium on Information Theory, p. 14, 1994. [13] M. J. Weinberger, N. Merhav, and M. Feder, “Optimal sequential probability assignment for individual sequences,” IEEE Trans. Info. Theory, vol. 40, pp. 384–396, March 1994. [14] I. Csiszar and P. Narayan, “Capacity and decoding rules for classes of arbitrarily varying channels,” IEEE Transactions on Information Theory, vol. 35, pp. 752–769, July 1989. [15] N. Merhav and M. Feder, “Universal prediction,” IEEE Trans. Inform. Theory, vol. IT-44, pp. 2124– 2147, Oct. 1998. [16] A. Lapidoth and J. Ziv, “Universal sequential decoding,” 1998 Information Theory Workshop, p. 58, 1998. [17] J. Ziv and N. Merhav, “A measure of relative entropy between individual sequences with application to universal classification,” IEEE Transactions on Information Theory, vol. 39, pp. 1270–1279, JUly 1993. [18] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996. [19] N. Warke and G. Orsak, “Impact of statistical mismatch on both universal classifiers and likelihood ratio tests,” Proceedings 1998 IEEE International Symposium on Information Theory, p. 140, 1998. [20] B. Fittingoff, “Universal methods of coding for the case of unknown statistics,” Proc. 5th Symp. Information Theory (Moscow/Gorky, USSR), 1972. [21] L. Davisson, “Universal noiseless coding,” IEEE Trans. Info. Theory, vol. IT-19, pp. 783–795, 1973. [22] P. Elias, “Universal codeword sets and representations of the integers,” IEEE Trans. Inform. Theory, vol. 21, pp. 194–203, March 1975. [23] J. Ziv, “Coding of sources with unknown statistics, part i: Probability of encoding error,” IEEE Trans. Info. Theory, vol. IT-18, pp. 373–343, May 1972. [24] R. Krichevsky and V. Trofimov, “The performance of universal encoding,” IEEE Trans. Info. Theory, vol. IT-27, pp. 199–207, March 1981. [25] J. Rissanen and G. Langdon, “Universal modeling and coding,” IEEE Trans. Info. Theory, vol. IT-27, pp. 12–23, Jan. 1984. [26] J. Rissanen, “A universal data compression system,” IEEE Trans. Info. Theory, vol. IT-29, pp. 656–664, September 1983. [27] B. Y. Ryabko, “Twice-universal coding,” Prob. Inf. Trans, vol. 20, pp. 173–7, Jul-Sep 1984. [28] P. Chou, M. Effros, and R. Gray, “Universal quantization of parametric sources has redundancy k/2 logn/n,” Proceedings 1995 IEEE International Symposium on Information Theory, p. 371, 1995. [29] M. Effros, P. Chou, and R. Gray, “Rates of convergence in adaptive universal vector quantization,” Proceedings 1994 IEEE International Symposium on Information Theory, p. 456, 1994. [30] R. Zamir and M. Feder, “On universal quantization by randomized uniform/lattice quantizers,” IEEE Transactions on Information Theory, vol. 38, pp. 428–36, March 1992. [31] T. Cover, “Universal gambling schemes and the complexity measures of kolmogorov and chaitin,” Tech. Rep. 12, Dept. Statist., Stanford Univ., Stanford, CA, 1974. [32] B. Y. Ryabko, “Prediction of random sequences and universal coding,” Prob. Inf. Transmission, vol. 24, pp. 87–96, Apr-June 1988.

30

[33] P. Algoet, “Universal schemes for prediction, gambling, and portfolio selection,” The Annals of Probability, pp. 901–941, 1992. [34] A. Singer and M. Feder, “Universal linear prediction by model order weighting,” IEEE Transactions on Signal Processing, pp. 2685–2699, October 1999. [35] A. Singer and M. Feder, “Twice universal linear prediction of individual sequences,” 1998 IEEE Int. Symp. on Info. Theory, 1998. [36] A. Singer and M. Feder, “Universal linear least-squares prediction,” in Proceedings of the 2000 International Symposium on Information Theory, (Sorrento, Italy), June 25-30 2000. [37] J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, pp. 561–580, April 1975. [38] A. H. Sayed and T. Kailath, “A state-space approach to adaptive RLS filtering,” IEEE Signal Processing Magazine, pp. 18–60, July 1994. [39] M. Feder and A. Singer, “Universal data compression and linear prediction,” Proceedings of the 1998 Data Compression Conference, March 1998. [40] J. Cioffi and T. Kailath, “Fast, recursive-least squares transversal filters for adaptive filtering,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 304–337, April 1984.

31

Suggest Documents