a note on sequence prediction - CiteSeerX

12 downloads 0 Views 125KB Size Report
dictability is affected by the number of distinct el- ements it contains. Let S = s1,...,sn be a fixed sequence. For any non-negative integer l, we define the lth-order ...
A NOTE ON SEQUENCE PREDICTION Travis Gagie Department of Computer Science University of Toronto Email: [email protected] ABSTRACT We show sequences are predictable, in a certain sense, if they are sufficiently long and contain relatively few distinct elements. We define a statistic P` (S) that measures how well any function predicts a sequence S using only contexts of length `. If a function’s success rate predicting S is at least nearly P` (S), then we call it a close `th-order predictor for S. Given `, we show how to construct a function that is a close `th-order predictor for all sequences whose n is sufficiently large and that contain   1 length o n `+3 distinct elements. Finally, we show this

Notice 0 < 1/|{x : x ∈ S}| ≤ P0 (S) ≤ · · · ≤ Pn−1 = Pn = · · · = 1. For example, if S is the string TORONTO, then 1 |{x : x ∈ S}|

=

0.25 ,

P0 (S) = 3/7 ≈ 0.429 , P1 (S)

=

1 1 + P0 (SN ) + 2P0 (SO )+ 7  P0 (SR ) + 2P0 (ST )

1 1 + P0 (T) + 2P0 (RN)+ 7  P0 (O) + 2P0 (OO) = 6/7 ≈ 0.857

bound cannot be raised to O(n1/` ).

=

1 PROBLEM STATEMENT In this paper, we explore how a sequence’s predictability is affected by the number of distinct elements it contains. Let S = s1 , . . . , sn be a fixed sequence. For any non-negative integer `, we define the `th-order predictability of S, denoted P` (S), to be the maximum probability of predicting si given a context of length `, as in the following experiment: we are given S; i is chosen uniformly at random from {1, . . . , n}; if i ≤ `, we are told si ; if i > `, we are told si−` , . . . , si−1 . Specifically,    |{i : si = x}|   max if ` = 0;   x∈S  n     P` (S) =  X  1   ` + |Sα |P0 (Sα ) if ` > 0.    n |α|=`

Here, x ∈ S means x occurs in S, and Sα is the sequence whose ith element is the one immediately following the ith occurrence of the `-tuple α in S — the length of Sα is the number of occurrences of α in S unless α is a suffix of S, in which case it is 1 less.

and all higher-order predictabilities of S are 1. In other words, if someone chooses a character uniformly at random from TORONTO and asks us to guess it, then our chance of doing so is about 0.429; if they tell us the preceding character before we guess, then on average our chance is about 0.857; if they tell us the preceding two or more characters, then we are certain of the answer. We call a (possibly randomized) function F an -close `th-order predictor for S if R(F, S) 

|{i : F(s1 , . . . , si−1 ) = si }| n ≥ P` (S) −  ;

=

E



by the definition of P` (S), for any function G on contexts of length `,   |{i : G(smax(i−`,1) , . . . , si−1 ) = si }| E ≤ P` (S) . n We call F an o(1)-close `th-order predictor for a set S of sequences if, for any  > 0 and any sufficiently

long S ∈ S, F is an -close `th-order predictor for S. We would like to give a close high-order predictor for any sufficiently long sequence, but we cannot: fix F and suppose S is a permutation of {1, . . . , n} chosen uniformly at random; it is not hard to show P1 (S) = 1 but R(F, S) ≤ Hn /n, where Hn is the nth harmonic number. Instead, we give a family of functions such that the `th function in the family is an o(1)-close `th-order predictor for any set S of sequences with max S∈S, |{x : x ∈ S}| ∈ |S|=n  1  o n `+3 . We then show this bound cannot be  raised to max S∈S, |{x : x ∈ S}| ∈ O(n1/` ).

Proof. Fix  > 0 and let S = s1 , . . . , sn ∈ S be such that `/n ≤ /3 and A is an (/3)-close 0th-order predictor for any subsequence of S with length at n least 3|{x :x∈S}| ` : any sufficiently long sequence in S has these properties because  max |{x : x ∈ S}| S∈S, |S|=n

  ∈ o 

max S∈S, |S|=n

1/3  n   `   . |{x : x ∈ S}|

Let L be the set of `-tuples α such that A is an (/3)-close 0th-order predictor for Sα . Thus,

|S|=n

2 RELATED WORK

P` (S) =

  X X 1 `+ |Sα |P0 (Sα ) + |Sα |P0 (Sα ) n



1 X 2 |Sα |P0 (Sα ) + n 3

α6∈L

α∈L

It is surprisingly difficult even to find a close 0thorder predictor for the set of binary sequences. For example, by diagonalization, such a predictor must be randomized; if F is deterministic, then R(F, S) = 0 when each si = F(s1 , . . . , si−1 ). Hannan [2] randomized the obvious function — always predict the most common element so far — and gave a function A that, among other things, is an o(1)-close 0th-orderpredictor for any set S of sequences with max S∈S, |{x : x ∈ S}| ∈ o(n1/3 ).

α∈L

and R(A` , S) ≥ =

|S|=n

Kalai and Vempala [4] present Hannan’s predictor in the context of current machine learning theory and applications. Feder, Merhav and Gutman [1] and Krishnan and Vitter [5] combined Hannan’s predictor and the LZ78 compression algorithm [6] to obtain functions asymptotically as good as any finite-state predictor; in particular, for any ` ∈ O(1), Feder, Merhav and Gutman’s is an o(1)-close `th-order predictor for the set of binary sequences, and Krishnan and Vitter’s is an o(1)-close `th-order predictor for any set of sequences containing a constant number of distinct elements.

3 BOOSTING HANNAN’S PREDICTOR Let S be a sequence, let ` be a positive integer and let α be the suffix of S of length `; we define A` (S) = A(Sα ), where A is Hannan’s predictor and Sα is as in Section 1. Theorem 1. The function A` is an o(1)-close `thorder predictor for any set S of sequences with  1 max S∈S, |{x : x ∈ S}| ∈ o n `+3 . |S|=n

>

1 X |Sα |R(A, Sα ) n α∈L  1 X  |Sα | P0 (Sα ) − n 3 α∈L 1 X  |Sα |P0 (Sα ) − , n 3 α∈L

so A` is an -close `th-order predictor for S. We now cannot be raised to  show the bound max S∈S, |{x : x ∈ S}| ∈ O(n1/` ), regardless of |S|=n

which function is chosen. Fix c > 0 and, for all k ≥ 1, let S contain all sequences of dk ` /ce elements drawn from {1, . . . , k}. Suppose we choose S uniformly at random from the sequences of length n in S. By definition, P` (S)n is at least the number of distinct `-tuples in S minus 1. Janson, Lonardi and Szpankowski [3] showed the expected number of distinct `-tuples in S is   1 cn 1 − 1/c + O(`) , e so   1 E[P` (S)] ≥ c 1 − 1/c + o(1) . e However, for any function F, 1 + o(1) cn1/` approaches 0 as n grows. It follows there is no o(1)close `th-order predictor for S. E[R(F, S)] ≤

References [1] M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. IEEE Transactions on Information Theory, 38(4):1258–1270, 1992. [2] J. Hannan. Approximation of Bayes risk in repeated plays. In M. Dresher, A.W. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pages 97–139. Princeton University Press, 1957. [3] S. Janson, S. Lonardi, and W. Szpankowski. On average sequence complexity. Theoretical Computer Science, 326(1–3):213–227, 2004. [4] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. In Proceedings of the 16th Conference on Computational Learning Theory, pages 26–40, 2003. [5] P. Krishnan and J.S. Vitter. Optimal prediction for prefetching in the worst case. SIAM Journal on Computing, 27(6):1617–1636, 1998. [6] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5):530–536, 1978.