A NOTE ON SEQUENCE PREDICTION Travis Gagie Department of Computer Science University of Toronto Email:
[email protected] ABSTRACT We show sequences are predictable, in a certain sense, if they are sufficiently long and contain relatively few distinct elements. We define a statistic P` (S) that measures how well any function predicts a sequence S using only contexts of length `. If a function’s success rate predicting S is at least nearly P` (S), then we call it a close `th-order predictor for S. Given `, we show how to construct a function that is a close `th-order predictor for all sequences whose n is sufficiently large and that contain 1 length o n `+3 distinct elements. Finally, we show this
Notice 0 < 1/|{x : x ∈ S}| ≤ P0 (S) ≤ · · · ≤ Pn−1 = Pn = · · · = 1. For example, if S is the string TORONTO, then 1 |{x : x ∈ S}|
=
0.25 ,
P0 (S) = 3/7 ≈ 0.429 , P1 (S)
=
1 1 + P0 (SN ) + 2P0 (SO )+ 7 P0 (SR ) + 2P0 (ST )
1 1 + P0 (T) + 2P0 (RN)+ 7 P0 (O) + 2P0 (OO) = 6/7 ≈ 0.857
bound cannot be raised to O(n1/` ).
=
1 PROBLEM STATEMENT In this paper, we explore how a sequence’s predictability is affected by the number of distinct elements it contains. Let S = s1 , . . . , sn be a fixed sequence. For any non-negative integer `, we define the `th-order predictability of S, denoted P` (S), to be the maximum probability of predicting si given a context of length `, as in the following experiment: we are given S; i is chosen uniformly at random from {1, . . . , n}; if i ≤ `, we are told si ; if i > `, we are told si−` , . . . , si−1 . Specifically, |{i : si = x}| max if ` = 0; x∈S n P` (S) = X 1 ` + |Sα |P0 (Sα ) if ` > 0. n |α|=`
Here, x ∈ S means x occurs in S, and Sα is the sequence whose ith element is the one immediately following the ith occurrence of the `-tuple α in S — the length of Sα is the number of occurrences of α in S unless α is a suffix of S, in which case it is 1 less.
and all higher-order predictabilities of S are 1. In other words, if someone chooses a character uniformly at random from TORONTO and asks us to guess it, then our chance of doing so is about 0.429; if they tell us the preceding character before we guess, then on average our chance is about 0.857; if they tell us the preceding two or more characters, then we are certain of the answer. We call a (possibly randomized) function F an -close `th-order predictor for S if R(F, S)
|{i : F(s1 , . . . , si−1 ) = si }| n ≥ P` (S) − ;
=
E
by the definition of P` (S), for any function G on contexts of length `, |{i : G(smax(i−`,1) , . . . , si−1 ) = si }| E ≤ P` (S) . n We call F an o(1)-close `th-order predictor for a set S of sequences if, for any > 0 and any sufficiently
long S ∈ S, F is an -close `th-order predictor for S. We would like to give a close high-order predictor for any sufficiently long sequence, but we cannot: fix F and suppose S is a permutation of {1, . . . , n} chosen uniformly at random; it is not hard to show P1 (S) = 1 but R(F, S) ≤ Hn /n, where Hn is the nth harmonic number. Instead, we give a family of functions such that the `th function in the family is an o(1)-close `th-order predictor for any set S of sequences with max S∈S, |{x : x ∈ S}| ∈ |S|=n 1 o n `+3 . We then show this bound cannot be raised to max S∈S, |{x : x ∈ S}| ∈ O(n1/` ).
Proof. Fix > 0 and let S = s1 , . . . , sn ∈ S be such that `/n ≤ /3 and A is an (/3)-close 0th-order predictor for any subsequence of S with length at n least 3|{x :x∈S}| ` : any sufficiently long sequence in S has these properties because max |{x : x ∈ S}| S∈S, |S|=n
∈ o
max S∈S, |S|=n
1/3 n ` . |{x : x ∈ S}|
Let L be the set of `-tuples α such that A is an (/3)-close 0th-order predictor for Sα . Thus,
|S|=n
2 RELATED WORK
P` (S) =
X X 1 `+ |Sα |P0 (Sα ) + |Sα |P0 (Sα ) n
≤
1 X 2 |Sα |P0 (Sα ) + n 3
α6∈L
α∈L
It is surprisingly difficult even to find a close 0thorder predictor for the set of binary sequences. For example, by diagonalization, such a predictor must be randomized; if F is deterministic, then R(F, S) = 0 when each si = F(s1 , . . . , si−1 ). Hannan [2] randomized the obvious function — always predict the most common element so far — and gave a function A that, among other things, is an o(1)-close 0th-orderpredictor for any set S of sequences with max S∈S, |{x : x ∈ S}| ∈ o(n1/3 ).
α∈L
and R(A` , S) ≥ =
|S|=n
Kalai and Vempala [4] present Hannan’s predictor in the context of current machine learning theory and applications. Feder, Merhav and Gutman [1] and Krishnan and Vitter [5] combined Hannan’s predictor and the LZ78 compression algorithm [6] to obtain functions asymptotically as good as any finite-state predictor; in particular, for any ` ∈ O(1), Feder, Merhav and Gutman’s is an o(1)-close `th-order predictor for the set of binary sequences, and Krishnan and Vitter’s is an o(1)-close `th-order predictor for any set of sequences containing a constant number of distinct elements.
3 BOOSTING HANNAN’S PREDICTOR Let S be a sequence, let ` be a positive integer and let α be the suffix of S of length `; we define A` (S) = A(Sα ), where A is Hannan’s predictor and Sα is as in Section 1. Theorem 1. The function A` is an o(1)-close `thorder predictor for any set S of sequences with 1 max S∈S, |{x : x ∈ S}| ∈ o n `+3 . |S|=n
>
1 X |Sα |R(A, Sα ) n α∈L 1 X |Sα | P0 (Sα ) − n 3 α∈L 1 X |Sα |P0 (Sα ) − , n 3 α∈L
so A` is an -close `th-order predictor for S. We now cannot be raised to show the bound max S∈S, |{x : x ∈ S}| ∈ O(n1/` ), regardless of |S|=n
which function is chosen. Fix c > 0 and, for all k ≥ 1, let S contain all sequences of dk ` /ce elements drawn from {1, . . . , k}. Suppose we choose S uniformly at random from the sequences of length n in S. By definition, P` (S)n is at least the number of distinct `-tuples in S minus 1. Janson, Lonardi and Szpankowski [3] showed the expected number of distinct `-tuples in S is 1 cn 1 − 1/c + O(`) , e so 1 E[P` (S)] ≥ c 1 − 1/c + o(1) . e However, for any function F, 1 + o(1) cn1/` approaches 0 as n grows. It follows there is no o(1)close `th-order predictor for S. E[R(F, S)] ≤
References [1] M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. IEEE Transactions on Information Theory, 38(4):1258–1270, 1992. [2] J. Hannan. Approximation of Bayes risk in repeated plays. In M. Dresher, A.W. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pages 97–139. Princeton University Press, 1957. [3] S. Janson, S. Lonardi, and W. Szpankowski. On average sequence complexity. Theoretical Computer Science, 326(1–3):213–227, 2004. [4] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. In Proceedings of the 16th Conference on Computational Learning Theory, pages 26–40, 2003. [5] P. Krishnan and J.S. Vitter. Optimal prediction for prefetching in the worst case. SIAM Journal on Computing, 27(6):1617–1636, 1998. [6] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5):530–536, 1978.