Learning Changing Concepts by Exploiting the Structure of Change Peter L. Bartlett Department of Systems Engineering Research School of Information Sciences and Engineering Australian National University, Canberra, ACT 0200, Australia +61 2 6279 8681,
[email protected] Shai Ben-David Department of Computer Science Technion, Haifa 32000 Israel
[email protected] Sanjeev R. Kulkarni Department of Electrical Engineering Princeton University, Princeton, NJ 08544 USA
[email protected]
An
earlier version of this paper was presented at the Ninth Annual Conference on Computational Learning Theory.
1
Abstract This paper examines learning problems in which the target function is allowed to change. The learner sees a sequence of random examples, labelled according to a sequence of functions, and must provide an accurate estimate of the target function sequence. We consider a variety of restrictions on how the target function is allowed to change, including infrequent but arbitrary changes, sequences that correspond to slow walks on a graph whose nodes are functions, and changes that are small on average, as measured by the probability of disagreements between consecutive functions. We first study estimation, in which the learner sees a batch of examples and is then required to give an accurate estimate of the function sequence. Our results provide bounds on the sample complexity and allowable drift rate for these problems. We also study prediction, in which the learner must produce online a hypothesis after each labelled example and the average misclassification probability over this hypothesis sequence should be small. Using a deterministic analysis in a general metric space setting, we provide a technique for constructing a successful prediction algorithm, given a successful estimation algorithm. This leads to sample complexity and drift rate bounds for the prediction of changing concepts.
Key words: computational learning theory, sample complexity, concept drift, estimation, tracking Running head: Learning changing concepts
2
1 Introduction We consider the problem of learning to track a changing subset of a domain from random examples. In many learning problems for which the environment is changing, there is some structure to the change. For example, the daily weather at a given location may be viewed as a changing concept having some basic underlying structure to its change. On a short-term scale changes are of bounded variation, on a larger annual scale changes are roughly cyclic, and there are probably some further subtle rules governing the structure of daily weather changes. Another rather practical example arises in a steel rolling mill, where the efficiency of the mill’s operation depends on how accurately the behavior of the rolling surfaces can be predicted [6]. As in many industrial processes, there is an accurate physical model of the target function (relating the measured variables to the desired quantity), but there are several unknown parameters, and these may change over time. The change may be slow (as the rollers wear), or occasionally fast (when something fails). Again, there is a definite structure to the change. Yet another scenario that can be viewed similarly arises in computer vision when one wishes to identify an object from a sequence of photographs taken while the object or the camera are in motion. In this paper we address the question, when can we exploit the structure of change to learn a changing concept? More formally, we assume that the learner sees, at time t, a random example xt from some domain X , together with the value of an unknown target function ft : X ! f0; 1g at the point xt. The function is an element of a known class F . The distribution that generates the examples is assumed to remain fixed, but the function can change between examples, with some structure to the change. We formalize the structure by defining a set of legal function sequences. For instance, cyclic or seasonal changes correspond to a walk on a directed cyclic graph. In the rolling mill example, the legal sequences might be those corresponding to smooth paths in parameter space. We consider the following two problems of learning in a changing environment. Estimation When can one estimate a sequence of concepts (f1; f2 ; : : : ; fn ) on the basis of a set of random samples (x1 ; f1(x1 )); : : : ; (xn ; fn (xn ))? This may be thought of as ‘understanding the past on the basis of gathered experience.’ Prediction When can one predict the next concept in a sequence of concepts (f1; f2; : : : ; fn), on the basis of random samples of previous concepts? This 3
may be thought of as ‘predicting the future on the basis of past experience.’ Note that in the usual PAC model these two issues coincide. In that model, there is only one target concept per learning session (it remains fixed throughout the learning process). The problem of predicting labels for a changing concept has been considered elsewhere. Helmbold and Long [10] consider prediction when the concept is allowed to drift slowly between trials. That is, any two consecutive functions fi and fi+1 must have Pr(fi 6= fi+1 ) small. This is a natural measure of concept drift, since it can be thought of as the weakest assumption that implies the labels of random examples will not vary much. Their work is in a slightly different setting—they consider prediction strategies that aim to minimize the probability over a long sequence of misclassifying the last example—but the results can be easily converted between settings. The bound on allowable drift that they obtained, (2 =(d log(1=))) (where is the allowable prediction error and d is the VC-dimension of the class), was subsequently improved by Bartlett and Helmbold [2] to (2=(d + log(1=))). This result is also a special case of a later result due to Barve and Long [3]. It has recently been improved by Long [11] to (2 =d). These papers impose a uniform bound on the probability of disagreement between consecutive functions. One of the function sequence classes considered here contains sequences that satisfy a weaker (time-averaged) version of this bound. The final result of the paper, when converted to the setting of the earlier work described above, shows that with this weaker constraint, the allowable drift rate decreases by no more than log factors, 2=d versus 2 =(d log 2 (d=)). Several authors [1, 2, 3, 11] have considered learning problems in which a changing environment is modelled by a slowly changing distribution on the product space X f0; 1g. The allowable drift is restricted by ensuring that consecutive probability distributions are close in total variation distance. Clearly, allowing a changing concept with a bound on the probability of disagreement between consecutive functions is a special case of this model. More recently, Freund and Mansour [7] have investigated learning when the distribution changes as a linear function of time. They present algorithms that estimate the error of functions, using knowledge of this linear drift. Blum and Chalasani [4] consider learning switching concepts. The target concept is allowed to switch between concepts in the class, but with some constraint on the total number of concepts visited, or on the frequency of switches. Their most closely related results concentrate on the computational complexity of predicting switching concepts from particular concept classes. In contrast, we give 4
a general condition on switching frequency for which the estimation and prediction of switching concepts from any class with finite VC-dimension is possible, ignoring computational constraints. The next section introduces the notation. Section 3 considers the problem of sequence estimation (from a batch of labelled samples). Our main result here is the derivation of a sufficient condition that guarantees estimability of a family of sequences of functions. This result may be viewed as an extension of the basic [5] sufficiency theorem for PAC learnability of classes of (single) functions. We go on and apply this result to provide sample size upper bounds for the estimation of several naturally arising families of function sequences. In Section 4 we discuss the function prediction problem. We show the success of certain k th-order Markovian prediction strategies in the setting of function prediction. We deviate from previous work on prediction via Markovian strategies in that, rather than assuming access to the complete last-k -steps information (and then looking for the best Markovian strategy, or the best one in some computationally restricted family of strategies, as in the work of Merhav and Feder [12]), we assume that our predictor can only approximate the past sequence, (ft?k ; : : : ; ft?1 ). We conclude the paper by gluing together our estimation and prediction results to obtain sample size upper bounds for prediction of changing concepts under several types of change constraints.
2 Basic Notation Throughout, we let X be a set, and we consider classes F of f0; 1g-valued functions defined on X . We fix some -algebra of subsets of X and consider probability distributions over X that are defined over this algebra. Furthermore, we shall assume that all functions we consider are measurable with respect to this algebra of sets. This is true if X is countable; for uncountable X , [5] gives mild The growth function of F , F : N ! N, is defined conditions on F that suffice. as F (n) = max Fjx : x 2 X n , where
Fjx = f(f (x1); : : : ; f (xn)) : f 2 F g : The Vapnik-Chervonenkis dimension of F is defined as VCdim(F ) = max fn : F (n) = 2n g : For a sequence of functions, f = (f1 ; : : : ; fn ) and a sequence of points x = (x1; : : : ; xn) 2 X n , let f(x) denote the sequence (f1(x1); : : : ; fn(xn )). For a set of sequences of functions, Fn F n , we denote ff(x) : f 2 Fn g by Fn jx . 5
For a probability distribution P on X , define the P -induced pseudometric dP over a class of functions F by dP (f; g ) = P fx : f (x) 6= g (x)g (for f; g 2 F ). For sequences f = (f1 ; : : : ; fn ) and g = (g1 ; : : : ; gn ) in F n , extend dP to the pseudometric dP on F n by n dP (f; g) = 1 X dP (fi; gi );
n i=1
and for a sequence x in X n define
d^x(f; g) = n1 jfi : fi(xi) 6= gi(xi)gj : We will consider a variety of constraints on the function sequences, that restrict how much the functions can fluctuate over time. Given a (pseudo)metric d over a class F of functions, a natural measure of these fluctuations is the average distance, in the metric d, between subsequent functions,
Vd
(f) def =
1
jfj ? 1
jX fj?1
=1
i
d(fi; fi+1):
(where jfj stands for the length of the sequence f). Note that, for the discrete metric D over F (for which f = g and 1 otherwise), one gets
(1)
D(f; g) takes value 0 if
VD (f) def = 1 jf1 i < jfj : fi 6= fi+1gj jf j ? 1
(2)
We will refer to VdP (f) as VP (f) and to VD (f) as D(f). Note also that, for every sequence f and for any probability distribution P over X , VP (f) D(f). A third type of metric we shall refer to arises when one is given a directed graph whose nodes are members of the function class F . Such a graph may be used to model a scenario in which, if a system is in some state f 2 F at one moment, it may switch at the next moment to a state in a restricted subset of F . In such a case we shall define a digraph by having the edges reflect this ‘possible next state’ relation. Given a directed graph G over F , let dG denote the metric defined by the length of the shortest path in G. We will refer to VdG as VG . Let log denote logarithm to base 2 and ln denote the natural logarithm.
6
3 Sequence estimation Definition 1. For n 2 N and any distribution P on X , let Fn (P ) of function sequences of length n. We call these legal sequences.
F n be a set
An estimation algorithm A is a function that maps sequences from f0; 1g to function sequences from F .
X
For 0 < ; < 1, we say that A (; )-estimates Fn on n examples if, for all distributions P on X , for all f 2 Fn (P ), the probability over x 2 X n that dP (A(x; f(x)); f) is less than .
Often, a plausible sequence estimation algorithm is a consistent algorithm, that is, one that chooses a function sequence from Fn that agrees with the target sequence on all of the examples. The following theorem gives a uniform convergence result for classes of function sequences, which implies a sufficient condition for a consistent algorithm to be able to estimate Fn . Theorem 2. For all 0 < < 1, n 6=, and f 2 Fn , and for all distributions P on X ,
P n fx 2 X n : 9g 2 Fn s.t. gi (xi) = fi(xi) for all i, g) < 2?n=2+1E Fnjx 2 ; and dP (f; where the expectation is over x in X n . Proof. Using Chernoff bounds (see, for example, [8]), we have
P n x : 9g 2 Fn s.t. g(x) = f(x), dP (f; g) n 1 2 n 1 ? e?n=8 P (x; y) 2 X 2n : 9g 2 Fn s.t.
o
g(x) = f(x), dP (f; g) , and d^y (f; g) =2 :
Let U be the uniform distribution on the set of permutations on f1; : : : ; 2ng that swap elements from the first to the second half of the sequence (that is, all permutations that satisfy f (i); (i + n)g = fi; i + ng for all i 2 f1; : : : ; ng). For such a permutation denote the permuted version of a sequence (x; y ) 2 X 2n by
7
(x ; y ). Then, given the assumption on n, the probability above is less than n 2E(x;y)P n U : 9g 2 Fn s.t. g(x ) = f(x ) 2
g) > and d^y (f; g) > =2 and dP (f; n j(x;y)
2E F
sup
(v;w)2Fn (x;y)
o
n
U : f(x ) = v and
j
1 jfi : f (y ) 6= w gj =2 i (i) i n
?n=2 n j(x;y) 2
2E F
;
where Fn j(x;y) is the set
(h1(x1); : : : ; hn (xn); h1(y1); : : : ; hn(yn)) : h 2 Fn :
But this set is contained in
(h1(x1); : : : ; hn (xn)) : h 2 Fn (h1(y1); : : : ; hn(yn)) : h 2 Fn ;
which gives the desired result. In fact, the proof of Theorem 2 did not make use of the fact that the target sequence f was in the set Fn of legal sequences. This observation by itself is not useful for learning, since we cannot be sure that there will be a function sequence in the class Fn that is consistent with an arbitrary target sequence. However, we can use a similar argument (together with techniques of Haussler [9]) to prove the following more general uniform convergence result, which is useful for learning when the target sequence f is arbitrary.
Theorem 3. For a; b 0, define
d (a; b) = jaj j+a ?jbjb+j :
For all 0 < ; < 1, n 2=(2 ), all sequences f of measurable functions, and all distributions P on X ,
n
P x : 9g 2 Fn s.t. d d^x(f; g); dP (f; g) ? < 2E Fnjx 2 exp ?2n 2 ; n
where the expectation is over x in X n .
8
o
Proof. Using Chernoff bounds and properties of the metric n 2=(2 ),
n
d shown in [9], if
o
P n x : 9g 2 Fn s.t.d d^x(f; g); dP (f; g) n o 2P 2n (x; y) : 9g 2 Fn s.t.d d^x(f; g); d^y (f; g) =2 o n ^ ^ = 2E(x;y)P n U : 9g 2 Fn s.t.d dx (f; g); dy (f; g) =2 n o 2E Fn j x;y sup U : 9g 2 Fn s.t.d d^x (f; g); d^y (f; g) =2 ; 2
(
)
(x;y)
where U is the uniform distribution on the group of swapping permutations onthe g); d^y (f; g) set f1; : : : ; 2ng, as in the proof of Theorem 2. Now, d d^x (f; =2 if and only if n X
=1
i
ai (jfi(xi) ? gi (xi)j ? jfi(yi) ? gi(yi)j) 2
n X
=1
i
!
(jfi(xi) ? gi (xi)j + jfi(yi) ? gi(yi)j) + n ;
where ai = 1 if (i) = i and ai = ?1 otherwise. By Hoeffding’s inequality and some easy manipulations, the probability of this event under the distribution U is no more than 2 exp(?2n 2 ). A slightly weaker version of Theorem 2 follows easily from this result. Choosing appropriate values for and gives the following corollary.
Corollary 4. Suppose 0 < < 1, f is a sequence of measurable functions defined on X , and P is a distribution on X . 1. If n 3= then
P
n
x : 9g 2 Fn s.t.d^x(f; g) < 31 dP (f; g) ? or o
d^x (f; g) > 3dP (f; g) + < 2E Fnjx 2 exp (?n=2) :
2. If n 18=2 then
n
o
P n x : 9g 2 Fn s.t. d^x(f; g) ? dP (f; g) ? < 2E Fnjx 2 exp ?2n2=9 : 9
In both inequalities, the expectation is over x in X n .
Proof. The first part follows from Theorem 3 with = 1=2 and = 3. For g) 1 and the second part, use = =3 and = 1, and notice that dP (f; ^dx(f; g) 1.
It is not hard to see that for sets of sequences Fn whose definition does not depend on the underlying distribution P , the slow growth of E Fn jx for all distributions is also necessary for uniform convergence. More precisely, if for some a > 1, E Fn jx = (an), then there exist distributions P relative to which the empirical weighted differences between sequences in Fn do not converge uniformly to dP . However the following example shows that there are distributions P and distribution-dependent families of sequences Fn which exhibit uniform conver gence as above, despite the exponential growth rate of E Fn jx . Let Fn be any sequence that has P (fi (x) 6= fi+1 (x)) = 0 for all i. Clearly, if the VC-dimension of F is finite, the components of any sequence in Fn are all equal except on a set of measure zero, and so uniform convergence follows from standard results. But if, for instance, X = [0; 1], P is the uniform distribution on X , and F is the set of n indicator functions of singletons fxg, then E Fn jx = 2 . Clearly, Theorem 2 implies that any finite set Fn can be estimated. As another example, consider a function class F and suppose that G is a directed graph with nodes in F . For n 2 N, let Fn be the set of walks of length n on G. If F has finite VC-dimension and for all nodes f in F the number of walks of length n on G starting at f grows polynomially in n, then Fn can be (; )-estimated for sufficiently large n. By restricting the set of legal sequences to sequences defined by slow walks on the graph, the restriction on the underlying graph may be relaxed. Theorem 5. Let G = (F; E ) be a directed graph, where F is a function class. For n 2 N and 0 < < 1=2, let Fn(G; ) be the set of sequences f = (f1; : : : ; fn ) whose dG variation is bounded by , i.e., Fn (G; ) = ff 2 F n : VG (f) g.
1. Let pk denote the number of paths in G of length no more than k . Suppose that, for some > 0, pk < 2k for all k . Then for any 0 < < 1, if
> 8 + log 4
and
n 4 log 2 ; 10
it follows that any consistent algorithm will (; )-estimate Fn (G; ) (on n examples). 2. Let O be the outdegree of G, and let d be the VC-dimension of F . Then for 0 < ; < 1, if
> 8 log 4O ; and
n > 8 2d log 16e + log 2 : then any consistent algorithm will ples.
(; )-estimate Fn(G; ) from n exam-
Proof. 1. The number of paths in G of length bnc is no more than 2n . So n 4 2 n n jFnj bnc 2 : It follows from Theorem 2 that Fn (G; ) can be (; )-estimated by any consistent
algorithm whenever
n ? 2n log 4 + > log 2 ; 2 for which the conditions of the theorem suffice. 2. Since the outdegree of G is bounded by O, the number of function sequences of length k starting from a given function is no more than Ok . It follows that, for any x in X n we have
d F bnnc Obnc en d n en d ; 4 O d where we have used the fact that n 1= (if n < 1=, the result follows trivially from results for learning (constant) functions from F ; see, for example, [13, 5]).
n jx
11
Theorem 2 implies that Fn (G; ) can be (; )-estimated by any consistent algorithm whenever
n ? 2n log 4O ? 2d log en > log 2 : 2 d
So if
> 8 log 4O ;
it suffices that
n > 8 2d log 16e + log 2 :
The first part of the next theorem gives a similar result that applies when the functions can change arbitrarily over some class F , but only occasionally. The lower bound on (the guaranteed proximity of the estimation to the target sequence) now depends on the VC-dimension of F , whereas when graph constraints apply the out-degree of the graph imposed a fixed bound for all (finite VC-dimension) classes F . The second part of the theorem shows that, if there is a slowly changing function sequence that is close to the target f with respect to dP , an algorithm that chooses any slowly changing sequence with minimal error on the examples will find a good approximation to the target. Theorem 6. For a set F of f0; 1g-valued functions defined on X and define the sets of -frequently switching function sequences of F as
> 0,
Fn(D; ) = f 2 F n : D(f) ; where D(f) is the average variation of a sequence f, as defined by Equation (2) above. Suppose that VCdim(F ) = d 2. 1. There are constants c1 and c2 such that, for 0 < ; ; < 1, if
> c1d log 1 (for which < c1 =(d log(d=)) suffices) and c 1 1 2 n d log + log ; then any consistent algorithm will (; )-estimate Fn () (on n examples). 12
2. There are constants c1 and c2 such that, for 0 < ; ; quence f of measurable f0; 1g-valued functions, if
< 1 and any se-
> c1d log 1
(for which < c1 =(d log(d=)) suffices) and
c 1 1 2 n d log + log ;
then n
P x : 9g 2 Fn s.t. d^x(f; g) < =6 and dP (f; g) oor d^x(f; g) > 3 and dP (f; g) < : n
Proof. In order to apply Theorem 2 and Corollary 4, we begin by establishing a bound E F . Fix x 2 X n . Notice that it does not suffice to argue that on ? n n jx n Fn j x n F (n) , since this grows too quickly with n. Suppose for now that = 1=q for some positive integer q , and that n is such that k = n is an integer, Clearly,
n jx
F
where the sum is over all 1 ij
k XY j
=0
F (ij );
n satisfying k X j
=0
ij = n:
Sauer’s lemma (see, for example, [13]) implies choices of the ij , k Y j
=0
2i d j
k Y j
=0
13
F (i) 2id.
2(n=k)d ;
Now, for all legal
that is, the choice of ij that maximizes this product is ij = n=k . To see this, suppose that we have a set fi0; : : : ; ik g, with i0 < i1 . Consider what happens when we change these ij s to
fi00; : : : ; i0k g = fi0 + 1; i1 ? 1; i2; : : : ; ik g: In this case, Qk
d i i 0 1 = (i + 1)(i ? 1) ; 0 1
=0 2ij 0d =0 2i j d
j Qk j
and this is no more than 1, since i1 i0 + 1. It follows that
n ? 1 F k 2k (n=k)dk : For the first part of the theorem, if k = n and n=k are both positive integers, Theorem 2 implies that any consistent algorithm can (; )-estimate provided that n 6= and
n jx
2?n=2+1
n?1 k
2 2kd
n k
22k < :
So it suffices if
n ln 2 ? 2k ln(2(n ? 1)=k) ? 2kd ln n ? 2k ln 2 ln 2 ; 2 k which is implied by
Recalling that k
n ln 2 ? 2k(d + 2) ln 2n ln 2 : 2 k
= n, if
> ln82 (d + 2) ln 2
then n = c log 1 examples will suffice, for some constant c. The case where n and n=bnc are not integers introduces no more than an extra constant factor, provided that bnc > 0. Otherwise, the result follows from standard results (see, for example, [13, 5]). 14
For the second part of the theorem, suppose again that k = n and n=k are both positive integers. If n 18=, Corollary 4 implies that the desired probability is less than
2 n 2dk n ? 1 2 k 22k k exp(?n=4) which is no more than when n ? 2k ln 2(n ? 1) ? 2kd ln n ? 2k ln 2 ln 2 : 4 k k This is implied by But k
n=4 ? 2k(d + 2) ln(2n=k) ln(2=):
= n, so it suffices if 2(d + 2) ln(2=) < =8
and
n 8 ln 2 :
The case where n and n=bnc are not integers follows as above.
Notice that if the class Fn of legal sequences depends on the distribution P , then it is not clear how to construct a consistent algorithm or an algorithm that minimizes error over Fn , since the algorithm does not have access to P . In the following result, we avoid this problem by relating such a sequence class to one that does not depend on the distribution.
Theorem 7. Suppose F is a set of f0,1g-valued functions defined on X with VCdim(F ) = d, P is a probability measure on X , and > 0. Define the set of (P; )-slowly changing sequences in F n as
Fn(P; ) = ff 2 F n : VP (f) g: There are constants c1 and c2 such that, for any 0 < ; < 1, any < c12=(d log(d=)); and any
1 1 c 2 n d log + log ; there exists an algorithm that (; )-estimates any sequence in Fn (P; ).
15
Proof. We consider an estimation algorithm that works with the class Fn (D; 0 ) of slowly switching function sequences from F n , for some appropriate value of 0. The algorithm chooses a sequence from that class with minimal error on the training examples. We show that this class (with 0 = 18=) approximates Fn(P; ), and that this implies the algorithm is successful. Consider a target sequence f 2 Fn (P; ). We will construct a piecewise constant sequence g that approximates f with respect to dP , and does not have too many switches. Let g1 = f1 , and let i1 > 1 denote the first index such that dP (f1; fi1 ) =18. Set gi = g1 for 1 i i1 ? 1. Let gi1 = fi1 , and let i2 > i1 be the smallest index such that dP (fi1 ; fi2 ) =18. Then set gi = fi1 for i1 i i2 ? 1. Continue in this manner to form the piecewise constant sequence g. Note that by the construction of g and the triangle inequality, if D(g ) = then VP (f) =18. Therefore, if f 2 Fn(P; ) then D(g ) < 18=. Also note that g) < =18. by construction dP (f; Thus, for any target sequence f 2 Fn (P; ), there is a sequence g in the class Fn(D; 18=) with dP (g; f) =18. Let 0 = 18=. Theorem 6 implies that, if
> c10d log 1 0 and
n c2 d log 1 + log 1 ; g; f) =6 and, for any h 2 then with probability 1 ? we have both d^x ( 0 ^ Fn(D; ) with dP (h; f ) , dx(f; h) > =6. That is, with probability 1 ? , the h) < . algorithm returns a function sequence h that satisfies dP (f; For some positive constants c3 and c4 , the condition > c1 0d log(1=0 ) is implied by the condition 2 > c3 d log(=), which is implied by < c42=(d log(d=)). Note that in contrast with Theorem 6, one cannot in general estimate a sequence in Fn (P; ) using just any consistent algorithm. There are examples for which for every n there are consistent sequences h with VP (h) = 0 but dP (h; f) = 1. For example, let X = [0; 1] and P be the uniform distribution on [0; 1]. Let F contain the indicator functions of the concept [0; 1] and all singletons fxg for x 2 [0; 1]. Note that VCdim(F ) = 2. Now, for any n consider the sequence f = (f1 ; : : : ; fn ) with fi = [0; 1] for i = 1; : : : ; n. For every x1; : : : ; xn 16
= (h1; : : : ; hn ) defined by the sequence f labels each xi as 1. The sequence h hi = fxig also labels each xi as 1. Thus, if the true sequence of concepts is f, then for every x1 ; : : : ; xn the corresponding h is consistent and yet VP (h) = 0 while dP (h; f) = 1. Clearly, Theorem 2 implies that E Fn(P; )jx grows exponentially with n for every x.
in this case. In fact, it is easy to see directly that
Fn(P;
)jx
= 2n
4 Prediction In this section, we consider the problem of online prediction of function sequences from labelled examples. Definition 8. Suppose M is a positive integer, P is a distribution on X , and FM is a set of legal function sequences of length M from a set F of functions. (In some cases, FM depends on P .) A prediction problem for FM proceeds as follows. There is an unknown target sequence f = (f1; : : : ; fM ) 2 FM . At each time 1 t M , a prediction strategy receives an example (xt; ft(xt)), where xt is chosen according to P . The strategy then hypothesizes a function ht+1 . For 0 < ; < 1, we say that a strategy can (; )-predict FM online from M examples if, for all probability distributions P on X and all legal sequences f in FM , the probability over x in X M that dP (h; f) is less than , where h = (h1; : : : ; hM ). Clearly, if there is a strategy that can (; )-predict a class FM from M examples, then it can be used to construct an algorithm that can (; )-estimate FM in the sense of Definition 1. We will construct prediction strategies that are based on the estimation results of the previous section. Theorem 10 below shows that these strategies predict well whenever the underlying estimation strategy works well. In fact, this result applies more generally to the problem of online prediction of elements of an arbitrary pseudometric space. Definition 9. Let (F; d) be a pseudometric space. A prediction problem on this space is defined as follows. For M 2 N, there is a sequence f in F M . At time t ? 1, a prediction strategy receives partial information Gt?1(ft?1) about ft?1. The strategy then predicts ht . At time t, it receives partial information about ft,
17
and so on. The strategy aims to ensure that
d(h; f)
def =
1
M
M X
=1
i
!
d(fi; hi)
is small. A k -th order Markovian prediction strategy H is one that, at time t ? 1, considers only the partial information that it has received about the sequence (ft?k ; : : : ; ft?1) in forming its prediction ht. The pseudometric space that will be of interest to us is that of binary valued functions over some domain set X , with the pseudometric, dP , induced by some probability measure, P , over X . The prediction strategy receives partial information Gt (ft ) about the target function ft in the form of the label ft (xt) of ft for a P -randomly drawn example xt . However, for the statement and proof of Theorem 10 below, we will remain in the more abstract setting of Definition 9. Of course the error of a prediction strategy depends on the intrinsic ‘fluctuations’ of the sequence f it comes to estimate. We measure these fluctuations by the ‘total variation’ of the function sequence f relative to a given metric d over F , Vd (f) (defined in Equation 1). We shall consider separately the case where d is the discrete metric over F , giving rise to the variation D(f) (defined in Equation 2), Theorem 10. Let (F; d) be a pseudometric space, and consider f 2 F M for M 2 N. Let H be a k-Markovian prediction strategy for (F; d) and, for each t M , let ftk = (ft?k ; : : : ; ft?1). Suppose that, at time t, H has access to a sequence f^tk = (f^t;t?k ; : : : ; f^t;t?1) 2 F k that satisfies d(f^tk ; ftk ) , and H predicts ht = f^t;t?1. We say that H is D-conservative if, for all t, Ltail (f^tk ) Ltail(ftk ), where Ltail(g) for a sequence g 2 F k denotes
Ltail(g) = max fi 2 f1; : : : ; kg : gk+1?i = gk+2?i = = gk g ; the length of g’s “constant tail”. We say that H is Vd -conservative if Vd (f^tk ) Vd (ftk ). 1. If H is D-conservative then, If k > 1=(D(f) + 1=M ) then d(h; f) k + 2kD(f) log 1 + Vd (f): M D (f ) 18
If k 1=(D(f) + 1=M ) then
d(h; f) Mk + + 2kD(f) (log(k) ? 1) + Vd(f):
2. If H is Vd -conservative, then
d(h; f) k=M + minf + 2kVd (f); kg + Vd (f): Furthermore, the bounds are ‘local’ in the sense that, if H only satisfies n
1
o
(f^tk ; ftk ) > < ; k + 1 t M : d M ?k ; f) increase only by an additive term of . then the upper bounds on d(h Proof. 1. The proof proceeds in two stages. We start by producing an upper bound on the instantaneous error of a prediction strategy at each time point t. We then apply this bound to derive the desired upper bound for the cumulative error over the full prediction process. Since Ltail(f^tk ) Ltail(ftk ), for all t ? Ltail(ftk ) i t ? 1, we have d(fi; f^t;i) = d(ft?1 ; f^t;t?1). By the triangle inequality, the error of the hypothesis ht = f^t;t?1 is
d(ht; ft) d(f^t;t?1; ft?1) + d(ft?1; ft): Note that the first term depends upon the strategy H , whereas the second term is an intrinsic property of the target sequence, f and it sums up to MVd (f). Regarding the first term, we have:
t?1 X
d(f^t;i; fi) d(ftk ; f^tk ) = k1 i=t?k
k1
t?1 X
=
( )
i t?Ltail ftk
d(f^t;i; fi)
= k1 Ltail(ftk )d(f^t;t?1; ft?1):
Applying the assumption that the hypothesis sequence f^tk is -close to the target sequence ftk , we conclude that d(f^t;t?1; ft?1 ) k=Ltail (ftk ). We now turn to the 19
cumulative error over the full sequence f. We have
M d(f; h) =
M X
=1
t
d(f^t;t?1; ft)
k + k
M X
= +1
t k
1=Ltail(ftk ) + MVd (f):
Denote t = Ltail(ftk ). Notice that t takes values in the range 1 (when ft?2 6= ft?1) up to k (when ft?k = = ft?1). So from a point of change in f, 1=t starts at 1 and decays as 1=i until it reaches 1=k , where it remains until the next change in f. A straightforward perturbation argument implies that among the f with a fixed number of function switches (i.e. fixed value of D(f)), sequences PM t=k+1 1=t assumes its maximal value when these switches are equally spaced along the sequence f. To upper bound the cumulative error for sequences of a fixed length M and a fixed D(f), we therefore?consider sequences composed of (MD(f ) + 1) blocks, each consisting of 1= D(f ) + 1=M many identical functions. Let us ? consider twocases: ? Case 1: k > 1= D(f) + 1=M . On each sequence of 1= D(f) + 1=M identical ft’s, the sum of the appropriate terms 1=t is the harmonic series,
1=(D(X f)+1=M )
1 : D ( f ) + 1 =M i=1 As there are MD(f) + 1 many such subsequences in f, we get ? dP (f; H ) k=M + k D(f) + 1=M log D(f) +1 1=M + Vd(f) k=M + 2kD(f) log(1=D(f)) + Vd(f): ? Case 2: k 1= D(f) + 1=M . In this case, on each subsequence of identical functions ft , the corresponding sequence of 1=t ’s?consists of the harmonic ? se quence 1; 1=2; : : : ; 1=k followed by a sequence of 1= D(f ) + 1=M ? k values equal to 1=k . A straightforward calculation shows that in this case we get dP (f; H ) k=M + + 2kD(f) (log(k) ? 1) + Vd (f): 1=i log
2. Just as in the first part of the proof, at each stage t,
d(ht ; ft) = d(f^t;t?1; ft) d(f^t;t?1; ft?1) + d(ft?1; ft): 20
(3)
The term sums up to MVd (f), so our task is to upper bound the sum PM ?second 1 d(f^ ; f ). t;t?1 t?1 t=1 Since H is Vd -conservative, for every k + 1 t M , Vd (f^tk ) Vd (ftk ). This together with the triangle inequality imply that, for every (t ? k ) i t ? 1,
2
2
t?
t?
X X d(f^t;t?1; ft?1) d(f^t;i; fi) + d(fj ; fj+1) + d(f^t;j ; f^t;j+1)
=
=
j i
j i
d(f^t;i; fi) + kVd (f ) + kVd(f^ ) d(f^t;i; fi) + 2kVd (ftk ): k t
k t
(4)
Since, by assumption,
1
t?1 X
^
k i=t?k d(ft;i; fi) ; there must be a t ? k i t ? 1 with d(f^t;i ; fi ) . Applying inequality (4), we conclude that
d(f^t;t?1; ft?1) + 2kVd (ftk ): Averaging inequality (3) over all t, one gets, d(h; f) k=M + + 2kVd (f) + Vd (f): But the assumption d(f^tk ; ftk ) also implies that d(f^t;t?1 ; ft?1) k. Averaging inequality (3) using this last inequality amounts to
d(h; f) k=M + k + Vd (f):
We can now combine this theorem with the estimation theorems derived in Section 3 to give a result on the prediction of changing concepts.
Definition 11. For a set F of f0; 1g-valued functions defined on X , and numbers k; M 2 N and 0 < < 1, define the sets of k-local--frequently-switching function M -sequences of F as FMk (D; ) = f 2 F M : for k t < M , D(ftk ) : For a probability measure P define the sets of k -local-(P; )-slowly-changing M -sequences of F as FMk (P; ) = f 2 F M : for k t < M , VP (ftk ) : 21
Theorem 12. Suppose that F is a class with VC-dimension d, and P is a probability distribution on X . 1. There are constants c1 and c2 such that for 0 < ; ; < 1, if
> c1d log 1 log 1 ;
2.
k = c2 d log 1 + log 1 ; and M k=, then there is a k -Markovian prediction strategy that can (; )-predict FMk () online from M examples. There are constants c1 and c2 such that for 0 < ; ; < 1, if 2 > c1d log2 d ;
1 1 c 2 k = d log + log ; and M k=, then there is a k -Markovian prediction strategy that can (; )-predict FMk (P; ) online from M examples. Proof. 1. For all t 2 fk + 1; : : : ; M g, ftk 2 Fk (D; ). At time t, our estimation algorithm chooses a consistent sequence f^tk . For any t, let Mt represent the set of training samples for which dP (f^tk ; ftk ) . Theorem 6 implies that, if > c1d log(1=) and 1 1 c 2 k d log + log ; then for all t we have Pr(Mt ) < . It follows that
1
M X
!
Mt < : E M ?k t=k+1
Markov’s inequality implies that, with probability at least 1 ? , n
o
t 2 fk + 1; : : : ; M g : dP (f^tk ; ftk ) < (M ? k): 22
If our consistent estimation algorithm also maximizes Ltail(f^tk ) for its estimate f^tk at each time t, it is also D-conservative. Applying Theorem 10 (taking the maximum of both bounds), if k < 1=(2 log(1=)) and M K=, we have dP (h; f) 5. For the first inequality, it suffices if
> c3d log 1 log 1
> c3d log 1 log 1 :
(
2. The algorithm of Theorem 7 can easily be modified by, instead of choosing that is closest to the target and does not have too many switches, we choose the h one that maximizes the length of its “constant tail” but is sufficiently close to the target and does not have too many switches. The proof of Theorem 7 shows that, with probability at least 1 ? , there is a function sequence g in Fn (D; 18=) g; f) =6. So if the algorithm returns a function sequence h that has the with d^x ( longest “constant tail” of those sequences in Fn (D; 18=) satisfying d^x (h; f) =6, it will satisfy the conditions of the first part of Theorem 10. We can then use an argument identical to the proof above of the first part of the theorem, but with replaced by 18=. In this case, we get
( (
> c3d 1 log log 1
> c4d log d log 1 log 1 2 > c4d log2 d ;
2
for some universal constant c4.
Acknowledgements This research was partially supported by the Australian Research Council, and by a DIST Bilateral Science and Technology Collaboration Program Travel Grant. Thanks to some anonymous reviewers for helpful comments on an earlier version of this paper.
23
References [1] P. L. Bartlett. Learning with a slowly changing distribution. In Proceedings of the 1992 Workshop on Computational Learning Theory, pages 243–252, 1992. [2] P. L. Bartlett and D. P. Helmbold. Learning changing problems. Technical report, Australian National University, 1996. [3] R. D. Barve and P. M. Long. On the complexity of learning from drifting distributions. Information and Computation, 138(2):101–123, 1997. [4] A. Blum and P. Chalasani. Learning switching concepts. In Proc. 5th Annu. Workshop on Comput. Learning Theory, pages 231–242. ACM Press, New York, NY, 1992. [5] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 36(4):929–965, 1989. [6] A. J. Connolly, J. F. Chicharo, and P. Wilbers. Temperature modelling and prediction of steel strip in a HSM. In Control 92 Conference Proceedings. Institute of Engineers Australia, 1992. [7] Y. Freund and Y. Mansour. Learning under persistent drift. In S. Ben-David, editor, Computational Learning Theory: Third European Conference, EuroCOLT’97, pages 109–118. Springer, 1997. [8] T. Hagerup and C. Rub. A guided tour of Chernov bounds. Inform. Proc. Lett., 33:305–308, 1990. [9] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inform. Comput., 100(1):78–150, September 1992. [10] D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14:27–, 1994. [11] P. M. Long. The complexity of learning according to two models of a drifting environment. To be presented at the Eleventh Annual Conference on Computational Learning Theory, 1998.
24
[12] N. Merhav and M. Feder. Universal schemes for sequential decision from individual data sequences. IEEE Transactions on Information Theory, 39(4):1280–1292, 1993. [13] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York, 1982.
25