Machine Learning, 37, 337–354 (1999) c 1999 Kluwer Academic Publishers. Manufactured in The Netherlands. °
The Complexity of Learning According to Two Models of a Drifting Environment PHILIP M. LONG
[email protected] Department of Computer Science, National University of Singapore, Singapore 119260, Republic of Singapore Editors: Jonathan Baxter and Nicol`o Cesa-Bianchi 3
c² Abstract. We show that a VC dim(F ) bound on the rate of drift of the distribution generating the examples is sufficient for agnostic learning to relative accuracy ², where c > 0 is a constant; this matches a known necessary c² 2 condition to within a constant factor. We establish a VC dim(F ) sufficient condition for the realizable case, also matching a known necessary condition to within a constant factor. We provide a relatively simple proof of a bound of O( ²12 (VC dim(F ) + log 1δ )) on the sample complexity of agnostic learning in a fixed environment.
Keywords: computational learning theory, concept drift, context-sensitive learning, prediction, PAC learning, agnostic learning, uniform convergence, VC theory
1.
Introduction
Learning often takes place in a gradually changing environment. This phenomenon has been studied theoretically by assuming that the function to be learned, the distribution generating the examples, or both, change at most a certain amount between examples (see Helmbold & Long, 1994; Bartlett, 1992; Bartlett & Helmbold, 1995; Barve & Long, 1997).1 In this paper, we study the problem of learning functions from some set X to {0, 1} (“concepts”) using two models of a drifting environment. In the first (Bartlett, 1992), it is assumed that examples (x1 , y1 ), (x2 , y2 ), . . . are generated independently at random from a sequence of joint distributions over X × {0, 1}, and the only constraint is that consecutive pairs of distributions have small total variation distance. If this distance is always at most 1, then the sequence of distributions is called 1-gradual. For each t, the learning algorithm must output a hypothesis h t using only the first t − 1 examples. For some concept class F and drift rate 1, if, for any sequence of 1-gradual joint distributions, for large enough t, the probability that h t (xt ) 6= yt is at most ² more than the minimum such probability from among f ∈ F, then we say that F is (², 1)-trackable in the agnostic case. The second model of learning in a drifting environment (Helmbold & Long, 1994; Bartlett, 1992; Bartlett & Helmbold, 1995) is obtained from the above by adding the requirement that each distribution Pt has some f t ∈ F such that the probability that the pair (xt , yt ) drawn according to Pt has f t (xt ) = yt is 1. Here, if, for large enough t, the probability that h t (xt ) 6= yt is at most ², we say that F is (², 1)-trackable in the realizable case.
338
LONG 3
c² In this paper, we show that there is a constant c > 0 such that a VC dim(F bound on 1 is ) 2 c² sufficient for F to be (², 1)-trackable in the agnostic case, and a VC dim(F ) bound is sufficient for the realizable case. This work continues an existing line of research (Helmbold & Long, 1994; Bartlett, 1992; Bartlett & Helmbold, 1995; and Barve & Long, 1997), and matches known necessary conditions for both the agnostic (Barve & Long, 1997) and realizable (Bartlett, 1992) cases to within a constant factor, closing log-factor gaps. Note that both models allow for variation both in the target and in the marginal distribution on the domain elements; some previous work addressed these two types of changes separately. The agnostic drift analysis uses a technique called Chaining from Empirical Process Theory (see Pollard, 1984, 1990). We defer a high-level description of this technique until later in the paper when appropriate context is available. In the realizable case, as in (Helmbold & Long, 1994; Bartlett, 1992; Bartlett & Helmbold, 1995), we consider an algorithm based on the one-inclusion graph algorithm (Haussler, Littlestone & Warmuth, 1994), which was originally designed for learning concepts in a fixed environment. To determine h(xm ) from some sample
(x1 , y1 ), . . . , (xm−1 , ym−1 ), the original algorithm constructs a graph whose vertices are {( f (x1 ), . . . , f (xm )) : f ∈ F} and has edges between pairs of vertices that differ in only one component (the “oneinclusion graph”).2 The edges of the graph are then directed, and these orientations are used to determine h(xm ). The analysis involves relating the probability of a mistake for some target f to the maximum (over x1 , . . . , xm ) of the outdegree for the vertex associated with f . Since any one-inclusion graph for F can be shown to be sparse relative to VC dim(F ), the edges can be directed so that the out-degree of any vertex is at most VC dim(F ) (Haussler, Littlestone & Warmuth, 1994). In (Helmbold & Long, 1994; Bartlett & Helmbold, 1995), the vertex set was expanded to include elements of {0, 1}m that are within some Hamming distance of elements of {( f (x1 ), . . . , f (xm )) : f ∈ F}; these graphs also can be shown to be sparse. The main new idea in this paper’s realizable drift analysis is to show, for each F, how to direct all the edges of the m-dimensional hypercube so that the outdegree of each vertex is bounded appropriately in terms of its distance to the closest element of {( f (x1 ), . . . , f (xm )) : f ∈ F} as well as the VC-dimension of F. 1.1.
Agnostic learning in a fixed environment
In the standard agnostic learning model (Haussler, 1992; Kearns et al., 1994), random examples (x1 , y1 ), . . . , (xm , ym )
COMPLEXITY OF LEARNING
339
are drawn from an arbitrary joint distribution P, and the learner’s goal is to output a function h such that probability that h(x) 6= y for another pair (x, y) drawn according to P is nearly as small as that of the best function in F. We give a proof that, in a fixed environment, for any concept class F, ¶¶ µ µ 1 1 VC dim(F ) + log O ²2 δ examples are sufficient for an algorithm to, with probability 1 − δ, output a hypothesis whose error is at most ² worse than the best in F. This bound, which also follows from previous work of Talagrand (1994), improves on the bound of ¶¶ µ µ 1 1 1 + log VC dim(F ) log O ²2 ² δ that follows from Vapnik and Chervonenkis’ results (see Haussler, 1992), and matches Simon’s general lower bound (Simon, 1996) to within a constant factor for each concept class F. Our constants are greater than Talagrand’s, but our proof is simpler and more elementary. 2.
Preliminaries
Fix a countable set X . Denote the reals by R, and the natural numbers by N. An example is an element of X × {0, 1}, and a sample is a finite sequence of examples. A learning algorithm takes a sample as input, and outputs a hypothesis, which is a function from X to {0, 1}. We will also consider randomized learning algorithms, which can be modelled as deterministic functions of another random input along with the sample. For a real-valued function g defined on Z , and Ez ∈ Z m , define m 1 X g(z i ). Eˆ Ez (g) = m i=1
The VC-dimension of a set G ⊆ {0, 1}m is the length of the longest sequence i 1 , . . . , i d of indices such that {(gi1 , . . . , gid ): g ∈ G} = {0, 1}d . The VC-dimension of a set G of functions from X to {0, 1} is the maximum, over m ∈ N, xE ∈ X m , of the VC-dimension of {(g(x1 ), . . . , g(xm )): g ∈ G}. The metric dT V on probability distributions is defined by dT V (P, Q) = 2 sup |P(E) − Q(E)|. E
Say a sequence P1 , P2 , . . . of probability distributions is 1-gradual if for each t ∈ N, dT V (Pt , Pt+1 ) ≤ 1. For a learning algorithm A, we say that a sample (x1 , y1 ), . . . , (xm , ym ) and randomization r cause a mistake for A if A, given (x1 , y1 ), . . . , (xm−1 , ym−1 ) and r , outputs a hypothesis h for which h(xm ) 6= ym .
340
LONG
which we will denote by ρ, is defined by ρ(E v , w) E = PRecall that the Hamming distance, m E ∈ {0, 1}m , define ρ(E v , F) = min{ρ(E v , fE): fE ∈ F}. i |vi −wi |. For m ∈ N, F ⊆ {0, 1} , v v ∈ {0, 1}m : ρ(E v , F) = k}. For each k ∈ {0, . . . , m}, define ρk (F) = {E Both analyses will use Fubini’s Theorem. Lemma 1 (see Royden, 1963). Choose countable sets Z 1 and Z 2 , a function f : Z 1 × Z 2 → [0, 1] and probability distributions D1 over Z 1 and D2 over Z 2 . Then Z
µZ
Z Z 1 ×Z 2
f (z 1 , z 2 ) d(D1 × D2 )(z 1 , z 2 ) = Z
Z1
µZ
Z2
= Z2
¶ f (z 1 , z 2 ) d D2 (z 2 ) d D1 (z 1 ) ¶ f (z 1 , z 2 ) d D1 (z 1 ) d D2 (z 2 ).
Z1
We will also use the standard Hoeffding bound. Lemma 2 (see Pollard, 1984). Let Y1 , . . . , Ym be independent random variables taking values in [a1 , b1 ], . . . , [am , bm ] respectively. Then ïà !¯ ! à ! à ! m m ¯ X ¯ X −2η2 ¯ ¯ Pr ¯ Y − E(Yi ) ¯ > η ≤ 2 exp Pm . 2 ¯ i=1 i ¯ i=1 (bi − ai ) i=1 3.
Agnostic learning
In this section, we consider agnostic learning in both fixed and drifting environments. We begin with a fixed environment. 3.1.
Fixed environment
Choose a class F of functions from X to {0, 1}. For a probability distribution P on X ×{0, 1} and a function h from X to {0, 1}, the error of h with respect to P, denoted by er P (h), is P{(x, y) : h(x) 6= y}. A learning algorithm A is said to (², δ)-agnostically learn F from m examples if for all distributions P on X × {0, 1}, o n P m Ez : er P (A(Ez )) > ² + inf er P ( f ) ≤ δ. f ∈F
To set the context, we briefly review the work that our analysis builds on (Vapnik & Chervonenkis, 1971; Pollard, 1984; Blumer et al., 1989; Haussler, 1992). For each f ∈ F, define L f : X × {0, 1} → {0, 1} by L f (x, y) = | f (x) − y|. Define L F = {L f : f ∈ F}. The following reduces the learning problem to that of obtaining uniformly good estimates of the errors of possible hypothesis (i.e. expectations of elements of L F ).
COMPLEXITY OF LEARNING
341
Lemma 3 (Haussler, 1992). X × {0, 1},
Choose ², δ > 0, m ∈ N. If for all distributions P on
( Pm
¯ ¯ ) Z ¯ ¯ ² ¯ˆ ¯ ≤δ g(u) d P(u)¯ > Ez : ∃g ∈ L F , ¯EEz (g) − ¯ ¯ 2 X ×{0,1}
then F is (², δ)-agnostically learnable from m examples. The following will also be useful. Lemma 4 (see Blumer et al., 1989). VC dim(LF ) ≤ VC dim(F ). So now we can concentrate on determining distribution-free bounds, in terms on the VC-dimension, on the number of examples required to obtain uniformly good estimates of the expectations of random variables in some set. Choose some countable3 set Z (in the learning application, Z will be X × {0, 1}) and some set G of functions from Z to {0, 1} (in the learning application, G will be L F ). The first lemma bounds the probability that any estimate is inaccurate in terms of the probability that two samples yield substantially different estimates. Lemma 5 (Vapnik & Chervonenkis, 1971). Choose η > 0 and m ∈ N for which m ≥ 2/η2 and some probability distribution P on Z . Then ¯ ¯ ) Z ¯ ¯ ¯ ¯ˆ g(u) d P(u)¯ > η P Ez : ∃g ∈ G, ¯EEz (g) − ¯ ¯ Z ¾ ½ ¯ η ¯ 2m ¯ ¯ ˆ ˆ (Ez , uE) : ∃g ∈ G, EEz (g) − EuE (g) > ≤ 2P 2 ¯ ¯ ) ( m ¯ ηm ¯X ¯ ¯ 2m g(z i ) − g(u i )¯ > . (Ez , uE) : ∃g ∈ G, ¯ = 2P ¯ ¯ i=1 2 (
m
The next lemma is an example of the “permutation trick”: note that setting σi = −1 has the effect of exchanging z i and u i . Lemma 6 (Vapnik & Chervonenkis, 1971; Pollard, 1984). Choose η > 0, m ∈ N and some probability distribution P on Z . Then if U is the uniform distribution on {−1, 1}m , ¯ ¯ ) m ¯X ¯ ¯ ¯ g(z i ) − g(u i )¯ > ηm P 2m (Ez , uE) : ∃g ∈ G, ¯ ¯ i=1 ¯ ¯ ¯ ( ) m ¯X ¯ ¯ ¯ σi (g(z i ) − g(u i ))¯ > ηm . ≤ sup U σE : ∃g ∈ G, ¯ ¯ i=1 ¯ Ez ,E u ∈Z m (
342
LONG
The previous lemma allows us to fix some sequence of 2m elements of Z , and restrict our attention to the behaviors of elements of G on those 2m elements. The following lemma is an immediate consequence of Lemma 2. Lemma 7. Choose m ∈ N and G ⊆ {0, 1}2m . Then if U is the uniform distribution over {−1, 1}m , ¯ ¯ ) m ¯ ¯X 2 ¯ ¯ σi (gi − gm+i )¯ > ηm ≤ 2|G|e−η m/2 . U σE : ∃g ∈ G, ¯ ¯ ¯ i=1 (
By combining Lemmas 3, 4, 5, 6, and 7, and applying a bound on |G| in terms of VC dim(G) (Sauer, 1972; Shelah, 1972; Vapnik & Chervonenkis, 1971) in Lemma 7, one gets a bound of µ O
1 ²2
µ
1 1 VC dim(F ) log + log ² δ
¶¶
on the sample complexity of agnostically learning F (Haussler, 1992). Our argument will take advantage of the following refinement of a slight generalization of Lemma 7, which also follows directly from Lemma 2. m Lemma that each h ∈ H Pm 8. 2 Choose m, k ∈ N, and suppose that H ⊆ R has the property has i=1 h i ≤ k. Then if U is the uniform distribution over {−1, 1}m ,
¯ ¯ ) m ¯ ¯X 2 2 ¯ ¯ σi h i ¯ > ηm ≤ 2|H |e−η m /2k . U σ : ∃h ∈ H, ¯ ¯ ¯ i=1 (
The idea of Lemma 8 is that if all of the elements of H are small, then the variances of the random terms σi h i tend to be small, which means that its less likely that any sum of them will stray far from 0 (its expectation). The following lemma is the heart of our analysis. and G ⊆ {0, 1}2m Lemma 9. Choose η > 0, and d ∈ N. Choose an integer m ≥ 278(d+1) η2 for which VC dim(G) = d. Then if U is the uniform distribution over {−1, 1}m , for any η > 0, ¯ ¯ ) m ¯ ¯1 X 2 ¯ ¯ σi (gi − gm+i )¯ > η ≤ 4 · 41d e−η m/400 . U σE : ∃g ∈ G, ¯ ¯ ¯ m i=1 (
The proof is a chaining argument. See Pollard’s books (Pollard, 1984, 1990) for others and for further references. The idea is as follows. First, we form a sequence G 0 , . . . , G n of approximations to G. The approximations get successively finer until G n = G. Next, we
COMPLEXITY OF LEARNING
343
Figure 1. A schematic representation of the G j ’s and H j ’s from the proof of Lemma 9 in the case m = 4. The G j ’s, which form increasingly accurate approximations to G, are represented by increasingly dense rows of nodes. For each j > 0, an edge is added between the node representing each element of G j and that representing the closest element of G j−1 . If you think of this edge as representing the difference between the two, then each H j (for j > 0) consists of the jth layer of edges.
consider the sets H1 , H2 , . . . , Hn , where each H j consists of the adjustments that need to be made to G j−1 to get the improved approximation G j . In particular, H j consists of the differences between each element of G j and the closest element of G j−1 . (See figure 1.) If we define H0 = G 0 , then each element of G is the sum of an element of H0 , an element of H1 , and so on up to an element of Hn . So, loosely speaking, if things are OK for each of the H j ’s, then they’re OK for G. We will apply Lemma 8 to analyze each of the H j ’s. For relatively large j, H j consists of those adjustments needed to make an already fine approximation finer. Thus, the elements of H j are small, and we can use the fact that Lemma 8 provides a better bound in this case. When j is small, since |H j | ≤ |G j |, and G j is a relatively coarse approximation to G, H j does not have many elements, which provides partial compensation for the fact that its elements might be large. We will use the following result due to Haussler, which bounds the number of significantly different elements of a set G in terms of its VC-dimension. This can be used to bound the size of an approximation to G (Kolmogorov & Tihomirov, 1961). Lemma 10 (Haussler, 1995). For all m ∈ N, for all k ≤ m, if each pair g, h of elements of G ⊆ {0, 1}m has ρ(g, h) > k, then µ |G| ≤
41m k
¶VC dim(G) .
Proof (of Lemma 9): Let n = 1 + blog2 mc. Construct G 0 , . . . , G n as follows. Let G 0 consist of an arbitrary single element of G, and for each j ∈ {1, . . . , n}, construct G j by initializing it to G j−1 , and as long as there is a g ∈ G for which ρ(g, G j ) > m/2 j , choosing such an g and adding it to G j . Note that G 0 ⊆ G 1 ⊆ · · · ⊆ G n = G. For each g ∈ G and j ∈ {0, . . . , n} choose an element ψ j (g) of G j such that ρ(g, ψ j (g)) is minimized. Note that ρ(g, ψ j (g)) ≤ m/2 j , since otherwise g would have been added to G j . Let H0 = G 0 ,
344
LONG
and for each j ∈ {1, . . . , n}, define H j to be {g − ψ j−1 (g) : g ∈ G j }. Note that since for P2m |h i | ≤ m/2 j−1 . all g ∈ G, ρ(g, ψ j−1 (g)) ≤ m/2 j−1 , for each h ∈ H j , i=1 By induction, for each k ∈ {0, . . . , n} for each g ∈ G k , there exist h g,0 ∈ H0 , . . . , h g,k ∈ Hk P such that g = kj=0 h g, j . Thus, for each g ∈ G = G n , there exist h g,0 ∈ H0 , . . . , h g,n ∈ Hn P such that g = nj=0 h g, j . Let ¯ ¯ ) m ¯1 X ¯ ¯ ¯ p = U σE : ∃g ∈ G, ¯ σ (g − gm+i )¯ > η . ¯ m i=1 i i ¯ (
Then, expressing g as
Pn j=0
h g, j , we get
¯ Ã !¯ ) m n ¯1 X ¯ X ¯ ¯ p = U σE : ∃g ∈ G, ¯ σ (h ) − (h g, j )m+i ¯ > η . ¯ m i=1 i j=0 g, j i ¯ (
Rearranging the sums yields ¯ ¯ ) n m ¯ ¯X 1 X ¯ ¯ σi ((h g, j )i − (h g, j )m+i )¯ > η , p = U σE : ∃g ∈ G, ¯ ¯ ¯ j=0 m i=1 (
and applying the triangle inequality, we get ¯ ¯ ) n ¯ m ¯ X ¯ ¯1 X σ ((h ) − (h g, j )m+i )¯ > η . p ≤ U σE : ∃g ∈ G, ¯ ¯ ¯ m i=1 i g, j i j=0 (
p P For each j ∈ {0, . . . , n}, let η j = (η/7) ( j + 1)/2 j . Then nj=0 η j ≤ η, and therefore ¯ ¯ ) m ¯1 X ¯ ¯ ¯ p ≤ U σE : ∃g ∈ G, ∃ j ∈ {0, . . . , n}, ¯ σ ((h ) − (h g, j )m+i )¯ > η j , ¯ m i=1 i g, j i ¯ (
which implies ¯ ¯ ) m ¯1 X ¯ ¯ ¯ U σE : ∃g ∈ G, ¯ σi ((h g, j )i − (h g, j )m+i )¯ > η j . p≤ ¯ m i=1 ¯ j=0 n X
(
345
COMPLEXITY OF LEARNING
Since each h g, j ∈ H j , we have ( n X p≤ U σE : ∃h ∈ H j ,
¯ ¯ ) m ¯1 X ¯ ¯ ¯ σi (h i − h m+i )¯ > η j . ¯ ¯ m i=1 ¯
j=0
Choose j ∈ {0, . . . , n}. For each h ∈ H j , {−1, 0, 1}2m ,
P2m i=1
|h i | ≤ m/2 j−1 . Thus, since h ∈
m X (h i − h m+i )2 = 4|{i : |h i − h m+i | = 2}| + |{i : |h i − h m+i | = 1}| i=1
≤2
2m X
|h i |
i=1
≤
m . 2 j−2
Applying Lemma 8, we have ! Ã n X −(η j m)2 . 2|H j | exp p≤ 2 2 mj−2 j=0 Substituting the value of η j , we get ¶ µ 2 n X −η ( j + 1)m . p≤ 2|H j | exp 400 j=0 By construction, each pair of elements of G j have Hamming distance more than m/2 j . Applying Lemma 10, we get |H j | ≤ |G j | ≤ (41 · 2 j )VC dim(Gj ) ≤ (41 · 2 j )d since G j ⊆ G. Therefore p≤2
µ ¶ η2 ( j + 1)m exp (ln 41 + j ln 2)d − 400 j=0
∞ X
2 · 41d e−η m/400 1 − 2d e−η2 m/400 2
=
≤ 4 · 41d e−η since m ≥
278(d+1) . η2
2
m/400
, 2
Putting together Lemmas 3, 4, 5, 6, and 9, and solving for m, we get a new proof of the following result due to Talagrand.
346
LONG
Theorem 1 (Talagrand, 1994). There is a constant c such that for any class F of functions from X to {0, 1}, for any ², δ > 0, there is an algorithm A that (², δ)-agnostically learns F from at most ²c2 (VC dim(F ) + ln 1δ ) examples. 3.2.
Drifting environment
For a class F of functions from X to {0, 1}, we say a learning algorithm A agnostically (², 1)-tracks F if for all 1-gradual sequences P1 , P2 , . . . of distributions over X × {0, 1}, there Qm is an m 0 such that for all m ≥ m 0 , the probability that a sample drawn according to t=1 Pt and A’s randomization cause a mistake for A is at most ² + inf f ∈F Pm {(x, y) : f (x) 6= y}. If there is a prediction strategy that agnostically (², 1)-tracks F then we say F is (², 1)-trackable in the agnostic case. For our analysis of agnostic learning in a drifting environment, we will replace Lemmas 5 and 6 with the following. Lemma 11 (Barve & Long, 1997). Choose a countable set Z , and a set G of functions from Z to {0, 1}. Choose α > 0 and 0 ≤ κ < α. Choose m ∈ N such that m ≥ 4/α 2 . Choose distributions D, D1 , . . . , Dm on Z such that for each 1 ≤ i ≤ m, dT V (Di , D) ≤ κ. If U is the uniform distribution over {1, −1}m , Ã
¯ ¯ ) Z ¯ ¯ ¯ˆ ¯ Ez ∈ Z : ∃g ∈ G, ¯EEz (g) − Di g(v) d D(v)¯ > α ¯ ¯ Z i=1 ¯ ¯ ( ) m ¯1 X ¯ ¯ ¯ U σE : ∃g ∈ G, ¯ σi (g(u i ) − g(z i ))¯ > (α − κ)/2 . ≤ 2 sup ¯ m i=1 ¯ (Ez ,E u )∈Z m ×Z m
m Y
!(
m
Putting together Lemmas 11 and 9, we get the following. Lemma 12. Choose a countable set Z , and a set G of functions from Z to {0, 1}. Let d = VC dim(G). Choose α > 0 and 0 ≤ κ < α. Choose distributions D, D1 , . . . , Dm on Z such that for each 1 ≤ i ≤ m, dT V (Di , D) ≤ κ. If m ≥ (1112(d + 1))/((α − κ)2 ) then Ã
d Y i=1
!( Di
¯ ¯ ) Z ¯ ¯ ¯ˆ ¯ g(v) d D(v)¯ > α Ez ∈ Z : ∃g ∈ G, ¯EEz (g) − ¯ ¯ Z m
≤ 8 · 41d e−(α−κ) m/1600 . 2
Next, we record a slight variant of a well-known lemma for converting tail bounds to expectation bounds. Lemma 13. For any [0, 1]-valued random variable Y , if ϕ : [0, 1] → [0, 1] is such that for all β, Pr(Y > β) ≤ ϕ(β), then for all 0 = a0 ≤ a1 ≤ · · · ak ≤ ak+1 = 1, E(Y ) ≤ P k i=0 ϕ(ai )ai+1 .
347
COMPLEXITY OF LEARNING
Proof: The distribution on Y that maximizes its expectation subject to ∀i, Pr(Y > ai ) ≤ ϕ(ai ) assigns ϕ(ak ) probability on 1, ϕ(ak−1 ) − ϕ(ak ) probability on ak , and so on, until all the probability has been distributed. This can be verified by induction moving from right to left, using a perturbation argument for the induction step. 2 Theorem 2. There is a constant c > 0 such that for any set F of functions from X to c² 3 , then F is (², 1)-trackable in the agnostic case. {0, 1}, for any ² > 0, if 1 ≤ VC dim(F ) ² Proof: Choose ² ≤ 1, and 1 ≤ 5000000d . Let m = b²/(161)c. For each f ∈ F, define L f : X × {0, 1} → {0, 1} by L f (x, y) = | f (x) − y|. Consider the algorithm A which, given (x1 , y1 ), . . . , (xt−1 , yt−1 ), returns a Pt−1 L h (xi , yi ). Let L F = {L f : f ∈ F}. Recall that hypothesis h ∈ F that minimizes i=t−m VC dim(LF ) ≤ VC dim(F ) (Lemma 4). Choose a 1-gradual sequence P1 , P2 , . . . of probability distributions, an arbitrary f ∗ ∈ F (to compare h with), and t > m. Applying Lemma 1 as in (Haussler, Littlestone Qt & Pi Warmuth, 1994), the probability that (x1 , y1 ), . . . , (xt , yt ) drawn according to i=1 causes a mistake for A is equal to the expectation, with respect to the first t − 1 examples, of Pt {(xt , yt ) : h(xt ) 6= yt } (recall that h is a function of the first t − 1 examples). Choose β ≥ 61m. Since for all i ≤ m, dT V (Pt−i , Pt ) ≤ 1m, applying Lemma 12 with α = β/2, Z = X × {0, 1}, and G = L F , and doing some simple calculations, we get ¯ ¯ ! Ã t−1 ¯ β ¯ 1 X ¯ ¯ L f (xi , yi )¯ > Pr ∃ f ∈ F, ¯ Pt {(xt , yt ) : f (xt ) 6= yt } − ¯ ¯ m i=t−m 2 µ ¶ −β 2 m ≤ 8 · 41d exp . 14400 3
Since
Pt−1 i=t−m
L h (xi , yi ) ≤
Pt−1 i=t−m
L f∗ (xi , yi ),
for all β > 61m,
Pr(Pt {(xt , yt ) : h(xt ) 6= yt } − Pt {(xt , yt ) : f ∗ (xt ) 6= yt } > β) µ ¶ −β 2 m d . ≤ 8 · 41 exp 14400 Applying Lemma 13 with ϕ given by the the above bound q when β ≥ 61m and 1 otherwise, 41)d+i ln 2) , we get and with a1 = 61m, and for all relevant i > 1, ai = 14400(ln 8+(ln m E(Pt {(xt , yt ) : h(xt ) 6= yt } − Pt {(xt , yt ) : f ∗ (xt ) 6= yt }) r ∞ X 14400(ln 8 + (ln 41)d + (i + 1) ln 2) −i 2 ≤ 61m + m i=1 r ∞ p d X ≤ 61m + 14400(6 + (i + 1) ln 2)2−i m i=1 r d ≤ 61m + 341 . m
348
LONG
Substituting the values of m and 1 and approximating, we get E(Pt {(xt , yt ) : h(xt ) 6= yt } − Pt {(xt , yt ) : f ∗ (xt ) 6= yt }) ≤ ². 2
As discussed above, this completes the proof. 4.
The realizable case
Say a probability distribution P over X × {0, 1} is consistent with a function f from X to {0, 1} if the probability that a pair (x, y) drawn according to P has f (x) = y is 1. For a set F of functions from X to {0, 1}, say that P is consistent with F if it is consistent with some member of F. For a class F of functions from X to {0, 1}, we say a learning algorithm A (², 1)-tracks F in the realizable case if for all 1-gradual sequences P1 , P2 , . . . such that for all of distributions over X × {0, 1} that are consistent with F, there is an m 0Q m ≥ m 0 , the probability that (x1 , y1 ), . . . , (xm , ym ) drawn according to m t=1 Pt and A’s randomization cause a mistake for A is at most ². If there is a prediction strategy that (², 1)-tracks F in the realizable case then we say F is (², 1)-trackable in the realizable case. Recall that the mth hypercube, which we will denote by Hm , is the undirected graph E such that ρ(E v , w) E = 1. whose vertex set is {0, 1}m , and whose edges are all vE, w Theorem 3 (Haussler, Littlestone & Warmuth, 1994). For any m ∈ N, for any F ⊆ {0, 1}m , if G is the subgraph of Hm induced by F, the edges of G can be directed so that the maximum outdegree of any node is at most VC dim(F). Lemma 14 (Shelah, 1972; Sauer, 1972; Blumer et al., 1989). |F| ≤ (em/VC dim(F))VC dim(F) .
For m ∈ N, F ⊆ {0, 1}m ,
The proof of our next lemma is similar to that of a related result of Roy (1991). Lemma 15. For any m ∈ N, for any F ⊆ {0, 1}m , for any k ∈ {1, . . . , m}, VC dim(ρk−1 (F) ∪ ρk (F)) ≤ 5(VC dim(F) + k). Proof: Assume without loss of generality that |F| > 1. Let d = VC dim(ρk−1 (F) ∪ ρk (F)). Choose a set i 1 , . . . , i d such that {(gi1 , . . . , gid ) : g ∈ ρk−1 (F) ∪ ρk (F)} = {0, 1}d . Each element of {(gi1 , . . . , gid ) : g ∈ ρk−1 (F)} can be derived from an element of {( f i1 , . . . , f id ) : f ∈ F} and a subset of k − 1 elements of {1, . . . , d}, and therefore µ
¶ d |{( f i1 , . . . , f id ) : f ∈ F}|. |{(gi1 , . . . , gid ) : g ∈ ρk−1 (F)}| ≤ k−1
349
COMPLEXITY OF LEARNING
Applying a similar observation with regard to ρk (F), we get |{(gi1 , . . . , gid ) : g ∈ ρk−1 (F) ∪ ρk (F)}| µµ ¶ µ ¶¶ d d ≤ + |{( f i1 , . . . , f id ) : f ∈ F}| k−1 k µ ¶ d +1 = |{( f i1 , . . . , f id ) : f ∈ F}| k ¶VC dim(F) µ ¶µ d +1 ed ≤ VC dim(F) k by Lemma 14. Thus ¶VC dim(F) ¶µ d +1 ed 2 ≤ VC dim(F) k µ ¶k µ ¶VC dim(F) e(d + 1) ed ≤ . k VC dim(F) µ
d
Taking logs, we get µ
e(d + 1) d ln 2 ≤ k ln k
¶
µ
¶ ed + VC dim(F) ln . VC dim(F)
Since for all x, λ > 0, 1 + ln x ≤ λx + ln(1/λ) (see Anthony, Biggs & Shawe-Taylor 1990), we have that for all λ > 0, d ln 2 ≤ λ(2d + 1) + (VC dim(F) + k) ln(1/λ). Solving for d and substituting λ = 1/10 completes the proof.
2
Lemma 16. Choose m ∈ N and F ⊆ {0, 1}m . Then the edges of Hm can be oriented so that the outdegree of any vE ∈ {0, 1}m is at most 15(VC dim(F) + ρ(Ev, F)). Proof: Let d = VC dim(F). Assume without loss of generality that |F| > 1 (and therefore d > 0). Let G 0 be the subgraph of Hm induced by F, and for each k = 1, . . . , m, let G k be the subgraph of Hm induced by ρk (F) ∪ ρk−1 (F). (See figure 2.) For each k, let G 0k be a directed graph obtained by directing the edges of G k so that the outdegree of each vertex in G 0k is at most 5(d + k). v , F) − ρ(w, E F)| ≤ 1. By the triangle inequality, if vE, w E is an edge in Hm , then |ρ(E Therefore, each edge of Hm is in G k for at least one k. Form a directed graph Hm0 by directing the edges of Hm by choosing the direction for each edge from the graph G 0k with the least k such that the undirected edge is in G k .
350
LONG
Figure 2. For m = 4 and some F ⊆ {0, 1}m , the m-dimensional hypercube has been diagrammed with F at the bottom, those vertices at a Hamming distance 1 from some element of F in the row above, and so on. The subgraphs G 0 , . . . , G 4 from the proof of Lemma 16 are as shown.
Choose a vertex vE ∈ {0, 1}m . Assume without loss of generality that ρ(E v , F) < m. Then v , F), ρ(E v , F) + 1}. Hence the outdegree of vE in Hm0 vE appears in G 0k exactly when k ∈ {ρ(E is at most 5(d + ρ(E v , F)) + 5(d + ρ(E v , F) + 1) ≤ 15(d + ρ(E v , F)), completing the proof.
2
For each set F of possible targets, the tracking algorithm A0F used to prove Theorem 4 will apply a subalgorithm AF to a subsequence consisting of the most recent examples. We begin by describing and analyzing AF . Algorithm AF will make use of an arbitrary order on X . For each F, we will describe the hypothesis h output by AF on input (x1 , y1 ), . . . , (xm−1 , ym−1 ) by describing a process for generating h(x m ) for each possible xm . Algorithm AF first sorts x1 , . . . , xm (let a1 , . . . , am be the resulting reordering of x1 , . . . , xm ; let b1 , . . . , bm be the corresponding reordering of y1 , . . . , ym−1 , 2, where 2 serves to hold the position corresponding to xm ; and let i ∗ be the position of xm in a1 , . . . , am ). Next, it sets F = {( f (a1 ), . . . , f (am )) : f ∈ F}, and creates a directed graph Hm0 by orienting the edges of Hm so that the outdegree of each vertex vE is at most 15(VC dim(F) + ρ(Ev, F)) as in Lemma 16. Finally, it sets h(xm ) = 1 if and only if the edge in Hm0 between (b1 , . . . , bi ∗ −1 , 0, bi ∗ +1 , . . . , bm ) and (b1 , . . . , bi ∗ −1 , 1, bi ∗ +1 , . . . , bm ) is oriented toward (b1 , . . . , bi ∗ −1 , 1, bi ∗ +1 , . . . , bm ). Lemma 17 (Bartlett, 1992). For any probability distributions P and Q, dT V (P × Q, Q × P) ≤ dT V (P, Q).
351
COMPLEXITY OF LEARNING
Lemma 18. Choose m ∈ N, a set F of functions from X to {0, 1}, and a 1-gradual sequence P1 , . . . , Pm ofQprobability distributions on X × {0, 1} that are consistent with F. The probability under m t=1 Pt that (x 1 , y1 ), . . . , (x m , ym ) causes a mistake for A F , is at most 15VC dim(F) + 61m + Pr(∃i, j, xi = x j ). m Proof: Define χ ((x1 , y1 ), . . . , (xm , ym )) to indicate whether (x1 , y1 ), . . . , (xm , ym ) causes a mistake for AF and x1 , . . . , xm are distinct. Clearly, Pr(mistake) ≤ E(χ) + Pr(not distinct), so we will bound E(χ). Let Z = X × {0, 1}. For Ez ∈ Z m , j ∈ {1, . . . , m}, define ϕ(Ez , j) to be the result of exchanging z j and z m . By the triangle inequality, for all t ∈ {1, . . . , m}, dT V (P j , Pm ) ≤ 1m. Choose j ∈ {1, . . . , m − 1}. Repeatedly applying Fubini’s Theorem (Lemma 1), ! Ã Z m Y Pt (Ez ) χ(Ez ) d Z µZ =
t=1
! ¶ Ã Y Pt (z 1 , . . . , z j−1 , z j+1 , . . . , z m−1 ). χ(Ez ) d(P j × Pm )(z j , z m ) d t6∈{ j,m}
Applying Lemma 17 and the definition of dT V , ! Ã ¶ Z µZ Z m Y 1m Pt (Ez ) ≤ χ(Ez ) d(Pm × P j )(z j , z m ) + χ (Ez ) d 2 t=1 Ã ! Y Pt (z 1 , . . . , z j−1 , z j+1 , . . . , z m−1 ) d Z =
t6∈{ j,m}
χ(ϕ(Ez , j)) d
Ã
m Y t=1
! Pt (Ez ) +
1m , 2
again, because of Fubini’s Theorem. Thus ! Ã ! ! Ã Z Ã X Z m m Y Y 1 m 1m Pt (Ez ) ≤ χ(ϕ(Ez , j)) d Pt (Ez ). + χ(Ez ) d 2 m j=1 t=1 t=1
(1)
∈ (X × {0, 1})m . If x1 , . . . , xm are not Fix an arbitrary Ez = ((x 1 , y1 ), . . . , (xm , ym )) 1 Pm distinct, then the definition of χ implies that m j=1 χ(ϕ(Ez , j)) = 0. Assume x1 , . . . , xm are distinct. Let a1 , . . . , am be x1 , . . . , xm in sorted order, and let v1 , . . . , vm be the corresponding reordering of the yi ’s. Let F = {( f (a1 ), . . . , f (am )) : f ∈ F}.
352
LONG
Since algorithm AF sorts the sample, the directed graph Hm0 constructed by algorithm AF using any reordering of the xi ’s is the same. Choose j ∈ {1, . . . , m}. Let j 0 be the position of x j when x1 , . . . , xm is sorted. Then ϕ(Ez , j) causes a mistake for AF if and only if the edge in Hm0 between vE and the vertex obtained by negating the j 0 th bit of vE is oriented away from vE. (This is because vE represents the Pcorrect labellings, and AF predicts according to v ). the direction of the named edge.) Thus mj=1 χ(ϕ(Ez , j)) ≤ outdegree(E For each t ∈ {1, . . . , m} choose f t ∈ F such that Pt is consistent with f t . Then ρ(E v , F) ≤ |{t : f t (xt ) 6= f m (xt )}|. Thus, outdegree(E v ) ≤ 15(VC dim(F) + |{t : ft (xt ) 6= fm (xt )}|) and therefore m X
χ (ϕ(Ez , j)) ≤ 15(VC dim(F) + |{t : ft (xt ) 6= fm (xt )}|).
j=1
Since VC dim(F) ≤ VC dim(F), plugging into (1), we have ! Ã Z m Y 15VC dim(F) 15 1m Pt (Ez ) ≤ + + E(|{t : f t (xt ) 6= f m (xt )}|). χ (Ez ) d 2 m m t=1 (2) Since Pm is consistent with f m , Pm {(x, y) : f m (x) 6= y} = 0.
(3)
For any t ∈ {1, . . . , m}, since dT V (Pt , Pm ) ≤ 1m, (3) implies Pt {(x, y) : f t (x) 6= f m (x)} = Pt {(x, y) : f m (x) 6= y} ≤
1m . 2
Thus E(|{t : f t (xt ) 6= f m (xt )}|) ≤
1m 2 . 2
Substituting into (2) completes the proof.
2
Theorem 4. There is a constant c > 0 such that for any set F of functions from X to {0, 1}, for any ² > 0, if 1≤
c² 2 , VC dim(F )
then F is (², 1)-trackable in the realizable case.
COMPLEXITY OF LEARNING
353
Proof: Let d §= VC dim(F¨). Consider the algorithm A0F defined as follows. First, it sets R = {1, . . . , 11560d 2 /² 3 }, and for each t, it draws rt uniformly at random from R. Given (x1 , y1 ), . . . , (xm , ym ), if m > 33d/², then A0F gives the last m 0 = d33d/²e elements of ((x1 , r1 ), y1 ), . . . , ((xm , rm ), ym ) to AF . Let U be the uniform distribution over R. For some 1 ≥ 0, choose a 1-gradual sequence P1 , P2 , . . . of distributions over X . Then P1 × U, P2 × U, . . . is also 1-gradual. Also, if for each f ∈ F, we define a function f R from X × R to {0, 1} by f R (x, r ) = f (x), then, straight from the definitions, VC dim({fR : f ∈ F}) = d. So applying Lemma 18, if m > 33d/², the probability that A0F makes a mistake is at most 15d/m 0 + 61m 0 + (m 0 )2 /|R|. Substituting ²2 , this the definitions of m 0 and R and observing that 33d/² ≤ m 0 ≤ 34d/², if 1 ≤ 458d probability is at most ², completing the proof. 2 Acknowledgments We thank the COLT’98 program committee and anonymous referees for valuable comments, including pointing out some mistakes in earlier drafts of the paper. We gratefully acknowledge the support of National University of Singapore Academic Research Fund Grant RP960625. Notes 1. Recently, other constraints on the drift have been examined (e.g., Bartlett, Ben-David & Kulkarni, 1996; Freund & Mansour, 1997). In this paper we restrict our attention to the simplest drift models, but direct application of a slight variant of Lemma 12 of this paper leads to a small improvement in the analysis of (Freund & Mansour, 1997). Models of a changing environment that are more dissimilar to that studied here were considered in (Littlestone & Warmuth, 1994; Kuh, Petsche & Rivest, 1990; Kuh, Petsche & Rivest, 1991; Blum & Chalasani, 1992; Freund & Ron, 1995; Herbster & Warmuth, 1995; Auer & Warmuth, 1995; Kuh, 1997; Tian & Kuh, 1997; Herbster & Warmuth, 1998). 2. Their statement of their algorithm is slightly different; we describe an equivalent algorithm to facilitate comparison with our modification. 3. We assume that Z is countable for convenience. Considerably weaker measurability assumptions suffice for the results mentioned in this paper (Pollard, 1984; Haussler, 1992). 4. The behavior of A0F for small m is immaterial.
References Anthony, M. Biggs, N., & Shawe-Taylor, J. (1990). The learnability of formal concepts. Proceedings of the 1990 Workshop on Computational Learning Theory (pp. 246–257). Auer P., & Warmuth, M.K. (1995). Tracking the best disjunction. Proceedings of the 36th Annual Symposium on the Foundations of Computer Science. Bartlett, P.L. (1992). Learning with a slowly changing distribution. Proceedings of the 1992 Workshop on Computational Learning Theory (pp. 243–252). Bartlett, P.L., Ben-David, S., & Kulkarni, S.R. (1996). Learning changing concepts by exploiting the structure of change. Proceedings of the 1996 Conference on Computational Learning Theory (pp. 131–139). Bartlett, P.L., & Helmbold, D.P. (1995). Manuscript. Barve, R.D., & Long, P.M. (1997). On the complexity of learning from drifting distributions. Information and Computation, 138(2), 101–123.
354
LONG
Blum, A., & Chalasani, P. (1992). Learning switching concepts. Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 231–242). Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1989). Learnability and the Vapnik-Chervonenkis dimension. JACM, 36(4), 929–965. Freund, Y., & Mansour, Y. (1997). Learning under persistent drift. Proceedings of the 1997 European Conference on Computational Learning Theory. Freund, Y., & Ron, D. (1995). Learning to model sequences generated by switching distributions. Proceedings of the 1995 Conference on Computational Learning Theory (pp. 41–50). Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1), 78–150. Haussler, D. (1995). Sphere packing numbers for subsets of the boolean n-cube with bounded VapnikChervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2), 217–232. Haussler, D., Littlestone, N., & Warmuth, M.K. (1994). Predicting {0, 1}-functions on randomly drawn points. Information and Computation, 115(2), 129–161. Helmbold, D.P., & Long, P.M. (1994). Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1), 27–46. Herbster, M., & Warmuth, M.K. (1995). Tracking the best expert. Proceedings of of the Twelvth International Conference on Machine Learning. Herbster, M., & Warmuth, M.K. (1998). Tracking the best regressor. Proceedings of the 1998 Conference on Computational Learning Theory. Kearns, M.J., Schapire, R.E., & Sellie, L.M. (1994). Toward efficient agnostic learning. Machine Learning, 17, 115–141. Kolmogorov, A.N., & Tihomirov, V.M. (1961). ²-entropy and ²-capacity of sets in functional spaces. American Mathematical Society Translations (Ser. 2), 17, 277–364. Kuh, A. (1997). Comparison of tracking algorithms for single layer threshold networks in the presence of random drift. IEEE Trans. on Signal Processing, 45(3), 640–650. Kuh, A., Petsche, T., & Rivest, R. (1990). Learning time varying concepts. In NIPS 3. Morgan Kaufmann. Kuh, A., Petsche, T., & Rivest, R. (1991). Mistake bounds of incremental learners when concepts drift with applications to feedforward networks. In NIPS 4. Morgan Kaufmann. Littlestone, N., & Warmuth, M.K. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261. Pollard, D. (1984). Convergence of Stochastic Processes. Springer Verlag. Pollard, D. (1990). Empirical Processes : Theory and Applications, volume 2 of NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Math. Stat. and Am. Stat. Assoc. Roy, S. (1991). Semantic complexity of relational queries and data independent data partitioning. Proceedings of the ACM SIGACT-SIGART-SIGMOD Annual Symposium on Principles of Database Systems. Royden, H.L. (1963). Real Analysis. Macmillan. Sauer, N. (1972). On the density of families of sets. J. Combinatorial Theory (A), 13, 145–147. Shelah, S. (1972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific J. Math., 41, 247–261. Simon, H.U. (1996. General lower bounds on the number of examples needed for learning probabilistic concepts. Journal of Computer and System Sciences, 52(2), 239–254. Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Annals of Probability, 22, 28–76. Tian, X., & Kuh, A. (1997). Performance bounds for single layer threshold networks when tracking a drifting adversary. Neural Networks, 10(5), 897–906. Vapnik, V.N., & Chervonenkis, A.Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), 264–280