Extension of the PAC Framework to Finite and Countable Markov Chains
∗
David Gamarnik IBM T.J.Watson Research Center , PO Box 218 Yorktown Heights, NY 10598
[email protected] July 16, 2002
Abstract We consider a model of learning in which the successive observations follow a certain Markov chain. The observations are labeled according to a membership to some unknown target set. For a Markov chain with finitely many states we show that, if the target set belongs to a family of sets with a finite VC dimension, then probably approximately correct learning of this set is possible with polynomially large samples. Specifically for observations following a random walk with a state space ³ ´ t0 1 X and uniform stationary distribution, the sample size required is no more than Ω 1−λ log(t |X | ) , 0 δ 2 where δ is the confidence level, λ2 is the second largest eigenvalue of the transition matrix and t0
is the sample size sufficient for learning from i.i.d. observations. We then obtain similar results for Markov chains with countably many states using Lyapunov function technique and recent results on mixing properties of infinite state Markov chains.
1
INTRODUCTION
The subject of this paper is a model of learning in which the input observations follow a Markov chain with a finite or countable state space. A certain target subset C of the state space is fixed and the learner is presented with a sequence of labeled observations. The label is equal to one if the observation belongs C and is equal to zero if it does not. Given a finite sequence of labeled observations, the goal ∗
A preliminary version of this paper appeared in Proceedings of 12th Ann. Conference on Computational Learning
Theory (COLT), 1999.
1
of the learner is to select a subset of the state space which is sufficiently close to the target set C with high probability. The hypothesis set is selected from some given collection of sets C which contains C and has a finite V C dimension. Thus our learning model is the classical PAC (Probably Approximately Correct) model with the only exception that the successive observations are not i.i.d. but rather have a Markov chain dynamics. The goal of this paper is to obtain the bounds on sample sizes which would guarantee PAC learning for our Markov chain model. The literature on PAC learning with dependent successive observations is not comprehensive compared to its i.i.d. counterpart. Some results on this subject include Aldous and Vazirani [2], Kumar and Buescher [7], [8], Nobel [18]. It was shown by Nobel [18] that stationarity of the stochastic process by itself is not sufficient for learning, even if the family of concept sets has a finite VC dimension. Learning in nonlinear time series setting was considered by Modha and Masry [17], and Meir [14]. In the latter paper the bounds on sample sizes were obtained which are expressed in terms of the mixing parameters of the underlying stochastic process, and in terms of accuracy level ǫ and confidence δ. The dependence on ǫ is of the form Ω(1/ǫ4 ), which as we show in this paper, can be significantly improved for Markov chains with finitely many states. Our infinite state Markov chain model does not fall into Meir’s framework since in our case mixing time depends on the initial state of the Markov chain. It is not surprising that PAC learning is possible for Markov chains with finitely many states. Such Markov chain, if aperiodic, always possesses exponential mixing property: the transient distribution converges to a steady state distribution exponentially fast. For the case of reversible Markov chains, the mixing rate can be estimated via the second largest eigenvalue of the underlying transition matrix. Thus by considering only observations which are sufficiently far from each other (in time sense) the learner obtains a model which is closed to i.i.d. In this paper we estimate the sample sizes which would make this approximation possible and in particular to show that the required sample sizes are polynomial. This approach has been used by Bartlett, Fischer and Hoffgen [5] specifically for uniform random walks on a binary cube. The PAC learning for a Markov chain with countably infinite state space is more complicated. Such a Markov chain does not necessarily have a steady state distribution. And even if a steady state distribution exists and the Markov chain is mixing, there is no general algorithmic procedure for estimating mixing rates. In this paper we use a Lyapunov function technique, which proved to be very useful for establishing the existence of a steady state distribution and for estimating mixing rates. For a comprehensive survey of Lyapunov function techniques for Markov chains see [15]. We obtain bounds on sample sizes sufficient for PAC learning in infinite Markov chains, whenever a Lyapunov function can 2
be constructed. The polynomial type bounds are expressed in terms of the parameters of this Lyapunov function and usual parameters of ǫ (accuracy) and δ (confidence level). We illustrate our results on a simple random walk with a state space Z+ , where Z+ is the set of nonnegative integers. In an implicit way, our results use large deviation bounds for the Markov chain, which are expressed in terms of the Lyapunov functions. Large deviation bounds in terms of the parameters of the Lyapunov functions can be obtained explicitely, as it was shown in Balaji and Meyn [3] for irreducible Markov chains.
2
MODEL DESCRIPTION AND ASSUMPTIONS
2.1
STOCHASTIC INPUT PROCESS
The input process for our learning model is a Markov chain Xt , t = 0, 1, 2 . . . which has a finite or countably infinite state space X . For any two states x, y ∈ X we denote by p(y|x) the conditional probability of going from the state x to the state y. Let π denote the stationary distribution, if one exists. Then for any state y π(y) =
X
π(x)p(y|x).
x∈X
Let π0 denote the probability distribution of the initial state X0 : π0 (x) = Pr{X0 = x}, x ∈ X . We index the probability of future events by this initial probability using the notation Prπ0 {·}. For example Prπ0 {Xt = y} =
P
x Pr{Xt
= y|X0 = x}π0 (x).
From the theory of Markov chains, if the state space is finite and the Markov chain is aperiodic and irreducible, then a unique stationary distribution π exists and for some computable values β, ψ > 0, the following holds |Pr{Xt = y|X0 = x} − π(y)| ≤ βe−ψt ,
∀x, y ∈ X
(1)
For reversible Markov chains with the stationary distribution π this bound can be specified as follows |Pr{Xt = y|X0 = x} − π(y)| ≤ √
1 e−(1−λ2 )t , πmin
∀x, y ∈ X
(2)
where πmin = minx∈X π(x) and λ2 < 1 is the second largest eigenvalue of the transition matrix (see Aldous and Fill [1]). Throughout the paper g = O(f ), g = Ω(f ), g = Θ(f ) mean that for functions f (n), g(n), n = 0, 1, 2, . . . , and for some positive constants c1 , c2 , c3 , c4 the following holds respectively g(n) ≤ c1 f (n), g(n) ≥ c2 f (n), c3 f (n) ≤ g(n) ≤ c4 g(n) for all n = 0, 1, 2, . . . .
3
2.2
LEARNING MODEL
We now introduce a learning model of interest. The model is precisely the classical PAC model with the only exception that the process Xt is not i.i.d., but is a Markov chain. Assume that we have a concept space C - a collection of subsets of X . We assume that C has a finite VC dimension d. For the definition of the VC dimension see [4],[11], [21]. A concept C ∈ C, unknown to the learner, is fixed and the learner is presented with a finite labeled sequence (Xt , lt ), t = 0, 1, . . . , T , where Xt is a Markov chain taking values in X , and lt = 1 if Xt ∈ C, lt = 0 otherwise. Equivalently lt = 1{Xt ∈ C}, where 1{A} is the indicator function of the event A. The goal of the learner is to construct a concept C ′ ∈ C which is close to the target concept C, based on a fixed sample (Xt , lt ), t = 0, 1, . . . T. The difference between C and C ′ is measured with respect to the stationary distribution π. Thus, the goal of the learner is to find C ′ ∈ C which guarantees small π(C∆C ′ ), where ∆ denotes the symmetric difference. Note that, because of the dependence, this measure does not coincide with the probability of the learner correctly classifying (labeling) the next input Xt+1 . But, from the law of large numbers, it does measure the long term rate of misclassification: 1 τ →∞ τ − T lim
τ X
j=T +1
1{Xj ∈ C ′ ∆C} = π(C ′ ∆C).
Moreover, by large deviations result for Markov chains (to be discussed below) the probability of any constant size deviation of the average above from the limit is exponentially small. In this paper we ignore the algorithmic aspect of actually choosing the hypothesis concept C ′ . Rather, in the spirit of PAC framework, we concentrate on information theoretic limits of learning for our Markovian learning model. A concept C ′ ∈ C is defined to be consistent with observations (Xt , lt ), 0 ≤ t ≤ T if lt = 1 if and only if Xt ∈ C ′ , for all 0 ≤ t ≤ T . We denote by C(X0 , X1 , . . . , Xt ) the set of concepts consistent with (Xt , lt ) (we omit the labels lt for brevity). We are interested in minimal number of the observations that would guarantee with high probability certain level of closeness of any consistent concept C ′ to the target concept C. Specifically, given ǫ, δ > 0 we would like to estimate the sample size T that would guarantee that for any concept C ∈ C, Prπ0 {supC ′ ∈C(X1 ,...,XT ) π(C ′ ∆C) < ǫ} > 1 − δ. The probability is taken with respect to random observations Xt , t = 0, 1, . . . , T . In the following sections we obtain explicit bounds on sample size that guarantee a given level of closeness ǫ > 0 with a given level of confidence δ > 0. Finite state and countable state Markov chain cases are considered in Section 3 and 4 respectively. 4
3
LEARNING IN FINITE STATE MARKOV CHAINS
3.1
UPPER BOUNDS ON THE REQUIRED SAMPLE SIZES
We now state and prove the main result of this section. Theorem 1 Suppose (Xt , lt ), t = 0, 1, . . . , T is a sequence of labeled observations such that Xt forms a Markov chain with a finite state space X . The labeling is done with respect to some target set C ⊂ X . Let β and ψ be the parameters for which the relation (1) holds. If a collection of subsets C of the set X contains C and has a finite VC dimension d, then for any ǫ, δ > 0 Prπ0 {
π(C ′ ∆C) > ǫ} < δ
sup
(3)
C ′ ∈C(X0 ,X1 ,...,XT )
whenever n
T = max t0 , where
´o 1 t0 ³ log + log t0 + log |X | + log β ψ δ ³d
t0 = Ω
ǫ
log
1 1 1´ + log . ǫ ǫ δ
(4)
(5)
In particular, the sample size is a polynomial function of β, ψ, ǫ, δ and log |X |. Proof : See the Appendix Remarks. 1. We have expressed the bounds on the sample size in terms of the bound ³d
t0 = Ω
ǫ
log
1 1 1´ + log ǫ ǫ δ
which is a sufficient sample size for PAC learning from i.i.d. observations, see [11]. This format makes explicit how much we need to increase the sample size to achieve the same level of accuracy ǫ and confidence δ when Xt is a Markov chain. 2. Observe that the sample size required for PAC learning depends linearly on log |X |. Thus fast learning is possible even if the state space is exponentially large (for example is a binary cube). 3. Theorem 1 is a generalization of Theorem 3.3 [5] to arbitrary finite state Markov chains. The proof is based on establishing an analogue of Lemmas 3.1, 3.2 [5]. However, unlike [5], we do not assume that the Markov chain is in steady state. We now refine Theorem 1 for reversible Markov chains. As we mentioned above, for reversible √ Markov chains the bound (1) holds for β = 1/ πmin and ψ = 1 − λ2 . 5
Corollary 1 Under the assumptions of Theorem 1 suppose in addition that the Markov chain Xt is reversible. Then the bound (3) holds whenever T =
1 1 ´ t0 ³ log + log t0 + log |X | + log 1 − λ2 δ πmin
where ³d
t0 = Ω
ǫ
log
1 1 1´ + log . ǫ ǫ δ
If the Markov chain has a uniform stationary distribution then the result holds for T =
´ 1 t0 ³ log + log t0 + log |X | 1 − λ2 δ
where ³d
t0 = Ω
ǫ
log
1 1 1´ + log . ǫ ǫ δ
Proof : The result follows immediately from Theorem 1 and (2). Note that for the uniform stationary distribution πmin = π(x) = 1/|X | for all x ∈ X . 2 As we mentioned before, because of the dependence, the probability π(C ′ ∆C) does not measure the probability of the learner misclassifying the next input Xt+1 . That is Prπ0 {Xt+1 ∈ C ′ ∆C} is not necessarily equal to π(C ′ ∆C). But, as we show below, it does measure a long term rate of misclassification. We use Chernoff type bound for Markov chains, which was proven by Gillman [9], to obtain the following result. Proposition 1 Let concept C ′ ∈ C(X0 , X1 , . . . , XT ) be a consistent concept chosen by a learner based on a sequence of labeled observations (X0 , l0 ), (X1 , l1 ), . . . , (XT , lT ). If the Markov chain is reversible, then for any α > 0 and any integer τ > T , Prπ0
n Pτ
o (1−λ2 )α(τ −T ) ∈ C ′ ∆C} α(1 − λ2 ) 1 20 . − π(C ′ ∆C) > α ≤ (1 + )√ e− τ −T 10 πmin
j=T +1 1{Xj
In words, the probability that the average rate of misclassifications for the next T − t observations is different from the expected by more than α decays exponentially fast in T − t. Proof : Follows immediately from Theorem 2.1 [9].2
6
3.2
EXAMPLES
We now consider some examples of Markov chains and obtain specific learning bounds. Example 1. Symmetric random walk Sn on a circle. Let X = {1, 2, . . . , n} ≡ Sn . The Markov chain considered on this state space is a symmetric random walk: p(i + 1|i) = p(i − 1|i) = 1/2, for i = 1, 2, . . . , n and p(j|i) = 0 otherwise. We identify n + 1 with 1 and 0 with n. If n is odd then this random walk is aperiodic. It has a uniform stationary distribution. It is known (see [1]) that for this random walk ψ = 1 − λ2 ≈ 2π 2 /n2 . Suppose the concept class C is a collection of intervals [i, j] = {i, i + 1, . . . , j}, where for i > j we put [i, j] = {i, i + 1, . . . , n, 1, 2, . . . , j}. It is not hard to see that this concept class has VC dimension d = 3. The bound from Corollary 1 then becomes ³
³
T = Ω (t0 n2 log
´´ 1 + log t0 + log n δ
(6)
where ³1
t0 = Ω We may rewrite this as ³ n2
T =Ω
ǫ
log2
ǫ
log
1 1 1´ + log . ǫ ǫ δ
1 n2 log n 1´ 1 n2 1 n2 log n + log2 + log + log . ǫ ǫ δ ǫ ǫ ǫ δ
We see that for fixed ǫ and δ the sample size sufficient for learning is Ω(n2 log n). In the following subsection we will show that this is tight up to the log n factor. Example 2. Multidimensional torus Snr . Our state space is now is a multidimensional torus X = Snr , for some positive integer r. Let C be the collection of r dimensional rectangles
Qr
l=1 [il , jl ].
This concept class has VC dimension d = Θ(r). Then the sample size sufficient to learn this concept class with ǫ accuracy and 1 − δ confidence is ³
³
t = Ω t0 n2 log
´´ 1 + log t0 + r log n δ
where ³r
t0 = Ω
ǫ
log
1 1 1´ + log . ǫ ǫ δ
Example 3. Uniform random walk on a binary cube. Let X = {0, 1}n . For x, y ∈ {0, 1}n , we set p(y|x) = 1/(n + 1) if x and y differ by at most one coordinate, and p(y|x) = 0 otherwise. This is an example considered in [5]. For this random walk 1 − λ2 = Θ(1/n). The stationary distribution is uniform and in particular πmin = 1/2n . Using these parameters and invoking Corollary 1 we obtain the ³
bound Ω n2 t0 + nt0 log tδ0
´
on a sample size needed for learning concepts with finite VC dimension. 7
3.3
LOWER BOUNDS ON SAMPLE SIZES
Theorem 1 raises a question to what extent the bound (4) on sample size required for learning is sharp. Note that, unlike i.i.d. case, the bounds are not and cannot be distribution free. Indeed, by changing the transition rates of the Markov chain, we can make the mixing rate very small. For example, in a symmetric random walk on Sn , by making p(i + 1|i) = p(i − 1|i) = ǫ, p(i|i) = 1 − 2ǫ, for some very small ǫ > 0, we obtain a Markov chain with mixing rate very small. It it clear then, that the sample size required for learning a concept class of intervals would be progressively larger as we decrease ǫ. We now show that, for our Example 1 of a symmetric random walk on Sn , O(n2 ) is a lower bound on the sample size required for learning the concept class of intervals, when the parameters ǫ, δ are fixed. In particular, the upper bound obtained is sharp up to the factor log n. Proposition 2 Consider a symmetric random walk Xt on Sn starting from the state X0 = n/2. Let 1/4 > ǫ > 0 and δ > 0 be fixed. Let α(n) be any function such that α(n) = o(n2 ) and let c be an arbitrary constant that may depend on ǫ and δ. Then, for T = cα(n), 2
Pr{ sup |Xτ − n/2| ≤ n/4} ≥ 1 − e
n ) −Θ( α(n)
.
0≤t≤T
In particular, this probability is larger than δ for large enough n, and the concept class of intervals cannot be learned with ǫ accuracy and δ confidence in time t = O(α(n)). We see that Θ(n2 ) is a lower bound for a sample size necessary for learning intervals in the case of a random walk on Sn . Contrast this with an upper bound of Θ(n2 log n) obtained in the previous subsection. Proof : See the Appendix.
4
LEARNING IN INFINITE STATE MARKOV CHAINS
In this section we extend Theorem 1 to certain Markov chains with countably many states. The analysis for infinite state Markov chains is more complicated since an infinite state Markov chain does not necessarily have a steady state distribution. Even if the stationary distribution exists, the bounds of the type (1) do not necessarily hold. Nevertheless, in recent years there has been progress in establishing the existence of steady state distributions and proving exponentially fast mixing for infinite (even uncountable) Markov chains using Lyapunov function techniques. For a comprehensive survey on this subject see [15]. The goal of this section is to link these techniques with PAC analysis of infinite Markov 8
chains. We will consider Markov chains for which the mixing rates and the other relevant parameters can be computed explicitly, and obtain explicit bounds on sample sizes sufficient for learning. In what follows we introduce some preliminary assumptions on our Markov chain Xt . We then state and prove the main result of this section. We consider a Markov chain Xt , t = 0, 1, 2, . . . , which takes values in some countable state space X . As usual denote p(y|x) the probability of transition from a state x to a state y. Definition 1 Given a Markov chain with the state space X , a function V : X → [1, +∞) is defined to be a Lyapunov function if there exists a state x∗ ∈ X and parameter 0 < λ < 1 (called drift) such that p(x∗ |x∗ ) > 0 and for any state x 6= x∗ E[V (Xt+1 )|Xt = x] ≤ λV (x).
(7)
Our definition of a Lyapunov function is somewhat restrictive compared to the one in [15] but it simplifies the analysis significantly. The existence of a Lyapunov function guarantees the existence of a stationary distribution, (see [15]). Moreover certain mixing property does also hold. The following result was proven by Meyn and Tweedie in [16] (Theorems 2.1, 2.2). Theorem 2 Let Xt be a Markov chain with a countable state space X and stationary distribution π. If there exists a Lyapunov function V then there exists a computable value 0 < θ < 1 such that for any ρ > θ and any x, y ∈ X |Pr{Xt = y|X0 = x} − π(x)| ≤ V (x)
ρt+1 . ρ−θ
(8)
In particular, the transient distribution converges to a steady state distribution exponentially fast and the Markov chain is mixing. We will not describe here how to compute θ. The details can be found in [16]. Note, that, unlike finite state case, the bound on the mixing time depends on the initial state X0 = x via V (x). Note also that optimizing over the choice of ρ we obtain from (8) |Pr{Xt = y|X0 = x} − π(x)| ≤ V (x)etθt .
(9)
We assume that our Markov chain satisfies certain additional assumptions which we describe below. For any state x ∈ X let Dm (x) be the number of states that the Markov chain can get into from state x within m steps: ¯ ¯
¯ ¯
Dm (x) = ¯{y : p(j) (y|x) > 0, for some j = 0, 1, 2, . . . m}¯. 9
Assumption A. There exists a constant α > 0 such that supx∈X Dm (x) ≤ mα . Assumption B. There exists a Lyapunov function V and there exists γ ≥ 1 such that for any x, y ∈ X with p(y|x) > 0,
V (y) 1 ≤ ≤γ γ V (x)
The Assumption A is not very restrictive. It is satisfied, for example, by random walks on an integer lattice Z d . Note that for any z ∈ Z d , Dm (z) ≤ md . The Assumption B does apply to some random d . We will discuss a simple one-dimensional example later in walks on a nonnegative integral lattice Z+
this section. This assumption above allows us to bound the tails of the stationary distribution of our Markov chain. The following theorem is proven in [6]. Theorem 3 Let Xt be a Markov chain with a countable state space X and a stationary distribution π. Suppose V is a Lyapunov function with drift λ which satisfies Assumption A. Then for any l = 0, 1, 2, . . . Prπ {V (Xt ) ≥ V (x∗ )γ 2l } ≤
³
´l log γ . log γ + log λ1
where Prπ {·} denotes the probability with respect to a stationary distribution π. We now state and prove the main result of this section. Theorem 4 Suppose (Xt , lt ), t = 0, 1, . . . , T is a sequence of labeled observations such that Xt forms a Markov chain with a countable state space X , with a deterministic initial state X0 = x0 . Suppose also that the labeling is done with respect to some target set C ⊂ X . If a collection of subsets C of the set X contains C and has a finite VC dimension d, and the Markov chain satisfies Assumptions A, B, then for any ǫ, δ > 0 Prx0 whenever ³
T = Ω t0 +
n
o
π(C ′ ∆C) > ǫ < δ
sup C ′ ∈C(X0 ,X1 ,...,XT )
log γ(log 1δ + log t0 ) (α + 1)2 ´´ t0 ³ ∗ log t + log V (x ) + log V (x ) + + 0 0 log 1 log 1θ log 1θ log(1 + log γλ )
(10)
(11)
where ³d
t0 = Ω
ǫ
log
1 1 1´ + log . ǫ ǫ δ
Remark. Usually the state x∗ corresponds to a state where V (x∗ ) = 1 (for example x∗ = 0 and V (x∗ ) = 1 in the example below). Then the term log V (x∗ ) drops from the bound above. Note also that the bounds are expressed in terms of the value of the Lyapunov function at the initial state X0 = x0 10
and the initial distribution is π0 (x0 ) = 1, π0 (x) = 0, for x 6= x0 . In infinite state Markov chains it is possible that πmin = inf x π(x) = 0. Thus the bounds of the form (2) might not be possible. It is natural then that the bounds on sample sizes sufficient for learning depend on the initial state. If the Markov chain has a some specific structure, for example it satisfies a certain Doeblin condition (see [15]), then probably the dependence on the initial state can be dropped, as in this case the mixing time does not depend on the initial state. Proof : See the Appendix. We now consider a simple example of a learning model on an infinite state Markov chain. Let X = Z+ - be the set of nonnegative integers. The transition probabilities are given as p(i + 1|i) = p, p(i − 1|i) = 1 − p for i ≥ 1 and p(1|0) = p, p(0|0) = 1 − p, where p < 1/2. It is well known that p i 1−2p 1−p ( 1−p ) , i = 0, 1, 2, . . . . This example q x ∗ ( 1−p p ) is a Lyapunov function with x =
this Markov chain has a steady state distribution π(i) =
is
considered in [16]. It is easy to check that V (x) =
0,
λ = 2 p(1 − p) and V (x∗ ) = 1. Note also that p
V (x + 1) = V (x)
for any positive integer x so we may set γ =
q
1−p p .
s
1−p p
Now suppose p = 1/3. It is shown in [16] that
θ = .996. Let C be a collection of subsets of Z+ with a finite VC dimension d (for example C is a set of unions of d intervals). Then the sample size sufficient for learning this concept class (we omit the constants) is given by ³ 1 ´ t = Ω t0 (x0 + log t0 + log ) , δ
where as usual ³d
t0 = Ω
5
ǫ
log
1 1 1´ + log . ǫ ǫ δ
EXTENSIONS
There are two natural extensions of the results in this paper. First, our Markov chain process does not have to be first order and the labeling process might be a function of several prior observations of the process, possibly even noisy. Specifically, there exists integers p and q such that the following holds. For any sequence of states y, x1 , x2 , . . . , xi , . . ., the following equality holds Pr{Xt = y|Xt−1 = x1 , . . . , Xt−i = xi , . . .} = Pr{Xt = y|Xt−1 = x1 , . . . , Xt−p = xp }. In other words, the Markov chain is of the order p. Also the labeling process is determined as follows. A collection C of subsets of X q 11
is given. For any t, lt = 1 if (Xt , Xt−1 , . . . , Xt−q+1 ) ∈ C and is equal to zero otherwise. Our results extend to this model as well since we may embed our order p Markov chain into first order Markov ¯ t on the state space X r , r = max{p, q}, by making X ¯ t = (Xt , Xt−1 , . . . , Xt−r+1 ), and we may chain X embed trivially concepts class C of subsets of X q into a concept class of subsets of X r . This reduces the model to the one considered earlier in the paper. One interesting question remaining is whether the learner can determine the correct or approximately correct order of the Markov chain Xt . This is an algorithmic question and we leave out of the scope of the paper. Second possible extension is the agnostic learning framework, in which the correct concept C which generates the labels lt does not belong to the concept class D available to the learner. In this case the best we can hope for is to obtain a concept D ∈ D which minimizes the stationary discrepancy measure π(C∆D), where C is the correct concept. We let Er = Er(D, C) ≡ inf D′ ∈D π(D′ ∆C). Note that Er might not be achieved by any single concept D or might be achieved by several. In this agnostic learning framework we denote by D(X0 , X1 , . . . , XT ) the set of concepts D ∈ D which minimize the number of misclassifications {t : Xt ∈ D, lt = 0} ∪ {t : Xt ∈ / D, lt = 1}. Now that C does not necessarily belong to D, we cannot guarantee the existence of sets with zero misclassifications. The following result is proven in a way analogous to the proofs of Theorems 1, 4. The details are omitted. Theorem 5 Suppose (Xt , lt ), t = 0, 1, . . . , T is a sequence of labeled observations such that Xt forms a Markov chain with a finite or countable state space X , with a deterministic initial state X0 = x0 , such that the Markov chain mixes exponentially fast. Suppose also that the labeling is done with respect to some target set C ⊂ X and a concept class D is available to the learner, which possibly does not contain C. If the collection D has a finite VC dimension d, (and the Markov chain satisfies Assumptions A, B in infinite state case) then for any ǫ > 0 Prx0
n
sup D′ ∈D(X0 ,X1 ,...,XT )
o
|π(D′ ∆C) − Er(D, C)| > ǫ → 0,
(12)
as T → ∞. We do not specify the bounds on the sample size needed for learning. Such bounds can be obtain in a way similar to Theorems 1,4.
6
CONCLUSIONS
We have extended a classical PAC learning model to a model with observations following Markov chain dynamics. Explicit bounds are obtained on the sample sizes that guarantee approximation of the target 12
concept with a certain confidence level. Our learning results for infinite state Markov chains are based on the existence of a Lyapunov function which is a simultaneous witness of a stationary distribution and of an exponentially fast mixing. It would be interesting to complement these results with lower bounds on the sample sizes required for learning. Such lower bounds exist for learning in i.i.d. setting and almost match upper bounds, see [20], [13], [12]. Obtaining lower bounds for finite state Markov chains would probably depend on estimating from below the mixing rates, and this by itself is a fairly complicated area of research. Lower bounds for mixing in countable Markov chains is a virtually unexplored area. Acknowledgments. The author wishes to thank the anonymous referees for constructive criticism and suggestions for extending the scope of the paper.
References [1] D. Aldous and J. Fill. Reversible Markov Chains and Random Walks on Graphs. In preparation. [2] D. Aldous and U. Vazirani. A Markovian extension of Valiant’s learning model. Proc. 31st Symposium on the Foundations of Computer Science, pages 392–396, 1990. [3] S. Balaji an S. P. Meyn. Multiplicative ergodicity and large deviations for an irreducible markov chain. Sorchasitc Processes and their Applications, 90:123–144, 2000. [4] M. Anthony and N. Biggs. Computational Learning Theory. Cambridge University Press, 1992. [5] P. Bartlett, P. Fischer, and K. Hoffgen. Exploiting random walks for learning. Proc. 7th ACM Conf. on Computational Learning Theory, 1994. [6] D. Bertsimas, D. Gamarnik, and J. Tsitsiklis. Performance of multiclass Markovian queueing networks via piecewise linear Lyapunov functions. Ann. Appl. Probab., 11(4):1384–1428, 2001. [7] K.L. Buescher and P.R. Kumar. Learning by canonical smooth estimation, part I: Simultaneous estimation. IEEE Transactions on Automatic Control, 41:545–556, 1996. [8] K.L. Buescher and P.R. Kumar. Learning by canonical smooth estimation, part II: Learning and model complexity. IEEE Transactions on Automatic Control, 41:557–569, 1996. [9] D. Gillman.
A Chernoff bound for random walks on expander graphs.
27(4):1203–1219, 1998. 13
SIAM J. Comput.,
[10] J.M. Harrison. Brownian Motion and Stochastic Flow Systems. Krieger Publishing Company, 1990. [11] M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [12] Y. Li, P. Long, and A. Srinivasan. Improved bounds on the sample complexity of learning. Proc. 11th ACM-SIAM Symposium on Discrete Algorithms, pages 309–318, 2000. [13] P. Long. The complexity of learning according to two models of a drifting enviroment. Proc. 11th ACM Conf. on Computational Learning Theory, pages 116–125, 1998. [14] R. Meir. Performance bounds for nonlinear time series prediction. Proc. 10th ACM Conf. on Computational Learning Theory, pages 122–130, 1997. [15] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, 1993. [16] S. P. Meyn and R. L. Tweedie. Computable bounds for geometric convergence rates of Markov chains. Ann. of Appl. Prob., 4:981–1011, 1994. [17] D. S. Modha and E. Masry. Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory, 44(1):117–133, 1998. [18] A. Nobel. A counterexample concerning uniform ergodic theorems for a class of functions. Statistics and Probability Letters, 24:165–168, 1995. [19] A. Shwartz and A. Weiss. Large deviations for performance analysis. Chapman and Hall, 1995. [20] M. Talagrand. Sharper bounds for Gaussian and empirical processes. Annals of Probability, 22:28– 76, 1994. [21] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
7
APPENDIX
Proof of Theorem 1: Consider first T = t0 m + 1 observations (Xt , lt ), t = 0, 1, 2, . . . T of our labeled stochastic process, where t0 is given by (5). The value of m will be specified later. Select from it every m-th observation together with first observation: (X0 , l0 ), (Xm , lm ), . . . , (Xt0 m , lt0 m ). Our goal is to take m sufficiently large to make the process Yi = Xim , i = 0, 1, . . . , t0 close to i.i.d. Naturally we are
14
bounded by the total number of observations t. Note that C(X0 , X1 , . . . , Xt ) ⊂ C(Y0 , Y1 , . . . , Yt0 ). It follows Prπ0 {
sup
C ′ ∈C(X
0 ,X1 ,...,Xt )
π(C ′ ∆C) > ǫ} ≤ Prπ0 {
π(C ′ ∆C) > ǫ}
sup
(13)
C ′ ∈C(Y0 ,Y1 ,...,Yt0 )
We denote the event π(C ′ ∆C) > ǫ
sup C ′ ∈C(Y1 ,...,Yt0 )
by AC,ǫ (Y0 , Y1 , . . . , Yt0 ) or shortly by AC,ǫ , and obtain an upper bound on its probability. From a Markov chain dynamics X
Pr{AC,ǫ } =
(y0 ,...,yt0
)∈X t0 +1
1{AC,ǫ }π0 (y0 )
t0 Y
i=1
p(m) (yi |yi−1 )
(14)
where p(m) (y|x) denotes the m-th step probability of going from state x to state y. We have from (1) that p(m) (yt0 |yt0 −1 ) ≤ π(yt0 ) + βe−ψm .
(15)
Substituting, we obtain Pr{AC,ǫ } ≤
X
(y0 ,...,yt0 )∈X t0 +1
X
(y0 ,...,yt0 )∈X t0 +1
1{AC,ǫ }π0 (y0 )
1{AC,ǫ }π0 (y0 )
tY 0 −1 i=1
tY 0 −1 i=1
p(m) (yi |yi−1 )π(yt0 )+
p(m) (yi |yi−1 )βe−ψm .
Since 1{A} ≤ 1 then X
(y0 ,...,yt0 )∈X t0 +1
1{AC,ǫ }π0 (y0 )
tY 0 −1 i=1
p(m) (yi |yi−1 )βe−ψm ≤
But X
π0 (y0 )
(y0 ,...,yt0 −1 )∈X t0
tY 0 −1 i=1
X
(y0 ,...,yt0 −1
π0 (y0 ) )∈X t0
tY 0 −1 i=1
p(m) (yi |yi−1 )|X |βe−ψm . (16)
p(m) (yi |yi−1 ) = 1.
We obtain that the left hand side of (16) is not bigger than |X |βe−ψm . Similarly, we have X
(y0 ,...,yt0 )∈X t0 +1
X
(y0 ,...,yt0 )∈X t0 +1
1{AC,ǫ }π0 (y0 )
1{AC,ǫ }π0 (y0 )
tY 0 −1
tY 0 −2 i=1
15
i=1
p(m) (yi |yi−1 )π(yt0 ) ≤
p(m) (yi |yi−1 )π(yt0 −1 )π(yt0 )+
(17)
X
+
(y0 ,...,yt0 )∈X t0 +1
1{AC,ǫ }π0 (y0 )
tY 0 −2 i=1
p(m) (yi |yi−1 )βe−ψm π(yt0 )
The second summand in the inequality above is not bigger than X
π0 (y0 )
(y0 ,...,yt0 )∈X t0 +1
Since
P
yt0 ∈X
tY 0 −2 i=1
p(m) (yi |yi−1 )βe−ψm π(yt0 )
π(yt0 ) = 1 then for each fixed sequence y0 , . . . , yt0 −2 ∈ X t0 −1 we have X
yt0 −1 ,yt0 ∈X
βe−ψm π(yt0 ) = |X |βe−ψm .
But X
π0 (y0 )
tY 0 −2 i=1
(y0 ,...,yt0 −2 )∈X t0 −1
p(m) (yi |yi−1 ) = 1.
As a result, the second summand in (17) is not bigger than |X |βe−ψm . We continue this chaining process. As intermediate steps we obtain terms X
(y0 ,...,yt0 )∈X t0 +1
1{AC,ǫ }π0 (y0 )
j Y
p
(m)
i=1
−ψm
(yi |yi−1 )βe
t0 Y
π(yj ).
i=j+2
Each of these terms is bounded by |X |βe−ψm . In the end we obtain Pr{AC,ǫ } ≤
X
(y0 ,...,yt0
)∈X t0 +1
1{AC,ǫ }π0 (y0 )
t0 Y
i=1
π(yi ) + t0 |X |βe−ψm .
(18)
Note, X
(y0 ,...,yt0 )∈X t0 +1
1{AC,ǫ }π0 (y0 )
t0 Y
i=1
π(yi ) ≤
X
(y1 ,...,yt0 )∈X t0
1{AC,ǫ }
t0 Y
π(yi ).
i=1
But the last expression is exactly the probability of the event Pr{AC,ǫ }, when the random variables Y0 , Y1 , . . . , Yt0 are drawn independently according to the distribution π. The classical PAC result (see [11], [4]) states that this probability is upper bounded by 2(
2et0 d − ǫt0 ) 2 2 . d
In particular, the probability above is smaller than δ/2 whenever ³d
t0 = Ω
ǫ
log
1 1 1´ + log . ǫ ǫ δ
(19)
Now, if ³1³
m = ⌈Ω
ψ
log
´´ 1 + log t0 + log |X | + log β ⌉ δ
16
(20)
then t0 |X |βe−ψm ≤ δ/2. Combining this with (18), we obtain that the probability of the event AC,ǫ does not exceed δ whenever (19) and (20) hold. Recall T ≥ t0 m + 1. This completes the proof of the theorem. 2 Proof of Proposition 2: For each realization of our random walk Xt starting with n/2 consider a corresponding random walk Zt on the set of integers Z starting from the state X0 = 0. The correspondence is obvious: whenever the random walk on Sn increases (decreases) by a unit, the random walk on Z also increases (decreases) by a unit. It suffices then to prove that 2
Pr{ sup |Zt | ≤ n/4} ≥ 1 − e
n ) −Θ( α(n)
.
0≤t≤T
We now obtain an upper bound on Pr{sup0≤t≤T Zt ≥ n/4}. A bound on Pr{sup0≤t≤T Zt ≤ −n/4} is obtained similarly. Consider Pr{ZT ≥ n/4} which we write as Pr{ZT ≥ n/4| sup Zτ ≥ n/4}Pr{ sup Zt ≥ n/4}+Pr{ZT ≥ n/4| sup Zτ < n/4}Pr{ sup Zt < n/4}. 0≤t≤T
0≤t≤T
0≤t≤T
0≤t≤T
Note, Pr{ZT ≥ n/4| sup0≤t≤T Zt < n/4} = 0. Also note that Pr{ZT ≥ n/4| sup0≤t≤T Zt ≥ n/4} ≥ 1/2 since after crossing threshold n/4 at some time t ≤ T , the random walk is equally likely to end up strictly above or strictly below n/4 at time T . (In addition, it can end up exactly at n/4 if n/4 is an integer). This is a well known Reflection Principle of a symmetric random walk (or of a Brownian motion in continuous time continuous state case, see [10]). Therefore Pr{ sup Zt ≥ n/4} ≤ 2Pr{ZT ≥ n/4}. 0≤t≤T
Using the Chernoff bound [19] for Bernoulli type random variables n2
n2
Pr{ZT ≥ n/4} ≤ e−Θ( T 2 T ) = e−Θ( T ) . 2
n −Θ( α(n) )
By the choice of T = cα(n) we have Pr{ZT ≥ n/4} ≤ e
. But α(n)/n2 → 0 as n → ∞. The
required bound is proven. Now let us go back to the original random walk on Sn and suppose that the target concept is an interval C = [1, n/4]. We showed that with very high probability the random walk starting from the state X0 = n/2 will not end up in this interval within T time steps. So with a very high probability C ′ = ∅ is a concept consistent with observations. But π(C∆C ′ ) = π(C) = 1/4 > ǫ. We see that 17
T = Ω(n2 ) is a lower bound on the sample size needed for learning of the concept class of intervals. Comparing to the bound (6) we see that our lower bound is sharp up to log n factor. Proof of Theorem 4. The argument is very similar to the one of Theorem 1. We consider first T = t0 m + 1 observations (Xt , lt ), t = 0, 1, 2, . . . T of our labeled stochastic process and select from it every m-th observation together with first observation Yi = Xim , i = 0, 1, . . . t0 . Actual value of m will be specified later. We have Y0 = X0 = x0 , which we also denote by y0 . An upper bound on Pry0 {
π(C ′ ∆C) > ǫ}
sup
(21)
C ′ ∈C(Y0 ,Y1 ,...,Yt0 )
is our goal. Again we denote the event π(C ′ ∆C) > ǫ
sup C ′ ∈C(Y0 ,Y1 ,...,Yt0 )
by AC,ǫ (Y0 , Y1 , . . . , Yt0 ) ≡ AC,ǫ . We have Pr{AC,ǫ } =
X
(y1 ,...,yt0 )∈X t0
1{AC,ǫ }
t0 Y
i=1
p(m) (yi |yi−1 ).
(22)
By Assumption B, Theorem 2 and (9) |p(m) (y1 |y0 ) − π(y1 )| ≤ V (y0 )emθm Therefore Pr{AC,ǫ } ≤ X
X
(y1 ,...,yt0 )∈X t0
(y1 ,...,yt0 )∈X t0
1{AC,ǫ }π(y1 )
1{AC,ǫ }V (y0 )emθm
t0 Y
i=2
t0 Y
i=2
p(m) (yi |yi−1 )+
p(m) (yi |yi−1 )
(Recall that the initial state y0 is fixed and, as a result, does not appear in the summation). Note that by definition there are at most Dm states y1 that Markov chain can get into in m steps starting from the state y0 . Also for each fixed y1 we have
P
y2 ,...,yt0
Qt0
i=2 p
(m) (y
i |yi−1 )
= 1. It follows then that the
second summand above is not bigger than Dm V (y0 )emθm . Now let us analyze the first summand. Let L=
log t0 + log 3δ log(1 +
1 log λ log γ
.
(23)
)
This guarantees (
δ log γ L 1 ) ≤ 3t . log γ + log λ 0 18
(24)
We then have X
(y1 ,...,yt0 )∈X t0
1{AC,ǫ }π(y1 )
t0 Y
i=2
p(m) (yi |yi−1 ) ≤ X
y1 ,...,yt0 :V (y1 )>V (x∗ )γ 2L
X
y1 ,...,yt0 :V (y1 )≤V (x∗ )γ 2L
1{AC,ǫ }π(y1 )
t0 Y
i=2
1{AC,ǫ }π(y1 )
t0 Y
i=2
p(m) (yi |yi−1 )+
p(m) (yi |yi−1 )
The second summand is upper bounded by X
π(y1 )
y1 ,...,yt0 :V (y1 )>V (x∗ )γ 2L
t0 Y
i=2
p(m) (yi |yi−1 ) =
π(y1 ) = Prπ {V (Xt ) ≥ V (x∗ )γ 2L }.
X
y1 :V (y1 )>V (x∗ )γ 2L
By Theorem 3 it is upper bounded by (
log γ )L log γ + log λ1
which by (24) is not bigger than δ/(3t0 ). For the first summand, using Theorem 2 we have X
y1 ,...,yt0 :V (y1 )≤V (x∗ )γ 2L
X
y1 ,...,yt0 :V (y1 )≤V
(x∗ )γ 2L
X
y1 ,...,yt0 :V (y1 )≤V (x∗ )γ 2L
1{AC,ǫ }π(y1 )
t0 Y
i=2
1{AC,ǫ }π(y1 )π(y2 )
p(m) (yi |yi−1 ) ≤ t0 Y
i=3
p(m) (yi |yi−1 )+
1{AC,ǫ }π(y1 )V (y1 )emθm
t0 Y
i=3
p(m) (yi |yi−1 )
The second summand is upper bounded by X
y1 ,y2 :V (y1 )≤V (x∗ )γ 2L
π(y1 )V (y1 )emθm ≤ V (x∗ )γ 2L Dm emθm .
We break again the first summand in (25) as follows: X
y1 ,...,yt0 :V (y1 ),V (y2 )≤V (x∗ )γ 2L
X
1{AC,ǫ }π(y1 )π(y2 )
y1 ,...,yt0 :V (y1 )≤V (x∗ )γ 2L ,V (y2 )>V (x∗ )γ 2L
t0 Y
i=3
p(m) (yi |yi−1 )+
1{AC,ǫ }π(y1 )π(y2 )
t0 Y
i=3
where again, by Theorem 3, the second summand is upper bounded by (
log γ )L ≤ δ/(3t0 ). log γ + log λ1
19
p(m) (yi |yi−1 )
(25)
We continue this chaining argument. In the end we obtain three summations. The first sum is X
(y1 ,...,yt0 ):V (yi )≤V
The second sum is upper bounded by
(x∗ )γ 2L ,1≤i≤t
Pt0 P i=1
0
1{AC,ǫ }
yi :V (yi )>V (x∗ )γ 2L
t0 Y
π(yi ).
i=1
π(yi ) = t0 Prπ {V (Xt ) ≥ V (x∗ )γ 2L }, which
is upper bounded by t0 δ/(3t0 ) = δ/3 by Theorem 3 and (24) The third sum is upper bounded by (t0 V (x∗ )γ 2L + V (y0 ))Dm emθm .
(26)
The first sum is upper bounded (lift the constraints V (yi ) ≤ V (x∗ )γ 2L ) by the probability of the event AC,ǫ when the successive observations are i.i.d. with distribution π. By PAC analysis, this probability is smaller then δ/3 whenever ³d
t0 = Ω
ǫ
log
1 1 1´ + log . ǫ ǫ δ
We now analyze the bound (26) on the third sum. We obtain a value of m which guarantees this sum to be smaller than δ/3. In the analysis we omit the constants for simplicity. By Assumption B, Dm ≤ mα . Fix an arbitrary constant 0 < c < 1. We analyze an inequality mα+1 θm < c. It is equivalent to m≥
(α + 1) log m + log 1c . log 1θ
α+1 Since f (x) = x − log 1 log x is a growing function of x for x ≥ θ
m≥
α+1 log θ1
2
and is positive for x = ⌈ (α+1) ⌉, then log2 1 θ
(α + 1) log m log 1θ
2
for m = Ω( (α+1) + 1). We use this inequality for c = δ/(t0 V (x∗ )γ 2L e) and c = δ/(V (y0 )e) to conclude log 1 θ
that the sum (26) is smaller than δ whenever ³ (α + 1)2
m=Ω
log2 1θ
+1+
1 ³ 1 ´´ ∗ . 1 log t0 + log V (x ) + log V (x0 ) + 2L log γ + log δ log θ
The value of L is given by (23). Recall, T = t0 m + 1. This completes the proof of the theorem. 2
20