In Learning in the Limit and Non{Uniform (; ) { Learning Shai Ben-David
Michal Jacovi y Department of Computer Science Technion, Haifa 32000, Israel July 23, 2001
Abstract We compare the two most common types of success criteria for learning processes. The (; ) criterion employed in the PAC model and its many variants and extensions, and the identi cation in the limit criterion that is used in inductive inference models. By applying common techniques from the theory of Probability, and a stochastic variant of Rissanen's MDL principle, we demonstrate close connections between these two types of learnability notions. We also show that, once computability issues are set aside, stochastic identi cation in the limit is intimately related to the countability of a concept class.
1 Introduction Gold's Inductive Inference paradigm [Gol67] and Valiant's PAC model [Val84] are the cornerstones of major sub elds of computational Learnability Theory. The rich body of research carried out in these areas have taken directions that have very small common intersection. In this paper we wish to understand some connections between the notions of learnability in these two frameworks. One dierence between these settings is that while the PAC model is inherently concerned with learning from random inputs, the scenario laid out by Gold is usually applied to learning from deterministically generated input sequences. Wishing to emphasize the proximity between the models we consider a variation of the Identi cation In the Limit model that is based on randomly y
email:
[email protected] [email protected]
1
generated examples. This is a natural extension of the basic IIL model and similar variations have been investigated before (see e.g., [KB92]). Maybe the most conspicuous dierence between the two models is the dierence between their respective success criteria. While the IIL model is concerned with the eventual behavior of the learning process, the PAC success is required to occur within a nite bounded length of training sequence. The emergence of the Non-Uniform variants of PAC learnability (see e.g., [LMR88] and [BI88]) provides the framework for bridging the remaining dierences between the frameworks. There is another important dierence between the two branches of research. The research in the IIL framework has taken the issue of the computability of a learning method as a central theme, a trend resulting in the prevalence of recursion theoretic tools and results. On the other hand, the PAC school have focused on issues of eciency, both from the information theory point of view (i.e., sample size bounds) and the computational complexity angle (bounding computational resources). In this research we focus on the information theory aspects and ignore the computational and eciency issues. As a common ground for a discussion of the two models we have chosen the framework of generalized PAC, introduced by Haussler in [Hau92]. It provides a suciently general setting for the embedding of all the models we wish to compare. After providing precise de nitions of several models of learnability we turn to a detailed analysis of the relationship between their respective success criteria. We show that the notion of Identi cation In the Limit { the notion demanding the production of an exact identi cation at some nite stage in the learning process { is somewhat uninteresting: we prove that it is basically equivalent to the countability of the hypothesis class. We then go on to investigate weaker notions that settle for eventual convergence of the student's hypothesis to an optimal one. These notions turn out to be equivalent to the PAC originated notion of Non-Uniform (; ) - learnability. We conclude our discussion with a diagram that summarizes the relationship between the different notions.
2 Formal De nitions In order to discuss the relations between Identi cation In the Limit and Non-Uniform (; ) Learning, we need a framework that captures both kinds of learning. We choose to work in the framework of generalized PAC, introduced by Haussler in [Hau92]. 2
Let X be an instance space and let Y be an outcome space. A concept P in this framework is a probability distribution over X Y representing a possible \state of nature". An example of a concept P is a pair (x; y ) 2 X Y that is randomly drawn from X Y according to P . A concept class P is a collection of concepts. The learner receives a nite sequence of examples of an unknown target distribution P (a sample of P ) and develops a deterministic strategy that speci es what he believes is the appropriate action a for each instance x 2 X . The action is chosen from a set of possible actions A, called the decision space. In the future, for each example (x; y ), the learner will be shown only the instance x and will use his strategy in order to choose an action a. Following this, the outcome y will be revealed to the learner. For each action a and outcome y , the learner will suer a loss, which is measured by a xed real-valued loss function l : A Y ! [0:M ]. The learner's strategy, which is a function mapping the instance space X into the decision space A is called a decision rule. The decision rule is chosen from a xed decision rule space H of functions mapping X into A. The goal of the learner, who knows H, is to nd a decision rule H 2 H that minimizes the expected loss Ep [l(h(x); y )] (or in short Ep[lh ]). If such an h exists, we call it an optimal description of P . A basic learning problem is de ned by the six components (X; Y; A; H; P ; l).
De nition 2.1 [Learning Method] A learning method, A, for solving (X; Y; A; H; P; l) is a function m mapping the space of all m-samples, [1 m (X Y ) , into H. =1
De nition 2.2 [Learning In the Limit] A learning method, A, solves (X; Y; A; H; P; l) in the limit if, for any P 2 P , with probability 1 over the in nite sequences of examples of P , the sequence of A's hypotheses Ai = A((x ; y )); :::; An = A((x ; y ); :::; (xn; yn)); :::; contains a decision rule Am that is an optimal description of P , and Am = Am = , i.e., with probability 1, A converges to 1
1
1
1
+1
an optimal description of P after a nite number of examples.
Note that the probability distribution, relative to which A's success is de ned, is the product distribution P 1 over the in nite sequences of examples of P . This formulation of the success of a learning method is sometimes denoted EX. There is yet another possibility, in which the learning method is not required to converge to one optimal description but has to, eventually, output only optimal descriptions of the target P . The latter formulation is sometimes denoted BC [Ful90]. 3
Learning in the limit is only de ned for problems in which every P 2 P has an optimal description in H. This fact leads us to the introduction of a weaker notion of learning in the limit, that settles for eventual convergence of the losses of the learner's hypotheses towards the in mum of all expected losses.
De nition 2.3 [Weakly Learning in the Limit] A learning method is said to weakly solve (X; Y; A; H; P; l) in the limit if, for any P 2 P ; limn?1 E P [lAn ] = inf h2H E P [lh ], with probability 1 over the in nite sequences of examples of P .
De nition 2.4 [(; )- Learning] A learning method, A; (; )-solves (X; Y; A; H; P; l) if for any 0 < ; < 1 and any P 2 P , there exists a sample size m such that n m implies jE P [lAn ] ? minh2H E P [lh ]j < , with probability exceeding (1 ? ) over the n-samples of P (with respect to the product distribution P n ).
De nition 2.5 [Uniform and Non-Uniform (; )-learning] If the sample size m depends upon
and only and not upon the unknown target distribution P , the learning is said to be uniform. The learning is non-uniform if m is a function of P as well.
Embedding known models in our framework: Inductive Inference: We work in a stochastic model of Identi cation In the Limit in the spirit
of a suggestion by Gold [Gol67], in which the instances are randomly chosen and convergence in the limit is with probability 1. Furthermore, the Haussler model we have adopted allows treatment of concepts that are probabilistic, i.e., a concept may be a distribution over X Y rather than just a subset of X .
PAC-learning: PAC-learning is a special case of (epsilon; )-learning. In PAC, Y = A = f0; 1g. The collection P of distributions over X Y is replaced by a concept class C . A concept is a function c : X ! f0; 1g, (i.e., a subset of X ). If c(x) = 1 then the instance x is labeled 1 deterministically. The instances x are drawn according to some probability distribution D over X that remains unknown.
P -concepts: Another special case of (; )-learning is the model of p-concepts, introduced by Kearns and Schapire [KS90]. In the model of p-concepts, Y = f0; 1g, and A = [0; 1]. The collection P of distributions over X Y is replaced by a p-concept class C . A p-concept is a function c : X ! [0; 1] stating the conditional probability P (y = 1jx) of the label y being 4
1 given the instance x which is drawn according to some probability distribution D over X that remains unknown.
3 Some Tools From Probability Theory Following are some basic notions of Probability theory. We refer the reader to standard Probability text books, such as [GS82] for elaborations.
De nition 3.1 Let Z ; :::; Zn; ::: and Z be random variables on some probability space ( ; F ; P ). 1
We say that a;s (a) Zn ! Z almost surely, written Zn ! Z , if f! 2 : limn!1 Zn (!)g is an event whose probability is 1.
(This kind of convergence is sometimes called convergence almost everywhere or convergence with probability 1). P (b) Zn ! Z in probability, written Zn ! Z , if for all > 0; limn!1 P (jZn ? Z j > ) = 0. a;s P Z ) ) (Zn ! Z ). Theorem 3.1 (Zn !
Theorem 3.2 If Zn !P Z then there exists an increasing sequence of integers n ; :::; ni; ::: such that a;s Zn ! Z. 1
1
Lemma 3.1 Let jepsilonn & 0 be a monotonic, nonincreasing sequence. If the event
0 = f! : 9n0 s:t:8n n0 jZn(! ) ? Z (! )j < n g
(3.1)
a;s is an event whose probability is 1, then (Zn ! Z ).
A straightforward approach for learning is to search through the decision rule space for a decision rule that minimizes the average empirical loss. The following lemma of Hoeding [Hoe63] oers an upper bound on the rate of convergence of such as approach (as a function of the sample size).
Lemma 3.2 ([Hoe63]) Let Z (1 i n), be n independent random variables with identical prob1
ability distributions, each ranging over the (real) interval [0; M ]. Let Z = n1
P (jZ ? EP [Z ]j ) 2 e ? Mne 2
5
2
n Z. i=1 i
P
Then
(3.2)
Corollary 3.1 For a given decision rule, h 2 H an a given sample, ((x ; y ); :::; (xm; ym)), drawn 1
1
according to some distribution P , let the average empirical loss be
Em[lh] def = m1 Let N (k; ; ) be
m
X
j =1
l(h(xj ); yj):
N (k; ; ) = d 2M 1n 2k e: 2
(3.3) (3.4)
For every probability distribution P and every nite sequence of decision rules, (h1; :::; hk), if m N (k; ; ) then, with P m probability exceeding (1 ? ), for all i k; Em[lh1 ] is -close to the true expected loss, E P [lh1 ].
It is worthwhile to note that the needed sample size, N (k; ), depends upon ; delta and k only and does not depend upon the unknown P .
Lemma 3.3 (Borel-Cantelli) If fBngn2N is a sequence of events and if X
n2N
P (Bn) = 1;
(3.5)
then P (Bn in nitely often) = 0.
If fBngn2N is a sequence of independent events and if X
n2N
P (Bn) = 1;
(3.6)
then P (Bn i:o) = 1. The following, amusing result of Rudich [Rud85] turns out to be helpful for the analysis in the next section.
Lemma 3.4 ([Rud85]) Given a nite canvas, a countability in nite set of colors and an jepsilon >
0, it is impossible to point a set of measure at least with each color, so that the set of points painted with an in nite number of colors has measure 0. Note that this lemma is a variant of the second part of the Borel-Cantelli lemma. It dismisses the independence demand for the cost of a weaker conclusion. 6
4 A Characterization of Learning in the Limit The analysis carried out in this section leads to a conclusion that, we nd, quite surprising. Roughly speaking, we prove that the learnability in the limit of a problem, (X; Y; A; H; P; l), is basically equivalent to the countability of its decision rule space H.
Theorem 4.1 If H = fh ; :::; hn; :::g is a countable decision rule space, then the problem (X; Y; A; H; P; l) 1
is learnable in the limit.
Proof: Given a sample, we would like to use it for calculating the average empirical losses of all the
decision rules, and return the one with smallest empirical loss. The problem with this approach is twofold: From a computational point of view, calculating the averages of all members in an in nite set of rules is not feasible. Even if we ignore the issue of computability, as long as the convergence of the empirical estimates to the true losses is not uniform over all members of H, we can't guarantee the convergence of such a guessing process. Our solution is to apply the above approach to a nite number of decision rules for any given sample. This will also enable us to use corollary 3.1 for the calculation of the empirical losses. Let f (k) be the number of decision rules whose expected loss we approximate given a k-sample. What should such an f satisfy? f should be a monotonic, non-decreasing function, so that for any n, hn's loss is approximated for all but nitely many sample sizes. It is also important that the number n returned by f is such that the input k-sample is large enough for all n decision rules to be approximated with accuracy n and con dence (1 ? n) for some decreasing sequences jepsilonn & 0 and n & 0. Another problem with this approach is that choosing the decision rule with smallest empirical loss may generate a non-convergent sequence of hypotheses. Our remedy to this diculty is to return, at each stage in the learning process, the rst decision rule, relative to some xed enumeration of H, whose empirical loss is `close enough' to the smallest empirical loss. A careful choice of the accuracy parameters de ning the meaning of `close enough' at each stage will guarantee that the method we end up with is a successful learning method for (X; Y; A; H; P; l). Let n & 0 be any decreasing sequence and let n & 0 be a decreasing sequence satisfying P1 n=1 n < 1. Given a sample size k, f returns the maximal n such that the sample suces for the approximation of n decision rules with accuracy n and con dence (1 ? n ).
f (k) = maxfn : k N (n; n; n)g 7
where N is the function de ned in corollary 3.1. N (n; n; n) is an increasing function of n, thus f is a monotonic non-decreasing function. Let A be the following method: Given a k-sample.
Use the rst N (n; n; n) examples to approximate the expected losses of the rst n = f (k) decision rules in H. Returns the rst decision rule whose empirical loss is 2"n-close to the smallest empirical loss (among those just calculated).
Let us show that A indeed solves (X; Y; A; H; P; l) in the limit. Let P 2 P be the target distribution, and suppose hi 2 H is an optimal description of P , with minimal i. By corollary 3.1, N (n; "n; n) examples suce for the approximation of the losses with probability exceeding (1 ? nn ). Given any k-sample, for any hj such that f (k) = n j; EP [lhj ] is approximated and jEk [lhj ]j < "n , with probability exceeding (1 ? n). Let kn be the smallest sample size k for which f (k) n and let Bnj be the random event jEkn [lhj ] ? E P [lhj ]j "n. 1
X
n=j
p1 (Bnj )
1
X
n=j
n < 1:
(4.1)
4.1 and the Borel-Cantelli lemma imply that Bnj occurs only nitely often, with probability 1, which means for any j , jEk[lhj ] ? EP [lhj ]j < "f (k) (4.2) all but nitely often with probability 1. By choice of hi ; EP [lhi ] = minh2H E P [lh], thus 4.2 implies that the empirical loss of hi is 2"nclose to the smallest empirical loss all but nitely often with probability 1. On the other hand, for any j < i,
EP [lhj ] ? EP [lhi ] = "j > 0:
(4.3)
Thus, for any large enough k("j > e"f (k) ), the empirical loss of hj is too far from the smallest empirical loss, which means the empirical loss of hj is too far from the smallest empirical loss all but nitely often with probability 1. The method employed in the above proof relies on a xed enumeration of all candidate decision rule and can be viewed as a stochastic version of the method of identi cation by enumeration 8
([Gol67]). It actually employs a minimal description length principle in the spirit of the work of Rissanen [Ris85] and Cover [Cov73], i.e., one considers the set of all hypotheses that are not ruled out by probabilistically signi cant evidence and chooses the lowest indexed hypothesis in this set.
De nition 4.1 For a problem (X; Y; A; H; P; l), we say that P is covered by a subset HP H if for any P 2 P there exists an optimal description of P in HP . Claim 4.1 Let (X; Y; A; H; P; l) be any learning problem and let HP H be a cover for P . If (X; Y; A; HP ; P; l) is learnable then so is (X; Y; A; H; P; l). Corollary 4.1 If P is covered by a countable HP H, (X; Y; A; H; P; l) is learnable in the limit. Proof:Applying the above learning method to the countable HP . 2
We now turn to the reverse implication. We wish to show that any problem that is strongly learnable in the limit is `essentially countable'. For exact formulation of what we can prove along this theme, we need some further terminology: Given a learning problem (X; Y; A; H; P; l), let us de ne a relation R over the decision rule space H. Two decision rules h; h0 2 H are in the relation if there exists a distribution P 2 P for which both h and h0 are optimal descriptions. R is not necessarily an equivalence relation as it is not necessarily transitive. However, in many settings R is an equivalence relation. For example, in PAC and in the model of p-concepts with loss function l(a; y) = (a ? y)2 or l(a; y) = ja ? yj. Let [h]R be the equivalence class of the decision rule h in such settings.
Theorem 4.2 Let (X; Y; A; H; P; l) be a problem in which R is an equivalence relation, Y is a nite outcome space, and all members of P share the same projection on the X coordinate. If (X; Y; A; H; P; l) is learnable in the limit, then P is covered by a countable HP H. Let PX denote the projection of any P 2 P on the X coordinate. Proof: Each distribution P 2 P induces a function fP (X Y )1 ? N [ fg, as follows: Given a sequence (x1; y1) (xn ; yn) fP returns the rst m such that A((x1; y1) (xm ; ym )) is an optimal description of P and Am = Am+1 = fP returns a if no such m exists. For any P 2 P ; P 1(fP?1 (N )) = 1, thus there exists an mP such that with probability exceeding 12 . A converges to an optimal description of P after mP examples.
Claim 4.2 For each m, the set Pm def = fP 2 P : mP = mg is covered by a nite Hm H. 9
The proof of the theorem follows from this claim since P = [1 m=1 Pm and is therefore covered by the countable set HP = [1 m=1 Hm .
Proof of Claim:
Assume, by way of contradiction, that there exists an m for which Pm can not be covered by a nite Hm H. Let us paint the set X m with colors [h]R for which there exists a P 2 Pm that h optimally describes. That is, we paint X m with [h]R's that are optimal descriptions of distributions on which A converges after m examples with probability exceeding 21 . By our assumption, the collection of colors we use is in nite. We paint with [h]R, any sequence (x1; :::; xm) for which there exist a P that h optimally describes, a sequence of labels (y1 ; :::; ym) and all in nite extension of the pre x ((x1; y1); :::; (xm; ym )) on which fP returns m(= mP ). The choice of mP is such that the set painted [h]R is of (PX )m -measure that is painted with an in nite number of colors. Let (x1; :::; xm) 2 X 0 be any such sequence. The fact that (x1; :::; xm) is painted with an in nite number of colors means A returns an in nite number of non-equivalent answers given the sequence (along with labels). But the number of dierent possible labelings of the sequence is jY jm which is nite and A is a deterministic algorithm, therefore A returns only a nite number of answers given (x1; :::; xm). Thus, we have contradicted the assumption, and there is no such m.
2
Note that theorem 4.2 can do with the weaker BC criterion.
Corollary 4.2 In the PAC model, Learning In the Limit is equivalent to the countability of the
concept class.
Corollary 4.3 In the model of p-concepts [KS90], Learning In the Limit is equivalent to the count-
ability of the p-concept class.
Theorem 4.2 relies on the assumption that R is an equivalence relation over H. We believe it is possible to omit this assumption.
Conjecture 4.1 Let (X; Y; A; H; P; l) be a problem in which Y is a nite outcome space, and all members of P share the same projection on the X coordinate. If (X; Y; A; H; P; l) is learnable in the limit, then P is covered by a countable HP H. 10
We were not able to prove this conjecture. However, we can prove the theorem under each of the following conditions instead of the assumption that R is an equivalence relation. 1. For any h 2 H, the set of distributions P 2 P that h optimally describes is countable. Or, 2. For any h 2 H, the set
fEP : h is an optimal description of P g
(4.4)
is not dense in fEP : P 2 Pg, where EP = minh2H E P [lh ]. Any natural learning task we are aware of satis es one of these conditions.
5 Weak Learning in the Limit and Non-Uniform ("; )-Learning Weak learning in the limit turns out to be equivalent to Non-Uniform ("; )-learning. This is nothing more than a reformulation of some basic probability theorems.
Theorem 5.1 If a problem (X; Y; A; H; P; l) is weakly learnable in the limit, then it is nonuniformly
("; )-learnable.
Proof. The theorem follows from the fact that convergence almost surely implies convergence in
probability (theorem 3.1). The space is the collection of all in nite sequences of labeled examples (X Y )1 , any distribution P over X Y induce a probability distribution P= P 1 over . For any learning method A weakly solves (X; Y; A; H; P; l) in the limit, the random variables
Zn((x ; y ); :::; (xn; yn); :::) = E P [lA 1
1
x ;y1 );:::;(xn;yn )) ];
(( 1
(5.1)
converge almost surely to infh2H E P [lh ]< for any P 2 P . Theorem 3.1 implies that the Zn's converge to infh2H E P [lh ] in probability, for any P 2 P , which means for any P and any " > 0,
lim P Zn ? jinf 2H E P [lh] > " = 0;
n!1
(5.2)
or in other words, for any 0 < "; < 1, and any P 2 P , there exists an m such that n m implies
EP [lAn ? hinf E [l ] " 2H P h
with probability exceeding (1 ? ). Thus (X; Y; A; H; P; l) is non-uniformly ("; )-learned. For the reverse direction we shall have to work harder. 11
(5.3)
Theorem 5.2 If a problem (X; Y; A; H; P; l) is non-uniformly ("; )-learnable, then it is weakly learnable in the limit.
Proof. Let A0 be a learning method that uniformly ("; )-learns (X; Y; A; H; P; l). A straightforward approach for weakly learning (A; Y; A; H; P; l) in the limit would be to 1. Run A0 on pre xes of a part of the input sample and receive its hypotheses. 2. Use the rest of the sample in order to approximate the expected losses of the hypotheses. 3. Return the hypothesis with smallest empirical loss. Let f (k) be the number of pre xes we run A on, given a k-sample, f should be a monotonic, non-decreasing unbounded function. It is also important that f (k) is such that the input k-sample can be divided into two parts, an \input part" of size f (k) whose pre xes are given to A0 and a \test part", (xf (k)+1; yf (k)+1; :::; (xk; yk )R, that is used for the evaluation of the returned decision rules. The \test part" should be large enough to allow the approximations of all the losses of the rst f (k) hypotheses with accuracy "f (k) and con dence (1 ? f (k) ) for some decreasing sequences "n & 0 and n &). By corollary 3.1 f can be de ned as
f (k) = maxfn : k n + N (n; "n; n)g:
(5.4)
As in the proof of theorem 4.1, the above heuristics does not suce to guarantee convergence of the sequence of losses of the generated output hypotheses. Our solution is, once again, to employ an MDL principle, though calculations in this case should be more careful. The tedious calculations are diered to the full version of this paper.
2
5.1 From Uniform ("; )-Learning to Weak Learning In the Limit The most prevailing model of ("; )-learning is the uniform learnability criterion. I.e., learnability where the needed sample size depends only upon the ("; ) parameters (and not upon the target to be learned). Benedek and Itai [BI88], introducing the rst non-uniform variant of PAC learning, relate this notion to uniform learnability as follows:
Theorem 5.3 ([BI88]) A concept class C is non-uniformly PAC learnable i there exists classes fCn : n 2 N g such that 12
1. C = [n2N Cn 2. Each Cn is uniformly learnable.
Their proof relies heavily on the [BEHW89] characterization of uniform learnability of a class by the niteness of its VC-dimension. We wish to claim that the equivalence exhibited in [BI88] transcends the scope of PAC learnability. We shall now prove several instances of this thesis in the general context of the Haussler framework:
Theorem 5.4
1. If for any i; (X; Y; A; H; P i; l) is uniformly ("; )-learnable and P = [1 i=1 Pi, then (X; Y; A; H; P; l) is weakly learnable in the limit.
2. If for any i; (X; Y; A; Hi; P; l) is uniformly ("; )-learnable and H = [1 i=1 Hi , then (X; Y; A; H; P; l) is weakly learnable in the limit. 3. If for any i; (X; Y; A; Hi; Pi; l) is uniformly ("; )-learnable and P1 P2 and H1 H2 then (X; Y; A; [1i=1Hi; [1i=1Pi; l) is weakly learnable in the limit.
Proof. We shall prove claim 1 of the theorem, claims 2 and 3 are proven similarly. Let A i be a learning method that uniformly ("; )-learns (X; Y; A; H; P i; l). A straightforward approach for weakly learning (X; Y; A; H; P; l) in the limit would be to ( )
1. Run the A(i) 's on a part of the input sample and receive their hypotheses. 2. Use the rest of the sample in order to approximate the expected losses of the hypotheses. 3. Return the hypothesis with smallest empirical loss. The problem with this approach is that we have to handle in nitely many A(i) 's. Our solution is to apply the above approach to a nite number of learning methods, for a given sample. This will also enable us to use corollary 3.1 in the approximations of task 2. Let f (k) be the number of A(i) 's we run given a k-sample. f should be a monotonic, non-decreasing function, so that for any n; A(n) is run on all but nitely many sample sizes. It is also important that the number n returned by f is such that the input k-sample can be divided into two parts, and \input part" that is given to the rst n learning methods and a \test part" that is used for the approximation of the returned decision rules, and that the \input part" is large enough to enable any of the rst n 13
methods to come up with a hypothesis whose expected loss is "n -close to the minimal expected loss with probability exceeding (1 ? n ) for some decreasing sequences "n & 0 and n & 0, and the \test part" is large enough for all n hypotheses to be approximated with accuracy "n and con dence (1 ? n ). If a learning method A uses the suggested approach with such a function f , then for any target distribution P in any Pi, the expected losses of A's hypotheses on growing k-samples are (2f (k) )-close to the minimal expected loss with probability exceeding (1 ? 2f (k)), for all but nitely many k's. Let Zk be the random variable Zk = E P [lAk ], and let Z be inf h2H E P [lh ]. By the P above discussion, Zk ! Z . If the sequence fngn2N satis es P1n=1 n < 1, then it can be shown, by the Borel-Cantelli lemma, that the event
f! 2 (X f0; 1g)1 : 9k0 s.t. 8k k0jZk(!) ?Z j < "f k g
(5.5)
( )
is an event whose probability is 1 (Let kn be the smallest number for k for which f (k) n, de ne Bn to be the event jZkn (!) ? Z j "n. Note that P1n=1 P (Bn) P1n=1 n < 1 and apply the a;s Borel-Cantelli lemma). Thus, by lemma 3.1, Zk ! Z , which completes the proof of the theorem. It is left to show that such a function f exists. Let mi "; ) be the sample size of the uniformly ("; )-learning method A(i) . De ne
Mi(n) = max j imj ("n; n):
(5.6)
The Mi (n) example suces for any A(j ) ; j i to come up with a decision rule whose expected loss is "n -close to the minimal expected loss with probability exceeding (1 ? n ), for any target distribution P 2 Pj . Let M : N ! N be a function that grows faster than any Mi . That is, for any Mi , there exists an ni such that n ni implies Mi (n) M (n). Such an M exists since fMi : i 2 N g is a countable collection of functions. For any n ni , and M (n)-sample suces for any A(j ); j i to come up with a decision rule whose expected loss with probability exceeding (1 ? n ), for any target distribution P 2 Pj . Let In be the maximal i such that n ni . Given a sample size k, f returns the maximal n such that the sample can be divided into an \input part" of length M (n) and a \test part" of length N (In; "n; n) where N is the function de ned in corollary 3.1.
f (k) = maxfn : k M (n) + N (In; "n; n)g 14
M (n) + N (In; "n; n) is an increasing function of n, thus f is a monotonic, non-decreasing function. 2
Example 5.1 [p-concepts] Let X be an arbitrary set and f ; :::; fd functions mapping X into R. 1
Kearns and Schapire [KS90] de ned C (f1 ; :::; fd) as the class of all p-concepts of the form c(x) = d a f (x) for a 2 R, where the f and a are such that c(x) 2 [0; 1] for all x 2 X . By theorem i i i i=1 i i 10 in [KS90], for any computable functions fi : X ! R; 1 i d, the p-concept class C (f1 ; :::; fd) is uniformly ("; )-learnable w.r.t. the loss function l(a; y ) = (a ? y )2. Let fi : X ! R; i 2 N be any sequence of functions and de ne C (f1; :::; fd; :::) to be the pconcepts class C (f1; :::; fd; :::) = [d2N C (f1; :::; fd): (5.7) P
By claim 1 in theorem 5.4, for any computable functions fi : X ! R; i 2 N , the p-concepts class C (f1; :::; fd; :::) is weakly learnable in the limit.
Corollary 5.4 The p-concept class Cpoly = C (1; x; :::; xd; :::) of all polynomials p : [0; 1] ! [0; 1] is weakly learnable in the limit.
6 Acknowledgments We wish to thank Moti Frances and Shai Halevi for helpful discussions.
References [BD92] S. Ben-David. Can Finite Samples Detect Singularities of Real-Valued Functions? In Proceedings of the 24th ACM Symposium on the Theory of Computing. pp. 390-399, May 1992. [BEHW89] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the Vapnik-Chervonenkis Dimension. Journal of the Association for Computing Machinery, 36(4):929-965, October 1989. [BI88] G.M. Benedek and A. Itai, Non-Uniform Learnability. In Proc. of the 15th ICALP, pp. 82-92, 1988. [Cov73] T.M. Cover. On Determining the Irrationality of the Mean of a Random Variable. The Annals of Statistics. 1(5):862-871, 1973. 15
[Ful90] M.A. Fulk. Robust Separations in Inductive Inference. In Proc. of the 31st Symposium on Foundations of Computer Science, pp. 405-410, 1990. [Gol67] E.M. Gold. Language Identi cation in the Limit. Inf. Control, 10:447-474, 1967. [GS82] G.R. Grimmet and D.R. Stirzaker. Probability and Random Processes. Clarendon Press, Oxford 1982. [Hau92] D. Haussler. Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications. Technical Report. University of California, Santa Cruz, 1992. [Hoe63] W. Hoeding. Probability Inequalities for Sums of Bounded Random Variables. The American Statistical Association, 58:13-30, 1963. [KB92] S. Kapur and G. Bilardi. Language Learning from Stochastic Input. In Proceedings of the 5th Annual Workshop on Computational Learning Theory, pp. 303-310, 1992. [KS90] M.J. Kearns and R.E> Schapire. Ecient Distribution-free Learning of Probabilistic Concepts. In Proceedings of the 31st Annual Symposium on Foundations of Computer Science, pp. 382-391, 1990. [LMR88] N. Linial, Y. Mansour and R. Rivest. Results on Learning and the Vapnik-Chervonenkis Dimension. In Proceedings of the 29th Symposium on Foundations of Computer Science, pp. 120-129, 1988. [Ris85] J. Rissanen. Minimum Description Length Principle. Encyclopedia of Statistical Sciences, 5:523-527, 1985. [Rud85] S. Rudich. Inferring the Structure of a Markov Chain from its Output. In Proceedings of the 28th Annual Symposium on Foundation of Computer Science, pp. 321-326, 1985. [Val84] L.G. Valiant. A Theory of the Learnable. Communications of the ACM. 27(11):1134-1142, November 1984. From yvonne Sun Aug 31 17:04 IDT 1997 Return-Path: Received: by CS.Technion.AC.IL (SMI-8.6/SMI-SVR4) id RAA22255; Sun, 31 Aug 1997 17:04:19 +0300 Date: Sun, 31 Aug 1997 17:04:19 +0300 From: yvonne (Yvonne Sagi) Message-Id: To: shai Subject: gure Content-Type: text Content-Length: 4671 X-Lines: 93 Status: RO 16
The rst gure le I sent you is the . g le. Here I am sending you the .tex le of that gure. Learnability of problems with a countable cover. HP
cor. 4.2 (Strong) Learnability In the Limit
4.3
Learnability of countable unions of Uniformity ("; )-Learnable problems 5.4 ,
Weak Learnability In the Limit 5:1; 5:2 Non-Uniform ("; )-Learnability
- Learnability implies -Learnability
'
'
'
'
'
'
-Learnability implies -Learnability under the conditions stated in section 4
-Learnability implies -Learnability under the conditions stated in section 5
Yvonne
17