Trial and Error: A New Approach to Space-Bounded ...

Trial and Error: A New Approach to Space-Bounded Learning Foued Ameur Paul Fischery Klaus-U. Hogenz Friedhelm Meyer auf der Heidex Abstract

A pac-learning algorithm is d-space bounded, if it stores at most d examples from the sample at any time. We characterize the d-space learnable concept classes. For this purpose we introduce the compression parameter of a concept class C and design our Trial and Error Learning Algorithm. We show : C is d-space learnable if and only if the compression parameter of C is at most d. This learning algorithm does not produce a hypothesis consistent with the whole sample as previous approaches e.g. by Floyd, who presents consistent space bounded learning algorithms, but has to restrict herself to very special concept classes. On the other hand our algorithm needs large samples; the compression parameter appears as exponent in the sample size. We present several examples of polynomial time space bounded learnable concept classes: all intersection closed concept classes with nite VC{dimension. convex n-gons in IR . halfspaces in IRn . unions of triangles in IR . We further relate the compression parameter to the VC{dimension, and discuss variants of this parameter. 2

2

Heinz Nixdorf Institut und Fachbereich Mathematik-Informatik, Universitat-GH Paderborn, Paderborn, Germany. email: [email protected] yLehrstuhl Informatik II, Universit at Dortmund, D-44221 Dortmund, Germany. [email protected] zLehrstuhl Informatik II, Universit at Dortmund, D-44221 Dortmund, Germany. hoe[email protected] x Heinz Nixdorf Institut und Fachbereich Mathematik-Informatik, Universit at-GH Paderborn, Paderborn, Germany. email: [email protected]

1

D-33098 email: email: D-33098

1 Introduction We consider pac-learning algorithms that are space bounded, i.e. that can only store a xed number d of examples (d is independent of the quality parameters " and ). We completely characterize the space needed for pac-learning a concept class C by introducing the compression parameter of C and our Trial and Error Learning Algorithm. We present several applications of this general method. In contrast to previous approaches to space bounded learning where results were achieved for few concept classes, our learning algorithm does not insist on producing a hypothesis which is consistent with the whole sample.

1.1 De nitions

Let X be the learning domain, C 2X a concept class and H 2X a hypothesis class. X , C and H may be structured according to a complexity parameter n, i.e. X = (Xn )n2IN , C = (Cn )n2IN and H = (Hn)n2IN . For example Xn = IRn or Xn = f0; 1gn . We assume that C and H satisfy the measure-theoretic condition of being well-behaved, as de ned in [?]. Moreover we assume that there is a representation language for the objects involved. A sample of c 2 C is a sequence of labeled examples of the form (x; l), x 2 X , l = 1 if x 2 c and l = 0 otherwise. Samples are generated according to an unknown distribution D on X . The learning task can then be formulated as: Given an unknown target concept c 2 C , nd a hypothesis h 2 H that is a good approximation to c. Information about c is presented through a sample, drawn according to the unknown distribution D, and the quality of the hypothesis is measured by D(ch), where denotes the symmetric dierence. In what follows we refer to h 2 H or c 2 C both as the set and its characteristic function (i.e. : x 2 c is also expressed by c(x) = 1). We call C (polynomially) pac-learnable by H if there exists a (polynomial-time) algorithm A such that for any 0 < "; < 1 and n > 0, there is a sample size m("; ; n) (which is polynomial in 1="; 1= and n) such that for any target concept c 2 Cn, any distribution D, given a sample of size m("; ; n), A produces an "-good hypothesis h 2 H, i.e. an h 2 H such that D(ch) ", with probability at least 1 ? . A is called a (polynomial) pac-learning algorithm for C . Next we de ne the Vapnik-Chervonenkis dimension of a concept class. A nite subset T of X is said to be shattered by C if fT \ c j c 2 Cg = 2T . The Vapnik-Chervonenkis dimension of C , VCdim(C ), is the largest integer k such that there is a set T X of cardinality k that is shattered by C . If no such k exists VCdim(C ) is in nite. Now, we are in a position to state a theorem from [?] which relates nding consistent hypotheses to pac-learnability. We call a hypothesis h consistent with a sample S = ((x ; l ); : : : ; (xm; lm)) if h(xi) = li for each 1 i m. 1

1

Theorem 1 [?] Let the notation be as above. If VCdim(C ) is k, where k < 1 then, for 0 < "; < 1, and sample size at least

m("; ) = max 4" log 2 ; 8"k log 13" 2

!

any algorithm A nding a consistent hypothesis is a learning algorithm. In particular, if

C = (Cn )n2IN ; k = k(n) = poly(n) and A is polynomially time bounded then C is polynomially

pac-learnable.

The learning algorithms of this model are in general batched. This means that the full sample is available to the algorithm all the time. The type of algorithm we consider is not only on-line but also space-bounded. Basically, such an algorithm proceeds as follows: It receives the examples one at a time and its memory is restricted to storing at most d examples, d nite (and independent of " and ). After having received a new example the algorithm does some computation (which may use additional memory) but before receiving the next example the algorithm has to decide which d from the d + 1 currently known examples it wants to save and has to clear the additional memory. There is also a counter which is incremented with each new example. This counter can be read and be set to 0 by the algorithm. The current hypothesis h of the learning algorithm is not explicitly available, but there is a scheme in the learning algorithm which can compute h from the stored examples without any further information. Such a learning algorithm is called d-space bounded. We proceed with a formal de nition. A concept class C is d-compressible for d 2 IN if there exists a scheme F : (X f0; 1g)d ! H such that 8c 2 C 8m d and for all samples S = ((x ; l ); : : :; (xm; lm)) for c there exists 1 i ; : : :; id m such that 1

1

1

F ((xi1 ; li1 ); : : : ; (xi ; li )) (xj ) = lj d

d

for all 1 j m. This means, the hypothesis recovered by F from the d examples is consistent with the whole sample. In the above model we explicitly use an ordered sequence ((xi1 ; li1 ); : : : ; (xi ; li )) of examples, which is called the compression sequence. F is called a d-recovering scheme for C . The ordered compression parameter for C , d(C ), is de ned by d

d

d(C ) = minfd j C is d-compressible g A concept class C is called (polynomially) d-space learnable if there is a (polynomial) paclearning algorithm A for C which stores at most d examples at a time and uses no additional information except a counter as described above. In case that F does not depend on the order of the compression sequence we talk about a compression set, an order independent recovering scheme, order independent compression parameter, dind (C ), and refer to C as order independent d-space learnable. In this case we require that the set of examples in the sample S has cardinality at least d. Note that we do not demand that the d-compression scheme, i.e. the computation of the compression sequence from the sample, is ecient. We only require the existence of such a sequence. We shall refer to these two models as the ordered and order independent model.

3

1.2 Known results about space-bounded learning and the compression parameter

Previous works on learning with space restrictions includes [?], [?] and [?]. Haussler in [?] was the rst to deal with space-bounded learning. Haussler presents space ecient learning algorithms for a restricted class of decision trees and for certain nite valued functions on the real line. The space used by these two algorithms is polynomially bounded in the VC{ dimension of the concept class. The class of nested dierences of intersection-closed concept classes has been investigated by Helmbold, Sloan and Warmuth in [?]. They show that for special classes with this property space ecient learning is possible. Littlestone and Warmuth in [?] introduce the notion of a kernel size, which is very similar to what we call order independent compression parameter of a concept class. They show a bound on the sample size necessary for pac-learning in which the kernel size plays a role like the VC{dimension in the main theorem of [?]. Their sample size is max( ln( ); k ln( k )+2k) where k is the kernel size. However, Littlestone and Warmuth in [?] do not consider spacebounded learning. Space-bounded learning where the space also depends on the quality parameters "; has been investigated by Boucheron and Sallantin in [?], and by Schapire in [?]. Our de nition of compression is similar to that used by Floyd in [?]. She uses maximum and maximal classes. A concept class is maximal if adding any conceptPto the class increases m d the VC{dimension of the concept class. Let d (m) be de ned as i i for m d, and as 2m for m < d. A concept class is maximum if, for every nite subset Y of X , CjY = fc \ Y; c 2 Cg contains d(jY j) concepts on Y . In [?] Floyd proves the existence of a d-compression scheme for maximum classes C 2X of VC{dimension d on a nite X as well as for maximum and maximal concept classes C 2X when X is in nite. She then constructs space-bounded learning algorithms for some concept classes with this property. Her learning algorithms produce hypotheses which are consistent with the whole sample. 1

2

4

4

=0

2 New results

As indicated by the results of Floyd in [?], further progress in space-bounded learning seems to need that learning algorithms do no longer require consistency with the whole sample. In this paper we present a (non consistent) learning algorithm, the Trial and Error Learning Algorithm for space-bounded learning, and show the following.

Theorem 2 (Characterisation of d-space learnable classes) Let C = (Cn )n2IN be d(n)-compressible (or order independent d(n)-compressible) and paclearnable by the hypothesis class H = (Hn )n2IN with a sample size m("; ; n) as given in theorem ??. For xed n; "; let m := m(; ; n). Let t := ln (m ln m )d n (d(n)+1) d n . Then C is d(n)-space learnable by H (or order independent d(n)-space learnable by H) with sample size t m(; t ; n) (or d nt m(; t ; n)). In particular, if m("; ; n) = poly( " ; ; n) 2

0

2

0

0

(

)

2 (

1

2

(

)!

2

4

1

)

and for all n 2 IN , d(n) d for some d, and the recovering algorithm is polynomially time bounded, then C is polynomially d-space learnable by H.

Note that the compression parameter is a lower bound for the number of examples to be stored by any learning algorithm. Thus it completely characterizes the space necessary and sucient for pac learning. Theorem ?? states that we can achieve polynomial time learning algorithms for concept classes with constant compression parameter. A discussion of the relationship between the compression parameters dind ; d and the complexity parameter n (resp. the VC{dimension) is discussed in section ??. For the following problems we compute order independent compression parameters and design recovering schemes with a hypothesis class H = C : intersection closed concept classes. halfspaces in IRn . Hence we have space-bounded learning algorithms for these classes. We also present such space bounded learning algorithms in the order dependent model for the following problems: n-polygons in IR . nite unions of n-polygons in IR . We further give some insight into the in uence of order dependency to space bounded learning. We show that for C = 2X the compression parameter and the order independent compression parameter dier by at least a logarithmic factor. Finally we show that in both cases there are classes where d(C ) < VCdim(C ). A further extension of the model allows robust learning as de ned in [?, ?]. Here the hypothesis class may be much weaker than the target concept class. The objective is to nd a hypothesis which is almost the best approximation possible in this hypothesis class. The modi cations necessary consist of a second counter and one more memory cell capable of storing an integer. An easy modi cation of our Trial and Error Learning Algorithm shows that in the extended model, the results of theorem ?? also hold for robust learning if space 2d(C ) is allowed. 2

2

3 The Trial and Error Learning Algorithm We begin this section by the description of the Trial and Error Learning Algorithm. On input ", , and n the algorithm determines m := m("; ; n) and d = d(n), the values mentioned in theorem ??, and t := ln (m ln m )d (d + 1) d. It further computes m := m(; t ; n), again using the bound of theorem ??. The algorithm runs at most t trials. In each trial it sees m labeled examples (called an m-sample), one after another. It stores the rst d examples and uses the d-recovering scheme to produce a hypothesis h 2 Hn from them. For each of the m examples it tests whether h is consistent with it. The trial is successful 0

2

0

2

5

0

2

2

if h happens to be consistent with all of them. In this case h is the hypothesis produced by the learning algorithm and the algorithm stops. Otherwise a new trial is started. For the proof of the theorem ?? we need the following de nition. Fix c 2 C . We say that a m-sample S is -bad i there exists a hypothesis h 2 Hn which is consistent with S but D(ch) > where c is the target concept.

Lemma 1 With m and t as above the probability that at least one -bad m-sample appears in t trials is at most =2.

?? guarantees that the probability of an arbitrary Proof. Since m = m(; t ; n) theorem m-sample S to be -bad is at most t . The assertion follows. 2

2

Lemma 2 A trial is successful with probability at least probability at least

d md !

in the order independent model.

in the ordered model and with

1

md

Proof. We know that there is a sequence (set) M = ((xi ; li ); : : :; (xi li )) such that F (M ) 1

1

d

d

is a hypothesis consistent with the whole m-sample S . The probability of success is at least the probability to get the examples from M at the rst d positions. The latter probability in the ordered case is 1 : 1 m(m ? 1) (m ? d + 1) md and in the order independent model d! : d! m(m ? 1) (m ? d + 1) md

We now proceed with the proof of the theorem. With the above choice of t one can easily verify that t md ln holds. If we run t trials in the ordered model (resp. tind = dt in the order independent model) then the probability that all of them are unsuccessful is at most (1 ? m )t (1 ? m ) 2 m e? 2 = (the case of the order independent model is analogous). As the successful trial produces an "-good hypothesis with probability at least 1 ? by lemma ??, we conclude that with probability (1 ? ) > (1 ? ), an "-good hypothesis is found after md ln( ) trials, which implies Theorem ??. The modi cation of the Trial and Error Learning Algorithm to a robust learning algorithm is very simple. In each trial, the number of incorrectly classi ed examples is counted using the second counter. The additional memory cell stores the smallest number of misclassi ed examples observed in the trials already executed. The additional d(C ) storage units store the examples used to recover the currently best hypothesis. An easy variant of the above analysis shows that the hypothesis produced in the \best" trial ful lls the requirements of robust learning. 2

!

1

d

2

2

1

d

ln

d

2

2

6

ln

2

4 Applications In this section we present some examples for d-space bounded learnable concept classes with hypothesis class H = C . In fact we only design d-recovering schemes for the concept classes, the rest follows from the results of the last section. In order to get a good upper bound for the compression parameter dind resp. d of special concept classes we rst note a result of [?] which also appeared in an (almost) equivalent version in [?]. We say that a concept class is intersection closed if the intersection of the elements of any subclass is also an element of the class.

Example 1 Let C be an intersection closed concept class with nite VC{dimension k. Then

dind (C ) k.

The proof is a direct consequence of the following lemma. For each c 2 C and each set T X of positive examples, the intersection of all c0 2 C consistent with T is called Closure(T ). We de ne \; = 0 which means that Closure(T ) 0 if and only if there is no c 2 C such that c(x) = 1 for all x 2 T . Obviously Closure(T ) 2 C . We say that T 0 T is a spanning set of T if and only if T 0 is a set of minimal cardinality with Closure(T ) = Closure(T 0). Thus, given a sample S of positive examples the compression set is the spanning set S 0 of S and the recovering scheme is Closure(S 0). In [?] the following is shown:

Lemma 3 [?] If C is an intersection closed concept class and T is a nite subset of the domain, then every spanning set T 0 of T is shattered by C . This lemma guarantees that dind (C ) VCdim(C ) holds for all intersection closed classes C . Examples of such classes are the monomials or conjunctive normal forms on n Boolean variables.

Our next example is a concept class which is not intersection closed.

Example 2 Let Pn be the class of closed halfspaces in IRn . We show that dind (Pn) (n +1)

and give the corresponding recovering scheme. We shall rst state a few properties of convex polygons which we need in the following. For a set X of points let CH(X ) denote the closed convex hull of X . The boundary of a convex polygon P is denoted by (P ), its interior by int(P ) and its exterior by ext(P ). We make the convention that for a point p (zero-dimensional polygon) int(p) = (p) = p. A face of P is a maximum n0 -dimensional convex polygon in (P ), for some n0 n. Let d(x; y) denote the Euclidean distance of points x and y. A sample for a halfspace consists of nitely many positively and negatively labeled points of IRn . Let POS and NEG denote the respective sets. If NEG = ; then let z denote the largest x value in any of the samples (x ; : : :; xn) in S . Let (x; 1) 2 S be an example with this property. Any (n + 1){elementary subset of S including this example is a compression set. Similarly, if POS = ;, we look for the smallest x -value. We now assume that POS and NEG are both nonempty. Let r = d(CH (POS ); CH (NEG)) := minfd(x; y) j x 2 CH (POS ); y 2 CH (NEG)g. 1

1

1

7

As CH(POS ) \ CH(NEG) = ;, we have r > 0. From now on let x 2 CH (POS ) and y 2 CH (NEG) be such that they minimize the distance, i.e. the length of the line segment s = xy is r. Let Hx;y denote the hyperplane orthogonal to s which cuts s in the middle. The hyperplane Hx;y separates POS from NEG. Let Fx be the face of CH(POS ) of minimum dimension such that x is in the interior of Fx and let Fy be de ned analogously. By simple geometric considerations x, y can be chosen such that dim(Fx) + dim(Fy ) n ? 1. The faces Fx and Fy can then be de ned by at most n + 1 points, which de ne the compression set. We now describe the recovering algorithm. If it receives n + 1 positively labeled points it computes the maximum z of the x -values of the examples (x ; : : :; xn) 2 S . Let h be the hyperplane orthogonal to the x -axis which contains (z + 1; 0; : : : ; 0) . The recovering algorithm returns the halfspace of h containing the examples. The algorithm proceeds similarly if it gets only negative examples. Otherwise the examples are split into the positive and negative. Let A and B denote the resulting subsets. Form all pairs (A0; B 0), A0 A, B 0 B , and compute the distance between the ane spaces spanned by A0 and B 0. For a minimum distance compute a line segment s = xy which realizes this distance. Output as hypothesis that halfspace of Hx;y which contains the positive examples. For A [ B a compression set as de ned above, this algorithm produces a halfspace consistent with the whole sample. With our next example we switch to the ordered model. The recovering algorithm relies essentially on the fact that the d examples are presented in a special order. We do not know of any recovering algorithm that works for the order independent model. We use another recovering algorithm for halfplanes, which we need as a subroutine in the original problem. Example 3 The concept class in the focus of this example is convex polygons with at most n corners in IR . The corresponding VCdim is 2n +1 (see [?]). For simplicity we will describe a recovering scheme for triangles noting that the same arguments suce for any n. Let T be the concept class of triangles. Given a sample S of m examples a compression sequence of length 7 can be selected as follows: We choose 2 sample points for every edge of the target triangle. If at least one of the chosen pair of points is negative then we have to move the edge a bit until we can describe a triangle which contains all positive points and no negative one. To describe the limits of this motion we need an additional 7th point. If S contains fewer than 7 distinct examples then there is a trivial solution. Now we present the details. Let t 2 T be the target triangle. Let e be an arbitrary edge of t and l be its supporting line. We shift l parallel towards the interior of t until we meet the rst positive example p. Next we determine the example q in S which induces the smallest angle between the two lines l and l , the one given by the two points p and q. If q is also a positive example and no other negative point is lying on l then we have a sucient description of the edge e0 of our hypothesis triangle. Otherwise q is a negative point. We choose a negative point q 0 on l such that no other example lies between p and q 0. Now we determine the example r such that the line de ned by p and r has smallest angle with l . We then rotate l around the center p by half this angle. The direction of rotation is such that q 0 will lie on the opposite side of the line as the hypothesis triangle. Then the hypothesis formed by replacing the original edge e 1

1

1

1

1

2

1

1

1

2

2

2

2

8

2

by an appropriate part of l is consistent with the sample. Since we know the order of the three points our recovering scheme can easily compute the edge given only these points. The compression set is given as follows: for each of the three edges select 3 points p; q 0 and r as above. Check for which of the 3 edges the nal angle of rotation is the smallest one. This angle will also do for the other 2 edges. The corresponding 3 points are presented as the rst 3 examples the other 2 edges are given by the following 4 points. From this compression sequence the recovering scheme can compute 3 lines and the minimum angle in a straightforward way. It is interesting to note that we have again a dcompressible concept class C such that d VCdim(C ). 2

The last example of this section shows that the VCdim is not the lower bound of the compression parameter d, if we are in the ordered model.

Example 4 Let U be the concept class containing the union of 2 triangles. From the above

example we conclude that 14 is an upper bound of the compression parameter d. But we can save a further example by choosing the minimum angle of rotation of both triangles and presenting this one rst. The recovering scheme works basically as the one given in the above example. A simple argument shows that the VC{dimension of this class is at least 14.

It is obvious that we can enlarge this gap between the VC{dimension and the compression parameter d inspecting unions of more than only 2 triangles. With the above idea we only need one point for the determination of the minimum angle of rotation, which is relevant for all edges. They can be represented as before by an (ordered) pair of examples. Note that the concept classes C in examples ??, ?? and ?? given in this section are polynomially d(C )-space learnable, if n is a constant.

5 Discussion of the Compression Parameters

We have already seen in example ?? that the VC{dimension is not a lower bound on the ordered compression parameter. We would like to show that the Vapnik-Chervonenkis dimension is also not a lower bound on the compression parameter in the unordered model.

Example 5 We consider the case X = f1; : : :; ng and C = 2X . Then VCdim(C ) = n. Let c X be the target concept. We shall identify c with the bit vector (b ; : : :; bn), where 1

bi = c(i). A compression set for c can also be expressed as a vector (r ; : : : ; rn) where ri is equal to bi or ri is the gap symbol X . ri = X means that the example (i; bi) is not in the compression set. If a vector contains k gap symbols we call it a compression vector with k 1

gaps. We shall show in the following that gaps can be used to recover the missing labels. First we show how an injective mapping from the concepts (represented by the bit vectors) to the compression vectors with k gaps can be found (for a suitably chosen k). To this end x 0 k n. We de ne a bipartite graph with vertex set V = B [ R and edge set E . B consists of all bit vectors and R of all compression vectors with k gaps. Then

9

jB j = 2n and jRj =

n 2n?k . k

For b 2 B and r 2 R the edge (b; r) exists if ri 6= X )bi= ri, i.e., b and r agree outside the gaps. Then, for all b 2 B and r 2 R, deg(b) = nk and deg(r) = 2k . Note that any matching of size 2n will establish an injective mapping from B to R. A sucient condition for such a matching to exist is following

j?(B 0)j jB 0j; for all B 0 B ;

(1)

where ?(B 0) is the set of all neighbouring vertices of B 0. Condition ?? is automatically satis ed if for every b 2 B and r 2 R it holds that deg(r) deg(b), i.e., 2k nk . It can be shown that this holds if k < 0:77 n, for large n. Moreover, for k 0:78 n and large n, no matching of size 2n exists because jRj < 2n . Hence the gap-coding technique cannot lead to compression sets smaller than O(n). For k = 1 a direct coding-decoding procedure can be given: The bit vectors (0; : : : ; 0) and (1; : : : ; 1) are mapped to (X; 0; : : : ; 0) and (X; 1; : : : ; 1), respectively. For any other bit vector let i be the rst position such that bi 6= bi . This vector is mapped to (b ; : : : ; bi ; X; bi ; : : :; bn), i.e., the example (i + 1; bi ) is not in the compression set. The label of i + 1 is recovered as bi = 1 ? bi. Note that the gap is at the rst position only for the one and zero vector. +1

1

+2

+1

+1

In the ordered model an even stronger compression is possible.

Example 6 Let X = f1; : : : ; ng and C = 2X . Then VCdim(C ) = n. We show that d(C ) = n . n

2

Let d be such that d! 2n . d = nn is sucient for this choice. Let be some (easy computable) surjective mapping from the permutations in Sd to C . For a given sample S X let h be a consistent hypothesis. Any sequence of d distinct examples in S , ordered according to a with = ? (h) (relative to the lexicographic order on C ), is a compression sequence. Given a sequence of d examples, the recovering scheme simply computes the order type of this sequence and returns the hypothesis (). log

2

log

1

Acknowledgements. We would like to thank Martin Dietzfelbinger for helpful discussions and Hans-Ulrich Simon for pointing out an inaccuracy in a previous version of this paper. Foued Ameur and Friedhelm Meyer auf der Heide are supported in part by the ESPRIT Basic Research Action No 7141 (ALCOM II) and by the DFG grant Di 412/2-1. Paul Fischer and Klaus-U. Hogen are supported by the DFG grant We 1066/6-1 and by Bundesministerium fur Forschung und Technologie grant 01IN102C/2.

References [1] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik{Chervonenkis dimension. Journal of the Association on Computing Machinery, 36(4):929{965, 1989. 10

[2] Stephane Boucheron and Jean Sallantin. Some remarks about space-complexity of learning, and circuit complexity of recognizing. In Proceedings of the 4th Annual Workshop on Computational Learning Theory, pages 125{138, 1988. [3] Sally Floyd. On Space-bounded Learning and the Vapnik-Chervonenkis Dimension. Technical report, ICSI Berkeley, 1989. [4] David Haussler. Space Ecient Learning Algorithms. Technical report, UCSC, 1988. [5] David Haussler. Generalizing the pac model: Sample size bounds from metric{dimension based uniform convergence results. In Proceedings of the 30'th Annual Symposium on the Foundations of Computer Science, pages 40{46, 1989. [6] David Helmbold, Robert Sloan, and Manfred K. Warmuth. Learning nested dierences of intersection-closed concept classes. In Proceedings of the 2nd Annual Workshop on Computational Learning Theory, pages 41{56, 1989. [7] Michael J. Kearns, Robert E. Schapire, and Linda M. Sellie. Toward ecient agnostic learning. In Proceedings of the 5th Annual Workshop on Computational Learning Theory, pages 341{353, 1992. [8] Nick Littelstone and Manfred Warmuth. Relating Data Compression and Learnability. Technical report, UCSC, 1987. [9] B.K. Natarajan. On learning boolean functions. In Proceedings 19th ACM Symposium on Theory of Computing, pages 296{304, 1987. [10] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5:197{227, 1990.

11