Probably Almost Bayes Decisions - CiteSeerX

Probably Almost Bayes Decisions Paul Fischer

Stefan Polt

Hans Ulrich Simon

Lehrstuhl Informatik II Lehrstuhl Informatik II Lehrstuhl Informatik II Universitat Dortmund Universitat Dortmund Universitat Dortmund D-4600 Dortmund 50 D-4600 Dortmund 50 D-4600 Dortmund 50 paulf,poelt,[email protected]

Abstract

We put Bayes decision theory into the framework of pac-learning as introduced by Valiant [Val84]. Unlike classical Booleann concept learning where functions f : f0; 1g ! f0; 1g are approximated, we assume here that f(x) is 0 (or 1) with a certain probability. We develop a theoretical framework for estimating functions and reduce the classi cation problem to the problem of estimating parameters. Within this framework it is shown that classi cations based on n conditional independent Boolean features can eciently be learned by examples. Our learning algorithm achieves with probability 1 ? an error which comes arbitrarily close (up to an additive ") to the optimal one of a perfect Bayes decision. It requires ? O n"43 ln n examples. In the particular case of two state classi cation, learning can be performed on a single neuron. Moreover we relax the restriction of conditional independence to dependencies of bounded k and show that in this case we 3k2order +5k+3 nk+1 n need O "3k+9 ln examples.

1 INTRODUCTION This paper deals with classi cation problems of the following kind. There is a feature space X, a state space Y and a distribution function D : X Y ! [0; 1] : The objective is guessing with minimum error rate the underlying state if we observe an object with feature vector x 2 X. If the distribution function is known and easy to evaluate, then Bayes decision theory gives a constructive answer, how to achieve the minimum error rate: Guess the state y 2 Y with maximum a posteriori probability D[y j x]. For a more detailed introduction into Bayes decision theory and its importance for pattern recognition see [DH73]. Our aim is to put the elements of Bayes decision theory into the framework of pac-learning. An

extension of pac-learning to real-valued functions in a probabilistic setting has been investigated by Haussler [Hau89]. Also Kearns and Schapire [KS90] have recently considered the learning of probabilistic concepts (p-concepts). In this paper we investigate a case that cannot be (obviously) treated with their methods. We assume that D is xed but unknown and that all predictions are only based on sample observations according to D. We want to nd a prediction mechanism which brings us with high probability arbitrarily close to the optimal Bayes decision (probably almost Bayes decisions). This project is certainly overambitious if we do not put any restriction on the class of distributions in consideration. It becomes however manageable if the following conditions hold: 1. D is describable by a formula depending on a \suciently small" set of parameters. 2. The eect of inaccurate estimations of the parameters on the prediction error can be \reasonably" bounded. The paper is structured as follows: In chapter 2 we give a brief survey of Bayes decision theory. In chapter 3 we develop a theoretical framework for estimating distribution functions which satisfy conditions 1) and 2). Within this framework we reduce the problem of nding an approximate Bayes decision to the statistical problem of parameter estimation. In chapter 4 and 5 we apply our theory to two particular classes of distribution functions. In chapter 4 we consider the case of independent Boolean features. In chapter 5 we deal with the more complex case where dependencies of bounded order are allowed.

2 BAYES AND PROBABLY ALMOST BAYES DECISIONS For the sake of a clear description of our approach, we restrict ourselves to the following simple situa-

tion: 1. X = f0; 1gn (Boolean feature vectors) 2. Y = f!0; !1 g (2 states) Let D : X Y ! [0; 1] be a Boolean classi cation problem with 2 states and n Boolean features, say x1; :::; xn. Given a feature vector x i 2 X, Bayes decision means choosing the state ! with higher a posteriori probability D[!i j x] which is related to the conditional probability D[x j !i ] by: !i ] D[!i ] : D[!i j x] = D[x j D[ (1) x] Equivalently we may choose the state !i with the higher value of Gi(x) := D[!i j x] D[x] = D[!i ] D[x j !i ] : (2) The Bayes decision function is then n 0 1 B(x) = 0 if G (x) G (x), (3) 1 otherwise. The prediction error of any decision function C : X ! f0; 1g is given by error(C) := D [f(x; i) j i 6= C(x)g] X = D[!1?C (x) j x] D[x] : (4) x

Since Bayes decision chooses the state with higher a posteriori probability, its error opt is minimum and equals the average minimuma posteriori probability, i.e. X opt = minfD[!0 j x]; D[!1 j x]g D[x] : (5) x

2.1 De nition Let Dn be a class of distribution functions of theSform D : f0; 1gn f!0; !1 g ! [0; 1] and D := n Dn . We call D probably almost Bayes decidable (pab-decidable), if 9 an algorithm A and polynomials p1; p2: 8"; 2 [0; 1]; n 2 ; D 2 Dn : If the input of A is a sample IN

of size p1 (1="; 1=; n) drawn according to the distribution D then A computes in time p2(1="; 1=; n) with con dence at least 1 ? a decision function ~ opt + ". B~ such that error(B) A Bayes decision can be made if the probabilities of the two states and the conditional probabilities of a given feature vector x are known (see (2.2) and (2.3)). It becomes therefore important to nd good estimations for the (conditional) distribution functions. In the next chapter we introduce new methods for achieving this aim.

3 PAC-ESTIMABLE DISTRIBUTION CLASSES In this chapter we introduce the notations of pacestimability with multiplicative or additive error and prove relations between them as well as closure properties. Our aim is to apply this theoretical framework to classes of distribution functions. In order to motivate our abstract approach, we consider the following two concrete examples: Example 1 The class of distribution functions with independent Boolean features: D1 [x] =

n Y

j =1

(pj )xj (1 ? pj )1?xj ;

(6)

where pj is the probalility that (xj = 1). Example 2 The class of k-th order BahadurLazarsfeld expansions, i.e., the class of distributions, in which at most k Boolean features are correlated: D2 [x] = D1 [x]

k X

X

l=0 j2Jl

j1 :::jl yj1 (x):::yjl (x) :

(7) where Jl := f(j1 ; : : :; jl ) j 1 j1 < : : : < jl ng . The yj (x) are the normalized variables: yj (x) = p xj ? pj : (8) pj (1 ? pj ) The j1 :::jl are the correlation coecients of order at most k of the variables xj1 ; :::; xjl and de ned as X j1 :::jl = yj1 (x) ::: yjl (x) D2 [x] = q

x

X

b2f0;1gl

:::bl ; (9) yjb11 ::: yjbll pbj11:::j l q

:::bl is where yj0 = ? 1?pjpj , yj1 = 1?pjpj and pbj11:::j l the probability that (xj1 = b1) ^ ::: ^ (xjl = bl ). Therefore we may write (3.2) as

D2 [x] = D1 [x]

k X X

X

Y (l; j ; b) ; (10)

l=0 j2Jl b2f0;1gl :::bl Y (l; j ; b) := yjb11 yj1 (x) ::: yjbll yjl (x) pbj11:::j l

. where Notice that each of the two classes is representable by a single formula((3.1) resp. (3.5)). Each formula depends on n Boolean variables and poly(n) pa:::bl ). Any speci c distribution rameters (pj and pbj11:::j l function in the class is obtained by an appropriate choice of the latter parameters. Note that not all

choices lead to a distribution function, those which do are called reasonable. In addition, if examples are drawn according to a speci c distribution, we may compute estimations for the underlying parameters, because they are probabilities of events ((xj = 1) or (xj1 = b1) ^ ::: ^ (xjl = bl )) for which the examples constitute a Bernoulli experiment. It is a straightforward idea to convert these parameter estimations into estimations for the whole formula. The conversion process is guided by the structure of the formula and will produce estimations for subformulas as intermediate results. It is therefore reasonable to de ne the notion of pac-estimability in such a way that it applies to arbitrary formulas (the subformulas for instance) which operate on the same parameters as the formula for the distribution itself. Let us now assume that a distribution class D = (Dn )n2IN is given by a family (fn )n2IN of formulas and a function N(n) such that for any n 2 IN the following holds: fn ( ; ) depends on N(n) parameters from [0; 1] and n Boolean variables. N(n) is polynomially bounded. For the following assumptions we suppress the n if it is selfevident. Any distribution function f(q ; ) in the class is obtained by an q 2 Rn [0; 1]N (n), where Rn is the set of reasonable choices. Theren are eciently decidable events Ej f0; 1g for j = 1; : : :; N such that for all q; j : X qj = f(q ; x) : x2Ej

In other words: qj is the probability of Ej w.r.t. f(q ; ).

3.1 De nition n1) Let D be a probability distribution on f0; 1g and 2 [0; 1]. Any subset E f0; 1gn with X D[x] x2E

is called a D-fraction of at most . We simply say fraction if D is evident from the context. Let us now assume that (gn)n2IN is a family of formulas depending on the same parameters as the above-mentioned family (fn )n2IN of distributions. Throughout the rest of the paper we use a tilde to denote the estimation of the corresponding object. 2) We call (gn ) multiplicatively pac-estimable w.r.t. (fn ), i

9 an algorithm A, polynomials p1 ; p2: 8 ; ; 2 [0; 1]; n 2 IN; q 2 Rn [0; 1]N (n): If A is started on a sample S f0; 1gn of 1 1 1 size p1 (n; ; ; ) drawn according to the distribution function fn (q; ) then A outputs in time p2 (n; 1 ; 1 ; 1 ) with con dence at least 1 ? an estimation g~( ) for gn(q; ) and a predicate exception(x). The predicate holds only for an f(q ; )-fraction of at most of all x. In addition the following inequality is satis ed for all x which are not exceptions (called well-behaved in the sequel): g~(x) 1 + gn(q ; x) (1 + )~g(x) : (11) The pair (~g ( ); exception) is called a multiplicative (; )-estimation for gn(q; ) w.r.t. fn (q; ). Analoguously, the conditions g~(x) ? gn(q; x) g~(x) + (12) and g~(x) (1 + ) ? gn(q ; x) (1 + )~g(x) + (13)

lead to the notions of additive and linear pac-estimability, and additive (; )- and linear (; ; )-estimations w.r.t. fn (q; ).

3) The distribution class D (given by (fn )) is called

multiplicatively (additively, linearly) pac-estimable

if (fn ) is multiplicatively (additively, linearly) pacestimable w.r.t. itself. Additive and linear pac-estimability are de ned analoguously. 4) Let (gn1 ); : : :; (gnM ) be families of multiplicatively (resp. additively, linearly) pac-estimable formulas (w.r.t (fn )). We call (gn1 ); : : :; (gnM ) uniform multiplicatively (resp. additively, linearly) pac-estimable, if all gnj can be estimated by a single algorithm which gets j as an additional input. The notion of multiplicative pac-estimability is powerful because it provides a sucient condition for pab-decidability (see theorem 3.2). In some applications it is hard to verify multiplicative pacestimability directly and we have to use a detour via linear or additive pac-estimability. The lemmas 3.3 through 3.7 show how to convert one type of estimability into another one and present some closure properties. The following result will prove useful in theorem 3.2 and in some applications:

3.2 Lemma If x is a Bernoulli-variable, i.e. a random variable with values 0 or 1, p the probability of (x = 1) and pm the estimated probability

given m independent Bernoulli trials, then: For all 0m < ; ; 1, m = K(; ; ) = l 3 32 2 ln , the probability that for p~ = pm the following implications are valid, is at least 1 ? : 1. If p~ 2 then 1+p p~ p(1 + ), 2. If p~ < 2 then p < .

Proof: The proof which relies on the Cherno

bounds [Che52] will be presented in the full paper.

3.3 Theorem Let D = S Dn be a class of ndistribution functions of the form D : f0; 1g [!0; !1 ] ! f0; 1g. For D 2 D, D0 and D1 denote the conditional distribution functions de ned by Di [x] := D[x j !i ]. Let Dc = fDi j D 2 D; i = 0; 1g Let , i.e., the class of conditional distribution functions corresponding to D. Then the following holds: If Dc is multiplicatively pac-estimable then D is pab-decidable.

Proof: We only present a sketch here and refer the reader to the full version of the paper. By assumption there is an algorithm for Dc which produces (; )-estimations, for arbitray ; . Our aim is to nd good approximations to the discriminant functions de ned as G(x) = D[x j !i ] D[!i]. If D[!i ] and D[x j !i ], i = 0; 1, can be estimated with small multiplicative error 1 + , then a decision based on the estimations will coincide with the Bayes decision in the case that G0(x) and G1(x) dier by more than (1+)4 . In the other case, that the dierence is less than (1+)4 , the estimations may not be in the same order as the Gi. Then we decide in the wrong way which blows up the error on x by a factor of at most (1+)4 . Thus the overall error may exceed the optimal one by a factor of at most (1 + )4 . There are two problems that arise in this context. First D[!i ] may be very small which implies that good multiplicative estimations require superpolynomial sample size. Using lemma 3.1 this can be checked with high con dence1?and we may output a constant decision, namely ! i. Secondly, the exceptions require a special treatment, but their additive contribution to the overall error can be shown to be bounded by . In the full paper we shall present the appropriate choices of and which make the overall error less than (1 + 2" )opt + "2 which is at most opt + " because opt 1. We also show that the desired

con dences can be achieved and that the sample required is of polynomial size.

3.4 Remark

In the full paper we prove that theorem 3.2 can be generalized to the case of a constant number of states.

3.5 Lemma

Uniform multiplicative pacestimability is closed under: 1. poly(n)-times product, 2. poly(n)-times sum of purely positive (resp. purely negative) terms.

Proof: Let n be xed and let M = M(n) a

polynomially bounded function, h1 ; :::; hM the uniform multiplicatively pac-estimable formulas and QM h(q; x) = j =1 hj (q; x) the product of them. Let q 2 Rn be arbitrary but xed. i) Starting the estimation algorithms for the hj with the parameters 0 = M , 0 = M , 0 = we get with con dence at least 1 ? M 0 = 2M 0 0 1 ? multiplicative ( ; )-estimations ~hj and predicates exceptionj (x). Setting exception := exception1 (x) _ : : : _ exceptionM (x) the probabilities of the exceptions for any of the h~ j accumulate to at most M 0 = . For those x which are wellbehaved w.r.t all formulas the following holds: ~h(x) =

M Y

~hj (x) (1 + 0 )M h(q; x)

j =1 e M h(q; x) = e 2 h(q; x)

< < (1 + )h(q; x) : Symmetrically one can show: h~ (x) > 1+1 h(q; x). ii) Analoguously with 0 = M , 0 = M and 0 = the same holds for the sum. The factor 1 + is preserved by the distributive law. 0

The proofs of lemmas 3.4 through 3.7 are simmilar to the one of lemma 3.3 and omitted from this abstract.

3.6 Lemma Uniform additive pac-estimability is closed under poly(n)-times sum. The following two lemmas show that, under some conditions, it is possible to convert one type of estimability into another.

3.7 De nition We call (gn) probably bounded w.r.t. to (fn ), i for 1all n; q; the following holds: fn (q; fx j gn (q; x) > g) < .

We call (gn ) probably bounded away from zero w.r.t. to (fn ), i for all n; q; the following holds: fn (q; fx j gn (x; q) < g) < .

We consider the con dent case. If p~ < 2 , any x with xj = 1 is classi ed as an exception. The probability of an exception is then bounded by pj < . If xp~j 2 , then p~j is an appropriate estimation for pj j = pj . The (; )-estimations output by the algorithm is obvious. This proves the lemma.

3.8 Lemma

4.2 Theorem

A family of formulas which is additively pac-estimable and probably bounded away from zero w.r.t. (fn ) is also multiplicatively pacestimable w.r.t. (fn ).

3.9 Lemma A family of formulas which is mul-

tiplicatively pac-estimable and probably bounded w.r.t. (fn ) is also additively pac-estimable w.r.t. (fn ).

3.10 Lemma

A product of constantly many uniform linearly pac-estimable and probably bounded (w.r.t. (fn )) formulas is additively pacestimable w.r.t (fn ).

In the next two chapters we assume that n is xed and speak of formulas instead of families of formulas.

4 INDEPENDENT BOOLEAN FEATURES This chapter and the next one contains applications of the previously developed theory. We rst deal with the class of distribution functions of independent Boolean features as de ned in (3.1).

4.1 Lemma fpj xj j j = 1; : : :; ng [ f(1 ? pj )1?xj j j = 1; : : :; ng is a set of uniform multiplicative pac-estimable formulas.

Proof: We show how to estimate pj xj for an arbitrary but xed j 2 f1; : : :; ng. The considerations for (1 ? pj )1?xj are symmetrical. Notice that for xj = 0 the constant 1 is a perfect estimation. Let xj = 1. Given a sample of size K(; ; ) we compute an estimation p~j for pj . We know from lemma 3.1 that the following implications hold with a con dence of at least 1 ? : 1. If p~j 2 then 1+pj p~j pj (1 + ), 2. If p~j < 2 then pj < .

The class of classi cation problems with two states and conditional independent Boolean features is pab-decidable.

Proof: Applying the preceeding lemma and the closure property of lemma 3.3, we know that D1 [x] =

n Y

j =1

(pj )xj (1 ? pj )1?xj

is a multiplicative pac-estimable distribution class. The assertion of the theorem follows now immediately from theorem 3.2. An inspection of the relevant lemmas shows that we may bound the sample size for this class of problems by: 3 n n O "4 ln : For a proof of the following theorem see the full paper.

4.3 Theorem

The class of classi cation problems with two states and conditional independent Boolean features can be performed on a single neuron.

5 DEPENDENT BOOLEAN FEATURES In this chapter we show the pab-decidability of the class of k-th order Bahadur-Lazarsfeld expansions as de ned in (3.5).

5.1 Lemma fyjb yj (x) j j = 1; : : :; n and b= q p x ? p j j 0 p 0; 1g, where yj (x) = pj (1?pj ) , yj = ? 1?jpj , q

yj1 = 1?pjpj , is a set of probably bounded and uniform linear pac-estimable formulas.

Proof: Let j; ; ; ; be arbitrary but xed and b = 1 (the case b = 0 is symmetrical). Case 1: xj = 0. Then yj (x) = yj0 and therefore yj (x)yj1 = yj0 yj1 = ?1. A constant is obviously probably bounded and precisely estimated by itself. Case 2: xj = 1. Then yj (x) = yj1 and therefore yj (x)yj1 = yj1 yj1 = 1?pj 1?pj 1 1 1 1 pj . If yj (x)yj > then pj = pj ? 1 > which implies that pj < 1+ < . The function is therefore bounded by 1 for all but a fraction of at most of all x, i.e., it is probably bounded. Informally we estimate 1?pjpj as follows: If there is evidence that pj , we may classify x with xj = 1 as an exception. If there is evidence that 1 ? pj 1+ , then 1?pj pj and 0 is an appropriate additive estimation. Ifthere is evidence that pj > and 1 ? pj > 1+ , then a multiplicative estimation is produced by applying the Cherno bounds. We proceed in the spirit of lemma 3.1, but have to make a careful choice of the sample size K(0 ; 0 ; 0 ). Combinig the mutliplicative estimations p~j and 1 ? p~j for pj and 1 ? pj to the estimation 1?p~jp~j for 1?pjpj , the multiplicative factor of uncertainty is (1 + 0 )2 which should be no more than 1 + . 0 )2 = 1+20 +02 1+30 for 0 1, Since (1+ 0 = 3 is a suitable choice. Furthermore, 0 = 2 is a good choice, because we apply lemma 3.1 two times (for pj and 1 ? pj ). Finally 0 should be chosen as minf ; 1+ g, because ; 1+ are the critical thresholds for pj and 1 ? pj . Our estimation procedure starts therefore on a sample of K( 3 ; minf ; 1+ g; 2 ) examples and computes then the estimations p~j . From lemma 3.1 we know that the following holds with a con dence of at least 1 ? : If p~j 2 and 1?p~j 2(1+ ) , then 1?pjpj 1+1 1?pj 1?p~j p~j pj (1 + ). If p~j < 2 , then pj < and x is classi ed as exception. If 1 ? p~j < 2(1+ ) , then 1?pjpj < and 0 is an appropriate additive estimation. In any case, we delivered with high con dence a value which is precise, a multiplicative estimation

or an additive estimation. This constitutes a linear estimation. It is straightforward to construct the (; ; )-estimation.

5.2 n :::bl j b = (b1; : : :; bl ) 2 f0; 1gl ; Lemma pbj11:::j l 1 l k; 1 j1 < : : : < jl ng is a system of probably bounded and linear pac-estimable (constant) function classes.

:::bl is a probability, its value is Proof: Since pbj11:::j l

bounded by 1 and therefore probably (even certainly) bounded. A sample size of K(; ; ) leads to a linear estimation (with no exceptions). As in :::bl and 0 is the proof of lemma 4.1, either p~bj11:::j 2 l :::bl > and it estian additive estimation or p~bj11:::j 2 l :::bl multiplicatively. Both assertions hold mates pbj11:::j l with a con dence of at least 1 ? . Again it is clear how the (; ; )-estimation is de ned.

5.3 Lemma

If D, P are probability distributions and D(x) = P(x) Q(x) for all x 2 f0; 1gn, then Q(x) is probably bounded away from zero (w.r.t. distribution D).

Proof: Since 0 P(x) 1, Q(x) < implies that D(x) < . Thus, Q(x) n for all but an D-fraction of of all x 2 f0; 1g . 5.4 Theorem The class of Boolean classi cation problems with 2 states and conditional distributions which have a bounded Bahadur-Lazarsfeld expansion is pab-decidable. Proof: Let k be the bound on the order of the expansions. According to theorem 3.2, it is sucient to show that D2 [x] as given by formula (3.5) is a multiplicative pac-estimable distribution class. Setting :::bl (x) = y (x)yb1 : : : y (x)ybl pb1 :::bl Tjb11:::j jl j1 jl j1 :::jl j1 l and Q(x) =

k X

X

X

l=0

1j1