Efficient Learning of Monotone Concepts via Quadratic ... - CiteSeerX

14 downloads 0 Views 173KB Size Report
where X; Y is an observable pair of input-output random ... tion and the price of the underlying asset, and a decreasing .... put as a constraint in the optimization problem. Here we put monotonicity as an ... norm in Rn: Suppose now we are given a certain probability .... a continuous positive density function h(x) on the domain.
Efficient Learning of Monotone Concepts via Quadratic Optimization

David Gamarnik Mathematical Sciences Department IBM, T.J.Watson Research Center P.O.Box 218, Yorktown Heights, NY 10598 [email protected]

Abstract We consider a non-parametric regression model, in which the regression function to be estimated is coordinate-wise monotone. Such functions can model, for example, the dependence of an option price on a strike price, duration and a price of an underlying asset. We propose a simple algorithm based on quadratic optimization, which estimates the regression function in polynomial time. Numerical results are provided, which show that the algorithm is quite robust to possible discontinuities of the regression function. Our main theoretical result shows that the expected VC-entropy of the space of coordinate-wise monotone functions grows subexponentially. As a result, the estimating function, constructed by our algorithm, converges to the true regression function, as the number of observations goes to infinity, and the estimating procedure is consistent.

1 INTRODUCTION We consider a regression model of the form

Y = f(X) + ; where X; Y is an observable pair of input-output random variables and is a zero mean unobserved random noise (in particular E[Y jX] = f(X)). The regression function f : D ! R is defined on some bounded domain D  x1 ; X2 < x2g k Prn?1?kfX1 > x1; X2 > x2ghC (x1; x2) =

2

(18)





n ? 1 Prk fX < x jX > x g 2 2 1 1 k (1 ? PrfX2 < x2 jX1 > x1g)n?k?1 Prn?1 fX1 > x1ghC (x1; x2); where (X1 ; X2 ) is a random vector chosen according to the distribution HC with density function hC . Combining with (17) and (18) we obtain

But

(19)

hC (x1; x2) = hXC 2 (x2jX1 = x1 )hXC 1 (x1 ) = X2 hXC 2 (x2jX1  x1)hXC 1 (x1) hCX2 (x2jX1 = x1 ) : hC (x2jX1  x1 )

Note that

hXC (x2 jX1 = x1) = hX (x2jX1 = x1 )  ; hXC (x2 jX1  x1) hX (x2 jx1  X1  c12 ) by the choice of . Therefore hC (x1; x2)  hXC (x2jX1  x1 )hXC (x1 ): 2

2

2

2

1

2

C 2C

Z c22

c

2 1

Prk fX2 < x2 jX1 > x1 g

(1 ? PrfX2 < x2 jX1 > x1g)n?k?1hXC (x2jX1  x1) = 2

Z c22

c

2 1

k=0 C 2C

2

We use the recursion in Lemma 4 to deduce the rate of divergence of supC 2C L(2; n; HC ) as n diverges to infinity. p We argue that the rate of divergence is dominated by e2 n , which has a subexponential growth rate. Lemma 5 There holds

nX ?1 e2 n  e2 (n?1) + n e2 k ; k=0 for all sufficiently large n. 1 2

1 2

1 2

1 2

1 2

1 2

(20)

Proof: Introduce

nX ?1 (n) = e2 n ? e2 (n?1) ? n e2 k : k=0 We need to show that (n)  0 for sufficiently large n. We

have

1 2

1 2

1 2

1 2

1 2

1 2

n (n) = ne2 n ? ne2 (n?1) ? 1 2

2

We now evaluate the double integral in (19) by integrating first by x2 and then by x1. From the previous equality.

sup L(2; k; HC ) n1 :

k=0 C 2C

nX ?1 sup L(2; n; HC )  sup L(2; n?1; HC )+ n sup L(2; k; HC ):

Note that

dPrfX2 < x2jX1 > x1g = hX (x jX  x ): 1 C 2 1 dx2

nX ?1

It follows

This completes the proof.

2 2

2 1

1 1

L(2; n; HC )  L(2; n ? 1; HC ) +

C 2C

L(2; n; HC )  L(2; n ? 1; HC )+  Z Z nX ?1  n ? 1 sup L(2; k; H ) c c C k C 2C c c k=0 Prk fX2 < x2 jX1 > x1g (1 ? PrfX2 < x2 jX1 > x1g)n?k?1 Prn?1 fX1 > x1ghC (x1; x2)dx1dx2: 1 2

Combining, we obtain

1 2

1 2

1 2

nX ?1 1 1 e2 2 k 2 : k=0

Similarly

(n + 1) (n + 1) = (n + 1)e2 (n+1) ? (n + 1)e2 n ? 1 2

1 2

1 2

1 2

n X 1 1 e2 2 k 2 : k=0

Subtracting, we obtain

(n + 1) (n + 1) ? n (n) =

Prk fX2 < x2 jX1 > x1 g

(21)

2 (n+1) ? (2n + 1 + )e2 n + ne2 (n?1) 12 : (n + 1)e n ? k ? 1 (1?PrfX2 < x2jX1 > x1g) dPrfX2 < x2jX1 > x1g = 1 1 Z c22 Now, let us consider the function e2 2 x 2 and its third order 1 n?k?1 Taylor expansion around x = n. k + 1 c21 (1 ? PrfX2 < x2 jX1 > x1 g) 1 1 1 1 h 1 2 dPrk+1fX2 < x2jX1 > x1g =    e2 2 (n+1) 2 = e2 2 n 2 1+ (n+1 ? n)+ 1 (n+1 ? n)2 1 2

n

? k)(n ? 2 ? k)    1 =  1  : = (n(k?+11)(k + 2)    (n ? 1) n?1 k

The last equality is obtained by successive integration by parts and observing PrfX2 < c21jX1 > x1g = 0; PrfX2 < c22 jX1 > x1 g = 1. Also

Z c22

c21

Prn?1fX1 > x1gfCX (x1)dx1 = n1 : 1

1 2

1 2

1 2

? 41 (n + 1 ? n)2 +

1 2

1 2

2n

1 2

n 1 ( ? 3 + 3 )(n + 1 ? n)3 + O( 1 )i: 6 n 2n2 4n n2 3 2

3 2

1 2

3 2

5 2

Similarly,

h e2 (n?1) = e2 n 1+ (n ? 1 ? n)+ 21 n (n+1 ? n)2 n 1

1 2

1 2

1 2

1 2 1 2

? 41 (n + 1 ? n)2 + 1 2

n 1 ( ? 3 + 3 )(n ? 1 ? n)3 + O( 1 )i: 6 n 2n2 4n n2 3 2

3 2

1 2

3 2

5 2

Applying this to (21), and simplifying we obtain

(n + 1) (n + 1) ? n (n) = h e2 n ? + + + 21 n ? 12 ? n n 1 + (n + 1 ? n) 1 ( ? 3 + 3 ) + O( 1 )i = 4n 6 n n2 4n n h1 i 1 1

= e2 n 2 + O( n )  e2 n 4 ; n n for sufficiently large n. Since the sum 1 X e2 n 14 n n=1 1 2

1 2

1 2

1 2

1 2

1 2

1 2

3 2

1 2

3 2

3 2

5 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

diverges to infinity, then

n (n) ? (1) =

nX ?1

m=1

(m + 1) (m + 1) ? m (m)

diverges to infinity as well. In particular, (n) is positive for all sufficiently large n. 2 Our final step in proving Theorem 2 is to use Lemma 5 to show that there exists some constant > 0 for which

We have L(m; n; H)  L(2; n; H) and it suffices to prove the lower boundpfor the case of two labels 1; 2 only. k (a1 ? a1),p2 = Let N = d ne and let p1k = a11 + N 2 1 k k a21 + N (a22 ? a21); k = 0; 2; : : :; N ? 1. Then we have partitioned the rectangle D into N 2 equal size rectangles Pkj  [p1k; p1k+1]  [p2j ; p2j +1]. Let us consider the rectangles forming anti-diagonal PN ?1;0; PN ?2;1; : : :; P0;N ?1. Given random input points X1 ; X2; : : :; Xn, for each k = 0; 1; : : :; N ? 1 select one point out of X1 ; X2 ; : : :; Xn that falls into the rectangle Pk;N ?1?k, if there is any. Let  = (X1; X2 ; : : :; Xn) denote number of points selected. In particular,   N . Notice, that these points are incomparable with respect to  order, and therefore can be labeled with 1; 2 arbitrarily. Any of such labelings can be extended to a labeling over all points X1 ; X2; : : :; Xn: We conclude

L(2; n; H)  E[2] By Jensen’s inequality E[2 ]  2E [] . We now show that p E[]  n(1 ? e?1 ). Since all the rectangles have equal

size, the probability of a point falling into a given rectangle Pk;N ?1?k is 1=N 2. Therefore the probability that at least 2 one point falls into Pk;N ?1?k is 1 ? (1 ? 1=N 2)N . It follows

E[]  N[1 ? (1 ? 1=N 2)N ] 2

Since (1 ? 1=N 2)N

 e?1 , then p E[]  N(1 ? e?1 )  n(1 ? e?1 ): 2

supC 2C L(2; n; HC ) is upper bounded by e2 n for all n. Let us find n0 starting from which (20) holds. We now

This proves the first part of the proposition. The second part follows immediately from Proposition 4.2

sup L(2; n; HC )  2n e2 n ;

Acknowledgments. Many thanks to Don Coppersmith for suggesting the proof of the labeling problem and many helpful discussions.

1 2

1 2

show by induction that

0

C 2C

1 2

1 2

for all n = 1; 2; : : : : Since L(2; n; HC )  2n (the labeling number of n points cannot exceed 2n) then the inequality holds for all n  n0 . Suppose it holds for all n  m ? 1 where m > n0. Then applying Lemma 4

mX ?1   sup L(2; m; HC )  2n e2 (m?1) + m e2 k  C 2C k=0 2n e2 m ; where the last inequality follows from Lemma 5 and m > n0 . This proves the induction step. We conclude that for the constant = 2n there holds sup L(2; n; HC )  e2 n : 1 2

0

0

1 2

1 2

1 2

1 2

1 2

0

C 2C

1 2

1 2

In particular, let us take the rectangle C to be the whole domain D. Then (9) is satisfied. We conclude that the bound (16) holds, namely f^n converges to f in probability, exponentially fast. The almost sure convergence follows from Borel-Cantelli lemma.2 Proof of the Proposition 5: Without the loss of generality, we may assume that the input distribution H(x) is uniform.

References [1] D. Bertsimas, D. Gamarnik J. Tsitsiklis. Estimation of Time-Varying Parameters in Statistical Models; an Optimization Approach. Proceedings of the Tenth Annual Conference on Computational Learning Theory, 314-324, 1997. [2] J. Hutchinson, A. Lo and T. Poggio. A Nonparametric Approach to Pricing and Hedging Derivative Securities Via Learning Networks. Technical Report, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, MIT, 1994. [3] M. Kearns and R. Schapire. Efficient distributionfree learning of probabilistic concepts. 31st Annual Symposium on Foundations of Computer Science, 382-391, 1990. [4] Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984. [5] V. Vapnik. Nature of Learning Theory. SpringerVerlag, New York, 1996. [6] M. Bazaara, H. Sherali, C. Shetti. Nonlinear Programming; Theory and Algorithms. Wiley, New York, 1993.

[7] D.Haussler. Generalizing the PAC Model for Neural Net and Other Learning Applications. University of California Santa Cruz Technical Report UCSC-CRL-89-30.