Randomized Neural Networks for Learning Stochastic ... - CiteSeerX

2 downloads 0 Views 252KB Size Report
By choosing an appropriate loss function, the system can be made to learn ...... the sequence of inter-arrival times to the queue, and let f ng be the sequence of ...
Randomized Neural Networks for Learning Stochastic Dependences Vivek S. Borkar and Piyush Gupta

Abstract We consider the problem of learning the dependence of one random variable on another, from a nite string of i.i.d. copies of the pair. The problem is rst converted to that of learning a function of the latter random variable and an independent random variable uniformly distributed on the unit interval. However, this cannot be achieved using the usual function learning techniques because the samples of the uniformly distributed random variables are not available. We propose a novel loss function, the minimizer of which results in an approximation to the needed function. Through successive approximation results (suggested by the proposed novel function), a suitable class of functions represented by combination feed-forward neural networks is selected as the class to learn from. These results are also extended for countable as well as continuous state-space Markov chains. The e ectiveness of the proposed method is indicated through simulation studies.

1 Introduction In this paper, we consider the following problem: Given a nite string f(Xk ; Yk ); 1  k  mg of copies of the random-variable pair (X; Y ), to learn to estimate Yk given Xk . The usual procedure is to nd a function h such that a suitably de ned error between h(Xk ) and Yk is minimized (for example, if the mean-square error between the estimate h(Xk ) and Yk is to be minimized, then it is easily seen that h(Xk ) = E [Yk jXk ]). Now given such a \best" estimating function h, the estimate of g(Yk ) given Xk is g(h(Xk )). But clearly g(h(Xk )) need not be the best estimate of g(Yk ) given Xk . 1

In contrast, we consider the general problem of learning the conditional law of Y given X . For this purpose, we rst obtain a canonical representation of the dependence of Yk on Xk , using some results from [1]. For arbitrary random variables Xk and Yk , taking values in compact subsets of real Euclidean spaces, we can always nd a function f such that Yk = f (Xk ; k ), where fk g are i.i.d. , uniform over [0; 1] and independent of fXk g. Hence the problem of learning the conditional law of Yk given Xk , reduces to that of learning the function f . However, f can not be learned using the usual techniques because the corresponding samples of fk g are not available. Hence an appropriate loss function that does not require fk g is minimized. The minimizer of the loss function gives an approximation of f , denoted by f. Moreover, by using a result of Dubins [6], it is shown that one such approximation of f is f(X; ) = PNi=1+1 I f 2 Aig  fi(X ), where N depends on how \closely" f is required to match f , and Ai; 1  i  N + 1, is a partition of [0; 1] into intervals. Now, given such an approximation f of f , the best estimate of Yk given Xk is PNi=1+1 i  fi(Xk ), where i is the length of the interval Ai. Moreover, the estimate of g(Yk ) given Xk is PNi=1+1 ig(fi(Xk )), which is not the same as g(PNi=1+1 i  fi (Xk )). This is so because we are essentially estimating an \approximate" conditional law of Yk given Xk , and not any particular conditional average as is usually done. Also, as it will be shown shortly, estimating i; fi; 1  i  N + 1 does not require explicit randomization; that is, we do not require to simulate fk g explicitly in order to learn i's and fi's. A number of researchers have previously considered the problem of learning the conditional distribution [5,9]. But in these approaches, the true conditional distribution is assumed to come from a known countable class of distributions. For example, DeSantis et al.[5] consider the problem of learning the conditional distribution from a countable class of distributions, which minimizes the entropy of the observed data. In the approach taken here, we do not make any such assumptions. Another approach which is closely related to the issues addressed here, is the generalized Probably Approximately Correct (PAC) model of learning from examples [8]. In this model, the learning system is required to decide on an action, say ak , after observing the input Xk . Depending on ak and actual output Yk , there is a `loss' l(ak; Yk ). The objective for the learning system is to infer a decision rule (i. e., a function which maps Xk to ak ) so as to minimize the loss. By choosing an appropriate loss function, the system can be made to learn various features of the conditional distribution. For example, by choosing the squared-error loss 2

function, the system can learn the conditional mean of Yk given Xk . In contrast, our approach aims at not only learning a particular conditional average, but also learning an approximation of the conditional distribution. In this paper, we discuss two cases: Firstly, when the given string f(Xk ; Yk ); k  1g consists of i.i.d. copies of the random-variable pair (X; Y ); and secondly, when Yk = Xk+1 , fXk g is a Markov chain (with either a countable or continuous state space) and we desire to learn its occupation measure. In the former, we extend the results of [2] by considering an alternate loss function, while we make use of some results from [3] to arrive at the latter. The rest of the paper is organized as follows. In section 2 we discuss the i.i.d. case in detail. For the sake of completeness, we rst give a brief review of the results of [2]. The new loss function is given in x2.2. In section 3 we extend the results for Markov chains. Appendix A gives the results of simulation of the learning algorithm on a number of synthetic problems. We conclude with some continuing work in section 4.

2 The i.i.d. Case Let (Xk ; Yk ); k  1 be independent and identically distributed pairs of random variables taking values in C3 = C1  C2, where C1 

Suggest Documents