On-Line Learning for Active Pattern Recognition. Jong-Min Park, Yu Hen Hu. Abstract| An adaptive on-line learning method is pre- sented to faciliate pattern ...
IEEE SIGNAL PROCESSING LETTERS
1
On-Line Learning for Active Pattern Recognition Jong-Min Park, Yu Hen Hu
Abstract | An adaptive on-line learning method is presented to faciliate pattern classi cation using active sampling to identify optimal decision boundary for a stochastic oracle with minimum number of training samples. The strategy of sampling at the current estimate of the decision boundary is shown to be optimal compared to random sampling in the sense that the probability of convergence toward the true decision boundary at each step is maximized, oering theoretical justi cation on the popular strategy of category boundary sampling used by many query learning algorithms.
P
I. Introduction
ATTERN recognition via active sampling can trace its roots to statistical experiment design where performing an experiment (acquiring one training sample) may incur signi cant cost. A number of active learning strategies, based on the concept of optimal experimental design, as well as importance sampling have been reported ([1], [2], [3], [4], [5]). References [1] and [2] focused on active learning for pattern classi cation applications, with a common heuristic to sample at or near the present estimate of the category boundary using a justi cation that ([1], [2]) the function approximation of the posterior probability is most uncertain near the category boundary. In this letter, we examine the validity of this argument using a two-class pattern classi cation problem as an example. We show that the variance of the approximation error reaches its maximum at the true category boundary. However, the present estimate of the category boundary may not coincide with the true boundary unless convergence is reached. Based on a stochastic oracle model, we show that the strategy of sampling at the present estimate of category boundary is optimal by using a perceptron-like learning algorithm. This result oers a direct theoretical justi cation of the \sample-at-current-boundary" strategy. Preliminary simulation results are consistent with what the algorithm predicts. II. Problem Formulation
Since q0 (x) = 1 ? q1 (x), for simplicity, we shall denote ( ) by q(x) in the rest of this letter. A maximum posterior probability (MAP) classi er is a decision rule which determines that x is drawn from class C = 1 if q (x) > 0:5 and from class C = 0 if q (x) < 0:5. The set of points B = fx j q0 (x) = q1 (x) = 1=2g is called the decision boundary. In general, B may contain more than one point. In this work, we are mainly interested in applications where B contains exactly one point. This will be the case if, say, q(x) is a monotonically increasing function of x as we will assume in the sequel. In conventional pattern recognition problem, a set of \training samples" are given so that the decision boundary w where q(w ) = 0:5 can be estimated from these samples. In an active learning (also known as query learning, [6]) problem formulation, the set of training samples are not given. Instead, a \learner" (the classi cation algorithm) will sample a feature vector x and present to an oracle (by performing an experiment or running a simulation) to learn the corresponding class label of x. In a two-class pattern recognition problem, this is equivalent to the evaluation of a function y(x) at a speci c value of x. The oracle will return y(x) = 1 or y(x) = 0 as the class label associated with x according to the posterior probability P fC = 1 j xg = q (x) and P fC = 0 j xg = 1 ? q (x). For x w (w is unknown to the learner) q(x) ! 1 and the oracle will most likely return y(x) = 1, while for x w , it will most likely return y (x) = 0. For x w , it is equally likely for the oracle to return y(x) = 0 or y (x) = 1. In other words, y (x) is an oracle whose answer is probabilistic in nature. This property is quite distinct from previous works where an oracle is assumed to be deterministic ([5], [2]). The goal of the active learning in the pattern recognition problem setting is to nd w with minimum number of queries to the oracle. Minimizing the number of queries not only reduces the cost associated with each experiment (query), but also expedite the convergence of the algorithm.
q1 x
III. Minimum Error Active Learning
To devise a learning rule that learns the optimal decision boundary w using active learning, let us de ne a 0{1 loss function L(y(x); f (w; x)) = [y(x)?f (w; x)]2 = [y(x)?u(x? w)]2 , where u(x) = 1 if x > 0 and u(x) = 0 if x < 0. That is, if f (w; x) and y(x) have the same value for a given x, then the loss is 0. Otherwise, the loss is 1. Then, a cost function as the conditional risk given x can be de ned as: 2 Cost(x) = E [[y (x) ? f (w; x)] j x] i fi (x) ; i = 0; 1: qi (x) = P fC = i j xg = 0 f0 (x) + 1 f1 (x) = P fy(x) = 1 j xg [1 ? u(x ? w)]2 +P fy(x) = 0 j xg [0 ? u(x ? w)]2 Yu Hen Hu and Jong-Min Park are with the Department of Electrical and Computer Engineering, University of Wisconsin - Madison. = q(x) [1 ? u(x ? w)] + (1 ? q(x)) u(x ? w) In a two-class pattern recognition problem, the feature vector x 2 < and the class label C 2 f0; 1g are random variables with conditional probability density function fxjC (x j C = i) = fi (x), and prior probability P (C = i) = i , where i = f0; 1g. We also denote the posterior probability that C = i given x is
IEEE SIGNAL PROCESSING LETTERS
2
1 ? q(x) x > w (1) Thus (6) becomes q (x) x w g + q (xn+1 )P fwn < w g Integrate Cost(x) over x, and we have the expected risk (9) Z If wn > w , one would want to minimize 1 ? q(xn+1 ) R = Cost(x)p(x)dx term by choosing xn+1 wn , since 1 ? q(xn+1 ) < 0:5 for < wn xn+1 . If we chose xn+1 < wn , it may be w Z1 x n+1 < w , in which case 1 ? q (xn+1 ) > 0:5, thus actually = P fx 2 C0 j xgp(x)dx w increasing Pe and not being optimal. Zw On the other hand, if wn < w , one would choose xn+1 P fx 2 C1 j xgp(x)dx + wn to minimize the term q (xn+1 ), since q (xn+1 ) < 0:5 for ?1 Z1 xn+1 wn < w . If we chose xn+1 > wn , it may be = P fx 2 C0 g pfx j x 2 C0 gp(x)dx xn+1 > w , in which case q (xn+1 ) > 0:5, again increasing w Zw Pe . Since one has no prior knowledge of whether wn > w +P fx 2 C1 g pfx j x 2 C1 gp(x)dx ?1 or wn < w , the only choice of xn+1 which guarantees that = P fx 2 C0 gP fx > w j x 2 C0 g Pe < 0:5 is xn+1 = wn . The equation (6) now becomes +P fx 2 C1 gP fx < w j x 2 C1 g (2) P fjwn+1 ? w j = jwn ? w j + 2 g This shows that the expected risk R is the probability = [1 ? q(wn )] P fwn > w g of mis-classi cation. Thus, for any xed x, minimizing +q(wn ) P fwn < w g (10) Cost(x) will minimize the probability of mis-classi cation. < [1 ? q(w )] P fwn > w g Since Cost(x) is not dierentiable with respect to the +q(w ) P fwn < w g = 0:5: decision boundary w, we use an adaptive formula similar to that of the classical perceptron learning algorithm: Since, for any xed xn+1 , wn+1 = wn ? [(y (xn+1 ) ? 0:5)] (3) P fjwn+1 ? w j = jwn ? w j ? 2 g where is the learning rate, a.k.a. step size, and the new = 1 ? P fjwn+1 ? w j = jwn ? w j + 2 g > 0:5 estimate of the boundary moves to the left or right by =2 (11) depending on the sample output y(xn+1 ) at the next sam- this proves (5). pling of xn+1 . Note that E [y(w )] = 0:5. From (3), This theorem establishes that the optimal active learning strategy for the two-class pattern classi cation problem is jwn+1 ? w j = jwn ? w j 2 : (4) to sample at the current estimate of the category boundary wn , and not the unknown theoretic decision boundary The algorithm will move toward convergence in the present w . step if jwn+1 ? w j = jwn ? w j ? 2 . Due to space limitation, the convergence of this proTheorem 1: Given the a posteriori probability q(x) of a posed learning algorithm will be discussed formally in a two-class problem, the probability of moving away from the separate paper. Brie y, we note that the update formula true boundary, P fjwn+1 ? w j = jwn ? w j + 2 j xn+1 g, is (3) is Markovian in nature because the next state wn+1 minimized when xn+1 ! wn . Moreover, depends only on the previous state wn and current input xn+1 . Since the probability of moving toward the decision P fjwn+1 ? w j = jwn ? w j ? 2 j xn+1 = wn g > 0:5 (5) boundary (eqs. (7), (8)) is a function of xn+1 only, it is a Proof: Let time-invariant Markov chain with a continuous state space of w. As the Markov chain converges to its limiting dis Pe = P fjwn+1 ? w j = jwn ? w j + g tribution Pw1 (w) = limt!1 P r:fW (t) wg, so will the 2 proposed algorithm. Furthermore, the corresponding lim = P fy(xn+1 ) = 0 j wn > w g P fwn > w g iting density function will peak at w = w . +P fy(xn+1 ) = 1 j wn < w g P fwn < w g (6) IV. Simulation The event y(xn+1 ) = 0 given xn+1 and the event wn > We have conducted a simulation to validate the theorem w are independent, hence developed in this letter. We assume that the data sam P fy (xn+1 ) = 0 j wn > w g = P fy (xn+1 ) = 0g pled are drawn from two Gaussian distribution with means = 1 ? q(xn+1 ) (7) equal to 1 (C 0) and ?1 (C 1), respectively, and the variance equal to 1. This yields a posterior probability P fy (xn+1 ) = 1 j wn < w g = P fy (xn+1 ) = 1g exp?(x?1)2 =2 q (x) = = q(xn+1 ) (8) exp?(x?1)2 =2 + exp?(x+1)2 =2 =
IEEE SIGNAL PROCESSING LETTERS
( )
q x
3
jwn ? w j, Active and Random Sampling
Two-class problem, q(x) = P fC = 1 j xg 1 q(x) 0:5 0
?4 ?2
0 2 input, x
0:025 0:02 0:015 0:01 0:005 0
4
Fig. 1. Two-class problem, q(x) = P fY = 1 j xg
For each x, a random variable r, uniformly distributed over [0; 1], is generated. If r < q(x), then y(x) = 1, otherwise y(0) = 0. The initial estimate of w , w0 , is chosen randomly over the interval [?0:5; 0:5]. Two methods are compared. One chooses xn+1 = wn , the sample-at-the-boundary method discussed in this letter. The other, a random sampling method, draws xn+1 randomly from the interval [?0:5; 0:5]. The step size is = 0:05 in both methods. Hundred random initial estimates from [0:5; ?0:5] have been drawn. For each of them, 100 trial runs are performed, with each trial lasting for 100 steps. The gures 2 and 3 show the average of the mean jwn ? w j and the average of the variance of the boundary estimates jwn ? w j2 at each step. The gure 2 shows that the active sampling converges faster than the random sampling with the same update step . But when the estimate approaches near the true boundary (the mean nears zero), the variance of the active sampling becomes higher than that of the random sampling. This is expected since now plays a crucial role of oscillation when jwn+1 ? w j < , that is, the update step size is too large for wn+1 to fall at the true boundary exactly. This can be seen in the gure 2, where jwn ?w j approaches zero near the point where the variance becomes higher. V. Conclusion
Active learning in a stochastic environment re ects the method of estimating the learning model given existing samples, then querying new samples that may optimize the estimation process, and iterate this process. It is theoretically shown that sampling near the boundary is the optimal way for active learning in a stochastic environment. An example is shown to demonstrate the formulation for a two-class problem with active querying near the boundary points. References [1] David A. Cohn, \Neural network exploration using optimal experiment design", in Advances in Neural Information Processing Systems, 1994, vol. 6. [2] Jenq-Neng Hwang, Jai J. Choi, Seho Oh, and Robert J. Marks II, \Query-based learning applied to partially trained multilayer
random sampling active sampling
0
5 10 15 20 25 30 35 40 Step, n
Fig. 2. Average mean of active and random sampling
jwn ? w j2 , Active and Random Sampling
0:076 0:074 0:072 0:07 0:068 0:066 0:064 0:062 0:06 0:058
random sampling active sampling
0
5 10 15 20 25 30 35 40 Step, n
Fig. 3. Average variance of active and random sampling
[3] [4] [5] [6]
perceptrons," IEEE Transactions on Neural Networks, vol. 2, no. 1, pp. 131{136, Jan. 1991. Yoshiyuki Kabashima and Shigeru Shinomoto, \Incremental learning with and without queries in binary choice problems," in Proc. International Joint Conference on Neural Networks, 1993, vol. 2, pp. 1637{1640. D. MacKay, \Information-based objective functions for active data selection," Neural Computation, vol. 4, no. 4, pp. 590{604, 1992. Mark Plutowski and Halbert White, \Active selection of training examples for network learning in noiseless environments," Tech. Rep. CS91-180, Univ. of California-San Diego, Feb. 1990. D. Angluin, \Queries and concept learning," Machine Learning:S M177 L473, vol. 2, pp. 319{342, 1988.