Optimality of Pocket Algorithm Marco Muselli Istituto per i Circuiti Elettronici Consiglio Nazionale delle Ricerche via De Marini, 6 - 16149 Genova, Italy Email:
[email protected] Abstract. Many constructive methods use the pocket algorithm as a basic component in the training of multilayer perceptrons. This is mainly due to the good properties of the pocket algorithm confirmed by a proper convergence theorem which asserts its optimality. Unfortunately the original proof holds vacuously and does not ensure the asymptotical achievement of an optimal weight vector in a general situation. This inadequacy can be overcome by a different approach that leads to the desired result. Moreover, a modified version of this learning method, called pocket algorithm with ratchet, is shown to obtain an optimal configuration within a finite number of iterations independently of the given training set.
1
Introduction
Besides the minimization of the error scored on a given training set an effective learning technique for supervised neural networks must pursue another important aim: the optimization of the number of weights in the final configuration. In fact, the reduction of the number of parameters in a multilayer perceptron lowers the probability of overfitting and increases its generalization ability in the employment phase [1]. Unlike the back-propagation algorithm which works on a fixed architecture and is therefore unable to optimize the resulting configuration, constructive methods generate the desired neural network by consecutively adding units to the hidden layers only when necessary [2, 3, 4]. This allows to obtain in a direct way the optimal or near-optimal number of weights needed to solve a given problem. In most cases the execution of a constructive method requires the repeated application of a training algorithm for single neuron which provides proper values for the weights of the unit that is going to be added to the network. Such an algorithm must have good convergence properties which lead to a high compactness of the final configuration. For this aim the pocket algorithm [5] is often used, since a favorable theorem ensures the achievement of an optimal weight vector when the execution time increases indefinitely. A careful examination of the proposed proof [5] shows that the assertion of the convergence theorem holds vacuously and the achievement of an optimal configuration is not generally ensured. The present paper has therefore the aim of
extending previous results in order to dissipate any doubt on the asymptotical optimality of the pocket algorithm. Moreover, a new theorem shows that the version with ratchet is finite-optimal, i.e. it reaches the desired configuration within a finite number of iterations. For sake of brevity all the proofs are only outlined leaving details to a future publication [7].
2
Optimality definitions
Let N (w) denote a single threshold neuron which input weights and bias are contained in the vector w. The output is binary and can assume values in {0, 1}. Furthermore, let ε(w) be the cumulative error (total number of unsatisfied inputoutput pairs) scored by N (w) on a given training set S containing a finite number s of samples and denote with εmin the minimum of ε(w). A learning algorithm for the neuron N (w) provides at every iteration t a weight vector w(t) which depends on the particular rule chosen for minimizing the cumulative error ε. Thus we can introduce the following definitions where the probability P is defined over the possible sequences w(t) generated by the given procedure: Definition 1. A learning algorithm for the neuron N (w) is called (asymptotically) optimal if lim P (ε(w(t)) − εmin < η) = 1
t→+∞
for every η > 0
(1)
independently of the given training set S. Definition 2. A learning algorithm for the neuron N (w) is called finite-optimal if there exists t such that ¢ ¡ (2) P ε (w(t)) = εmin , for every t ≥ t = 1 independently of the given training set S. To reach an optimal configuration the pocket algorithm repeatedly executes perceptron learning whose basic iteration is formed by the following two steps: 1. a sample in the training set S is randomly chosen with uniform probability. 2. if the current threshold neuron N (v) does not satisfy this sample its weight vector v is modified in such a way to provide the correct output. It can be shown that perceptron learning is finite-optimal when the training set S is linearly separable [8]. In this case there exists a threshold neuron that correctly satisfies all the samples contained in the training set S. In the opposite case the perceptron algorithm cycles indefinitely through feasible weight vectors without providing any (deterministic or probabilistic) information about their optimality. However, we can observe that a neuron N (v) which satisfies a high number of samples of the training set S has a small probability of being modified at step 2.
This consideration motivates the procedure followed by the pocket algorithm: it holds in memory at every iteration t the weight vector w(t) (called pocket vector) which has remained unchanged for the greatest number of iterations during perceptron learning. The corresponding neuron N (w(t)) forms the output of the training. Before approaching the convergence theorems for the pocket algorithm we need to introduce some notations which will be used in the relative proofs. If the training set S contains a finite number s of samples, we can always subdivide the feasible weight vectors for the considered threshold neuron in s + 1 (eventually empty) sets Wm , with m = 0, 1, . . . , s. Each of them contains the configurations that satisfy exactly m samples of the given training set and consequently ε(w) = s − m for every w ∈ Wm . If r is the number of input patterns in S correctly classified by an optimal neuron, then we have Wm = ∅ for m > r and εmin = s − r. Thus, the set Wr contains all the optimal weight vectors for the given problem. Now, let S(w) denote the subset of samples of the training set S which are satisfied by the neuron N (w). For the perceptron convergence theorem [6, 8], if samples in S(w) are repeatedly chosen, then the weight vector w will be generated in a finite number of iterations, starting from any initial configuration. Furthermore, a corollary of the perceptron cycling theorem [8] ensures that the number of different weight vectors which can be generated by the perceptron algorithm is always finite if the input patterns of the training set have integer or rational components. Then, there exists a maximum number τ of iterations required for reaching a configuration w starting from any initial weight vector, assuming that only samples in S(w) are chosen in the perceptron learning algorithm.
3
Convergence theorems for the pocket algorithm
The existence of a convergence theorem for the pocket algorithm [5] which ensures the achievement of an optimal configuration when the number of iterations increases indefinitely has been an important motivation for the employment of this learning procedure. If we use the definition introduced in the previous section this theorem asserts that the pocket algorithm is optimal (1) when the training set S contains a finite number of samples and all their inputs are integer or rational. Nevertheless, a correct interpretation of the corresponding proof requires the introduction of some additional quantities which are used in the mathematical formulation. As a matter of fact the convergence relations are characterized by the number N of generations (visits) of an optimal weight vector instead of the corresponding number t of iterations. The value of N does not take into account the permanences obtained by a given weight vector after each generation. Then, let M N be the greatest number of generations scored by a non-optimal configuration during the same training time t; it can be shown that the ratio M remains finite when the number t of iterations increases indefinitely. In fact, as
we noted in the previous section, the set of different weight vectors generated by the pocket algorithm is finite if the inputs in the training set are integer of rational. Moreover, we can observe that the configurations reached by perceptron learning form a Markov chain since the generation of a new weight vector only depends on the previous one and on the randomly chosen sample of the training set S. Now, the theory of Markov chains [9] ensures that for every pair of configurations the ratio between the number of generations remains constant when the learning time increases indefinitely. Since the number of these pairs is finite, the maximum M of the ratio above always exists. With these premises we can analyze the original proof for the pocket convergence theorem: it asserts that for every σ ∈ (0, 1) the pocket vector w(t) is optimal with probability greater than σ if the number k of permanences scored by this vector is contained in the following interval: "
¡ ¢! √ √ log (1 − M N σ) log 1 − N 1 − σ I(σ) = , log pm log pr
(3)
for N ≥ N . The value of N depends on σ although an explicit relation is not given. The probability pm in the expression for I(σ) is related to the non-optimal weight vector v which has obtained the greatest number of permanences at iteration t (v ∈ Wm with m < r). The validity of the original proof is then limited by the constraint (3) on the number k of permanences obtained by the pocket vector. In fact general conditions on the feasible values for k during the training of a threshold neuron are not given. On the other hand, by following definition 1 we obtain that the pocket algorithm is asymptotically optimal only if the convergence to the optimal configuration occurs independently of the particular samples chosen at every iteration. Thus, the proof of the pocket convergence theorem must hold for any value assumed by the number of permanences of the pocket vector or by other learning dependent quantities. Consequently, the pocket convergence theorem has only been shown to hold vacuously and the convergence to an optimal weight vector is not ensured in a general situation. A theoretical proof which presents the desired range of applicability can be obtained by a different approach: consider a binary experiment (such as a coin toss) with two possible outcomes, one of which has probability p and is called success. The other outcome has then probability q = 1 − p of occurring and is named failure. Let Qtk (p) denote the probability that in t trials of this experiment there does not exist a run of consecutive successes having length k. The complexity of the exact expression for Qtk (p) [10] prevent us from directly using it in the mathematical equations; if we denote with bxc the truncation of x, good approximations for Qtk (p) are given by:
Lemma 3. If 1 ≤ k ≤ t and 0 ≤ p ≤ 1 the following inequalities hold: ¡ ¢t−k+1 ¡ ¢bt/kc 1 − pk ≤ Qtk (p) ≤ 1 − pk
(4)
This lemma, whose technical proof is omitted for sake of brevity, gives a mathematical basis to obtain the asymptotical optimality of pocket algorithm. Theorem 4. (pocket convergence theorem) The pocket algorithm is optimal if the training set is finite and contains only input patterns with integer or rational components. Proof (outline). Consider t iterations of the pocket algorithm and denote with v∗ and v the optimal and non-optimal weight vectors respectively which have obtained the greatest number of permanences during the training. Let at and bt be these numbers of permanences. From the procedure followed by the pocket algorithm we have that the current pocket vector w(t) is equal to v∗ or v according to the values of at and bt ; in particular, if at > bt the saved weight vector is optimal. Consequently we obtain for 0 < η < 1: P (ε (w(t)) − εmin < η) ≥ 1 −
t X
P (at ≤ k|bt = k)P (bt = k)
(5)
k=0
But the training set is finite and contains only input patterns with integer or rational components; thus, if the algorithm consecutively chooses k + τ + 1 samples belonging to S(v∗ ) we obtain k + 1 permanences of the weight vector v∗ . Some mathematical passages involving elementary probabilistic concepts and the application of lemma 4 yield the following inequalities: ³ ¡ ´ ¢ t +1 k+τ +1 −3 P (at ≤ k|bt = k) ≤ min 1, 1 − pk+τ (6) r ¡ ¢t (7) P (bt = k) ≤ 1 − Qtk (pm ) ≤ 1 − 1 − pkm where pr = r/s is the probability of choosing a sample in S(v∗ ) and pm = m/s (m < r) is the probability of having a permanence for v. By substituting (6) and (7) in (5) we obtain an upper bound for the probability of error at iteration t (note that 0 < pm < pr < 1) P (ε (w(t)) − εmin ≥ η) ≤
t X k=0
³ ¡ ¢ t −3 ´ ³ ¡ ¢t ´ min 1, 1 − prk+τ +1 k+τ +1 1 − 1 − pkm
(8) Thus, the optimality of the pocket algorithm is directly shown if the right side of (8) vanishes when the number t of iterations increases indefinitely. This can be verified by breaking the summation above in three contributions having values of the integer k belonging to the intervals [0, bt1 c], [bt1 c + 1, bt2 c], [bt2 c + 1, t], where log pr log t , t2 = t(1−α)/2 , α= t1 = −α √ log pr log pr pm
Unfortunately, the cumulative error ε(w(t)) associated with the pocket vector w(t) does not decrease monotonically with the number t of iterations. To eliminate this undesired effect a modified version of the learning method, called pocket algorithm with ratchet, has been proposed. It computes the corresponding error ε(w(t)) at every possible saving of a new pocket vector w(t) and maintains the previous configuration when the number of misclassified samples would increase. In this way the cumulative error ε decreases monotonically when a new pocket vector is saved and the stability of the algorithm increases. The theoretical properties of the pocket algorithm with ratchet are supported by the following: Theorem 5. The pocket algorithm with ratchet is finite-optimal if the training set is finite and contains only input patterns with integer or rational components. Proof. From the procedure followed by the pocket algorithm with ratchet we obtain that the cumulative error ε(w(t)) associated with the pocket vector w(t) decreases monotonically towards a minimum ε(w(t)) corresponding to the last saved configuration. We have thus w(t) = w(t) for every t ≥ t. Let k denote the number of permanences scored by w(t); we have ¢ ¡ ¢ ¡ P ε (w(t)) = εmin , for every t ≥ t = P w(t) 6∈ Wr , for every t ≥ t ≤ ≤ lim P (w(t) 6∈ Wr ) ≤ lim Qtk+τ +1 (pr ) ≤ 0 t→+∞
t→+∞
having used the upper bound in (4).
References 1. Hertz, J., Krogh, A., and Palmer, R. G. Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley, 1991. ´zard, M., and Nadal, J.-P. Learning in feedforward layered networks: The 2. Me tiling algorithm. Journal of Physics A 22 (1989), 2191–2203. 3. Frean, M. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation 2 (1990), 198–209. 4. Muselli, M. On sequential construction of binary neural networks. IEEE Transactions on Neural Networks 6 (1995), 678–690. 5. Gallant, S. I. Perceptron-based learning algorithms. IEEE Transactions on Neural Networks 1 (1990), 179–191. 6. Rosenblatt, F. Principles of Neurodynamics. Washington, DC: Spartan Press, 1961. 7. Muselli, M. On convergence properties of pocket algorithm. Submitted for publication on IEEE Transactions on Neural Networks. 8. Minsky, M., and Papert, S. Perceptrons: An Introduction to Computational Geometry. Cambridge, MA: MIT Press, 1969. 9. Nummelin, E. General Irreducible Markov Chains and Non-Negative Operators. New York: Cambridge University Press, 1984. 10. Godbole, A. P. Specific formulae for some success run distributions. Statistics & Probability Letters 10 (1990), 119–124.