Gilles Labonté
Department of Mathematics and Computer Science Royal Military College of Canada Kingston, Ontario, K7K 5L0 Canada
Abstract. The implementation of the relaxation-projection algorithm by artificial neural networks to solve sets of linear inequalities is examined. The different versions of this algorithm are described, and theoretical convergence results are given. The best known analogue optimization solvers are shown to use the simultaneous projection version of it. Neural networks that implement each version are described. The results of tests, made with simulated realizations of these networks, are reported. These tests consisted in having all networks solve some sample problems. The results obtained help determine good values for the step size parameters, and point out the relative merits of the different networks.
The problem of solving a system of linear inequalities arises in numerous applications. It is omnipresent in optimization, where it is solved by itself, or concurrently with the problem of finding the minimum value of a cost function, as with the simplex algorithm [6], or as a preliminary step for interior point methods (see for example, Chapter 5 of [9]).
The particular method of solution of this problem called the relaxationprojection method is the main object of the present article. The original research on this method was carried out, around 1954, by S. Agmon [1], T.S. Motzkin and I.J. Schoenberg [17]. Associating the inequalities to half-spaces, in which lie the points corresponding to the feasible solutions, they proved that such a point can be reached, from an arbitrary outside point, by constructing a trajectory of straight line segments, each of which is in the direction of one of the half-spaces corresponding to violated constraints.
Many neural network training procedures consist, or are based upon, a method very closely related to that algorithm. For example, the single layer perceptron training method is such a process. Notwithstanding this fact, it seems that F. Rosenblatt [21], H.D. Block [3] and the many others who contributed to its proof of convergence, were unaware of the work of Agmon, Motzkin and Shoenberg, since no mention of it can be found in their writings.
Recently, H. Oh and S.C. Kothari [18,19], in their study of neural networks used as bidirectional associative memory, realized the usefulness of these
results and, in effect, proposed using a particular version of the relaxationprojection algorithm to calculate directly the weights of the neurons.
Even though the mathematical results concerning these algorithms are clearly very pertinent to the field of artificial neural networks, they seem to have gone very much unnoticed by the researchers in that field, until their use by Oh and Kothari. Thus, one of our aims, in the present article, is to draw attention to the most important results concerning these methods. We shall describe the different versions of the relaxation-projection algorithm, known as the maximal distance, the maximal residual, the systematic, the general recurrent and the simultaneous relaxation-projection methods. We shall also give definite theorems concerning the step size parameters for which convergence, and even termination in a finite number of steps, is guaranteed.
After having done so, we shall look at some of the best known analogue optimization networks, namely those of L.O. Chua and G.N. Lin [5] and of M.P. Kennedy and L.O. Chua [13], of D.W. Tank and J.J. Hopfield [23], and of A. Rodríguez-Vázquez et al. [20]. We shall demonstrate that they are all making use of a continuous time version of the simultaneous relaxation-projection algorithm.
We shall then show neural networks which implement each of the different versions of the relaxation-projection algorithm. We shall give the number of units of time needed to perform one step, and the formulas for the number of neurons these networks require, in terms of the number of variables and the number of inequalities to solve. We shall describe two types of
implementations, one with fixed weights, and one with weights varying according to Hebb's rule.
Finally, we report on tests we made with simulated realizations of all these networks. These tests consisted in having each network solve a set of fifteen small problems, with from two to six variables, and from four to sixteen inequalities, and one somewhat larger problem with twenty variables and thirty five inequalities. Different step size parameters were used, so that we can determine good values to use for these parameters, as well as compare the relative merits of the different networks.
1.1 Notation
We consider the problem of finding a vector x ∈ Rn such that Ax + b ≥ 0, when A is a constant mXn matrix and b is a constant vector in Rm. If we let ai denote the transposed of the i'th row of A, these inequalities can be written as
wi(x) = + bi ≥ 0
for i=1,...,m
where < , > is the Euclidean scalar product. We assume that no ai is the zero vector.
Define the closed half-space hi and its bounding hyperplane πi as
hi = {x : wi(x) ≥ 0 },
πi = {x : wi(x) = 0 }.
ni = ai / |ai| is then the unit normal to πi that points inward of hi. A point x is "on the right side" of πi if it is in hi; otherwise, it is "on the wrong side" of it. The Euclidean distance between point x and hyperplane πi is
dist(x, πi) = εi (+ βi)
where βi = bi / |ai| and εi = 1 if x is on the right side of πi and -1 if it is on its wrong side. The distance between point x and the half-space hi is dist(x, hi) = dist(x, πi) if x ∉ hi and zero if x ∈ hi. The solutions to the system of inequalities correspond to the points of the convex polytope Ω, defined as the intersection of all half-spaces hi. We shall assume hereafter that Ω is non empty.
1.2 Methods of Solution
Essentially all optimization methods, except of course those that require starting from a feasible solution, will solve the feasibility problem, when the true cost function is set to zero. This is particularly straightforward to implement for methods which use an objective function which consists of two terms: a term for the actual cost to be minimized, and a penalty term for the non satisfied constraints. We recall (see, for example, Chapter 5 of Ref. [9]), that the penalty functions that are most commonly used in optimization are the two functions F1 and F2 defined by: F1( x ) =
∑ η i < a i , x > +b i
F2 ( x ) =
∑ ηi < ai , x > +bi
where Ι(x) is the set of the indices of the constraints which are violated by x, and ηi are some positive constants.
There are also algorithms which have been developed especially for the solution of the feasibility problem. This is the case for the relaxation-projection method of S. Agmon [1], T.S. Motzkin and I.J. Schoenberg [17], mentioned above, and for the simultaneous relaxation-projection method, proposed more recently by Y. Censor and T. Elfving [4]. This latter method is a variant of the former, in which the steps of the iteration sequence are made in the direction of an average of the normals toward all the half-spaces of the violated constraints. This method is remarkable in that, as proved by A.R. De Pierro and A.N. Iusem [7], even for inconsistent problems, it will produce a point for which the weighted average of the squares of the distances to the half-spaces, i.e. the value of the function F2, is minimum.
Define the operators T(hi), i=1,...,m, such that if x ∈ hi if x ∉ hi
T(hi) x = x = x + [λi dist(x, πi) + ρi] ni
where λi and ρi are non-negative constants. Define also the operator T: m
T = ∑ γ i T(hi ) , i =1
where each γi >0 and
∑ γi
i =1
Single-Plane Algorithm. Define an infinite sequence of half-spaces {Hν}, by repeating elements of the set {hi,i=1...,m}, as prescribed below. Take an arbitrary x0 in Rn and as long as xν ∉ Ω, define inductively xν+1 = T(Hν) xν.
How the sequence {Hν} is defined characterizes different versions of this algorithm. Some often considered choices are
1) The maximal distance algorithm, for which Hν is the half-space farthest away from xν, or anyone of them, if there are more than one at the largest distance.
2) The maximal residual algorithm, for which Hν =hi if wi(xν) is the linear form in the set of Ineqs. (1) which has the smallest negative value, or anyone of them. if there are more than one with the smallest value.
3) The systematic projection algorithm, for which {Hν} is the infinite cyclic sequence with Hν = hi for ν = i (mod m).
4) The general recurrent projection algorithm, for which the infinite sequence of half-spaces {Hν} is arbitrary except for the requirement that any one half-space hi must reoccur in a finite number of steps, after any given ν. Sequences so defined are commonly considered in neural network theory, when it comes to presenting a finite set of exemplars to a learning neural network; (see, for example, F. Rosenblatt [21] and H.D. Block [3]). The systematic projection algorithm is obviously a particular case of it.
Multi-Plane Algorithm. Take an arbitrary x0 in Rn, and as long as xν ∉ Ω, define inductively xν+1 = T xν. 8
This is the simultaneous projection algorithm. In its more general form, the γi's are allowed to vary from step to step, as long as their sum remains 1.
2.1 General Convergence Properties
In this and the following two sections, we review some important results concerning the convergence of the relaxation-projection algorithms described above. Because the results we could find published did not cover all the variants of these algorithms, we had to generalize some of them. We simply state those results that had already been proven as such, and prove those that resulted from some generalization. The proofs are worth going over in that they provide a good understanding of the nature of the algorithms.
We start with two preliminary lemmas on which most of the proofs are based.
Lemma 2.1: Let x be an arbitrary point, let y = T(hi) x with 0 ≤ λi ≤ 2 and let the point a ∈ hi be such that 0 ≤ ρi ≤ 2 dist (a, πi), then | y - a | ≤ | x - a |.
Proof: If x ∈ hi, then y = x and the result is trivial. Let us therefore consider x ∉ hi, and to simplify the notation, define di(x) = dist (x, πi). A straightforward algebraic calculation, making use of Eqs. (5) and (3), yields the following equation. | y - a |2 = | x - a |2 - Qi(x)
Qi(x) = [ λidi(x) + ρi] [(2 - λi) di(x) + 2 di(a) - ρi]
The factorization we have made in Qi(x) makes it evident that, under the hypotheses of this lemma, Qi(x) ≥ 0 ∀ i.
Lemma 2.2: Let Ω be non empty, let x be an arbitrary point outside Ω, let y = Tx while for each T(hi) entering in T, 0 ≤ λi ≤ 2 and 0 ≤ ρi ≤ 2 dist (a, πi) for some point a ∈ Ω, then | y - a | ≤ | x - a |.
Proof: Upon using the definition of T, given in Eq.(6), and the result of Lemma 2.1, one gets: m
| y - a | ≤ ∑ γ i T(hi )x − a i =1
≤ ∑ γ i x − a = | x - a |.
i =1
Theorem 1: Let {xν} be any type of single-plane or a multi-plane relaxationprojection sequence, with 0 ≤ λi ≤ 2 ∀ i and with 0 ≤ ρi ≤ 2 dist (a, πi) ∀ i, for some point a ∈ Ω, then the sequence of distances { | xν - a | } is monotonically non-increasing and thus convergent.
Proof: Lemmas 2.1 and 2.2 imply that the inequality | xν+1 - a | ≤ | xν - a | holds for all ν's. The sequence of distances { | xν - a | } is therefore monotone non-increasing and since it is obviously bounded below by zero, it is then necessarily convergent.
Theorem 1 states that all relaxation-projection sequences have the remarkable geometrical property, called Fejér monotonicity, of approaching pointwise the polytope Ω or a subset of it. Indeed, when ρi = 0 ∀ i, each step xν → xν+1 of the algorithm produces a point xν+1 that is closer than xν or at an equal distance, to every point of the polytope Ω. When ρi > 0, a similar property holds with respect to the subset {a: 2 dist (a, πi) ≥ ρi ∀ i} of Ω. This subset is a sort of core inside Ω, the boundaries of which are obtained by translating inwards by ρi/2 each hyperplane πi, i=1...m. Note that, when Ω is not full dimensional, this subset is empty, when all ρi's are positive.
Theorem 1, or particular cases of it, can be found stated in most articles dealing with the convergence of relaxation-projection algorithms. S. Agmon [1], S. Motzkin and I.J. Schoenberg [17] were the first to mention it for single-plane algorithms, with ρi = 0, λi = λ ∀ i, and 0 < λ < 2. H. Oh and S.C. Kothari [19] proved the same result for algorithms with ρi > 0 and the same conditions as above on the λi's. Although the latter authors talk explicitly about the systematic relaxation-projection algorithm, their proof clearly holds for all single-plane algorithms. However, they do not provide an explicit upper bound on the ρi's, as we did in Theorem 1; they simply state that if Ω is full-dimensional, the ρi's can always be taken small enough for the property to hold. Y. Censor and T. Elving [4], proved the Fejér monotonicity of the multi-planes simultaneous relaxationprojection algorithm with ρi = 0 and 0 < λi < 2 ∀ i.
Our proof of Theorem 1 has the merit of covering all variants of the algorithm and it is somewhat more direct than some of the above mentioned proofs.
Theorem 2: Let {xν} be any type of single-plane or a multi-plane relaxationprojection sequence, with 0 ≤ λi ≤ 2 ∀ i and with 0 < ρi < 2 dist (a, πi) ∀ i, for some point a λ Ω, then {xν} terminates after a finite number of steps to a point of Ω. Proof: Let ρm be the smallest of the ρi's and define the positive constant K = Min { [2di(a) - ρi]: i=1,...,m}. Eq.(8) shows that Qi(x) ≥ ρm K ∀ i.
i) Single plane methods. With the help of Eq.(7) and the above lower bound on Qi, one sees that each time a step xν → xν+1 of the algorithm is made, with a half-space that does not contain xν, | xν+1 - a |2 ≤ | xν - a |2 - ρm K. Thus, if at point xµ, there has been N non-trivial steps, | xµ - a |2 ≤ | x0 - a |2 - N ρm K. Since distances are bounded below by zero, there can only be a finite number of non-trivial steps.
ii) Multi-plane method. With the lower bound obtained above on Qi, Eq.(7) which holds for x ∉ hi, leads to | T(hi) x - a | ≤ | x - a | - [ρm K]½ . Thus, Ineq.(9) can be refined to yield
| xν+1 - a | ≤ | xν - a | -
∑ γ i [ρmK ]
1/ 2
i∈ΙΙ ν
≤ | xν - a | - γm [ρm K]½
where Ιν is the set of indices of the half-spaces not containing xν and γm is the smallest of the γi's. Thus, at point xµ, | xµ - a | ≤ | x0 - a | - µ γm [ρm K]½. Again, since distances are bounded below by zero, there can only be a finite number of non-trivial steps.
F. Rosenblatt, in Chapter 5 of Ref.[21] and H.D. Block in his article [3] about the convergence of the learning procedure of single (evolving) layer perceptrons, have proven the result of Theorem 2, for the single-plane general recurrent algorithm, with λi = 0 and ρi > 0 ∀ i, in the particular situation where the polytope Ω is actually a hypercone. For hypercones, the conditions of the theorem impose no upper bound on the values of the ρi's since whatever these values, it is always possible to find a point a inside the hypercone, far enough from its apex, so that these conditions hold. Although it is not obvious, the pseudo-relaxation method proposed by H. Oh and S.C. Kothari [18,19] can be recognized as the single-plane systematic relaxation-projection algorithm, with λi = λ, 0 < λ < 2 and ρi > 0 ∀ i. The argument we used above, for the part of our Theorem 2 that deals with the general single-plane algorithm, is the same one they used in proving their Theorem 1 of Ref.[19]. Note however that they did not provide an explicit upper bound on the ρi's, as we do; they simply stated that if Ω is full-dimensional, the ρi's can always be taken small enough for the property to hold. We did not find any published proof of Theorem 2 for the simultaneous relaxation-projection algorithm.
2.2 Convergence of Single-Plane Methods with ρi = 0
Theorem 3: Let {xν} be any type of single-plane relaxation-projection sequence with 0 < λi < 2 and ρi = 0 ∀ i, then {xν} converges to a point of Ω.
For the proof of this theorem, we shall use the following lemma, which holds under the same hypotheses as Theorem 3.
Lemma 2.3: The sequence {| xν+1 - xν |} converges to zero. 13
Proof: The definition of the sequence {xν} is such that it terminates only if it has reached a point of Ω, thus the theorem needs to be proven only for infinite sequences {xν}. We write hereafter Λν for λi and Πν for πi if the half-space Hν is hi and we write Dν(x) for dist(x, Πν). Thus, Eq.(7) becomes:
| xν+1 - a | = | xν - a | - Qν with
when xν γHν
Qν = ΛνDν(xν) [(2 - Λν)Dν(xν) + 2 Dν(a)] ≥ Λν(2 - Λν) [Dν(xν)]2
Since the sequences { | xν+1 - a | } and { | xν - a | } have the same limit, there follows from Eq.(10) that the limit of the non-negative sequence {Qν} must be zero. The positive sequence { Λν(2 - Λν) } being bounded, this can happen if and only if the sequence {Dν(xν)} converges to zero. The conclusion of the lemma follows from the fact that for any single-plane relaxation-projection sequence, with ρi = 0 ∀ i, the step size is | xν+1 - xν | = Λν Dν(xν).
Proof of Theorem 3: Lemma 2.1 states that the sequence {xν} is bounded. Thus, by the Bolzano-Weierstrass theorem, it must have at least one accumulation point l. Our proof of Theorem 3 will consist in proving that there is only one such point which is thus the limit of {xν}, and that this point is in Ω. We first show that l cannot be outside of Ω. For this, we consider the possibility that it is and let d be the distance between the point l and the closest half-space not containing it, and let λm be the smallest of the λi's. By the definition of accumulation points, for any ε > 0, there exists an index ν0(ε) after
which the sequence {xν} has an infinite number of its points in the closed sphere Sc(l,ε) of radius ε, centered on l. Consider then ε
. Therefore | xν+1 - xν | = Λν Dν(xν) ≥ λm Dν(xν) > (2+λm)ε, λm which is incompatible with the latter condition on xν. Thus no accumulation point of the sequence {xν} can be outside Ω.
The accumulation point l must be on the surface of Ω. Indeed, it cannot be inside Ω, since then ε can be taken such that the sphere Sc(l,ε) lies entirely inside Ω. The first point of the sequence {xν} to enter this sphere would then be inside Ω, and the sequence would terminate at this point.
There then remains only to prove that there can be only one accumulation point at the surface of Ω. Suppose there are two different such points l and q, and take ε < | l - q | /2 and small enough that the sphere Sc(l,ε) is traversed only by hyperplanes containing l. Then, if xν is a point of the sequence in this sphere, xν+1 must also be in this sphere because the reflecting hyperplane Πν necessarily passes through l. By induction, one can prove that all following points are also necessarily in Sc(l,ε), contrary to the hypothesis of existence of another accumulation point q ≠ l.
As for our Theorem 1 above, S. Agmon was the first one to prove this result explicitly , in his Theorem 3 of Ref.[1], for maximal distance relaxationprojection sequences, with ρi = 0 and λi = λ ∀ i, and 0 < λ < 2. In Section 4 of
Ref. [1], he states that his result can be proven as well for maximal residual and systematic relaxation-projection sequences. S. Motzkin and I.J. Schoenberg [17] also proved exactly the same result as Agmon, but by a different method. The proof we present above covers explicitly all variants of the single-plane algorithm and involves similar ideas as those used by Motzkin and Schoenberg.
Theorem 4: (a) When Ω is full-dimensional, there exists a constant λ0 ∈ [1,2) such that all the single-plane relaxation-projection sequences, for which λ0 < λi ≤ 2 and ρi = 0 ∀ i, terminate after a finite number of steps. (λ0 is a geometrical constant associated with the polytope Ω, defined in Ref.[10]). (b) Furthermore, if λ > 2, then the sequence either converges finitely to a point of Ω or it does not converge.
T.S. Motzkin and I.J. Schoenberg [17] were the first to prove finite termination, for the particular case λi = 2 and ρi = 0 ∀ i. J.L. Goffin (see his Theorem (3.3.1) and his Section 4.1 of Ref. [10]) then proved the above theorem, which constitutes a noteworthy improvement over the result of Motzkin and Schoenberg, in that it guarantees termination for a whole interval of λi's. This fact may prove important when doing numerical computations, in that it would allow to avoid the inevitable instability of a property which holds only for one particular value of some parameter.
2.3 Convergence of the Multi-Plane Method with ρi = 0 m
Theorem 5: Consider F(x) =
∑ γ i λ i [dist( x,hi )]2
. A simultaneous relaxation-
i =1
projection sequence {xν}, with ρi = 0 and 0 < λi < 2, produces a monotonically
non-increasing sequence { F(xν) }. {xν} converges to a solution, if the system of inequalities is consistent and, if not, to a minimizer of F(x).
Y. Censor and T. Elfving [4] were the first to prove the convergence of the simultaneous relaxation-projection sequence under the conditions of Theorem 5. A proof similar to theirs is easily produced with the help of the ideas we used in the proofs presented above. A different proof was given by A.R. De Pierro and A.N. Iusem in Ref. [7] who also had the merit of proving Theorem 5 as such.
It is remarkable that Theorem 5 is the only result to the effect that each step of a relaxation-projection sequence decreases an objective function. Even the similar simultaneous relaxation-projection sequence with ρi > 0 has not been proven to have this property with respect to its objective function F:
∑ γ i {λ i [dist( x,hi )]2 + 2ρidist( x,hi )}. m
F(x) =
i =1
It is, of course, nonetheless true that all relaxation-projection sequences, which converge to a point of Ω, produce a sequence of values {F(xν)} which converges to 0, and therefore to the minimum of F(x), where F is the objective function of Eq.(11), with arbitrary positive constants γi.
Widespread interest in the use of electrical circuits as analog solvers for optimization problems was really awakened in 1986 by D.W. Tank and J.J.
Hopfield [23] (for an overview of this subject, see C.Y. Maa and M. Shanblatt [15]). As M.P. Kennedy and L.O. Chua [12] pointed out, the network proposed by Tank and Hopfield, with corrected sign for the penalty function, is closely related to the canonical nonlinear programming circuit of L.O. Chua and G.N. Lin [5]. Another circuit that is also often mentioned is that of A. Rodríguez-Vázquez et al. [20].
We shall briefly examine, in this section, the feasibility solvers that these networks become when they are used with a zero cost function. Although they are also of interest, we shall not discuss, in the present article, models where equality and inequality constraints are treated separately (as, for example, the model of S.H. Zak et al. [24]). Thus, we consider models that solve instances of the optimization problem:
Minimize φ(x), subject to the set of constraints Ax + b ≥ 0.
3.1 The Model of Chua-Lin and Kennedy-Chua
The circuit of Chua-Lin and Kennedy-Chua [5, 13] is devised to solve the general optimization problem, where minimal assumptions are made about the cost function and the constraint functions. When the constraints are taken to be linear in x, the evolution equation for the n-vector x of voltages in this circuit is
dx 1 m = −∇φ + ∑ | a i |2 dist( x,hi )ni dt R i =1
where C is a nXn diagonal matrix of constant capacitances and R is a constant resistance. A Liapunov function that is minimized by this system is
E(x) = φ +
1 m ∑ | ai |2 [ dist( x,hi )]2 . 2R i =1
When the cost function φ is zero, the network implements a continuous time version of the simultaneous relaxation-projection algorithm, with ρi= 0 ∀ i. In order to see this, do the change of variables ~ x = (RC)1 / 2 x, ~ ai = (RC) −1 / 2 a i to remove the constants R and C from Eq.(12). A first order Euler discretization of the resulting equation then produces the equation xν+1 = T xν, which describes one step of the simultaneous relaxation-projection algorithm, with ρi = 0 ∀ i and ai |2 . The value of the Liapunov function E is then F(x)/2R∆t, where F γiλi = ∆t | ~ is the objective function of our Theorem 5.
Note that the conditions for the convergence of the discrete algorithm, seen in Section 2, correspond here to upper bounds on the step size ∆t.
3.2 The Model of Rodríguez-Vázquez et al.
The circuit proposed by Rodríguez-Vázquez et al.[20] is also devised to solve the general optimization problem, with minimal conditions on the cost and constraint functions. Their model is characterized by its division of the space in two regions: the region of feasibility, inside of which the objective function is solely the cost function, and the rest of space, where the objective function is solely the penalty function for the violated constraints. Correspondingly, they define the pseudo-cost function
ψ(x) = U(Ax+b)φ(x) + µP(x)
with U(Ax+b) = 1 if Ax+b ≥ 0
= 0 otherwise.
The constant µ is called the penalty multiplier and P is the penalty function for the violated constraints. P can be taken to be either one of P1 or P2 with: m
P1(x) =
∑ | ai | dist( x,hi )
P2(x) =
i =1
∑ | ai |2 [dist( x,hi )]2
i =1
The equation of motion that corresponds to their circuit is
dx = −U( Ax + b)∇φ − µ ∑ v i a i dt i∈Ι( x )
where Ι(x) is the set of indices of the violated constraints, and
vi = -1
if P = P1
= - |ai| dist(x, πi) if P = P2.
When the cost function φ is zero, this model corresponds to a continuous time version of the simultaneous relaxation-projection algorithm, with λi =0 ∀ i when the penalty function P is P1 and with ρi=0 ∀ i when P = P2.
A first order Euler discretization of Eq.(16) transforms it into the equation describing a step of the simultaneous relaxation-projection algorithm with
λi =0 ∀ i
γiρi = µ∆t |ai|
when P = P1
ρi =0 ∀ i
γiλi = µ∆t |ai|2
when P = P2.
The value of the pseudo-cost function in Eq.(14), when P = P1, is F(x)/2∆t, where F is the objective function of Eq.(11) with λi =0 ∀ i. When P=P2, it is F(x)/∆t, where F is the objective function defined in our Theorem 5. The conditions of convergence, discussed in Section 2, correspond to upper bounds on ∆t.
3.3 The Model of Tank-Hopfield
The linear programming network of Tank and Hopfield [23] minimizes the objective function φ(x) = , where k is a constant n-vector. Its equation of motion, for the n-vector of voltages x, with corrected sign for the penalty term (see M.P. Kennedy and L.O. Chua [12]), is m C dx = −k − (rR) −1 x + ∑ | a i |2 dist( x,hi )ni r dt i =1
where C is a nXn constant diagonal matrix of capacitances, R is a constant nXn diagonal matrix of resistances, and r is the proportionality constant in the linear input-output function for the variable amplifiers. The Liapunov function for this model is 1m 1 < x.R −1x > . E(x) = + ∑ | a i |2 [dist( x,hi )] + 2 i =1 2r
When the cost function is zero, this model does not quite correspond to the simultaneous relaxation-projection algorithm, due to the additional term -(rR)-1x on the right hand side of Eq.(17).
In order to make conspicuous the effect of this additional term, we consider a problem with a one-dimensional variable x, and the single constraint x ≥ b > 0. The equation of motion and its solution, in the region x < b, are then C dx 1 = − x + (b − x ) r dt rR
with α =
x(t) = β + [ x(0) - β ] e
- αt
1+ rRb rR and β = . It is straightforward to show that x(t) always RC (1 + rR)
remains smaller than b at finite times, when x(0) < b, and that the limit x(∞) = β < b. Thus, x(t) never moves into the region where the constraint is satisfied, even asymptotically. Not only that, if x(0) is such that x(∞) < x(0) < b, x(t) will actually move away from the region of feasibility and decrease monotonically to x(∞).
These calculations corroborate the remark, made by M.P. Kennedy and L.O. Chua [12] and C.Y. Maa and M. Shanblatt [15], to the effect that the TankHopfield network should be used only when the resistances in R are very large, so that the second term on the right hand side of Eq. (17) is negligible. When this is the case, the circuit implements the simultaneous relaxation-projection algorithm.
The analogue networks of Section 3 were all implementations of the simultaneous relaxation-projection algorithm. We now present networks which implement all the variants of the relaxation-projection algorithm. Although we
describe them as digital networks performing the discrete algorithms, it should be clear that they can also be realized as analogue electrical circuits, performing the continuous time algorithms.
We consider networks of Mac Culloch-Pitts type neurons, such that when a neuron has input vector x , weight vector w and activation function f, its output is f(). These neurons are arranged in layers, the data taking one unit of time to go through each layer. A clock, and possibly delays, ensure that the proper data enter and leave each layer in step. As with the analogue networks of Section 3, the neurons all have fixed weights which depend on the parameters in the inequalities.
These networks all work on the same principle: an arbitrary vector x0 is initially fed them as input. They are then left to cycle, their output being fed back as input, until a solution is reached.
4.1 Maximal Distance and Maximal Residual Algorithms
The maximal distance and the maximal residual algorithm each requires a Winner-Takes-All (WTA) subnetwork to select the maximum xM in a set of values {x1,...,xn}. This WTA subnetwork must take the n-vector x = (x1,...,xn)t as input, and return as output the n-vector y = (y1,...,yn)t of zeros, except for a 1 at only one of the positions of xM in x.
Some Winner-Takes-All Networks. 1) Feldman and Ballard [8] have presented a WTA network which operates in one unit of time. It is however composed of neurons that are somewhat more 23
complex than Mac Culloch-Pitts neurons, in that the value of their activation function depends on the position of the inputs on their surface, as well as the connection weights. When n inputs arrive at different locations on the surface of such a neuron, one of them is favoured in that only when this one is the largest of all inputs does the neuron fire, with output 1. The same behavior is obtained when considering that the "favoured" input is presented directly to the neuron, and the other (n-1) inputs come from the neighboring neurons, through inhibiting channels.
A single layer of such neurons, each having one of the xi's arriving at its favoured surface location, constitutes a WTA network, its output vector will have a 1 only at the positions of xM in x, and zeros elsewhere. This WTA network is obviously the optimum. However, we cannot use it, if we restrict ourselves to networks with only Mac Culloch-Pitts neurons.
2) A WTA network which operates in only two time steps, can be made with Mac Culloch-Pitts neurons.
Its first layer has n(n -1)/2 neurons, which we label as nij with i < j, ∀ i and j ∈ {1,...,n}, according to the two components xi and xj of x that it receives as input. The weights of its connections to these inputs are respectively +1 and -1. Their activation function is the sign function sgn+ : sgn+(x) = 1 if x ≥ 0 and -1 if x < 0. Thus, the components of the output vector of this first layer are the signs of all the possible differences between two components of the input x.
The second layer has n neurons, with the k'th one connected to each neuron nij of the first layer, for which either i or j = k. Its connection weight is +1 24
if i = k and -1 if j = k. These neurons have the activation function f: f(x) = 1 if x ≥ (n - 3/2) and 0 otherwise. Thus, when the components of the input to the n
network are all different, the total input to the k'th neuron is
∑ sgn( x i − x k ) .
i =1(i ≠ k )
can be seen that, with sgn+ as activation function for the first layer, the output vector of the network will have a 1 only at the first position of xM in x, and 0 everywhere else, as desired. This WTA network requires a total of n(n+1)/2 neurons. An example of this network, with n = 4, is shown in Figure 1.
3) As final example of WTA network, we mention the binary maximum selector network devised by T. Martin [16], and described by R.P. Lippman [14]. By appropriately defining its activation functions at zero, this network can be made to return an output vector y which has a 1 only at the first position of xM in x and 0 everywhere else. This network requires 2[log2 n] +1 layers of neurons, if n > 2 (where [x] = x if x is an integer and the next integer greater than x otherwise) and 1 layer if n = 2. It then operates in as many time steps. It has (5 x 2[log2n] - 6 + n) neurons if n > 2 and 2 neurons if n = 2.
We remark that when n ≥ 5, this network is slower than the second network mentioned above. However, it requires less neurons than the latter network, whenever n ≥ 13. Since we shall here consider neurons to be inexpensive, we will use the faster second WTA network.
The network that realizes the algorithm. The network shown in Figure 2 performs one step of the maximal distance or the maximal residual algorithm. It takes an arbitrary vector x as input. If this x is a solution of Ineqs.(1), it is returned as the output. If it is not, the output is the vector T(hk) x, where k is the
index of the half-space farthest from x, if the maximal distance algorithm is performed, and the index of the smallest negative linear form wi of Ineqs.(1), if the maximal residual algorithm is performed.
Here is how this network functions. Its first layer comprises m neurons: one for each inequality. The threshold of the i'th neuron of this layer is αi βi, where αi = 1 for the maximal distance algorithm and αi = |ai| for the maximal residual algorithm. Its weight vector is - αi ni, where ni is the unit vector normal to the hyperplane πi. As is common practice, we shall take the threshold into account, by augmenting the weight vector by one component. Thus, we let the threshold be its zeroth component, so that it becomes W i = αi (-βi, -ni1,...,-nin)t. Correspondingly, an augmented input vector X is defined as (1, x1,...,xn)t. The activation function for each of these neurons is f: f(x) = x if x ≥ 0 and 0 if x < 0. The output vector of the first layer is therefore [α1dist(x,h1),..., αmdist(x,hm)]t.
This output vector serves as input for the WTA subnetwork. If x is already a solution, the output of this subnetwork is z = 0 and y = [1,0,...0]t, and if it is not, it is z = αkdist(x,π πk) and the vector y = [0,...1,...0]t, where the 1 is at the k'th position.
The WTA subnetwork is followed by a layer of m neurons, with zero threshold, and multiplicatively arranged input connections to the input and output ports of the WTA subnetwork. These connections are such that the i'th neuron of this layer has the input OiΙi where Oi is the i'th output of the WTA network and Ιi is its i'th input. How to realize multiplicative synaptic arrangements has been discussed by G.E. Hinton [11] and others (see, for example, Section 9.6 of Ref.
[2]). The activation function of the i'th neuron is fi: fi(x) =
λi x + ρi if x > 0 and 0 αi
if x ≤ 0. The output vector of this layer is then the zero vector if the point x input to the first layer of the network is a solution and the vector [0,... λ 0, k dist( x, π k ) + ρ k ,0,...,0], if it is not. αk
The last layer is made up of n neurons, each with zero threshold and activation function fl: fl(x) = x. There is a connection from the i'th neuron of the previous layer to the j'th one of this layer, with weight nij, where nij is the j'th component of the unit vector ni, normal to the hyperplane πi. This j'th neuron is also fed, with weight one, the j'th component of the input vector x to the first layer of the network. The output vector of this last layer is then T(hk) x.
If one unit of time is required for the data to go through each layer of the network, 5 units of time will be required for it to perform one step of the algorithm. The network has (m2 + 5m + 2n)/2 neurons. A solution to the system of inequalities is obtained when the output vector of the network is identical to its input vector x.
4.2 Systematic Projection Algorithm
The network for this algorithm is composed of as many subnetworks, as that illustrated in Figure 3, as there are inequalities to satisfy. These subnetworks are chained together to perform a full cycle of the algorithm.
Here is how the i'th subnetwork functions. Its first layer contains one neuron, with same weights and threshold as the i'th neuron in the first layer of our maximal distance algorithm network. Its activation function fi is however different, with fi(x) = (λi x + ρi) if x > 0 and = 0 if x ≤ 0.
The last layer of this subnetwork is similar to that of the maximal distance algorithm network. The connection from the single neuron of the previous layer to the j'th one of this layer has weight nij, where nij is the j'th component of the unit vector ni, normal to the hyperplane πi. This j'th neuron is also fed, with weight one, the j'th component of the input vector x for the first layer of this i'th subnetwork. The output vector of this last layer is therefore T(hi) x.
Two units of time are required to perform one step of the algorithm, i.e. for the data to go through one subnetwork, which contains (n+1) neurons. Since m such subnetworks, connected in series, are required for a whole cycle through all the inequalities, 2m units of time and m(n+1) neurons will be required for one such cycle. A solution is obtained when the output vector, at the end of the chain, is identical to the input vector x, at its beginning.
4.3 Simultaneous Projection Algorithm
The basic structure of the network for this algorithm, shown in Figure 4, can be recognized in each of the analogue optimization networks discussed in Section 3.
Its first layer has m neurons. The i'th one of which is identical to that of the first layer of the i'th subnetwork for the systematic algorithm, with its activation function multiplied by γi.
The last layer of this network is identical to that of the maximal distance algorithm network, and it is connected in the same way to its preceding layer and the input x for the whole network. Its output vector is Tx.
Each step of the algorithm is performed in two units of time. The network has (m+n) neurons. A solution is obtained when its output vector is identical to its input vector x.
Another set of networks implementing the same algorithms, is obtained by interchanging, in the networks of Section 4, the way in which the coordinates X and the inequality parameters W i are treated. Thus, the weights of the neurons of the first layer of the networks would now all be set to X, and the i'th neuron of the first layer would receive the vector W i as input. This interchange leaves its total input unchanged. When these neurons are let to evolve according to Hebb's rule, a solution to the system of inequalities will be obtained as the final value of their weights.
More precisely, consider a neuron, with (n+1)-dimensional weight vector X = (1, x1,...,xn)t, and activation function fi: fi(x) = λix + ρi , if x > 0 and f(x) = 0, if x ≤ 0.
When the vector W i is presented to it as input, its output fi() will be 0 if x ∈ hi, and λi dist(x, πi) + ρi, if x∉ hi. The first component of its weight vector is then kept to the constant value 1, and its other weights are made to change according to the Hebbian learning rule: x … x + fi() ni. Thus, this neuron implements the action of the operator T(hi) on the vector x.
1) Systematic and General Recurrent Algorithm. A single neuron, as described above, can perform the systematic and the general recurrent versions of the single-plane algorithm, with λi = λ and ρi = ρ for all i's. For this, It suffices to present to it the inputs W i's, in the order specified by these algorithms.
In order to allow for different parameters λi and ρi, it would be necessary to use m neurons, each one with a different value of these parameters in its activation function. The exemplar W i would then be presented only to the i'th neuron, the output of which would provide the weight correction for all m neurons.
2) Maximal Distance and Maximal Residual Algorithms. The interchange of the roles of X and W i, as described above, is made for the neurons of the first layer of the network described in Section 4.1. To execute one step, all the W i's are presented simultaneously as inputs (W i being the input for the i'th neuron). If the direct connections, between the input to the network and the last layer, are removed, the output of the last layer will be the zero vector, if x is a solution, and the vector [λk dist(x,π πk) + ρk] nk if not. This output is the correction to be added to the x part of the weight vector for each neuron of the first layer. A solution is recognized as such when the output of the network is zero.
3) Simultaneous Algorithm. The same modifications done to the maximal distance network should be made to the network described in Section 4.3. The network would then perform one step of the algorithm, by weight correction for the neurons of the first layer, exactly as described above for the maximal distance algorithm network.
We have simulated the digital neural networks implementing the maximal distance, the systematic and the simultaneous relaxation-projection algorithms. In a first series of tests, they were used to solve some 15 small feasibility problems (most of these problems are optimization problems from Ref. [22], in which we have set the cost function to zero). Upon characterizing a problem, with n variables and m inequalities, by the pair (n, m), the problems solved can be described as two of each of the types (2, 4), (3, 6) and (4, 7), four of the type (5, 9) and one of each of the types (3, 7), (4, 8), (5, 3), (5, 8) and (6, 16). For each algorithm, the same step size parameters λi and ρi were used for all hyperplanes. Values of λ = 0.5k, with k=0,...,6 and ρ ∈ {0, 0.25, 0.5, 1} were tried. For the simultaneous projection algorithm, the additional values of λ with k = 7,...,20 and ρ = 0.5s, with s=3,...,11, were also tried. Note that whenever this algorithm was used, its parameters γi, for i=1,...,m, were all taken to be 1/m, where m is the number of inequalities. Table 1 reports the total number of steps and the total number of units of time each network required to solve all of these problems, when the best values for λ and ρ were used. Notice that for the simultaneous relaxation-projection algorithm, the best results were obtained for values of λ much outside of the bounds given in Theorems 2 and 5. These results, and those with λ =2, the upper "safety" bound, appear in Table 1. 31
Max. Distance
(> safe bounds)
TABLE 1: The values of the parameters λ and ρ for which the three networks took less overall time to solve 15 small feasibility problems.
As this table indicates, all the networks, given appropriate step size parameters, solved the 15 problems in a finite number of steps. In terms of the number of steps, the best performance of the maximal distance and of the simultaneous projection algorithms are comparable, and are much better than that of the systematic projection algorithm. In terms of the time required however, the simultaneous projection network is faster because of its fewer layers. Nonetheless, if λ had to be taken within the safe range λ ≤ 2, the times required by the two would be comparable (595 vs 554 time units). And if we had allow ourselves the single level WTA network, the maximal distance network would have performed the best with 476 vs 554 time units.
Figures 5 to 7 are graphs showing the values taken, at each step of the solution, by the two variables x1 and x2, when the following sample problem is solved by the different networks.
Find x1 and x2 such that:
x1 - x2 ≥ 1
-x1 + 5x2 ≥ 5.
The behavior seen is representative of the general one observed with the different networks. The maximal distance and the simultaneous projection networks are seen to be of somewhat similar effectiveness in terms of the number of steps. However, the trajectories produced by the first network oscillate more, in general, as should be expected from the fact that the simultaneous projection algorithm involves an average of the directions toward all violated constraints hyperplanes. The systematic projection algorithm is seen to converge most slowly. All networks started from the point (0, 0)t. The maximal distance algorithm network reached the solution (3.371, 1.864)t after 6 steps, taking 36 units of time. The systematic algorithm network reached the solution (2.976, 1.623)t after 16 steps, taking 32 time units. The simultaneous algorithm network reached the solution (4.961, 2.193)t after 8 steps, taking 16 units of time.
Table 2 shows the number of neurons each network requires for solving a type (20, 35) problem. The systematic projection and the maximal distance networks are seen to be, by far, the most costly in terms of the number of neurons. This same table also shows the number of steps and of units of time required for the solution, with the best values for the parameters, as determined in the tests with the 15 small problems, as well as those values among those mentioned above, which yielded the best solution time for this (20, 35) problem 33
alone. The mention "Ended" in the table indicates that the network was stopped, the algorithm having run for 100 steps without producing a solution. The systematic projection network proved the less efficient, taking more than the 100 steps limit, for most values of the parameters. The performance of the maximal distance algorithm and the simultaneous projection algorithm are comparable as for the number of steps, when λ is in the "safe" range. The latter algorithm is however definitely superior in terms of the number of units of time required. The best performance, obtained with the simultaneous projection network, is remarkable in that the solution is obtained in a single step.
Max. Distance
Ended Ended
TABLE 2: Values of the network parameters and the corresponding times taken by the three networks to solve a 20 variables, 35 inequalities problem.
We have shown that the solution method, used by the best known analogue optimization networks, is a continuous time version of the simultaneous relaxation-projection algorithm. As for the Tank-Hopfield network however, the input resistances for the variable amplifiers makes its behavior deviate slightly from that of this algorithm. By solving exactly its equation of motion, we have demonstrated that this additional term has a negative effect, in that it prevents a feasible solution from being reached.
We have produced neural networks that implement each of the relaxationprojection algorithms. For the fixed weights implementation, the number of neurons required to solve a problem with n variables and m inequalities are (m2 + 5m + 2n)/2 neurons for the maximal distance version, m(n+1) neurons for the systematic projection version, and (m+n) for the simultaneous projection version. These numbers clearly show that, among these three versions, the last one is the most economical in terms of neurons used, its number of neurons increasing only linearly with the problem parameters. The variable weights networks have the same basic structure and same efficiency as the above ones. However, as mentioned in Section 5, a single neuron with the Hebbian learning capacity, suffices to perform the systematic and any recurrent single-plane algorithm. Although we have found these algorithms to be generally less efficient than the other ones, this fact definitely renders them worthy of consideration, for certain applications where speed of solution is not a critical factor.
The results of the preliminary tests, we have conducted with these networks, have been discussed to a certain extent in Section 6. We sum them up as follows. For the sample problems solved, the maximal distance and the 35
simultaneous projection algorithms required comparable numbers of steps, always much less than the systematic projection algorithm. In terms of the number of units of time used, the simultaneous projection network appears superior to the maximal distance network. This comes from the fact that the latter one has more layers than the former.
The simultaneous projection algorithm furthermore provides its user, with
the important unique advantage of guaranteeing to minimize the objective function, even when the system of inequalities has no solution (see Theorem 5).
For the single plane methods, we have found that good values, among those tried, for the step size parameters λ and ρ are λ ≈ 1.5 and ρ ≈ 0.25. This is consistent with the convergence theorems mentioned. For the simultaneous projection algorithm, although good results were found with λ ≈ 2, the best results were consistently obtained for larger λ's as well as for rather large ρ's, between 1 and 2.5. This fact can very well be interpreted as an indication that the sufficient conditions in the convergence theorems are not really necessary, and that the theoretical results need to be refined.
We note that, when both step size parameters λ and ρ are non-zero, the convergence should generally be better than when one of the two is zero. Indeed, when the point xν is far from the polytope, the distance dependance of the step size ensures that the points of the sequence approach the polytope at a faster pace than if the steps were of constant lengths. On the other hand, as the points get close to the polytope and the distance term in the step size becomes small, the constant term takes over and ensures that the points of the sequence
keep on moving toward the polytope at a minimum, non-infinitesimal, rate, so that it is reached in a finite number of steps.
For computing solutions, it suffices, of course, to know that the iteration sequence converges; the calculations can then always be stopped when a certain precision criterion is satisfied. This will always happen after a finite number of steps, even though the exact sequence {xν} may actually be infinite. Nevertheless, it still remains a very important property for an iteration sequence to exactly terminate in a finite number of steps. Indeed, this generally means that its limit point is inside the polytope Ω, while for infinite sequences, it is necessarily on its surface. Interior point solutions are more stable and more robust because they are completely surrounded by a whole neighborhood of other solutions. On the other hand, surface limit points have neighbors both in Ω and outside of it; so that they can easily cease to be solutions under small perturbations of the parameters of the problem, as when the coefficients of the inequalities are slightly modified. For example, this is the kind of stability that leads to a better ability of neural networks to generalize to new inputs the knowledge they have accumulated during their training.
Given the fact that all the networks we described can be realized with very inexpensive computing elements, it would be practical to further improve on the solution time by having many copies of the networks work simultaneously on the same problem, each using either different values of the step size parameters, some even with λ > 2, and some with different starting points x0.
It is certainly worthwhile to conduct other tests, with more complex and larger sample problems, in order to see whether the results we observed persist. 37
We believe that the study reported in the present article is important for the theory of optimization neural networks, as well as for feasibility networks, since after all, the latter networks are always an essential part of the first ones.
The author is grateful to the reviewers for their constructive comments and suggestions for improving this manuscript.
Fig. 1: Winner-Takes-All network with 4 inputs. Full lines have weight 1, dashed lines -1. Activation functions are a sign function for the first layer and a step function, with threshold of 5/2, for the second layer. Data transits from left to right. The outputs yi are 0 except at the position of the maximum input xi.
Fig. 2: Artificial neural network to perform the maximal distance and maximal residual relaxation-projection algorithms.
Fig. 3: The i'th subnetwork of the chain that constitutes the artificial neural network to perform the systematic relaxation-projection algorithm.
Fig. 4: Artificial neural network to perform the simultaneous relaxation-projection algorithm.
Fig. 5: Trajectories (value vs step number) of the variables x1 and x2, produced by the maximal distance relaxation-projection network, for a sample problem, with λ = 1.5 and ρ = 0.25. x2 is the variable that increases at the start.
Fig. 6: Trajectories (value vs step number) of the variables x1 and x2, produced by the systematic relaxation-projection network, for a sample problem, with λ = 1 and ρ = 0.25. x1 is the variable that increases at the start.
Fig. 7: Trajectories (value vs step number) of the variables x1 and x2, produced by the simultaneous relaxation-projection network, for a sample problem, with λ = 2 and ρ = 2.5. x1 is the variable that increases most at the start.