University of Windsor, 401 Sunset Avenue. N9B 3P4 Windsor, Ontario, Canada. £. Dan A. Simovici. Department of Mathematics and Computer Science.
Evolutionary Strategy for Learning Multiple-Valued Logic Functions Alioune Ngom
Dan A. Simovici
Computer Science Department, 5115 Lambton Tower
Department of Mathematics and Computer Science
University of Windsor, 401 Sunset Avenue
University of Massachusetts at Boston
N9B 3P4 Windsor, Ontario, Canada. £
02125 Boston, Massachusetts, USA. Ivan Stojmenovi´c
Computer Science Department School of Information Technology and Engineering University of Ottawa, K1N 6N5 Ottawa, Ontario, Canada. Ý
Abstract We consider the problem of synthesizing multiplevalued logic functions by neural networks. An evolutionary stratà is described. A egy (ES) which finds the longest strip in Î strip contains points located between two parallel hyperplanes. Repeated application of ES partitions the space Î into certain number of strips, each of them corresponding to a hidden unit. We construct neural networks based on these hidden units. Preliminary experimental results are presented and discussed.
Keywords: Multiple-valued logic, Multiple-threshold perceptron, Evolution strategy, Neural network, Partitioning method, Constructive algorithm.
1. Introduction In this research paper we propose to synthesize multiplevalued logic functions by minimal multilayer feedforward neural networks. There are various measures that can be used in constructing multiple-valued neural networks. The most important measures are depth (number of layers) and size (number of processing units). The depth is related to the speed of computing a function whereas size decides the hardware cost. In this paper, we use multiple-valued multiple-threshold perceptrons as basic processing elements (i.e. nodes) of the network. We apply the strip based neural network growth algorithm of Ngom et al. [9] to realize multiple-valued logic functions by minimal networks. The technique in [9] apply genetic algorithm to find small networks for given arbitrary functions. Our strip based method, however, uses evolutionary strategy to construct minimal networks. with . A -valued logic Let function maps the Cartesian product into . Denote by the set of all such functions . The set
defined by
is the set of all -valued logic functions. For instance, is the set of all two-valued logic functions. A -input -valued -threshold perceptron [8]), abbrevi-perceptron computes a weighted -input ated as valued -threshold function given by
if
if
if (1) where
is the output vector; is the threshold vector —with — and is the number of threshold values; is the input vector; is the weight vector and is the
and . The perceptron’s transfer function is dot product of a -valued -threshold function . A -perceptron partitions the inputs into disjoint classes using parallel hyperplanes, where
and (we assume and ). Each hyperplane equation denoted by ( ) is of the form (2) Our model of multiple-valued logic neuron is the
¼
perceptron defined above. The first model of multiple-valued logic neurons (and neural network) was introduced by Chan [2] and since then various other models have been described (see for instance, [7]). The problem we address in this paper is that of learning multiple-valued logic functions using minimal neural net-perceptrons. The problem of works composed of deciding whether or not a given task can be performed by a given architecture is known to be NP-complete [4]. Also, it has been shown in [1] that the problem of finding the absolute minimal architecture for a given task is NP-hard.
£ Research supported by NSERC grant RGPIN22811700 and University of Windsor’s Startup Fund Ý Also with DISCA, IIMAS, UNAM, Dirrection Circuito Escolar s/n, Coyoacan, Mexico D.F. 04510, Mexico. Research supported by REDII grant Proceedings of the 34th International Symposium on Multiple-Valued Logic (ISMVL’04) 0195-623X/04 $20.00 © 2004 IEEE
y
; ;
Procedure
3
1
1
3
3
2
0
1
1
2
1
0
1
1
1
0
3
0
0
1
0
1
2
3
-BasedSynthesis
;
s.t. of ; with respect to ; ; ; Until ; Construct a network with hidden units on the first layer;
Repeat Apply ES to find a subset Create a new hidden unit
Figure 2. -based synthesis algorithm. x
and .
ited by either one or two hyperplanes (depending on the predefined objective subset ). Each halfspace is assigned to a hidden unit that correctly classifies all elements of it. Once Our approach to the problem’s solution is discussed in secthe first hidden layer is complete, the remaining weights, laytion 2. The learning method is based on the general principle ers and units of the networks are determined to complete the of partitioning algorithms discussed in [9]. A partitioning alnetwork construction (the details of the network architecture gorithm seeks to construct a minimal network by partitioning are described in section 4). the input space into classes that are as large as possible. Each class of partition is then assigned to a new hidden unit. The connections and weights of the new units are determined appropriately in such a way that the constructed network will 3. Determining longest strips by evolutionary strategy always give the correct answer for any input. Distinct partitioning algorithms differ in the way the input space is partitioned. Also network topologies obtained from different parEvolutionary strategies have been proposed by Schwefel titioning algorithms may differ in the way new hidden units [11] as optimization methods for real-valued parameters. ES are connected. manipulates a single potential solution to a problem. SpecifiIn [9], minimal neural network is obtained by assigning cally, it operates on an encoded representation of the solution, nodes to optimal subsets (called strips) of a function’s space and not directly on the solution itself. Schwefel’s ES encodes and combining those nodes in such a way that the given funca solution as a real-valued vector. We will refer to such vection is synthesized. A genetic algorithm (GA) is used to find tor as chromosome. The solution is associated with an objecthe optimal strips of a function. Although GA is a powerful tive value that reflects how good or bad it is, compared with optimization method, it has the problem of being too slow other potential solutions in the space. ES is a random guided compared to other optimization techniques. In this paper, we hill-climbing technique in which a good candidate solution is use the computationally faster evolutionary strategy in place obtained by applying a certain number of small mutations to of genetic algorithm to obtain the function’s strips. a given parent solution. The best result of mutation is used again to generate the next best solution, and so on until some 2. Longest strip based growth algorithm convergence criterion is satisfied. Figure 1. Example of longest strip for
A strip is a set of points between two parallel hyperplanes which have the same value. The longest strip is the strip with the maximum possible cardinality. A maximum separable subset is a set of points having equal values with the maximum possible cardinality that can be separated from all other points by exactly one hyperplane. Examples of longest strip and maximum separable subset are shown, respectively, in Figure 1. The original STRIP of [9] uses genetic algorithm (GA) to determine the longest strip or the maximum separable subset of a currently given set of training examples. Here, we use evolution strategy (ES) in place of genetic algorithm (see Figure 2). The underlying growth algorithm of STRIP constructs a network by removal of points in a predefined objective sub . is either the longest strip or the maximum set separable subset in . The goal is then, using ES, to obtain a subset such that is as close as possible to if not equal to (of course we have ). In the algorithm, ES finds subsequent halfspaces delim-
3.1. Problem representation Our ES method uses the same solution representation of [9]. That is, a potential solution (a subset of ) is rep (such vector can be decoded resented as a weight vector to obtain ). More formally, a potential solution is a subset and the best solution is one whose size is closest to (if not equal to ). Given a weight vector ½ we can find the unique strip (or separable subset) of maximum cardinality that is associated (see section 3.2). Each chromosome will uniquely with into classes with determine a partition of parallel hyperplanes (for some ) and the best chromosome is the one that maximizes the number of points between a pair of parallel hyperplanes. To determine how good is a solution the ES needs an objective function to evaluate each chromo. some
Proceedings of the 34th International Symposium on Multiple-Valued Logic (ISMVL’04) 0195-623X/04 $20.00 © 2004 IEEE
3.2. Fitness function
3.3. Mutation
The objective function, the function to optimize, provides the mechanism for evaluating each chromosome. be the current set of points. Initially, Let . To compute the longest strip generated by , we and construct a sorted calculate for every the value list of records of the form . The array is sorted using as primary key and as secondary key. Let these records be sorted as follows: ½ , or more precisely, , , where . such A strip in is a sequence that
Chromosomes are subject to random mutations. With probability , each coordinate of a vector is altered according to some mutation operator. We use the three mutation operators of [9]. For a chromosome to be mutated, one of the three mutation operators is selected with probability .
3.4. Weight neighborhood
Recall that to find the subset for a current in a set we sort the points with respect of points with to (first key) and (second key) and find the longest strip or the maximum separable subset from the sorted list 1. . . The neighbors of are precisely those weight vectors that yield the least change in and . 2. the ordering of the ’s. with and . The length of the For instance, if
and gives the orstrip is and is the value of the strip. der , then a near neighbor Given a set of points and a function over , let of may produce the order ½ and ¾ ( ) be respec- and a far neighbor of may produce the order , with . tively the leftmost and rightmost strips generated by strip values and . We denote by the length of , . Let Consider the -th coordinate, , of the longest strip generated by and denote by the and be two consecutive elements in the order produced length of the maximum between the leftmost and rightmost by and let strips, on set and function . To evaluate how good is . Likewise, for an arbitrary vector , we we propose the following fitness function with respect to the will have . definition of . with respect to coordinate Vector is a neighbor of if for and is minimal such that the = longest strip sorted order is changed. Then only is affected and hence . Since (3) therefore we have , where . = maximum separable subset Taking and solving for we obtain
(4)
, that is the set of
Let
points of value . An alternative objective is to select a strip ´µ
of value , i.e. , which maximizes , where de notes the length of . That is, as in [6, 12], the selection criterion chooses the strip that constitutes the largest proportion of a class of points that can be separated. We denote by the length of the largest strip proportion of value
and denote by ½ and ¾ the lengths of the leftmost and rightmost strips, respectively, on set and function . Our alternative fitness function with respect to is
= largest strip proportion
(7) Only differences , for , be
tween distinct consecutive elements in sorted order are considered. For each coordinate ( ) there should be one neighbor only, chosen such that ·½ is minimized and is not equal to zero. Let and denote the nearest neighbor of with respect to by . Then differs from only in the -th coordinate, by . and for . In other That is words, the Euclidean distance between and is . Thus there are neighbors of , that is one nearest . We denote the set of all neigbhor per each coordinate of such neighbors of by . This set is . called the neighborhood of
(5)
3.5. Evolution strategy
= largest separable proportion Our evolutionary strategy is shown in Figure 3. In the al ¾ is the current solution at generation , is the best ½ (6) gorithm, ½ ¾ solution generated so far, is the best solution in and
Proceedings of the 34th International Symposium on Multiple-Valued Logic (ISMVL’04) 0195-623X/04 $20.00 © 2004 IEEE
Procedure EvolutionaryStrategy ; ; := random unit vector; ; Repeat = best in If
4. Constructing the neural network In the -th iteration of the -based synthesis algorithm, ES will find a chromosome which generates a subset . Subset is the longest strip (or maximum separable subset) found in the population. The optimization ability of ES makes it possible to attain a solution as close (in size) to as possible, if not equal. The ability of ES to produce depends on its many control parameters and the complexity of the tasks to learn.
; then ;
Else
Success := false; Repeat we search in the ’s
; = best in ; If
then ; ;
Success := true; Until Success = true or ; may be a local optimum If Success = false then With probability do either one of 1: := best between and all ’s;
2: With probability do either one of := random unit vector; big jump 2.1: 2.2: := random mutation of 2.3: := random mutation of 2.4:
;
:= random mutation of for a random ;
Until Stopping criteria is true;
Figure 3. Evolution strategy to find
such that
.
is the next solution to generate. We initially start with a random unit vector and set to . The algorithm works as follows. At generation , we compute the best neighbor of , that is a vector in that has the highest fitness value. If is better than then we are done for this generation; and are both set to and we move to the next generation. If is worse than then we must decide how to set . We first attempt to find a solution better than in the neighborhood of the neighbors of . That is we search successively in each ( ) until such solution is found or all neighborhoods are tried without success. If a solution is found then we set both and to that solution, for some , and move on to the next generation. If, otherwise, no such solution is found then is possibly a local optimum. We have two choices, with probability we either 1) set to the best between and (for all ’s) and move on to generation , or 2) to jump out of a possible local optimum, we set to a random unit vector or to a random alteration of or or for some random . In the second choice, one of the four possibilities is selected with probability and the alteration of a vector is done by choosing any of our three mutation techniques described in section 3.3 (with probability ). Let be the number of generations, then at each new generation new chromosomes are evaluated for their fitness and hence, our ES has a time complexity of . The time complexity in [9] is where is the population size of the GA. Since then our ES is much faster than the GA of [9].
Longest strip based network At every iteration of the based synthesis algorithm, ES finds a chromosome which produces the longest strip , where , and strip value . Let . Then we create a -perceptron (hidden unit ) whose weight vector is , threshold vector is and output vector is . In other words, the perceptron has a transfer function of the form (that is a valued two-threshold function). The -perceptron will output the value for all points and will output the value 0 for all points . In order to achieve a good accuracy on the testing set, that is a good generalization ability of our algorithm when approximating a function, we set the threshold vector to . Thus test points of value which are outside but close to the strip —that is test points that lie between and and between and — will be correctly classified by unit , since they are now spanned by . The offsets and are given by
and
(8)
Maximum separable subset based network At every iteration of the -based synthesis algorithm, GA finds a chromosome which produces a maximum separable subset or (i.e. the maximum between the leftmost and the rightmost strips), where and strip value or . Let . Then we create a -perceptron (hidden unit ) whose weight vector is , threshold vector is if is the leftmost strip, or if is the rightmost strip, and output vector is or depending on . In other words, the perceptron has a transfer func or tion of the form
(that is a -valued one-threshold function). The -perceptron will output the value for all points and will output the value 0 for all points .
Proceedings of the 34th International Symposium on Multiple-Valued Logic (ISMVL’04) 0195-623X/04 $20.00 © 2004 IEEE
t j = j+1
1