a vlsi-optimal constructive algorithm for classification ...

3 downloads 0 Views 166KB Size Report
A VLSI-OPTIMAL CONSTRUCTIVE ALGORITHM. FOR CLASSIFICATION PROBLEMS. SORIN DRAGHICI. Vision and Neural Networks Laboratory, Dept. of ...
A VLSI-OPTIMAL CONSTRUCTIVE ALGORITHM FOR CLASSIFICATION PROBLEMS SORIN DRAGHICI Vision and Neural Networks Laboratory, Dept. of Comp. Science, Wayne State University, 431 State Hall, Detroit, MI, USA VALERIU BEIU Los Alamos National Laboratory, Division NIS-1, MS D466, Los Alamos, New Mexico 87545, USA ISHWAR K. SETHI Vision and Neural Networks Laboratory, Dept. of Comp. Science, Wayne State University, 431 State Hall, Detroit, MI, USA ABSTRACT: If neural networks are to be used on a large scale, they have to be implemented in hardware. However, the cost of the hardware implementation is critically sensitive to factors like the precision used for the weights, the total number of bits of information and the maximum fan-in used in the network. This paper presents a version of the Constraint Based Decomposition training algorithm which is able to produce networks using limited precision integer weights and units with limited fan-in. The algorithm is tested on the 2-spiral problem and the results are compared with other existing algorithms.

INTRODUCTION It is widely accepted that the hardware implementation is the most cost effective solution for large scale use of computational paradigms. Artificial neural networks (NN’s) are no exception. One particular difficulty regarding the neural networks' migration towards hardware is that their present software simulations do not take into consideration issues, which are very important from a VLSI point of view. For instance, most software simulations use floating point arithmetic and either double or simple precision weights. Any hardware implementation would be unreasonably expensive if it were to implement floating point operations and to store so many bits for each weight. Limited precision weight neural networks are better suited for such purposes because a limited precision requires fewer bits for storing the weights and also simpler computations. This brings a decrease in the size of the VLSI chip and therefore, a lower cost. Another important issue for hardware implementations of neural networks is the complexity of the network. The complexity of a network can be described using different measures such as the number of neurons or the depth (i.e. the number of edges on the longest input/output path). For VLSI implementation purposes, the

depth can be put into correspondence with the delay T and the number of neurons can be put into correspondence with the area A of a VLSI chip. However, these measures are not the best criteria because the area of a single neuron depends on the precision of its associated weights, as already mentioned. Better criteria are the total number of connections (Beiu, 1997b and references therein) the total number-of-bits needed to represent the weights (Bruck, 1990; Williamson, 1991, Denker, 1988) or the sum of all the weights and thresholds (Beiu, 1997b). In the following, the total number of bits will be adopted as one appropriate measure of network complexity. A different important element is the maximum fan-in of the units. (Abu-Mostafa, 1988) and (Hammerstrom, 1988) are among the first papers to suggest that a limited maximum fan-in can be beneficial. Since then, some researchers have presented algorithms and architectures able to work with limited fan-in (Klaggers, 1993; Phatak, 1994). Recently, it has been shown that a small and constant fan-in (in the range [6..9]) is indeed optimal from a VLSI point of view because it minimizes the AT2 measure of the chip (Beiu, 1997). In conclusion, a VLSI friendly constructive algorithm would be able to: • • • •

decide the appropriate architecture for the network construct a solution network (both architecture and weights) do so using limited precision integer weights in a range chosen by the user construct solutions with a limited fan-in chosen by the user

This paper will present such an algorithm, some results of tests performed on the 2-spiral problem and a comparison with other related algorithms. THEORETICAL FOUNDATIONS An important and interesting question is: how many bits of information do we need for a given problem or, alternatively, how complex a problem can we solve with a network of a given complexity? Recent work has established bounds which relate the number of bits (as defined in information theory) with the complexity of the problem described by the number of patterns m and the minimum distance d between the closest patterns from opposite classes. Recently, using an entropy based reasoning, Beiu has calculated first an upper bound of mn{log D / d  + 5 / 2}in (Beiu, 1996) and then improved it to mn{log D / d  + 2} / 2 in (Beiu, 1997a). In these expressions, n is the number of dimensions, D is the radius of the smallest hypersphere including all patterns and the logarithm is in base 2. These results have been obtained considering a uniform quantization of the space which in turn, assumes that the separating hyperplanes can be placed in any necessary position. For integer weights, these assumptions do not hold anymore. The case of weights restricted to integer values in the range [-p, p] is analyzed in (Draghici, 1997; 1997a and 1997b). (Draghici, 1997) shows that neural networks using limited precision weights are indeed capable of solving any classification problem if the distance between the closest patterns of opposite classes is at least 1/p: Proposition 1 (from Draghici, 1997). Using integer weights in the range [-p, p], one can correctly classify any set of patterns for which the minimum distance between two patterns of opposite classes is dmin=1/p.

Further work in this field has shown that the number of bits necessary for solving a given classification problem with limited precision weights can be bounded (Beiu, 1996, 1997a; Draghici, 1997, 1997a, 1997b). However, all these results show only that it is theoretically possible to implement a solution able to solve the given classification problem with limited precision integer weights. Several algorithms which limit the precision of the weights used have already been presented. There are several different ways of dealing with the imposed limitations. For instance, (Höhfeld, 1992; Xie, 1991, Coggins 1994; Tang, 1993) use a dynamic rescaling of the weights and a corresponding adaptation of the gain of the activation function. (Höhfeld, 1992; Vincent 1992) rely on probabilistic rounding whereas (Dundar, 1995; Khan 1994, Kwan 1992, 1993; Marchesi 1990, 1993; Tang 1993) use weight values which are restricted to powers of two. Several other possibilities for improving the VLSI efficiency of neural network implementations (including a VLSI friendly constructive algorithm) are presented in (Beiu, 1997b). (Draghici, 1997c) presents a version of a constructive algorithm which produces networks with limited weights. Although VLSI friendly, due to the use of limited precision weights, this algorithm does not limit the fan-in of the units used. Thus, networks with an unreasonable maximum fan-in might result. This paper will present an enhanced version of this algorithm (VCBD) which produces a network which uses both limited weights and limited fan-in units. For a chosen fan-in in the range [6,9], the resulting networks will be VLSI optimal in the AT2 sense (Beiu, 1997). THE VLSI FRIENDLY CONSTRAINT BASED DECOMPOSITION ALGORITHM (VCBD) The VCBD algorithm presented here (see Figure 5) is based on the Constraint Based Decomposition (CBD) (Draghici, 1994; 1996a; 1996b) and Integer Weight Constraint Based Decomposition (ICBD) (Draghici, 1997c). As already mentioned, ICBD builds a network using limited precision integer weights but does not limit the fan-in of the units. The VCBD algorithm is based on the classical ‘divide and conquer’ strategy. Its convergence is ensured through choosing the dividing hyperplane from a set of ‘canonical’ hyperplanes. This set of canonical hyperplanes is calculated for each given range p such that i) they can be implemented with weights in the given range [p, p] and ii) they partition the space in a set of subspaces such that the largest possible distance in each subspace is less than the distance d between the closest two patterns from opposite classes. Proposition 1 gives a simple formula for calculating the range p for a given d in the general case. However, since the set of canonical hyperplanes has fewer hyperplanes than in the general case, this formula should be amended for this particular case. One possibility is to consider the largest possible distance equal to the hyperdiagonal of the hypercube of side 1/p. This is justified by the fact that all hyperplanes normal to the axes are always included in the canonical set. If we consider only these hyperplanes normal to the axes, the largest decision region they can form is a hypercube of side 1/p. Therefore, the largest internal distance in such hypercube is n / p and any pattern set with d > n / p (where n is the number of dimensions) will be separated successfully by some hyperplanes in the

canonical set. From here, one can calculate the necessary weight range p for a given minimum distance d as p >  n / d  . The algorithm searches through this set of candidate canonical hyperplanes until one such hyperplane is found and at least some of the patterns in the current subgoal are classified correctly. Since this set of hyperplanes respects condition ii) above, there will always be a hyperplane which will separate at least two patterns from opposite classes. This ensures that any given subgoal training will terminate and, since the algorithm is called on gradually smaller example sets, it also guarantees that the overall training will terminate and that it will actually find a solution. The algorithm builds a 3 layer network with a first layer of units implementing hyperplanes, a second layer of AND units and a third layer of OR units. Similar structured networks are also built by algorithms described in (Sethi, 1990; Beiu, 1997b; Ayestaran, 1993, Bose, 1993). This structure allows a simple approach to limiting the fan-in. The solution can be expressed in DNF form. For instance, a solution with maximum fan-in 5 would be: C1 = h0 + h0 h1 h2 + h0 h1 h2 h3 h4 (see Figure 1 where the units on the first layer implement h0 to h4). The maximum fan-in can be reduced by inserting a new unit F and rewriting this solution as: C1 = h0 + h0 h1 h2 + (h0 h1 h2 ) ⋅ h3 h4 = h0 + h0 h1 h2 + Fh3 h4 . Figure 2 presents the new architecture of the network (with max fan-in = 3). Unit 10 implements the same function as before but now has a maximum fan-in of 3. Class 1

Class 1

Class 2 OR layer

AND layer

Class 2 10

OR layer

AND layer

Figure 1. An architecture generated by the ICBD. Note that one unit has fan-in 5.

Figure 2. An architecture generated by VCBD for the same problem. The square unit is an AND gate.

Figure 3. The XOR problem solved with integer weights in the range [3,3]. The solution used 2 hyperplanes for a total number of 2x3x3 = 18 bits.

Figure 4. The 2 spiral problem solved with integer weights in the range [-4, 4]. The solution used 40 hyperplanes for a total number of 40x3x4=480 bits.

• separate (region, C1 (class1), C2 (class2), factor, range, fan-in) • if fan-in == max_fan-in then • create a new F unit which will implement the given factor • new_factor = F • Separate( region, C1, C2, new_factor, range, 1) • else • Build a subgoal S with patterns x1C1 and x1C2 taken at random from C1 and C2. Delete x1C1 and x1C2 from C1 and C2. • Choose h, a canonical hyperplane (in range) which separates x1C1and x1C2. • For each pattern p in C1 U C2. • Add p to the current subgoal S and save h in h_copy • Try to find another canonical hyperplane (in range) which separates the current subgoal S • if not success then • Restore h from h_copy and remove p from S • Let new_factor = factor and (h,’+’) • If the positive half-space determined by new_factor is consistently Cj then • Classify new_factor as Cj /* done */ • else /* call the procedure recursively in the smaller region */ • Delete from C1 and C2 all patterns, which are not in h+. Store the result in new_C1 and new_C2. • Separate( h+, new_C1, new_C2, new_factor, range, fan-in+1 ) • Let new_factor = factor and (h,’-‘) • If the negative half-space determined by new_ is consistently Cj then • Classify new_factor as Cj /* done */ • else /* else call the procedure recursively in the smaller region */ • Delete from C1 and C2 all patterns, which are not in h-. Store the result in new_C1 and new_C2. • Separate( h-, new_C1, new_C2, new_factor , range, fan-in+1 ) Figure 5. The VCBD algorithm for limited precision integer weights and limited fan-in. The canonical hyperplanes depend on the given range (precision) p.

EXPERIMENTAL RESULTS The algorithm was tested on several problems. Figure 3 presents the XOR solved with integer weights in the range [-3, 3]. Note the fact that the solution uses only 2x3x3=18 bits for storing the weights and compare this with the 2x3x32=192 bits which is the minimum number of bits used by any algorithm employing floating point weights. Figure 4 presents a solution for the 2-spiral problem. The solution uses 40 hyperplanes; each hyperplane uses 3 integer weights in the range [-5, 5] which can be stored on 4 bits. The total number of bits used for this solution is 40x3x4=480. Also, note that the 4 bits have not been exploited fully (they can implement weights in the range [-7, 7] not only in the used range of [-5, 5]). The maximum fan-in of this network is 7 which is within the optimal limits [6,9]. In comparison, the floating point version of the CBD algorithm (Draghici, 1994) managed to produce a solution with only 34 hyperplanes. However, each of these 34 hyperplanes used three floating point weights for a total of 34 x 3 x 32 = 3264 bits (for the particular compiler used). (Lang, 1988) solved the problem with 138

connections. If we assume these connections were floating point numbers we obtain 138 x 32 = 4416 for the total number of bits used. It can be noted that, although the number of hyperplanes used by our algorithm is larger, the number of bits is actually greatly reduced. CONCLUSIONS The paper presented a constructive algorithm able to build solutions using arbitrary precision weights and limited fan-in. The algorithm does not require an a priori design of the network architecture and has a guaranteed convergence. It has been shown on a few examples that the algorithm is able to build solutions to complicated problems (like the 2 spirals) which use integer weights in a very limited range and an optimal fan-in. The number of bits necessary for storing the weights for the problems studied was approximately one order of magnitude less than the number of bits used in classical algorithms using floating point weights. At the same time, the number of bits used was approximately 40% less than the number of bits given by several limits (480 vs. 776). Future work will include extensive testing of this algorithm on different real world problems. Also, we intend to test the CBD algorithm with different LPIW weight changing mechanisms, which might bring improvements in the generalization performances of the solutions generated. REFERENCES Abu-Mostafa Y.S., Connectivity Versus Entropy, Neural Information Processing Systems, D.Z. Anderson (ed.), Amer. Inst. of Physics, NY, pp. 1-8, 1988. Ayestaran H.E., R.W. Prager, The Logical Gates Growing Network, TR 137, Cambridge Univ., Eng. Dept., July 1993. Beiu, V., Entropy bounds for classification algorithms, Neural Network World, Vol. 6, No. 4, pp. 497-505, 1996. Beiu V.- Constant Fan-In Discrete Neural Networks Are VLSI-Optimal, in S.W. Ellacott, et.al. (eds.): "Mathematics of Neural Networks: Models, Algorithms and Applications", Kluwer, 1997. Beiu, V., T. De Pauw, Tight bounds on the size of neural networks for classification problems, in Proc. of Intl. Work-Conf. on Artif. and Natural NN, IWANN97, Lanzarote, June 4-6, 1997 (a). Beiu V. - VLSI Complexity of Discrete Neural Networks, Gordon & Breach, 1997 (b). Bose N.K., A.K. Garga, Neural Network Design Using Voronoi Diagrams, IEEE Trans. on NN’s, 4 (5), 778-787, 1993. Bruck J., Goodman J.W., On the power of neural networks for solving hard problems, J. of Complexity, 6, 129-135, 1990. Coggins R., M. Jabri, Wattle: A Trainable Gain Analogue VLSI Neural Network, Advances in NIPS 6 (NIPS*93, Denver, CO), Morgan Kaufman, San Mateo, CA, 874-881, 1994. Denker J.S., Wittner B.S., Network Generality, Training Required and Precision Required, NIPS’88, D.Z. Anderson (Ed.), Amer. Inst of Phys., New York, 219-222, 1988. Draghici S. - The Constraint Based Decomposition Training Architecture, Proc. of World Congress on NN’s, San Diego, Vol. III, pp. 545-555, 1994. Draghici S. - Enhancements of the Constraint Based Decomposition Training Architecture, Proc. of the Intl. Conf. on NN, Washington, 3-6 June, Vol. I, pp. 317-322, 1996 (a).

Draghici S. - Improving the speed of some constructive algorithms by using a locking detection mechanism, Neural Network World, Vol. 6, No, 4, pp. 563-575, 1996 (b). Draghici S., Sethi, I.K, On the possibilities of the limited precision weights neural networks in classification problems, Proc. of Intl. Work-Conf. on Artificial and Natural Neural Networks IWANN'97, Lanzarote, June 4-6, p. 753-762, 1997. Draghici S., Beiu V., Entropy based comparison of neural networks for classification, to appear in Proc. of The 9-th Italian Workshop on Neural Nets, WIRN Vietri-sul-mare, 22-24 May, 1997 (a). Draghici S., Sethi, I.K, Choosing the appropriate size network for low-cost hardware implementation: a tight lower bound for the number of bits of integer limited weights neural nets, Proc. of CSCS 11, Bucharest, Romania, 28-31 May, 1997 (b). Draghici S., Sethi I.K. - Adapting theoretical constructive algorithms to hardware implementations for classification problems, to appear in Proc. of the Intl. Conf. on Eng. Appls. of Neural Networks, Stockholm, Sweden, 16-18 June, 1997 (c). Dundar G., K. Rose, The Effect of Quantization on Multilayer Neural Networks, IEEE Transactions on Neural Networks 6(6), pp. 1446-1451, 1995. Hammerstrom D. - The Connectivity Analysis of Simple Associations -or- How Many Connections Do We Need ?, NIPS., D.Z. Anderson (ed.), Amer. Inst. of Physics, NY, pp. 338-347, 1988. Höhfeld M., S.E. Fahlman, Learning with limited numerical precision using the Cascade-Correlation Algorithm, IEEE Transactions on Neural Networks, NN-3(4), 602-611, 1992. Khan A.H., E.L.Hines, Integer-weights neural nets, Elect.Lettrs, 30(15), 1237-1238, 1994. Khan A.H., R.G. Wilson, Integer-weight approximation of continuous-weight multilayer feedforward nets., Proc. of the IEEE Intl. Conf. on NNs, Vol. 1, pp. 392-397, IEEE Press, NY, 1996. Klaggers H. and M. Soegtrop - Limited Fan-In Random Wired Cascade-Correlation, In D.J. Myers and A.F. Murray (eds.): MicroNeuro'93, UnivEd Tech. Ltd., Edinburgh, pp. 79-82, 1993. Kwan H.K., Tang C.Z., Designing Multilayer Feedforward Neural Networks Using Simplified Activation Functions and One-Power-of-Two Weights. Elec. Lett., 28(25), 2343-2344, 1992. Kwan H.K., Tang C.Z., Multiplierless Multilayer Feedforward Neural Networks Design Suitable for Continuous Input-Output Mapping, Electronic Letters, 29(14), pp. 1259-1260, 1993. Lang K. J, M. J. Witbrock - Learning to tell two spirals apart, Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, 1988. Marchesi M., G. Orlandi, F. Piazza, L. Pollonara, A. Uncini, Multilayer Perceptrons with Discrete Weights, Proc. Int. Joint Conf. on Neural Networks, San Diego, Vol. II, pp. 623-630, 1990.

Marchesi M., G. Orlandi, F. Piazza, A. Uncini, Fast Neural Networks without Multipliers, IEEE Transactions on Neural Networks, NN-4(1), pp. 53-62, 1993. Phatak D.S. and I. Koren Connectivity and Performance Tradeoffs in the Cascade-Corelation Learning Architecture, IEEE Trans. on NN, 5(6), pp. 930-935, 1994. Sethi I., Entropy nets: From decision trees to NN’s, Proc. of IEEE, 1605-1613, 1990. Tang C.Z., H.K. Kwan, Multilayer Feedforward Neural Networks with Single Power-of-Two Weights. IEEE Trans. On Signal Processing, SP-41(8), 2724-2727, 1993. Vincent J.M., D.J.Myers, Weight Dithering and Wordlength Selection for Digital Backpropagation Networks, BT Technology J., 10(3), pp. 124-133, 1992. Williamson R.C., Σ−Entropy and the complexity of feedforward neural networks, NIPS’90, R.P. Lippmann, J.E.Moody and D.S. Touretzky (Eds.), Morgan Kaufmann, San Mateo, 946-952, 1991. Xie Y., M.A. Jabri, Training Algorithms for Limited Precision Feedforward Neural Networks, SEDAL TR 1991-8-3, School of EE, University of Sydney, Australia, 1991.

Suggest Documents