would require few computations to deal with limited precision training as compared to the ... injecting a small perturbation 1Wij to weight Wij, the partial derivative ...
Training Algorithms for Limited Precision Feedforward Neural Networks Yun Xie Department of Electronic Engineering, Tsinghua University, Beijing 100084, P.R.China Marwan A. Jabri Department of Electrical Engineering The University of Sydney N.S.W. 2006, Australia
SEDAL Technical Report No. 1991-8-3
Abstract In this paper we analyse the training dynamics of limited precision feedforward multilayer perceptrons implemented in digital hardware. We show that special techniques have to be employed to train such networks where each variable is quantised to a limited number of bits. Based on the analysis, we propose a Combined Search (CS) training algorithm which consists of partial random search and weight perturbation and can easily be implemented in hardware. Computer simulations were conducted on IntraCardiac ElectroGrams and sonar re ection pattern classi cation problems. The results show that using CS, the training performance of limited precision feedforward MLPs with 8 to 10 bit resolution can be as good as that of unlimited precision networks. The results also show that CS is insensitive to training parameter variations.
1 Introduction When neural networks are to be used on limited precision digital hardware, problems may arise in their training because all network parameters and variables are quantised. The characteristics of limited precision networks may be heavily aected by the quantisation and their training may become much harder. Several researchers have conducted simulations that implement existing training algorithms using limited precision [1,2,3]. However, few attempts have been made to develop training algorithms specially geared for limited precision. The work described in this paper was conducted to investigate the performance and the training dynamics of digitally implemented feedforward MLPs where all the variables are represented by a limited number of bits in xed point format. The paper is structured as follows. In Section 2, the quantization eects on the training dynamics are analyzed. In Section 3, the Combined Search training algorithm (CS) for limited 1
precision neural networks is presented. The performance of the algorithm is analyzed and implementation issues are discussed. In Section 4, we present extensive simulations on IntraCardiac ElectroGrams (ICEG) and sonar re ection pattern recognition problems, and compare the performance of our algorithms against several others including backpropagation. We also discuss the sensitivity of CS to its training parameters; In Section 5 the hardware implementation of CS is discussed brie y. Throughout the paper, and unless indicated otherwise, we use the term MLP to indicate a feedforward multi-layer perceptron network.
2 Quantization Eects on Neural Network Training In [4], we have used statistical models to analyze the eects of quantization on MLPs. As for the dynamics of the training process, the major eect of quantization (especially when small numbers of bits are used) is that small variations in the error function are quantized to zero. This leads to the formation of a large number of plateaus in the error surface in addition to those that already exist (see Figure 1). These plateaus can be absolutely at as far as quantisation is concerned, and make the training much harder because they trap the training procedure. Trapping occurs more frequently if the training algorithm only makes use of the information in the vicinity of the current point in the weight space and especially if it based on the dierentiable property of the error function. Therefore, if as few bits as possible are to be used for the quantisation, while ensuring successful training and performance, training could not solely rely on the local information such as the gradient of the error function. Methods performing non-local search in the weight space have to be incorporated. static
ERROR (before quantization)
ERROR (after quantization) Plateaus
Local minima
Local minima WEIGHTS
WEIGHTS
Figure 1: The error function before and after quantization In this paper, the terms \local search algorithm" and \non-local search algorithm" are used to dierentiate between algorithms that use information from a close vicinity as opposed to a wider search area.
3 The Combined Search Algorithm for Training Limited Precision MLPs It is desirable that limited precision training algorithms possess the following properties: 2
would require few computations to deal with limited precision training as compared to the
unlimited precision case, can escape plateaus and local minima, and be simple for hardware implementation. Backpropagation (BP) is the most popular training algorithm for MLPs [5]. It is ecient in training but dicult to implement in hardware. This is due to its high precision requirements (12 to 16 bits) as it involves a relatively large number of computations in the forward and backward paths [3,1,6]. For hardware implementation simplicity, the Weight Perturbation algorithm (WP) [6] is more attractive. This was originally proposed for analog neural networks. The idea is that by injecting a small perturbation 1Wij to weight Wij , the partial derivative @Error=@Wij can be approximated by the measurement (Error(Wij +1Wij ) 0 Error(Wij))=1Wij , where Error(:) is the error (cost) function to be minimized during training. Therefore the weights can be updated according to: Wijnew
= Wijold + (Error(Wij) 0 Error(Wij + 1Wij ))=1Wij (1) where is the learning rate. WP, although not as ecient as BP, requires only feedforward propagation during training, and as a result it is easier to implement in hardware and is less sensitive to quantization errors. There are two serious problems for BP and WP when they are implemented with limited precision. The training convergence is more sensitive to the learning rate than in the case of unlimited precision. The reason is that the learning rate should be small enough to avoid dramatic changes to the weights in a single iteration, and large enough to avoid the quantisation of the weight change to zero. BP and WP are local search algorithms. As indicated in Section 2, their performance degrades with lower bit resolutions and they are likely to be trapped on plateaus. To overcome these problems, a combination of local and non-local search algorithms has to be employed. The training procedure we are proposing in this paper combines such techniques and hence its name, the Combined Search algorithm (CS). The algorithm makes use of Modi ed Weight Perturbation (MWP) and Partial Random Search (PRS). In order to keep the weights out of saturation, neurons are equipped with a gain control which is adjusted during training. We describe below the algorithm in detail. A ow diagram of the Combined Search algorithm is shown in Figure 3. In the diagram, \Weight(I,J)" is the Jth weight connecting the (I-1)th and the Ith layer of neurons; \Gain(I)" is the gain of the neurons in the Ith layer; \L" is a constant larger than 1 (see Section 3.1.2); \Error" is the cost function to be minimized; \Nlayers", \Nweight[I]" and \Nepochs" respectively represent the number of layers, weights in Ith layer and epochs. The algorithm makes use of Modi ed Weight Perturbation (MWP) and Partial Random Search (PRS). In order to keep the weights out of saturation, neurons are equipped with a gain control which is adjusted during training. We describe below the algorithm in detail. 3
CombinedSearchf dof PartialRandomSearch(); Modi edWeightPerturbation(); gwhile(terminating conditions are not met) g
PartialRandomSearchf dof for(I< 1 x > 1 y = f (x) = >: x In-between 0 xFrom [10] 99.7 0.2 83.5 5.6
12
multi-layer feedforward neural networks. The Partial Random Search and the Modi ed Weight Perturbation algorithms do compensate for each other's limitations when combined in the Combined Search algorithm. 4.4 Convergence Speed and Gain Adjustment
The convergence speed of CS and PRS are compared on the training of a one hidden layer architecture of 10 hidden units and 9 bit quantization on the ICEG problem. The training is terminated either when the average output error was below 6 2 1003 or when the number of training epochs reach 1000. Out of 10 trials, CS stopped training 8 times within 1000 epoches and the average training epochs of these 8 training sessions is 463; PRS failed to reduce the error below 6 2 1003 within 1000 epoches. This indicates that CS is more ecient in searching for minima than PRS in terms of convergence speed. As mentioned in Section 3.1, the gain of the neurons in the network are adjusted during training, and it was shown that the Combined Search algorithm is not sensitive to the initial values of the gains. The experiments show that as long as the initial gain values are 1 to 2 orders of magnitude smaller than the nal gains, there is no obvious dierence in the training results. This suggests that CS is robust with respect to the gain although the nal gains of the neurons in a network are somewhat problem{dependent. This also indicates that if the gain adjustment coecient is set to 2, 6 to 7 bits are sucient for its control. The gain adjustment coecient, which was set to 1:5 in most of the computer simulations in this paper, is independent of the problem to be solved by the network. A too large value can make the training unstable, especially when low precision is used. The reason is that when the gains are adjusted, although the weights are scaled down accordingly at the same time, the mapping implemented by the network may be altered due to the limited precision. This phenomenon can be seen clearly from the sudden error increases in Figure 6. If the gain adjustment coecient is very close to 1, the training becomes slow since the growth margin in the PRS's search range is too small for large weights. In our simulations, it is found that 1:5 is a satisfactory value for the gain adjustment coecient. Values of 1.2 and 2 were also tried on the ICEG data. Table 5 shows the training results for the single hidden layer architecture with 10 hidden units, where the CS was used and each training session was run for 900 epochs. Table 5: Training results on ICEG data with 9 bit resolution and one hidden layer Average Standard Average Standard Gain Adjustment Performance Deviation Performance Deviation Coecient on Training Sets on Training Sets on Testing Sets on Testing Sets (%) (%) (%) (%) 1.2 97.3 0.644 89.9 0.935 1.5 98.4 0.771 89.6 0.978 2.0 98.3 0.614 89.7 1.03 13
The results show that CS gives almost the same performance when the gain adjustment coecient varies from 1.2 to 2. If it is set to 2 and a binary representation is used, the gain adjustment can be realized just by a \shift" operation. 4.5 The In uence of Bit Resolution on the Performance of the Network
Using the ICEG classi cation problem and the three layer architecture used in the previous sections, another set of simulations was conducted in which the number of quantization bits was varied. Table 6 and Figure 6 show the training performance and curves, respectively. The Combined Search is used in this experiment. Error x 10 -3 10 bits 9 bits 8 bits 7 bits 6 bits 5 bits
340 320 300 280 260 240 220 200 180 160 140 120 100 80 60 40 20 0 0
200
400
600
800
Epochs
Figure 6: Training curves of CS on ICEG data with one hidden layer As the number of bits decreases, the performance of the network deteriorates much faster than linearly. What is not shown in Table 6 is that as the number of bits decreases, CS relies increasingly on the Partial Random Search algorithm during training, while the performance of Modi ed Weight Perturbation algorithm degrades until its contribution becomes marginal. The reason is that as less bits are used, both the number and the scale of the planes in the error function caused by the quantization increase. Another interesting phenomenon is that as the number of bits is reduced, an increasing number of hidden units became irrelevant after the network is trained|their outputs are either zero or one for all the input patterns. This is due to the fact that as less bits are used, less information can be extracted by the hidden units from the input data. 14
Table 6: Training results of CS on ICEG data with one hidden layer Average Standard Average Standard Number of Performance Deviation Performance Deviation Bits on Training Sets on Training Sets on Testing Sets on Testing Sets (%) (%) (%) (%) 10 98.7 0.712 90.0 1.09 9 98.4 0.771 89.6 0.978 8 98.5 0.740 89.2 1.46 7 97.5 2.24 87.7 5.29 6 95.7 2.70 86.4 5.85 5 89.2 8.80 81.2 8.63
5 Hardware Implementation of the Combined Search Algorithm It is shown in [6] that in terms of chip area and hardware complexity, WP is ideal for on-chip training. In addition to the components required for the normal operation of a network, WP requires the following three modules (for on-chip learning): Perturbation Generation; Error Dierence Evaluation; Weight write; Though the analysis in [6] is done for analog VLSI, it applies to digital VLSI as well. For the Conbimed Search algorithm two parts are required in addition to the three above: 1. a random number generator which can be implemented as a shift register with feedback loops. 2. a gain adjustment module which can be simpli ed when the gain adjustment coecient is 2.
6 Conclusions An ecient and eective limited precision training algorithm is presented in this paper. It combines the Modi ed Weight Perturbation and the Partial Random Search. A detailed description of training characteristics and comparative simulation results of limited precision multi-layer feedforward neural networks on two pattern recognition problems are also presented along with preliminery testing results on an analog neural network chip. The following conclusions can be drawn: Training algorithms that are eective and ecient for unlimited precision networks may not be suitable for limited precision cases. Some need very high precision computation and are hard to be implemented in hardware. 15
The proposed Combined Search Algorithm is an eective training algorithm for limited
precision feedforward neural networks. It is simple to use, has stable performance and economic for hardware implementation. In the Combined Search Algorithm the Partial Random Search and the Modi ed Weight Perturbation algorithms do compensate for each other's limitations. Both the training performance and the convergence speed are improved as a result. Although the simulations were conducted only on multi-layer feedforward neural networks, basically there is no diculty in applying the Combined Search Algorithm to other types of supervised network models. Gain adjustment is essential in training limited precision neural networks. The simulation results show that 6 to 7 bits are sucient for the control of the gains. In our experiments, only one kind of combinations of local and non-local search methods are studied. In fact, it is possible to combine various local and non-local search methods in dierent ways in order to meet various requirements in performance and implementation. The applications of such algorithms are not restricted to limited precision neural network training [8].
7 Acknowledgment This work was supported in part by the Australian Research Council and the Australian Telecommunication and Electronic Research Board. One of us (YX) acknowledges the support of National Education Committee of P.R.China.
References [1] J. L. Holt and J. Hwang. Finite precision error analysis of neural network hardware implementation. , 1991. [2] M. Hoehfeld and S. Falhman. Learning with limited numerical precision using the cascadecorrelation algorithm. , May 3, 1991. [3] P. W. Hollis, J. S. Harper, and J. J. Paulos. The eects of precision constraints in a backpropagation learning network. , 2(3), 1990. [4] Y. Xie and M. A Jabri. Analysis of eects of quantization in multilayer neural networks using statistical model. , 3(2), March, 1992. [5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propogation. In D. E. Rumelhart and J. L. McClelland, editors, . MIT Press, 1986. [6] M. A. Jabri and B. Flower. Weight perturbation: An optimal architecture and learning technique for analog vlsi feedforward and recurrent multi-layer networks. , 3(1), 1992. Technical report: FT-10, Dept. of E.E., University of Washington
CMU-CS-91-130, School of Computer Science, Carnegie Mellon
University, Pittsburgh, Pa 15213
Neural Computation
IEEE Transactions on Neural Networks
Parallel Distributed
Processing:Explorations in the Microstructure of Cognition
IEEE Transac-
tions on Neural Networks
16
[7] J. K. Kruschke and J. R. Movellan. Bene ts of gain: Speeded learning and minimal hidden layers in back-propagation networks. , 21(1), 1991. [8] P. D. Wasserman. . Van Nostrand Reinhold, New York, 1989. [9] E. Barnard and R. Cole. A neural-net training program based on conjugate-gradient optimization. , 1989. [10] R. P. Gorman and T. J. Sejnowski. Analysis of hidden units in a layered network trained to classify sonar targets. , 1:75{89, 1988. IEEE Transactions on Systems, Man and Cybernetics
Neural Computing: Theory and Practice
OGC Tech. Report No.CSE 89-014, Carnegie Mellon University
Neural Networks
17