Eighth International Conference on Intelligent Systems Design and Applications
FPGA Realization of Activation Function for Artificial Neural Networks Venakata Saichand1, Nirmala Devi.M2, Arumugam.S3, N.Mohankumar4 P.G.Student M.Tech VLSI, 2, 4 Faculty in Electronics and Communication Engineering 1, 2, 4 School of Engineering, Amrita Vishwa Vidyapeetham, Tamil Nadu, India 3 Nanda College of Engineering, Tamil Nadu, India
[email protected]
1
Abstract Arrays (FPGAs) [2]. ASIC design has some drawbacks like the ability to run only specific algorithm and limitations on the size of a network. Hence FPGA offers a suitable alternative that has flexibility with an appreciable performance [3]. It maintains high processing density, which is needed to utilize the parallel computation in an ANN [4]. Every digital module is concurrently instantiated on the FPGA and hence, operates in parallel. Thus the speed of the network is independent of the complexity [2]. When implementing ANN on FPGA, certain measures are to be taken to minimize the hardware. The two fold approach is presented here, aims to meet the objective. The first step is representing the data in fixed-point (FXP) rather than floating- point (FLP) to reduce the circuit complexity significantly, by slightly compromising on the accuracy. The second step proposes the use of Piecewise Linear (PWL) activation function (AF) rather than using Look Up Table (LUT) based activation function. It is based on the fact that the realization of activation function plays a major role in deciding the resource requirement. Section 2 and 3 present the background details and review of the implementation respectively. Section 4 deals with the architectural design. Section 5 and 6 discuss the experimental results and conclusion respectively.
Implementation of Artificial Neural network (ANN) in hardware is needed to fully utilize the inherent parallelism. Presented work focuses on the configuration of Field-Programmable Gate Array (FPGA) to realize the activation function utilized in ANN. The computation of a nonlinear activation function (AF) is one of the factors that constraint the area or the computation time. The most popular AF is the log-sigmoid function, which has different possibilities of realizing in digital hardware. Equation approximation, Lookup Table (LUT) based approach and Piecewise Linear Approximation (PWL) are a few to mention. A two-fold approach to optimize the resource requirement is presented here. Primarily, Fixed-point computation (FXP) that needs minimal hardware, as against floating-point computation (FLP) is followed. Secondly, the PWL approximation of AF with more precision is proved to consume lesser Si area when compared to LUT based AF. Experimental results are presented for computation. Keywords- ANN, FPGA, Activation Function, Lookup table based, Piecewise Linear Approximation
1. Introduction Artificial neural networks (ANNs) are the computational models of human brain. Application of ANN ranges from function approximation to pattern classification and recognition. Multilayer perceptron (MLP) with one or two hidden layers can solve almost any problem of high complexity. ANNs exhibit high degree of parallelism, when the complete architecture with all interconnections is realized in hardware, but the implementation cost is high. On the other hand, implementing either a neuron or a layer of neuron and configuring it to act as other layers or neurons in time multiplexed way using a controlled unit offers low level of parallelism. Performance is compromised for the conservation of silicon area [1].Hardware implementation of ANN to utilize the parallelism can follow analog, digital or mixed signal design technique. Matured and flexible digital design in Very large Scale Integration (VLSI) can be implemented on Application Specific Integrated Circuits (ASICs) or Field-Programmable Gate 978-0-7695-3382-7/08 $25.00 © 2008 IEEE DOI 10.1109/ISDA.2008.333
2. Design concepts Computation in neuron of a network involves multiplication, addition and activation function to decide the output, where f(yj) is the activation function as shown in Figure 1. The approach proposed in this paper, attempts to optimize the silicon area required to realize the activation function block. Instead of representing signals in floating point data, the computation involves fixed-point arithmetic to reduce the hardware further. Experimental results clearly justify that the design of PWL activation function is hardware friendly and
159
hence could be utilized to realize a single neuron and then the complete network. f(yj)
w1
x1
yj x x
Oj
x1 to xn – input; w1 to wn – weight; yj = x.w; Oj = f(yj) = output
x.w
xn wn
Fig1. Neuron of an ANN
independent of the number but dependent on the position of the radix point. Direct implementation of arithmetic modules in hardware generally alters the operation due to the limitation in precision. Hence the implemented result deviates slightly from the software simulation results as shown in Fig.6.[1].
2.1 Data Precision and Representation When implementing an ANN using FPGA the design possesses a number of challenges [5]. Arithmetic representation is a major part of concern, which may range from 16bit, 32-bit or 64-bit floating-point format (FLP) or fixed-point (FXP) formats. The former is area hungry with improved precision where as the latter needs minimal resources at the expense of precision [6]. Hence an area versus precision tradeoff could decide the data format. Presented work uses FXP formats. FXP format: It has two parts; integer part and fractional part, which is separated by a radix point. The integer part is represented by bits of bi-1 to b3 and the fractional part extends from b2 to b0 as shown in Figure 2. As an example
bi-1
bi-2
………………
b4
b3
Integer part
b2
b1
2.2 Activation Function of ANN Main function of ANN is multiplication, summation and squashing operation as shown in Figure 3. Squashing function is known as threshold function or activation function. The activation function may range from a simple step function to sigmoid function. It restricts the applied input to lie with in the specified range (0,1) so as to decide the output (or) to be applied as input to next layer neurons. A monotonically increasing, differentiable non-linear sigmoid activation function is reported to be a better activation function for ANN.
b0
Fractional
Either log sigmoid function f(x) =1 / (1+e-x) ……………… (2)
Fig 2: FXP Format
or tan sigmoid function f(x) = [(2 / 1+e-2x)-1] ……………… (3)
Base B of fixed-point number is 2 for binary data and hence its decimal equivalent is obtained by data conversion procedure as represented in equation (1) by 2’s complement.
Can be used. The former is chosen here, where x = xiwi. Software implementation of the functions can easily be done by direct calculation using in built functions but the hardware design that targets FPGA should consider the following points - Arithmetic modules of xy or exp(x) are difficult to synthesize with matched performance. - Design of divider is far from being area and speed efficient. Hence a common way is to implement the AF using LUT based [5] or PWL approximation. Both implementations are tried and results are presented for comparison.
bi-1 Bi-4+ bi-2 Bi-5+----------+ b4 B1+ b3 B0+ b2 B-1+ b1 B-2+ ……….. .... (1) b0B-3
The minimum allowable precision and minimum allowable range is to be decided to conserve silicon area without affecting the performance of ANN. For example, a 5-bit data with 2 bit integer and 3 bit fractional part represents, the smallest value of (00.001)2 as 0.125 and the largest value of (11.111)2 as 3.875 in decimal. Precision of fixed-point data is
160
Register Input1 x Input2 w
Multiplier x.w
Adder 6 x.w
f(yj)
Output
Fig 3. Modules of a Neuron in ANN
[5]. Using powers of two co-efficient to represent the slope of the linear segments, allows right-shift operations instead of multiplications [5] to reduce the hardware. The approximation of PWL plot with actual plot is shown in Fig.7.
2.2.1 LUT based approach An LUT can be used to implement the sigmoid activation function by means of discrete value. But it consumes a large storage area when moderately high precession. If input range needs 21 bits and a precision need 16 bits to represent the results of LUT, then 221 u 16= 4MB LUT is needed. It is consuming a large area and time, which may affect the speed of computation. If Taylor series is chosen to represent the exponential function, the computation involves more multipliers and thus increases area. The LUT design does not optimize well under FLP format and hence fixed-point format is used. But still on-chip realization of log-sigmoid function increases the size of hardware considerably. To optimize the area to some extent, the inbuilt RAM available in FPGAs is used to realize LUT based activation function. It reduces area and improves speed. If the precision is to be improved, then hardware friendly PWL approximation approach of activation function is to be utilized.
3. Review of AF realization Antony W.Savich et al. [6] had shown that the resource requirement of networks implemented with FXP arithmetic is appreciably two times lesser than the FLP arithmetic with similar precision and range. Also FXP computation produces better convergence in lesser clock cycles. For a normalized input and output in the range of [0,1] to realize the logsigmoid transfer function, Holt and Bakes [6] showed that 16-bit FXP (1-3-12) representation was the minimum allowable precision. Ligon III et al. [19 of 3] exhibited the density advantage of 32-bit FXP computation over FLP computation for Xilinx 4020E FPGAs. The software implementation of the activation function is easy but implementation in hardware is costly, either using LUT or equation approximation. Hence PWL approximation is preferred.
2.2.2 PWL approximation approach Log-sigmoid activation function is realized by piecewise linear approximation (PWL) of equation (2). The activation function is approximated to have five linear segments [6] called pieces. The representation of the segments is shown in equation (4).
4. Realization of the Activation Function 4.1 LUT based AF The activation function shown in equation (2) is realized by representing input x in 10-bits and output f(x) in 8-bits.Therefore 210 u 8 = 1024 u 8 = 8192 bits are required for LUT storage. 10-bit input x has the following FXP format shown in Figure 4 to represent the address of the memory location and output f(x) is represented as 8-bit fractional part only, to indicate the respective content of memory location.
(4) The number of pieces required in the PWL approximation of the activation function depends upon the complexity of the problem to be solved
161
4.2 PWL approximation of AF S9
Sign
8
7 6
5
Integer
4
3
2
1
0
PWL approximation of f(x) using five segments as shown in equation (4) is realized. The implementation needs comparison, subtraction, addition and shifting operations. The design consumes small size with appreciable performance. Computational modules of activation function are shown in figure 5.
Fractional Part
Fig 4. FXP format of input x
The smallest possible value that could be represented as output is 2-n, where n is the number of bits in the stream. This value should be treated as zero, since f(x)Æ0 when xÆ- which may never occur. Hence an error 2-n is introduced. Similarly the largest possible value of f(x) is 1-2-n to limit the output to be within (0, 1). Therefore, the minimum value of AF, f(x1) = 1 / (1+e-x1) = 2-n
……………… (5)
which decides the lower limit of input, x1 = -ln (2n - 1)
Fig. 5 Computational modules of activation function
……………… (6)
Individual sub blocks are realized as mentioned below using FXP formats 8-n: Inverter module; /n: Constant bit shifts; Mod n: Subtraction; +0.5 : Constant adder. Simulated results are tabulated in Table-2.
Similarly, the maximum vale of AF, f(x2) = 1 / (1 + e-x2 ) = 1-2 -n ……………… (7) which decides the upper limit of input, n
x2 = ln (2 - 1)
5. Experimental Results
……………… (8)
Synthesis of the circuits is performed using Xilinx ISE 9.1i, considering vertex-4 FPGA and targeting device XC4VLX1512ff668. Data precision followed in two approaches is shown in Table 3.
Hence the step size or increment of input is found to be x = ln [(0.5 + 2-n) / (0.5 - 2-n)] …………… (9)
Table 3. Data precision of LUT and PWL AF
The minimum number of LUT values is given by LUTmin = (x2-x1) / x
LUT based AF
PWL approximation AF
Data
FXP format
No. of bits
FXP format
No. of bits
Input
1-3-6
10
1-4-8
13
output
0-0-8
8
0-4-8
12
……………… (10)
LUTmin for the x1 and x2 in equation 6, equation 8 respectively is found to be 710 for 8-bit precision. Hence 1K RAM with 10-bit address lines is chosen that gives 1024 locations with step size x = 0.015625. For a 10-bit input x, the minimum value x1 = -5.5413 and the maximum value is x2 = 5.5371. The output RAM has 1K memory in which 710 locations are used leaving 374 locations unused. Wastage of memory is a draw back and the second method (PWL) overcomes it. Simulated results are listed in Table-1.
It is observed that, around 40% of LUT utilization and 20% of gate count is saved after the PWL approximation of AF realization. Conservation of silicon area on single activation function realization confirms that, a significant area would be conserved after realizing the neuron and the network as a whole. The increase in the number of IOBs and gate count is the price
162
paid
for
increased
precision
in
PWL
approximation. Results are given in Table-4.
Table:1 simulated results of LUT based approach Input Range
Binary Address corresponding to input Range 1101100010 1100010011 (866) (787) 1100010010 1100000000 (786) (768) 1011111111 1011001110 (767) (718) 1011001101 1010000000 (717) (640) 1001111111 0001111111 (639) (127) 0010000000 0011001101 (128) (205) 0011001110 0011111111 (206) (255) 0100000000 0100010010 (256) (274) 0100010011 0101100010 (275) (354)
-5.5413 to -4.3069 -4.2912 to -4.01 -3.9944 to -3.2287 -3.2131 to -2.009 -1.9943 to 1.9901 2.0058 to 3.2089 3.2245 to 3.9902 4.0058 to 4.2871 4.3027 to
5.5371
Output Range 0.0039063 to 0.013297
Binary code of Output 00000001
00000011
0.013503
to 0.017811
00000011
00000100
0.018086
to 0.038099
00000100
00001001
0.038676
to 0.11816
00001001
00011110
0.1198
to 0.87976
00011110
11100001
0.8814
to 0.96117
11100001
11110110
0.96175
to
0.98184
11110110
11111011
0.98212
to
0.98644
11111011
11111100
0.98665
to
0.99688
11111100
11111111
Table 2: Simulated results of PWL based approach Input x value
Region
-8.123 -7 to -1.6
1 2
-1.5 to 1.5
3
1.6 to 7
4
8.123
5
Binary Encoding Of input values
Output f(x) value
Binary Encoding Of the output values
1100000011111 1011100000000 to 1000110011001 1000110000000 to 0000110000000 0000110011001 to 0011100000000 0100000011111
0 0.015625 to 0.1 0.125 to 0.875 0.9 to 0.984375 1
000000000000 000000000100 to 000000011001 000000100000 to 000011100000 000011100111 to 000011111100 000100000000
Table 4: Comparision of Synthesis reports for LUT based approach and PWL approximation using XILINX ISE 9.1i Components/modules
Used
Used
LUT-Approach
PWL-Approximation
Number of 4-input LUTs
187
108
12288
Number of slices
102
58
6144
Number of Bounded IOBs
19
26
320
Maximum Delay(ns) (Critical path)
1.934
1.834
-----
Total Gate Count for design
1348
1029
-----
Type
163
Available
the facilities and the anonymous reviewers for their comments.
REFERENCES [1] Hiroomi Hikawa , “A Digital Hardware Pulse-Mode Neuron With Piecewise Linear Activation Function,” IEEE Trans.Neural Networks, vol.14,no.5,pp.1028-1037,Sept.2003 [2] Nitish D. Patel, Sing Kiong Nguang, and George G. Coghill, “Neural Network Implementation Using Bit Streams,” IEEE Trans. Neural Networks, vol.18,no.5,Sept.2007 Fig.6. Comparison of Actual values and Hardware Calculated values
[3] S. Li, M. Moussa, and S. Areibi, “Arithmetic formats for implementingartificial neural networks on FPGAs,” Can. J. Elect. Comput. Eng., vol.31, no. 1, pp. 31–40, 2006. [4] D. Anguita, A. Boni, and S. Ridella, “A digital architecture for support vector machines: Theory, algorithm, and FPGA implementation,” IEEE Trans. Neural Netw., vol. 14, no. 5, pp. 993–1009, Sep. 2003. [5] J. Zhu and P. Sutton, “FPGA implementations of neural networks—a survey of a decade of progress,” in Proc. Conf. Field Programm. Logic, Lisbon, Portugal, 2003, pp. 1062– 1066. [6] J. Holt and T. Baker, “Back propagation simulations using limited precision calculations,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN-91), Seattle, WA, Jul. 1991, vol. 2, pp. 121–126.
Fig.7. Comparison of Actual plot and PWL approximation plot
[7] Y. Maeda, A. Nakazawa, and Y. Kanata, “Hardware implementation of a pulse density neural network using simultaneous perturbation learning rule,” Analog Integr. Circuits Signal Process., vol. 18, no. 2, pp. 1–10, 1999.
6. Conclusion Hardware implementation of an activation function that is used to realize an ANN is tried in this paper. A two-fold approach to minimize the hardware is proposed here, aiming at the FPGA realization. Firstly, FXP arithmetic is followed against FLP arithmetic to reduce area, by compromising slightly on accuracy. Secondly, PWL approximation of activation function with five linear segments is proved to be hardware friendly, when compared to LUT based AF. Experimental results on XOR and parity function prove the accuracy of AF is acceptable. But still number of linear segments could be increased when the resolution is to be improved. Complete realization of neuron and hence the network is left for future work.
[8] Antony W. Savich, Medhat Moussa, and Shawki Areibi, “The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study,” IEEE Trans. Neural Networks, vol.18, no.1, Jan.2003. [9] Yutaka Maeda, and Masatoshi Wakamura, “Simultaneous Perturbation Learning Rule for Recurrent Neural Networks and Its FPGA Implementation,” IEEE Trans. Neural Networks, vol.16, no.6, November 2005. [10] J. Holt and T. Baker, “Back propagation simulations using limited precision calculations,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN-91), Seattle, WA, Jul. 1991, vol. 2, pp. 121–126. [11] W. Ligon, III, S. McMillan, G. Monn, K. Schoonover, F. Stivers, andK. Underwood, “A re-evaluation of the practicality of floating point operations on FPGAs,” in Proc. IEEE Symp. FPGAs Custom Comput. Mach., K. L. Pocek and J. Arnold, Eds., Los Alamitos, CA, 1998, pp.206–215 [Online]. Available:citeseer.nj.nec.com/95888.html.
ACKNOWLEDGEMENT The authors wish to thank the management Amrita Viswa Vidyapeetham,Coimbatore, Tamilnadu , India for financial support.The Department of Electronics and Communication, for providing
164