FPGA Realization of Activation Function for Artificial Neural Networks

6 downloads 0 Views 153KB Size Report
Abstract. Implementation of Artificial Neural network. (ANN) in hardware is needed to fully utilize the inherent parallelism. Presented work focuses on the.
Eighth International Conference on Intelligent Systems Design and Applications

FPGA Realization of Activation Function for Artificial Neural Networks Venakata Saichand1, Nirmala Devi.M2, Arumugam.S3, N.Mohankumar4 P.G.Student M.Tech VLSI, 2, 4 Faculty in Electronics and Communication Engineering 1, 2, 4 School of Engineering, Amrita Vishwa Vidyapeetham, Tamil Nadu, India 3 Nanda College of Engineering, Tamil Nadu, India [email protected]

1

Abstract Arrays (FPGAs) [2]. ASIC design has some drawbacks like the ability to run only specific algorithm and limitations on the size of a network. Hence FPGA offers a suitable alternative that has flexibility with an appreciable performance [3]. It maintains high processing density, which is needed to utilize the parallel computation in an ANN [4]. Every digital module is concurrently instantiated on the FPGA and hence, operates in parallel. Thus the speed of the network is independent of the complexity [2]. When implementing ANN on FPGA, certain measures are to be taken to minimize the hardware. The two fold approach is presented here, aims to meet the objective. The first step is representing the data in fixed-point (FXP) rather than floating- point (FLP) to reduce the circuit complexity significantly, by slightly compromising on the accuracy. The second step proposes the use of Piecewise Linear (PWL) activation function (AF) rather than using Look Up Table (LUT) based activation function. It is based on the fact that the realization of activation function plays a major role in deciding the resource requirement. Section 2 and 3 present the background details and review of the implementation respectively. Section 4 deals with the architectural design. Section 5 and 6 discuss the experimental results and conclusion respectively.

Implementation of Artificial Neural network (ANN) in hardware is needed to fully utilize the inherent parallelism. Presented work focuses on the configuration of Field-Programmable Gate Array (FPGA) to realize the activation function utilized in ANN. The computation of a nonlinear activation function (AF) is one of the factors that constraint the area or the computation time. The most popular AF is the log-sigmoid function, which has different possibilities of realizing in digital hardware. Equation approximation, Lookup Table (LUT) based approach and Piecewise Linear Approximation (PWL) are a few to mention. A two-fold approach to optimize the resource requirement is presented here. Primarily, Fixed-point computation (FXP) that needs minimal hardware, as against floating-point computation (FLP) is followed. Secondly, the PWL approximation of AF with more precision is proved to consume lesser Si area when compared to LUT based AF. Experimental results are presented for computation. Keywords- ANN, FPGA, Activation Function, Lookup table based, Piecewise Linear Approximation

1. Introduction Artificial neural networks (ANNs) are the computational models of human brain. Application of ANN ranges from function approximation to pattern classification and recognition. Multilayer perceptron (MLP) with one or two hidden layers can solve almost any problem of high complexity. ANNs exhibit high degree of parallelism, when the complete architecture with all interconnections is realized in hardware, but the implementation cost is high. On the other hand, implementing either a neuron or a layer of neuron and configuring it to act as other layers or neurons in time multiplexed way using a controlled unit offers low level of parallelism. Performance is compromised for the conservation of silicon area [1].Hardware implementation of ANN to utilize the parallelism can follow analog, digital or mixed signal design technique. Matured and flexible digital design in Very large Scale Integration (VLSI) can be implemented on Application Specific Integrated Circuits (ASICs) or Field-Programmable Gate 978-0-7695-3382-7/08 $25.00 © 2008 IEEE DOI 10.1109/ISDA.2008.333

2. Design concepts Computation in neuron of a network involves multiplication, addition and activation function to decide the output, where f(yj) is the activation function as shown in Figure 1. The approach proposed in this paper, attempts to optimize the silicon area required to realize the activation function block. Instead of representing signals in floating point data, the computation involves fixed-point arithmetic to reduce the hardware further. Experimental results clearly justify that the design of PWL activation function is hardware friendly and

159

hence could be utilized to realize a single neuron and then the complete network. f(yj)

w1

x1

yj x x

Oj

x1 to xn – input; w1 to wn – weight; yj = x.w; Oj = f(yj) = output

x.w

xn wn

Fig1. Neuron of an ANN

independent of the number but dependent on the position of the radix point. Direct implementation of arithmetic modules in hardware generally alters the operation due to the limitation in precision. Hence the implemented result deviates slightly from the software simulation results as shown in Fig.6.[1].

2.1 Data Precision and Representation When implementing an ANN using FPGA the design possesses a number of challenges [5]. Arithmetic representation is a major part of concern, which may range from 16bit, 32-bit or 64-bit floating-point format (FLP) or fixed-point (FXP) formats. The former is area hungry with improved precision where as the latter needs minimal resources at the expense of precision [6]. Hence an area versus precision tradeoff could decide the data format. Presented work uses FXP formats. FXP format: It has two parts; integer part and fractional part, which is separated by a radix point. The integer part is represented by bits of bi-1 to b3 and the fractional part extends from b2 to b0 as shown in Figure 2. As an example

bi-1

bi-2

………………

b4

b3

Integer part

b2

b1

2.2 Activation Function of ANN Main function of ANN is multiplication, summation and squashing operation as shown in Figure 3. Squashing function is known as threshold function or activation function. The activation function may range from a simple step function to sigmoid function. It restricts the applied input to lie with in the specified range (0,1) so as to decide the output (or) to be applied as input to next layer neurons. A monotonically increasing, differentiable non-linear sigmoid activation function is reported to be a better activation function for ANN.

b0

Fractional

Either log sigmoid function f(x) =1 / (1+e-x) ……………… (2)

Fig 2: FXP Format

or tan sigmoid function f(x) = [(2 / 1+e-2x)-1] ……………… (3)

Base B of fixed-point number is 2 for binary data and hence its decimal equivalent is obtained by data conversion procedure as represented in equation (1) by 2’s complement.

Can be used. The former is chosen here, where x =  xiwi. Software implementation of the functions can easily be done by direct calculation using in built functions but the hardware design that targets FPGA should consider the following points - Arithmetic modules of xy or exp(x) are difficult to synthesize with matched performance. - Design of divider is far from being area and speed efficient. Hence a common way is to implement the AF using LUT based [5] or PWL approximation. Both implementations are tried and results are presented for comparison.

bi-1 Bi-4+ bi-2 Bi-5+----------+ b4 B1+ b3 B0+ b2 B-1+ b1 B-2+ ……….. .... (1) b0B-3

The minimum allowable precision and minimum allowable range is to be decided to conserve silicon area without affecting the performance of ANN. For example, a 5-bit data with 2 bit integer and 3 bit fractional part represents, the smallest value of (00.001)2 as 0.125 and the largest value of (11.111)2 as 3.875 in decimal. Precision of fixed-point data is

160

Register Input1 x Input2 w

Multiplier x.w

Adder 6 x.w

f(yj)

Output

Fig 3. Modules of a Neuron in ANN

[5]. Using powers of two co-efficient to represent the slope of the linear segments, allows right-shift operations instead of multiplications [5] to reduce the hardware. The approximation of PWL plot with actual plot is shown in Fig.7.

2.2.1 LUT based approach An LUT can be used to implement the sigmoid activation function by means of discrete value. But it consumes a large storage area when moderately high precession. If input range needs 21 bits and a precision need 16 bits to represent the results of LUT, then 221 u 16= 4MB LUT is needed. It is consuming a large area and time, which may affect the speed of computation. If Taylor series is chosen to represent the exponential function, the computation involves more multipliers and thus increases area. The LUT design does not optimize well under FLP format and hence fixed-point format is used. But still on-chip realization of log-sigmoid function increases the size of hardware considerably. To optimize the area to some extent, the inbuilt RAM available in FPGAs is used to realize LUT based activation function. It reduces area and improves speed. If the precision is to be improved, then hardware friendly PWL approximation approach of activation function is to be utilized.

3. Review of AF realization Antony W.Savich et al. [6] had shown that the resource requirement of networks implemented with FXP arithmetic is appreciably two times lesser than the FLP arithmetic with similar precision and range. Also FXP computation produces better convergence in lesser clock cycles. For a normalized input and output in the range of [0,1] to realize the logsigmoid transfer function, Holt and Bakes [6] showed that 16-bit FXP (1-3-12) representation was the minimum allowable precision. Ligon III et al. [19 of 3] exhibited the density advantage of 32-bit FXP computation over FLP computation for Xilinx 4020E FPGAs. The software implementation of the activation function is easy but implementation in hardware is costly, either using LUT or equation approximation. Hence PWL approximation is preferred.

2.2.2 PWL approximation approach Log-sigmoid activation function is realized by piecewise linear approximation (PWL) of equation (2). The activation function is approximated to have five linear segments [6] called pieces. The representation of the segments is shown in equation (4).

4. Realization of the Activation Function 4.1 LUT based AF The activation function shown in equation (2) is realized by representing input x in 10-bits and output f(x) in 8-bits.Therefore 210 u 8 = 1024 u 8 = 8192 bits are required for LUT storage. 10-bit input x has the following FXP format shown in Figure 4 to represent the address of the memory location and output f(x) is represented as 8-bit fractional part only, to indicate the respective content of memory location.

(4) The number of pieces required in the PWL approximation of the activation function depends upon the complexity of the problem to be solved

161

4.2 PWL approximation of AF S9

Sign

8

7 6

5

Integer

4

3

2

1

0

PWL approximation of f(x) using five segments as shown in equation (4) is realized. The implementation needs comparison, subtraction, addition and shifting operations. The design consumes small size with appreciable performance. Computational modules of activation function are shown in figure 5.

Fractional Part

Fig 4. FXP format of input x

The smallest possible value that could be represented as output is 2-n, where n is the number of bits in the stream. This value should be treated as zero, since f(x)Æ0 when xÆ- which may never occur. Hence an error 2-n is introduced. Similarly the largest possible value of f(x) is 1-2-n to limit the output to be within (0, 1). Therefore, the minimum value of AF, f(x1) = 1 / (1+e-x1) = 2-n

……………… (5)

which decides the lower limit of input, x1 = -ln (2n - 1)

Fig. 5 Computational modules of activation function

……………… (6)

Individual sub blocks are realized as mentioned below using FXP formats 8-n: Inverter module; /n: Constant bit shifts; Mod n: Subtraction; +0.5 : Constant adder. Simulated results are tabulated in Table-2.

Similarly, the maximum vale of AF, f(x2) = 1 / (1 + e-x2 ) = 1-2 -n ……………… (7) which decides the upper limit of input, n

x2 = ln (2 - 1)

5. Experimental Results

……………… (8)

Synthesis of the circuits is performed using Xilinx ISE 9.1i, considering vertex-4 FPGA and targeting device XC4VLX1512ff668. Data precision followed in two approaches is shown in Table 3.

Hence the step size or increment of input is found to be x = ln [(0.5 + 2-n) / (0.5 - 2-n)] …………… (9)

Table 3. Data precision of LUT and PWL AF

The minimum number of LUT values is given by LUTmin = (x2-x1) / x

LUT based AF

PWL approximation AF

Data

FXP format

No. of bits

FXP format

No. of bits

Input

1-3-6

10

1-4-8

13

output

0-0-8

8

0-4-8

12

……………… (10)

LUTmin for the x1 and x2 in equation 6, equation 8 respectively is found to be 710 for 8-bit precision. Hence 1K RAM with 10-bit address lines is chosen that gives 1024 locations with step size x = 0.015625. For a 10-bit input x, the minimum value x1 = -5.5413 and the maximum value is x2 = 5.5371. The output RAM has 1K memory in which 710 locations are used leaving 374 locations unused. Wastage of memory is a draw back and the second method (PWL) overcomes it. Simulated results are listed in Table-1.

It is observed that, around 40% of LUT utilization and 20% of gate count is saved after the PWL approximation of AF realization. Conservation of silicon area on single activation function realization confirms that, a significant area would be conserved after realizing the neuron and the network as a whole. The increase in the number of IOBs and gate count is the price

162

paid

for

increased

precision

in

PWL

approximation. Results are given in Table-4.

Table:1 simulated results of LUT based approach Input Range

Binary Address corresponding to input Range 1101100010 1100010011 (866) (787) 1100010010 1100000000 (786) (768) 1011111111 1011001110 (767) (718) 1011001101 1010000000 (717) (640) 1001111111 0001111111 (639) (127) 0010000000 0011001101 (128) (205) 0011001110 0011111111 (206) (255) 0100000000 0100010010 (256) (274) 0100010011 0101100010 (275) (354)

-5.5413 to -4.3069 -4.2912 to -4.01 -3.9944 to -3.2287 -3.2131 to -2.009 -1.9943 to 1.9901 2.0058 to 3.2089 3.2245 to 3.9902 4.0058 to 4.2871 4.3027 to

5.5371

Output Range 0.0039063 to 0.013297

Binary code of Output 00000001

00000011

0.013503

to 0.017811

00000011

00000100

0.018086

to 0.038099

00000100

00001001

0.038676

to 0.11816

00001001

00011110

0.1198

to 0.87976

00011110

11100001

0.8814

to 0.96117

11100001

11110110

0.96175

to

0.98184

11110110

11111011

0.98212

to

0.98644

11111011

11111100

0.98665

to

0.99688

11111100

11111111

Table 2: Simulated results of PWL based approach Input x value

Region

-8.123 -7 to -1.6

1 2

-1.5 to 1.5

3

1.6 to 7

4

8.123

5

Binary Encoding Of input values

Output f(x) value

Binary Encoding Of the output values

1100000011111 1011100000000 to 1000110011001 1000110000000 to 0000110000000 0000110011001 to 0011100000000 0100000011111

0 0.015625 to 0.1 0.125 to 0.875 0.9 to 0.984375 1

000000000000 000000000100 to 000000011001 000000100000 to 000011100000 000011100111 to 000011111100 000100000000

Table 4: Comparision of Synthesis reports for LUT based approach and PWL approximation using XILINX ISE 9.1i Components/modules

Used

Used

LUT-Approach

PWL-Approximation

Number of 4-input LUTs

187

108

12288

Number of slices

102

58

6144

Number of Bounded IOBs

19

26

320

Maximum Delay(ns) (Critical path)

1.934

1.834

-----

Total Gate Count for design

1348

1029

-----

Type

163

Available

the facilities and the anonymous reviewers for their comments.

REFERENCES [1] Hiroomi Hikawa , “A Digital Hardware Pulse-Mode Neuron With Piecewise Linear Activation Function,” IEEE Trans.Neural Networks, vol.14,no.5,pp.1028-1037,Sept.2003 [2] Nitish D. Patel, Sing Kiong Nguang, and George G. Coghill, “Neural Network Implementation Using Bit Streams,” IEEE Trans. Neural Networks, vol.18,no.5,Sept.2007 Fig.6. Comparison of Actual values and Hardware Calculated values

[3] S. Li, M. Moussa, and S. Areibi, “Arithmetic formats for implementingartificial neural networks on FPGAs,” Can. J. Elect. Comput. Eng., vol.31, no. 1, pp. 31–40, 2006. [4] D. Anguita, A. Boni, and S. Ridella, “A digital architecture for support vector machines: Theory, algorithm, and FPGA implementation,” IEEE Trans. Neural Netw., vol. 14, no. 5, pp. 993–1009, Sep. 2003. [5] J. Zhu and P. Sutton, “FPGA implementations of neural networks—a survey of a decade of progress,” in Proc. Conf. Field Programm. Logic, Lisbon, Portugal, 2003, pp. 1062– 1066. [6] J. Holt and T. Baker, “Back propagation simulations using limited precision calculations,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN-91), Seattle, WA, Jul. 1991, vol. 2, pp. 121–126.

Fig.7. Comparison of Actual plot and PWL approximation plot

[7] Y. Maeda, A. Nakazawa, and Y. Kanata, “Hardware implementation of a pulse density neural network using simultaneous perturbation learning rule,” Analog Integr. Circuits Signal Process., vol. 18, no. 2, pp. 1–10, 1999.

6. Conclusion Hardware implementation of an activation function that is used to realize an ANN is tried in this paper. A two-fold approach to minimize the hardware is proposed here, aiming at the FPGA realization. Firstly, FXP arithmetic is followed against FLP arithmetic to reduce area, by compromising slightly on accuracy. Secondly, PWL approximation of activation function with five linear segments is proved to be hardware friendly, when compared to LUT based AF. Experimental results on XOR and parity function prove the accuracy of AF is acceptable. But still number of linear segments could be increased when the resolution is to be improved. Complete realization of neuron and hence the network is left for future work.

[8] Antony W. Savich, Medhat Moussa, and Shawki Areibi, “The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study,” IEEE Trans. Neural Networks, vol.18, no.1, Jan.2003. [9] Yutaka Maeda, and Masatoshi Wakamura, “Simultaneous Perturbation Learning Rule for Recurrent Neural Networks and Its FPGA Implementation,” IEEE Trans. Neural Networks, vol.16, no.6, November 2005. [10] J. Holt and T. Baker, “Back propagation simulations using limited precision calculations,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN-91), Seattle, WA, Jul. 1991, vol. 2, pp. 121–126. [11] W. Ligon, III, S. McMillan, G. Monn, K. Schoonover, F. Stivers, andK. Underwood, “A re-evaluation of the practicality of floating point operations on FPGAs,” in Proc. IEEE Symp. FPGAs Custom Comput. Mach., K. L. Pocek and J. Arnold, Eds., Los Alamitos, CA, 1998, pp.206–215 [Online]. Available:citeseer.nj.nec.com/95888.html.

ACKNOWLEDGEMENT The authors wish to thank the management Amrita Viswa Vidyapeetham,Coimbatore, Tamilnadu , India for financial support.The Department of Electronics and Communication, for providing

164