A Pipeline Hardware Implementation for an Artificial Neural Network

0 downloads 0 Views 198KB Size Report
Abstract: Artificial Neural Networks are computational devices inspired by the human brain for solving problems. Currently, they are being widely applied for ...
A Pipeline Hardware Implementation for an Artificial Neural Network Denis F. Wolf, Gedson Faria, Roseli A. F. Romero, Eduardo Marques, Marco A. Teixeira, Alexandre A. L. Ribeiro, Leandro C. Fernandes, Jean M. Scatena, Rovilson Mezencio Instituto de Ciências Matemáticas e de Computação - Universidade de São Paulo Av. Trabalhador São-carlense, 400 13560-970 - São Carlos – SP - Brasil {denis, gedson, rafrance, emarques}@icmc.sc.usp.br

Abstract: Artificial Neural Networks are computational devices inspired by the human brain for solving problems. Currently, they are being widely applied for solving problems in several areas such as: robotics, image processing, pattern recognition, etc.... The neural network model, Multilayer Perceptrons, is one of the most used due to its simple learning algorithm. However, its convergence is very slow. To take advantage of the massive parallelism inherent to this model, a hardware parallel implementation should be performed. There are different hardware parallel implementations for this particular model. This paper presents a reconfigurable hardware parallel implementation for Multilayer Perceptrons by using pipelines. Tests realized showed that the use of pipelines speeded up the execution time of the hardware parallel implementation.

1. Introduction The most general-purpose computers are based on the von-Neuman architecture, which is sequential in nature, on the other hand, artificial neural networks profit from massively parallel processing [Schonauer et al. 1998]. The Multilayer Perceptrons (MLPs) model is being widely applied successfully to solve difficult and diverse problems by training them in a supervised manner with a highly popular algorithm known as the error back-propagation algorithm [Haykin 1999]. An interesting method to explore this parallelism is using hardware implementations. The good performance and the natural parallel operation of the hardware devices, turn it as an attractive option to implement neural network algorithms. Today, the reconfigurable computing has been used as a very interesting technique to project and prototype hardware. Reconfigurable computing joins the hardware performance with the software flexibility [Gonçalves et al. 2000]. The hardware implementation of neural network has been already discussed in many articles, such as [Pérez-Uribe and Sanchez 1996], [Demian et al. 1996], and [Molz et al. 2000]. Some principles and perspectives of the digital neurohardware are discussed by Schonauer et al. [Schonauer et al. 1998] with some implementation examples. Suggestions of adaptations to hardware implementation of neural nets are presented by Moerland et al. [Moerland and Fiesler 1997]. According to Morenno [Moreno et al. 1999a][Moreno et al. 1999b], a reconfigurable device was proposed, which can be used to realize the most of arithmetic operations necessary to implement neural networks in hardware.

As it is known, two basic hardware models can be considered to hardware projects: the analog hardware and the digital hardware. The analog hardware can reproduce better the non-linear functions, which are frequently present in the transfer functions of the neurons belonging to a MLP. The digital hardware is not so accurate, but provides the projects with simplicity and good performance. In this work, a pipeline hardware implementation for a MLP by using the digital reconfigurable hardware is proposed. For this, some modifications in the transfer functions of neurons have been realized and the float pointing numbers have been converted to integer numbers. This paper is organized as it follows. In Section 2, a brief description about the FPGA technology and Reconfigurable Computing is presented. Some modifications in the neuron model necessary for improving the performance and adjust the implementation are presented in Section 3. In Section 4, the hardware implementation is discussed based on a simple example and the obtained results are also showed. Finally, in Section 5, a conclusion and future work are presented.

2. Reconfigurable Computing and FPGA technology A Field Programmable Gate Array (FPGA) is an array of logic blocks (configurable cells), enclosed within a single chip. Each one of these cells has a computational capability to implement logic functions and to do a route allowing communication among the cells, all this operations can occur at the same time on entire array of cells [Rose et al. 1993][Brown and Rose 1996]. In the FPGA architecture, the interconnection between the configurable logic blocks takes place through electrically programmable commuter like in the traditional Programmable Logic Devices (PLD). The FPGA matches the gate array's versatility and the PLD's programming capability [Donachy 1996].

Interconnection

I/O Blocks

Logic Blocks Figure 1: Basic Architecture of FPGA

The Figure 1 shows the basic architecture of a FPGA. It consists of a 2-D array of logic blocks. The communication between these blocks is done through interconnection resources. The external array's edges are special blocks that perform input/output operations. The FPGA technology has evolved towards better performance, higher density levels and lower cost price. This fact makes shorter the distance between the FPGA and

the chips implemented directly on silicon, allowing that this technology be used to construct more and more complex architecture. The utilization of FPGA to realize computing lead to a new general class of computer organization called Reconfigurable Computing Architecture [DeHon and Wawrzynek 1999]. This class of architecture provides a highly custom-built machine that can attend to the instantaneous needs of an application. Thus, it is possible to have the application running over a specially developed architecture, bringing more efficiency than general-purpose processors. In other words, to achieve the best performance of an algorithm, it has to be executed in a specific hardware. With this inherent speed and adaptability, the reconfigurable computing can be specially exploited on applications that need high performance like parallel architectures, image processing and real-time applications.

3. Neuron Adaptations A hardware implementation of MLP has been built for solving the classification problem for the iris database set. The iris set is a very popular data set that has been widely used for learning algorithms testing.

Input Set

Results

Figure 2: MLP Topology

The iris set contains a database with 150 flowers, classified into 3 different groups. Each sample (or pattern) has 4 attributes. So, a MLP topology has been considered for the classification of this particular data set constituted by 3 layers being 4 neurons on the first layer, 4 neurons on the hidden layer, and 3 neurons on the output layer (Figure 2). The main purpose of the first layer is just to deliver the input signals for all the neurons of the hidden layer. As the signals are not modified by the first layer neurons (the neurons do not have arithmetical operations), the first layer can be represented by a single set of busses (Figure 3).

Busses Input Set

Results

Figure 3: The network topology without the first layer

1

0,75

Piecewise Linear Function

0,5

Sigmoid Function

0,25

0 -8

-6

-4

-2

0

2

4

6

8

Figure 4: Sigmoid and Piecewise Linear Functions

The model MLP, in general, receives floating pointing numbers as input values. Working with floating point numbers in hardware is a problem because the arithmetic operations are more complex than with integer numbers. Further the dedicated circuits for these operations are more complex, slower, and faster (occupying a larger chip area). A good idea to make easy the project and improve the performance is to convert the float pointing numbers to integer numbers. This implies in some loss of precision but in most of cases is sufficient to achieve good results. Other problem for representing the arithmetic operations, using digital hardware is related with the neuron's transfer functions. Some transfer functions like the sigmoid function (Figure 4) need some modifications to make easy the design of the hardware. In this case, the sigmoid function has been substituted by a piecewise linear function as that shown in Figure 4. The precision is decreased again but in this case good results have been obtained.

4. Hardware Implementation The hardware model of the neural network has been designed and simulated with the Altera Quartus Design Tool [Altera 2000]. To generate a hardware model from software algorithm some simple logic and arithmetic blocks such as: multipliers, adders, divisors and logic gates have been used. Basically, each neuron has four multipliers to multiply each input value by the corresponding weight. The four results of the multipliers are added with the bias and finally, the transfer function delivers the neuron's output.

Figure 5: The Complete Neural Network

The complete diagram of the neural network can be seen in Figure 5. The four pins on the left side of the picture are the four input values. They are connected to the set of busses (first layer); these busses distribute the input signal to the next layer (hidden layer). The results are provided to the output layer and finally the results are showed in the output pin (on the right side of Figure 5). In this particular implementation, the floating-point numbers (input set, bias, weights, and results) have been converted to 16-bit integers. It has allowed that 100% of precision in the classification of the iris database set. The neural network has been trained in software once this hardware implementation does not allow on-chip training. The input signals need 32 ns to be processed by the circuit. In Figure 6, the waveform analysis are showed, and the input and output signals can be seen over the time.

Input Signals at 5ns

Result at 37 ns

Figure 6: Time Analysis

A pipeline has been used to improve the performance. In this case, the neural network has been divided in two stages (Figure 7), and each stage can process a different set of input signals at the same time.

Input Set

Hidden Layer Stage 1

Out Layer Results

Stage 2

Figure 7: The two stages of the network

The hidden layer receives the set of input values, when it finishes the process the results from the hidden layer are sent to the output layer. At the same time, the hidden layer receives a new input set and both layers operate with different values. When both layers finish their task, the hidden layer sends the results to the output layer again and

receives another input set. At the same time, the output layer shows the results and receives the results from the hidden layer. Through the use of this technique, the neural net can work with two different sets of input values and, a better performance has been obtained. The time analysis can be seen in Figure 8.

Input set 1 at 10ns

Input set 2 at 30ns

Result 1 at 52ns

Result 2 at 72ns

Figure 8: Two stages of pipeline analysis

The same process that was applied for the neural network can be applied for each neuron. A neuron can be divided in two stages and can process two different sets of values in each stage. The neuron has been divided as it follows: the first stage is constituted by the set of multipliers, and the second stage is constituted by the adders and the circuit of the transfer function, as it can be seen in Figure 9. Multipliers Input Signal Stage 1

Adder + Transfer Function

Results

Stage 2

Figure 9: A pipeline neuron

Through the use of pipeline neurons in a two stages pipeline network, a four stages pipeline circuit has been obtained. The synchronization of all the circuits with pipeline has been done by the use of clock signals. Using the four stages pipeline network, the circuit can process four different sets of input signals at the same time. The time analysis can be seen in the Figure 10. In the Table 1, all of the hardware and software implementations are compared. A natural performance difference between the hardware and the software models can be

seen. The modifications, such as: the use of integer numbers and the use of piecewise transfer function, present better results even in software implementation. Joining the neuron modifications with a parallel hardware and a pipeline, a very efficient circuit has been obtained.

Input 1 at 10 ns

Input 2 at 20 ns

Result 1 at 62 ns

Result 2 at 72 ns

Figure 10: Four Stages of Pipeline Analysis

5. Conclusion The Neural Networks and the Reconfigurable Computing are two important and promising technologies to scientific researches. When combined one can obtain very interesting results, as it was showed in this paper. There are many applications for these two technologies, such as: robotics, imaging processing and pattern recognizing.

Implementation Software

Transfer Function Sigmoid

Number Representation Floating Point

Software

Piecewise

Floating Point

Software

Piecewise

Integer

Hardware Without pipeline Hardware with a 2 stages pipeline Hardware with a 4 stages pipeline

Piecewise

Integer

Piecewise

Integer

Piecewise

Integer

Machine

Frequency

Latency

Pentium III Pentium III Pentium III Dedicated Circuit Dedicated Circuit Dedicated Circuit

550 MHz

---

Response Time 92 µs

550 MHz

---

74 µs

550 MHz

---

23 µs

---

---

32 ns

100 MHz

42 ns

20 ns

100 MHz

52 ns

10 ns

Table 1: Test Results

With the performance of the reconfigurable hardware, the use of the neural nets can be improved and more powerful applications can be obtained. For example, more precise images can be processed in a shorter period of time, a faster pattern recognition problem can be realized and a faster response time robot can be obtained. As a future work, a MLP hardware implementation for using in gesture recognition will be developed by using the same ideas proposed in this work.

References Schonauer T., Jahnke A., Roth U. Klar H., “Digital Neurohardware: Principles and Perspectives”, In: Proc. Neuronale Netze in der Anwendung - NN'98, Magdeburg, invited paper, pp.101-106, February 1998. Haykin S., Neural Networks A comprehensive Foundation, 2nd edition, Prentice Hall, 1999. Gonçalves R. A., Wolf D. F., Coelho F. A. S., Teixeira M. A., Ribeiro A. A. L., Marques E., “Architect: Um Sistema de Computação Reconfigurável”, In Proc. of the Workshop de Computação Reconfigurável CORE2000, Marília, Brazil , pp.4245, 2000. (In portuguese) Pérez-Uribe A., Sanchez E., “FPGA Implementation of an Adaptable-Size Neural Network”, Proc. of VI International Conference on Artificial Neural Networks ICANN96, pp.29-31, 1996. Demian V., Desprez F., Paugan-Moisy H., Pourzandi M., “Paralel Implementation of RBF Neural Networks”, Ecole Normale Supérieure de Lyon, Research Report No. 96-11, 1996. Molz R. F., Engel P. M., Moraes F. G., Torres L., Robert M., “Estudo da Viabilidade de Implementação de um Sistema de Localização e Reconhecimento de Objetos com uso de RNAs Implementadas em FPGAs”, Workshop de Computação Reconfigurável CORE2000, Marília, Brazil, pp. 226-235, 2000 (in portuguese) Moerland P., Fiesler E., “Neural Network Adaptations to Hardware Implementations”, Handbook of Neural Computation E1.2:1-13 Institute of Physics Publishing and Oxford University Publishing, New York, 1997. Moreno J. M., Cabestany J., Cantó E., Faura J., Insenser J. M., “Dynamically Reconfigurable Strategies for Implementing Artificial Neural Networks Models in Programmable Hardware”, Proceedings of the 6th Conference Mixed Design of Integrated Circuits and Systems (MIXDES'99), pp. 379-384, Kraków, Poland, June 1999. Moreno J. M., Cabestany J., Cantó E., Faura J., Insenser J. M., "The Role of Dynamic Reconfiguration for Implementing Artificial Neural Networks Models in Programmable Hardware”, Engineering Applications of Bio-Inspired Artificial Neural Networks, J. Mira, J.V. Sánchez Andrés (eds.), pp. 85-94, Springer-Verlag, 1999. Rose, J.; Gamal A. E.; Vincentelli, A. S., “Architecture of Field-Programmable Gate Arrays”, Proceedings of the IEEE, vol. 81, no. 7, p.1013-28, 1993.

Brown, S.; Rose, J. “Architecture of FPGAs and CPLDs”, A Tutorial, IEEE Design and Test of Computers, vol. 13, no. 2, pp. 42-57, June 1996. Donachy, P., “Design and Implementation of a High Level Image Processing Machine using Reconfigurable Hardware”, Ph.D. Thesis, Dept. of Computer Science, The Queen’s University of Belfast, 1996. DeHon, A.; Wawrzynek, J., “Reconfigurable Computing: What, Why, and Design Automation Requirements", In Proceedings of the 1999 Design Automation Conference, pp. 610-615, June 1999. Altera Web Site "www.altera.com" visited in January/2000.