2012 Brazilian Symposium on Neural Networks
Harwdware/Software Co-design Implementation of On-Chip Backpropagation Mauricio A Dias, Fernando S Osorio, Denis Wolf Mobile Robotics Lab (LRM) Institute of Mathematical Science and Computation (ICMC) University of Sao Paulo (USP) Sao Carlos - SP - Brazil macdias,fosorio,
[email protected]
Abstract—Artificial neural networks are a parallel, fault tolerant, robust solution for computational tasks such as associative memories, pattern recognition and function approximation. There are many proposed implementations for artificial neural networks and network’s learning algorithms both in hardware and software. Hardware implementation of learning algorithms are a computational challenge because some constraints as maximum number of neurons and layers, training time, precision, and data representation are difficult to be optimized together. This paper describes a hardware/software co-design implementation of the error-backpropagation algorithm on multi-layer perceptron networks. Different types of processors, with different hardware features and goals, were created and the results were analyzed considering mentioned constraints. The results present a hardware/software co-design that allows a large number of neurons and layers, that maintains initial precision without restrictions on data representation. Platform limitations resulted in high execution times but solutions to this problem are also proposed. So the developed hardware proved to be a good alternative considering current hardware implementations of training algorithms and also the mentioned requirements.
velopment of this algorithm allowed perceptron neurons [2] to be organized in layers, learn about and solve problems more complex than linear problems. This fact brought ANN back to the top of computational research topics in the late 80’s. Hardware implementation of a neural networks training algorithm is a design challenge. The number of numerical operations executed a lot of times is very high and the hardware structure to parallelize all these operations is very complex. Some algorithm modifications and hardware structures were proposed by researchers. There are some problems with algorithms modifications as proposed in [3][4] because the precision loss on backpropagation algorithm causes a delay in training (to achieve the same expected error) and also worsens the results of the network outputs considerably. Other works propose hardware structures that incorporates training inside the neuron structures, or the system has a control structure for training. Neurons that are designed for a unique training algorithm as proposed in [5] after the training step of the network inevitably will occupy a hardware area that will not be entirely used after training. Hardware controllers [6][7] and pipelining [8] are good hardware design alternatives but the structures are limited considering scalability (maximum number of neurons). Hardware modifications out-of-pattern (use different representation for number, for example [9]) in spite of solving some problems generates different hardwares that makes the integration with other systems a difficult task addition to precision loss. Nedjah’s et. Al. work [9] proposed structure uses a approximation of sigmoid function that also causes a precision loss. Nedjah’s et. Al. work results came from simulation, the synthesis was used only to estimate hardware resources not to execute the algorithms and the execution times were calculated by multiplying clock cycles by the time for one clock cycle. This ”execution time” should be measured in real hardware executions and probably the results will be different. Hardware development to solve problems as the backpropagation training algorithm has some advantages and some critical problems. Hardware structures are faster but if the network of this hardware needs to became larger with more neurons or needs to be shorter without a considerable
Keywords-artificial neural networks; hardware; backpropagation; hardware/software co-design
I. I NTRODUCTION Artificial Neural Networks(ANNs) are structures that configure a complex, nonlinear and parallel way to compute. Networks have the capability to organize neurons and execute computations faster when compared to conventional computers [1]. The capabilities and mentioned features, together with the generalization that is a natural feature of ANNs, are responsible for their success and large utilization to solve a large amount of computational and noncomputational problems. ANNs are structures that need to learn about the problem that is going to be solved and this knowledge is represented by network’s connections weights. Learning methods can be divided basically in two types: (i) supervised learning that considers the presence of a teacher correcting the network’s weights based on the errors of neuron’s outputs, and (ii) unsupervised learning that changes the weights according to a rule without the error-correction procedure. One of the most important supervised learning algorithms is the ErrorBackpropagation (also known as backpropagation). The de1522-4899/12 $26.00 © 2012 IEEE DOI 10.1109/SBRN.2012.9
107
Considering a set of inputs x1 , x2 , x3 , ..., xn , a set of weights related to the inputs w1 , w2 , w3 , ..., wn the bias b, and the activation function ϕ, the equation the describes perceptron behavior is:
number of connections the hardware structure fails if its not designed correctly. Furthermore, a very large hardware structure is expensive and cannot accept higher frequencies of clock because the clock is not able to fill all the circuit before the next clock pulse. These problems can be solved adopting a different development method called hardware/software co-design. Hardware/software co-design is a hardware development method that concurrently design, develop, test and simulates system’s hardware and software components [10]. An hybrid system with hardware and software components can solve in a better way the scalability problem (number of neurons, connections)and is able to generate a smaller hardware that can achieve higher clock frequencies, lower costs and better power consumption even considering a higher clock frequency. This method needs a development tool that allows fast prototyping and testing. To achieve these features the hardware development tool used in this work was a Field Programmable Gate Array (FPGA). FPGAs are reconfigurable hardwares [11] that can be configured, reconfigured and physically tested before the real hardware production. In this case the FPGA manufacturer provides a soft-processor (a processor that can be fully configured in a reconfigurable hardware [12]) that is able to run system software and provides communication between the software and the developed hardware. Based on previous presented features this work used a hardware/software co-design method to implement the backpropagation algorithm. This method is able to solve many problems that are present on hardware implementation of previous works and allow further improvements and modifications considering not only the method but the development tool too. Previous works also ignored this development method that is widely used in hardware development nowadays [10]. The main contributions of this work are the development of a hardware/software co-design for backpropagation algorithm, the behavior of the NIOS II soft-processor with different specific hardware optimizations for this problem and an analysis of the efficiency of hardware/software co-design method to solve the problem.
v=
n
wi ∗ xi + b
(1)
i=1
After the forward step the output of the network is used to update the weights of the output layer starting the second step also known as the backward step. Considering wi j is the weight from the i neuron to j neuron, η the learning rate and y1 , y2 , y3 , ..., ym the output vector where m is the number of neurons, the update equation (2) for the output layer is: wji (n + 1) = wji (n) + η ∗ δj (n) ∗ yi (n)
(2)
considering δj (n): δj (n) = ej (n) ∗ ϕj (vj (n))
(3)
where ej (n) represents the output error compared to desired output. Hidden layer’s update equation is similar to (2) but δj (n) calculus is represented by the equation (4): δj (n) = ϕj (vj (n)) ∗
δk (n) ∗ wkj (n)
(4)
k
Equation (4) represents error-backpropagation backwards to the hidden layers’ weights. This rule applied repeatedly to the network represents the learning process of the network. There are many possible stopping criteria for the training [1] but in this work, to validade the execution time measurements and to allow comparisons between measured execution times, the adopted stopping criteria is the number of epochs. Each epoch considers one forward and one backward step. To implement a hardware/software co-design for the backpropagation algorithm, a profile-based method was used and is described in section II.B. B. Profile-Based HW/SW Co-design Method
II. M ATERIALS AND M ETHODS
Hardware/software co-design can be considered a new area in hardware design due to its recent creation in the beginning of the 90’s [16]. Many development methods were proposed since its creation but all methods follow the same principles that hardware and software design have to interact in any way during the design process. First design methods adopted the bottom-up approach start hardware development and lately changing some functionalities to software. This approach has a problem that hardware development costs a lot of design time. The second attempt was to develop the software solution and after that implement some functionalities in hardware. The second approach is better in many ways but thinking about the software and not considering
A. Backpropagation Backpropagation was the training algorithm chosen to be implemented in a hardware/software co-design. This choice is based on algorithm’s large use to solve a huge number of computational and non-computational problem [1]. Together with backpropagation-trained multi-layer perceptron (MLP) networks’ applicability, a MLP can approximate any function with two hidden layers [13]. Backpropagation training algorithm is based on a twostep procedure [14][15] . The first is known as forward step. In this step the signal is propagated forward in the multilayer perceptron network following perceptron equation (1).
108
Figure 1.
HW/SW Co-Design Method.
the hardware is also a problem because it makes hardware implementation a little difficult. To solve these design problems the hardware/software co-design methods evolved to variations of a basic design method described in figure 1. The most important feature of this basic method is that hardware and software are developed and tested concurrently. The basic design method solves some design problems but not all of them. One of the most important step of co-design development is the hardware/software partitioning. Decide which functionality of the system is going to be implemented in hardware and which will be implemented in software impacts the final design directly and that’s why it is considered one of the most important steps in hw/sw co-design.
Figure 2.
Profile-Based HW/SW Co-Design Method.
C. Development Tools The chosen profile-based hardware/software co-design method needs a lot of hardware implementation tests and validations before achieving the final system implementation. Considering the high cost of hardware manufacturing and the number of prototypes needed, the hardware implementation tool chosen for this work is Altera Cyclone II FPGA1 . The Field Programmable Gate Array is a reconfigurable hardware [11] that can be configured and execute the hardware design as many times as needed and all the measurements and experiments can be evaluated without higher costs. Considering reconfigurable hardware features any FPGA could be chosen for this work. A development tool for a hardware/software co-design should make prototyping, testing and validation easy. In this case Altera Cyclone II FPGA fulfill all the requirements. Also this tool should be able to test and validate the full system behavior, that means, the hardware and the software parts of the system have to be executed together. Altera provides a soft-processor called NIOS II2 that has hardware and software development tools for configuration and use. NIOS II soft-processor can have hardware added to its structure as custom processor instructions or co-processors (figure 3). This fact allows the design using the chosen method because: (i) NIOS II soft-processor executes the developed software; (ii) NIOS II soft-processor executes C code compiled by a modified version of Gnu C Com-
Usually designers implement in hardware the most critical functions of the initially designed software because hardware developments are more expensive considering manufacturing and development time. This approach is very interesting but some tool is needed to show exactly what are the real critical functions of the software considering some metric as execution time for example. Designers can have an idea about what are the critical functions of an algorithm but sometimes simple functions can be also critical and have more hardware-friendly implementation features. Software profilers [17] are tools that measure some software execution features and generate logs that can be analyzed in order to find the critical functions of a software. These tools allowed modifications to the basic method (figure 1) considering this profiling tools and the top-down design approach aiming to do the hardware/software partitioning intrinsically. In this work a modified profile-based method [18][19][20] is used and is described by the diagram in figure 2. The main modification comparing to traditional profile-based methods is that the hardware development starts only after all software optimizations were applied to designed software. In this work the main requirement is a considerably fast execution time without compromising the desired features discussed in section I.
1 http://www.altera.com/products/devices/cyclone2/cy2-index.jsp 2 http://www.altera.com/devices/processor/nios2/ni2-index.html
109
Figure 4.
Figure 3.
Execution Time Profile
Final System Representation
to the soft-processor initially without hardware division (xf processors) and then with hardware division (xfd). Another experiments were executed and the execution time remained high and the number of neurons remained insufficient. The alternative was to include a Phase-Locked-Loop (PLL) to increase clock frequency together with 80MB of RAM memory to the fastest processor. Best processor’s final architecture (NIOS II fast with floating-point unit, PLL and RAM memory) is represented by figure 5. Components are: the processor with all developed hardware inside it, memory controller connected to RAM memory, clock timer, jtag interface for USB communication, PLL and system identification. All components are connected to Altera AVALON Bus that is an address-oriented bus.
piler(GCC). This version of GCC also has a Gnu Profiler tool that allows software profiling and the critical function analysis; (iii) After the identification of the critical functions they can be implemented in hardware and added to the NIOS II soft-processor as customized instructions or coprocessors; (iv) The entire system during the development can be tested and validated. Also, the final system hardware can be synthesized with an option called ”Hard Copy” that generates a mask ready for manufacturing. Presented features (FPGA hardware; IDEs for hardware development, configuration and programming; and NIOS II soft-processor) together with researchers experience with all mentioned tools are the reasons for Altera Cyclone II FPGA choice as the development tool for this work. III. R ESULTS AND A NALYSIS Backpropagation algorithm implementation follows every step described in [1]. The precision and convergence of the algorithm were tested with classic problems as XOR and Iris (flower) dataset3 . To obtain execution times and co-design implementation features a few executions of any size of network is necessary using the previously tested algorithm. Following the proposed method, the three different types of NIOS II soft-processor were tested: (i) economic (e) that has no hardware acceleration, 4-stage pipelining; (ii) standard (s) that has hardware acceleration for basic mathematical operations and a 5-stage pipelining; and (iii) fast (f) that has lots of hardware accelerators inside for basic operations and a 6-stage pipelining. None of these processors come with floating point operation hardware accelerator. After executing the algorithm on these processors, the profile of the code was able to be analyzed. The results (figure 4) showed that more than 50% of the execution time is due to floating point operations ( muldf3, pack d, fppadd parts, muldi3, unpack d, di vdf3). Considering this fact, a floating-point unit was added
Figure 5.
Best Processor’s Architecture
The comparison between processors’ results is presented by figure 6. Profiling results shows that the number of Connection Updates Per Second (CUPS) is a very low value comparing to other implementations (table I) so the faster processor was chosen to receive the PLL and the RAM memory (ffd). Even being the largest processor, more expensive and energy consumer the difference between the NIOS II fast processor and the other two is not relevant compared to the advantage on CUPS. All processors that included the Floating-Point Unit (FPU) have and area consumption very close and all NIOS II fast processors achieved a close number of CUPS, this fact suggests that if area constraint
3 http://www.inf.upol.cz/iris/
110
the other designs, occupied less than 22% of the device total area. This fact is important because there is a considerable amount of free hardware area to be consumed by another hardware optimizations. Considering NIOS II e the reference of algorithm’s sequential implementation because is the most simple processor without any unit or instruction added, the speedup achieved by the fastest processor of 23.44 is presented in figure 7 compared to all other processors.
Figure 6.
Processors’ Results Comparison
is an important requirement NIOS II ffd or even NIOS II ff can be used. After choosing the best processor it’s results were compared to other on-chip implementations for backpropagation networks (table I). Table I F INAL R ESULTS . Work This [3] [4] [7] [8] [6] [9]
Neurons per Chip >100.000 10 100 92 4 99
CUPS 1250 1800000 77840 80000 -
Layers >7 3 3 >7 7
Figure 7. Max Freq.(MHz) 100 40 17 20 30
Speedup Achieved during Development
IV. C ONCLUSION This work presented a hardware/software co-design to a feedforward multi-layer perceptron network that is trained using a backpropagation algorithm. The final system, that is the main contribution of this work, is composed by a soft-processor with hardware acceleration for floating-point numerical operations achieved good results compared to other proposed architectures (table I) solving some common hardware implementation problems common to other architecture’s. The system is the result of a profile-based hardware/software co-design method and is configured in a reconfigurable hardware platform (Altera Cyclone II FPGA). Proposed method achieved good results considering the speedup achieved during the development (figure 7) and the final system results (table I). This FPGA is not a highperformance device and this hardware can be configured in any other Altera FGPA including the high-performance Stratix4 FPGAs. Implementing the system in other FPGA can elevate the final clock frequency and the system memory amount. This final system can be integrated and embedded in other systems without much effort due to it’s USB JTAG interface. Other important feature of this system is that more hardware accelerators can be added to the soft-processor as custom instructions or co-processors in order to elevate the CUPS.
Table I compares seven different solutions for the same problem. Some references were not complete and the missing information was referred as ”-”. While analyzing table I is easy to notice the advantages of using hardware/software co-design method with a reconfigurable hardware. The possibility of a soft-processor configuration to run the algorithm solved many problems that occurs in hardware implementations: (i) activation function approximation; (ii) system bus width (NIOS II is a 32-bit soft-processor) and (iii) numeric representation (NIOS II is IEEE Standard soft-processor). The scalability problem was solved with the use of RAM memory together with the soft-processor because in this case the software is able to allocate 80MB of memory dividing neurons and layers according to the problem. NIOS II processor frequency can achieve 150MHz in more complex FPGAs and this fact together to the possibility of adding custom instructions and co-processors to NIOS II system is a possible solution for the difference between CUPS. RAM memory also solves the problem of data input because the memory can be initialized with previous known values. Information about area consumption is in different units and conversion is not trivial in this case. Although considering the Cyclone II FPGA is not a high performance FPGA the final design, that consumed the larger area compared to
4 http://www.altera.com/devices/fpga/stratix-fpgas/stratix/stratix/stxindex.jsp
111
Final system achieved better results than previous proposed architectures considering problems of algorithm precision (the system is implemented without any precision loss caused by hardware architecture and the backpropagation algorithm was not modified to be implemented on-chip), scalability (of neurons and layers) and flexibility (final system is a more generic hardware that can be re-programmed to train with any other training algorithm, execute with other activation functions). Future efforts in this work will focus in: (i) platform changes to high-performance and implementation in different manufacturers platforms (Xilinx5 for example, that has MicroBlaze6 soft-processor); (ii) development and evaluation of other hardware accelerators optimizing other critical functions (figure 4); (iii) other networks and training algorithms implementation and (iv) Multi-core architecture implementation and evaluation.
[8] R. G. Giron´es, R. C. Palero, J. C. Boluda, and A. S. Cort´es, “Fpga implementation of a pipelined on-line backpropagation,” J. VLSI Signal Process. Syst., vol. 40, no. 2, pp. 189–213, Jun. 2005. [Online]. Available: http://dx.doi.org/10.1007/s11265-005-4961-3 [9] R. M. da Silva, L. de Macedo Mourelle, and N. Nedjah, “Compact yet efficient hardware implementation of artificial neural networks with customized topology,” Expert Systems with Applications, vol. 39, no. 10, pp. 9191 – 9206, 2012. [10] W. Wolf, High-Performance Embedded Computing: Architectures, Applications, and Methodologies. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2006. [11] C. Bobda, Introduction to Reconfigurable Computing: Architectures, Algorithms, and Applications, 1st ed. Springer Publishing Company, Incorporated, 2007. [12] P. Yiannacouras, J. Rose, and J. G. Steffan, “The microarchitecture of fpga-based soft processors,” in CASES ’05: Proceedings of the 2005 int. conf. on Compilers, arch. and synthesis for emb. sys. New York, NY, USA: ACM, 2005, pp. 202–212.
ACKNOWLEDGMENT The authors acknowledge the support granted by CNPq and FAPESP to the INCT-SEC (National Institute of Science and Technology – Critical Embedded Systems – Brazil), processes 573963/2008-8 and 08/57870-9. Also acknowledge CAPES for financial support.
[13] G. Cybenco, “Continuous valued neural networks with two hidden layers are sufficient,” 1988. [14] D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing - Explorations in the Microstructure of Cognition. A Bradford Book - The MIT Press, 1986, vol. 1 - Foundations.
R EFERENCES [1] S. Haykin, Neural Networks: A Comprehensive Foundation (2nd Edition), 2nd ed. Prentice Hall, Jul. 1998. [Online]. Available: http://www.worldcat.org/isbn/0132733501
[15] J. L. McClelland and D. E. Rumelhart, Parallel Distributed Processing - Explorations in the Microstructure of Cognition. A Bradford Book - The MIT Press, 1986, vol. 2: Psychological and Biological Models.
[2] F. Rosenblatt, “The perceptron: A probabilistic model for informatin storage and organization in teh brain,” Psyological Review, pp. 386–408, 1958.
[16] W. Wolf, “A decade of hardware/software codesign,” Computer, vol. 36, no. 4, pp. 38–43, April 2003.
[3] H. Ishii, T. Shibata, H. Kosaka, and T. Ohmi, “Hardwarebackpropagation learning of neuron mos neural networks,” in Electron Devices Meeting, 1992. IEDM ’92. Technical Digest., International, dec 1992, pp. 435 –438.
[17] H. Hubert and B. Stabernack, “Profiling-based hardware/software co-exploration for the design of video coding architectures,” IEEE Circuits and Systems for Video Technology, vol. 19, no. 11, pp. 1680 –1691, nov. 2009.
[4] J. Cloutier and P. Y. Simard, “Hardware implementation of the backpropagation without multiplication,” in In Proceedings of the Fourth International Conference on Microelectronics for Neural Networks and Fuzzy Systems, 1994, pp. 46–55.
[18] M. Dias, D. Sales, and F. Osorio, “Hw/sw co-design architecture for evolutionary robotics,” in Robotics Symposium and Intelligent Robotic Meeting (LARS),2010 Latin American, oct. 2010, pp. 43 –48.
[5] R. A. Khalil, “Hardware Implementation of Backpropagation Neural Networks on Field programmable Gate Array,” Ph.D. dissertation, University of Mosul, College of Engineering, Mosul, Iraq, 2007.
[19] M. A. Dias and F. S. Osorio, “Hardware/software co-design for image cross-correlation,” in Integrated Computing Technology, ser. Communications in Computer and Information Science, E. R. Hruschka, J. Watada, and M. Carmo Nicoletti, Eds. Springer Berlin Heidelberg, 2011, vol. 165, pp. 161– 175.
[6] D. Myers, J. Vincent, and D. Orrey, “Hannibal: A vlsi building block for neural networks with on-chip backpropagation learning,” Neurocomputing, vol. 5, no. 1, pp. 25 – 37, 1993. [7] J. G. Eldredge and B. L. Hutchings, “Rrann: A hardware implementation of the backpropagation algorithm using reconfigurable fpgas,” in In IEEE World Conference on Computational Intelligence, 1994, pp. 2097–2102.
[20] M. Dias, D. Sales, and F. Osorio, “A profile-based method for hardware/software co-design applied in evolutionary robotics using reconfigurable computing,” in Electronics, Robotics and Automotive Mechanics Conference (CERMA), 2010, 28 2010oct. 1 2010, pp. 463 –468.
5 http://www.xilinx.com/ 6 http://www.xilinx.com/tools/microblaze.htm
112