ACCEPTED FOR PUBLICATION IN PROCEEDINGS OF ISCAS’06, ISLAND OF KOS, GREECE, MAY 21-24, 2006.
1
Neural Network Stream Processing Core (NnSP) for Embedded Systems Hadi Esmaeilzadeh, Pooya Saeedi, Babak Nadjar Araabi, Caro Lucas, and Sied Mehdi Fakhraie Department of Electrical and Computer Engineering University of Tehran, Tehran 14395-515, Iran Email:
[email protected],
[email protected],
[email protected] [email protected],
[email protected]
Abstract— NnSP is a stream-based programmable and codelevel statically reconfigurable processor for realization of neural networks in embedded systems. NnSP is provided with a neuralnetwork-to-stream compiler and a hardware core builder. The NnSP stream compiler makes it possible to realize various neural networks using NnSP. On the other hand, the NnSP builder makes the NnSP processor an IP core that can be restructured to satisfy different demands and constraints. This paper presents the architecture of the NnSP processor, the streaming mechanism, and the builder facilities. Also, synthesis results of a 64-PE NnSP on a 0.18µm standard-cell library are presented. The obtained results show that a 64-PE NnSP can perform computations of 25.6 giga connections in a second, while its throughput is upto 51.2 giga 32-bit fixed point operations per second. Comparing with high performance parallel architectures locates 64-PE NnSP among the best state of the art parallel processors.
I. I NTRODUCTION Intelligent systems expose characteristics such as dependability, efficiency, autonomy, modeling convenience, maintenance cost-efficiency, and uniqueness in complicated applications. These capabilities are potential reasons for employing intelligent solutions in embedded systems [1]. The most applicable instance of the bio-inspired computing systems for employment in the embedded systems is an artificial-neuralnetwork engine. Considering the real-time requirements of embedded applications, most embedded microprocessor cores lack the performance for running neural networks or other bio-inspired sub-blocks. Therefore employing neural networks in embedded applications requires efficient custom IP core implementation of them [2]. On the other hand, the competitive market of embedded systems admits solutions that require shorter time in design, are cost-efficient in development, have flexibility in utilization, expose simplicity in integration, and exhibit reusability [3]. For the past decade, general-purpose commercial and academic [4], [5] neural devices are presented but the market has not accepted them for several reasons such as their very limited flexibility, insufficient technical and software support, and often insufficient performance [6]. Recently, some research activities have started on embedding intelligent cores in embedded systems [2], [7]. A cellular spiking neural network with run-time reconfigurable connectivity is realized on a Niosbased Altera SoPC bed for mobile robot navigation control [7]. Although the implemented network is utilized in an embedded robot control system, however it is not an IP core
for other realizations of cellular neural networks. In [2], a rapid automated generation of a low-power high performance VLIW stream-based processor which is optimized for speech recognition and vision is addressed. Despite the fact that the perception processor [2] is a complete embedded core, it is not a neural embedded core. The central idea of stream processing is the organizing of computations of an application into streams of data [8]. Streams are sequences of similar data records. Data records are encapsulated as packets which are processed by a relatively simple and regular algorithm [8]. Neural networks stream processor (NnSP) is a programmable code-level statically reconfigurable stream processor core for realization of neural networks in embedded systems. On the other hand, while preserving the flexibility, it has a high computational power due to its parallel processing architecture. The NnSP embedded processor is associated with a builder that generates its synthesizable Verilog code based on the structural specification fed by the designer. This ensures the reusability of the NnSP core. A stream compiler is developed for NnSP, that maps and streamizes neural network for execution on NnSP. This makes a small NnSP core a full neural network engine which is capable to perform computations of a relatively large neural network. Section II discusses the details of NnSP architecture and streaming approach of neural-networks. Section III elaborates on performance analysis and comparison with state of the art parallel processors. Finally, the paper is concluded in Section IV. II. A RCHITECTURE , S TREAMING , AND R EUSABILITY OF N N SP Each neuron in NnSP is virtually realized as a stream of packetized weight parameters. The stream flows through a processing engine in which the computations of its synaptic packets are performed. The stream preamble programs the processing engine to perform required computations on upcoming synaptic packets of the stream. After performing computations of a synaptic stream, the processing engine is ready to accomplish another stream computation. NnSP is a composition of such programmable stream processing engines which are not exclusively committed to a specific neuron. As depicted in Fig. 1, a number of stream processing engines in association with a FIFO-based cache build a stream
ACCEPTED FOR PUBLICATION IN PROCEEDINGS OF ISCAS’06, ISLAND OF KOS, GREECE, MAY 21-24, 2006.
Fig. 2.
2
Data structure of synaptic data streams.
fixed-point format. Prior to SynapticDataPackets, one PeConfigPacket is sent to the processing engine which programs it for upcoming data. The third type packets, BusConfigPacket, which the bus arbiter uses to establish data exchanges between processing engines, will be discussed in the Section II-B after introducing the bussing architecture. B. Caching and Local Communications Fig. 1. The NnSP embedded core architecture, along with its stream processing units and processing engines.
processing unit. Synaptic data streams are fetched from external memory and filled in the processing unit cache. The processing unit cache parses the header of each synaptic packet containing the ID of its target processing engine and delivers the packet to its target. The packet delivery is performed in firs-in-first-out manner. When a processing engine receives a packet of a synaptic stream in its cache interface FIFO, its computation core reads the synaptic weight packet and starts the calculations if the associated input data is ready in the bus input interface FIFO of the processing engine. The bus arbiter is responsible for conveying input value of each synaptic computation which is the output of another processing engine. In addition to three FIFOs intended for prevention of blocking in the cache and the on internal processing engines communication bus, the processing engine has a configuration unit and a controller. This units are programmed by the stream preamble to control the computations and trace the flow of streams in the processing engines. A. Data Structure of Streams There are three types of packets in a stream that encapsulates a neuron; BusConfigPacket, PeConfigPacket, and SynapticDataPacket. Two former packets are the preamble of stream, while the SynapticDataPackets build the payload (see Fig. 2). Synaptic parameters are packed in SynapticDataPackets. targetPeId identifies the processing engine which is the destination of the packet in a stream processing unit. The processing unit cache uses the targetPeId, the packet header, as an index to direct the packet to its target. The rest of the packet, the weight field, contains the synaptic weight in two’s-complement
1) FIFO-Based Caching: As depicted in Fig. 1, each processing unit has a FIFO-based cache. The memory interface unit fetches synaptic data streams from an external memory and places them into the processing unit cache. Since the streams are contiguous data, the external memory works in its burst access mode at the maximum throughput. This fact enhances the performance of the NnSP core. On the other hand, if the streamized neural network fits in the caches, the caches enter in circulating mode and only inputs should be fetched from the external memory. This is the other benefit of the FIFO-based caching scheme in terms of NnSP overall performance. In addition to the processing unit caches, there are two small FIFO-based caches implanted for inputs and outputs. 2) Local Bussing Architecture: As well, the bus arbiter contains a FIFO-based cache which the BusConfigPacket of synaptic data streams are cached in. BusConfigPacket (see Fig. 2) configures the bus for a processing engine to write an output value of an stream to the bus input buffers of other processing engines (see Fig. 1). The peId field of BusConfigPacket is the global index of the processing engine which should write its output over the bus. outIndex directly indexes an output value from the bus output buffer of the source processing engine. The wrtPattern field is a bit pattern in which each bit position has a direct correspondence with one processing engine. wrtPattern is directed to the write signals of bus input buffers of the destination processing engines. The bus interactions are multi-cast and, this means the shared bus architecture does not degrade the parallelism of computations. On the other hand, BusConfigPackets are sent to bus arbiter prior and in parallel with SynapticDataPackets and therefore, the bus is ready before computations completion. Then, as a data becomes available, it is sent to the processing engines that will need it. This prebussing scheme lowers the possibility of degrading parallelism
ACCEPTED FOR PUBLICATION IN PROCEEDINGS OF ISCAS’06, ISLAND OF KOS, GREECE, MAY 21-24, 2006.
due to bus sharing and enhances the scalability of NnSP core in terms of processing engine count. C. Stream Compiler and NnSP Builder 1) Stream Compiler: The reusability and computation power of the NnSP core is the result of its Stream Compiler and the streaming algorithm developed for it. Stream Compiler takes the architecture specifications of the target NnSP core on one hand and the topology of the under realization network on the other hand and as its output, generates synaptic data streams of the network for running on the target NnSP as parallel as possible. The output of the Stream Compiler is three memory files of the input streams, bus configuration streams, and synaptic data streams which are loaded in the memory for running on the target NnSP core. 2) NnSP Builder: The NnSP core is constructed from a collection of Verilog HDL templates. NnSP Builder takes the architecture specification of the intended core including number of stream processing units, number of processing engines per processing unit, cache and buffer sizes, and type of processing engine computation core. Also, system-level simulation results in the form of word lengths and fixed-point precisions are applied to the NnSP Builder. The result is the synthesizable Verilog code of the required core. III. P ERFORMANCE A NALYSIS A 64-PE NnSP core is implemented using 0.18µm standardcell library and employed in various pattern recognition applications for performance evaluation. A. ASIC SoC Implementation of 64-PE NnSP 1) Architecture of 64-PE NnSP: The implemented 32-bit NnSP core has 64 processing engines directly connected to 4K×32-bit caches which form 64 processing units. The cache interface FIFOs are removed to achieve one synaptic packet per clock for MLP computational cores. Also, a small 8-word buffer is implanted in processing engines as bus input interface FIFO for bus blocking prevention. A dedicated 8K×80-bit cache is allotted for bus arbiter. One 1K×32-bit cache is employed for caching of up to 1024-element input vectors. The output cache has 8 locations which enables 64-PE NnSP to interface with memories with 18 clock frequency of the core. Total cache size of 64-PE NnSP is 1.082 Mbytes. 2) MLP Computation Core Implementation: A 7-stage pipelined MLP computation core is implemented for the 64-PE NnSP core. In fact, computations of MLP neurons require a multiply-and-accumulate (MAC) unit plus a sigmoid calculation unit. The first five stages of the pipelined MLP core is a pipelined 32-bit radix-4 booth-encoded cs-form array multiplier. In each stage 8 bits of the multiplication result plus partial product in the carry-sum form are calculated. The fifth stage produces the upper half of the result by a 33-bit carry-look-ahead adder and then saturates the multiplication result and converts it to a two’s-complement 32-bit word. The sixth stage is the accumulation stage which is performed by another 32-bit CLA adder. After completion of the synaptic calculations of a neuron the accumulated result is applied to a sigmoid calculation unit.
3
TABLE I S YNTHESIS RESULTS OF THE 64-PE N N SP LIBRARY.
ON
0.18µm STANDARD - CELL
3) Sigmoid Implementation: The sigmoid calculation unit is implemented as a hardwired lookup table. The input and output bit-widths of the sigmoid LUT are selected based on the quantization error analysis, yielding 2 bits for the integer part and 15 bits for the fractional part. 4) Synthesis Results: The specifications of the 64-PE NnSP are given to the NnSP Builder and its synthesizable Verilog code is obtained. The 64-PE NnSP core is synthesized on a 0.18µm standard-cell library. The synthesis results are presented in Table I. B. Benchmarking For performance evaluation, several neural networks from previous publications on pattern recognition applications are programmed on 64-PE NnSP. The 784:500:1 and 784:500:500:500:1 neural networks are employed for handwritten digit pattern recognition in [9]. The input patterns are 28×28 handwritten digits. The 400:3136:784:1200:300:10 network is a handwritten digit recognizer for 20×20 input patterns [10]. The 256:128:256 network is employed for performance evaluation in [11]. The 234:1024:61 is a speechto-phoneme network [11]. Its 61 output units correspond to 61 phonetic classes. The streaming results of the seven benchmark networks over the 64-PE NnSP core are presented in Table II. The following two parameters are defined for quantitative throughput analysis. P arallelability
=
c pc
× 100% (1) pen × cd ccached Cachability = × 100% (2) c In (1) and (2), c is the number of connections, pc is the number of parallel computations, cd and pen are the computational depth of processing engines and the number of processing engines respectively, and ccached is the number of connections residing in the processing unit caches. Parallelability measures the structure of the streamized neural network in terms of parallelism degree over the NnSP core, while Cachability is the hit-rate of caches in terms of streamized connections. Fig. 3 depicts throughput versus Parallelability × Cachability which shows a linear dependency between throughput and Parallelability in conjunction with Cachability. Fig. 3 shows the importance of caching and parallelism in terms of performance. C. Comparison Several high performance parallel processing architectures are selected from a survey on parallel computing architectures
ACCEPTED FOR PUBLICATION IN PROCEEDINGS OF ISCAS’06, ISLAND OF KOS, GREECE, MAY 21-24, 2006.
4
TABLE II B ENCHMARK RESULTS .
Network
# of Neurons
# of Connections
# of Parallel Computations
# of Bus Accesses
Parallelability
Memory Usage (MByte)
Memory Overhead per Connection
Cachability
Throughput (GCPS)
Throughput (KPPS)
784:500:1
[9]
501
392,500
6,772
6,773
90.561%
1.564
4.442%
65.223%
15.260
59.067
784:500: 500:500:1
[9]
1,501
892,500
14,772
15,773
94.404%
3.561
4.586%
84.725%
20.537
27.078
400:3136: 784:1200: 300:10
[10]
5,430
135,736
3,200
3,210
66.277%
0.569
9.913%
100%
16.967
125.000
256: 128:256
[11]
384
65,536
1,024
1,280
100%
0.264
5.469%
100%
25.600
390.625
234: 1024:61
[11]
1,084
302,080
4,768
4,829
98.993%
1.203
4.355%
85.601%
21.751
83.893
guaranteed NnSP reusability. A 64-PE NnSP architecture which was synthesized on a 0.18µm standard-cell library was discussed. The obtained clock frequency for the 64-PE was 400 MHz. Several benchmark networks were streamized over it and the performance evaluations showed that the NnSP core reached 25.6 GCPS throughput. Comparisons with several high-performance parallel processors showed that NnPS with up to 51.2 GOPS was located among the best parallel processing architectures. R EFERENCES Fig. 3.
Throughput vs. Parallelability × Cachability. TABLE III P ERFORMANCE SUMMARY.
Processor Throughput
Processor Throughput
β chip
Imagine
IMAP-CE
10 GFLOPS
10-20 GFLOPS 40 GOPS
51.2 GOPS
CS301
PC101
NnSP
25.6 GFLOPS 12.8 GMAC/s
140 GOPS 30 GMAC/s
51.2 GOPS 25.6 GMAC/s
[12] for comparison with the 64-PE NnSP core. Their performances are summarized in Table III. Considering the 25.6 GMAC/s capability of the 64-PE NnSP core and the pipelined architecture of its processing engines, its peak performance is 51.2 GOPS. The 51.2 GOPS peak performance locates NnSP as one of the best high-performance parallel processing architectures. IV. C ONCLUSION The architecture of the embedded NnSP stream processing core was presented. The data structure of streams, the FIFO-based caching scheme, and the inter-processing engine communication approach with burst transactions over a common bus were explained. The software support including a stream compiler and a core builder was explained which
[1] W. Elmenreich, “Intelligent methods for embedded systems,” in Proc. of the First Workshop on Intelligent Solutions for Embedded Systems, Vienna, Austria, June 2003, pp. 3–11. [2] B. Mathew, A. Davis, and M. Parker, “A low power architecture for embedded perception,” in Proc. of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’04). Washington DC, USA: ACM Press, 2004, pp. 46–56. [3] T. Zhang, L. Benini, and G. D. Micheli, “Component selection and matching for ip-based design,” in Proc. of Design, Automation and Test in Europe (DATE’01), Mar. 13–16, 2001, pp. 40–46. [4] S. M. Fakhraie and K. C. Smith, VLSI-Compatible Implementations for Artificial Neural Networks. Norwell, Massachusetts: Kluwer Academic Publishers, 1997. [5] J. Zhu and P. Sutton, “FPGA implementations of neural networks - a survey of a decade of progress,” in Proc. of the Field Programmable Logic Conference (FPL’2003), 2003, pp. 1062–1066. [6] L. M. Reyneri, “Implementation issues of neuro-fuzzy hardware: Going toward hw/sw codesign,” IEEE Transactions on Neural Networks, vol. 14, no. 1, pp. 176–194, Jan. 2003. [7] D. Roggen, S. Hofmann, Y. Thoma, and D. Floreano, “Hardware spiking neural network with run-time reconfigurable conectivity in an autonomous robot,” in Proc. of the 2003 NASA/DoD Conference on Evolvable Hardware, Los Alamitos, California, 2003, pp. 189–198. [8] U. Kapasi, S. Rixner, W. Dally, B. Khailany, J. Ahn, P. Mattson, and J. Owens, “Programmable stream processors,” IEEE Computer, vol. 36, no. 8, Aug. 2003. [9] G. Mayraz and G. E. Hinton, “Recognizing handwritten digits using hierarchical products of experts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 189–197, Feb. 2002. [10] B. Boser, E. Sackinger, J. Bromley, Y. leCun, and L. Jackel, “Hardware requiremnets for neural network pattern classifier, a case study and implementation,” IEEE Micro, vol. 12, no. 1, pp. 32–40, Feb. 1992. [11] K. W. Przytula and V. K. Prasnna, Eds., Parallel Digital Implementations of Neural Networks. Englewood Cliffs, New Jersey: Prentice-Hall, 1993. [12] P. Foldesy, “Trends in design of massively parallel coprocessors implemented in digital ASICs,” in Proc. of the International Joint Conference on Neural Networks (IJCNN’04), vol. 4, July 25–29, 2004, pp. 3131– 3135.