A Choice of SM/DM Parallel ANN Implementation for Embedded. Applications. V. Dvorak, R. Cejka. Dept. of Computer Science, Technical University of Brno,.
A Choice of SM/DM Parallel ANN Implementation for Embedded Applications V. Dvorak, R. Cejka Dept. of Computer Science, Technical University of Brno, Bozetechova 2, Brno, 612 66 Czech Republic {dvorak, cejkar}@dcse.fee.vutbr.cz Abstract This paper examines implementations of a multi-layer perceptron (MLP) on bus-based shared memory (SM) and on distributed memory (DM) multiprocessor systems. The goal has been to optimize HW and SW architectures in order to obtain the fastest response possible. Prototyping parallel MLP algorithms for up to 8 processing nodes with the DM as well as SM memory was done using CSP-based TRANSIM tool [1]. The results of prototyping MLPs of different sizes on various number of processing nodes demonstrate the feasible speedups, efficiency and time responses for the given CPU speed, link speed or bus bandwidth. Key words: performance modeling/ prediction, prototyping parallel programs, parallel implementations of neural networks 1. Introduction Implementation of artificial neural networks (ANNs) for real-time applications must frequently feature a very fast time response in order that the continuous inputs to the ANN may be sampled and processed frequently enough, e.g. hundred times per second. For some applications, such as a 2D-pattern classification problem, the size of the ANN can be quite large and the computation load excessive even for the most modern CPUs. Therefore the efficient implementation of ANNs has been the active research area and different dedicated analog as well as digital hardware and architectures received much attention in recent years. Also, neural networks became one of the favorite application for parallel processing [2]. In this paper we prefer to analyze traditional parallel architectures with few CPUs and want to find out whether the shared memory or distributed memory system is faster and therefore more suitable for real-time processing. We consider CPUs such as ADSP-21062 SHARC by Analog Devices with six 4-bit links at 40 MB/s each, which can use also inter-processor bus data transfer at 240 MB/s and access the unified address space. In the following paragraphs we explain description of architecture, prototype software for MLP (feed-forward propagation mode only) and mapping software onto hardware for the sake of simulation. We will investigate only the three-layered perceptron (TLP); extensions to MLP are straightforward. We will not consider the learning process because learning is done usually only once and time of learning is not critical. We are interested in the case when the ANN must operate in real
time with continuous flow of input data and where only a very fast time response is of importance. 2. Parallelization of multi-layer ANNs We have used a common neuron model with a linear part described by equation
x a1 w1 a2 w2 ... an wn a0 w0 ,
(1)
where the threshold operation is taken as an extra input with activity a0 = 1 and weight w0 = -t. All constants and variables in equation (1) are real numbers (REAL32) - single accuracy proved sufficient for ANN applications. The nonlinear part of a neuron is specified by the sigmoid function f ( x)
1 1 exp( x )
.
(2)
The inputs and the hidden neurons are fully interconnected, as well as the hidden and output neurons. To parallelize this structure we have assigned a certain segment of hidden and output neurons to each CPU, thus creating slices of ANN perpendicular to the layers. Using this technique of "vertical slicing", the TLP nin - nhn - non (the number of input, hidden, and output neurons, respectively) is decomposed into S segments, each of which evaluates nhn1t = nhn/S hidden neurons, and non1t = non/S output neurons. The required communication among CPUs and their computation load at processing of one input vector can be easily estimated. Actions which have to be carried out are: read the new input vector out0 from outside to the SM or broadcast the new vector to all the nodes compute a subset of nhn1t hidden neuron outputs from the complete input vector. It will require nin MAC (multiply and accumulate) operations for each hidden neuron, i.e. nin*nhn1t MAC operations on one CPU. When each hidden neuron accumulation is complete, a sigmoid must be performed on each of nhn1t stored values on every CPU. read the blocks of nhn1t hidden neuron output values computed by other CPUs from SM or communicate these values in all-to-all fashion. compute a subset of non1t outputs from the complete vector of hidden outputs out1. It will require nhn*non1t MAC operations and non1t sigmoid evaluations on each CPU. Read out the output vector from SM or do all-to-one communication of blocks of non1t output values from each CPU to the root and then on to the outside world. Implementing the above steps in sequence using message passing would separate computation and communication and could not provide good efficiency and speedup. On the contrary, we have to overlap computation and communication as much as possible, because the CPU and links can work in parallel. The following pipeline is just using this internal parallelism: A) on each CPU do in parallel: compute hidden node output values for input vector (time n), read the new input vector (n+1) in SM or broadcast it to all CPUs (one-to-all, oab)
2
collect blocks of output values (n-1), all-to-one gather, aog and (root CPU only:) send out the complete output vector (n-1) B) get hidden node output values ( time n ) from the SM or communicate all-to-all, aab blocks of hidden node output values ( time n ) and compute output node values for input vector (time n). Testing of various alternatives of SW and HW organizations and prototyping the parallel algorithm described above was done using simulating and prototyping tool TRANSIM.
3. TRANSIM prototyping tool For prediction of time performance of various configurations of 1 to 6 SHARC processors and for rapid testing new ideas how to organize SW and HW, TRANSIM prototyping tool [1] has been used. Using this tool, the successive versions of the design may be built and modified more easily than in the environment intended for the final design. The code for processing an input vector by the fully connected 3-layer perceptron as well as the hardware architecture and mapping software modules onto processors have been described in CSPbased TRANSIM input language. It is a subset of Occam 2 with various extensions. TRANSIM is naturally intended for message-passing distributed memory systems. Nevertheless, it was also used to simulate shared memory (SM) systems, [3]. It is of skeletal form in which all the communications of the original code are retained and all pieces of sequential code are replaced by special timing constructs. Bus transactions in SM systems can be modeled as communications between node processes and a central process ("smbus") running on an extra processor. The example of hidden neurons activity evaluation is given below as a loop in TRANSIM code: SEQ h = 0 FOR nhn1t SEQ SEQ j = 0 FOR nin SERV ( fadd + fmul) SERV ( fadd + fdiv + fexp)
-- fadd = 6-9, fmul = 11-18 -- fdiv = 16-28, fexp = 90
In this loop activations of inputs are multiplied by weights and then summed up for number of input nodes (nin). Then the sigmoid function is evaluated using single addition, division and exponential function. An argument of SERV construct specifies the number of CPU cycles taken by floating point operations and functions. For superscalar CPUs the number of CPU cycles in SERV construct is lower than sum of cycles for component operations due to pipeline processing and can be measured accurately. As far as communication is concerned, the simulated messages can be of arbitrary length, whereas the value actually transferred is always a single value of type INT, and the type of a channel always ANY. The length of the message in bytes is specified in the output statement. The example of communication oab and aog under A) above follows:
3
INT x,y: SEQ | io0 IF i 0 PAR SEQ | oab ch[0][i] ? SEQ | aog ch[i][0] ! i=0 SEQ DEVICE ? y | PAR SEQ | oab1 ch[0][1] SEQ | oab2 ch[0][2] SEQ | oab3 ch[0][3] --PAR SEQ | oag1 ch[1][0] SEQ | oag2 ch[2][0] SEQ | oag3 ch[3][0] DEVICE ! 0 |
-- common nodes 1,2,3 x
-- read input vector
i | non1p*4
-- a part of the output vector back -- root
nin*4
-- get a new input vector from outside -- send in parallel to 1,2,3
! 0 | nin*4 ! 0 | nin*4 ! 0 | nin*4 -- and get the pieces of output vector ? x1 ? x2 ? x3 non*4 -- send outside
There are maximum 2 predefined DEVICE channels on each processor, one for input and one for output, for communication with an external device; message length must be specified in both the cases. Internal and external channels within a single array are allowed. Receiving and transmitting messages on channels "ch[x][i]" and "ch[i][x]" and CPU[i] processing is done in parallel, simultaneously on all CPUs, by interleaving 3 consecutive tasks (threads) in a software pipeline (see the code in the Appendix). Process-level parallelism in the pipeline of processors is thus combined with thread-level parallelism within processors. All processors run the same code with different conditional parts for a root and remaining processors. In TRANSIM we cannot call procedures, but we can use the same code on several CPUs using replicated PAR statement. Specification of the hardware means specification of the individual processing elements and of topology. As topology can usually be determined internally from the software description, we specify only parameters of processors by NODE construct. The default hardware parameters may be overridden by explicit parameters; e.g. NODE i = 0 FOR S NODE n : SPD = 40, ECS = 40
says, that all CPUs n[0], n[1] ,..., n[S-1] have clock speed SPD = 40 MHz (default value is 20 MHz) and external channel speed of 40 MB/s (default value is 10 Mbit/s at 11.25 bit/byte). Other optional parameters are ICS (internal channel speed), ICD/ECD (internal/external channel delay), TSL (the time-slice period in CPU cycles), EF (external memory factor, number of additional CPU cycles on top of internal memory access time). The connection between software and hardware is made through the MAP construct; channel placement is done automatically. For example, in TLP SM simulation all n[i] processors run modules "cpu" and a root processor runs "smbus" :
4
MAP i = 0 FOR P MAP n[i]: cpu[i] MAP root: smbus
4. Shared memory simulation with message passing Each processor is connected to the extra processing element with "smbus" process with two channels: fcpu and tcpu (from and to cpu). Simulation of bus transaction in a SM system is based on properties of channel selection statement ALT: if there are more processes ready to communicate, one is selected randomly. In SM organization some other arbitrating strategies are being used: fair, cyclic, priority, etc. These are modeled by modification of a loop with ALT. For example, fair arbitration is implemented by rotating the order of input channels fcpu[i] in ALT statement: WHILE TRUE SEQ j = 0 FOR n -SEQ -ALT fcpu[(j+0) REM n] ? q := (j+0) REM n fcpu[(j+1) REM n] ? q := (j+1) REM n fcpu[(j+2) REM n] ? q := (j+2) REM n fcpu[(j+3) REM n] ? q := (j+3) REM n ...
fair multiplexer for n inputs (priority scheme could be used, PRI ALT) x -- ALT cannot be replicated x x x
The second problem is simulation of memory and synchronization operations such as read request, write-back, semaphore operations, barrier, etc. Processing of these transaction inside the process "smbus" is given below: IF x = rr tcpu[q] ! x | cline x = br SEQ barin := barin + 1 IF barin = barnum SEQ SEQ z = 0 FOR barnum tcpu[z] ! br | rbsize barin := 0 TRUE SKIP x = sr locked := NOT locked x = wt SKIP x = wb SKIP
-------------
"cpu" processes: fcpu[i] ! rr | tcpu[i] ? x Write back: fcpu[i] ! wb | Semaphore: fcpu[i] ! sr | ... fcpu[i] ! sr | Barrier: fcpu[i] ! br | tcpu[i] ? x
rrsize wbsize srsize srsize brsize
Here rr, br, sr denote read request, barrier request, semaphore request and wt or wb are write requests. In the ANN implementation only the barrier synchronization has been used after evaluation of hidden nodes and before using their values for evaluation of output nodes. (Je to pravda? Napište, které synchronizační mechanizmy byly použity a kde nebo proč).
5
Generally for simulation purposes the bus bandwidth in the SM system corresponds to external channel speed, ECS = 240 MB/s in case of SHARC processors. All values in ANN simulations are represented by REAL32 floating point numbers. Since cache line is 16 B, reading the weight from the shared memory and a related bus transaction occurs once every four accesses. 5. Conclusions SHARC-based implementations of three ANN (small, medium, large size) have been analyzed in the following cases: a) Write-through shared memory (SM-WT), cache line 16 B, bus bandwidth 240 MB/s b) Write-back shared memory (SM-WB), cache line 16 B, bus bandwidth 240 MB/s c) Distributed memory (DM) with message passing, full connection, channel speed 40 MB/s. The results of simulation are in the Table 1. The results of this study demonstrated a very good speedup for all studied neural networks and a limited number of CPUs. Further improvement of the SW and HW architecture will be very difficult because communications are very well overlapped with processing and there is no more communication overhead to hide. According to the results, using multiple processors with distributed memory and message passing is more efficient, but under the assumption that the access to the external memory is as short as the access to an on-chip cache - 1 cycle. If it is longer (default value EF = 5), then the time response of the ANN also gets little longer, but efficiency slightly improves because accessing memory is taken as useful work and a longer processing can better overlap communication. Performance prediction for SHARC processors can be given in Million Connections Per Second (MCPS), a standard measure of performance for ANN, is. For the TLP nin : nhn : non we have nin * nhn + nhn * non connections done during response time T in the forward pass, so that for 6 processors and the large ANN the measure is MCPS = nhn * ( nin + non ) / T = 60 * 216 / 947 = 13.7 . Predicting performance of MLP with more than 1 hidden layer can be achieved with only a minor modification of the TRANSIM code. E.g. in the message passing architecture the process "seg" (ANN segment), a sequence of 3 components (two PRI PAR's followed by replicated SEQ, see Appendix 1), has to augmented so that all-to-all broadcast of hidden values (2nd PRI PAR) and output node evaluation (now again hidden node evaluation) will be repeated for each new hidden layer. Experiments showed only a small decrease in efficiency (1-2 %) depending on the number of nodes in additional hidden layers. As far as the number of processors could be increased above 6, full connection of nodes in DM system is no more possible, because the number of links is less or equal 6 in all available commercial processors. For 8 and 16 processors the most efficient topology seems to be a hypercube [4]. For number of nodes such as 10, 12, 14, 18, 20, etc. the best arrangement is a modified ring topology, [5]. Simulations and performance predictions for up to 32 processors are planned for the near future. Since especially the SM simulations are too time consuming, the workstation version of TRANSIM is being sought.
6
SM - WT
P=1
P=2
P=4
P=6
99.2 / 1.0
96.2 / 1.9
89.3 / 3.6
81.4 / 4.9
Medium 100-48-12 99.7 / 1.0
98.0 / 2.0
94.7 / 3.8
91.2 / 5.5
Large 200-60-12
99.8 / 1.0
98.4 / 2.0
96.0 / 3.8
93.3 / 5.6
P=1
P=2
P=4
P=6
99.7 / 1.0
96.8 / 1.9
90.1 / 3.6
82.3 / 4.9
Medium 100-48-12 99.9 / 1.0
98.4 / 2.0
95.3 / 3.8
92.0 / 5.5
Large 200-60-12
99.9 / 1.0
98.7 / 2.0
96.5 / 3.9
94.0 / 5.6
P=1
P=2
P=4
P=6
Small 32-24-12
SM - WB Small 32-24-12
DM Small 32-24-12
99.7 / 1
98.8 / 2
97.8 / 3.9
97.0 / 5.8
Medium 100-48-12 99.9 / 1
99.6 / 2
99.5 / 4
99.3 / 6
Large 200-60-12
99.8 / 2
99.8 / 3.9
99.7 / 6
DM time responses
100/ 1.0 P=1
P=2
P=4
579.6 s
260.6 s
123.3 s
81.1 s
Medium 100-48-12 2.593 ms
1.236 ms
602.9 s
399 s
Large 200-60-12
2.89 ms
1.426 ms
946.9 s
Small 32-24-12
5.933 ms
P=6
Table 1. Efficiency / speedup and response times of various ANN implementations. References [1] Hart, E.: "TRANSIM Prototyping Parallel Algorithms". University of Westminster Press, 1994. [2] Sundararajan, N. Saratchandran, P.: Parallel Architectures for Artificial Neural Networks. IEEE Computer Society Press, Los Alamitos, CA, 1998. ISBN 0-8186-8399-6. [3] Čejka, R. - Dvořák, V.: CSP-based Modeling of SM Architectures, In: Proceedings of conference Computer Engineering and Informatics CE&I'99, FEI TU Kosice Publ., Kosice Herlany, Slovakia, 1999, s. 163-168, ISBN 80-88922-05-4 [4] Dvořák, V. - Matoušek, P.: Highly Efficient Parallel ANN Implementation for Real-Time Processing, In: Proceedings of conference Computer Engineering and Informatics CE&I'99, FEI TU Kosice Publ., Kosice - Herlany, Slovakia, 1999, s. 186191, ISBN 80-88922-05-4
7
[5] Dvořák, V.: Prototyping parallel ANN implementations with TRANSIM, In: HPCS' 96 Conference Proceedings, Carleton University Press, Ottawa, 1996, s. 7/1-16, ISBN 0-88629-301-4 Appendix 1.
Listing of full4.IN TRANSIM input file
--DM implementation of ANN, full connection of p processors INTC p: p:= 4 [p][p] CHAN OF ANY ch: VAL VAL VAL VAL VAL VAL VAL VAL VAL VAL VAL VAL
nin IS 32: nhn IS 24: non IS 12: nhn1p IS nhn/p: non1p IS non/p: fadd IS 7: fmul IS 11: fdiv IS 19: fexp IS 90: incode IS 24: hidcode IS 16: outcode IS 8:
------
number number number number number
of of of of of
input nodes hidden nodes output nodes hidden nodes per processor output nodes per processor
PLACED PAR i= 0 FOR p INT x0,x1,x2,x3,x4,x5,x6,x7: SEQ | seg : MR = 0 -- process segment of the ANN, 0