General Mapping of Feed-Forward Neural

6 downloads 0 Views 172KB Size Report
In this paper we will concentrate on how to make a exible mapping of the back .... 906. = 259. µ. µ. Figure 3: MCUPS performance for di erent combinations of ...
General Mapping of Feed-Forward Neural Networks onto an MIMD Computer Jim Torresen1=2 Hiroshi Nakashima1, Shinji Tomita1, Olav Landsverk2 Department of Information Science Faculty of Engineering, Kyoto University Yoshida-hon-machi, Sakyo-ku, Kyoto 606-01, Japan 2 Department of Computer Systems and Telematics The Norwegian Institute of Technology, N-7043 Trondheim, Norway 1

Email: [email protected]

Abstract

This paper describes a scheme for mapping the back propagation algorithm onto an MIMD computer with 2D-torus network. We propose a new strategy that allows arbitrary assignment of processors to the multiple degrees of back propagation parallelism (training set parallelism, pipelining and node parallelism). Thus, the method allows a exible mapping that ts well to various neural network applications. Moreover, we consider the e ect of the weight update interval on the number of iterations required for convergence. The results from implementations on a Fujitsu AP1000 show that it may be bene cial to make a mapping involving contention in the communication network. Further, even though the convergence in number of iterations is slower for parallel implementations, compared to a serial program, parallel processing can be a means of achieving considerable speedup.

1 Introduction Parallel processing is mandatory to reduce the long training time for neural network learning algorithms. In this paper we will concentrate on how to make a exible mapping of the back propagation (BP) training algorithm onto an MIMD message passing computer to reduce the total training time of an application. Many possible implementations of BP on parallel computers have been suggested and implemented [1, 2, 3, 4]. However, they all implement a xed parallel algorithm without taking into account that di erent kinds of applications (neural network structure and training set) require di erent kinds of mappings to obtain minimum training time. Our mapping strategy combines multiple degrees of BP parallelism to obtain high performance on highly parallel computers. We have used the 2D torus-connected MIMD computer Fujitsu AP1000 [5] in our experiments. However, the main principles will be transferable to other kinds of parallel computers. The AP1000 processing element is called a cell. It consists of a Sparc CPU, a FPU, a message passing chip, 128 KB cache and 16 MB main memory. Inter-cell communication is by message passing. Section 2 of this paper describes the notation used and parallel aspects of back propagation. The proposed mapping method is outlined in Section 3 followed by experimental results in Section 4. Finally, some concluding remarks are given in Section 5.

2 Parallel implementation of back propagation This research considers a fully connected two layer feed-forward neural network, i.e. it consists of one hidden weight layer followed by an output weight layer. The NN literature lacks a formalized terminology, thus, we now give the terminology used in this paper.

Basic notation Training Set. Consists of a number of training samples, each given by an input vector and the corre-

sponding output vector. (Training) Iteration. Denotes one presentation of the whole training set. Weight updating strategies. Three di erent approaches:  Learning by pattern (lbp), update the weights after each training pattern has been presented.  Learning by block (lbb), update the weights after a subset of the training patterns has been presented.



Learning by epoch (lbe), update the weights after all patterns have been presented (i.e. one training

iteration).

Weight update interval. Denote the number of training patterns that is presented between weight updates, . For lbp,  = 1, while for lbe  = P , where P is the number of patterns in the training set. Degree of parallelism, The BP algorithm reveals several di erent kinds of parallelism, as described in [3, 6]:  Training session parallelism, Start training sessions with di erent initial starting values on di erent processors.  Training set parallelism, Splits the training set across the cells. Each cell has a local copy of the complete weight matrix and accumulate weight change values for the given training patterns. The weights are updated using lbb/lbe.  Pipelining, Pipelines the training patterns in the layers, i.e. compute hidden and output layer on di erent processors. While the output layer processor calculates output and error values, the hidden layer processor concurrently processes the next training pattern. Pipelining requires a delayed weight update or lbb/lbe.  Node parallelism, The nodes within a layer run in parallel. Further, the computation within each node may also run in parallel (Nordstrom [6] names this weight parallelism). In this method, the weights can be updated using lbp.

Parallel implementations of BP Several BP implementations have been done on general purpose computers. Node parallelism is used to implement BP on Intel iPSC/860 hypercube in [1] and on MasPar MP-1 SIMD computer in [2]. A combination of training set and node parallelism for MasPar MP-1216 is given in [4]. Common to all these implementations is the lack of possible adaption to the large range of neural network applications. E.g., in the case of a node parallel implementation, it scales badly for small feed-forward networks. Kumar et al. [7] propose a hybrid scheme using node and training set parallelism for hypercube architectures. The method Node parallelism makes it possible to vary the number of processors assigned to each degree of parallelism and utilizes the hypercube communication network. However, the e ect of  is not considered. Output layer As shown in [8], less frequent weight updates makes the con- processors vergence rate slower (i.e. the decrease in error per iteration is Pipelining smaller) during training. Since the main interest is to minimize the total training time, it is impossible to omit the selection of a Hidden layer proper , when training set parallelism is used. To obtain high processors speedup on a highly parallel computer, Torresen [9] showed that multiple inherent degrees of parallelism in the BP algorithm should be combined. The partitioning of the network is Figure 1: Partitioning of the network, given in Figure 1, where pipelining and node parallelism are when pipelining and node parallelism combined. Moreover, we included training set parallelism. In are combined. The dotted lines show this paper, we propose a exible and scalable mapping based which nodes are mapped to each proon these results. While the previous work was on xed com- cessor. binations of the degrees of parallelism, the new scheme allows arbitrary combinations of training set parallelism, pipelining and node parallelism.

3 The proposed mapping scheme

Our mapping strives to obtain the best possible load balance between the processing cells for any neural network application. Hence, we allow the number of processors assigned for each degree of parallelism to be changeable. The proposed mapping is shown in Figure 2. It combines all degrees of parallelism except training session parallelism. Within each dotted rectangle, node parallelism is used. The box within the gure indicates the mapping of the elements for node parallelism. This is based on an ordinary parallelization of matrixvector multiplication. The mapping can be simpli ed by removing the pipelining part, i.e., make two boxes (one O{H pair) into one in the gure. The eciency of including pipelining depends on the number of processors and the number nodes in each layer [10].

(Only node parallelism and pipelining) O

O

Processing element

H

H

A copy of the whole network O

O

H

H

O

O

H

H

H: Hidden layer O: Output layer

(NTSP ,N NP ) =

(1, CxCy /2 )

(# Training set partitions, #Processors used for node parallelism)

Cx

# processors in horizontal direction

Cy

# processors in vertical direction

Mapping of matrix−vector elements for node parallelism: O

O

H

O

H

O

H

H

O

O

H

H

O

O

H

H

O

O

H

H

O

O

H

H

w w w w

−1 (2 Cx/2, 2 1Cy) H

O

H

O

H

O

H

H

O

(Cx/2, Cy)

w w w w

x x x x

w w w w

(Only training set parallelism and pipelining)

O

O

w w w w

H

O

H

O

H

O

H

O

H

O

H

O

H

O

H

O

H

1 (2 Cx/2, 2−1 Cy)

O

H

O

H

O

H

O

H

O

H

O

H

(CxCy /2 ,1)

Figure 2: BP mapping including training set parallelism, pipelining and node parallelism. Below each processor mapping, the number of training set partitions (NTSP ) and the number of processors for node parallelism (NNP ) are given. The relation between them can be expressed by: C C NTSP = x y 2NNP

(1)

where Cx and Cy is the number of cells in the horizontal and vertical directions, respectively. If pipelining is removed: C C NTSP = x y (2) N NP

For a given weight update interval, a larger number of training set partitions imply fewer training pattern computations on each processor between weight updates. Our scheme implies network contention, i.e., several processors may send messages concurrently to the same link. However, the communication time on AP1000 is reasonable, when network contention exists. It is approximately doubled if 5 messages are on the same channel, indicating that much time is spent within each cell for communication overhead and less on the physical transfer of data [11]. Before running a neural application, we have to select the mapping giving shortest possible total training time. Thus, for each mapping, we have to predict the total training time for a set of weight update intervals. The total training time is given by: Ttotal = T1it ()N () (3)

where T1it is the time for one iteration and N () is the number of iterations needed. T1it can be found either by estimation or by running each of the possible mappings for one iteration. N () can be estimated based on the error after a few iterations [12]. We should select the mapping giving smallest possible Ttotal and use the corresponding -value.

4 Results and discussion In this section we evaluate our proposed methods on AP1000. We use the NETtalk [13] neural network application in our experiments. NETtalk is a two layer feed-forward network that transforms text to phonemes, using 203 input-, 60{120 hidden- and 26 output-nodes. The training set consists of the 1000 most common English words (total 5438 characters). We have done experiments with  = 63; 259; 494; 906; 1360; 2719; 5438.

Parallel BP implementation with network contention To be able to evaluate the bene t of a parallel back propagation algorithm that can vary the amount of each degree of parallelism, we implemented the lower part of Figure 2, but without pipelining. Hence, we are able to use a con guration of an even amount of node and training set parallelism, just training set parallelism, or an intermediate solution. The latter implies that there will be con icts in the communication. 14

MCUPS

MCUPS

4

3.5

µ = 906 µ = 259

12

3 10 2.5

8 2

6 1.5

4

1

2

0.5

0

0

(16,1)

(8,2)

a)

(4,4)

(N TSP,N NP )

(64,1)

(32,2)

(16,4)

(8,8)

(N TSP,N NP)

b)

Figure 3: MCUPS performance for di erent combinations of node and training set parallelism running the NETtalk application with 120 hidden nodes. a) 4 x 4 processor con guration for  = 906 b) 8 x 8 processors con guration for  = 906 and 259. Figure 3a) shows the performance of the algorithm running the NETtalk structure on 16 processors with 120 hidden nodes. The weights are updated after 906 patterns, i.e.  = 906. Performance is measured in MCUPS (Million Connections Updated Per Second). The con guration for each bar is indicated by (NTSP ; NNP ), as explained in the theory section. The fastest solution is not the even combination, but (NTSP ; NNP ) = (8; 2). This means that using 8 training set groups is better than using 4, even though network contention occurs. One reason is that mapping the 26 output nodes to 4 processors gives uneven load balance, while mapping them to 2 processors implies even load balance. However, there is not much di erence in the performance compared to the other con gurations. That is, on a small number of processors the computation grains are quite large. Therefore, the way BP is parallelized is not that critical on a small system. When the number of processors increases, as in Figure 3b), the di erence in performance becomes larger. In this gure, we have included the results for  = 259 in the lefthand bar and  = 906 in the righthand bar. The combined solutions perform better than using only training set parallelism. However, when  = 906, the fastest solution is, as in the previous gure, not the even combination, but (NTSP ; NNP ) = (16; 4). When the weights are more frequently updated, a smaller NTSP -value will be more bene cial, as seen in the bar chart.

(NTSP,NNP) = (32,16) MCUPS

Figure 4 shows the performance of NETtalk on a large system, for some -values. The performance improves with less frequent weight updates. The most ecient system is (NTSP ; NNP ) = (32; 16) for frequent updates and (NTSP ; NNP ) = (64; 8) for rare updates.

(NTSP,NNP) = (64,8)

100

(NTSP,NNP) = (128,4) 80

Minimizing the total training time

60

In this section we show the relation between weight update interval and the number of iterations needed to obtain convergence for the NETtalk training set. Moreover, we nd the best weight update interval when running on AP1000. The convergence was measured in percentage of the characters that have obtained an error of less than 0.1. Due to representation limitations of the network, the number of patterns that are not trained stabilized slightly below 5%.

40

20

0

63

259 906 Weight Update Interval

5438

Figure 4: MCUPS performance for di erent combinations of node and training set parallelism on a 16 x 32 processor con guration running the NETtalk application. The number of hidden nodes is 120 and the performance for  = 63; 259; 906; 5438 are given.

Figure 5 shows the number of iterations needed (N ()) to reduce the error to 5% for the investigated -values. 1000

Total time (s)

Number of iterations to 5% error

700

600

800

500

600 400

300

400

200

200 100

0

0 0

1000

2000

3000

4000

5000

Weight update interval

Figure 5: Number of iterations needed to obtain convergence.

0

1000

2000

3000

4000

5000

6000

Average weight update interval

Figure 6: Total training time for NETtalk running on 512 cells, using 120 hidden units.

The time for one iteration, T1it, can be computed from Figure 4 (not all -values are shown). Hence, we can nd the best weight update interval by employing (3) for = 63, 259, 494, 906, 1360, 2719, 5438. We get the total times as shown in Figure 6. We omit to accumulate weight change for a pattern when the error is suciently small to avoid over-training, thus T1it is decreasing with the convergence. Since

we use a constant T1it, the Ttotal will be larger than the real execution time. The best weight update interval is for every 906 pattern when (NTSP ; NNP ) = (64; 8). Convergence is obtained after 265 s. Running the same program (with parallel parts removed) on a Sparc10 workstation, using learning by pattern, converged after 73 iterations and 169 minutes. AP1000 speedup based on total execution time (T1it N ) is then 38 times. This is a conservative measure, since the decreasing T1it for AP1000 is not accounted for. Thus, training time can be reduced from hours to minutes by using parallel processing.

5 Conclusions We have proposed a way to near optimally map the back propagation algorithm onto a parallel computer. A variable processor assignment to the multiple degrees of BP parallelism has been exploited to best parallelize the large range of neural network applications present. Most other BP implementations do not consider the mapping strategy according to the given neural network application. Our suggested approach includes this aspect. Moreover, we have considered the e ect of the weight update interval on the number of iterations needed. Although more iterations are needed in the parallel implementation, training time for a back propagation neural network can be reduced substantially through the use of parallel processing.

References [1] Darin Jackson and Dan Hammerstrom. Distributing back propagation networks over the Intel iPSC/860 hypercube. In Proc. of Int. Joint Conference on Neural Networks, volume I, pages 569{574, 1991. [2] G. Chinn et al. Systolic array implementations of neural nets on the MasPar MP-1 massively parallel processor. In Proc. of Int. Joint Conference on Neural Networks, volume II, pages 169{173, 1990. [3] Alexander Singer. Implementation of arti cial neural networks on the Connection Machine. Parallel Computing, 14:305{315, Summer 1990. [4] Andreas Zell et al. Problems of massive parallelism in neural network simulation. In Proc. of IEEE Int. Conference on Neural Networks, pages 1890{1895, 1993. [5] Hiroaki Ishihata et al. Third generation message passing computer AP1000. In Proc. of the International Symposium on Supercomputing, pages 46{55, Nov. 1991. [6] Tomas Nordstrom and Bertil Svensson. Using and designing massively parallel computers for arti cial neural networks. Journal of Parallel and Distributed Computing, 14(3):260{285, March 1992. [7] Vipin Kumar et al. A scalable parallel formulation of the back propagation algorithm for hypercubes and related architectures. Trans. on Parallel an Distributed Systems, 5(10):1073{1090, October 1994. [8] Helene Paugam-Moisy. Parallel neural computing based on neural network duplicating. In Ioannis Pitas, editor, Parallel algorithms for digital image processing, computer vision and neural networks, chapter 10, pages 305{340. John Wiley & Sons, 1993. [9] Jim Torresen et al. Parallel back propagation training algorithm for MIMD computer with 2D-torus network. In Proceedings of International Conference On Neural Network Processing, Seoul, Korea, volume 1, pages 140{145, October 1994. [10] Jim Torresen et al. Exploiting multiple degrees of BP parallelism on the highly parallel computer AP1000. Submitted to the Fourth International Conference on Arti cial Neural Networks. [11] Hiroaki Ishihata. Performance evaluation of the AP1000. In Proc. of CAP workshop, pages N{1{N8, 1991. [12] Jim Torresen et al. The relation of weight update frequency to convergence of BP. Submitted to World Congress On Neural Networks'95. [13] Terrence J. Sejnowski and Charles R. Rosenberg. Parallel networks that learn to pronounce English text. Complex Systems, (1):145{168, 1987.

Suggest Documents