A Parallel Implementation of a Neural-Network Based

A Parallel Implementation of a Neural-Network Based Object Classifier† Enrique V. Carrera E. COPPE Systems Engineering - UFRJ Postal Box 68511 Rio de Janeiro - RJ, Brazil 21945-970 [email protected]

Abstract This paper presents a parallel implementation of an intelligent 2-D object classifier. The artificial vision system, based on neural networks, allows for the distinction among several different classes of images and provides invariance in respect to rotation, translation and scaling of the input image. The parallel implementation was made on the P-RIO environment that facilitates parallel programming on a cluster of workstations. The results obtained in this work show a significant reduction of time in both learning and recognition tasks. Moreover, they demonstrate that a low-cost network of workstations may be used to achieve good speed-up in solving artificial vision tasks.

1. Introduction Artificial neural networks (ANN) have been applied to a lot of problems, including signal processing, control tasks, pattern recognition, temporal series forecasting, and game playing, to mention just a few. Another area that has also been benefited from ANN models involves the implementation of artificial vision systems [1, 2, 3]. Unfortunately, these models rely heavily on highly interconnected computational units functioning in parallel. Thus, implementation of the resulting massively parallel computing on sequential machines is highly inefficient, as one processor has to simulate the models unit by unit. Indeed, it is an extremely time consuming process. Therefore, it is natural to try to develop implementations of these ANN models on parallel computers. The above considerations are especially true for artificial vision systems, where the number of interconnections and processing units is very high. †

This research was partially supported by the CNPq.

Paulo C. Amaral Pereira Computer Science Department - UFMG ICEX - Av. Antônio Carlos, 6627 Belo Horizonte - MG, Brazil 31270-010 [email protected] The present paper brings out a parallel implementation of an intelligent 2-D object classifier. It permits the distinction among several different classes of images and provides invariance in respect to rotation, translation and scaling of the input image [3]. The artificial vision system consists of two main tasks: pre-processing and classification. The preprocessing task executes a series of basic digital signal processing procedures in the matrix that represents the image to be classified. On the other hand, the classification task is implemented as a neural network, which makes use of the back-propagation algorithm for training. The parallel implementation of this classifier was done in the P-RIO environment [4, 5]. This environment presents an object-based software construction methodology that provides high-level abstractions that help building modular concurrent software by inexperienced users. Both the preprocessing and the classification tasks were parallelized to reduce their computing time using the facilities offered by P-RIO for parallel programming on a cluster of workstations. The results obtained during the experiments have shown a considerable shortening of time, in both learning and recognition tasks. This demonstrated that a low-cost network of workstations could be utilized by non-expert users to reach good speed-ups for artificial vision systems based on ANN. The rest of this paper is divided as follows: the next section presents some background material on PRIO and describes the intelligent 2-D object classifier. Section 3 describes the parallel implementation of the artificial vision system; while section 4 presents the outcomes of the base experiments. Related work is presented in section 5 and, finally, section 6 concludes the paper and suggests some future work.

2. Background In this section we present some background material on the P-RIO environment and describe the intelligent 2-D object classifier.

2.1. The P-RIO System The P-RIO (Parallel Reconfigurable Interconnectable Objects) environment [4, 5] is an objectbased software construction methodology, which provides high-level abstractions that help building modular concurrent software. Basically, the methodology depends on the configuration paradigm, which allows the assembly of a system or a program by the external interconnection of modules from a library. For this, P-RIO makes a distinction between the building of individual system modules and system construction from these modules.

Connector Code Internal Data Figure 1. Basic Module Each module is composed of code and data (Figure 1). The data are only modified by the methods encapsulated in the associated code. The code can be written in any language, however, the modules used in this paper were implemented in the C language. Each module also has one or more communication points called connectors. They are the points of interaction with other modules, since they concentrate communication activities. Composed modules, or a complete application, can be built through the interconnection of the connectors exposed by the modules. Connectors serve as “glue” among the modules. There is no restriction on the number of modules used to implement a composed module or an application. Moreover, existing (primitive and composed) modules may be used to form new modules. Composition caters for module reuse and encapsulation. A special configuration language was created to specify the module composition and the communication requirements of an application. The

configuration process comprises the selection of modules and the definition of their interconnections. To accomplish it, a graphical interface is supplied for helping users with little programming experience. The protocol used by the connectors can be selected at the moment of the configuration. Currently, it is possible to choose UDP or TCP protocols, in addition to multicast dissemination. In machines using specialized interconnection architectures, particular protocols could be transparently supported. P-RIO’s first version was implemented on PVM (Parallel Virtual Machine) library [6]. The P-RIO environment performs PVM calls to build an easy-touse communication interface with high portability to several operating systems and architectures. The P-RIO system runs on a network of SPARC workstations along the whole experiments. The advantages of using a cluster of conventional workstations are: easy availability of a network of such machines, and low cost of this approach, as compared to implementations on a supercomputer or a dedicated hardware.

2.2. The Intelligent Artificial Vision System The intelligent artificial vision system (Figure 2), which has been parallelized in this work, was considered in [3]. It can be divided into two main blocks: pre-processing and classifier.

Image

Pre-processing

Neural Classifier

Figure 2. Artificial Vision System The pre-processing block (Figure 3), which have also been used in the Acocella´s work [7], executes a series of basic digital signal processing procedures (filters, transforms, convolutions, and so forth) in the matrix that represents the image. The input to the preprocessing task is a matrix of 256x256 pixels with 256 levels of gray. Initially, a median filter is applied to attenuate the noise that may be present in the captured image. Then, through Sobel’s algorithm, the image borders are detected in order to obtain the object’s position in the system’s field of vision, thus reducing the size of the image to be processed.

Center Median Filter

Edge Detection

Median Image

Homomorphic Processing

Polar Coordinates

Circular Harmonic Coeficients Mellin Transform

Edges

Edges Image

Pipe

Figure 3. The Pre-processing Block Simultaneously, the original image suffers a homomorphic processing that allows the elimination of distortions caused by different incidence angles of the light beams over the image. Such effect occurs due to different light sources used during the capturing of the image. Following the homomorphic processing there is a change in the coordinate system to a polar representation, which takes the results of the edgedetection process and performs a sampling operation of 128 rays and 512 angles. The circular harmonic coefficients are then obtained, resulting in an image invariant to rotation. The last activity in the pre-processing task is the Mellin Transform, which is applied to the set of circular harmonic coefficients. The result of this operation is the so called intrinsic image, which is invariant to translation, rotation and scaling. The classification task uses the same architecture of the Perelmeter´s work [3], which was implemented as an artificial neural network using the backpropagation algorithm for training. The network consists of one input layer with 16384 neurons (the coefficients obtained from the intrinsic image), and one output layer with the number of neurons determined according to the number of object classes to be classified (one neuron for each object class). Only one hidden layer with 100 neurons was used in the experiments, in order to compare with the previous work [3]. As we are interested in the parallelization process, all the parameters used by the system are the same used in previous works [7, 3]. These parameters have demonstrated to produce the best results for classification tasks.

Edges

ANN

Neural

Neural Neural

Neural Figure 4. Configuration in the P-RIO Environment

3. The Parallel Implementation It was observed in the serial implementation of the intelligent classifier that the edge-detection and classification phases were the most expensive activities in terms of CPU time. The edge-detection and classification tasks took 50% and 30% of the total CPU time, respectively. It is due to the complexity of the Sobel’s algorithm and to the large input required by the neural network. Further, the ANN training time approximately involved two days of running on a SPARC II workstation. Thus, in an attempt to optimize the system’s performance, a parallelization of the artificial vision system was implemented using the P-RIO parallel facilities. The parallelization includes both the pre-processing operations and the neuralnetwork based classifier. Details of this parallelization process are discussed in the following subsections.

3.1. Parallel Edge-Detection Each operation of the pre-processing task was converted into a P-RIO module since all the operations are completely independent among themselves. The result of an operation is passed to the next one through the interconnection of their module’s connectors

(Figure 4). Hence, the pre-processing phase was structured as a set of interconnected components working concurrently. In addition to simplifying the programming of the application, this methodology permits increasing the concurrence of the system, since each module can be assigned to a different processor and executed in parallel with the rest. Because the edge-detection phase requires more CPU time, it was separated in several P-RIO modules. Each edge-detection module works on a single part of the original matrix. In this way, the operation that spent more CPU time can be improved by a factor that depends on the number of CPUs available in the system.

3.2. Parallel Back-propagation The most common strategies to parallelize the training of artificial neural networks are: -

-

Pattern Parallel Training (PPT). This strategy replicates the entire network on each node and presents different patterns in parallel. After each batch of patterns, nodes synchronize by exchanging weight update information. Network Parallel Training (NPT). The NPT strategy splits the network across all the processors. Each pattern must be sent to every processor and the whole weight updating process is local to each node.

parameter that specifies the number of neurons and inputs to each neuron. Thus, the module called “neural” was replicated several times and its instances were interconnected to form the desired topology (Figure 4). Accordingly, the entire neural network was created with just one type of a module. This fact simplifies programming, provides flexibility with the construction of the topology, and permits code re-use.

4. Results The neural network was implemented to mimic part of an intelligent vision system. It was trained to recognize five different classes of objects (Figure 5), which have also been used in the Acocella’s work [7]. Each object is a mechanical part, which has been rotated, scaled and/or translated, with respect to the original image. The rates of success obtained during the classification process are shown in Table 1.

Figure 5. Mechanical parts used in the classifier tests

Class 1 2 3 4 5

Rate 95% 74% 71% 52% 75%

Table 1. Classification Rates The advantage of PPT over NPT is its lower communication requirement. On the other hand, the advantage of NPT over PPT is that the network scales with the number of nodes, enabling users to train very large networks. But, the problem is not only restricted to the training of the network. The recall process is also an important characteristic of this application. So, it was decided to utilize the NPT scheme, which also allows for improving the time engaged with the classification task. Each layer was divided in several sets of neurons, and each group was implemented as a P-RIO module. Bear in mind that the code for all these modules is the same; they only required change is the initial

Number of Processors 1 2 3 4 5

Execution Time (sec) 168 85 62 44 36

Speed-up 1.0 1.9 2.7 3.8 4.6

Table 2. Pre-processing Performance If one only considers the pre-processing task, the results of execution time show that the speed-up is approximately linear (Table 2). The pre-processing time of 168 seconds on a workstation was reduced to 36 seconds when utilizing five SPARC workstations. It

occurs because of three main factors: a) the existing pipeline among the independent pre-processing phases, b) the concurrence between the edge-detection phase and the homomorphic phase; and principally, c) the efficient parallelization of the edge-detection phase. It must be pointed out that in the sequential version of the pre-processing phase more than 70% of the execution time was spent in the edge-detection phase. The classification phase, on the other hand, was reduced from 240 seconds, on one workstation, to approximately 110 seconds, on 5 workstations. The relatively low speed-up (2.18 in this case) was due to the contention involved in the parallel accesses to the input data saved in disk files. The input data files maintain the network connection weights and several other parameters used by the application. These files were stored in an NFS file system that causes a bottleneck when lots of concurrent read requests are sent to the same disk. One solution to this problem is to use a parallel file system, for example PIOUS (Parallel Input Output System) that is also based on PVM. Besides the smaller execution time reached during these experiments, the workstations have lower memory requirements than the sequential version. It is due to the much better code and data distribution obtained in the parallel version. It is also important to mention that the number of modules (and processors) used in each experiment may be modified at the configuration level, without further change in the code associated with the modules. It gives high flexibility and permits code reuse in other different applications. In addition to the performance profit, some implementation facilities are also presented: -

Theoretical analysis of the inherent parallelism can be made through the study of the communication requirements of each connector.

-

The system is easy to use from the user’s perspective, since a graphical interface allows the on-line interaction between the user and the system.

-

The code and modules are portable to several architectures and operating systems.

5. Related Work Several methods have been proposed for the implementation of back-propagation on distributed and parallel architectures. Rosemberg [9] maps each neuron on a different processing element of a parallel architecture. In Hwang’s approach [10], each processing element handles several neurons. Other solution [11] distributes the set of training patterns, the connections, or both on different processing elements. There are also many examples of special hardware [12, 13]. On the other hand, there are few approaches that exploit loosely-coupled distributed systems, like workstation's clusters. One of the most interesting works in this field is reported in [14]; they study the optimal mapping, with respect to the learning time, of a multi-layer feed-forward network on message-passing multicomputers.

6. Conclusions The intelligent artificial vision system, presented in [3], can be used as a general image recognition system, however, it has a crucial drawback, i.e., the massively parallel processing is highly inefficient on sequential machines, producing extremely high execution times. The alternative of using a supercomputer or dedicated machine also has its disadvantages, especially its high cost. Thus, it was proposed in the paper the use of a network of workstations to parallelize the intelligent classifier. Both the pre-processing and the classification tasks were parallelized using the facilities of the P-RIO environment. The results obtained in the experiments show a significant reduction of time. That proves that a lowcost network of workstations could be utilized by nonexpert users to reach good speed-ups for artificial vision tasks based on ANNs. Moreover, the methodology adopted by P-RIO facilitates experimenting with different alternatives for a particular problem. Our future work is oriented to search for new alternatives for the construction of vision systems, and the optimization of these alternatives through the parallelization of their processing units.

Acknowledgments We are grateful to Emilio Acocella who implemented the sequential version of the preprocessing task, and to the anonymous referees for comments that helped to improve this paper.

References [1] Wechlsler H., Computational Vision, Academic Press Inc., San Diego, 1990. [2] Fukushima K., Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition, Neural Networks, Vol. 1, 119-130, 1988, USA. [3] Perelmuter G., Carrera E., Vellasco M., Pacheco M., Image Classification using Artificial Neural Networks, V Int’l. Conf. EPMESC, Macao China, April 1995. [4] Carrera E. et al., P-RIO: The RIO Methodology on PVM, II EuroPVM Users’s Meeting, Lyon France, pp. 89-94, September 1995. [5] Loques O., Leite J., Carrera E., P-RIO: A Modular Parallel-Programming Environment, IEEE Concurrency, IEEE Computer Society, Vol. 6, No. 1, pp. 36-46, January-March 1998. [6] A. Geist et al., Parallel Virtual Machine - A Users’ Guide and Tutorial for Networked Parallel Computing, The MIT Press, 1994. [7] Acocella E., Grivet M, Extração de Invariâncias em Processamento de Imagem Aplicado a Visão Computacional, X Congresso Brasileiro de Automática, Vol. 2, pp. 879-884, Rio de Janeiro Brazil, September 1994. [8] Rumelhart D.E. et al., Parallel Distributed Processing, MIT Press, Massachusetts, 1986. [9] Rosemberg C. et al., An implementation of network learning on the connection machine, 10th Int. Conf. on AI, Milan, Italy, 1987. [10] Hwang J.N. et al., A systolic neural-network architecture for hidden markov model. IEEE Trans. on ASSP, Vol. 37, No. 12, 1989. [11] Pomerleau D. A. et al., Neural network simulation at warp speed: how we got 17 millons connections per second. Proc. of IEEE ICNN, San Diego, USA, 1988 [12] Hammerstrom D. A., VLSI architecture for highperformance, low cost, on chip learning.

IJCNN'90, San Diego, USA, 1990. [13] Ramacher U. et al., Design of a 1st generation neurocomputer, VLSI Design of Neural Networks, Kluwer Academic, 1991, pp. 271-310 [14] Chu L.C., Wah B.W., Optimal mapping of neural networks learning on message passing multicomputers. JPDC, No. 14, 1992

A Parallel Implementation of a Neural-Network Based

A Parallel Implementation of a Neural-Network Based

Suggest Documents

A PARALLEL IMPLEMENTATION OF A LINKâBASED RANKING

Implementation of a Parallel Algorithm Based on a Spark Cloud ...

Implementation of a Parallel Algorithm Based on a Spark Cloud ...

The design and implementation of a microkernel based parallel OS

Implementation of a Parallel Prefix Adder Based ... - ATLANTIS PRESS

A PETSc-Based Parallel Implementation of Finite Element Method for

Implementation of a Parallel GPU-Based Space-Time Kriging ... - MDPI

cudaBayesreg: Parallel Implementation of a Bayesian Multilevel ...

cudaBayesreg: Parallel Implementation of a Bayesian Multilevel ...

Implementation of a Massively Parallel Dynamic ...

Implementation of parallel tridiagonal solvers for a

Parallel implementation of a central decomposition ... - CiteSeerX

Parallel Implementation of a Data-Transpose ...

A Parallel Implementation of blockMesh for Quick

Parallel and GRID Implementation of a Large

A Parallel Implementation of Molecular Packing using

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

A Parallel Implementation of Singular Value

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

A Parallel implementation of Gram-Schmidt Algorithm

Implementation of Parallel LFSR-Based ... - DATE Conference

a parallel implementation of a multi-objective evolutionary algorithm

Towards a HPC-oriented parallel implementation of a learning ... - Core

A Parallel Implementation of a Bayesian Neural Network with

A Parallel Implementation of a Neural-Network Based