A New Orthogonal Multiprocessor and its Application

0 downloads 0 Views 184KB Size Report
M. S. Piedade email:[email protected] email: msp@inesc. .... pairs of brackets (f g) is performed by each processor in a similar way to the sequential algorithm.
A New Orthogonal Multiprocessor and its Application to Image Processing L. A. Sousa email:[email protected]

M. S. Piedade email: [email protected]

DEEC IST/INESC R. Alves Redol,9 1000 Lisboa, PORTUGAL

Abstract

In this paper a new orthogonal partially shared memory architecture for the design of multiprocessor systems is proposed. The architecture allows processors to partially share a 2-D array of memory modules in an orthogonal way with less limitations than those imposed by the traditional orthogonal (OMP) architecture. Processors have direct access to large neighborhoods of memory modules which can be used to improve processing eciency, namely for local processing. The main characteristics of the new architecture are analyzed and compared with the traditional orthogonal architecture. Image processing algorithms have been mapped onto the new architecture, in order to evaluate its eciency for local processing. The fault tolerant characteristics of the system are also discussed.

1 Introduction

Various multiprocessor architectures with shared memory and distributed memory have been proposed for the development of parallel systems [2]. In the case of distributed memory architectures, processors have their own local memory and communicate through interconnection channels. In the case of shared memory architectures, all processors have access to a common memory, but memory access con icts may ocurr. Nevertheless, in this case each processor has access to the data stored in the common memory at any time. Shared memory multiprocessor architectures can be grouped in two main classes: fully shared and partially shared memory architectures. Fully shared memory multiprocessor systems have few limitations for memory share, but are either expensive or require complex processor-memory interconnection networks. Partially shared memory multiprocessors, specially orthogonal systems (OMP) [5, 8], are simpler and lower cost systems, but their memory access limitations may lead to a processing eciency reduction. This may occur in orthogonal multiprocessor systems whenever processors need to access data which are not stored in memory directly connected to their corresponding row/column buses. In this paper, a new partially shared memory mul-

tiprocessor architecture is proposed in section 2. It is shown that it inherit all the interesting properties of an OMP system, such as a regular and simple structure, memory access free of con icts and an easy control. Image processing has been one of the most successfully application areas of parallel systems [7]. In section 3, some usual image processing algorithms are mapped onto the new architecture and the architecture eciency is evaluated for local processing. In section 4 we discuss the ability of the new architecture in what concerns systems operation under defective conditions ( with fewer available resources). Finally conclusions are drawn in section 5.

2 The New Orthogonal Multiprocessor Architecture 2

An OMP system with n processors uses n memory modules organized in a 2-D array. A dedicated bus is used to connect each processor to the memory. Each processor can access memory in one of two mutually exclusive modes (Fig. 1): in row mode processor Pi has exclusive access to memory modules in row i, and in column mode processor Pi has exclusive access to memory modules in column i (0  i < n). Therefore, processor Pi has direct access to a full row and a full column of memory modules (Mi;i can be viewed as local memory). It can access adjacent memory modules in the horizontal direction for row i, and in the vertical direction for column i. It can move data and communicate with processor Pj using two memory modules | Mi;j and Mj;i (0  i; j < n). Supposing that data required for local processing is distributed by neighbor memory modules, they should be directly accessible to a single processor in order to minimize data moves and maximize the processing eciency. To increase the number of adjacent memory modules that can be directly accessed by a single processor, some of the memory access restrictions of orthogonal architectures must be removed. This is achieved with the new architecture while maintaining the orthogonal memory access property. In the new architecture depicted in Figure. 2, a twoway switch (positions up/down) is placed in the path from each processor to the corresponding dedicated

row M

0,0

M

M

0,1

P0

M

M

1,0

1,1

M

0,n-1

P1

M

n-1,0

M

1,n-1

Table 1: Access of Pi (0  i < n) to data stored on memory M

n-1,1

column

n-1,n-1

Pn-1

Figure 1: Diagram of an OMP architecture with n processors.

row M

0,0

M

M

M

M

0,1

0,n-1

column

P0

M

1,0

1,1

1,n-1

P1

M

n-1,0

M

n-1,1

bus. With the switch in position `up' processor Pi is connected to the bus of row and column i, while in position `down' Pi is connected to the bus of row and column [(i + 1) mod n] 1. The rst processor-memory connection is characteristic of a traditional orthogonal multiprocessor. The new connection gives processor Pi the ability for accessing data placed on the adjacent row and column of memory modules formerly associated only with processor Pi+1 . Two new mutually exclusive modes of memory access are introduced | the aligned mode and the advanced mode| leading up to a total of four memory access modes: aligned row, aligned column, advanced row, and advanced column.

M

n-1,n-1

Pn-1

Figure 2: Diagram of the new architecture with n processors

Mode Allig. Adv.

Directly accessible Row Column

Mi;: Mi+1;:

M:;i M:;i+1

Data moving (Pi $ Pj ) i 6= (j + 1)^ j 6= (i + 1) Mi;j ; Mj;i Mi;j+1 , Mi+1;j Mj;i+1 , Mi+1;j+1 Mj+1;i ,Mj+1;i+1

i= j+1 Mi;: , M:;i , Mi+1;j , Mj;i+1

Table 1 presents the memory modules directly accessible to each processor, and all the modules that may be used to move data between any pair of processors. With the new architecture each processor has direct access to approximately the double of memory modules number than in a conventional OMP system. For any module of row and column i, processor Pi has direct access to neighborhoods of six memory modules, against only two in an OMP system. The number of memory modules shared by each pair of processors has increased from 2 to (2  n +1), for processors with consecutive indexes, and from 2 to 8, in all the other cases (see Table 1). These results show that the new architecture has better characteristics for local processing.

3 Local Processing Algorithms

In this section two parallel algorithms for image processing, median ltering and image segmentation, are proposed. It is assumed that the array of pixels is mapped onto the memory array of the n-processor orthogonal multiprocessor system for parallel processing. A mapping that preserves the relative position of the pixels of a N  N image (I [i; j ] 0  i; j < N ) in the memory array is: Mbi=c;bj=c [i mod ; j mod  Ii;j ]  = N=n : The algorithms are presented in Figures. 3 and 4. The construct do all processors: : :end all is used to indicate that the processing of the block inside the constructs is performed in parallel. The processing inside pairs of brackets (f g) is performed by each processor in a similar way to the sequential algorithm. The median ltering algorithm replaces each pixel by the median value of a window (w  w) containing neighbor pixels. The median values can be computed in sequence by columns (or rows), starting from 1 In the rest of this summary we omit the `mod' operation which a ects only processors with index n ? 1.

the rst pixel and updating the values for consecutive pixels according to the pixels that come out and enter into the window [4]. Processing can be straight parallelized with a maximum eciency whenever the pixels of the windows can be directly accessed. This is true for the case of pixels I[ 0; i  ] to I[ N; (i +1)   ? w], for all processors Pi (0  i < n), when (  w ? 1) and the architecture is in the aligned column mode. The parallel algorithm presented in Fig. 3 can be used to compute the median for pixels with neighborhoods not directly accessible to a single processor. Pixels in the last (w ? 1) columns of the sub-images placed in columns of memory modules are processed in parallel. The histogram and the median value for the rst pixel of each image column are computed separately. For the remaining pixels, the information contained in the histogram is updated with the values of pixels that come out and enter onto the window. With the system in the advanced column mode, processor P updates the histogram for pixels of the window in the columns beyond ( + 1)   ? 1, which are not directly accessible in traditional systems. This algorithm does not imply any data moves between processors neither involves any con ict in the memory access modes requested. For images with L intensity levels n  L  2  log2 w bits of memory are required for storing the histograms and the memory access mode has to be changed 2  w  N times. When the median algorithm is computed in a traditional orthogonal system, pixels that are accessed in the advanced mode must be moved between processors (for each window). With the system in column mode, processor P moves all pixels required by processor P?1 to the memory module M?1; . Accessing the memory in row mode, processor P?1 has direct access to this memory module and updates the histogram with the values of those pixels. The total number of pixels that have to be moved by each processor is 2n

X

w?1 i=1

i

= w  (w ? 1)  N :

This number grows linearly with image size (N ) and with the square of the window size (w) (O(N  w2)). If a xed number of compare and arithmetic operations is considered, which can be the average number of operations for all pixels, the complexity of the processing part of the algorithm is O(  N  w). For k = w, the number of data movings required grows at the same rate as the number of operations required to process the sub-images (O(N 3 =n2 )). It can thus considerably reduce the system performance. Moreover, another disadvantage of a traditional system is the number of extra memory bits (2  n  w  log2 L) it requires to compute the median. The second algorithm discussed is for image segmentation, which is another important operation that belongs to an intermediate level of processing [3]. The homogeneity of one region and the discontinuity between regions are the two properties usually explored for image segmentation. The processing is normally

applied in sequence and delimited to local areas of the image. For example, an algorithm for labeling and features extraction of a binary image sequentially compares the coordinates of the transitions which ocuur in pairs of adjacent image rows [1]. The main phases of a typical parallel algorithm for image segmentation using a divide-and-conquer strategy are presented in Figure 4. With the system in the row aligned mode processors analyze di erent sub-images composed of  complete image rows each, in parallel. The resulting information is uniformly distributed by all modules of the corresponding rows. Regions which have elements on the rst row and on the last row of the sub-image must be marked. They may not be really independent regions and should be combined with pseudo-regions of other sub-images. On a second phase, each processor has direct access to the information about the regions of two adjacent sub-images. No extra memory is required for combining the pseudo-regions that cross their common border if the memory access mode is alternated between aligned and advanced. The complexity of the algorithm is the sum of the complexity of both phases. In the rst phase the complexity depends on the distribution of the processing by the processors, that means how the regions are spread by the sub-images. In the worse case, all the regions (R) are concentrated in just one sub-image (O(#R=n)) and in the best case the regions are equally shared out by all the sub-images (O(#R). The complexity of the second phase of the algorithm depends how the regions cross the boundaries of the sub-images. The best case is when any region cross the boundaries os the subimages (O(1)) and the worse when the majority of the regions cross a great number of the sub-image's boundaries O(#R  n). The implementation of the second phase of the algorithm in a traditional orthogonal system is much more dicult. The results for a sub-image have to be transferred between adjacent processors. The results can be centralized in the modules shared by adjacent processors or must be moved during the combination phase. If the results are centralized the algorithm requires that n memory modules have n times more capacity than the remaining ones. Otherwise the memory access mode has to be changed a number of times proporcional to the bigger number of regions to combine in each iteration of the combination phase of the algorithm. It is dicult to distribute this task by the processors in a balanced way, which contributes even more to a decrease of the system's eciency. We have made some computer simulations of the operation of orthogonal systems using speci c simulation programming tools [9]. The complexity of a simple request for changing the architecture memory access mode is consider one order bigger than that of a simple arithmetic instruction. For the algorithms presented in this section, it has been concluded that the eciency gains achieved with the proposed architecture increase with the need of data transfer between processors. Eciency gains of about 20% are reached when w and N=n values are similar in the median ltering algorithm. The gains of eciency for the image

do all processors ( = 0; : : : ; n ? 1) // I O: Original Image; I P : Processed Image for l = 0 to w ? 2 do begin // For each of the w ? 1 image columns f advanced column mode g // Compute the HISTOGRAM vector for the pixel I O[0; ( + 1)   ? w + 1 + l] for m = ( + 1)   to ( + 1)   + l do for i = 0 to w ? 1 do HISTOGRAM [I O[i; m]] + + f aligned column mode g for m = ( + 1)   ? w + l + 1 to ( + 1)   ? w ? 1 do for i = 0 to w ? 1 do HISTOGRAM [I O[i; m]] + + f Computing the median value P (from the histogram) g I P [0; ( + 1)   ? w + 1 + l] := P // Process the remaining N ? 1 pixels of the image column ( + 1)   ? w + 1 + l for i = 1 to N ? 1 do begin f advanced column mode g for m = ( + 1)   to ( + 1)   + l do begin HISTOGRAM [I O[i ? 1; m]] ? ? HISTOGRAM [I O[i + w ? 1; m]] + +

f Updating the number of pixels of the neighborhood with a value bellow P g

end(m) f aligned column mode g for m = ( + 1)   + l ? w + 1 to ( + 1)   ? 1 do begin HISTOGRAM [I O[i ? 1; m]] ? ? HISTOGRAM [I O[i + w ? 1; m]] + +

f Updating the number of pixels of the neighborhood with a value bellow P g

end(m)

f Computing the new value of P (new pixel) g

I P [i; ( + 1)   ? w + 1 + l] := P end(i) end(l)

end all

Figure 3: Median ltering algorithm for the proposed system. segmentation algorithm are much variable with the image contents [10]

4 System Operation under Defective Conditions

The proposed modi cations for orthogonal systems allow both all processors to directly access a greater number of memory modules and memory modules to be shared by more processors. This is useful when a fault occurs in a device causing an incorrect operation. Disconnecting the device from the rest of the system and taking advantage of the new system characteristics it is possible to keep it working in a proper way. Assuming that one processor (Pi) stops working properly, in a traditional orthogonal architecture the data in module (Mi;i ) can no longer be accessed, while the data stored in the remaining memory modules, directly connected to the processor, can only be accessed by various processors. Therefore, it is very dicult to maintain the system working in acceptable conditions. With a system based on the new architecture, processor Pi?1 may accomplish the processing that was at-

tributed to Pi |with advanced modes, processor Pi?1 has direct access to memory modules in row and column i. In this case, processor Pi?1 becomes responsible for the processing of two processors and the system continues to work properly. This procedure can be extended to more processor's faults since those are not consecutive processors. In order to evaluate the system performance degradation with processor's faults, we have simulated its operation in normal and fault conditions. The metric used to this evaluation is the e ective memory bandwidth, which is de ned as number of memory requests accepted per cycle. The simulation model assumes that the processors generate memory requests syncronously and each processor generate its own memory requests in a random and independent way with a probability p [6]. The memory requests are uniformly distributed by all the memory modules. The required number of mutually exclusive memory access modes has been increased from 2 to 4 for modelling the alterations in system operation when ocurrs a fault in one of the processors. This is the alteration that causes the bigger reduction in the e ective memory band-

do all processors ( = 0; : : : ; n ? 1) // I : Image f aligned row mode g

fSegmentating sub-image I [  ; 0]  I  ? ;N ? g // Information about the regions are uniformly distributed by the memory modules f advanced row mode g ( +1)

1

1

while (sub-images with regions crossing their borders are not all combined) if (all the regions (R) present in the rst row (I   ) end on the repective sub-image) then ( +1)

f Combining information of the regions of the 2 adjacent sub-images,

by alternating the memory access mode between advanced and aligned g

else // Nothing can be done in this iteration endfwhileg end all

Figure 4: Image segmentation algorithm for the proposed system. width, since it requires the duplication of the number of memory acess modes. The e ective bandwidth (BW) values are plotted in Fig. 5 for di erent memory request generation rates (p). The di erence between the e ective bandwidth values for a system in normal and defective conditions enlarge when the probability p of requesting the memory increase. Processors request the memory more often, which causes a relative bigger number of memory requests rejected. The di erence between the e ective bandwidth in both operation modes also enlarges with the number of processors. However, these di erences increase at a lower rate, with the growth of the number of memory requests, as the number of processors increases. The BW di erences grow from 16% to 36% when p increases from 0:1 to 1 in a system with 16 processors, and from 30% to 43% in a system with 64 processors.

5 Conclusions

In this paper, a new orthogonal multiprocessor is proposed. While maintaining the memory access principle, the architecture allows direct access to larger neighborhoods of memory modules. Typical image processing algorithms have been mapped onto the new architecture and its advantages have been demonstrated for local processing. The possibility of system operation in defective conditions was considered. The fault tolerant characteristics of the new architecture are also discussed. It was shown that the new orthogonal system has enhanced capabilities for local processing and improved ability to operate under faulty conditions, when compared to conventional orthogonal systems. The reduction of the e ective memory bandwidth introduced by a processor fault has been estimated. For the cases studied, we estimate a maximum reduction of the efective memory bandwidth of less than 50%, with a system 64-processor system that request the access to the memory in all cycles.

References

[1] Gerald J. Agin. Handbook of Industrial Robotics, chapter Vision Systems. John Wiley & Sons, 1985.

[2] Ralph Duncan. A survey of parallel computer architectures. IEEE Computer, 23(2):5{16, February 1990. [3] Robert M. Haralick and Linda G. Shapiro. Computer and Robot Vision, volume I. AddisonWesley, 1992. [4] T. S. Huang, G. Y. Yang, and G. Y. Tang. A fast two-dimensional median ltering algorithm. IEEE Transactions on Acoustic, Speech, and Signal Processing, 27(1):13{18, February 1979. [5] Kai Hwang, Ping-Sheng Tseng, and Dongseung Kim. An orthogonal multiprocessor for parallel scienti c computations. IEEE Transactions on Computers, 38(1):47{61, January 1989. [6] Janak H. Patel. Performance of processormemory interconnections for multiprocessors. IEEE Transactions on Computers, C-30(10):771{ 780, October 1981. [7] Ioannis Pitas, editor. Parallel Algorithms for Digital Image Processing, Computer Vision and Neural Networks. WILEY Series in Parallel Computing. John Wiley & Sons, England, 1993. [8] Isaac D. Scherson and Yiming Ma. Analysis and applications of the orthogonal access multiprocessor. Journal of Parallel and Distributed Computing, 7(2):232{255, October 1989. [9] Leonel Sousa and Moises Piedade. Simulation of SIMD and MIMD shared memory architectures on UNIX based systems. In IEEE International Symposium on Circuits and Systems, pages 637{ 640, California, May 1992. IEEE Circuits and Systems Society. [10] Leonel Augusto Sousa. Parallel Image Processors with Orthogonal Access to Shared Memory. PhD thesis, Instituto Superior Tecnico, Lisboa, 1996. (available only on portuguease language).

BW

BW

10 9 8 7 6 5 4 3 2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p a)

normal

40 35 30 25 20 15 10 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

fault condition

p b)

Figure 5: Efective bandwidth of the system with normal and one processor fault mode for: a) 16 processors; b) 64 processors.

Suggest Documents