Parallel Architecture Dedicated to Connected ... - Semantic Scholar

2 downloads 5550 Views 529KB Size Report
This work presents the design of a dedicated parallel architecture for connected component analysis. Categori- zed in one-dimensional array processors, for an ...
Parallel Architecture Dedicated to Connected Component Analysis Eril Mozef, Serge Weber, Jamal Jaber, and Etienne Tisserand Laboratoire d'hstrumentation Electronique de Nancy (L.1.E.N) Universitk de Nancy I, BP 239 54506, Vandoeuvre Cedex, FRANCE Phone : (33) 83 91 20 71, Fax : (33) 83 91 23 91.

Abstract

n) units of propagation time by a tree structure of switches. Hence this architecture is well suited to real time processing and intermediate-level vision. This article also puts forth labeling, area determination, and perimeter determination algorithms, hence enhancing the application of the proposed architecture. These algorithms yield a complexity of O(n log n). The organization of the proposed architecture is described in the following section. The details of each algorithm is then elaborated. Finally, we compare the performances of the proposed architecture with another type of architecture. We then discuss the simulation results, possibility of implementation and future work.

This work presents the design of a dedicated parallel architecturefor connected component analysis. Categorized in one-dimensionalarray processors,for an image of nxn pixels, the proposed architecture has n-1 linear PE's, n2 CAM memory modules, and a tree structure of (&)-I switches allowing communication through the global bus in O(1og n) unit of propagation time. Well suited for low and intermediate-level vision, this architecture allows sequential processing through its line structure which is pe@ectly adapted to real time image analysis from any interlaced-mode video signal. This paper presents algorithms for connected component labeling, area and perimeter determinations, all of which are in O(n log n). The pe fomzances of the proposed architecture are compared with another architecture type. The simulation results, the possibility of implementation,and future work are discussed.

1. Introduction The enormous amount of processed information and the constant development of artificial vision and real time processing have incited a large number of research work in parallel architectures and algorithms [l], [2]. Amongst these architectures, the one-dimensional array processors is specially of interest due to the fact that it is well suited to simple interconnect and efficient hardware implementation PI, [41, VI, [61, VI. This paper presents the design of a parallel architecture dedicated to connected component analysis. Compared to the above stated architecture, the proposed architecture uses CAM (Content Addressable Memory) distributed-memories providing a PSMU (Parallel Search and Multiple Update) operation. This architecture has n-1 PES (Processing Elements) which operate synchronously in an SIMD mode. The proposed architecture provides extremely efficient global communication as the communication between the PE and the CAM module is ensured at O(10g

1015-4651196$5.000 1996 IEEE Proceedings of ICPR '96

Fig. 1. The proposed architecture

2. The Proposed Architecture The organization of the proposed architecture is depicted in Fig. 1. It consists of 2 memory planes denoted Ml[i,j] and Mz[i,j] (0 I i, j I n-l), Local operator modules

699

2.3. The Local Operator

LOj. n-1 processors PE, (0 5 r 5 n-2), local and global communication, and a tree structure of switches.

The local operator LOj allows any logical function from 5 locations of MI or/and M2 planes i.e. M[i-l,j], M[i+lj], M[i,j-11, M[ij+l], and M[i,j]. We remark that as the LOj is placed in column 0, thus i=O. In this case, each LOj will process Mrn-lj], M[l,j], M[Oj-11, M[O,j+l] and M[Oj]. This operator can be used to indicate that the current pixel in M[O,j] belongs to the object boundary or to locally transfer data. LOj also has a simple local up-counter to increment the M[Oj] value in the case of area and perimeter determinations.

2.1. The Processing Element (PE) Each PE possesses 2 principal functions, the MinMax function and the Adder function which are realized by a magnitude comparator and an adder of 2 log n bits. The MinMax function allows multiplexing of 2 adjacent memory values to the global bus. After comparison, the smaller value is multiplexed to the data bus and the larger to the address bus. Activation of each PE is dependent of merge operation. In an inactive state, the PE deconnects from the global bus. In the first merge operation, PEo, PE2, PE4, ... , PE,,,, are activated. In the second, PEl, PE5, PE9, ... , etc. And in the third, PE3, PEl1, PE19, ... , etc. and so forth.

2.4. The Communication Network The communication network consists of a local and a global bus which each consist of a data bus and an address bus of 2 log n bits. The local bus allows 2 values of 2 adjacent rows in the first column to be read by the PE. The global bus allows, in the merging mode, all rows connected by a tree structure to be written by a PE. We remark that a number of PE is connected to 2 global buses corresponding to 2 adjacent rows. The tree structure is 1 configurable switches denoted SWI, constituted of (d2)(0 I k I n/2 -2). Each SWI, is constructed through unidirectional three state buffers. In the merging mode, the direction of switches yields a transfer of data down from the apex to the base of the tree. The number of switches required is, in this case, considered optimal and thus ensures the data transfer in O(log n) units of propagation time.

2.2. The Memory Modules

To save the intermediate and definitive values whilst providing a PSMU operation, the memory module of CAM is utilized. This module is similar to MUCAM (Multiple Update CAM) which is described in [SI. A CAM module allows to update its content with a New-Data with the condition that the Target-Data is equal to its content. The PSMU operation allows an update of rows of CAM's of any length in one clock cycle. This operation can be explained using simple parallel algorithm, for n CAM's, as follows: forall M[address], (0 I address I n- 1) do-in-parallel if ( M[address] = = Target-Data ) { M[address] = New-Data; } endforall Each module in Ml[ij] is a simple register while each module in Mz[i,j] plane is the CAM type. Both consist of n rows and each row contains n modules of 2 log n. The CAM module is realized by an identity comparator and registers. In the labeling operation, Ml[ij] is used for stocking an initial gray levels image while Mz[ij] for labeling results. In contrast, in an area or perimeter deterlnination operation, Mz[ij] is used for stocking the labeled image while Ml[i,j] for results. To simplify memory architecture and to speed up the processing time, the pipeline principle is used. A row of CAM's operates in a circular FIFO mode and realizes a left-shift operation. Local or global memory operation is used to write and read memories. The local memory operation consists in write or read data in the first column of each row. The global memory operation consists of 2 write modes. The first, write (update) a row of CAM's in Mz by addressing the row in which it is located. The second, write a row in MI by addressing the opposite row in Mz. We remark that read mode in global memory operation is not used.

Fig. 2. System architecture

3. System Architecture The complete system is illustrated in Fig. 2. The memory modules, PE's, and tree structure of switches are controlled by 3 different synchronous signals, respectively denoted as S,, S,, and Ci. The S, allows simultaneous left-shift operation of each row in one clock cycle. The S , allows simultaneous comparison of 2 adjacent rows and their multiplexion to the global bus by each PE. The Ci allows each stage of the tree structure to be consecutively active, one after the other, during the processing which is started by the base stage and terminated by the apex stage.

700

Parallel to this, it also allows activation or inactivation of PES, in the order described in paragraph 2.1. The S , and S, consists of n clock cycles while the Ci consists of (log n) -1 cycles. Image input can be a digital gray levels image. Access to the memory ( M,[ij] ) is through sequential loading. This can be done by pipelining the input image, pixel by pixel, and multiplexing, line by line. Using the same principle, the result can be unloaded from the memory to the output bus.

Merging: Then each 2 adjacent rows are simultaneously merged. This can be done by comparing 2 labels of the 2 adjacent rows in the first column M,[O,j] and Mz[O,j+l] during each iteration. The smaller value becomes an input data (New-Data) whilst the larger becomes an address (Target-Data). Both are transferred to the 2 global buses belonging to these 2 rows in order to update the CAM’s. Otherwise, there is no operation. The 2 merged rows become region. The following iteration is done by activating the following stage of the tree structure. The above procedure can be repeated this time in merging 2 adjacent regions instead of 2 adjacent rows and so forth until a labeled image is obtained. There are n iterations for the merging of 2 rows or regions with (log n) -1 as the maximum number of merging. It so follows that this phase takes O(n log n).

4. The Algorithms In this section, we develop three different algorithms: connected component labeling, area and perimeter determinations.

4.1. Connected Component Labeling

~lancdMl

Labeling consists in assigning unique labels to each connected component in the image whilst ensuring a different label for each distinct object [9]. Quite a number of parallel algorithms have been developed; the utilization of a local approach in a mesh-connected type [lo], [l 11, has yielded a complexity O(n2). Yet another architecture, this time of reconfigurable-mesh type [ 121, [ 131, with labeling based on a regions approach [14] has reduced the complexity to O(n). Based on a boundary approach, labeling with the just mentioned architecture entails a complexity of O(1og n) [ 151. In utilizing CAM,the algorithms given in [8] and [ 161 yield a complexity of O(n2). Using a similar solution, the algorithm described in this section leads to a complexity of O(n log n) [17]. This algorithm, which is based on a divide and conquer technique, is similar to that presented by [14]. Here, instead of merging in 2 directions, vertical and horizontal, our algorithm proceeds in a horizontal direction. We suppose that a gray levels image is available in Ml[i,j]. M&j] which will be used to store the labeling results, is initialized by 0. This algorithm comprises of 2 stages: Row processing: Each LOj tests M1[Oj] and Ml[n-l,j] values. (1) If Ml[O,j] is equal to the Threshold-Value (positive non-zero value) and Ml[n-lj] has the same value as Ml[O,j], the value of Mz[n-lj] is propagated to M2[0j]. (2) If Ml[Oj] is equal to the Threshold-Value and Ml[n-lj] has a value different to that of Ml[O,j], then M2[0,j] is assigned by its row-major order. (3) Otherwise, there is no operation. Parallel propagation is undertaken for each row. The following iteration is done by first shifting each row to the left. We remark that at the first iteration (initialization stage), the condition (2) is supposed true. At the end of this stage, all objects of each row are labeled according to the smallest value. The complexity for this stage is of O(n).

plane ofM2

c c

c c lsl iccraticad mugc 1

3rd twtica d muse I

61b tlerahm d merge I

DlanedM2

alanf d MZ

~laneofM2

c Is( imtim of merge 2

planedMZ

r--

-- - --

c 6uI itpviol ofmerge 3

0 :.modified value: all of ulese values change simultaneousiy - .two regions. (: merging operation. ::.: processon. 0=J-jij , ith iteration = ilh !&-shift operation

Fig. 3. Example of connected component labeling The algorithm thus requires n operations for row processing and n ((log n) -1) operations for merging, hence representing a complexity of O(n log n). We remark that this complexity is independent of the shape of the object. An example of labeling is represented in Fig. 3.

70 1

4.2. Area of a region

4.3. Perimeter of a region

The area of a region is amongst the most important descriptive properties. It is defined, in a discrete image, as the total number of pixels belonging to the region. This algorithm takes O(n log n) independent of the number of connected component. When computing the area of a region, the algorithm utilizes a similar procedure as that described for labeling. We suppose the labeled image to be available in M2[ij]. Ml[ij], which will be used to store the results of area determination, is initialized by 0. Row processing: Each LO, tests 2 values in its own location, the Ml[Oj] and M2[0j] values. If both Ml[Oj] and M2[0j] are positive non-zero values, the counter is incremented. However, if the value of Ml[Oj] is 0 and if Mz[O,j] is a positive non-zero value, the local counter is updated by 0 (reset operation) and then its value is incremented. The value of the counter becomes the New-Data while that of M,[O,j] becomes the Target-Data. Both are transferred to the global bus corresponding to each LOj through the PE and then they update the CAM’s. Otherwise, there is no operation. The global memory operation for addressing and writing of data in 2 different planes or rows is carried out as shown in paragraph 2.2. The Target-Data addresses the row in M2 plane. Each match between the Target-Data and the CAM validates the updating of the register in the opposite row (in MI plane) with the New-Data. Parallel propagation is undertaken for each row. The following iteration is done by first shifting each row to the left. At the end of this stage, each row of M1[*j] will possess an area corresponding to objects of each row in M2[*j]. This stage takes O(n). Merging: Each PE tests 2 adjacent values of Ml[0j] and Ml[Oj+l]. If both are positive non-zero values, each PE validates the addition of these values. The resulting value denoted B becomes the New-Data while the M2[Oj] or M2[O,j+l] value becomes the Target-Data. Both are transferred to the 2 global buses belonging to these 2 rows in order to update the CAM’s. Otherwise, there is no operation. The same procedure, as described in the previous paragraph concerning row processing, is carried out for data addressing and updating in 2 different planes. We remark that a constant value must be added to the B value before transfemng. Thus for the following iteration this will indicate that area determination of the 2 objects (regions) has been undertaken. Therefore, the presence of this constant value in either MJOj] or MJO,j+l] must be tested by each PE before the commencement of each merging process. The area of region of the 2 rows are determinated. The following iteration is conducted by activating the following stage of the tree structure. The above procedure can be repeated, this time, in order to determine the area of 2 adjacent regions instead of 2 adjacent rows and so forth. A merge operation takes n iterations with (log n)-1 as the maximum number of merging. Thus this phase takes O(n log n).

The perimeter of a region is defined as the total length or number of boundary pixels belonging to the region. This algorithm takes O(n log n) independent of the number of connected component. We suppose that the labeled image is available in M2[i,j]. Ml[i,j] which will be used to store the results of perimeter determination is initialized by 0. This algorithm uses a similar procedure as that described for area determination, except that the values added all belong to the perimeter of an object and this can be detected by the LOj.

5. Results and Discussion Table 1. Comparison of two architecture types I One-dimensional array processors erformances I Orthogonal mul- I Linear array of 2 architec- tiprocessors type processors type ture types (ACMAA) ( I ) (Proposed I tecture) Processors I O(n) I

I

I OKn) I O(n IOP n) According to [4] C = number of connected component in image

I Perimeter ‘I’

(),

I

I

In Table 1, the performances of 2 architectures in the one-dimensional array processors class; orthogonal multiprocessors type, represented by ACMAA [4]and linear array processors type, represented by the proposed architecture, are compared. A detailed comparison with other architectures can be found in [4].In this table, we can see that the ACMAA presents a more efficient communication diameter than the proposed architecture. However, the proposed architecture presents more efficient algorithm complexities for labeling, area determination and perimeter determination.We may remark that on ACMAA, labeling is dependent of the shape of an object while area and perimeter determinations are dependent of the number of connected component. In contrast, the proposed architecture is independent of both. Hence, if at worst, the number of connected component C in an image is considered as O(n2), either area or perimeter determination on ACMAA will each take O(n3). These algorithms have been tested with functional simulation using the C language. However, the hardware simulation using FPGA tools has validated theoretical results where the labeling algorithm, the architecture optimization, and the real time processing from any interlaced-mode video signal are concerned [17]. We are currently working on the development of this architecture for

702

general purposes by adding a reconfigurable communication network and interactive C A M mode.

[121 R. Miller, and V. K. Prasanna-Kumar, "Meshes with reconfigurable buses", Proc. Of the 5th MIT Con$ Advanced Research in VLSI,Mars 1988, pp. 163-178. [13] H. Li, and M. Maresca, "Polymorphic-Torus architecture for computer vision", IEEE Transactions On Pattem Analysis and Machine Intelligence, March 1989, vol. 11, no. 3, pp. 233-243. [ 141 M. Maresca, and H. Li, M. Lavin, "Connected component labeling on Polymorphic-Torus architecture", IEEE Int. Con$ Computer Vision Pattern Recognition, Ann Arbor, 1988, pp. 951-956. [I51 S. Olariu, J. L. Schwing, and J. Zhang, "Fast component labelling and convex hull computation reconfigurable meshes", Image and Vision Computing, 1993, vol. 11, no. 7, pp. 447-455. [16] C. C. Weems, S. P. Levitan, A. R. Hanson, E. M. Riseman, D.B.Shu, and J.G. Nash, "The Image understanding architecture", International Journal of Computer Vision, 1989, vol. 2, pp. 251-282. [I71 E. Mozef, S . Weber, J. Jaber, and G. Prieur, "Parallel architecture dedicated to image component labelling in O(n log n): FPGA implementation", European Symposium

6. Conclusion The parallel architecture dedicated to connected component analysis as well as algorithms of connected component labeling, area and perimeter determinations are presented. The PSMU operation of C A M has demonstrated that the proposed architecture yields an excellent performance concerning connected component analysis and it is well suited for intermediate-level vision. The tree structure of switches for data communication across global buses and its row structure make the proposed architecture well suited for real time processing from any interlaced-mode video signal. To conclude may we state that the development of this architecture for a general purpose using an identical principle will be considered for future work.

On Lasers, Optics, and Vision for Productivity in Manufacturing U , BesanGon, France, June 96. (accepted for

References

publication in the ConferenceProceedings)

V.K.P. Kumar, "Parallel architectures and algorithms for image understanding", Academic Press INC, 1991. H. M. Alnuweiri, and V.K. Prasanna, "Parallel architectures and algorithms for image component labeling", IEEE Transactions On Pattern Analysis and Machine Intelligence, Oct.1992, vol. 14, no. 10, pp. 1014-1034. K. Hwang, P.S. Tseng, and D. Kim, "An Orthogonal multiprocessor for parallel scientific computations", IEEE Transactions on computers, Jan. 1989, vol. 38, no. 1, pp. 47-61. P. T. Balsara, and M. J. Irwin, "Intermediate-level vision tasks on a memory array architecture", Machine Vision and Applications, 1993, vol. 6, pp. 50-65. A.L. Fisher, "Scan line array processors for image computation", lntemational Conference on Computer Architecture, 1986, pp. 338-345. D. Chin, J. Passe, F. Bemard, H. Taylor, S . Knight, "The Princeton engine: A Real-Time video system simulator", IEEE Transactions on Consumer Electronics, May 1988, vol. 34, no. 2, pp. 285-297. S . Knight, D. Chin, H. Taylor, J. Peters, "The Samoff engine: A Massively parallel computer for high definition system simulation", International Conference on Application Specific Array Processors, 1992, pp. 342-356. Y-C. Shin, R. Sridhar, V. Demjanenko, P. W. Palumbo, and S . N. Shihari, "A Special-purpose content addressable memory chip for real-time image processing", IEEE Journal of Solid-state Circuit, May 1992, vol. 27, no. 5, pp. 737-744. A. Rosenfeld, and J. L. Pfaltz, "Sequential operations in digital picture processing", Joumal of the Association for Computing Machinery, 1966, vol. 13, no. 4,pp. 471-494. C. R . Dyer, and A. Rosenfeld, "Parallel image processing by memory-augmented cellular automata", IEEE Trunsuctions On Pattem Analysis and Machine Intelligence,

1981, vol. PAMI-3, no. 1, pp. 29-41. M. J. Duff, and T. J. Fountain, "Cellular logic image processing", New York: Academic, 1986.

703