Alina Lindner (Moga), Andreas Bieniek, and Hans Burkhardt. Albert-Ludwigs-Universit at Freiburg. Institut f ur Informatik, Am Flughafen 17, D-79110 Freiburg.
PISA|Parallel Image Segmentation Algorithms? Alina Lindner (Moga), Andreas Bieniek, and Hans Burkhardt Albert-Ludwigs-Universitat Freiburg Institut fur Informatik, Am Flughafen 17, D-79110 Freiburg
Abstract. Parallelisation of the watershed segmentation method is studied in this
paper. Starting with a successful parallel watershed design solution, extensive tests on various parallel machines are presented to prove its portability and performance. Next, the watershed algorithm has been re-formulated as a modi ed connected component problem. Consequently, we present a scalable parallel implementation of the connected component problem, which is the key for the future improving of the parallel design for the watershed algorithm.
1 Introduction Segmentation is a process of partitioning an image into disjoint regions such that each region is homogeneous, according to a certain uniformity criterion, and no union of any two adjacent regions is homogeneous. The problem has been broadly investigated by scientists using both classical and fuzzy based reasoning techniques. Dierent algorithms can be found in the literature seeking for either feature homogeneity (region growing) or to detect dissimilarities (edge detection) [12,15,23]. Segmentation plays a crucial role in image processing, in particular for coding, edge detection, object recognition, automatic measurements, and analysis. Additionally, in most of the applications, segmentation is expected to complete in real time. However, this goal is dicult to achieve for large images and image sequences, for which the complexity of serial segmentation is usually high. Therefore, fast scalable parallel algorithms are entailed. The emphasis in this project is on designing high level portable MIMD parallel algorithms to execute eciently and exhibit scalability, independent of the image content. Our work is dedicated to parallelising a recent segmentation technique named watershed transformation [5,17{19]. As a segmentation method, watershed transformation is often found successfully incorporated into image analysis systems in various domains, e.g., in industry and biomedicine (segmentation of electrophoresis gels, a moving heart, 3D holographic images, road trac analysis) [19]. Starting with a successful parallel design solution of the watershed algorithm [6,22], further tests on dierent parallel machines have been performed to evaluate its portability and performance. In our eorts to nd a new parallel algorithm for the watershed problem displaying concurrency, locality, modularity, data independence, and resilience to increasing number of processors, special attention is concentrated on elaborating robust ? We acknowledge the High Performance Computing Center Stuttgart for granting us the use of the Cray T3E parallel computer, as well as all rms through which the results included in this report were made possible.
2
A. Lindner, A. Bieniek, and H. Burkhardt
and less error prone sequential techniques with minimal memory costs, reduced software engineering cost, for implementation and testing, and simple data structures. Consequently, the classical watershed algorithm has been reformulated as a connected component problem [6]. The latter algorithm is simpler and faster than the classical method based on hierarchical queues [5]. Therefore, an ecient parallel connected component operator is the key for further improvements of the parallel watershed algorithm, as well as for the parallelisation of other problem classes. The rest of the paper is organised as follows. In Section 2, the parallel watershed algorithm based on hierarchical queues is described in more detail. Relevant timings and speedup of the algorithm running on various parallel systems are also incorporated toward comparison of the performance. A new approach of the algorithm based on the connected component problem along with preliminary results follow in Section 3. Finally, conclusions and highlights of the future work are comprised in Section 4.
2 A Parallel Watershed Algorithm based on Ordered Queues and Connected Components The classical watershed algorithm [5,19] regards the gradient of an image as a topographical relief, in which ooding starting from the regional minima is simulated. The algorithm starts by detecting and labeling initial seeds, e.g. minima of the gradient. Minima pixels are then sorted according to their grey-level and stored in a hierarchical queue. The latter consists of N FIFO queues, one queue for each grey-level h existing in the image. Flooding is then performed in the increasing order of the grey-level. A pixel removed from the queue, assigns its label to each unlabeled neighbour, which is, in turn, inserted in the FIFO queue allocated to its grey-level. However, because the classical algorithm implements a global and highly data dependent operation, its parallelisation is not straightforward. An ecient parallel implementation of the watershed algorithm has been extensively presented in [22], with results collected from a Cray T3D parallel computer. In order to prove the portability of the design solution on dierent massively parallel computers and hence, of its performance, the algorithm has been tested on several parallel machines. Before presenting the results, a short description of the algorithm follows in the next subsection.
2.1 Description of the algorithm From the brief description of the sequential watershed algorithm above, one can observe that watershed transformation performs recursively and the overall data access pattern is global. Global operations are generally tackled in MIMD implementations in a divide-and-conquer fashion. The solution in [22] resembles the parallel connected component design in [1,10,13]. For this purpose the image domain is divided into sub-domains (blocks or slices) and intermediate labeling, as in the sequential case, is performed within each sub-domain. The boundary connectivity information between sub-images is locally stored in a graph [1] or in an equivalence table [10,13]. More speci cally, labels on each side of a common boundary
PISA|Parallel Image Segmentation Algorithms 1
1
3
3
4 2
3
4 2 7
5
6
7
9
5
8
8
11 10
13 12
9
6
10 11 13 12
Fig. 1. Boundary connectivity graph of a distributed binary image between adjacent sub-images and pertaining to the same component are retained as belonging to the same equivalence class (see Fig. 1). Final labels correspond to the connected components of the global boundary connectivity graph obtained by combining all the local graphs (see Fig. 1, where two connected components result). Unlike in the case of binary images, for the watershed transformation, the greylevels of two neighbouring pixels do not suce to retrieve the boundary connectivity information. A solution results from the property that ooding is ruled by a two-dimensional ordering relation. Thus, a pixel p gets a label from an immediate neighbour q of lowest grey-level, if any. On plateaus surrounded by pixels of lower grey-level, the lower distance given by the length of the shortest path, completely included in the plateau, to a lower border of the plateau (see also [17]) imposes the propagation order. Details upon the parallel computation of the lower distance on plateaus can be found in [6,22]. Once established this predecessor-successor ooding relation between pixels, the boundary connectivity information can be easily recovered. Thus, if a pixel has its predecessor in an adjacent sub-domain, they definitely pertain to the same component. After the boundary connectivity graph has been set up in every processor, global connected components are computed in log2 P steps on P processors, regardless of the data complexity.
2.2 Experimental Results The above described algorithm has been implemented on top of Message Passing Interface (MPI) [11] and tested on six dierent parallel computers: Comparex/Hitachi, Hewlett-Packard X-Class, Hewlett-Packard Hyper-Class (Beta version), SGI/ Origin 2000, SGI/Cray T3E, and IBM/PWR2. Due to the fact that the implementation has been built on top of MPI, the code has been easily ported on all machines. The running time T for various images is tabulated in Table. 1 and the relative speedup SP (P ) = TT((1) P ) is illustrated in logarithmic scale in Fig. 2. The result of the transformation is shown in Fig. 6 for two example images. Although the optimisation ags used for each machine in part are not totally identical (O3 assisted by dierent machine speci c ags), one can observe that the highly ascendent relative speedup is preserved, irrespective of the underlying parallel platform. For each of the four images, in part, the speedup curves grow close to each other until a certain number of processors, after which they diverge. For the image Cermet this threshold is reached at 8 CPU's, point from where the curve for the SGI/Origin 2000 drops down. Next, at 16 CPU's, the performance on
4
A. Lindner, A. Bieniek, and H. Burkhardt Machine Image n P 1 2 4 8 16 32 64 Comparex/Hitachi Cermet(256 256) 0.601 0.315 0.171 0.105 0.072 0.053 0.046 Peppers(512 512) 2.397 1.333 0.671 0.356 0.208 0.126 0.087 Lenna (512 512) 2.382 1.308 0.629 0.359 0.214 0.132 0.099 People (1024 1024) 9.596 5.491 3.064 1.504 0.727 0.412 0.249 HP/X-Class Cermet(256 256) 0.238 0.129 0.074 0.054 0.039 0.048 Peppers(512 512) 0.912 0.481 0.257 0.146 0.118 0.089 Lenna (512 512) 0.931 0.475 0.253 0.151 0.109 0.092 People (1024 1024) 4.28 2.42 1.21 0.610 0.360 0.241 HP/Hyper-Class Cermet(256 256) 0.185 0.100 0.056 0.033 0.021 Peppers(512 512) 0.680 0.392 0.203 0.109 0.067 Lenna (512 512) 0.700 0.384 0.197 0.114 0.071 People (1024 1024) 3.259 1.919 0.950 0.462 0.234 SGI/Origin 2000 Cermet(256 256) 0.113 0.057 0.037 0.025 0.028 0.047 Peppers(512 512) 0.404 0.253 0.142 0.089 0.068 0.089 Lenna (512 512) 0.411 0.245 0.129 0.086 0.065 0.065 People (1024 1024) 1.563 0.973 0.631 0.361 0.226 0.180 SGI/Cray T3E Cermet(256 256) 0.180 0.098 0.056 0.035 0.022 0.012 0.010 Peppers(512 512) 0.686 0.405 0.210 0.110 0.066 0.040 0.026 Lenna (512 512) 0.726 0.421 0.221 0.119 0.070 0.041 0.028 People (1024 1024) 2.885 1.772 1.024 0.506 0.257 0.148 0.084 IBM/PWR2 Cermet(256 256) 0.154 0.084 0.047 0.028 0.020 0.037 0.027 Peppers(512 512) 0.608 0.370 0.197 0.103 0.063 0.037 0.027 Lenna (512 512) 0.606 0.357 0.183 0.102 0.062 0.038 0.028 People (1024 1024) 2.578 1.629 1.008 0.505 0.235 0.129 0.077
Table 1. Execution times of watershed algorithm (in seconds) the IBM/PWR2, HP/X-Class saturates. Still increasing relative speedup, until 64 CPU's, remains for Comparex/Hitachi and SGI/Cray T3E. A more uniform behaviour is observed for the other three images, excepting the downward slope for image Lenna, between 16 CPU's and 32 CPU's on the SGI/Origin 2000. Naturally, better performance is obtained for larger images, like Peppers and Lenna of size 512 512, or People 1024 1024. However, globally, the algorithm proves to be extremely portable, issuing a very good performance on each parallel machine.
3 A Parallel Connected Component Algorithm In the previous sections, we have shown that the watershed algorithm can be regarded as a connected component operator, where the notion of connectivity between two neighbouring pixels is de ned by the predecessor-successor ooding relation [6,7,21,22]. Therefore, in the rst phase of the project, our research is focused on nding ecient parallel algorithms for the connected component problem. The connected component operator can be additionally regarded as a foundation for the ecient parallelisation of various other problems. A good survey on sequential and parallel connected component algorithms for dierent computation models is given in [1]. In most sequential connected component algorithms, two passes are performed: rst, a raster scan through the image to set up equivalences between neigbouring pixels, and second, to replace each label with the representative of its equivalence class [1,10,14,16,24]. Equivalences are usually solved with the help of a lookup table. Alternatively, the image connected
6 5 2
1
0
2
1
0
3
4
5
0
1
2
Linear Speedup Comparex/Hitachi HP/X−Class HP/Hyper−Class SGI/Origin SGI/Cray T3E IBM/PWR2
3 log2(P)
4
Relative Speedup of Image People (1024 x 1024) 6
0
1
2
3
4
5
6 5 3 log2(P)
4
Relative Speedup of Image Peppers (512 x 512)
3
4
5
6
0
1
2
3
4
0
1
2
Linear Speedup Comparex/Hitachi HP/X−Class HP/Hyper−Class SGI/Origin SGI/Cray T3E IBM/PWR2 log2(SP(P))
5
0
1
2
Linear Speedup Comparex/Hitachi HP/X−Class HP/Hyper−Class SGI/Origin SGI/Cray T3E IBM/PWR2
3 log2(P)
4
Relative Speedup of Image Cermet (256 x 256)
5
6
log2(SP(P)) 6
5
log2(SP(P))
6
0
1
2
Linear Speedup Comparex/Hitachi HP/X−Class HP/Hyper−Class SGI/Origin SGI/Cray T3E IBM/PWR2
3 log2(P)
4
Relative Speedup of Image Lenna (512 x 512)
5
6
PISA|Parallel Image Segmentation Algorithms
log2(SP(P))
Fig. 2. Relative speedup of the watershed algorithm on dierent machines and with dierent images
component problem can be transformed into the equivalent graph connected component problem. An example of such an equivalent graph is shown in Fig. 1. Several divide-and-conquer algorithms for the parallel connected component on MIMD machines exist in the literature. In most cases the image domain is divided into sub-domains and the local connected component problem is rst solved with a sequential algorithm. Global connected components are then computed by applying a global merge-and-split operation on the boundary data of the sub-images. A global boundary connected graph can be constructed by combining all the local graphs, as described in Section 2. Alternatively, a global equivalence table can be built by combining the local tables of all sub-domains. These solutions are not
6
A. Lindner, A. Bieniek, and H. Burkhardt
optimal, because, in the worst-case, the size of the graph or table increases with the number of processors. The size of the connectivity graph can be bounded to the perimeter of the merged sub-domains, if the boundary information is updated with the help of a separate graph at each merging step, as shown in [2{4]. In [9,10], the boundaries are updated with the help of a separate and therefore small equivalence table at every merging step. Nevertheless, due to the fact that the sequential connected component operation is relatively fast, it is essential that the overhead of the merge-and-split operation is kept as small as possible. Otherwise, the scalability of the algorithm is compromised.
3.1 Outline of the parallel algorithm
The outline of the parallel algorithm follows next: rst, a fast sequential connected component algorithm, which uses the image of labels to solve equivalences [7,8,20], is performed independently within each sub-domain. After the sequential algorithm, the label of every pixel refers to the representative of the component the pixel pertains at, within the local sub-image. A global representative denotes a representative of a connected component which extends across the boundaries of a sub-domain. At the next step, the boundary data structure for each sub-domain is packed. The boundary data structure is organised in such a way, that a graph type connected component algorithm works directly on the communicated data. For each pixel in the image boundary, its label, grey-value, and a relative oset to a replacement pixel within the boundary data structure are stored, as shown in Fig. 3. The use of Imagelabel
Greyvalue
Relative Imageoffset label
Greyvalue
Relative offset
...
Imagelabel
Greyvalue
Relative offset
Fig. 3. Data structure to store border information of sub-images relative osets has the advantage that the references between border pixels are still valid after a communication step, as long as all references are within the communicated data structure. The packing operation moves all global representatives into the boundary data structure. This is done by linking the old representative, and therefore the whole component, to the new representative within the border data structure. Then a resolve phase is performed on the boundary data, by following and short-cutting all references to the root. The eect is that all border pixels refer to a representative within the border data structure, which is shown in Fig. 4a. The representatives of the grey shaded connected components are marked with black squares. Global representatives within the image point to a new representative in the border stripes, as shown by an arrow. At the next step, the boundary data is merged recursively until the whole image is encompassed. At each merging step, neighbouring edges of adjacent sub-images are merged. A new data packet, which consists of the combined edges except the merged ones, is created. The merging algorithm resembles the sequential connected component algorithm on the neighbouring edges, but, for the sake of communication, references (relative osets) within the edge data structures are used (Fig. 4b). The key idea of the presented merging algorithm is that, after linking neighbouring pixels which belong to the same equivalence class, all global representatives are
PISA|Parallel Image Segmentation Algorithms
7
1111 0000 0000 1111 1111 0000 0000 1111 1111 0000 0000 1111
1111 0000 0000 1111 1111 0000 0000 1111 1111 0000 0000 1111 a)
b)
c)
d)
Fig. 4. An example of the divide-and-conquer phase of the parallel connected component algorithm for a stripe-wise distributed image. moved to the new boundary data. A resolve phase guarantees that all new border pixels refer to a representative within the new border data structure (Fig. 4c). Following the merging process shown in Fig. 4, the representatives of the local sub-images in Fig. 4a are pointing to a single representative in Fig. 4d. At the next step, the nal labels of the last merging step are distributed in the reverse order of the merging process. For this purpose, the label of each border pixel is updated with the label of the nal representative. Then, the stored and updated data of the corresponding merging step is sent back to its source. Finally, all pixels in the sub-images are re-labelled, replacing the representative with the global representative.
3.2 Experimental results
Fig. 5 shows the speedup of the algorithm compared to the unchanged sequential algorithm for 512 512 and 1024 1024 images. There is no signi cant dierence in speedup (see Fig. 5a,b) for the dierent images we tested. In Fig. 5c and Fig. 5d stripe-wise decomposition of the image is compared with block-wise decomposition. The results show that the speedup for block-wise decomposition is signi cantly better than for stripe-wise. Additionally, redundant merging, which redundantly performs the merging steps on all processors, is compared with non-redundant merging [6]. The results show that performing the merging process redundantly, a slightly better speedup can be observed. The reason is that the split phase can be done within the local memory at the expense of sending much more data in the merging phase. This can lead to network saturation, which is visible for 1024 1024 stripe-wise distributed images in Fig. 5d. Performing the merging operation using a butter y communication graph or a deBruijn graph makes little dierence on a Cray T3E.
4 Conclusions and Future Work The results in Section 2 show the portability and scalability of a parallel watershed segmentation algorithm. In Section 3, we propose a new, scalable connected com-
8
6
6
5 log2(SP(P))
log2(SP(P))
5
4
4
3
3
2
2
1
1
0
1
2
3
4 log2(P)
5
6
7
0
8
(a) Dierent images, block redundant (512 512)
1
2
3
4 log2(P)
5
6
7
8
8
Linear Speedup bcomb (512x512) redundant butterfly, block bcomb (512x512) single butterfly, block bcomb (512x512) redundant butterfly, stripe bcomb (512x512) single butterfly, stripe bcomb (512x512) single de−bruijn, stripe
7
6
Linear Speedup bcomb (1024x1024) redundant butterfly, block bcomb (1024x1024) single butterfly, block bcomb (1024x1024) redundant butterfly, stripe bcomb (1024x1024) single butterfly, stripe bcomb (1024x1024) single de−bruijn, stripe
7
6
5 log2(SP(P))
5 log2(SP(P))
0
(b) Dierent images, block redundant, (1024 1024)
8
4
4
3
3
2
2
1
1
0
Linear Speedup bcomb (1024x1024) bpeppers.gt (1024x1024) bsnake (1024x1024) bsnake−r (1024x1024) bspiral (1024x1024)
7
0
1
2
3
4 log2(P)
5
6
7
(c) Dierent communication modes (512 512)
8
0
0
1
2
3
4 log2(P)
5
6
7
(d) Dierent communication modes (1024 1024)
8
A. Lindner, A. Bieniek, and H. Burkhardt
Linear Speedup bcomb (512x512) bpeppers.gt (512x512) bsnake (512x512) bsnake−r (512x512) bspiral (512x512)
7
0
8
Fig. 5. Speedup of the connected component algorithm on the CrayT3E
ponent algorithm, which can be seen as a foundation for the ecient parallelisation of the watershed transformation, as well as of various other problems in the eld of image processing. Therefore, our future work is focused onto extending the parallel connected component algorithm for watersheds.
Further on, analysing the drawbacks of the watershed algorithm, namely, oversegmentation, remedies based on multiscale pyramid algorithms have been investigated and implemented. This part, is still under development and therefore, results will be published later. However, the problem of multi-resolution watersheds opens an extremely challenging topic for parallel computation.
8
PISA|Parallel Image Segmentation Algorithms
9
Fig. 6. Example images lenna and peppers with result of the watershed transformation.
References 1. H. M. Alnuweiri and V. K. Prasanna. Parallel architectures and algorithms for image component labeling. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(10):1014{1034, October 1992. 2. D. A. Bader and J. JaJa. Parallel algorithms for image histogramming and connected components with an experimental study. Journal of Parallel and Distributed Computing, 35(2):173{190, June 1996. 3. D. A. Bader, J. JaJa, D. Harwood, and L. S. Davis. Parallel algorithms for image enhancement and segmentation by region growing with an experimental study. Journal of Supercomputing, 10(2):141{168, 1996. 4. K. P. Belkhale and P. Banerjee. Parallel algorithms for geometric connected component labeling on a hypercube multiprocessor. IEEE Transactions on Computers, 41(6):699{709, June 1992. 5. S. Beucher and F. Meyer. The morphological approach to segmentation: The watershed transformation. In E.R. Dougherty, editor, Mathematical Morphology in Image Processing, pages 433{481, N.Y., 1993. Marcel Dekker Inc.
10
A. Lindner, A. Bieniek, and H. Burkhardt
6. A. Bieniek, H. Burkhardt, H. Marschner, M. Noelle, and G. Schreiber. A parallel watershed algorithm. In Proceedings of 10th Scandinavian Conference on Image Analysis, pages 237{244, Lappeenranta, Finland, June 1997. 7. A. Bieniek and A. Moga. A connected component approach to the watershed segmentation. In Mathematical Morphology and its Applications to Image and Signal Processing, volume 12 of Computational Imaging and Vision, pages 215{ 222. Kluwer Academic Publishers, 1998. 8. A. Choudhary and R. Thakur. Connected component labeling on coarse grain parallel computers: an experimental study. Journal of Parallel and Distributed Computing, 20(1):78{83, January 1994. 9. H. Embrechts. MIMD Divide-and-Conquer algorithms for geometric operations on binary images. PhD thesis, Department of Computer Science, Katholieke Universiteit Leuven, Belgium, March 1994. 10. H. Embrechts, D. Roose, and P. Wambacq. Component labelling on a MIMD multiprocessor. CVGIP: Image Understanding, 57(2):155{165, March 1993. 11. Message Passing Interface Forum. MPI: A message-passing interface standard. Technical report, University of Tennessee, Knoxville, Tennessee, June 1995. Version 1.1. 12. R.M. Haralick and L.G.Shapiro. Image segmentation technique. Computer Vision, Graphics, and Image Processing, 29:100{132, 1985. 13. T. Johansson. Image Analysis Algorithms on General Purpose Parallel Architectures. Ph.D. thesis, Centre for Image Analysis, University of Uppsala, Uppsala, 1994. 14. T. Johansson and E. Bengtsson. A new parallel MIMD connected component labeling algorithm. In PARLE: Parallel Architectures and Languages Europe, pages 801{804. LNCS, Springer-Verlag, 1994. 15. T. Kanade. Survey, region segmentation: Signal vs semantics. CVGIP, 13(4):279{297, August 1980. 16. R. Lumia, L. Shapiro, and O. Zuniga. A New Connected Components Algorithm for Virtual Memory Computers. Computer Vision, Graphics, and Image Processing, 22(2):287{300, 1983. 17. F. Meyer. Integrals, gradients and watershed lines. In Proceedings of Workshop on Mathematical Morphology and its Applications to Signal Processing, pages 70{75, Barcelona, Spain, May 1993. 18. F. Meyer. Topographic distance and watershed lines. Signal Processing, 38(1):113{125, July 1994. 19. F. Meyer and S. Beucher. Morphological segmentation. Journal of Visual Communication and Image Representation, 1(1):21{46, September 1990. 20. R. Miller and Q. F. Stout. Parallel Algorithms for Regular Architectures: Meshes and Pyramids. MIT Press, Cambridge Massachusetts, London England, 1996. 21. A. Moga. Parallel Watershed Algorithms for Image Segmentation. PhD thesis, Tampere University of Technology, Tampere, Finland, February 1997. 22. A.N. Moga and M. Gabbouj. Parallel image component labeling with watershed transformation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5):441{450, May 1997. 23. N.R. Pal and S.K. Pal. A review on image segmentation techniques. Pattern Recognition, 26(9):1277{1294, 1993. 24. A. Rosenfeld and J. L. Pfaltz. Sequential operations in digital picture processing. Journal of the ACM, 13(4):471{494, October 1966.