Implementation of a Region Growing Algorithm on Multicomputers: Analysis of the Work Load Balance. M. G. MONTOYA, C. GIL and I. GARCIA
Dept. Arquitectura de Computadores y Electronica Universidad de Almeria. 04120 Almeria. Spain.
[email protected] tel: 34 - 50 215709 fax: 34 - 50 215486
Abstract. This paper discusses and evaluates different parallel implementations of a Region Growing algorithm based on the Split-and-Merge approach proposed by Horowitz and Pavlidis [1]. Region growing is a general technique for image segmentation. The basic scheme consists of joining adjacent pixels to form regions; adjacent regions are then merged to obtain larger regions. The solution has been conceived for a multiprocessor using the SPMD (Single Program Multiple Data) programming model and executions have been carried out on a Cray-T3E system. From a parallel point of view the region growing problem is an irregular and dynamic problem which exhibits unpredictable load fluctuations. Therefore it requires the use of load balancing schemes to achieve efficient parallel solutions. In this sense, we propose both a static and a dynamic load balancing strategy. The efficiency of the algorithm also depends on the procedure applied for determining the region identifiers (IDs) of the set of regions. To show this fact, we propose and analyze several strategies for the selection of the IDs and its influence on the execution time and the load distribution. keywords: parallel algorithms, load balancing, segmentation, region growing, multicomputers.
1. Introduction Segmentation is a procedure to subdivide an image into its constituent parts or objects, called regions, using image attributes such as pixel intensity, spectral values, and/or textural properties. Image segmentation is a key step in many approaches to data compression and image analysis [2]. Region growing is a general technique for image segmentation. The basic scheme consists of joining adjacent pixels to form regions; adjacent regions are then merged to obtain larger regions. The association of neighboring pixels or regions in the region growing process is governed by a homogeneity criterion that must be satisfied in order pixels and regions to combine. The homogeneity criterion is application dependent and may be dynamic within a given
application. Based on the use of merge and split operators, region growing techniques may be classified as Merging, Splitting, or Split and Merge [3, 4]. The simplest approach is pixels agregation, which starts with a set of ”seed” pixels and from these regions grow by appending to each seed point neighboring pixels that have similar properties, which defines the process. Two inmediate problems are the selection of initial seeds that properly represents regions of interest and the selection of suitable properties for including pixels into regions during the growing process. An alternative to the previous method consists of successively subdividing the entire image R, into smaller and smaller square regions so that, for any region Ri R, P (Ri ) = TRUE . That is, if P (Ri ) = FALSE , Ri has to be newly subdivided and so on. P (Ri ) is a logical predicate over the set of pixels in Ri . If only splitting were used, (Region Splitting) [5], the final partition likely would contain adjacent regions with identical properties. This drawback may be solved by allowing merging, as well as splitting. The Split-and-Merge approach proposed by Horowitz and Pavlidis [1], solves the region growing problem in two stages: the Split and the Merge stages. The Split phase is a preprocessing stage that aims to reduce the number of merge steps required to solve the problem. This paper presents a parallel algorithm for solving the region growing problem based on the Horowitz and Pavlidis approach. In section 2, a description of the algorithm for the image segmentation problem is made. From a computational point of view, the region growing algorithms are representative to the class of non-uniform problems which are characterized by a behavior that is data dependent and cannot be determined at compilation time. Moreover region growing has a very volatile behavior, starting with a high degree of parallelism that very rapidly diminishes to a much lower degree of parallelism. Our parallel implementations of the algorithm proposed in section 2 will be described in section 3. Finally, performance evalua-
tion of our parallel implementations executed on a message passing architecture are shown in section 4.
2. Region Growing based on the Split-andMerge approach. Region growing is a technique for partitioning an image by linking individual pixels into groups of pixels called regions. Merging pixels or regions to generate larger regions is usually governed by a set of predefined homogeneity criteria that must be satisfied. Everytime pixels are combined to form a region, the region adquires certain properties based on the combined characteristics of the pixels as a group. Due to homogeneity criteria use to depend on the specific application, a wide variety of them has been investigated. This criterion will be used as a test to determine whether a given group of pixels can be classified as a region. As an example, a threshold based homogeneity criterion might be defined as:
H (R) =
TRUE, FALSE,
If 8x; y 2 R Otherwise.
jf (x) ? f (y)j T
where T is the threshold. This simple homogeneity criterion will be used in the evaluation of our parallel implementations.
2.1. Algorithmic Description The Split-and-Merge technique used in our algorithm, requires two types of operations; a fast split phase is followed by one or more merge phases. The split stage rapidly partitions an image into square regions which conform to a first homogeneity criterion; then a region growing technique is used to merge these square regions into larger regions which conform to a second homogeneity criterion.
6
7
1
3
6
7
1
3
8
6
5
4
8
6
5
4
8
8
5
6
8
8
5
6
7
8
6
6
7
6
6
6
(a)
(c)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(b)
0
8
2
3
6
7
10
(d)
Figure 1: Split phase. (a) Pixel values and (b) pixel ID numbers at start of the split phase, (c) pixel values and (d) region IDs after first and final split iteration.
and larger regions according to the homogeneity criterion. This procedure continues until no more merges are possible. For many merging criteria, the order of merging is an important matter for both the computational cost of the algorithm and the final set of regions. Therefore, certain constraints on the merging order are imposed. The merge paradigm which tends to yield better results (Best Merge Paradigm), allows all the regions to check their neighbors concurrently, but requires that regions only merge with that neighboring region that better satisfies the homogeneity requirements. This imposes an ordering to the merge sequence. The Best Merge Paradigm is based on the following rules [2, 6]:
Split phase The Split phase consists of generating a quad-tree, where nodes at level i are the result of splitting the subimages of size N1 =2i N2 =2i in four subimages of size N1 =2(i+1) N2 =2(i+1). The construction of this quadtree can be carried out following a Bottom-up strategy of merging subimages or Top-Down strategy of splitting. In both strategies, only those nodes satisfying the homogeneity criterion will be generated. The set of leaves (regions) into the quadtree produced by the Split phase will be the set of input regions for the Merge phase. The Split phase ends when no more square regions can be generated. Each region is labeled by an unique identifier number (ID). Figure 1 shows the square regions produced by split phase for a 4 4 image where the threshold value is T = 3. Merge phase In the merge stage, adjacent square regions obtained by the split stage are iteratively merged in order to build larger
1. Each region can only merge with one another region at a time; that being the neighbor region with the best value of the homogeneity criterion. 2. A merge choice must be mutual for two regions to merge. 3. A tie is broken, arbitrarily, by selecting the neighbor with the smallest region ID, although other implementations solve ties with a random approach [7]. The merge process may be reformulated as an undirected, weighted graph problem: Let G = (V; E ) be an undirected graph with weighted edges. The vertices, V , of the graph correspond to the regions in the image. The set of edges, E , is composed of the edges (v; w) such that the regions corresponding to vertices v and w share a common boundary. The edge weight equals the value of the homogeneity criterion evaluated for the regions represented by v and w and it is given by:
(0)
(2)
(3)
1
3
(6)
(7)
5
4
7
8
6
8
8
5
6
6
6
6
(8)
7
1 1
6
2
1 3
8
3
4 3
0
(10)
7
2
2
6
7 2
10
(a) (0)
6
7
(2)
(3)
1
3
(6)
8
6
5
3
2 7
4
3
0
(10)
8
8
5
6
7
6
6
6
4
3
2
6
3
1
3
10 (b)
(0)
6
7
(2)
(3)
1
3
8
6
5
4
8
8
5
6
6
6
6
7
3
2
(6)
0
7 3
5
3
3
6
(c) (0)
(2)
6
7
1
3
8
6
5
4
8
8
5
6
7
6
6
6
2 7 0
(d)
Figure 2: Merge phase. (a) At start of the merge phase, (b) after first merge iteration, (c) after second merge iteration and (d) after third and final merge iteration.
becomes the representative of both, all edges connected to w, (i; w), are relabeled as (i; v). All edges connected to v must then be updated with the new values for ei;v . The process continues until there are no edges in G with a weight smaller than the homogeneity threshold. Figure 2 shows the different regions obtained and their corresponding graphs in each iteration of the merge phase, for the 4 4 image of figure 1. The small numbers in parenthesis in the corners of the regions denote the region IDs.
3. Parallel Implementation In this work several parallel implementations of the Split and Merge segmentation algorithm are proposed. They use a SPMD programming model, working on a MIMD parallel architecture with message passing communications. The region growing proccess transforms a set of pixels in a set of labeled regions and our goal consists of doing these tasks concurrently in such a way that each processor must work on disjoint subsets of pixels or regions. The Split phase works on the set of individual pixels of the image and is an inherently parallel task. So, initially, an image of size N1 N2 is partitioned into P1 P2 subimages containing N1 =P1 N2 =P2 adjacent pixels. Neighbor subimages will be located into neighbor processor elements (PEs). At the beginning of the merge phase the computational work load of a processor element depends on the set of local regions created during the split phase as well as on the set of edges creates between local regions and those regions located at neighbor processors.
0
8
9
10
11
4
5
6
7
12
(a) Consecutive
0
2
6
10
14
1
5
9
13
3
(b) Cyclic
4
6
14
2
10
5
9
13
1
11
(c) Random
Figure 3: Examples of the methods for the region identifier assignations.
e
v;w
=
h(v; w) = H (v [ w)
Using this model, the process of merging regions in parallel is performed as follows: For all edges E , in graph G, to merge those vertices, v and w for which ev;w is the edge of minimum weight for both vertices v and w. For those vertices with more than one edge having the same minimum weight, the edge connecting with the vertex of smaller value is selected. Only edges weighted within the homogeneity threshold are considered. Two vertices, v and w, are merged by deleting edge ev;w . When two regions merge, the region with the smallest ID
Our implementation of the Split-and-Merge parallel region growing algorithm can be described by the following steps [8]: 1. A set of P1 P2 subimages are created and mapped onto the set of PEs. 2. Each PE executes the Split phase on its own subimage using a bottom-up strategy and assigns an identifier (ID) to each of the created regions. 3. Each PE builds its own local weighted graph by creating the set of edges between its own local regions which satisfy the vicinity and homogeneity criteria.
1
0
1
4
1 1
6
0
1
1
5
9
1
1
1 1
13
7
5
1 6
4
5
1
1
1
1
4
1
1
1 1
1
1
7
1
1
7
1 1
6
0
1
1
a) Consecutive. The set of regions obtained at the end of the split phase are sequentially examined from left to right and top to bottom and consecutive IDs are assigned to consecutive regions. In this method, the ID given to a region r, located at the processor p is:
ID
r
=
p NP 1 NP 2 + i; 1
2
1 N2 . So, IDs of rewhere i is an integer 0 i < N P1 P2 gions located at the same PE are a block of consecutive integers.
13
0
depends on the procedure applies for determining the IDs of the set of regions. To show this fact, the following three methods for defining the region’s IDs have been implemented and evaluated:
13
7 1
Figure 3(a) shows an example of this method where it is supposed that the number of processors is Np = 4. After split phase the number of regions for every processor p = 0; 1; 2; 3 is 1; 4; 4; 1 respectively. In this way, IDs are consecutively assignated taking into account the maximum number of regions in a processor element (in this example is 4) and the processor identifier.
0
(a) Consecutive
(b) Random
Figure 4: Example of the merge phase when multiples ties ocurr.
4. PEs exchange information about the regions at the border of the initial subimage and create the set of weighted edges between regions that belong to different neighbor PEs. 5. Each region determines the linked regions that better satisfy the homogeneity criterion as candidate regions to be merged with. 6. PEs must exchange information about the better selection for those linked regions located at different PEs. 7. Vertices and edges of the graph are updated for the current set of regions and only those edges that do not satisfy the homogeneity criterion are removed. 8. Repeat steps 5 to 7 while there exists linked regions. Otherwise the algorithm ends. In the step 2 of the algoritm, when the split phase is finished, each PE assigns an identifier to each of their created regions. These IDs are calculated as a function of the processor element identifier and depend on the maximum number of regions generated locally by a processor element. These region identifiers should be different for every region. Therefore, the efficiency of the algorithm now
b) Cyclic. IDs of the regions are cyclically distributed among PEs. The ID assigned to a region located at the processor p is determined as:
ID
i
=
p + P 1 P2 i
Figure 3(b) shows this method for the same example that in figure 3(a). In this way, IDs are cyclically assignated taking into account the maximum number of regions in a PE and the processor identifier. c) Random. Region IDs are cyclic and randomly generated. This method is shown in figure 3(c). In the Consecutive method, adyacent regions should have consecutive identifiers. When two regions are merged, the new region obtain the smallest identifier of both. As the algorithm evolves, regions start to merge, and only those regions with the smallest IDs will survive, so it is likely that most of the computational work will move to the PE, p = 0, and a strong load unbalance effect appears. In the other hand, ties are broken by choosing the neighbor with the smallest ID, due to the fact that the region identifiers are consecutive, this imposes a serialization on the order of merges and as result a small number of merges are carried out per iteration. In the Cyclic method, we try to solve the problems of the Consecutive method. Region IDs are ciclycally generated as a function of the number of processors and the maximum number of regions in a processor. In this way, the distribution of the region IDs is more uniform and as a consecuence the load unbalance effect can be reduced. Moreover,
Table 1: Number of iterations for different images and different region identifier assignations.
Alg/Np Consecutive Cyclic Random
Image art1 2 4 2965 2942 3091 1955 888 731
8 2974 1871 961
16 2974 1277 834
Alg/Np Consecutive Cyclic Random
Image rlf1 2 4 1142 1142 1127 887 392 455
8 1099 850 519
16 1099 934 348
for identifying regions the probability of merging two couples of regions is higher. In Figure 4(b) is shown that in the first step, regions 4 and 5, 9 and 1 will be fused simultaneously. In this example, Random approach reduces one merge iteration with respect to the Consecutive method.
(a) this help to break the chains of merge dependencies in the frontiers of the processors when ties are produced. An example of this behavior has been presented in Table 1 where the number of iterations at the merge phase have been written down for two images (see Figure 5). These values correspond with executions carried out in a sequential processor which simulates the behavior of these methods when the number of processors is Np = 2; 4; 8 and 16. The Cyclic method reduces the number of iterations with respect to the Consecutive method such as has been represented in Table 1. This is due to the fact that a larger number of merges can be carryed out simultaneously, and therefore the execution time is also reduced. The Cyclyc method is mantaining the serialization problem on the order of merges into every processor element, due to the fact that the region identifiers follow an increasing order into a processor element. So, we try to solve this problem with the Random method, which retains the cyclic properties in such a way that the good load balance effect is maintained, but the IDs into a processor element follows a random order. The results for this approach have been also presented in Table 1. It can be seen that for all the tested images, Random is the fastest method and Consecutive is the slowest one. This is mainly due to random method eliminates the chains of merge dependencies in the selection of the regions to be merged together, introducing some additional degree of parallelism in the algorithm. This can be more clearly understood by the example of Figure 4 where multiples ties for the homogeneity criterion have been represented. Supossed that at some time of the algorithm, regions has the same value of the homogeneity criterion, so the regions to be merged are determined by the smallest IDs. Figure 4(a) represents the Consecutive case, with IDs values 0, 4, 5, 6 and 7. Regions 4 and 6 should be merged to region 0 because it has the smallest value of the ID, but region 0 chooses region 4, region 5 chooses region 4 and region 7 chooses region 5, then only regions 0 and 4 will be fused. However when a Random method is applied
(b) Figure 5: Examples of the original and the segmented image. (a) image art1 (224x224) and (b) image rlf1 (256x256).
From a parallel point of view the region growing problem is an irregular and dynamic problem; i.e. the size and the amount of regions as well as the number of edges between regions are data dependent and they evolve during the algorithm execution. To solve this kind of problem on a multiprocessor seems to be very convenient to apply load balancing strategies and to establish efficient communication patterns [9]. At the begining of the merge phase the number of edges at each processor is data dependent, so it will be necessary to apply a work load balancing strategy. We propose a static load balancing strategy as a previous step to the merge phase and a dynamic load balancing scheme after each iteration of merge phase. In the next, description and evaluation of five different parallel implementations of the Split-and-Merge parallel region growing algorithm are presented. One of these implementations does not use any balancing strategy neither a special communication scheme. The objective of this nonoptimal implementation is to highlight the complexity of the irregular and dynamic problems. These five implementations can be described as follows:
1. Block Consecutive (NB cons)
art1 600 NB_cons NB_cyc SB_cyc DB_cyc DB_rnd
500
400 Time
Basically, this implementation is the simplest one where no balancing (NB cons) strategies are applied and the interprocessor communications follow an Allto-All model. After the Split phase, regions must be identified by an integer. In this algorithm, the ID given to a region follows a Consecutive method. So, IDs of regions located at the same PE are a block of consecutive integers. This implementation has serious problems of load unbalancing such as it was above commented.
Total Time Versus Np
300
200
100
2. Block Cyclic (NB cyc) 0
This implementation is similar to the previous one but in order to avoid the unbalance problem, in this case IDs of the regions are cyclically numbered among PEs. This implementation, also solves, the chains of merge dependencies in the frontiers of the PEs when ties take place. Therefore the number of iterations and the execution time can be reduced with respect to the previous one.
0
2
4
6
8 10 No. Processors
12
14
16
Speed-Up (art1) Nproc 2 4 8 16
NB cons 0.80 1.87 3.83 8.57
NB cyc 0.99 2.66 5.26 11.06
SB cyc 1.75 4.2 7.42 13.5
DB cyc 1.85 4.4 8.2 15.5
DB rnd 1.72 3.66 7.23 13.48
3. Static Balanced Cyclic (SB cyc) This approach uses a static scheme for load balancing and regions are identified as in the Block Cyclic strategy. Just before to start the merge phase, a work load balance strategy is applied. In this approach those edges shared by regions belonging to two neighbor PEs will be located at the PE with the minimum work load. On the other hand, communications follow personalized strategy. Messages are only exchanged between those two processors involved in a specific process of merging regions. 4. Dynamic Balanced Cyclic (DB cyc) The unpredictable migration of edges at execution time may lead to severe load imbalances. This can be controlled by a dynamic load balancing scheme that manages the migration of edges in the merge process. This implementation is similar to the previous one, but this approach also includes a simple dynamic load balancing scheme which assigns some of the updated edges (created after a fusion of regions) to the PE with the minimum work load. This scheme is carried out after every iteration, without any extra interprocessor communication. Therefore, this implementation includes a static and a dynamic schemes and the regions are identified using a cyclic strategy, in order to balance the load between processors. 5. Dynamic Balanced Random (DB rnd) The only difference with the DB cyc strategy consists of region identifiers are cyclic and randomly generated.
Figure 6: Results obtained for the image art1 of size 224x224 pixels.
4. Evaluations and Results Executions of the five versions of the algorithm have been carried out on a Cray-T3E using up to 16 processors. The data presented in the graphs corresponds to real executions measured on this machine. Both the size of the images used and the nature of the image contents were important factors in the performance evaluation. Large images with small objects yielded the best results. This is due to the fact that small objects require less interprocessor merging and allow a higher degree of parallelism since there are less dependencies between merges. Figure 5(a) represents an example of this type of image. A single large object or few large objects tends to produce a more restricted ordering in the merge sequence and requires a greater amount of interprocessor communication. Figure 5(b) belongs to this class of images and represents the original and the segmented image respectively. In Figures 6 and 7 the execution times of the different implementations (NB cons, NB cyc, SB cyc, DB cyc, DB rnd), have been represented as a function of the number of processors for the images art1 and rlf1. It is shown that the values of execution time are better when both, the static and dynamic balanced schemes have been included. Personalized interprocessor communications, also improve the performance of this algorithm. It is clear the great level of irregularity of the region growing problem both in load
5. Conclusions
Total Time Versus Np rlf1
18
In this work, it has been shown as the method chosen for identifying regions may reduce work load unbalancing problems on a parallel implementation. Results indicate that the algorithm performance is optimized when a dynamic load balancing scheme is applied, but in general, the performance degradation is due primarily to the amount of interprocessor communication required by a given solution as well as the level of dependence existing in the merge order for a particular image.
NB_cons NB_cyc SB_cyc DB_cyc DB_rnd
15
Time
12
9
6
3
0
References 0
2
4
6
8 10 No. Processors
12
14
16
[1] S. L. Horowitz and T. Pavlidis. Picture segmentation by a directed split-and-merge procedure. Proc. 2nd IJCPR., pages 424–433, 1974.
Speed-Up (rlf1) Nproc 2 4 8 16
NB cons 0.78 1.56 2.92 4.69
NB cyc 0.88 2.15 4.42 5.74
SB cyc 1.38 2.98 5.70 7.75
DB cyc 1.58 3.60 7.38 10.33
DB rnd 1.66 2.63 5.27 7.30
Figure 7: Results obtained for the image rlf1 of size 256x256 pixels.
[2] J. C. Tilton. Image segmentation by iterative parallel region growing and splitting. Quantitative remote sensing: An economic tool for the ninties., 4:2420–2423, 1989. [3] S. W. Zucker. Region growing: Childhood and adolescence. Comput. Graphics Image Process., 5:382–399, 1976. [4] G. N. Khan and D. F. Gillies. Parallel-Hierarchical image partitioning and region extraction. Computer Vision and Image Processing, pages 123–140, 1992.
distribution and interprocessor communication. Also, it is shown in this graph, that the execution time may be reduced by applying an effective region identifier scheme, which reduces both the imbalance and the sequentiality problems. The schemes which obtain the best results follow a cyclic and random strategies. Tables in figures 6 and 7, represent the Speed-Up values for the five implementations. It is shown that the best values for the Speed-up correspond to the DB cyc implementation due to improvement introduced by the work load balance schemes and the communication optimization, joined to the cyclic region identification. Although the dynamic balance scheme is only effective in balancing loads locally, it is performed with very little overhead. The results show small improvements in performance with respect to the SB cyc implementation. Also we can observe in this graph that although DB rnd strategy obtained the best values for the execution time, however it does not obtain the best values for the Speed-Up. That is primarly due to two reasons: (i) the execution times of the sequential version are very small respect to the rest of sequential versions, and (ii) this version does not execute the same number of iterations and it varies with the number of processors used. Therefore it may happen that parallel implementations need more iterations than the sequential one, since this depends on the way that regions are identified and in this case, these are random values.
[5] R. Ohlander, K. Price and D. R. Reddy. Picture segmentation using a recursive region splitting method. Comput. Graphics Image Process, 8:313–333, 1978. [6] M. Willebeek-LeMair and A. Reeves. Solving nonuniform problems on SIMD computers: Case study on region growing. J. Parallel Distrib. Comput., 8:135– 149, 1990. [7] N. Copty, S. Ranka, G. Fox and R. V. Shankar. A Data Parallel Algorithm for Solving the Region Growing Problem on the Connection Machine. Journal of Parallel and Distributed Computing., 21(1):160–168, 1994. [8] M. G. Montoya, C. Gil, I. Garcia. Load Balancing for a class of irregular and dynamic problems: Region Growing Image Segmentation Algorithms. Euromicro Workshop on Parallel and Distributed Processing, pages 163–169, 1999. [9] C. Gil, J. Ortega, A.F. Diaz and M. G. Montoya. Annealing-based heuristics and genetic algorithms for circuit partitioning in parallel test generation. Future Generation Computer Systems, 14:439–451, 1998.