Parallel and Distributed Systems, IEEE Transactions

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 12,

NO. 12,

DECEMBER 2001

1281

L2 Vector Median Filters on Arrays with Reconfigurable Optical Buses Chin-Hsiung Wu and Shi-Jinn Horng AbstractÐIn spite of their good filtering characteristics for vector-valued image processing, the usability of vector median filters is limited by their high computational complexity. Given an N N image and a W W window, the computational complexity of vector median filter is OW 4 N 2 . In this paper, we design three fast and efficient parallel algorithms for vector median filtering based on the 2-norm (L2 ) on the arrays with reconfigurable optical buses (AROB). For 1 p W q N, our algorithms run in OW 4 log W =p4 , 4 2 OWp4 qN2 log W and O1 times using p4 N 2 = log W , p4 q2 = log W , and W 4 N 2 log N processors, respectively. In the sense of the product of time and the number of processors used, the first two results are cost optimal and the last one is time optimal. Index TermsÐParallel algorithm, scalable algorithm, vector median filter, nonlinear filter, image (signal) processing, reconfigurable optical bus system.

æ 1

T

INTRODUCTION

vector median filter (VMF) [3] is a well-known technique in vector-valued image and signal processing since it is invariant to scale and bias. It has been proven to be a highly effective tool for removing impulse noise from color images without degrading image contours [26]. Color images may be processed using the vector median filter, where pixel values are treated as vector quantities rather than as separate components. Given a two-dimensional (2D) vector-valued image A ai; j, 0 i; j < N, each pixel with d components a0 i; j; ; adÿ1 i; j, d 1, the vector median filtering operation is to replace the noisy pixel from image A by one of its neighbors within a neighborhood of that pixel. Usually, the neighborhood of a specified pixel is established using a W W window centered at that pixel, where W 2w 1 is an odd number. In practice, W N since a larger filter might introduce extraneous pixels into the range of the offending noise; the larger windows can remove impulses of larger width, but also remove detailed lines [3]. When pixel i; j is near the boundaries of A, the window is wrapped around the appropriate boundaries. In median filtering, the image components are processed separately. However, this method has some drawbacks. The image components in real applications are generally correlated; if each component is processed separately, this correlation is not utilized [3]. The vector median filters can overcome the drawbacks of median filters for vector-valued images. The vector median filter may be evaluated as follows: For each pixel ai; j, 0 i; j < W , within a specified window, the following sums of distances must be computed HE

. C.-H. Wu is with the Department of Information Mangement, Chinese Naval Academy, Kaohsiung, Taiwan, Republic of China. E-mail: [email protected]. . S.-J. Horng is with the Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, Republic of China. E-mail: [email protected]. Manuscript received 2 May 2000; revised 18 Jan. 2001; accepted 11 June 2001. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 112046.

Si; j

W ÿ1 W ÿ1 X X

kai; j ÿ ak; lk;

1

k0 l0

where kai; j ÿ ak; lk is the norm (distance) between the two pixels. The norm may be measured on the basis of any L metric for 1 < 1. Since the vector median filters based on Euclidean distance (L2 -norm) are rotation invariant and have the best performance [4], the distances are computed on the basis of the L2 -norm in this paper. The pixel which yields the minimum value of S is selected as the vector median vmi; j used at the corresponding pixel position in the filtered image. That is, the vector median of window W fai k; j l j ÿw k; l wg is the pixel vmi; j 2 W such that Svm minfSi k; j l; ÿw k; l wg:

2

The main drawback of vector median filters is their high computational complexity. Especially, the use of Euclidean distance leads to the slowest algorithm since it involves the computation of a square root for each distance. The computational complexity of vector median filtering is OdN 2 W 4 . On the other hand, the higher number of components of the vectors, the worse computational complexity of vector median filtering. For instance, we generally have three components for color images and a higher number for satellite images. We usually assume that each pixel has constant components (i.e., d is a constant); a straightforward sequential algorithm for vector median filtering takes ON 2 W 4 time. This is because of the computation of pairwise distances for each window. Previously, several fast sequential algorithms for 1-norm vector median filtering were proposed [7], [14]. No fast algorithm based on the 2-norm has been proposed so far. Only two related algorithms attempt to speed up such a filter based on Euclidean norm approximation [5], [6]. Caselles et al. showed that the vector median filter of vector-valued image is equivalent to a collection of infimum-supremum morphological operations [10]. However, calculating the vector median of a window is an

1045-9219/01/$10.00 ß 2001 IEEE

1282


VOL. 12, NO. 12,

DECEMBER 2001

TABLE 1 Summary of Comparison Results for Parallel Vector Median Filtering Algorithms

inherently slow operation. In many applications, especially in real-time image processing, such as military and industrial applications, the speed of computations is very important. Owing to a computation intensive task, such as the 2D vector median filtering, requiring much time to process it, designing parallel algorithms to process it is the only way to get a real-time response. Astola et al. [3] proposed an OW N time algorithm for 1D vector median filtering. They used W ÿ 1 distance computing elements (DCEs) to calculate the pairwise distances. Angelopoulos and Pitas [2] proposed an OW 4 time mesh algorithm for vector median filtering using N 2 processors. They also examined the performance of the proposed algorithm on the Massively Parallel Processor (MPP) machine [8]. In this paper, we will develop more flexible and fast algorithms for vector median filtering on the arrays with reconfigurable optical buses (AROB) model. The mesh-connected-computers (MCCs) are useful for vector median filtering [2] because of its simplicity and regularity in structure. Unfortunately, there are two drawbacks of the MCC: fixed architecture and long communication diameter. These two drawbacks can be overcome by equipping it with various types of bus systems. Recently, the reconfigurable meshes have received much attention from researchers because they can reduce the drawbacks of the MCC [9], [24]. However, the exclusive access to the bus resources will limit the throughput of the end-to-end communication. Optical interconnections may provide an ultimate solution to this problem [11], [12], [13], [18], [21], [22]. The array with a reconfigurable optical bus system is defined as an array of processors connected to a reconfigurable optical bus system whose configuration can be dynamically changed by setting up the local switches of each processor. Due to unidirectional signal propagation and predictable delay of the signal per unit length, the optical buses enable synchronized concurrent access in a pipelined fashion. More recently, two related models have been proposed, namely the array with reconfigurable optical buses (AROB) [19] and linear array with a reconfigurable pipelined bus system (LARPBS) [18]. A major difference between them lies in that the online switch setting is allowed in the AROB model during a bus cycle, but it is not permitted in the LARPBS model. Many algorithms have been proposed for these two related models [13], [17], [18], [20], [21], [23], [25]. These indicate that arrays with a reconfigurable optical bus system are very efficient for the parallel computation due to the high bandwidth and flexibility within a reconfigurable optical bus system.

The main contribution of this paper is in designing fast and flexible algorithms for computing the vector median filtering on an AROB. We first develop two optimal basic operations for integer summation and shifting data matrix on an AROB model. Then, using these operations, we design three efficient and scalable parallel algorithms for the vector median filtering problem on the AROB model. Assume that the values of the pixels are Olog N-bit integers. O ur algorithms run in OW 4 log W =p4 , 4 2 O Wp4 qN2 log W , a n d O1 t i m e s u s i n g p4 N 2 = log W , p4 q2 = log W , and W 4 N 2 log N processors, respectively. To the best of our knowledge, this is the first constant time algorithm for vector median filtering. Compared to other previous results, as shown in Table 1, ours are more scalable and achieve time or cost optimal. The rest of this paper is organized as follows: We give a brief introduction to the AROB computational model in Section 2. Section 3 describes some data manipulation operations which will be used in the parallel vector median filtering algorithms. We develop the efficient vector median filtering algorithms in Section 4. Finally, some concluding remarks are included in the last section.

2

THE COMPUTATIONAL MODEL

A linear array processors with pipelined buses (1D APPB) [11] of size N contains N processors connected to the optical bus with two couplers. One is used to write data on the upper (transmitting) segment of the bus and the other is used to read the data from the lower (receiving) segment of the bus. An example for a 1D APPB of size 5 is shown in Fig. 1. In order to ensure that all processors in Fig. 1 can write their messages on the bus simultaneously without collision and to determine which slot to use, each processor contains a counter used for the time waiting function, and the following collision-free condition must be satisfied: do > b w cg ; where do is the optical distance between two adjacent processors (as shown in Fig. 1), b is the maximum number of binary bits in each message, w is the width of an optical pulse used to represent one bit in each message (in seconds), and cg is the velocity of light in the waveguide. Thus, in the same cycle time, the pipelined optical bus can transmit up to N times more messages compared to the electrical bus of

WU AND HORNG: L2 VECTOR MEDIAN FILTERS ON ARRAYS WITH RECONFIGURABLE OPTICAL BUSES

Fig. 1. A linear APPB of size 5.

the same length without collisions on the bus, where N is the number of processors in the array [11], [13], [20], [22]. The AROB model is essentially a mesh using the basic structure of a classical reconfigurable network (RN) [9] and optical buses. The linear AROB (LAROB or 1D AROB) extends the capabilities of the 1D APPB by permitting each processor to connect to the bus through a pair of switches. Each processor with a local memory is identified by a unique index, denoted as Pi , 0 i < N, and each switch can be set to either cross or straight by the local processor. The optical switches are used for reconfiguration. When all switches are set to straight, the bus system operates as a regular optical bus. When both switches of processor Pi , 1 i < N ÿ 1, are set to cross, the LAROB will be partitioned into two independent subbuses from processor Pi , where each of them forms an LAROB. That is, one consists of processors P0 , P1 , , Piÿ1 and the other consists of processors Pi , Pi1 , , PNÿ1 , whose leader processor is

1283

Pi (as shown in Fig. 2c). Each processor uses a set of control registers to store information needed to control the transmission and reception of messages by that processor. An example for an LAROB of size N is shown in Fig. 2a. Two interesting switch configurations derivable from a processor of an LAROB are also shown in Fig. 2b. A 2D AROB of size M N, denoted as 2D M N AROB, contains M N processors arranged in a 2D grid. Each processor is identified by a unique 2-tuple index i; j, 0 i < M, 0 j < N. The processor with index i; j is denoted by Pi;j . Let Pi; denote the ith row, where represents all indexes along the ith row. For example, P1; is equivalent to P1; j , 0 j < N. Each processor has four I/O ports, denoted by E, W, S, and N, to be connected with a reconfigurable optical bus system. The interconnection among the four ports of a processor can be reconfigured during the execution of algorithms. Thus, multiple arbitrary linear arrays like LAROB can be specified in a 2D AROB. The two terminal processors which are located in the end points of the constructed LAROB may serve as the leader processors (similar to P0 in Fig. 2a). The related position of any processor on a bus to which it is connected is its distance from the leader processor. For more details on the AROB, see [19]. An example of a 2D 4 4 AROB and the 10 allowed switch configurations are shown in Fig. 3. A 2D AROB model can be extended to a 3D AROB model with two extra ports, U and D, for the communications of the neighboring processors in the third dimension. For a 3D AROB of size M N L, each processor is identified by

Fig. 2. (a) An LAROB of size N. (b) The switch states. (c) An example of bus reconfiguration.

1284


VOL. 12, NO. 12,

DECEMBER 2001

Fig. 3. (a) A 4 4 AROB. (b) The processor connection in a 2D AROB. (c) The allowed switch configurations.

a unique 3-tuple i; j; k, 0 i < M; 0 j < N; 0 k < L. The processor with index i; j; k is denoted by Pi; j; k . An example of a 3D 4 4 4 AROB is shown in Fig. 4. Similarly, the l-D AROB can be extended from (l ÿ 1D) AROBs with two extra ports for each processor, l > 3. A petit cycle length is defined as the time needed for a pulse to traverse the optical distance between two consecutive processors on the bus. A bus cycle length is defined as the end-to-end propagation delay ( 2N) of the messages, together with the message processing time for the message generation, buffering, encoding, decoding, etc., in the source and destination processors. Although the bus cycle length can be considered to be O, Melhem et al. [15] assumed that the bus cycle length in a pipelined optical bus is compatible with the computation speed in the processors. We need more justifications for this assumption. The ratio r between the message processing time and the time needed for the communication over a distance do is in the order of 103 [11], [19]. This implies that a bus cycle length can be regarded as a constant for systems of size Or [13]. In order for the message processing time to be of the same order of magnitude as the end-to-end communication time , the number of processors should satisfy N r. Guo et al. [11] had also shown that, as do approaches bwcg , the pipelined

bus can accommodate N messages in the same cycle time as the exclusive access bus. Although the message communication time is proportional to the system size N, in parallel computing, the time for a message to propagate a link in a RN or in a mesh with global broadcasting buses is assumed to be constant [1]. Since the message communication time on the optical pipelined bus is very small, we also assume that the bus cycle length is constant and compatible with the computation time of any arithmetic or logic operations. This assumption has adopted widely in the literature [1], [9], [13], [15], [16], [17], [18], [23], [24], [25]. In the literature [11], [19], [20], [21], the time complexity of algorithms simulated on an array with optical pipelined buses is expressed in terms of the number of steps, in order to abstract from the details introduced by the technology dependent parameters, as for example. A unit delay is defined to be the spatial length of a single optical pulse, shown as a loop in Fig. 2a, which may introduce a time slot delay between two processors on the receiving segment. Several approaches can be applied to route messages from one processor to another in an optical bus system: There are time waiting function [11], time-division multiplexity scheme [22], and the coincident pulse technique


1285

Fig. 4. A 4 4 4 AROB.

[12], [22]. It is shown that these approaches for a data

3

routing can be implemented in constant time [19]. Arbitrary

In this section, we will develop several basic operations on the AROB model. These basic operations will be used to develop efficient vector median filtering algorithms in the next section. Conceptually, a 3D W 4 N N AROB may be partitioned into W 4 layers, denoted by Mm , 0 m < W 4 , N2 each of size 1 N N; each Mm consists of W 2 windows N . Without each of size W W , denoted by W s;t , 0 s; t < W loss of generality, assume that W is a factor of N and W N. To make the algorithm presentation more comprehensible, the previous results which had been proposed on the AROB are summarized in the following:

permutation can be performed using these methods. For the sake of completeness, we may use a time waiting function wait; 1, to determine that processor P should wait, relatively to the beginning of the bus cycle, before reading the message sent on the bus from source processor P . For example, in Fig. 2a, assume processors P0 and P1 want to send messages m1 and m2 to processors P2 and P3 , respectively. First, messages m1 and m2 are written on the bus (by processors P0 and P1 , respectively) simultaneously at the beginning of a bus cycle. Then, processors P2 and P3 will receive messages m1 and m2 at 2 0 1 3 and 3 1 1 5 relatively to the beginning of the bus cycle, respectively. For a unit of time, assume each processor can either perform arithmetic and logic operations or communicate with others on a bus. It also allows multiple processors to broadcast data on the different buses or to broadcast the same data on the same bus simultaneously at a time unit if there is no collision [13], [20], [21], [23], [25]. Let vari, arrayik denote a local variable var and an element k of an array array in a processor with index i. For example, sum0 and a0k are a local variable sum and the element k of the array a of processor P0 .

BASIC DATA MANIPULATION OPERATIONS

Lemma 1. [16] Given N integers or normalized real numbers each of size Olog N-bit, these N numbers can be added in Olog N time on an LAROB using N processors. Lemma 2. [18] Given N numbers each of size Olog N-bit, the maximum (minimum) of these N numbers can be found either in Olog log N time on an LAROB using N processors, or in O1 time on an AROB using N 2 =2 processors. Lemma 3. [20] Given N numbers each of size Olog N-bit, these N numbers can be added in O1 time on an N log N AROB.

3.1 Scalable Operations In reality, the number of processors available in the system is not always enough for applications. If the number of processors available in the system is p, 1 p N, then each

1286


VOL. 12, NO. 12,

DECEMBER 2001

displacements x and y, 0 x; y < W , move ai; j, 0 i; j < N, to a0 i ÿ x mod N; j ÿ y mod N. For example, if x 2 and y 3, then a2; 3 is moved to a0 0; 0. By applying the pipelining ability and reconfigurability of the optical buses, the detailed 2D shift algorithm (2DSA) is shown in the following: Procedure 2DSAa; x; y; a0 1. // Rotate all data elements up by x positions. // Processor Pi; j , 0 i; j < N, copies ai; j to axi ÿ x mod N; j, through the i-dimensional bus. 2. // Rotate all data elements to the left by y positions. // Processor Pi; j , 0 i; j < N, copies axi; j to axyi; j ÿ y mod Nthrough the j-dimensional bus. 3. Processor Pi ; j; 0 i; j < N, copies axyi; j to a0 i; j. End{2DSA} Procedure 2DSA can be easily verified to run in O1 time for each step. We have the following lemma: Lemma 6. Procedure 2DSA can be computed in O1 time on a 2D N N AROB. Fig. 5. An illustration of shift operation on a 3D 16 4 4 AROB, where N 4, W 2, x y 1. a0; 1; 1 is moved to a0 3 4z; 0; 0, for x y 1 and 0 z < W 2 .

processor contains an array with ON=p data items to be processed. Therefore, Lemmas 1 and 2 can be modified to run on an AROB using p processors since they were based on the divide-and-conquer technique. Assume that data item ai , 0 i < N, is stored in the element of the array ai mod pbi=pc of processor Pi mod p , i.e., processor Pj holds data items aj , ajp ; ; ajNÿp . The scalable summation operation can be performed as follows: Each processor first sums up the ON=p numbers stored in its local memory sequentially. This takes ON=p time. Then, apply Lemma 1 to sum up these p partial sums in Olog p time. Thus, the summation operation can be done in ON=p log p time on an LAROB using p processors. Especially, if p N= log N, then the time complexity is Olog N. The maximum (minimum) operation can also be done in ON=p time on an LAROB with p processors by the similar method. Thus, we have the following lemmas: Lemma 4. Given N integers or normalized real numbers each of size Olog N-bit, these N numbers can be added in ON=p log p time on an LAROB using p processors. Lemma 5. Given N numbers each of size Olog N-bit, the maximum (minimum) of these N numbers can be found in ON=p time on an AROB using p2 processors.

3.2 Shift Let A ai;j , 0 i; j < N, be a data matrix of size N N stored in the local variable ai; j of a 2D N N AROB. Also, let x and y be a pair of vertical and horizontal displacements. The 2D shift is defined to circularly shift all data elements of A to up and left by x and y positions relative to the original matrix A. That is, for a fixed pair of

Next, we will perform the 3D shift. Consider a 3D W 4 N N AROB with W 4 layers (M0 ÿ MW 4 ÿ1 ), each layer of size N N. Let the data matrix A be stored in the local variable a0; i; j of the first layer (i.e., M0 ). Also, let x and y, 0 x; y < W , be a pair of vertical and horizontal displacements relative to A, respectively. The 3D shift is defined to circularly shift the data elements of A located on M0 up and left by x and y positions and then store them on MzW 2 xW y , 0 z < W 2 . That is, for each pair of displacements x and y, 0 x; y < W , move a0; i; j located on M0 , 0 i; j < N, to a0 zW 2 xW y; i ÿ x mod N; j ÿ y mod N located on MzW 2 xW y , 0 x; y < W and 0 z < W 2 . For example, as shown in Fig. 5, where x 1, y 1, and W 2, then a0; 1; 1 is moved to a0 3 4z; 0; 0, 0 z < W 2 . Following the idea of procedure 2DSA, this operation can be also run in O1 time; the detailed 3D shift algorithm (3DSA) is listed below. Assuming N 4 and W 2, an illustration of procedure 3DSA is also shown in Fig. 6. 0

Procedure 3DSAa; a 1. Processor P0; i; j , 0 i; j < N, broadcasts a0; i; j to ak; i; j, 0 k < W 4 , through the k-dimensional buses. 2. For each MzW 2 xW y , 0 x; y < W , 0 z < W 2 , call 0 2DSAa; x; y; a . End{3DSA} Lemma 7. Procedure 3DSA can be computed in O1 time on a 3D W 4 N N AROB. Initially, the data matrix A is stored in the local variable a0; i; j of M0 for all 0 i; j < N Proof. The correctness of this procedure clearly follows. The O1 time complexity can be easily verified by the pipelined transmission ability of optical buses and procedure 2DSA. u t


1287

Fig. 6. An illustration of procedure 3DSA. (a) Initialization. (b) After Step 1. (c) After Step 1 of procedure 2DSA. (d) After Step 2 of procedure 2DSA.

4

PARALLEL VECTOR MEDIAN FILTERING ALGORITHMS

Given an N N image A ai; j, 0 i; j < N, and a W W window, the vector median filtering problem is to find the pixel where the sum of distances between it and the others inside the filter window is the minimum and the filter window moves over all pixels of the input image. Since the window is centered at each specified pixel, the

image is shifted circularly down and right by w positions. Then, the result vmi; j of each window operation is stored in a location corresponding to the top-left corner of the window. That is, the result of vector median filtering vmi; j, 0 i; j < N, is computed as follows: For each window W i;j fai c; j dj0 c; d < W g whose top-left corner is ai; j, 0 i; j < N, first compute the sum of

1288


VOL. 12, NO. 12,

DECEMBER 2001

Fig. 7. The conceptual mapping of an N N image and a W W window to a 3D W 4 N N AROB for algorithm VMFA, where W 2 and N 4.

distances for every pixel within the window from it to all the other pixels inside the window as: Si c; j d

W ÿ1 W ÿ1 X X

kai c mod N; j d mod N

k0 l0

3

ÿai kmod N; j l mod Nk; 0 c; d < W :

horizontal displacements x and y of each subimage SIs; t , the sum of distances specified in (3) can be reformulated as SsW x c; tW y d

W X W X

asW x cN; tW y dmodN

ÿasW x kmodN; tW y lmodN ; k0 l0 6 N W

where 0 s; t < , and 0 c; d; x; y < W . As stated in, (3), (4), and (5), there are N 2 vector medians

Then, find the minimum among these sums of distances by setting

to be computed. Considering a 3D W 4 N N AROB, it

mi; j minfSi c; j d; 0 c; d < W g:

4

can be viewed as N 2 LAROBs, each of size W 4 1 1,

Finally, pixel ac0 ; d0 within the window W i;j is the vector median of this window if its sum of distances Sc0 ; d0 is the minimum. That is, set

denoted by LAi;j , 0 i; j < N; each LAi;j can be further

vmi; j ac0 ; d0 ; if Sc0 ; d0 mi; j:

5

responsible for a sum of distance computing and a vector

Assume that the value of a pixel of the image is an Olog N-bit integer in the following. In the remainder of this section, we will develop several efficient parallel algorithms for vector median filtering on the AROB model based on the number of processors available in the system. To implement (3), (4), and (5) on a 3D AROB of size W 4 N N, we redefine the coordinates of the image in the following. Conceptually, the N N image can be N N W subimages, each of size W W , denoted viewed as W N . Without loss of generality, assume by SIs; t , 0 s; t < W that W is a factor of N. Our algorithms can be easily extended for the general case when the shape of the image is not square. Then, for each pair of the vertical and

median computing, respectively. Thus, there are N 2

viewed as W 2 subLAROBs each of size W 2 1 1, denoted by SLAz;i;j , 0 z < W 2 . Each SLAz;i;j and LAi;j are

LAROBs to compute these N 2 vector medians. That is, LAi;j is responsible for computing vmi; j, specified in (5), 0 i; j < N. A conceptual mapping of the computations is shown in Fig. 7. In order to let these N 2 LAROBs compute (3) in parallel, we must shift the image pixels to their corresponding LAROBs, i.e., a0 zW 2 xW y; i; j ai x mod N; j y mod N: It means that, for a specified LAROB LAi; j with a fixed pair of displacements x and y, (3), (4), and (5) can be reformulated as follows:


1289

Fig. 8. An illustration of algorithm VMFA, where N 4 and W 2. (a) After Step 3, each processor contains two vectors and computes the distance between them. (b) After Step 4.1, each subLAROB sums up the distances. (c) Each LAROB finds its corresponding vector median and the final result is stored at its first processor.

SzW 2 ; i; j W ÿ1 W ÿ1 X X

a0 zW 2 z; i; j ÿ a0 zW 2 kW l; i; j ;

7

k0 l0

where z xW y. m0; i; j minfSzW 2 z; i; j; 0 i; j < N; 0 z < W 2 g: 8 Set vm0; i; j a0 zW 2 z; i; j

9

when SzW 2 z; i; j is the minimum among the sums of distances computed on this LAROB, as specified in (8). Based on (7), (8), and (9), the high level description of vector median filtering algorithm on a 3D AROB is specified by the following two steps: First, shift the image pixels located at M0 to its corresponding LAROBs according to the specified x and y displacements. Then, all LAROBs compute the vector medians simultaneously. Assume that the image is initially stored in the local variables a0; i; j, 0 i; j < N, of a 3D W 4 N N AROB. Finally, the result is stored in the local variable vm0; i; j of a 3D W 4 N N AROB. The detailed algorithm (VMFA) is listed in the

1290


following. Following procedure 3DSA (Fig. 6), an illustration of algorithm VMFA is also shown in Fig. 8. Algorithm VMFA a; vm 1. // Since the window is centered at each specified pixel, the image is shifted circularly down and right by w positions. // Call 2DSAa; ÿw; ÿw; a0 . 00 2. Call 3DSAa0 ; a . 3. Partition each LAROB into W 2 subLAROBs and processor PzW 2 z;i;j , 0 z < W 2 , 0 i; j < N, broadcasts 00 a zW 2 z; i; j to a zW 2 z0 ; i; j, 0 z0 < W 2 , through the k-dimensional bus. 4. Compute the vector median on each LAROB according to (7), (8), and (9). 4.1. For each subLAROB, compute (7) using Lemma 1. 4.2 For each LAROB, compute (8) using Lemma 2. 4.3. For each LAROB, find the vector median according to (9), then move it to M0 . Theorem 1. Algorithm VMFA can be computed in Olog W time on a 3D W 4 N N AROB, where W < N. Proof. The correctness of this algorithm directly follows from (7), (8), (9), and Lemmas 1-7. Steps 4.1-4.3 are used to implement (7), (8), and (9) for computing a specified vector median. For computing (7), processor PzW 2 xW y; i; j first copies a00 zW 2 xW y; i; j to all processors on its subLAROB (i.e., to another variable named as a zW 2 z0 ; i; j, 0 z; z0 < W 2 , of processor PzW 2 z0 ; i; j ). Then, processor PzW 2 z0 ; i; j computes the distance between a zW 2 z0 ; i; j and a00 zW 2 xW y; i; j (Both correspond to a0 zW 2 z; i; j and a0 zW 2 kW l; i; j, as specified in (7), respectively) for a fixed pair of displacements x and y. Then, each subLAROB sums up these W 2 distances. Finally, each LAROB computes the corresponding vector median by finding the minimum among these W 2 sums of distances as each computed in the previous step. Since there are N 2 LAROBs, all N 2 vector medians can be correctly computed by the proposed algorithm. The time complexity is analyzed as follows: Steps 1 and 2 each take O1 time by Lemmas 6 and 7, respectively. Step 3 also takes O1 time. Step 4.1 takes Olog W time by Lemma 1, since the parallel addition algorithm was based on the divide-and-conquer technique and independent on the domain of input data. Similarly, Step 4.2 takes O1 time by Lemma 2. Step 4.3 takes O1 time. Hence, the total time complexity is Olog W . u t Next, we will derive a scalable and cost optimal algorithm from Algorithm VMFA when the number of processors available in the system is p4 N 2 . Without loss of generality, assume that p is a factor of W . Our algorithms can be easily extended for the general case while the shape of the image is not square. As stated before, a 3D AROB of size p4 N N consists of N 2 LAROBs, each of size p4 1 1, denoted by LA0i;j , 0 i; j < N; each LA0i;j can be further viewed as p2 subLAROBs, each of size p2 1 1, denoted by SLA0z;i;j , 0 z < p2 . Each subimage SIs; t ,

VOL. 12, NO. 12,

DECEMBER 2001

N 0 s; t < W , also consists of Wp Wp blocks, each of size p p, denoted by Bu; v , 0 u; v < Wp . In this case, we can use the divide-and-conquer technique to compute all vector medians. Similarly, each LAROB is responsible for computing a specified vector median for each pair of displacements x and y of each specified subimage SIs; t . To compute one vector median needs OW 4 operations and we use W 4 processors to do it in algorithm VMFA. Recall that each LAROB performs two substeps to compute one vector median: Each subLAROB first computes the sum of distances for a specified pixel to all other pixels within the window (i.e., (7)); then, each LAROB finds the minimum of these sums (i.e., (8)). Now, since each LAROB has only p2 subLAROBs, and each subLAROB has only p2 processors, it needs two nested loops to compute one vector median. In the inner loop, each subLAROB computes only the partial sum of distances for a specified pixel to all other pixels within a specified block of the window. Since each 2 window contains Wp2 blocks, the inner loop must be 2 repeated Wp2 times. Similarly, since there are only p2 subLAROBs available, the outer loop must also be repeated W2 2 p2 times to compute the sums of distances of all W pixels. In order to let these N 2 LAROBs compute the vector medians in parallel, we must shift the image pixels to their corresponding LAROBs, respectively. That is, for each iteration of the inner loop, shift the image pixels of each block Bu; v to their corresponding LA0i;j such that each processor of the LA0i;j stores the image pixels up by x vertical displacement and left by y horizontal displacement corresponding to the original image pixels located on the M0 . Then, each SLA0i;j;z computes (7). Finally, each LA0i;j computes (8) and (9). Repeating these processes for each block of each subimage from left to right and then up to down for the outer loop, we can compute all vector 4 medians. That is, repeating algorithm VMFA for Wp4 times, all N 2 vector medians can be computed. Assume that the image is initially stored in the local variables a0; i; j, 0 i; j < N, of a 3D p4 N N AROB. Finally, the result is stored in the local variable vm0; i; j of a 3D p2 N N AROB. Since each LAROB contains only p4 processors and the complexity of computing a specified vector median is W 4 , 4 there are Wp 4 N 2 vector medians to be computed at an iteration on a 3D p4 N N AROB. Therefore, it requires W4 iterations to compute the N 2 vector medians. Furtherp4 more, if the number of processors available in the system is p4 N 2 = log W , then the summation operations used in Step 4.1 of algorithm VMFA should be run in Olog W time using p4 N 2 = log W processors applying Lemma 4, the minimum operation used in Step 4.2 of algorithm VMFA should be run in Olog log W time applying Lemma 2. This implies that the time complexity of algorithm VMFA becomes 4 OWp4 log W , but the number of processors is reduced by a factor of Olog W and the proposed algorithm is cost optimal. This leads to the following theorem: 4

Theorem 2. Algorithm VMFA can be computed in OWp4 log W time on a 3D p4 N N= log W AROB, where W < N.


Similarly, if the number of processors available in the system is p4 q2 , then each processor holds an array with 2 ONq2 data to be processed. Without loss of generality, assume that q is a factor of N and W q N. For this case, 2 the 3D shift operation can be done in ONq2 times using the pipelined optical buses. Since each LAROB contains only 4 p4 1 1 processor, each LAROB computes Wp 4 vector medians at each iteration from Theorem 2. Since we have 4 2 q2 LAROBs, pWq4 vector medians can be computed at each 4 2 2 iteration. By repeating algorithm VMFA for Nq2 times, pWN4 vector medians can be computed. Furthermore, repeating 4 algorithm VMFA for Wp4 times again, all N 2 vector medians can be computed. Based on the divide-and-conquer technique as stated above, algorithm VMFA can be modified to run flexibly and efficiently. This leads to the following result: Theorem 3. Algorithm VMFA can be computed in 4 2 OWp4 qN2 log W time on a 3D p4 q q= log W AROB, where W q N. On the other hand, a constant time algorithm for vector median filtering can be also derived. By increasing the number of processors a few used to the parallel algorithm, algorithm VMFA can be easily modified to run more efficiently. The time complexity of the proposed algorithm is dominated by computing the sum of distances and finding the minimum, as shown in Steps 4.1 and 4.2 of algorithm VMFA, respectively. The summation of the W W distances of each window can be computed in O1 time by Lemma 3 and the minimum of each window can be found by Lemma 2. Hence, the total time complexity of the modified algorithm can be reduced from Olog W to O1 by increasing the number of processors from W 4 N 2 to W 4 N 2 log N. This leads to the following theorem: Theorem 4. The vector median filtering can be solved in O1 time on a 4D W 4 log N N N AROB, where W < N.

5

CONCLUDING REMARKS

Vector median filters require a great amount of computations and global propagation of image pixels. In this paper, three fast and efficient algorithms for vector median filtering are proposed. Our algorithms run in 4 2 OW 4 log W =p4 , OWp4 qN2 log W and O1 times using 4 2 4 2 p N = log W , p q = log W and W 4 N 2 log N processors, respectively. To the best of our knowledge, this is the first constant time vector median filtering algorithm to be developed on any parallel computational models. Compared to other previous results as shown in Table 1, ours are more scalable and achieve time or cost optimal. If the system size R is less than the problem size I W 4 N 2 , the time complexity may be represented as OT I=R To I; R, where T I is the time complexity of the sequential algorithm and To I; R is the overall overhead of a parallel implementation. A parallel implementation is scalable in the range 1 . . . P if linear speedup can be achieved for all 1 R P . A highly

1291

scalable parallel implementation means that the sequential algorithm can be highly parallelized and the overhead in parallelization is small (i.e., R is as large as T I=T Ilog Ik , where k 0 is a constant and T I is the best possible parallel time [13]). According to this definition, we have obtained a highly scalable parallelization of the vector median filtering algorithm on the AROB model when k 1. Furthermore, a fully scalable parallel implementation means that the sequential algorithm can be fully parallelized and the overhead in parallelization is negligible [13]. According to this definition, we have obtained a fully scalable parallelization of the vector median filtering algorithm on the AROB model when the system size R W 4 N 2 = log W . In a cycle time, the number of messages that can be transmitted on a pipelined optical bus is larger than what is able to be transmitted on an electrical bus. Optical transmission can reduce the data transmission time between processors quite a lot. The transmission time of a data item between processors is determined by the size of the data item and the bus capacity. Usually, parallel image processing jobs require a lot of computations and communications. Due to its high communication bandwidth, its bus reconfigurability as a computation tool and the versatile communication patterns it supports, the AROB is useful for solving image problems and its computational power is superior to that of other existing reconfigurable networks, like PARBS.

ACKNOWLEDGMENTS The authors are grateful to the four anonymous referees whose constructive comments helped to improve the presentation of the paper. This work was partially supported by the National Science Council under contract number NSC-89-2213-E011-007.

REFERENCES [1]

S.G. Akl, Parallel Computation: Models and Methods Prentice Hall, 1997. [2] G. Angelopoulos and I. Pitas, ªTwo-Dimensional Vector Median Filters on Mesh Connected Computers,º Proc. Int'l Conf. Image Processing, pp. 650-653, 1994. [3] J. Astola, P. Haavisto, and Y. Neuvo, ªVector Median Filters,º Proc. IEEE, vol. 78, no. 4, pp. 678-689, 1990. [4] M. Barni, V. Cappellini, and A. Mecocci, ªThe Use of Different Metrics in Vector Median Filtering: Application to Fine Arts and Paintings,º Proc. Sixth Int'l Conf. Signal Processing, Theories, and Applications, pp. 1485-1488, 1992. [5] M. Barni, V. Cappellini, and A. Mecocci, ªFast Vector Median Filter Based on Euclidean Norm Approximation,º IEEE Signal Processing Letters, vol. 1, pp. 92-94, 1994. [6] M. Barni, F. Bartolini, and V. Cappellini, ªOptimum Linear Approximation of the Euclidean Norm to Speed up Vector Median Filtering,º Proc. Int'l Conf. Image Processing, vol. 1, pp. 362- 365 1995. [7] M. Barni, ªA Fast Algorithm for 1-norm Vector Median Filter,º IEEE Trans. Image Processing, vol. 6, no. 10, pp. 583-586, 1997. [8] K.E. Batcher, ªDesign of a Massively Parallel Processor,º IEEE Trans. Computers, vol. 29, pp. 836-840, 1980. [9] Y. Ben-Asher, D. Peleg, R. Ramaswami, and A. Schuster, ªThe Power of Reconfiguration,º J. Parallel and Distributed Computing, vol. 13, no. 2, pp. 139-153, 1991. [10] V. Caselles, G. Sapiri, and D.H. Chung, ªVector Median Filters, Morphology, and PDE's: Theorectical Conceptions,º Proc. Int'l Conf. Image Processing, vol. 1, pp. 177- 181, 1999.

1292


[11] Z. Guo, R.G. Melhem, R.W. Hall, D.M. Chiarulli, and S.P. Levitan, ªPipelined Communications in Optically Interconnected Arrays,º J. Parallel and Distributed Computing, vol. 12, no. 3, pp. 269-282, 1991. [12] S.P. Levitan, D.M. Chiarulli, and R.G. Melhem, ªCoincident Pulse Technique for Multiprocessor Interconnection Structures,º Applied Optics, vol. 29, no. 14, pp. 2024-2033, 1990. [13] K. Li, Y. Pan, and S.Q. Zheng, ªFast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System,º IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 8, pp. 705-720, Aug. 1998. [14] L. Lucat and P. Siohan, ªVector-Median Type Filters and Fastcomputation algorithms,º Proc. Int'l Symp. Circuits and Systems, pp. 2469-2472, 1997. [15] R.G. Melhem, D.M. Chiarulli, and S.P. Levitan, ªSpace Multiplexing of Waveguides in Optically Interconnected Multiprocessor Systems,º Computer J., vol. 32, no. 4, pp. 362-369, 1989. [16] R. Miller, V.KP. Kumar, D. Reisis, and Q. F. Stout, ªImage Computations on Teconfigurable VLSI Arrays,º Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 925-930, 1988. [17] Y. Pan and M. Hamdi, ªQuicksort on a Linear Arrays with a Reconfigurable Pipelined Bus System,º Proc. Int'l Symp. Parallel Architectures, Algorithms, and Networks, pp. 313-319, 1996. [18] Y. Pan and K. Li, ªLinear Array with a Reconfigurable Pipelined Bus SystemÐConcepts and Applications,º Information Sciences, vol. 106, nos. 3/4, pp. 237-258, 1998. [19] S. Pavel and S.G. Akl, ªOn the Power of Arrays with Reconfigurable Optical Bus,º Proc. Int'l Conf. Parallel and Distributed Processing Techniques and Applications, pp. 1443-1454, 1996. [20] S. Pavel and S.G. Akl, ªMatrix Operations Using Arrays with Reconfigurable Optical Buses,º Parallel Algorithms and Applications, vol. 8, pp. 223-242, 1996. [21] S. Pavel and S.G. Akl, ªInteger Sorting and Routing in Arrays with Reconfigurable Optical Buses,º Int'l J. Foundations of Computer Science, vol. 9, no. 1, pp. 99-120 1998. [22] C. Qiao and R.G. Melhem, ªTime-Division Communications in Multiprocessor Arrays,º IEEE Trans. Computers, vol. 42, no. 5, pp. 577-590, May 1993. [23] S. Rajasekaran and S. Sahni, ªSorting, Selection, and Routing on the Array with Reconfigurable Optical Buses,º IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 11, pp. 1133-1142, Nov. 1997. [24] B.F. Wang and G.H. Chen, ªConstant Time Algorithms for Transitive Closure and Some Related Graph Problems on Processor Arrays with Reconfigurable Bus Systems,º IEEE Trans. Parallel and Distributed Systems, vol. 1, pp. 500-507, 1990. [25] C.H. Wu, S.J. Horng, and H.R. Tsai, ªEfficient Parallel Algorithms for Hierarchical Clustering on Arrays with Reconfigurable Optical Buses,º J. Parallel and Distributed Computing, vol. 60, no. 9, pp. 11371153, 2000. [26] J. Zheng, K.P. Valavaris, and J.M. Gaugh, ªNoise Removal from Color Images,º J. Intelligent Robotic Systems, vol. 7, pp. 257-285, 1993.

VOL. 12, NO. 12,

DECEMBER 2001

Chin-Hsiung Wu received the BS degree from Chinese Naval Academy, the MS degree in information management from the National Defence Management College, and the PhD degree in electrical engineering from the National Taiwan University of Science and Technology, Republic of China, in 1986, 1991, and 2000, respectively. He is an associate professor in the Department of Information Management, Chinese Naval Academy. His research interests include image processing, computer vision, parallel and distributed computing. Shi-Jinn Horng received the BS degree in electronics engineering from National Taiwan Institute of Technology, the MS degree in information engineering from the National Central University, and the PhD degree in computer science from the National Tsing Hua University in 1980, 1984, and 1989, respectively. Currently, he is a full professor in the Department of Electrical Engineering, the National Taiwan University of Science and Technology. His research interests include VLSI design, multiprocessing systems, and parallel algorithms.

. For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.

Parallel and Distributed Systems, IEEE Transactions

Parallel and Distributed Systems, IEEE Transactions

Suggest Documents

Parallel and Distributed Systems, IEEE Transactions - Adelaide

IEEE Transactions on Parallel and Distributed Systems

Parallel and Distributed Systems - IEEE Xplore

Parallel and Distributed Systems - IEEE Xplore

Distributed and Parallel Systems

parallel and distributed systems

Parallel and Distributed Systems - csail

Distributed and Parallel Database Systems

Distributed Shared Memory: Concepts and Systems - IEEE Parallel ...

Power Systems, IEEE Transactions on

Power Systems, IEEE Transactions on

Power Systems, IEEE Transactions on - IEEE Xplore

Power Systems, IEEE Transactions on - IEEE Xplore

Distributed and Parallel Systems : Cluster and Grid

Parallel and Distributed - IEEE Computer Society

Parallel & Distributed Systems group - CiteSeerX

Parallel & Distributed Systems group - CiteSeerX

Cloud Computing - Parallel and Distributed Systems

Parallel, Distributed and Multi-Agent Production Systems

Distributed and Parallel Database Systems - Semantic Scholar

Differences Between Distributed and Parallel Systems

Parallel and Distributed Stream Processing: Systems ... - Hal

BIAS IN PARALLEL AND DISTRIBUTED SIMULATION SYSTEMS

Concepts of Parallel and Distributed Database Systems