IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 14,
NO. 10,
OCTOBER 2003
983
Fast and Scalable Selection Algorithms with Applications to Median Filtering Chin-Hsiung Wu and Shi-Jinn Horng Abstract—The main contributions of this paper are in designing fast and scalable parallel algorithms for selection and median filtering. Based on the radix-! representation of data and the prune-and-search approach, we first design a fast and scalable selection algorithm on the arrays with reconfigurable optical buses (AROB). To the authors’ knowledge, this is the most time efficient algorithm yet published, especially compared to the algorithms proposed by Han et al. [8] and Pan [16]. Then, given an N N image and a W W window, based on the proposed selection algorithm, several scalable median filtering algorithms are developed on the AROB model with a various number of processors. In the sense of the product of time and the number of processors used, most of the proposed algorithms are time or cost optimal. Index Terms—Median filter, scalable selection algorithm, parallel algorithm, image processing, reconfigurable optical bus system.
æ 1
INTRODUCTION
G
a sequence Q of N numbers a0 ; a1 ; . . . ; aN1 and i N, the problem of selection is to find the ith smallest data element from these N given numbers. For the selection problem, some algorithms have been derived in the parallel computation models with optical buses. Pan [16] proposed a selection algorithm on the linear array with a reconfigurable pipelined bus system (LARPBS). His algorithm runs in O Np log N expected time and in OðNÞ worst time using p processors. Han et al. [8] proposed an Oððlog log NÞ2 = log log log NÞ time algorithm on the LARPBS model using N processors. Rajasekaran and Sahni [22] proposed an Oð1Þ time sorting algorithm on the array with reconfigurable optical buses (AROB) using N N processors for any constant > 0. ~ð1Þ randomized selection algoThey also proposed an O rithm on the 2D AROB with high probability [22]. Although their sorting algorithm can solve the selection problem, however, their algorithm is complicated and hard to be scaled. Their sorting algorithm implemented the column sort [10] recursively on an N N " AROB, for " > 0. The column sort is based on that sorting N data elements can be decomposed into s groups, where each group consists of r data elements. This limits the algorithm to be implemented on any existing architectures. The sorting algorithm contains eight steps, four steps for sorting data items on each column, the other IVEN
. C.-H. Wu is with the Department of Information Management, Chinese Naval Academy, 669 Chun-Hsiao Road, Kaohsiung, Taiwan, Republic of China. E-mail:
[email protected]. . S.-J. Horng is with the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, Republic of China. E-mail:
[email protected]. Manuscript received 19 June 2001; revised 22 Jan. 2003; accepted 2 May 2003. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number 114376. 1045-9219/03/$17.00 ß 2003 IEEE
four steps for permuting data items. If the number of processors is less than N 2=3 N, their algorithm is a recursive algorithm which contains inner efforts. Median filter is a very important nonlinear filter used in image processing to eliminate noise while preserving details. Given a two-dimensional (2D) image A ¼ ai;j , 0 i; j < N, the median filtering operation is to replace the intensity of each pixel from image A by the median of intensities in a neighborhood of that pixel. Usually, the neighborhood of a specified pixel is established using a W W window centered at that pixel, where W is the width of filter and W ¼ 2w þ 1 is an odd number. In practice, W < 10 since a larger filter might introduce extraneous pixels into the range of the offending noise: Larger windows remove impulses of larger width, but also remove detailed lines [9]. When pixel ði; jÞ is near the boundaries of A, the window is wrapped around the appropriate boundaries. In median filtering, a straightforward serial algorithm takes OðN 2 W 2 Þ time. However, calculating the median of a window is an inherently slow operation. Several techniques have been developed to improve the speed of median filtering, such as sliding window (running median) methods [2], histogram methods [6], and separable median filters [15]. The running median/histogram methods reuse partial results, whereas their algorithms appear to recalculate for each window. The comparisons of various median filtering methods are given in [2]. In many applications, especially in real-time image processing, such as military and industrial applications, the speed of computations is very important. Owing to a computation intensive task such as the 2D median filtering requiring much time to process it, designing parallel algorithms to process it is the only way to get a real-time response. Some parallel algorithms were proposed for computing median filtering of the images [23], [24], [25]. For an N N image and a W W window, Stout [24] proposed an OðW Þ time pyramid algorithm for Published by the IEEE Computer Society
984
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
median filtering using N 2 processors. His algorithm is complex and needs multiple sort steps. Ranka and Sahni [23] proposed two CREW PRAM algorithms for separable median filtering. Separable median filtering is potentially quick but inexact, since the medians it uses are local, but not global. Their algorithms run in Oðlog2 W log log W Þ time and Oðlog2 W Þ time using OðN 2 = log W log log W Þ and OðN 2 log2 W Þ processors, respectively. Tanimoto [25] proposed two mesh algorithms for median filtering based on internal scanning. His algorithms run either in OðW 2 Þ time if the pixel intensities are unbounded, or in OðW Þ time if the pixel intensities are bounded (constant) and relatively small using N N processors. In this paper, the proposed median filtering algorithm is scalable, so we need a simple and scalable selection method to find the medians. This motivates us to look for an alternative approach for designing an efficient selection algorithm. The median filtering algorithm proposed by Ataman et al. [2] is based on the radix-2 selection algorithm. Our algorithms are based on the efficient selection algorithm using the higher radix number representation and the reconfigurable optical bus technique, since the higher radix number representation, the better parallelized arithmetic operations. The Mesh-Connected-Computers (MCCs) are useful for median filtering [25] because of their simplicity and regularity in structure. Unfortunately, there are two drawbacks of the MCC: fixed architecture and long communication diameter. These two drawbacks can be overcome by equipping it with various types of bus systems. Recently, the reconfigurable meshes have received much attention from researchers because they can reduce the drawbacks of the MCC [3], [27]. However, the exclusive access to the bus resources will limit the throughput of the end-to-end communication. Optical interconnections may provide an ultimate solution to this problem [7], [11], [12], [17], [20], [21]. The array with a reconfigurable optical bus system is defined as an array of processors connected to a reconfigurable optical bus system whose configuration can be dynamically changed by setting up the local switches of each processor. Due to unidirectional signal propagation and predictable delay of the signal per unit length, the optical buses enable synchronized concurrent access in a pipelined fashion. More recently, two related models have been proposed, namely, the array with reconfigurable optical buses [18], and the linear array with a reconfigurable pipelined bus system [17]. A major difference between them is that the processors are able to count the optical pulses in the AROB model during a bus cycle, but it is not permitted in the LARPBS model. Many algorithms have been proposed for these two related models [8], [12], [19], [20], [22], [28]. These indicate that arrays with a reconfigurable optical bus system are very efficient for the parallel computation due to the high bandwidth and flexibility within a reconfigurable optical bus system. The contributions of this paper are in designing fast and scalable algorithms for computing the selection and median filtering problems. Instead of using the doubly logarithmic technique proposed by Han et al. [8], our selection algorithm is based on both the radix-! technique [14] and the prune-and-search approach [5]. Compared to
VOL. 14,
NO. 10,
OCTOBER 2003
Fig. 1. An LAPPB of size 5.
Han et al. [8] and Pan [16], our approach reduces the time complexity from Oððlog log NÞ2 = log log log NÞ to Oð1Þ, where N is the number of data elements. In this paper, the proposed algorithms are designed for the restricted AROB which is equivalent to the LARPBS model. That is, our algorithms can be implemented on the LARPBS model with the same bound. Overall, we first develop two constant time basic operations for selection, rotating, and inverting data matrix on an AROB model. Then, using these operations, we design efficient parallel algorithms for the median filtering problem on the AROB model with a various number of processors. One algorithm runs in constant time either using W 2 N 2 processors if the intensity of the image is bounded (i.e., a constant), or using W 2 N 2þ1=c processors for some fixed c and c 1 if the intensity of the image is unbounded or Oðlog algorithm runs in 2 NÞ-bit integer. The second O Wp2 ðlog log NÞ2 = log log log N time using p2 N 2 proces 2 2 sors, where p W . The last algorithm runs in O Wp2 qN2 time either using p2 q2 processors if the intensity of the image is bounded, or using p2 q2 N 1=c processors for the unbounded intensity, where q N. The remainder of this paper is organized as follows: We give a brief introduction to the AROB computational model in Section 2. Section 3 describes the selection algorithms and some data manipulation operations which will be used in the parallel median filtering algorithms. In Section 4, we develop the efficient and scalable algorithms for median filtering. Finally, some concluding remarks are included in the last section.
2
THE COMPUTATIONAL MODEL BASIC NOTATIONS
AND
A linear processors array with pipelined buses (LAPPB) [7] of size N contains N processors connected to the optical bus with two couplers. One is used to write data on the upper (transmitting) segment of the bus and the other is used to read the data from the lower (receiving) segment of the bus. The optical bus has three waveguides: The message waveguide is used for sending data, and the selection and reference waveguides are used for sending address information. An example for an LAPPB of size 5 is shown in Fig. 1.
WU AND HORNG: FAST AND SCALABLE SELECTION ALGORITHMS WITH APPLICATIONS TO MEDIAN FILTERING
985
Fig. 2. (a) An LAROB of size N. (b) The switch states. (c) An example of bus reconfiguration.
In order to ensure that all processors in Fig. 1 can write their messages on the bus simultaneously without collision, and to determine which slot to use, each processor contains a counter used for the time waiting function, and the following collision-free condition must be satisfied: do > b w cg ; where do is the optical distance between two adjacent processors (as shown in Fig. 1), b is the maximum number of binary bits in each message, w is the width of an optical pulse used to represent one bit in each message (in seconds), and cg is the velocity of light in the waveguide. Thus, in the same cycle time, the pipelined optical bus can transmit up to N times more messages compared to the electrical bus of the same length, where N is the number of processors in the array. The AROB model is essentially a mesh using the basic structure of a classical reconfigurable network (RN) [3] and optical buses. The linear AROB (LAROB) extends the capabilities of the LAPPB by permitting each processor to connect to the bus through a pair of switches. Each processor with a local memory is identified by a unique index denoted as Pi ; 0 i < N, and each switch can be set to either cross or run straight by the local processor. The optical switches are used for reconfiguration. When all switches are set to straight, the bus system operates as a regular optical bus. When both switches of processor Pi ; 1 i < N 1, are set to cross, the LAROB will be partitioned into two independent subbuses from processor
Pi , where each of them forms an LAROB. That is, one consists of processors P0 , P1 , , Pi1 , and the other consists of processors Pi , Piþ1 , , PN1 whose leader processor is Pi (as shown in Fig. 2c). Each processor uses a set of control registers to store information needed to control the transmission and reception of messages by that processor. An example for an LAROB of size N is shown in Fig. 2a. Two interesting switch configurations derivable from a processor of an LAROB are also shown in Fig. 2b. A petit cycle () is defined as the time needed for a pulse to traverse the optical distance between two consecutive processors on the bus. A bus cycle () is defined as the endto-end propagation delay of the messages on the optical bus, i.e., the time needed to traverse through the entire optical bus. Then, the bus cycle ¼ 2N, where N is the number of processors in the array. A unit delay is defined to be the spatial length of a single optical pulse, shown as a loop in Fig. 2a, which may introduce a time slot delay between two processors on the receiving segment. A 2D AROB of size M N, denoted as 2D M N AROB, contains M N processors arranged in a 2D grid. Each processor is identified by a unique 2-tuple index ði; jÞ, 0 i < M, 0 j < N. The processor with index ði; jÞ is denoted by Pi; j . Let Pi; denote the ith row, where represents all indexes along the ith row. For example, P1; is equivalent to P1; j , 0 j < N. Each processor has four I/O ports, denoted by E, W , S, and N to be connected with a reconfigurable optical bus system. The interconnection among the four ports of a processor can be reconfigured
986
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 14,
NO. 10,
OCTOBER 2003
2
N size N N; each SAx; y consists of W 2 blocks, each of size N W W , denoted by Bu;v , 0 u; v < W ; and each block of W2 size W W consists of p2 subblocks, each of size p p, denoted by SBs;t , 0 s; t < Wp . Without loss of generality, assume that p is a factor of W and W is a factor of N and 1 p W < N. In order to make the algorithm presentation more comprehensible, the previous results which had been proposed on the AROB are summarized in the following:
Fig. 3. (a) A 4 4 AROB. (b) The allowed switch configurations.
during the execution time of algorithms. Thus, multiple arbitrary linear arrays like LAROB can be specified in a 2D AROB. The two terminal processors which are located in the end points of the constructed LAROB may serve as the leader processors (similar to P0 in Fig. 2a). The related position of any processor on a bus to which it is connected is its distance from the leader processor. For more details on the AROB, see [18]. An example of a 2D 4 4 AROB and the ten allowed switch configurations are shown in Fig. 3. The extended AROB extends the capabilities of the AROB by allowing switch setting to change during a bus cycle, triggered by detection of a pulse [20]. Several approaches can be applied to route messages from one processor to another in an optical bus system: There are time waiting function [7], time-division multiplexity scheme [21], and the coincident pulse technique [11], [21]. It is assumed in [18] that, during a read/write phase, each memory location knows that it will be accessed by certain processors. It is shown that these approaches for a data routing can be implemented in constant time [18]. Arbitrary permutation can be performed using these methods. The coincident pulse technique is quite different from the time waiting, time-division source, or destination multiplexing [13]. The LARPBS model only uses the coincident pulse technique for addressing [13]. In this paper, we also use this technique to route messages on the AROB model. For a unit of time, assume each processor can either perform arithmetic and logic operations, or communicate with others on a bus. Since the bus cycle length can be considered to be OðÞ, we assume that it is compatible with the computation time of any arithmetic or logic operations. It allows multiple processors to broadcast data on the different buses, or to broadcast the same data on the same bus simultaneously at a time unit, if there is no collision. Let varðiÞ or arrðiÞ½ denote a local variable var or array arr½ in a processor with index i. For example, sumð0Þ, datað0Þ½ is a local variable sum and array data½ of processor P0 , respectively.
3
BASIC DATA MANIPULATION OPERATIONS
In this section, we will develop several basic operations on the AROB model. These basic operations will be used to develop efficient median filtering algorithms in the next section. Conceptually, a 2D pN pN AROB may partition into p2 subAROBs, denoted by SAx; y , 0 x; y < p, each of
Lemma 1 [17], [19]. Given N Boolean data, each with either 0 or 1, the binary prefix sums of these N Boolean data can be computed in Oð1Þ time on an N LAROB. Lemma 2 [28]. Given N integers each of size Oðlog NÞ-bit, the prefix sums of these N integers can be computed in Oð1Þ time 1 on an N 1þc LAROB for some fixed c and c 1.
3.1 Selection In this section, we will present a parallel algorithm for the selection problem in a sequence Q of N numbers, each with n bits. If n is a constant, then the selection problem can be easily solved using binary prefix sums from the most significant bit to the least significant bit, respectively, on an AROB of size N. When i ¼ 1, dN=2e, N, the ith smallest data element is called the minimum, median, and maximum, respectively. If there are nondistinct elements in Q, the selection problem for finding the ith smallest data element will be called ”finding the ith rank of an element of Q.” Note that, if ai ¼ aj , then ai precedes aj if and only if i < j. Instead of using a complicated sorting method, we will design a simple and scalable selection algorithm based on the radix-! technique and the prune-and-search approach. 1 The proposed algorithm runs in OðN=qÞ time on a q N c AROB, where q N, c is a constant, and c 1. Although the same bound can be achieved, our selection algorithm is simpler and more scalable than that of [22]. To the authors’ knowledge, this is the most time efficient algorithm yet published, especially compared to [8], [16] on LARPBS. Assume that the data size of each element is n bits. We first represent the N numbers by the radix-! representation. Since aj < 2n and 0 j < N, it is represented originally by the radix-2 representation as: aj ¼
n1 X
bj; k 2k ;
ð1Þ
k¼0
where bj;k 2 f0; 1g, 0 aj < 2n , and 0 j < N. Instead of using the radix-2 representation, aj can be represented by the radix-! representation as: aj ¼
T 1 X
mj; k !k ;
ð2Þ
k¼0
where T ¼ blog! 2n c þ 1, 0 mj;k < w, 0 aj < 2n , and 0 j < N. The coefficient mj; k can be computed recursively for each aj as follows:
WU AND HORNG: FAST AND SCALABLE SELECTION ALGORITHMS WITH APPLICATIONS TO MEDIAN FILTERING
987
Fig. 4. An illustration of algorithm SELECTION for N ¼ 8; i ¼ 4, and w ¼ 3. (a) Initialization. (b) After Step 1. (c) bðj; lÞ for k ¼ 1, after Step 3.2. (d) sðN 1; lÞ for k ¼ 1, after Step 3.3. For example, a0 , a3 , and a4 belong to Q0 . (e) psðN 1; lÞ for k ¼ 1, after Step 3.4. (f) bðj; lÞ for k ¼ 0, after Step 3.2. (g) sðN 1; lÞ for k ¼ 0, after Step 3.3. (h) psðN 1; lÞ for k ¼ 0, after Step 3.4. (i) ise ¼ 3 for k ¼ 0, after Step 3.6. (j) The final result.
mj; 0 ¼ aj mod ! ¼ aj !r0 ; 0 r0 < aj mj; 1 ¼ r0 mod ! ¼ r0 !r1 ; 0 r1 < r0 mj; T 1 ¼ rT 2 mod ! ¼ rT 2 !rT 1 ; 0 rT 1 < rT 2 : ð3Þ Since aj > r0 > r1 > > rT 1 , we obtain rT 1 ¼ 0. That is, with this approach, each digit has a value ranging from 0 to ! 1, and an !-ary representation mT 1 m2 m1 m0 is equal to m0 !0 þ m1 !1 þ m2 !2 þ þ mT 1 !T 1 , where the most significant digit mT 1 is assumed to be the first digit of this !-ary representation. We can also compute (2) (i.e., binary-to-residue) in O(1) time using the table-lookup schemes [1] or bus reconfigurability schemes [4]. Then, apply the prune-and-search technique on the mj;k , where 0 j < N and 0 k < T , to find the ith smallest data number of these N numbers. That is, we use the kth digit of aj , 0 j < N, to divide the sequence Q into ! subsequences Q0 ; Q1 ; ; Q!1 , respectively, during the kth iteration so that all elements of Ql are less than those of Qlþ1 , 0 l < !. For each subsequence Ql , we calculate its size sl and the accumulative size psl from Q0 to it. If psl1 < i psl , then the ith smallest data element is located in Ql , 0 l < !.
Then, we search Ql only and prune the other subsequences in the next iteration. To specify the data elements whose first k digits are the same as those of the ith smallest data element, a flag f is assigned to each data element. If f ¼ 1 after iteration k, then its first k digits of the associated data element are the same as those of the ith smallest data element, and it is still enabled in the next search; disabled, otherwise. Repeating this process from the most significant digit to the least significant digit, the ith smallest data element in Q can be obtained. If the algorithm is not terminated for the last iteration, this leads to the nondistinct case. Then, we use the index of each data element of Q to identify the ith smallest data element in the remaining subsequence. Initially, we assume that aj , 0 j < N, and i are located at local variables aðj; 0Þ and ithð0; 0Þ of processors Pj;0 and P0;0 , respectively. Finally, the ith smallest data element ise and its index id are stored in processor P0;0 . By applying the pipelining ability and reconfigurability of the optical buses, the detailed selection algorithm is shown as follows: Assume N ¼ 8 and ! ¼ 3. Fig. 4 shows a snapshot of algorithm SELECTION to select the fourth smallest data element (i ¼ 4) from a sequence of data 2, 6, 3, 0, 1, 7, 4, 5.
988
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
Algorithm SELECTION (Q, i, ise, id); /* Q and i are input variables. ise and id are output variables. Let fðj; Þ located in the processor Pj; be a flag associated to the data element aj . We use it to identify whether the data element aj is pruned or not. That is, fðj; Þ ¼ 1, if aj is not pruned; fðj; Þ ¼ 0, otherwise. Initially, set fð; Þ ¼ 1. */ 1: Processor Pj; 0 , 0 j < N, copies aðj; 0Þ (i.e., aj ) to processor Pj; l , 0 l < !. 2: Processor P0; 0 copies ithð0; 0Þ (i.e., i) to processor PN1; . 3: repeat Steps 3.1-3.6 from k ¼ T 1 to 0. 3.1: Processor Pj; with fðj; Þ ¼ 1, 0 j < N, computes mðj; Þ½k (i.e., mj; k in (2)) from aðj; Þ by using (3). 3.2: Processor Pj; l with fðj; lÞ ¼ 1, 0 j < N, 0 l < !, sets bðj; lÞ ¼ 1 if l ¼ mðj; lÞ½k; bðj; lÞ ¼ 0, otherwise. 3.3: // Divide Q into ! subsequences Q0 ; Q1 ; ; Q!1 , respectively, by using the kth digit, where Q¼ Q0 [Q1 [ [Q!1 . // P Compute sðN 1; lÞ ¼ N1 j¼0 bðj; lÞ, 0 l < !, applying Lemma 1 in parallel, so that aj is in Ql if bðj; lÞ ¼ 1. Then, use sl to divide Q into ! subsequences. The size of Ql , 0 l < ! is sl . 3.4: // Identify the rank of each subsequence. // P Compute psðN 1; lÞ ¼ lc¼0 sðN 1; cÞ, 0 l < !, applying the integer prefix sums algorithm (Lemma 2). 3.5: // Prune the other subsequences and search in Ql only at the next iteration. // If psðN 1; l 1Þ < ithðN 1; lÞ psðN 1; lÞ, then the ith smallest data element is located in Ql . Set ithðN 1; lÞ ¼ ithðN 1; lÞ psðN 1; l 1Þ and fðj; tÞ ¼ 0 if mðj; tÞ½k 6¼ l, 0 j < N, 0 t; l < !. 3.6: If psðN 1; lÞ ¼ ithðN 1; lÞ and sl ¼ 1, then processor Pj;0 with fðj; 0Þ ¼ 1 sets ise ¼ aðj; 0Þ. That is, the ith smallest data element is the only element in Ql . 4: // For the nondistinct case, we use the rank of each data element in the remaining sequence to find the ith smallest data element. // Processor Pj; 0 with fðj; 0Þ ¼ 1, 0 j < N, computes P binary prefix sums bpðj; 0Þ ¼ jc¼0 fðc; 0Þ, applying Lemma 1. Let i0 be the value stored in the local variable ithðN 1; Þ. Then, the rank of each data element in the remaining sequence is found and the ith smallest data element is the element whose rank is i0 (i.e., bpðj; 0Þ ¼ ithðN 1; 0Þ, where ithðN 1; 0Þ ¼ i0 ). 5: Copy the ith smallest data element and its column index to iseð0; 0Þ and idð0; 0Þ, respectively. Lemma 3. Given N numbers each of size n-bit, the selection problem of these N numbers can be solved in OðT Þ time on a 2D ! N AROB, where 2 ! < N and T ¼ blog! 2n c þ 1. Proof. We have N data elements and each is allocated to a processor. There are at most T digits of each data element under radix-! representation. We compare all data elements from the most significant digit to the least significant digit by Step 3. During the tth iteration (i.e., k ¼ T t) of Step 3, only the subsequence containing the
VOL. 14,
NO. 10,
OCTOBER 2003
ith smallest data element will survive and others will be pruned. That is, at most !Nt data will survive after the tth iteration of Step 3 if all data elements are distinct. We use ithðN 1; Þ to record relative rank of the ith smallest data element in the remaining sequence. The size of the remaining sequence at iteration t is less than that at iteration t þ 1. Finally, only the ith smallest data element will survive as all data elements are distinct. For the nondistinct case, we use Step 4 to find the result. There are two cases for the sequence Q. One is that all data elements of Q are distinct, but those of the other are not. We prove that the distinct case is correct only. The correctness of the nondistinct case clearly follows. Let the ith smallest data element be represented by a !-ary representation mi;T 1 mi;T 2 mi;1 mi;0 . We will show that only the ith smallest data element will survive after the end of algorithm. Basis Step. During the first iteration, we divide the Q into ! subsequences. The most significant digit (i.e., mj;T 1 ) under radix-! representation of each data element of Ql is l, 0 l < !, by Step 3. The ith smallest data element is located in Ql only when its most significant digit is l and psl1 < i psl by Step 3.5. Therefore, it will survive in Ql and other subsequences will be pruned. If the size of Ql is one, then the algorithm is terminated. Induction Step. Suppose that this statement is true at the tth iteration. That is, the ith smallest data element is assumed to be located in Ql , and the relative rank of it is greater than psl1 and less than or equal to psl . We now wish to prove that this statement is also true at the ðt þ 1Þth iteration. Let r be the relative rank of the ith smallest data element in the remaining sequence Ql after the tth iteration. This means that there are r data elements in Ql whose ðt þ 1Þth digit is less than or equal to the digit mi;tþ1 . During the ðt þ 1Þth iteration, assume psl1 < r psl , but the ith smallest data element is not located in Ql after Step 3.5. This means that the digit mi;tþ1 is not equal to l. Suppose that it is less than l; the case when it is greater than l can be proved similarly. Then the ith smallest data element is located in one of sequences Qu , 0 u < l. By assumption, psl1 < r, then the rank of each data element located in Qu , 0 u < l, is less than r. This leads to a contradiction. Thus, by the principle of induction, the statement is true at any iteration and this algorithm is correct. The time complexity is analyzed as follows: Steps 1, 2, 4, and 5 each take Oð1Þ time. Step 3 will be repeated at most T times, and each takes Oð1Þ time. Hence, the total time complexity is OðT Þ. u t Moreover, assume n ¼ Oðlog NÞ, ! ¼ N 1=c , and N 1=c > log N, where c is a constant and c 1. Then, T ¼ blogN 1=c 2n c þ 1 ¼ blogN 1=c OðNÞc þ 1 ¼ bOðcÞc þ 1 is also a constant in Lemma 3. This leads to the following corollary. Corollary 1. Given N numbers each of size n-bit, the selection problem of these N numbers can be solved in Oð1Þ time, either on an N LAROB if n is a constant, or on a 2DN 1=c N AROB if n ¼ Oðlog NÞ, for some constant c and c 1.
WU AND HORNG: FAST AND SCALABLE SELECTION ALGORITHMS WITH APPLICATIONS TO MEDIAN FILTERING
989
Fig. 5. Illustrations of rotation and inversion, where N ¼ 8, W ¼ 4, u ¼ v ¼ 1, and p ¼ 2. (a) Rotation: að1; 1Þ is moved to a0 ðN; NÞ for x ¼ y ¼ 1. (b) Inversion: dðN þ W ; N þ W Þ is moved back to d0 ðW þ p þ 1; W þ p þ 1Þ for x ¼ y ¼ 1.
In reality, the number of processors available in the system is not always enough for applications. In many cases, the number of data elements is much larger than the size of the system. If the number of processors available in the system is q N 1=c , 1 q N, then each processor contains an array with OðN=qÞ data items to be processed. Therefore, Corollary 1 can be modified to run on an AROB using q N 1=c processors, since they were based on the divide-and-conquer technique. Many basic operations such as broadcast and binary summation are scalable on LARPBS model and can be run in OðN=qÞ time [26]. This scalable operation can be implemented on the AROB model with the same bound since the number of iterations remains the same, and each iteration uses OðN=qÞ time now. Thus, we can adapt our algorithm to obtain a scalable selection algorithm. This leads to the following lemma. Lemma 4. Given N numbers each of size n-bit, the selection problem of these N numbers can be solved in OðN=qÞ time, either on a q LAROB if n is a constant, or on a 2D N 1=c q AROB if n ¼ Oðlog NÞ, for some constant c and c 1. Consequently, let A ¼ ai;j , 0 i; j < W < N be a data matrix of size W W stored on an AROB. The selection problem for data matrix A can be solved as follows: First, connect the row buses in snake-like fashion. Then, apply Lemma 4 to select the ith smallest data element from these W 2 data. For single-model fibers with a length of a few kilometers, the synchronization error is small enough and can be tolerated. Hence, using current technology, a system using a few thousands of processors will not present any problems. This leads to the following lemma. Lemma 5. Given W W numbers each of size n-bit, the selection problem of these W 2 numbers can be solved in OðW 2 =q2 Þ time, either on a 2D q q AROB if n is a constant, or on a 3D N 1=c q q AROB if n ¼ Oðlog NÞ and q2 W 2 N, for some constant c and c 1.
3.2 Rotation and Inversion Let A ¼ ai; j , 0 i; j < N, be a data matrix of size N N, stored in the local variable aði; jÞ of a 2D pN pN AROB (i.e., SA0;0 ). Also, let x and y, 0 x; y < p be a pair of vertical and horizontal displacements relative to A, respectively. The rotation is defined to circularly shift the data elements of A
located on SA0;0 up and left by x and y positions, respectively, and then store them on SAx;y . That is, for each pair of displacements x and y, 0 x; y < p, move aði; Þ located in SA0;0 , 0 i; j < N, to a0 ðxN þ ði xÞ mod N; yN þ ðj yÞ mod NÞ located in SAx;y , 0 x; y < p. For example, as shown in Fig. 5a, where x ¼ 1 and y ¼ 1, then að1; 1Þ of SA0;0 is moved to a0 ðN; NÞ located in SA1;1 . Let D be a sparse data matrix of size pN pN, stored in the local variable dðxN þ uW ; yN þ vW Þ of the block Bu;v of the SAx; y of a 2D pN pN AROB, 0 x; y < p, N . The inversion is defined to move the data 0 u; v < W N , located at elements dðxN þ uW , yN þ vW Þ, 0 u; v < W 0 the block Bu;v of the SAx; y back to d ðuW þ sp þ x; vW þ tp þ yÞ located at the subblock SBs;t , 0 s; t < Wp , of the block Bu;v of SA0;0 . For example, as shown in Fig. 5b, where x ¼ y ¼ s ¼ t ¼ 1, then dðN þ uW ; N þ vW Þ is moved back to d0 ðuW þ p þ 1, vW þ p þ 1Þ. By applying the pipelining ability and reconfigurability of the optical buses [19], both of the rotation and inversion operations can be solved in Oð1Þ time. Lemma 6. Both rotation and inversion operations can be computed in Oð1Þ time on a 2D pN pN AROB.
4
PARALLEL ALGORITHMS
FOR
MEDIAN FILTERING
Given an N N digital image and a W W window, the median filtering problem is to find the medians of the neighborhoods with all possible windows of the image, and the result of each window operation is stored in a location corresponding to the top-left corner of the window. Given an N N image A ¼ ai; j , 0 i; j < N, the result of median filtering, denoted by di; j , 0 i; j < N, is formulated as di; j ¼ MEDfaðiþlÞ mod N; ðjþmÞ mod N ; w l; m wg;
ð4Þ
where w ¼ ðW 1Þ=2 and MED denotes the median (i.e., the dW 2 =2eth smallest data element of the window). In the following, assume that the intensity of each pixel of the image is an n-bit integer. In the remainder of this section, we will develop several scalable and efficient parallel algorithms for median filtering on the AROBs based on the number of processors available in the system. For easily implementing (4) on a 2D pN pN
990
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
AROB, we redefine the coordinates of the image in the N N W following. Let the N N image consist of W subimages, each of size W W , denoted by SIu; v , N ; also, let each W W subimage consist of 0 u; v < W W W groups, each of size p p, denoted by Gs; t , p p 0 s; t < Wp . Without loss of generality, assume that p is a factor of W and W is a factor of N. Our algorithms can be easily extended for the general case, while the shape of the image is not square. Then, for each pair of the vertical and horizontal displacements x and y of each group Gs; t , relative to the top-left corner of each subimage SIu; v , the medians specified in (4) can be reformulated as duW þspþx; vW þtpþy ¼MEDfaðuW þspþxþlÞ mod N; ðvW þtpþyþmÞ mod N ; w l; m wg; ð5Þ N where 0 u; v < W , 0 s; t < Wp , and 0 x; y < p.
As stated in (4) and (5), there are N 2 medians to be computed. Considering a 2D pN pN AROB, it can be partitioned into p p subAROBs, each of size N N, denoted by SAx; y , 0 x; y < p; each SAx; y can be further N N W blocks, each of size W W , denoted partitioned into W N by Bu;v , 0 u; v < W . Each Bu;v is responsible for a median N2 2 computing; totally, there are W 2 p medians to be computed at 2 an iteration. Therefore, it requires Wp2 iterations to compute the N 2 medians. That is, at the ðs Wp þ tÞth iteration, SAx; y is responsible for computing duW þspþx; vW þtpþy specified in (5), N , 0 s; t < Wp , and 0 x; y < p. In order to let 0 u, v < W these p2 subAROBs compute (5) in parallel, we must rotate the image data to their corresponding subAROBs (i.e., a0xNþ uW ;yNþvW = aðuW þspþxÞ mod N;ðvW þtpþyÞ mod N Þ. It means that for a specified group Gs; t with a fixed pair of displacements x and y, duW þspþx; vW þtpþy can be computed by dxNþuW ; yNþvW ¼MEDfa0xNþ uW þl; yNþvW þm ; w l; m wg;
ð6Þ
N where 0 u; v < W , 0 s; t < Wp , and 0 x; y < p. Based on (6), the high level description of it at the ðs W p þ tÞth iteration is specified by the following three steps. First, rotate the image data located at SA0; 0 to its corresponding subAROB according to the specified x and y displacements. Then, each block computes the medians simultaneously. Finally, the results of all blocks are moved back to their corresponding positions of SA0; 0 . Repeating this process for each group Gs; t of each subimage SIu; v from left to right and then up to down, all N 2 medians can be computed. Assume that the image is initially stored in the local variables aði; jÞ, 0 i; j < N of a 2D pN pN AROB. Finally, the result is stored in the local variable dðuW þ sp þ x; vW þ tp þ yÞ of a 2D pN pN AROB, where x and y are a pair of displacements of each group Gs; t relative to the top-left corner of each subimage SIu; v . The detailed median filtering algorithm (MFA) is described in the following:
VOL. 14,
NO. 10,
OCTOBER 2003
Algorithm MFAða; dÞ 1: Since the window is centered at each specified pixel, the image is shifted circularly down and right by w positions. 2: for s ¼ 0 to Wp 1 do Step 3. 3: for t ¼ 0 to Wp 1 do Steps 3.1-3.4. 3.1: The image is shifted circularly up and left by one position. 3.2: Apply Lemma 6 to rotate the image data to other subAROBs. N , of subAROB SAx; y , 3.3: Each block Bu;v , 0 u; v < W 0 x; y < p, applies the selection algorithm proposed by Han et al. [8] to find the median (i.e., compute (6)). 3.4: Apply Lemma 6 to move the result back to SA0;0 . Theorem 1. Algorithm MFA can be computed in 2 OðWp2 ðlog log NÞ2 / log log log NÞ time on a 2D pN pN AROB if the domain of the image is unbounded, where 1 p W < N. Proof. The correctness of this algorithm directly follows from (4)-(6) and Lemma 6. Steps 3.1-3.3 are used to implement (6) for computing a specified group Gs; t and Steps 2 and 3 are used to compute all Wp Wp groups. Hence, all N 2 medians can be correctly computed by the proposed algorithm. The time complexity is analyzed as follows: Steps 3.2 and 3.4 each take Oð1Þ time by Lemma 6. Step 3.3 takes Oððlog log NÞ2 = log log log NÞ time, which was derived by Han et al. [8]. Steps 3.1-3.4 will be repeated Wp Wp times. Hence, the total time 2 u t complexity is OðWp2 ðlog log NÞ2 = log log log NÞ. If we increase the number of processors slightly, it is possible to achieve a better time complexity. Since the time complexity of the proposed algorithm is dominated by finding the median as shown in Step 3.3, the median of the W W elements of each window can be found in Oð1Þ time by Lemma 5. Hence, the total time complexity of the modified algorithm can be reduced from 2 2 OðWp2 ðlog log NÞ2 = log log log NÞ to OðWp2 Þ by increasing the 1 number of processors from p2 N 2 to p2 N 2þc , for some fixed c and c 1. Furthermore, if the number of processors is 1 increased to W 2 N 2 and W 2 N 2þc , two corresponding Oððlog log NÞ2 = log log log NÞ time, and Oð1Þ time algorithms can be also obtained. This leads to the following corollaries. 2
Corollary 2. The image median filtering can be solved in OðWp2 Þ time, either on a 2D pN pN AROB if the domain of the 1 image is bounded, or on a 3D pN pN N c AROB if the domain of the image is unbounded for a constant c and c 1, where 1 p W < N. Corollary 3. The median filtering can be solved in Oð1Þ time, either on a 2D W N W N AROB if the domain of the image 1 is bounded, or on a 3D W N W N N c AROB if the domain of the image is unbounded for a constant c and c 1.
WU AND HORNG: FAST AND SCALABLE SELECTION ALGORITHMS WITH APPLICATIONS TO MEDIAN FILTERING
991
TABLE 1 Summary of Comparison Results for Image Median Filtering Algorithms
On the other hand, there are pq pq processors available on a 2D AROB, where q is a factor of N and q N. As stated previously, the high level description of Step 3 consists of the following three steps. First, rotate the image data to its corresponding subAROB. Next, each block computes its corresponding median. Finally, rotate the results back to their corresponding positions. Since we have N 2 processors N2 for each subAROB, it can compute W 2 medians as defined in 2 N2 2 (6). Since we have p subAROBs, we can compute W 2 p W2 medians for an iteration. By repeating Step 3 for p2 times, all N 2 medians can be computed. By restricting the number of 2 processors to q2 and each processor has OðNq2 Þ restricted memory to store the allocated image pixels for each 2 subAROB, we can only compute Wq 2 medians. Then, by 2 repeating Step 3 for Nq2 times, each subAROB can compute N2 we have p2 subAROBs, after repeating W 2 medians. Since W2 Step 3 for p2 times, all N 2 medians can be computed. Furthermore, if there are only q2 N 1=c processors to find the medians for each subAROB, we may use the scalable selection algorithm (Lemma 5) to solve this problem. Based on the divide-and-conquer technique as stated above, algorithm MFA can be modified to run efficiently. This leads to the following results. Corollary can be solved in 2 2 4. The image median filtering O Wp2 qN2 ðlog log NÞ2 = log log log N time on a 2D pq pq AROB if the domain of the image is unbounded, where 1 p W , q N. Corollary 2 2 5. The image median filtering can be solved in O Wp2 qN2 time, either on a 2D pq pq AROB if the domain of 1 the image is bounded, or on a 3D pq pq N c AROB if the domain of the image is unbounded for a constant c and c 1, where 1 p W , q N.
two Oð1Þ time basic operations to speed up the median filtering algorithms. These include the selection, rotation, and inversion for an image; all of them require global propagation of image data. Based on these basic operations, several efficient and scalable parallel median filtering algorithms are derived. To the authors’ knowledge, this is the first constant time median filtering algorithm to be developed on any parallel computational models. Compared to other previous results as shown in Table 1, our algorithms are more time flexible and scalable. In image processing, median filtering has been long used to suppress noise while preserving edges in image. The median filter is applied with small windows to avoid image distortion. That is, W N. For such a case, a result of Oð1Þ is not different from OðW Þ, where W N is the window size. Image intensities are usually represented with gray levels ranged from 0 to 255, whereas unbounded intensities will rarely be met in images. In a cycle time, the number of messages that can be transmitted by a pipelined optical bus is larger than that that can be transmitted by an electrical bus. Optical transmission can reduce the data transmission time between processors a lot. The transmission time of a data element between processors is determined by the size of the data element and the bus capacity. Usually, parallel image processing jobs require a lot of computations and communications. Due to its high communication bandwidth, its bus reconfigurability as a computation tool, and the versatile communication patterns it supports, the AROB is useful for solving image problems and its computational power is superior than that of other existing reconfigurable networks, like PARBS [27].
ACKNOWLEDGMENTS 5
CONCLUDING REMARKS
In this paper, several fast and scalable algorithms for selection and median filtering are proposed. We first design
This work was partially supported by the National Science Council under the contract no. NCS91-2623-7-012-001 and NSC91-2213-E-011-115.
992
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
REFERENCES [1] [2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10] [11]
[12]
[13]
[14] [15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
G. Alia and E. Martinelli, ”VLSI Binary-Residue Converters for Pipelined Processing,” The Computer J., vol. 33, pp. 473-474, 1990. E. Ataman, V.K. Aatre, and K.M. Wong, ”A Fast Method for RealTime Median Filtering,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, pp. 415-421, 1980. Y. Ben-Asher, D. Peleg, R. Ramaswami, and A. Schuster, ”The Power of Reconfiguration,” J. Parallel and Distributed Computing, vol. 13, pp. 139-153, 1991. A. Bertossi and A. Mei, ”A Residue Number System on Reconfigurable Mesh with Applications to Prefix Sums and Approximate String Matching,” IEEE Trans. Parallel and Distributed Systems, vol. 11, pp. 1186-1199, 2000. M. Blum, R.W. Floyd, V.R. Pratt, R.L. Rivest, and R.E. Tarjan, ”Time Bounds for Selection,” J. Computer and System Sciences, vol. 7, no. 4, pp. 448-461, 1972. J. Gil and M. Werman, ”Computing 2-D Min, Median, and Max Filters,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, pp. 504-507, 1993. Z. Guo, R.G. Melhem, R.W. Hall, D.M. Chiarulli, and S.P. Levitan, ”Pipelined Communications in Optically Interconnected Arrays,” J. Parallel and Distributed Computing, vol. 12, no. 3, pp. 269-282, 1991. Y. Han, Y. Pan, and H. Shen, ”Sublogarithmic Deterministic Selection on Arrays with a Reconfigurable Bus,” IEEE Trans. Computers, vol. 51, no. 6, pp. 702-707, June 2002. L. Hayat, M. Fleury, and A.F. Clark, ”Two-Dimensional Median Filter Algorithm for Parallel Reconfigurable Computers,” IEE Proc. Vision Image Signal Processing Conf., vol. 142, no. 6, pp. 345350, 1995. T. Leighton, ”Tight Bounds on the Complexity of Parallel Sorting,” IEEE Trans. Computers, vol. 34, pp. 344-354, 1985. S.P. Levitan, D.M. Chiarulli, and R.G. Melhem, ”Coincident Pulse Technique for Multiprocessor Interconnection Structures,” Applied Optics, vol. 29, no. 14, pp. 2024-2033, 1990. K. Li, Y. Pan, and S.Q. Zheng, ”Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System,” IEEE Trans. Parallel and Distributed Systems, vol. 9, pp. 705-720, 1998. K. Li, Y. Pan, and S.Q. Zheng, ”Efficient Deterministic and Probabilistic Simulations of PRAMs on Linear Arrays with Reconfigurable Pipelined Bus Systems,” J. Supercomputing, vol. 15, pp. 163-181, 2000. D. Matula, ”Basic Digit Sets for Radix Representations,” J. ACM, vol. 29, pp. 1131-1143, 1982. P.M. Narendra, ”A Separable Median Filter for Image Noise Smoothing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 3, pp. 20-29, 1981. Y. Pan, ”Order Statistics on Optically Interconnected Multiprocessor Systems,” Proc. First Int’l Workshop Massively Parallel Processing Using Optical Interconnections, pp. 162-169, 1994. Y. Pan and K. Li, ”Linear Array with a Reconfigurable Pipelined Bus System—Concepts and Applications,” Information Sciences—An Int’l J., vol. 106, pp. 237-258, 1998. S. Pavel and S.G. Akl, ”On the Power of Arrays with Reconfigurable Optical Bus,” Proc. Int’l Conf. Parallel and Distributed Processing Techniques and Applications, pp. 1443-1454, 1996. S. Pavel and S.G. Akl, ”Matrix Operations Using Arrays with Reconfigurable Optical Buses,” Parallel Algorithms and Applications, vol. 8, pp. 223-242, 1996. S. Pavel and S.G. Akl, ”Integer Sorting and Routing in Arrays with Reconfigurable Optical Buses,” Int’l J. Foundations of Computer Science, vol. 9, no. 1, pp. 99-120, 1998. C. Qiao and R.G. Melhem, ”Time-Division Communications in Multiprocessor Arrays,” IEEE Trans. Computers, vol. 42, no. 5, pp. 577-590, 1993. S. Rajasekaran and S. Sahni, ”Sorting, Selection and Routing on the Arrays with Reconfigurable Optical Buses,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 11, pp. 1123-1132, Nov. 1997. S. Ranka and S. Sahni, ”Efficient Serial and Parallel Algorithms for Median Filtering,” IEEE Trans. Signal Processing, vol. 39, no. 6, pp. 1462-1466, 1991. Q. Stout, ”Sorting, Merging, Selecting, and Filtering on Tree and Pyramid Machine,” Proc. Int’l Conf. Parallel Processing, pp. 214-221, 1983.
VOL. 14,
NO. 10,
OCTOBER 2003
[25] S.L. Tanimoto, ”Fast Median Filtering Algorithms for Mesh Computers,” Pattern Recognition, vol. 28, no. 12, pp. 1965-1972, 1995. [26] J.L. Trahan, Y. Pan, R. Vaidyanathan, and A.G. Bourgeois, ”Scalable Basic Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System,” Proc. First Int’l Conf. Parallel and Distributed Computing Systems, pp. 564-569, 1997. [27] B.F. Wang and G.H. Chen, ”Constant Time Algorithms for Transitive Closure and Some Related Graph Problems on Processor Arrays with Reconfigurable Bus Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 10, pp. 500-507, 1990. [28] C.H. Wu, S.J. Horng, and H.R. Tsai, ”Optimal Parallel Algorithms for Computer Vision Problems,” J. Parallel and Distributed Computing, vol. 62, no. 6, pp. 1021-1041, 2002. Chin-Hsiung Wu received the BS degree from the Chinese Naval Academy, the MS degree in information management from the National Defense Management College, and the PhD degree in electrical engineering from the National Taiwan University of Science and Technology, Republic of China, in 1986, 1991, and 2000, respectively. He is an associate professor in the Department of Information Management at the Chinese Naval Academy. His research interests include image processing, computer vision, and parallel and distributed computing. Shi-Jinn Horng received the BS degree in electronics engineering from the National Taiwan Institute of Technology, the MS degree in information engineering from the National Central University, and the PhD degree in computer science from the National Tsing Hua University in 1980, 1984, and 1989, respectively. Currently, he is a full professor in the Computer Science and Information Engineering Department at the National Taiwan University of Science and Technology. He has published more than 100 research papers and received many awards. His research interests include VLSI design, multiprocessing systems, and parallel algorithms.
. For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.