A Low Complexity Architecture for Binary Image Erosion and Dilation using Structuring Element Decomposition ¨ Hugo Hedberg, Fredrik Kristensen, Peter Nilsson, and Viktor Owall Dept. of Electroscience, Box 118, Lund University, SE-221 00 Lund, Sweden Email:
[email protected],
[email protected]
Abstract— This paper describes a new hardware architecture for binary image erosion and dilation. The design is to be used in a self contained real-time surveillance system. Thus, low complexity and low power consumption are main constraints. To achieve this goal the aim has been to reduce memory requirements and the number of memory accesses per pixel. By storing only the number of consecutive ones that appears horizontally and vertically in the input image, only two internal memory accesses per calculated output pixel is required. The number of memory accesses is independent of the size of the structuring element (SE) as long as it is rectangular and only contains ones, which is a common case. The internal memory size is proportional to log2 (SEheight ), which means that a large span of SE sizes can be supported with a small amount of hardware.
I. I NTRODUCTION Erosion and Dilation (E&D) are the two foundations in mathematical morphology, since all morphologic operations can be broken down into these two basic operations [1]. For example, operations such as opening, closing, gradient, and skeletonization are performed with these two base functions. Derived from these facts, the need for low complexity architectures to perform E&D becomes evident. In this paper a binary E&D unit, which is to be used in a real-time surveillance system, is presented. A conceptual overview of the surveillance system is shown in Fig. 1. The camera feeds the image processing system with a real-time image stream, i.e., 25-30 frames per second. A segmentation algorithm, e.g., based on the Stauffer and Grimson algorithm [2] [3], preprocesses the image stream and produces a binary mask in which zeros and ones corresponds to background and foreground, respectively. In theory, the moving parts of an image should be distinguished as independent objects in the binary mask. However, in reality the mask will be distorted with noise and single objects split into multiple objects, e.g., if parts of the moving object has the same color as the background. In order to remove noise and reconnect split objects, one or more opening (erosion followed by dilation) and closing (dilation followed by erosion) operations are performed on the mask. The object classification part will then use the mask to cut out the interesting parts of the image and perform classification and tracking. To easily incorporate the E&D unit into the system, some requirements are placed on the architecture. First and most important, input and output data must be processed sequentially from first to last pixel in the binary image to avoid
0-7803-8834-8/05/$20.00 ©2005 IEEE.
Cam
Segmentation algorithm
Noisy Mask
Morphology
Object classification & tracking Fig. 1.
Surveillance system.
unnecessary memory handling. In addition, this allows burst reads from memory and that several E&D units can be placed sequentially after each other without any storage in between. Secondly, the hardware should be small, simple, and fast, in order to allow as much time and hardware space for the object classification/tracking part of the system as possible. To increase the overall performance of the system, it is also desirable that the size of the Structuring Element (SE) can be changed during run time. With a flexible SE size comes the ability to compensate for different types of noise and to sort out certain types of objects in the mask, e.g., high and thin objects (standing humans) or wide and low objects (side view of cars). The main contribution of this paper is a new hardware architecture for binary E&D that meets the requirements mentioned above, i.e., low memory requirement, sequential processing, and flexible SE size. However, to fulfill the requirements, some limitations are placed on the SE; it has to be rectangular, though any height and width are acceptable, and can only contain ones. These limitations are not seen as major disadvantages, since this is a common case in opening, closing, and boundary extraction [4]. A. Previous work Previous hardware implementations of binary E&D are mostly focused on small to medium sized SEs. For example, [5] and [6], where a direct implementation of the E&D algorithm makes them unsuitable for larger SEs, due to long delay lines in between each row of the SE. However, they do support any binary SE. In [7], a fast implementation is presented, where each row in the SE is processed in parallel making it fast but it requires one memory access for each row in the SE. Other implementations of E&D are used for grayscale input and are not really suited for the presented
3431
B
B1
Fig. 2.
Bheight = 3
Bwidth = 5 B2
Bheight = 3
Bwidth = 5 Window B1
Decomposition of structuring element B.
Window B2
system. In general, few E&D implementations of binary input binary output are published, especially small and simple ones for the basic and common case of rectangular SEs.
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
1
1
0
0
0
0
1
1
1
1
1
1
0
0
0
0
1
1
0
0
0
0
1
1
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Input Fig. 3.
Output
Input and output to decomposition window B1 and B2 .
II. M ORPHOLOGY In this paper A will represent the binary input image and B the structuring element, with A and B as sets in Z 2 . Both erosion and dilation are sliding window operations with B over A. In words, the result of binary erosion of A by B is all locations of B in A where the content of A is the same as B. ˆ the geometric In the case of dilation, it is all locations of B, inverse or reflection of B, in A where at least one element of ˆ is the same as A. Mathematically, the erosion and dilation B of A by B, denoted AB and A⊕B respectively, are defined as A B = {z | (B)z ⊆ A} ˆ z ∩ A] = ∅} A ⊕ B = {z | [(B) and one is the dual of the other according to ˆ A ⊕ B = (A B)
(1)
ˆ A B = (A ⊕ B)
(2)
where is bit inversion [8]. It is assumed that the height and width of B are odd numbers, i.e., B always has a well defined origin. All references to A and B in the form X = r×c means that X consists of r rows and c columns. Furthermore, dilation is associative which means that if the SE can be decomposed into smaller SEs according to (3)
as shown in Fig. 2, then dilating A with B, gives the same result as first dilating A with B1 and then dilating the result with B2 according to: A ⊕ B = A ⊕ (B1 ⊕ B2 ) = (A ⊕ B1 ) ⊕ B2 .
A ⊕ B = (A ⊕ B1 ) ⊕ B2 = (A B1 ) ⊕ B2 = ((A B1 ) B2 ) = ((A B1 ) B2 )
(5)
A B = A (B1 ⊕ B2 ) = (A ⊕ (B1 ⊕ B2 )) = ((A ⊕ B1 ) ⊕ B2 ) = ((A B1 ) ⊕ B2 ) = ((A B1 ) B2 ) = (A B1 ) B2
(6)
which implies that the same hardware can be used to perform both erosion and dilation with a decomposed SE, this is discussed further in Section III. Finding decompositions to a general SE is a hard problem and not always possible [9] [10]. In addition, for a SE to be reflection invariant it has to be symmetric both in respect to the x and y direction, e.g., a square or a circle. However, one common class of SEs that is both decomposable and reflection invariant is rectangles of ones. This type of SE is well suited for the opening and closing operations that are needed in the system described in Section I. A known problem with all kinds of sliding window operations, including erosion and dilation, are boundary problems. These occur when the sliding window, B, is centered on the boundary of A and thus extend outside A, as shown in Fig. 4a. The most common solution is to pad the input image, A, until B, centered on the original boundary, is completely covered and a well defined answer can be obtained, as shown in Fig. 4b. Padding is defined to be ones if erosion is performed and zeros if dilation is performed [8]. With these definitions information Padding
(4)
3432
B A
A Padding
a) Fig. 4.
b) Boundary problem (a), padding (b).
Padding
This process is shown in Fig. 3, where the input is first eroded with B1 and then B2 . The first position of B1 and B2 that produce a one is also shown, together with location in the output of this one. With a decomposed SE, the number of comparisons per output is decreased from the number of ones in B to the number of ones in B1 plus B2 . For example, if B in Fig. 2 only consists of ones, the number of comparisons per output is decreased from 15 to 8.
B Padding
B = B1 ⊕ B2 ,
ˆ and If the SE is both reflection invariant, i.e., B = B, decomposable. Then, combining Equation 1, 2, 3, and 4 the following two equations can be derived
ff
Input
’0’
sum1 Fig. 5.
X 3 3 3 2 2 2
sum1 == Bwidth
Stage-2
NORTH
A
WEST
around the border area of A is not lost and more complex operations, e.g., closing and opening, will perform correct. III. A RCHITECTURE
EAST 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 1 1 1 2 1 1 1 2 1 1 1
Simplified Architecture of stage-1.
3 3 3 3 3 1 1 1 1 1 1
1 1 1 1 1 1
SOUTH
The proposed architecture is based on Equation 5 and 6, in order to take advantage of the reduced number of comparisons that a decomposed SE requires. In addition, when comparing Equation 5 and 6, it can be seen that only the erosion operation with a decomposed SE has to be implemented. To perform a dilation the input A and the result is inverted. Hence, the same inner kernel can be used for both operations. With a rectangular SE of ones, erosion can be performed as a summation followed by a comparison. To perform binary erosion, bits in A that lies directly below the current position of B are added and compared to the size of B. If the sum is equal to the size of B the result is one otherwise zero. When combining this with decomposition, the summation can be broken up into two stages, where the first stage, stage-1, compares the number of ones under B1 to the width of B1 and the second stage, stage-2, compares the number of ones under B2 in the result from stage-1, to the height of B2 . When sliding B1 over a row in A, each position in A is used as many times as the width of B1 to calculate the sum. However, if a running sum records the number of consecutive ones in the currently processed input row, each input is used only once. A simplified block diagram of stage-1 is shown in Fig. 5, where f f is the flip-flop that stores the sum of consecutive ones. When the input is one, the recorded sum is increased and if the input is zero the sum is reset to zero. Each time the sum plus the input equals the width of B1 , stage-1 outputs a one to stage-2 and the old sum, i.e., B1width − 1 , is kept to be compared to the next input. The same principle is used in stage-2 but instead of a flip-flop a row memory is used to store the number of consecutive hits from stage-1 for each column in A. In Fig. 7, the final architecture of the datapath is shown. To handle the boundary problem discussed in Section II, the padding is split into four parts, namely north, east, south, and west padding, corresponding to respectively side of A. Where the north and south padding should extend SEheight /2 rows and the east and west padding should extend SEwidth /2 columns outside A. Since dilation is performed with A as input, the padding will be ones independent of which operation, dilation or erosion, that is performed. Furthermore, the result of all padding to the west and north can be precalculated. The precalculated result of the west padding will always be equal to SEwidth /2 and will be the initial value in stage-1. Similarly, the north padding is equal to SEheight /2 and will
Fig. 6. Padding when SE is seven rows and five columns of ones. X marks a don’t care position.
be the initial value in stage-2. The east and south padding have to be inserted in the data stream, the east padding in between the rows of A and the south padding after A. Fig. 6 shows all padding in the case that the SE is seven rows and five columns of ones. IV. R ESULTS AND PERFORMANCE In Fig. 7 the final architecture of the datapath is shown, together with the wordlengths in each stage. The input and output parts, stage-0 and 3, have a single bit wordlength, whereas the wordlengths in stage-1 and 2 depends on the largest supported size of B. The wordlengths are, log2 (Bwidth ) and log2 (Bheight ) in stage-1 and 2, respectively. Thus, the total amount of required memory to perform dilation or erosion is Mem = log2 (Bwidth ) + log2 (Bheight )Acolumns bits, where the first part is the flip-flop in stage-1 and second part is the row memory in stage-2. For example, with A = 288 × 352 bits and the size of B = 15 × 19, the required amount of memory is log2 (19) + log2 (15) · 352 = 1413 bits. The delay line implementations in [5] and [6], with the same A and B would require (Bheight − 1)Acolumns + Bwidth = 4947 bits of memory, which is 3.5 times more than the required memory in the presented implementation. The execution time, i.e., the time from first input to last output, is Tex = Tpp + Tpad , where Tpp and Tpad is the pixel processing and padding time, respectively. The pixel processing time is equal to the size of the input image, whereas padding time depends on the size of both A and SE according to Tpad = Bheight /2 (Acolumns + Bwidth /2 ) + + Bwidth /2 Arows clock cycles. Padding time includes all extra clock cycles due to padding, i.e., the east and south padding in Fig. 6. With the same A and B as in the memory example above, the total execution time is Tex = Tpp + Tpad = 288 · 352 + 15/2 · (352 + 19/2 ) + + 19/2 · 288 = 101376 + 5119 = 106495 clock cycles.
3433
E or S-Boundary
Padd. W
’0’
ff Er/Dil
W-Boundary
sum1
’1’ Stage-0, WL = 1 Fig. 7.
Padd. N
’0’
Row mem
Er/Dil
N-Boundary
sum1 == Bwidth
sum2
Stage-1, WL = log2 (Bwidth )
sum2 == Bheight
Stage-2, WL = log2 (Bheight )
Stage-3, WL = 1
Architecture of the datapath in the erosion and dilation unit and the wordlength (WL ) in each stage.
Almost the same result is obtained with the delay line implementations in [5] and [6]. However, the presented implementation will have a somewhat shorter execution time, since the west and north padding never effects the padding time. This effect is, however, marginal compared to the processing time for larger input images. No fair comparisons of execution time and memory requirement are feasible with the implementation in [7], since all data management is omitted from that design. Instead, Bheight different values of A are required as input each clock cycle. Input signals to the E&D unit, in excess of data and the select erode/dilate, is the width and height of B. A controller will set the north and west padding values and produce the control signals W-boundary, N-boundary, and E or Sboundary, accordingly, see Fig. 7. In addition, it produces an output signal hold, which stalls the input data, or the component that produce the input data, until the internally produced padding bits are processed. V. I MPLEMENTATION The architecture has been implemented in VHDL and synthesized to the UMC 0.13µm CMOS process. The row memory in stage-2 is implemented as a shift-register of flipflops, but to save power and area other memory structures will be investigated. Optimizing the row memory is of high importance, since it is by far the largest single component in the design. In Tab. I some characteristics of the implementation are presented. Since the E&D units we have found in the literature are implemented on different platforms or with different processes, it is not possible to make a fair comparison between hardware size, power, and speed. Instead, the comparison is based on process independent measurements, e.g., execution time in clock cycles, and the required number of memory elements, see Section IV. TABLE I T HE MOST IMPORTANT CHARACTERISTICS OF THE E&D UNIT Maximum A Maximum B Size Speed
288 × 352 15 × 15 0.05mm2 250M Hz
VI. C ONCLUSION In this paper a low complexity architecture of a binary E&D unit with reduced memory requirement is presented. The architecture has been implemented in VHDL and synthesized to the UMC 0.13µm CMOS process. This unit can be used when the SE is rectangular and only contain ones, which is a common case when dealing with morphological operations. More than one unit can easily be connected in series without any intermediate storage to perform more advanced morphological operations like opening and closing. The width and height of the SE is controlled by input signals and the architecture supports all sizes of SE up to the limits placed by the wordlengths in stage-1 and 2. In addition, large SE sizes can be supported without a large increase in hardware, since for a fixed image size the memory requirement is mainly proportional to log2 (SEheight ). R EFERENCES [1] J. Serra, Image Analysis and Mathematical Morpohology, Vol 1. Academic Press, 1982. [2] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Ft. Collins, CO, USA, June 23-25 1999. ¨ ¨ [3] H. Jiang, H. Ardo, and V. Owall, “Hardware accelerator design for video segmentation with multi-modal background modelling,” in Proc. of IEEE International Symposium on Circuits and Systems, Kobe, Japan, May 2326 2005. [4] R. Gonzalez and R. Woods, Digital Image Processing, 2nd ed. Upper Saddle River, NJ, USA: Prentice Hall, inc, 2002. [5] S. Fejes and F. Vajda, “A data-driven algorithm and systolic architecture for image morphology,” in Proc. of IEEE Image Processing, Austin, Texas, Nov. 13-16 1994, pp. 550–554. [6] J. Velten and A. Kummert, “Fpga-based implementation of variable sized structuring elements for 2d binary morphological operations,” in Proc. of IEEE International Symposium on Circuits and Systems, Bangkok, Thailand, Mar. 2003, pp. 706–709. [7] E. N. Malamas, A. G. Malamos, and T. A. Varvarigou, “Fast implementation of binary morphological operations on hardware-efficient systolic architectures,” The Journal of VLSI Signal Processing, vol. 25, pp. 79– 93, 2000. [8] J. Goutsias and H. J. Heijmans, “Fundamenta morphologicae mathematicae,” Fundamenta Informaticae, vol. 41, pp. 1–31, 2000. [9] H. Park and R. Chin, “Decomposition of arbitrarily shaped morphological structuring elements,” IEEE Trans. Pattern Anal. Machine Intell., vol. 17, pp. 2–15, 1995. [10] G. Anelli and A. Broggi, “Decomposition of arbitrarily shaped binary morphological structuring elements using genetic algorithms,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 217–224, 1998.
3434