A Run-Length Based Connected Component ...

2 downloads 0 Views 239KB Size Report
changes in labels. • Algorithms for hierarchical tree structured images that resolve the label equivalence with UNION-FIND. • Algorithms designed to perform a ...
A Run-Length Based Connected Component Algorithm for FPGA Implementation Kofi Appiah, Andrew Hunter, Patrick Dickinson and Jonathan Owens Department of Computing and Informatics Faculty of Media, Humanities and Technology University of Lincoln Lincoln, LN6 7TS, England {kappiah,ahunter,pdickinson,joowens}@lincoln.ac.uk • Algorithms for hierarchical tree structured images that resolve the label equivalence with UNION-FIND.

Abstract This paper introduces a real-time connected component labelling algorithm designed for Field Programmable Gate Array (FPGA) implementation. The algorithm run-length encodes the image, and performs connected component analysis on this representation. The run-length encoding, together with other parts of the algorithm, is performed in parallel; sequential operations are minimized as the number of runs is typically less than the number of pixels. The architecture is designed mainly on Block RAM (i.e. internal RAM) of the FPGA. A comparison with the multi-pass algorithm in hardware and software is presented to show the advantages of the algorithm. The algorithm runs comfortably in real-time with reasonably low resource utilization, making integration with other real-time algorithms feasible.

1

Introduction

The labelling of connected components in a binary image is a fundamental operation in pattern recognition. This algorithm transforms a binary image into a symbolic one with each connected component having a unique numeric label. The image can be represented in a number of ways using array, run-length, quadtree, octrees and bintress [9]. Labelling algorithms can broadly be classified into five groups [19, 16]: • Algorithms which perform two scans, assigning provisional labels and then resolving label equivalences. • Algorithms which perform forward and backward (multi-pass) scans alternatively until there are no changes in labels.

• Algorithms designed to perform a single scan of the entire image with irregular access pattern. • Algorithms designed specifically for parallel machines based on divide-and-conquer. In this paper we present a run-length encoding based connected component labelling algorithm [5] suitable for hardware implementation on FPGA. Using run-length encoding reduces the number of sequential operations significantly, as these are concentrated on processing runs, which are less numerous than pixels. The conversion to run-length encoding is highly parallelizable. The processing time is significantly lower than the multi-pass algorithm. The new algorithm also has low memory requirements, making it possible to implement the design on Block RAM, with a combination of parallel and efficient sequential operations. The paper is organized as follows. Section 2 discusses some existing labelling algorithms and some modifications to suit hardware implementation. Section 3 introduces the implemented labelling algorithm; details on implementation are given in section 4. Section 5 presents results and analysis of the implementation. Finally, we summarise and point out some future directions.

2

Previous Work

Several sequential and parallel algorithms have been proposed to solve the connected components problem[12]. The original two-pass algorithm (Rosenfeld and Pfaltz[10, 13]), uses two raster scans of the image and an equivalence table resolution stage. Details of the algorithm (using the 4adjacency definition of connectivity) are as follows:

Initial labelling:-The binary input image is scanned in raster order, to label foreground pixels by reference to the left and upper neighbours. If the neighbours are background pixels, a new label is assigned. If they are both foreground with different labels, the left label is copied to the current pixel, and an entry is made in the equivalence table linking the upper and left labels. If only one neighbour is labelled, or they share the same label, the value is propagated. Resolving Equivalences:-The initial labelling step is followed by equivalence resolution, after which the image is scanned again and each label is replaced by the label assigned to its equivalence class. This approach has been used in [11, 14, 18, 10, 15, 9], directly, or modified for implementation as the resolution of label equivalences has a great impact on run time. One drawback of this algorithm is the high dynamic storage required for the equivalence table. a

b

d

x a

c

x c

b

d a

b

Figure 1. (a.) 8-adjacency forward scan and (b.) 8-adjacency backward scan. To reduce storage requirements, Haralick [4] suggested a multi-pass labelling algorithm with increased computational time [12]. In this algorithm, a foreground pixel with only background neighbours is assigned a new label. However, where there are any labelled neighbours, the minimum label is assigned. The algorithm alternates raster and reverse-raster scans until the labelling stabilises (the number of iterations depending on the component’s geometric complexity). The algorithm is highly sequential, and the repeated image scans make it computationally inefficient, as scan-times tend to dominate connected component cycles. Efforts to further improve the space/time requirements led to the development of one-pass algorithms, such as Contour Tracing (Chang et al[2]). This algorithm scans the image in raster order until an unlabelled external contour is encountered. A complete trace of the contour is then made, assigning the same label to all subsequent object pixels until an internal contour (if any) or the start point is reached again. Labelling in a single scan is very attractive, yet the nature of the algorithm calls for irregular access pattern to the image in memory. Other algorithms have been developed specifically for complex image formats like hierarchical tree structures [16]. These algorithms are built around the two-pass or the iterative forward and backward scans (multi-pass), and hence suffer from the problems described above. The final set of algorithms is the parallel algorithms, also designed around the first two groups for implementation on

parallel machine models such as a mesh connected massively parallel processor. Yet very few of these algorithms have been implemented on single hardware architectures. Crookes et al. [3] successfully implemented the multipass connected component labelling algorithm on an FPGA using off-chip RAM. The implementation fits on a single Xilinx XC4010E chip with 20 × 20 CLBs and for an image size of 256 × 256, the design runs at 76MHz. The implementation suffers from all the problems associated with the base algorithm and hence the time taken to process a frame is dependant on the complexity of objects in the scene. A similar implementation is presented in by Benkrid et al. ([1]) using a bigger device, Xilinx Virtex-E FPGA. For an image size of 1024 × 1024, the design consumes 583 slices, 5 Block RAMs and runs at 72MHz. Jablonski et al. [6] have implemented a connected component labelling algorithm with a Xilinx FPGA similar to the one used in [1]. The implementation uses 8-adjacency to label the image after segmentation. Their implementation is divided into three stages; the segmentation and labelling stage, the label reordering stage and the label translation stage. The entire design is capable of labelling at 25f ps with a 512 × 512 image. Using the two-pass algorithm similar to [13] the implementation has a memory access bottleneck [6]. Contour Tracing (CT) presented by Chang [2] has been described as one of the most efficient one-pass algorithms[19]. Hedberg et al. [7] presented a hardware implementation of this algorithm; this guarantees labelling of a predefined number of objects (61 in their implementation). The system is capable of processing 25f ps of an image resolution of 320 × 240 (standard QVGA). The implementation needs irregular access to memory, a possible bottleneck. A single pass labelling algorithm that allows conflict resolution to be deferred, implemented on FPGA, is presented in [8]. This uses a merger table to record the minimum of two distinct labels in the neighbourhood of any pixel, which is resolved during the horizontal blanking period. The following section presents a connected component algorithm that fits on a single chip and runs in real time for a standard VGA sized image. The implementation also surmounts most of the memory bottlenecks presented in the previous implementations.

3

Our Approach

The labelling algorithm as presented in [13] can result in a very large equivalence table. Resolving the equivalence table has been the focus of most labelling algorithms, with little effort to implement such algorithms on hardware architecture for real-time processing. Our algorithm is suitable for implementation on a hardware platform, and also minimises use of memory.

Algorithm 3.1: P IXELT O RUNS(T )

Figure 2. Worse case scenario.

Our key contribution is to process using a run-length encoding representation. Conversion of the original binary image to run-length encoded format is easily parallelised by processing multiple rows in parallel. The run-length encoded format is much more compact than the binary image (individual runs have a single label), and so the sequential label propagation stage is much faster than the conventional algorithm. Details of the algorithm are given below. The stages involved in our implementation are as follows:

1. Pixels are converted to runs in parallel by rows,

2. Initial labelling and propagation of labels,

3. Equivalence table resolution and

4. Translating run labels to connected component.

∀T : T (x, y) = I(x, y) i←0 if T (x, y)  = 1 and isBlock = 0 si ← x then isBlock ← 1 if isBlock ⎧ = 1 and (T (x, y) = 0 or x = M ) ei ← x ⎪ ⎪ ⎪ ⎪ ⎨ri ← y then IDi ← EQi ← 0 ⎪ ⎪ i←i+1 ⎪ ⎪ ⎩ isBlock ← 0 where isBlock is 1 when a new run is scanned for partition T and M is the width of the image. A run is complete when the end of a row is reached or when a background pixel is reached. The maximum possible number of runs in an M × N image is 2M N . This worst case occurs when the image is a pixel-wise chequerboard pattern; see figure 2. The second stage involves initial labelling and propagation of labels. The IDs and equivalences (EQs) of all runs are initialized to zero. This is followed by a raster scan of the runs; assigning provisional labels which propagate to any adjacent runs on the row below. For any unassigned run (IDi = 0) a unique value is assigned to both its ID and EQ. For each run i with ID IDi , excluding runs on the last row of the image; runs one row below runi are scanned for an overlap. An overlapping run in 4-adjacency (ie. si ≤ ej and ei ≥ sj ) or 8-adjacency (ie. si +1 ≤ ej and ei +1 ≥ sj ) is assigned the ID IDi , if and only if IDj is unassigned. If there is a conflict (if an overlapping run has assigned IDj ), the equivalence of run i, EQi is set to IDj . This is summarized in algorithm 3.2. Algorithm 3.2: I NIT L ABELLING(runs)

The design is parallelised as far as possible. Although stages 2 and 3 are sequential, they operate on runs, which are far less numerous than pixels. Similar to stage 1, stage 4 can be executed in parallel by row. A run has the properties {ID, EQ, s, e, r}, where ID is the identity number of the run, EQ is the equivalence value, s the x offset of the start pixel, e the x offset of the end pixel, and r the row. The first stage involves row-wise parallel conversion from pixels to runs. Depending on the location and access mode of the memory holding the binary image, the entire image may be partitioned into n parts to achieve n runlength encoding in parallel. The use of runs rather than pixels reduces the size of the equivalence table in some cases similar to figure 4 and hence makes it easier to resolve. The following sequential local operations are performed in parallel on each partition, for an image size M × N to assign pixels to runs:

m←1 for i ← ⎧ 1 to T otalRuns if IDi = ⎪ ⎪ 0 ⎪ ⎪ IDi ← EQi ← m ⎪ ⎪ then ⎪ ⎪ m←m+1 ⎪ ⎪ ⎪ ⎪ for each r ⎪ ⎪ ⎧ j ∈ ri+1 ⎪ ⎪ if IDj = 0 and ei ≥ ⎨ ⎪ ⎪ ⎪ ⎪ sj and si ≤ ej do ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ IDj ← IDi ⎪ ⎨ ⎪ then ⎪ ⎪ EQj ← IDi do ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ if ID =  0 and ei ≥ ⎪ ⎪ j ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ and s s ⎪ ⎪ j ⎪ ⎪ i ≤ ej ⎩ ⎩ then EQi ← IDj Where T otalRuns excludes runs on the last row of the image. Applying P ixelT oRuns() to the object in figure 3 (a ’U’ shaped object)will generate four runs each with unassigned ID and EQ.

C o lu m n s

Rows

Rows

0 1 2 3 4 5 6 0 1 2 3 4

ID EQ ST EN RW

B1 B3

B2 B4

Figure 3. U shaped object with 4 runs after PixelToRuns() IT 1 2 3

ID EQ ID EQ ID EQ

B1 0 0 16 17 0

C o lu m n s

0 1 2 3 4 5 6 0 1 2 3 4

B1 1 1 1 1 1 1

B2 0 0 2 2 2 2

B3 1 1 1 1 1 2

B4 0 0 2 2 2 2

B2 0 0 6 8 1

B3 0 0 14 15 1

B4 0 0 3 5 2

B5 0 0 12 13 2

B6 0 0 0 2 3

B7 0 0 10 11 3

Table 2. Results after first image scan, where ST=start, EN=end and RW=row. B1 1 1

ID EQ

B2 2 2

B3 1 1

B4 2 2

B5 1 1

B6 2 2

B7 1 1

Table 3. Results after second scan. The start, end and row remain unchanged

Table 1. Results for the object in fig.3 after 3 iterations.

The third stage is resolution of conflicts, similar to [13] as shown in algorithm 3.3. In the example above (figure 3 and table 1) a conflict occurs at B3; the initially assigned EQ = 1 in iteration 1 changes to EQ = 2 in iteration 3 due to the overlap with B1 and B4, see table 1. This conflict is resolved in ResolveConf lict() resulting in ID = 2 and EQ = 2 for all the four runs. Even though ResolveConf lict() is highly sequential, it takes half the total cycles as the two if statements in the second loop are executed in simultaneously. The final IDs (final labels) are written back to the image at the appropriate pixel location, without scanning the entire image, as each run has associated s, e and r values. Algorithm 3.3: R ESOLVE C ONFLICT(runs) for i ← ⎧ 1 to T otalRuns if IDi = ⎪ ⎪ ⎧EQi ⎪ ⎪ T ID ← IDi ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ T EQ ← EQi ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ← 1 to T otalRuns ⎨for j ⎧ do if IDj = then ⎪ ⎪ ⎪ ⎪  T ID ⎪ ⎪ ⎨ ⎪ ⎪ IDj ← T EQ then ⎪ ⎪ ⎪ ⎪ do ⎪ ⎪ ⎪ ⎪ if EQ = ⎪ ⎪ ⎪ j ⎪ ⎪ ⎪  T ID ⎩ ⎩ ⎩ then EQj ← T EQ To illustrate the performance of the algorithm, consider the stair-like 8-connected component illustrated in figure 4. Using the multiple pass approach, a stair-like component with N steps will require 2(N − 1) scans to completely label. Using our approach, both images take two scans to

completely label. As shown in figure 4, the runs are extracted in the first scan, while the 8-adjacency labelling is done in the second scan. Tables 2 and 3 show results after the first and second scan respectively. It becomes clear from table 3 that no further scans are required. The same image (figure 4) will result in an equivalence table with five entries if the two-pass algorithm[13] is used, due to the successive offsetting of rows to the left.

3.1

Analysis of our algorithm

As with other connected components algorithms [1, 13, 17], the efficiency of our algorithm depends on the geometrical complexity of the components. However, our approach typically results in less labelling conflicts and a somewhat smaller equivalence table. We benchmarked the performance of our approach against the base algorithm presented in [1, 13, 17]. 0

1

2

3

4

5

6

7

8

9

10

1 1 1 2 1 3 1 4 15

17

B3

B2

1 B4

2 3

16 B1

0

B6

B5 B7

Figure 4. Example of a 3 and 4-stairs connected component

We compared our implementation and that presented in [1] using a set of complex images of various sizes, six of them shown in figure 5. The testing environment is an

Intel Pentium IV 2.8GHz personal computer with 2.0GB SDRAM running MATLAB/C. The top-left image has two irregular-shaped objects which generated 360 runs and an equivalence table with 3 entries. Thus using 2 × M N scans (sequential access of all pixels) to generate the runs and write the labels back to the final image, 360 run operations (sequential access of all runs) to identify overlapping runs and a further 360 × 3 run operations to resolve the equivalence table, our method takes approximately 2 × 720 × 576 + 360 × 4 = 830, 880 operations to completely label the image.

1.6 1.4

Processing time (sec.)

1.2 1 Our's

0.8

Benkrid's

0.6 0.4 0.2 0 1

14 27

40 53 66 79

92 105 118 131 144 157

Frame

Figure 6. Graph showing processing time for 200 images of size 720x576 Image top-left top-right middle-left middle-right bottom-left bottom-right Figure 5. Sample images used as benchmark to compare our approach with [1] The same image (top-left in figure 5) will require a total of 5 ∗ f ramesize = 2, 073, 600 scans to completely label the components using the multi-pass approach [1]. Thus using 1 complete scan for the initial labelling, followed by 2 complete scans for the forward and backward passes. Its worth pointing out that the components will have the appropriate labels, yet a further forward and backward pass will be required to guarantee there are no changes. The middle-left image in figure 5 (a noisy image with one object) resulted in 2,379 runs and 120 entries in the equivalence table. Our approach took approximately 2 ∗ 720 ∗ 576 + 2379 ∗ (120 + 1) = 1, 117, 299 operations while the method in [1] took approximately 13∗720∗576 = 5, 391, 360 scans to label the image. Similarly, the bottom-left image with a total of 103 components resulted in 1600 runs and 9 entries in the equivalence table. Table 4 shows the time taken in seconds to process the six images in the figure 5. A graph with the processing time (in seconds) of the two implementations, for 200 naturalistic images each of size 720 × 576 is shown in figure 6. In the worse case scenario, where there are no continuous pixels in a row as in figure 2, our approach incurs extra writing overheads. The implementation in hardware running at real-time is a significant advantage.

Size 720x576 720x576 720x576 406x277 405x266 640x480

Ours 0.063 0.063 0.079 0.030 0.015 0.063

[1] 0.486 0.533 1.889 0.406 0.743 0.590

Ratio 7.72 8.46 23.91 13.53 49.53 9.36

Table 4. Performance of the two approaches in seconds

4

Hardware Implementation

The algorithm presented in section 3 has been coded in Handel-C and fully implemented on a Xilinx Virtex-4 FPGA development board, RC340 from Celoxica. The Xilinx Virtex-4 FPGA chip (XC4VLX160), has approximately 152,064 logic cells with embedded Block RAM totalling 5,184 Kbits, making it possible to store 2 grayscale intensity images of size 640x480 or 16 binary images each of size 640x480. The development board is also packaged with 4 banks of zero-bus turnaround (ZBT) SRAM totalling 32MB, making it possible to buffer frames for the display unit when needed.

4.1

The Architecture

The implementation consists of image binarization, run extraction, initial labelling and conflict resolution stages. Figure 7 is a conceptual overview of the run-length connected component algorithm on FPGA. There are three different settings for our implementation to show the most efficient setting and architectural implementation. Again for

C O M PLETE FRAM E IN P U T

COM PARATOR

B IN A R Y IM A G E

pixel into binary using a threshold value. The output of this block is written to Block RAM (binary image buffer).

L A B E L L IN G U N IT

24

> =
=
=

Suggest Documents