H.F. Li, D. Pao and R. Jayakumar. "Improvements and systolic implementation of the ... R.V. Shankar and N. Asokan. "A parallel implementation of the Hough ...
CORDIC Based Parallel/Pipelined Architecture for the Hough Transform J.D. Bruguera N. Guil T. Lang J. Villalba E.L. Zapata
January 1996 Technical Report No: UMA-DAC-96/02
Published in: To appear on the Journal of VLSI Signal Processing
University of Malaga Department of Computer Architecture C. Tecnologico • PO Box 4114 • E-29080 Malaga • Spain
CORDIC BASED PARALLEL/PIPELINED ARCHITECTURE FOR THE HOUGH TRANSFORM* by J.D. Bruguera1, N. Guil, T. Lang2, J. Villalba and E.L. Zapata
Dept. Arquitectura de Computadores University of Malaga Plaza El Ejido. 29013 Málaga. SPAIN
1
Dept. Electrónica. Facultad de Física Univ. Santiago de Compostela
15706 Santiago de Compostela. SPAIN
2
Dept. Elect and Compt. Eng.
University of California, Irvine CA 92717. U.S.A.
Mailing address: Emilio L. Zapata Dept. Arquitectura de Computadores University of Málaga Plaza El Ejido s/n 29013 Málaga SPAIN
* This work was supported by the Minitry of Education and Science (CICYT) of Spain under proyect TIC-92-0942.
CORDIC BASED PARALLEL/PIPELINED ARCHITECTURE FOR THE HOUGH TRANSFORM
Abstract We present the design of parallel architectures for the computation of the Hough transform based on application-specific CORDIC processors. The design of the circular CORDIC in rotation mode is simplified by the a priori knowledge of the angles participating in the transform and a high throughput is obtained through a pipelined design combined with the use of redundant arithmetic (carry save adders in this paper). Saving area is essential to the design of a pipelined CORDIC and can be achieved through the reduction in the number of microrotations and/or the size of the coefficient ROM. To reduce the number of microrotations we incorporate radix 4, when it is possible, or mixed radix (radix 2 and radix 4) in the design of the processor, achieving a reduction by half and 25% microrotations, respectively, with respect to a totally radix 2 implementation. Furthermore, if we allocate two circular CORDIC rotators into one processors then the size of the shared coefficient ROM is only 50% of the ROM of a design based on two separated rotators. Finally, we have also incorporated additional microrotations in order to reduce the scale factor to one. The result is a pipelined architecture which can be easily integrated in VLSI technology due to its regularity and modularity.
1.- Introduction The Hough transform is a powerful technique for the detection of patterns in images [27]. With the Hough transform the image space is mapped onto a parameter space so that the detection of a specific pattern in the image space becomes the detection of peaks in the parameter space. This transform has been used with different variants in a wide scope of applications such as straight line recognition [16], extraction of straight line segments [31], object recognition [25], extraction of planar figures [42], [43], vectorization of aerial photographs [46], etc. The high computational cost of the Hough transform has induced a considerable research effort in order to reduce its calculation time. We can identify three directions for the acceleration of the execution of the Hough transform: design of fast Hough transform algorithms, use of parallel architectures and design of specific architectures for the Hough transform. The object of the fast Hough transform algorithms is to reduce the calculation time and memory requirements. Examples of this approach are: Piecewise Linear (PLHT) [33], Combinatorial (CHT) [6], Binary (BHT) [21], Randomized (RHT) [54] and Fast (FHT) [38], [22] Hough Transforms. The most important problem presented by these fast algorithms is their low regularity and parallelism [22], whereas the traditional algorithm has a strong implicit parallelism which can be conveniently used in shared memory [11] or distributed memory multiprocessors (linear array [20]; mesh [44], [5], [19], [3], [13]; hypercube [7], [45]; binary tree [9]). The algorithm associated with the Hough transform presents a high level of regularity and locality. These properties make it appropriate for the VLSI design of a specific purpose architecture which permits high speed computation. Systolic solutions in this line are the designs by Chuang and Li [12], Li et al [39] and Fontoura and Sandler [21].
One of the main problems of the Hough transform is the evaluation of the implicit trigonometric functions in the algorithm (sines and cosines), which are difficult to implement with conventional arithmetic units. Some authors have proposed the use of tables [5], [24], [40] or the formulation with only addition and shift operations (loss of precision in the results) [21]. Recently, Timmermann et al [48] have proposed an architecture based on two general CORDIC processors. The alternative we propose in this work consists in the formulation of the Hough transform in terms of rotations and it permits the design of a specific purpose CORDIC (Coordinate Rotation DIgital Computer) processor. The CORDIC algorithm was developed by Volder [51] in 1959 for the calculation of rotations and the conversion from rectangular to polar coordinates. The only operations necessary for the implementation of the algorithm are additions, subtractions and shifts. In 1971 Walther [53] generalized the algorithm for operation in three coordinate systems: linear, circular and hyperbolic, making the applications extend to the solution of problems using trigonometric, hyperbolic and arithmetic functions. Several VLSI designs based on the CORDIC algorithm have been developed for applications in linear algebra, matrix algebra, digital signal processing and image processing [1], [2], [10], [15], [26], [49], [30]. When the application requires high computation speeds it is necessary to incorporate pipelining and/or redundant arithmetic to the design [32], [34], [35], [41]. In this article we present the design of a specific purpose processor based on the CORDIC algorithm for the calculation of the Hough transform. The CORDIC processor includes the following features: pipelined design and use of redundant arithmetic to increase the speed, use of mixed radix (2 and 4) to reduce the number of microrotations, use of additional microoperations to reduce the scale factor to one. This way the output of the CORDIC will give us directly the memory address in which the vote for the angle being analyzed is located. The paper has been structured as follows. In section 2 we formulate the Hough transform by means of CORDIC rotations. The design of an application specific processor for the computation of the transform is presented in section 3. We analyze the implementation of the radix 2 and radix 4 microrotations of the CORDIC algorithm, we determine the size of the ROM memory for the coefficients and we generalize the design to arbitrarily sized images. In section 4, we present a parallel implementation of the Hough transform, starting from a partition of the angle space into several subspaces which can be simultaneously processed. Finally, in section 5 we summarize the main features of the processor and compare it to other designs based on the CORDIC algorithm we found in the literature.
2.- Hough Transform Processor We can differentiate three stages in the process of detecting lines by means of the Hough transform: 1) Creation of the contour of the image using a border detector (Sobel operator, for instance); 2) Application of the Hough transform to each point of the image; and 3) voting in the parameter space in order to extract the lines. We can additionally include a fourth stage of edge linking. We concentrate here on the second phase, which is also the one with highest algorithmic complexity. Without losing generality we consider an image space of dimension NxN. The normal equation for a line than crosses a point (x,y) is
ρ=xcosθ + ysinθ
(1)
2
being ρ the perpendicular distance form the line to the coordinate origin and θ the angle defined by the abscissa axis and the normal to the line. The Hough transform of a line produces a set of lines of the parameter space which cross at a point of coordinates (ρ,θ). Also, equation (1) represents the set of values (ρ,θ) of all the possible straight lines crossing the point (x,y) of the space. In digital image processing we have discretized both the image space (image array) and the parameter space (Hough array, H(m,j)). Therefore, each illuminated point of the image space "votes" over a set of points of the parameter space so that collinear points vote to a common point in the parameter space. The detection of the crossing point of the curves produced by each point of a straight line translates into finding peaks in the parameter space (Hough space). Hough transform algorithm 1. Initialize the Hough array to zero 2. For each pixel (x,y) with gray level equal one, a) for x,y=[0,1,...,N-1] compute ρj= xcosθj + ysinθj
(2)
where 0≤θj
1 (n log23) 8
(13)
As an example, for a precision of n=12 bits, the compensation of individual stages by scaling may be introduced for i≥2, in microrotations 2 and 3. The remaining compensation is determined by σ0 and σ1, and the number of scale factors has been reduced to 9. 2) Use the scheme that divides the space into four subspaces, as described by expression (10). The corresponding iterations are xi+1=xi + σi4-iyi yi+1=yi - σi4-ixi x’i+1=y’i + σi4-ix’i y’i+1=-x’i + σi4-iy’i
(14)
The division of the space of angles into four subspaces (eqs. (10) and (14)) limits the range of angles to the subset { 0,...,π/4 }. As seen below, this facilitates the selection of a reduced number of scale factors. Moreover, the division of the space into four subspaces facilitates the use elementary angles tan-1(½σi4-i) instead of tan-1(σi4-i). This is advantageous because it modifies the angles of the first iteration from 45° (σ0=1) y 63.4° (σ0=2) to 26.5° and 45°, which produces a better utilization when the range of angles is π/4. Now stage i=n/4 does not influence the scaling factor, so that it does not need compensation. This approach is only possible if the angles of the transform range from 0 to π/4, since when the angles range from 0 to π/2, the whole range is not covered by new elementary angles. The iterations now are: xi+1=xi + ½σi4-iyi yi+1=yi - ½σi4-ixi x’i+1=y’i + ½σi4-ix’i y’i+1=-x’i + ½σi4-iy’i
(15)
Using these elementary angles, equation (13) is modified. Now, compensation of individual stages may be introduced for
i >
1 3 (n log2 ) 8 16
(16)
3) Since the angles have several possible decomposition, we can choose decompositions that minimize the number of scale factors. We now describe an example of how these approaches reduce the number of scale factors and the number of scaling stages for its compensation. For a precision of n=12 bits, the initial number of scale factors, 2 3n/4, is 54. Using elementary angles tan-1(½σi4-i), the compensation of individual stages by scaling may be introduced for i≥2 (see eq. (16)). Within this precision and with this elementary angles, only stages 0,1 and 2 influence the scaling factor, so only stage 2 need to be compensated (stage 3 does not influence the scaling factor due to the use of the new elementary angles). The compensation for stage 2 can be performed by:
9
1 (( σi4 i)2 1) 2
1 2
1
1 1 ( σ24 2)2 2 2
1 σ22 2
11
(17)
and the remaining compensation corresponds to 9 scale factors. Table III displays the scaling stages necessary for each of the nine scaling factors. We specify the values of σ0 and σ1, which determine the scale factor, together with the type of scaling applied and the resulting scale factor. We have not included repetitions because they do not reduce the resulting number of stages. On the other hand, in order to maintain a full radix-4 CORDIC design and to reduce the number of processors in a parallel implementation, we have to group all the angles so that all that are in the same subset have decompositions producing the same scale factor. In this way, we can choose the angle decompositions which generate the smaller number of scale factors. Table IV displays the angular coverings obtained by each scale factor. Analyzing table IV and the scaling factors of table III we can choose those factors that lead to the minimum number of scaling stages (three): factors that are characterized by σ0σ1 =(00,01,10,11,12). As a consequence, it is possible to select just five scaling factors for all angles. With these five scale factors we cover the entire range of angles. Observe that in reality only four factors are different from unity due to the fact the first one always unity. Therefore, the implementation of the Hough transform with the angle parallelization scheme only needs five CORDIC processors. Each processor has assigned angles producing the same scale factor. With a full radix-4 design, the number of microrotations is n/4 plus the microrotations and scaling stages needed to compensate the assigned scale factor. However, this kind of solution presents several problems. First, a parallel system with as many processors as different scale factors generates a significant unbalanced workload among processors, because each processors has assigned a different number of angles. Examining the angular coverings of the scale factors in table IV, it is easy to verify that the angle subsets associated to each scale factor are differents. On the other hand, the total number of stages (standard radix-4 microrotations and scaling stages) is different in each processor. All processors have the same number of radix-4 microrotations, but the number of scaling stages for the scale factor compensation is different (see table III). With n=12, the number of scaling stages ranges from 0, for σ0σ1 = (00), to 3, for σ0σ1 = (10,11,12). In this way, each processor has a different operation time. This makes difficult the synchronization among processors. Finally, this solution is not flexible. The modification of the number of processor is not direct. In order to increase the parallelism, incorporating more processors, it is needed that angles generating a scale factor be assigned to several processors. In this way, the new processors must have the scaling stages to compensate a determined scale factor. To implement the Hough transform with less than five processors requires that some processors perform the compensation of several scale factors. With this solution, the resulting parallel system is non homogeneous, with processors having different latencies and hardware structures, and not flexible.
10
Splitting of the angle space The solution we consider more adequate consists to make use of equal radix 4 cordic processors. The processors must incorporate the scaling stages necessary to compensate five different scale factors. In this way, in a parallel system with m processors, the set of angles of the Hough transform is split into m disjoint subsets, with N/4m angles each one, so that each processor has associated a different subset. In this case, the computation of the Hough transform of a NxN image requires (N3/4m)+latency cycles, as for each pixel of the image equations (15) are implemented sequentially for N/4m angles. No voting conflict exists, because, as the processors have assigned different angles, voting is performed over different elements of the Hough space. To increase the parallelism of the system only requires to incorporate more processors. The only difference among processors is in the ROM memory where coefficients σi are stored. When we have five processors, the differences with the previous solution (to assign the angles generating the same scale factor to one processor), are that, in this case, all processors have the same number of stages, larger than the number of stages of the processors in the previous solution. On the other hand, now, the problems of unbalanced workload and different operation times are avoided. The flexibility and high regularity of this solution makes easy its VLSI implementation. Considering 12-bit precision, the resulting radix-4 CORDIC processor has ten stages: six standard radix-4 iteration stages, one for the compensation of the scaling factor of stage i=2 and three for performing the scaling. The standard radix-4 iteration stages follow equations (15), needing 3 bits in order to code σi (σi1,σi2,σi3), which implies 18 bits in the ROM for each angle. The compensation of the scale factor of stage i=2 has the appearance of a scaling stage, performing the multiplication of each coordinate times (1-|σ2|22-11) (see equation (17)). This compensation does not add any bits to the ROM because it uses the same coefficient σ2 as the stage. The scaling needed in order to compensate the four scaling factors selected appear in table III. Each of the three scaling stages can be characterized by just 2 bits σi=(σi1,σi2) (see right side of Table III). This adds 6 bits, resulting in a total of 24 bits per angle. Figure 4.a and 4.b display the general design of a radix 4 stage and of a scaling stage respectively. We have specified three control signals (c1,c2,c3). Table V establishes the equivalence between the control signals and the coding of the coefficients carried out in each stage. It also shows the shifts (A and B in figures 4) for each of the stages. c3 determines the type of operation (addition or subtraction). c1 and c2 select the shift to be applied. The scaling is always negative for the stages identified by i=6,7 in the table. We code this fact by setting c3=1 (subtraction operation). In the last scaling stage (i=8) the control signals c2 and c3 have the same value in order to obtain a positive scaling. The right size of Table III reflects a possible coding (not the only one) for the coefficients σij, with i=6,7,8 and j=1,2. It is possible to reduce the latency time if we carry out the parallel compensation of scale factor using the double rotation method proposed in [52]. This method works in radix-2 and mixed radix and it does not need the addition of scaling iterations. Basically, to rotate an angle θ we have to carry out two parallel rotations: a rotation of an angle θ+ß and a rotation of an angle θ-ß, where ß=cos-1(K-1). Finally, we make two parallel sums with the values obtained by both rotators. To apply this method to full radix-4 CORDIC we can add a code associated to the scale factor involved to each angle of the ROM . Therefore, the total number of iterations can be reduced to n/4+2, but doubling the hardware. Finally, it is possible to save 50% of the hardware if we multiplex the inputs to the processor,
11
given the high level of symmetry and independence found in figure 4. Obviously the number of angles processed per unit time will be halved.
5. Comparison and conclusions In this paper we have presented the design of parallel architectures for the computation of the Hough transform based on application-specific CORDIC processors. Although this transform has a significant and natural parallelism, only those parallel architectures which maintain independent Hough subspaces are of interest, because the voting process is free of conflicts. The parallel architecture we propose is based in the division of the Hough space into independent angles subspaces, provided that each subspace is assigned to a different processor. In this way, we obtain a parallel system, without conflicts in the voting process, highly regular, because it is only based on CORDIC rotators, maintaining the Duda and Hart parametrization (equations (4) and (10)). This regularity simplifies the design and facilitates the VLSI integration. The application-specific CORDIC processors have the following characteristics: pipelining, redundant arithmetic, compensation of the scale factor, decomposition of the angles and use of full radix 4 or mixed radix. Now we concentrate on the evaluation of the parallel system we propose, comparingit with other solutions that appeared recently in the literature [39],[48],[49]. Li et al. [39] present a systolic architecture that processes one row (or column) of pixels concurrently. By means of the reformulation of equation (1), parameter ρ depends only on the cosθ (not on sinθ). They use a table to store the cosine of the angles. In this way, for a given value of θ, ρ can be computed by simple addition. The basic systolic architecture consists of a linear array of N compute cells, where ρ is obtained, a routing network and a linear array of 2N accumulator cells. The routing network, composed of a 2NxN array of routing cells, sends the pixels with a given ρ to the corresponding accumulator. The systolic array processor computes the Hough transform for a particular value of θ. The hardware complexity of the systolic array processor grows linearly with the size N of the image. In this way, when N is large, the complexity of the systolic array is very high. In this case, it is necessary to partition the image into blocks, where each block is processed in a different array. Moreover, the computation of the Hough transform with this systolic array requires the processing of all the pixels of the image, whereas, in the solution proposed in this paper, only pixels with gray level equal to one are processed. Timmermann et al [48] propose another CORDIC-based processor design. In this solution, equation (1) is reformulated as ρ = (x2+y2)1/2 sin (θ ± tan-1(x/y))
(18)
with (x2+y2)1/2 being the amplitude of sinusoidal curves with phase shifts of tan-1(x/y) and 0≤θ 2.373 2.373 --> 4.833 4.833 --> 7.822 7.822 --> 9.316 9.316 --> 11.953 11.953 --> 14.677 14.677 --> 17.314 17.314 --> 18.896 18.896 --> 21.796 21.796 --> 24.257 24.257 --> 26.191 26.191 --> 28.916 28.918 --> 31.376 31.376 --> 35.771 35.771 --> 35.859 35.859 --> 40.253 40.253 --> 42.714 42.714 --> 45.000
x x
01
x x x x
02
σ0
σ1
10
11
x x x x x x x x
Table IV
21
x x x x
x x x x
12
20
21
22
x x x x
x x x x x x
x x
x x x
STAGE
c1
c2
c3
i= 0,1,...,5
σi1
σi2
σi3
Compensation i=2
σ21
σ22
i=6
σ61
i=7 i=8
Scaling
A -2i
2
B -2i+1
2
σ23
-9
2
2-11
σ62
1
2-3
2-4
σ71
σ72
1
2-5
2-10
σ81
σ82
σ82
2-6
2-7
Table V
22
Tamaño imagen n
Etapas radix-2
Etapas radix mixto 4
14
16
8
4
12
25 %
13
15
9
3
12
20 %
12
14
8
3
11
21.5 %
11
13
7
3
10
23 %
10
12
6
3
9
25 %
9
11
7
2
9
18 %
8
10
6
2
8
20 %
7
9
5
2
7
22.2 %
2
Table VI
23
total
% reducción de etapas
Figure 1
24
δ
δ
Figure 2
25
Figure 3
26
Figure 4
27