MIP-8614217. changing the mathematical structure of the algorithm (i.e., a sequence of step in the parallel processor system can be identified with a single stepĀ ...
V4b.6 PARALLEL PROCESSINGARCH1TECIURE;SFOR ITERAnVE. IMAGE RESTORATlON A. IC Katsaggelos, S . P. R. K u m a r and M. Sarrafzadeh Northwestern University D e p a r t m e n t of Electrical Engineering and C o m p u t e r Science T h e Technological Institute, Evanston, Illinois 60208
ABSTRACT I n this paper Mesh and Mesh of Pyramids implementations of iterative image restoration algorithms are proposed. lhese implementations are based on a single step, as well as on a multistep iterative algorithm derived from the single step regularized iterative restoration algorithm. One processor is assigned lo each picture element, with local memory depending on the support of the restoration filter. m e implemenklions consist of interprocessor communication and inbaprocessor computations. m e efficiency of the proposed VLSI algorithm is judged by establishing lower bounds on A F , where A is the area of the VLSI chip and T is its computation time.
I. INTRODUCnON The recovery o r restoration of an image that has been distorted is one of the m o s t important problems in image processing applications [ 11. Iterative restoration algorithms will be used in this work, due to certain advantages they offer over other existing techniques [3,3]. Iterative restoration tasks are generally computationally extensive and time consuming, as is the case with most image processing tasks. There has been a natural interest in improving the response times of the image processors to extend the horizon of their applicability, While early research in this direction focussed on exploiting the structure of the computation o n a single processor (e.g., FFT algorithm), enhancing the speed by employing multiprocessors is currently of intense interest. Several image processing systems with multiprocessors, such as STARAN ( a general purpose system employing an interconnection network, see [4,5]) have already been implemented with s o m e success. Recent technolo-gical revolution, represented by very-large-scale integration (VLSI), has generated considerable interest in hardware implementation of complex operations (e.g., see [SI for application in signalfpicture processing). In general, algorithm design is the development of better procedures to reduce the time to solve a given problem o n a given computing system. Exploitation of a multiprocessor system requires a radical departure from the traditional Von Neumann environment. Detection of parallelism in sequential programs is essential to the discipline. In VLSI there are a group of processors, each with a local memory, cooperating to solve a given problem. The new challenge is to exploit properties of VLSI to built etTective and efficient computing structures. The fundamental criteria of optimality are A , the area of the VLSI chip, and T,its computation time. The aim is to design architectures that use these two resources in an optimal manner. In this paper we propose mesh and mesh of pyramids VLSI implementations of iterative image restoration algorithms. A s it will be discussed in the following sections, most of the theoretical analysis (issues such as convergence) of iterative picture processing algorithms, in the past, have presumed a single o r a central processor, i.e., all the computations necessary in an algorithm step are carried o u t by this single processor. In many instances, it is possible to implement the centralized algorithm by a system of parallel processors, without T h e work of A . K. Katsaggelos was supported in part by the National Science Foundation under G r a n t No. MIP-8614217.
changing the mathematical structure of the algorithm (i.e., a sequence of s t e p in the parallel processor system can be identified with a single step of the centralized algorithm). However such a straightforward implementation may n o t be (and generally, is not) efficient. A n efficient implementation may necessitate altering the mathematical structure of the algorithm. We derive and implement a multistep iterative image restoration algorithm, which is characterized by localized data transaction, an important feature for VLSl implemen tation. This paper is organized in the following manner. In Sec. I1 the form and the properties of the first -order iterative algorithms are described. VLSI implementations to be considered in this work, are presented in Sec. 111. Finally, in Sec. IV conclusions and current research directions are described. 11.
A n appropriate mathematical model of the image distorting process is the following [ I ]
where the vectors y , z and U represent respectively lexicographically ordered blurred and original images and the additive noise. The matrix D represents the space invariant o r space varying deterministic distortion. The signal restoration problem is then to invert Eq. (1) o r to find an image as close as possible to the original one subject to a suitable optimality criterion, given y and D . We follow a regularization approach in solving the image restoration problem [ 3 ] . This results in obtaining a restored image by solving the following s e t of linear equations
0 1989 IEEE
( D T D + a C T C ) z= D T y
( 2)
Az = g
( 3)
or
denotes the transpose of a vector or matrix and a, where the regularization parameter, is inversely proportional to the signal to noise ratio (SNR). The matrix C represents a high-pass filter which is chosen in such a way so that the energy of the restored image at high frequencies (due primarily to the noise amplification) is bounded 131.
A. Single Step Iteration The following iteration zo=BATg zt +I = ([-BA
=W
(44
T A )Z*+BA Tg 2)
+b
(4h)
converges to the minimum norm least squares solution
2544 CHZ67X?/W/MUWZW $1.00
ITERATIVERESTORA'IlONALGORITHMS
(m.n.1.s.s) z+, of Eq. (3), defined by z+=A+ , where A + is the generalized inverse of A , f o r 0 < @ < 2 lb Algorithm (4) exhibits linear rate of convergence, since it can be shown that [7]
m e n t our iterative image restoration algorihms o n a mesh and Mesh of Pyramid. Meshs and Pyramids have been proven effective for a number of problems in digital signal processing, However, their combination has n o t been studied.
/rz.
(5)
where c =max{ll-BIIA
IPI, b-BIIA+l~ZD
(6)
T h e pointwise version of iteration (4) may be useful in considering diflerent ways in implementing it in VLSI. When the distortion D and the constraint C in Eq. (2) are space invariant, A in Eq. (3) is a circulant matrix and it can be characterized by the impulse response a ( i , j ) . Then, the pointwise version of iteration (4) is given by zo( i , j ) = j3 a(-i,-j)
Zk+di,j) = zt
**g( i,j )
+ B a(-i,-j)**[g(i,j)-a(i,j)*%k
=[6( i,j ) -Ba ( -i, -j)
**a
( i , j ) ] *%k ( i ,j ) +a (-i,-j)
= w(i,j)**zk ( i , j )
(i,j)l **g( i , j )
+ b(i,j)
( 7)
where ** denotes the two-dimensional ( 2 D ) discrete convolution and 6 ( i , j ) is the 2D impulse function.
E. M u l t i s t e p Iteration Some of the useful properties of an algorithm for its VLSI implementation, are regularity, recursiveness and localized data transaction. The iterative algorithms presented in Sec. 1I.A require operations in a neighborhood of a pixel. F o r example, according to Eq. (7), in generating the restored value of the ( i , j ) - t h pixel at the ( k + l ) - s t iteration step, the restored image values from the previous iteration step, in a neighborhood of ( i , j ) ,whose size depends o n the support of the impulse response w ( i , j ) , are required. However, an efficient implementation of an algorithm may require that the communication time is minimized. This results in reducing the size of the neighborhood to its minimum, a 3x3 template. In achieving this we have proposed [8,9] a modification of iteration (7). More specifically, a multistep iteration has been derived from (7), according to the following procedure. The impulse response w ( i , j ) in ( 7 ) is additively decomposed as w ( i , j ) = w l ( i , j ) + w p ( i , j ) + ...+wz,(i,j)
(8)
where the functions w,(i,j), l=1, ..., 2 L are depicted in Fig. 1. Then iteration (4) takes the form zo=BA T g Z ~ +=WIZ, L _I+
W Z k_,+...+ Z
W,PZ,_,,+ ...+W,,~
k - +b 2 ~
(9)
where the sequences w l ( i , j ) ,...,wzF(i,j ) are used in forming the matrices Wl, ..., W,, , respectively. We have assumed without loss of generality that the matrix W in Eq. (4b) has support (2L + l ) x ( 2 P + l ) pixels, where L 2F'. The convergence of (9) was studied in [9]. I t was found that the convergence proofs available for the single step iterative algorithm do n o t directly carry over to the multistep case. 111. VLSI IMPLEMENTATIONS
A. VLSI Model of Computation In this section, first we review VLSI model of computation and discuss computational limits of VLSI. We will imple-
We briefly review the synchronous model of VLSI c o r n p u b tion [10,11,12]. A computational problem ll is a Boolean mapping from a s e t of input variables to a s e t of output variables. T h e mapping embodied by II is realized by a Boolean machine described BS a computation graph, G = ( V , E ) , whose vertices V are information processing devices or input/output ports and whose edges E are wires. A VLSI chip is a two-dimensional embedding of this computation graph according to the prescriptions of the model. T h e model is characterized by a collection of rules concerning layout, timing, and input/output (I/O) protocol; in addition, the model restricts the class of computation graphs to those having bounded fan-in and fan-out. The layout rules are the following: (1,) Wires (edges) have minimum width X and a t most Y wires ( Y 2 2 ) can overlap at any point. ( 2 ) Nodes have minimum area CA', for some c 2 1. No loss of generality is incurred if the layout is restricted to be an embedding of the computation graph in a uniform grid, typically the square grid: the latter is the plane grid, the vertices of which have integer coordinates (layout grid). The timing rules specify that both gate switching and wire propagation of a bit take a fixed time 7 0 (hereafter, sssumed equal to 1 ) irrespective of wire length (synchronous system). In addition, the 1/0 protocol is semellectiue (each input is received exactly once), undocol (each input is received at exactly one input port), and time- and place-determinate (each 1/0 variable is available in a prespecified sequence a t a prespecified port, for all instances of the problem). Two other types of 1 / 0 protocol constraints appear in this paper: the word-local assumption and the word-serial assumption. A n 1 1 0 protocol is word-local if, for any cut partitioning the chip, o ( N ) input (output) words have s o m e bit entering (exiting) the chip o n each side of the cut [14]. This constraint is used in the derivation of the A F lower bound and is adhered to in the construction of the upper bounds (designs). An 1 / 0 protocol is word-serial if, a t any time instant, o ( N ) input (output) words have some, b u t not all, of their bits read (written). This constraint is used in the derivation of the A lower bound and is adhered to in the construction of the minimal area circuit.
E. Lower Bound In the VLSI model of computation as formulated in [10,11,12], the fundamental complexity measures are A area of the VLSI chip, and T, its computation time. VLSI computation theory addresses the problem of designing algorithms (and the corresponding architectures) that use these two resources in an optimal manner. In order to judge the efficiency of a VLSI algorithm, it is useful to establish lower bounds o n area, time, or various functions that capture an area-time tradeoff (e.g., A F ) . Standard techniques exist for proving lower bound on T and A F ; they are based o n fan-in arguments (in the case of T) and o n information-flow argumenta (in the case of A F ) [10,14].
To establish a lower bound on bisection flow for a problem there are two ways to proceed. T h e traditional approach is to start essentially from scratch, without taking advantage of previously derived lower bounds. A different approach is to utilize facts already known about another problem and show, by means of problem transformation, that the problem under consideration is at least as hard as this problem. Thompson [lo] established a now widely-used technique f o r obtaining area-time lower bounds by quantifying the information ezchange required to solve the problem II. This quantity, denoted by I, is defined as the minimum number of bits that two processors m u s t exchange in order to solve n, when
2545
exactly haU of the input variables of II are available to each proce-or at the beginning of the computation. Thompson showed that the area-time complexity of a problem with information exchange I satisfies the bound A F = n ( 1 2 ) 1101. With asuitable change in 1/0 protocol semantics 1131, information exchange arguments also give lower bounds o n area, namely, A = O ( I ) 115). Assuming that the image size is N and that the support of the restoration filter is Q, Eq. (7) computes a transitive function of order O ( N Q , therefore, the information flow is We conclude: lower bouned by N Q 'Ibarnm: Any chiD that comDutes a sinale-step iterative = n ( N Q ) and im e r e s t o r d o n &orithm m u s t satisfy A?= n ( ~ a ~ 2 ) .
C. Mesh Implementetion This processor organization falls into the class of array processor architectures that have been extensively proposed and implemented for image processing tasks [ S I . The processors are organized as a two dimensional array. For convenience we assume that there is o n e processor per pixel. We implement the pointwise version of Eqs. ( 4 ) and (9). The size of the template was assumed to be ( 4 L + l ) x ( 4 P + l ) . For ease of exposition, we assume L = P , and (4P+l)*=Q. Processor (i,]], corresponding to pixel ( i , j ) , has a memory location for b ( i , j ) and two m e m o r matrices in its local memory, denoted by M j i , J ) ,and M,$'.j! These matrices contain respectively the restored image values in the ( 4 P + l ) x ( 4 P + l ) neighborhood of ( i , j ) and the weights relevant for the compution of z ( i , j ) . The matrix M i e J )contains the weights w ( I , m ) where - 2 P < 1 5 2 P and -2Lm; memory locations M $ i , j t / , m ) where -2P