SYSTEMATIC METHODOLOGY OF MAPPING SIGNAL PROCESSING

SYSTEMATIC METHODOLOGY OF MAPPING SIGNAL PROCESSING ALGORITHMS INTO ARRAYS OF PROCESSORS J. Tasič, U. Burnik Faculty for Electrical and Computer Engineering University of Ljubljana Ljubljana, Slovenia

Abstract Nowadays high speed signal processing has become the only alternative in modern communication system, given the rapidly growing microelectronics technology. This high speed, real time signal processing depends critically both on the parallel algorithms and on parallel processor technology. Special purpose array processor structures will have become the real possibility for high speed signal processing in the next few years. Most signal processing algorithms have built in recursivity and regularity and in for the case of parallel arrays, local connectivity. The availability of such low cost devices, where the parallelpipelined computing represents the main limitation on application, has opened up a new possibility for the implementation of advanced and sophisticated algorithms and VLSI structures. The main advantage of VLSI array processors is the increased rate of throughput required by present–day real time signal processing applications. The answer to this challenge lies in the development of techniques for algorithm analysis, dependency analysis, mapping techniques of the algorithms onto massive parallel array structures, appropriate VLSI design techniques of the systolic arrays, in the partitioning technique and in Digital Signal Processing (DSP) applications. The application domain of such array processors covers different areas, including digital filtering, spectrum estimation, adaptive signal processing, image processing, medical image processing, seismic processing, etc. In the paper we intend to present an overview of the basic mapping procedure with some applications from the area of 1D and 2D signal processing. After a short introduction of systolic processor arrays, the importance of high–speed linear algebra computations is highlighted from the aspect of digital signal processing problems. Section 3 contains the main results of the paper (systematic methodology of transforming an algorithm into systolic array architecture); first the mapping technique of sequential algorithms onto suitable array architecture is proposed, through index set extension and data broadcast (Affine dependencies) elimination, then a procedure is proposed to map the parallelised algorithm into VLSI systolic array architecture. Further in the section, the proposed method is applied on some basic linear algebra algorithms applicable in DSP. Finally, in section 4 DSP implementations will be presented including the simulation results of notch filter and of two– dimensional FIR adaptive filter based on LS SVD algorithm.

1

1 An Overview and History of the Problem The systolic array idea was invented by H.T. Kung and C. E. Leiserson [1], where this array is defined as a network of processors with rhythmical data computation and propagation along the system. In systolic arrays data is pumped from cell to cell among the array. In systolic arrays the required computations are performed concurrently in the cells. Jose Fortes et al in his article [2] systematically analyzes different approaches to the transforming procedure of an algorithm represented in high level construct into systolic architecture. He grouped all known transforming procedures into four classes: • • • •

direct mapping from the algorithm-representation level to the systolic architecture, mapping from the algorithm representation over algorithm model into hardware, mapping of the previous designed architectures into a new architecture, symbolic transformations and regular transformations.

All presently known methods can be found in the second class, such as the H.T. Kung method, Moldovan and Fortes method, Miranker and Winkler’s method, S. Y. Kung method, Quinton method, Cappello and Steiglitz’s method, etc. We will describe a method, developed in cooperation between University of Ljubljana and Loughborouhg University of Technology [6], that can be classified in the same second class as the Moldovan and Fortes method and Cappelo Steiglitz’s method [3]. What is charasteristic of the first one is that the programs include loops and the index of computation is related with the index of the loop. Also global mapping exists in two subspaces, one is time and the other is space subspace. The latter defines the connections in the mesh structure and the direction of data movement. In the Cappello and Stieglitz method each index corresponds to one axis of the geometric space. Each point in this space corresponds to the processor with simple computational operations. Very popular among the researchers is S.Y.Kung’s algorithm [4], in which the algorithm is presented by a Signal Flow Graph (SFG). After some operations the resulting Signal Flow Graph with operation and delay modules maps straightforward into the systolic array. The effectiveness of the algorithmic parallel arrays depends upon an appropriate understanding and specification of the problem, mathematical analysis, algorithm understanding and analysis, parallelism analysis and the mapping of the algorithms on the array of processors (systolic arrays). Most modern DSP applications have been influenced by linear algebra algorithms. In sequential algorithms the complexity of the algorithm depends on the required computation and storage capacity. The complexity analysis of the parallel algorithms includes another important parameter, the communication required. Therefore in massive parallel computation the most important factors are: computation, communication and memory. In the systolic arrays design technology, certain constraints should be analyzed and incorporated, such as local communication paths, limited input - output communications, time consumption and data synchronization. Some other repetitive structures could also be incorporated, such as modularity, partitioning, flexibility and reconfigurability. For the purpose of the efficient use of all processors in the systolic array structure, these processors should have a balanced distribution of computation and data. Current massive parallel computers are proposed for the special applications. Systolic and wavefront arrays are determined by pipelining, data concurrence, and data processing. 2

Wavefront arrays use data driven processing capabilities, while the systolic arrays use global synchronized local instruction codes. Systolic arrays can also be classified into semisystolic arrays with global and local data communications, and pure systolic arrays with local data communications only. Local communication lines in VLSI technology influence the delays between the two processing elements, because the line capacity influences the transmission speed of the data. In the case of the systolic arrays, the data communication links become a great problem. Therefore in large VLSI design of the massive parallel arrays localised communication is required in order to reduce interdependence and delays from communications. These local constraints prevent the utilization of centralized control and global synchronization [4]. The solution of this problem lies in the asynchronous mode of operation of the systolic arrays. Such a method is data driven, and such arrays are known as Wavefront arrays.[5]. The restrictions mentioned and the finite number of processing elements confine us to a special class of applications, where recursions and local dependency play a very important role. These restrictions influence the generality of the procedure [6]. Several parameters characterise the processor arrays, such as the size of the algorithm denoted by n, the number of processors used in the processor array p, the time used for the algorithm execution on the processor array Tp , the speed up of the algorithm execution achieved by the processor array S p and the efficiency of processor arrays E p , which will be explained in section 3 [6].

1.1 Systolic Array Algorithms In paper we are dealing with array algorithms implemented on VLSI arrays. After identification of tasks and possible VLSI architectures, new algorithms with a degree of parallelism and regularity, and with low communication overheads, have to be developed [6]. An array algorithm is a set of rules for solving a finite number of steps on a multiple number of locally connected processors. The array algorithms are defined by synchronisity, concurrency control, granularity and communication geometry. A tool for the design of systolic algorithms has been proposed by Leiserson and Saxe [7]. This criterion defines a special class of algorithms that are recursive and locally dependent. The great majority of digital signal processing algorithms possess such properties. A typical class of algorithms are matrix based algorithms. Major computational needs in signal processing and applied mathematical problems can be reduced to a basic set of matrix operations and other related algorithms [8]. All these algorithms involve repeated application of relatively simple operations, with regular structure and local interconnections of the computing network. This leads to the computational wavefront [5]. The recursive nature of the algorithm and local data dependency leads to continuous waves of data. The computation starts on one level of elements and propagates through processor array. This concept of locality and recursivity provides a theoretical basis for the design of the highly parallel processor arrays.

1.2 Architecture of VLSI arrays Several types of linear arrays are determined according to the data flow. Usually there are one, two or three data paths, with identical or opposite directions. Some of them are named as Unidirectional Linear Array (ULA), Bidirectional Linear Array (BLA), or Three path

3

communication Linear Array (TLA) [6]. Triangular, square and hexagonal geometries of processor arrays with the three nearest neighbor interconnections are usually used. The ULA array is usually used for the matrix vector computation algorithm. The processors in this array work very efficiently. An example of the TLA is the ARMA filter proposed by H.T. Kung in [9]. TLA´s are also called double pipelines, since they can be decomposed into two simple linear pipelines of BLA type. The ULA and BLA implementations of the matrix vector multiplication algorithm execute the algorithm in the same time, but the ULA uses half processors of the BLA implementation. Since ULA implementation uses resident data, additional load/store times are required for the total processing time. The BLA implementation also requires additional input/output time. Therefore in total processing time all timings of the processor array have to be summed as the times for data loading, input, execution, output and unloading. Sometimes the total pipeline data period is required for comparison between the various systolic arrays. In next paragraph we will introduce some basic mathematical operations widely used in DSP and communication systems. Also some basic systolic structures will be presented for the mentioned mathematical algorithms.

2 Mathematical Background of the Signal Processing Problems Digital Signal Processing encompasses a broad spectrum of mathematical methods. These include: transform techniques, convolution, correlation techniques in filtering processes, and sets of linear algebraic methods such as matrix multiplication, pseudo inverse calculation, linear system solver, different decomposition methods, geometric rotation and annihilation. Generally we can classify all signal processing algorithms into two groups: basic matrix operations and special signal processing algorithms. Fortunately, most of the algorithms fall into the classes of the matrix calculations, convolution, or transform type algorithms. These algorithms possess common properties such as regularity, locality and recursiveness.

2.1 Arithmetic algorithms Each arithmetic algorithm E(n) is defined as a string of four basic arithmetic operations (+, -, *, /) between parameters defined in the expression. On a single processor system such an expression can be sequentially computed in O(n) time steps. In the case of a massive parallel computer with unlimited n p , the computing time of the arithmetic expression E(n) can be log n steps instead of n steps on a single processor computer. The total speedup is therefore n/O(log n).

4

Inner–vector multiplication The inner product of two n dimensional vectors x and y is close to this number of steps. This product is obtained as the product of the row vector xT and the column vector y, and can be given as: n

xT y = ∑ x j y j . j =1

Sequentially it can be computed in (2n-1) steps, on parallel computer with n processors it can be computed in 1+log n steps. Matrix–vector multiplication The matrix-vector multiplication algorithm of an n×m matrix A with a vector x of dimension m results in y = Ax where y is an n element vector. The i-th element of y is defined as: m

yi = ∑ aij x j , j =1

where aij is of the matrix A. Matrix–matrix multiplication Matrix–matrix multiplication algorithm of an n×m matrix A with n×p matrix B results in new matrix denoted by C of dimension m×p. Matrix C is given by C = A B where the elements are defined as: n

cij = ∑ aik bkj . k =1

This method can be realized with the array of processors of dimension m×p. The principle is the same as in Figure 1. The connections are realized in horizontal and in vertical directions. Therefore, the mesh connections of the Linear Processor Array structure is suitable for this operation, in which the data stream of matrix B flows to the right and the data stream of matrix A flows top down. The elements of matrix C are stored in the appropriate processors of the array. Linear equation solver Solving a system of linear equations is one of the most important problems in DSP. The problem is to find the solution vector x of dimension (n×1) for a given n linear equation Ax = y, where A is the nonsingular matrix of dimension (n×n). The problem can be solved by computing an inverse matrix A-1, that is x = A-1y. This inversion matrix computation procedure is computationally very intensive, and the procedure is numerically unstable. The approach using the triangularization procedure is often used to triangularize matrix A. An upper triangular system A*x = y*, where A* is an n×n upper triangular system, is solved by back-substitution to obtain the solution x.

5

In the numerical analysis literature there are many methods to triangularize a matrix, e.g. Gauss elimination, QR and LU decomposition or other methods such as those based on Givens rotations or Householder reflections. There also exist other effective methods for solving the system of equations: these are bidiagonalization methods and Singular Value Decomposition methods. The first step in solving the procedure of the system of n linear equations Ax = y is to factorize the matrix A. The best known methods are LU and QR factorization.

• PA = LU • QA = R where P is the permutation matrix, L is the lower triangular, R, U are upper triangular matrices. Once such methods are applied, solving the system of equations Ax = y is reduced to solving triangular systems of equations, followed by algorithms for dense and tridiagonal matrices.

Triangular matrices The Gauss elimination is a systematic elimination process that reduces Ax = y stepwise to triangular form [11, 12]. Variants of the Gauss elimination method use the idea of factoring A = LU setting Ux = y and Ax = LUx = Ly = b. First we try to solve the triangular system Ly = b for y and then the triangular system Ux = y for x. Without pivoting, such an algorithm requires 3(n-1) steps with (n − 1) 2 processors[10]. A pivoting process is necessary for the factorization. LU factorization based on Gaussian elimination with partial pivoting, can be obtained in O(n·log n) step on (n–1)2 processors. The speedup in this case is of O (n 2 / log n) . Using O( n 2 ) processors, direct implementation of Householder’s reduction and the GramSchmidt algorithm require O(n.log n) steps. Given’s reduction can be modified to produce a parallel algorithm in O(n) steps with O( n 2 ) processors. We will briefly describe Given’s reduction instead of Gaussian elimination with partial pivoting. In this case the elementary Gaussian elimination is replaced by Givens plane c s rotations where c2 + s2 = 1. -s c In a sequential algorithm, the orthogonal matrix Q, in QA = R, is formed as the product of plane rotations, each step annihilating an element below the main diagonal of matrix A. In the parallel scheme, Q is the product of orthogonal matrices Qj, each being the direct sum of independent rotations. In such cases, each Qj can annihilate more than one element under the diagonal. Each Qj is a diagonal matrix with block 2×2 rotation matrices on the diagonal. There exist several more annihilation procedures. Generally we construct an orthogonal matrix Q such that R is upper triangular and the linear system Ax = y is reduced to the upper triangular system Rx = y+, where y+ = Qy. The achieved reduction is equivalent to O(k n) steps. After triangularizing the linear equations, the remaining problem is how to find a (n×1) vector x such that A+x = b+, where A+ is an upper triangular matrix. This problem can be solved by back substitution.

Eigenvalue and Singular Value Decomposition Problem Another important method is that for finding the eigenvalues and eigenvectors or singular values or singular vectors of the matrix A. Few algorithms have been developed like the

6

parallel version of the Jacobi and Jacobi–like algorithms, or the QR algorithm for obtaining several eigenvalues of a symmetric tridiagonal matrix [10]. The Jacobi algorithm is described in Golub [11] and in Wilkinson, Reinsch [12]. A real symmetric matrix A can be reduced to the diagonal form by a sequence of plane rotations. In practice, this iterative process of reduction of the off-diagonal elements is terminated when these off-diagonal elements become negligible compared to the elements on the main diagonal. The main task is to find a sequence of reduction for the off-diagonal element in parallel, where we are not concerned about destroying zeros that were previously introduced. It is possible to eliminate more than one element simultaneously in one sweep. The maximal number of the annihilated off-diagonal elements in one sweep is (n2-n)/2 pairs. In relatively few (8) sweeps the matrix becomes practically diagonal. The diagonal elements represent the eigenvalues, and the products of individual transformations are taken as the eigenvectors. In the structure of O(n2) processors, one sweep requires O(n) steps, yielding a speedup over the sequential algorithm of O(n2). Other methods reduce the matrix to a tridiagonal form or upper Hessenberg form, depending on whether the matrix is symmetric or not. If the matrix is symmetric tridiagonal, we may apply the QR algorithm. This method is described in Reisch Wilkinson [12]. Singular Value Decomposition of matrices is useful in multidimensional signal processing. Matrix A can be factorized in Q1ΣQ2T, where Q1 is an m×m orthogonal matrix , Q2 is an n×n

σ1

orthogonal matrix, and Σ has the diagonal form Σ =

S 0 where S = 0 0

σ2 O

,

σr

r

σ 1 ≥ σ 2 ≥ ... σ r ≥ 0 and r is the rank of matrix A. The form A = Q1 ΣQ T2 = ∑ σ i ui viT is i =1

called the SVD of the matrix A, where the singular values σ i are the square roots of the nonzero eigenvalues of ATA, and ui and vi are column vectors of the matrices Q1 and Q2 respectively. The column vectors of Q1and Q2 are the eigenvectors of ATA. The preferable method for solving the SVD problem is described in Golub and Van Loan [11]. The described technique finds U and V simultaneously by simply applying the symmetric QR algorithm to ATA. This method can be also applied to solve the least squares problem, commonly used in DSP.

3. Algorithm transformation from recursive algorithm to systolic array application Modifying the algorithm to run on a systolic array is a complex procedure. It is necessary to provide the required data for each time sequence of the algorithm. The problem is well known from the parallel processing theory as a data-dependence problem. Another major problem, characteristic of systolic array processing, is that of global data distribution. Even when already calculated, the necessary data needs to be distributed to the particular processor in the array. As for this architecture only local connections are possible, and data dependencies of the systolic array algorithm need to be localized.

7

Several tools have been produced to assist the development of the systolic array algorithm. Mathematically, the basic factor of all possible approaches is first to determine the index space of the initial algorithm as well as existing data dependencies, and then to transform the results to the index space and data dependencies of the systolic algorithm. The most common approach is to use the theory of graphs. However, the solution may also be found by using pure analytical methods. The approach suggested by Moldovan [13], and later extended by Gušev [6], will be presented in this chapter.

3.1 Model of algorithm The basic formulation of an algorithm, commonly applied in DSP and other linear algebra procedures, is a notation in an iterative form. This is a very compact notation, yet not directly applicable for any computer architecture. For execution on any computer architecture, the algorithm should be transformed into a set of recursive scalar operations with indexed arguments. The model should represent the most important properties necessary for further paralellization and mapping of the algorithm onto a systolic array. Following the suggested model, the algorithm can be presented as a triple, consisting of index set Jn, the set of all data dependencies D and the set of all arithmetical operations S in the algorithm; A = (Jn, D, S). Each combination of indices appearing in any expression of the particular algorithm can be represented as an index point of the algorithm. The index point is an integer vector containing all corresponding index values. The set of all index points p of the recursive algorithm is referred to as an index set J of the algorithm, with the lower and upper bound of the indices defined as: T p = [ p1 , L , pn ] , pi ⊂ Ζ . J = {p}, A w p ≥ b w , A u p ≤ b u Variables in the system of recurrence equations can appear as input or output functions in particular expressions of the algorithm. Input functions are variables required to calculate the value of output function in the expression S r . Input and output functions form input and output sets of the algorithm, X = UrN=1 IN ( Sr ) and Y = UrN=1OUT ( Sr ) , respectively. The relation between input and output functions is known as data dependence, and can be defined for each function and each expression in the algorithm. This dependence should be strictly observed in order to preserve data consistency when several expressions of the algorithm are to be concurrently executed. There are three major types of data dependence: • Data dependence, when OUT ( S r (p v )) ∩ IN ( S t (p w )) ≠ ∅ • Output data dependence, when OUT ( S r (p v )) ∩ OUT ( S t (p w )) ≠ ∅ • Data anti-dependence, when IN ( S r (p v )) ∩ OUT ( S t (p w )) ≠ ∅ All of the mentioned types of data dependence should be considered for the purpose of our model. It would be possible to represent separately each data dependence for each index point and expression by means of Boolean algebra. More general notation serves to define data dependence vector as a difference between both index points in a particular occurrence of data dependence. Where there exists data dependence d (p v , p w ) = 1 , the corresponding data dependence vector equals d = p w − p v . A single data dependence vector refers to a single variable in the algorithm. The existence and efficiency of the transformation are determined

8

mostly by data dependence, as this is the crucial property for parallel execution of any algorithm. The third part of the triple, the set of arithmetical operations S refers to the operations to be performed at specified index points. The mapping procedure is independent of those operations. It is only important which data should be available at the time, so S will not be a part of the mapping procedure until the final result. The model presented contains all the basic information necessary to perform the mapping of the algorithm onto the array of processor elements. As the problem is highly complex, and may not always have an efficient solution, only the solution for a certain subclass of affine data dependencies will be presented later in this chapter.

3.2 Model of systolic array algorithm The systolic array algorithm model is a special subclass of the general algorithm model. The index and data dependence sets in this model directly refer to location and time slice at which specific expression of the algorithm is to be performed. The time (k) and space (M) models of the algorithm can easily be separated:

{

[

}

]

Jn = J k , J M , Dn = Dk D M , k = n − M .

The basic elements of the space algorithm model have their analogy in the placement and connectivity of the array processor elements. The following model describes any set of processor elements connected into a logical system known as processor array. We assume the following properties of the processor element in the processor array: • the processor element is able to perform basic operations required by the algorithm in a single processing cycle, • the active processor element operates for at least one processing cycle, after which it transfers the results of the operation to the processor elements to which physical connections are established, • the location of the processor element is defined by corresponding Cartesian coordinates. Each processor element in the array is defined by an index point l; its value is equal to Cartesian co-ordinates of the processor element in the array. Corrrespondingly, all index points together create the index set JM = {l} of the processor array. A single connection between processor elements, also called the connection primitive, is a difference between their index points, and all the existing connections combined create the connectivity matrix

[

]

P = l 1( p ) , l (2p ) ,..., l (r p ) ; l i( p ) = l w − l v ; l w , l v ∈ J M . VLSI systolic arrays tipically consist of many processor elements with only local interconnection capability. Normally, only connections to the closest neighbours are possible with the additional use of the limited local memory of the elements. The matrix of connectivity of the systolic array therefore refers to constant and local connections. The entire model of the array can be described as (JM, P).

9

index vector

execution time t =pTg

p

g1=(1,10) g2=(10,1)

(0,0)

0

0

(0,1)

10

1

(0,9)

90

9

(1,0)

1

10

(9,8)

89

98

(9,9)

99

99

...

Table 1: Time mapping using weight vector g The index set of the space algorithm model equals the index set of the processor array. Since not all of the available connections are always used, the matrix of space data dependence DM may be different from the matrix of connectivity. The second part of the model describes the time scheduling of algorithm operations. In general, the index sets of the time algorithm model have more than one dimension, so mapping to a single time coordinate needs to be defined. The actual execution time of the expression is determined by scalar multiplication of the index point p with pre-determined weight vector g, t=pTg. The actual principle can be seen from Table 1. Pair (Jk, g) therefore represents the time model of the algorithm. The remaining part of the data dependence matrix, and the time dependence vector dk∈Dk, determines time flow of the algorithm execution. The execution needs to be performed in real time, dk>0.

3.3 Mapping of algorithm: step-by-step A clear analogy between index sets of algorithm and processor array, as well as between data dependence matrix and matrix of connectivity, may be seen from the models presented. The problem of the transformation onto the parallel structure is that not all data dependencies of the algorithm can be realized directly on the VLSI array. The mapping procedure should transform the structure of the algorithm to fit the hardware capabilities of the physical systolic structure. The procedure should follow the basic steps as listed: • determining the model of the algorithm with corresponding index set and data dependence set • transforming the algorithm to a system of uniform recurrence equations • determining regular time and space transformation, • selecting optimal solution

10

Transformation of algorithm to system of uniform recurrence equations The notation of the algorithm as a system of uniform recurrence equations is the notation most easily transformed to the systolic array architecture. The detailed procedure consists of consecutive procedures A, C, D and E, as noted by Gu{ev [6], performing a step-by-step transformation from iterative notation to the full-index algorithm with constant and localized data dependencies. Procedure A: Iterative algorithm

a j (p): = iteration(operator , argument , range) denotes iterative repetition of the arithmetical or logical operator on arguments for the predetermined range of parameters. This notation should be rewritten into a recursive form: a j (p): = neutral(operator ) for s = s0 to sout step sh do

.

a j (p s ): = a j (p s ) < operator > argument The procedure A needs extension, so that for the case a j (p) in the iterative equation it is not only the result but also the argument of the iteration. In this case, in the recurrence equation we shall apply the value a j (p sout ) instead of a j (p) . This extension is called procedure C.

Some of the arguments in the recursive algorithm, as created by using procedures A and C, may appear in restricted index form. In order to ensure unified notation, the index space of all functions of the algorithm should be of the same dimension. A procedure needs to be defined to extend the index space of such functions to full index space. The easiest method applicable is described as procedure D. The index space of the function in restricted index form can be extended by transferring all existing values and by assigning zero value to the missing index: ps′ = 0, pk′ = pk , k = 1,..., m; k ≠ s . The result of the procedures so far is an iterative algorithm transformed to a system of affine recurrence equations. The resulting algorithm can be represented by index space J = {p} and affine data dependencies in the form of d = Ap + d0 . The problem of affine data dependencies is that they generally represent broadcasting of data, which cannot be realized on a systolic array structure. Two major types of broadcasting may exist in an algorithm. Data broadcast demands global distribution of the same arguments to more than a single function of the algorithm. It would be possible to satisfy this requirement by use of global bus or shared memory, which is not the case for systolic processor arrays. The requirements of computational broadcast are to allow calculation of a single function parameter more than once during the algorithm. To preserve regularity of the algorithm execution it is necessary to retain the sequential ordering where computational broadcasting occurs. The only way to handle the problem is by introducing a single assignment code, which allows each function paarmeters to be assigned only once during the algorithm execution. The key transformation to be applied before mapping procedure is to transform the system of the affine recurrence equation to a system of uniform recurrence equations with local data 11

dependencies. There exist several procedures to transform broadcasting in the algorithm into propagation. The one presented below uses the propagation vector of the broadcasting functions. Procedure E determines propagation vectors, based on a set of index points where any type of broadcasting appears. If A represents the matrix of affine data dependencies, a matrix of linear data dependence could be defined as B = I − A . The order of matrix B is always smaller than its dimensions. For each matrix B there exists a permutation P and hence the matrix B ′ = PB contains zeros only in the last n − rank( B ) rows:

M N  B ′ = PB =   .  0 0

The propagation vectors can be generally determined as column vectors in the matrices − M −1N  E  p ∈  or p ∈  −1  for non-singular N or M, respectively. For special cases with − N M   E  n − rank( B ) ≤ 2 we may even omit the inversion procedure and use the following simplification: • n − rank(B ) = 1 : The propagation vector is created from subdeterminants of the last row of matrix B ′ :

r = (b1* , b*2 ,..., b*n )

• n − rank( B ) = 2 : Q is determined by the plane n T p = d with p ∈ Q and n = [ a b c] , a , b, c, d ∈ℜ . The T

solution is a plane determined by propagation vectors r1 and r2 , selected by the following rules: abc ≠ 0 → r1 = [bc, ac, − 2ab] , r2 = [bc, − 2ac, ab] T

a = 0 → r1 = [1, 0, 0] , r2 = [ 0, c, − b] T

T

b = 0 → r1 = [ 0,1, 0] , r2 = [ c, 0, − a ]

T

c = 0 → r1 = [ 0, 0,1] , r2 = [b, − a , 0]

T

T T

T

The propagation vectors provided by procedure E can now be used to eliminate either computational broadcast or data broadcast of the algorithm. For the final propagation vector value, we are free to select either the incrementing or the decrementing value, r or -r. Note that for each particular occurrence of broadcasting, a corresponding operation is to be performed separately, including determination of propagation vectors. After all propagation vectors for each occurrence of computational broadcast or data broadcast have been successfully determined, they are only to be applied in the final step of broadcast elimination. The elimination procedure should first be applied in computational broadcast operations. Expressions like ai (d i 0 ): = Fi (..., ai (di 0 ),...) where di0 is a function of alghorithm´s index point p, should be rewritten as an equivalent expression ai (p): = Fi (..., ai (p − r ),...) with new, suitable initial conditions. In the case of data broadcast, a very similar operation is required. All the occurrences of aj(q) with data broadcast should be replaced with aj(p-r) at the source (right side of the expression) of data broadcast, and with aj(p) for the computed value (right side) and for all

12

further references to this value. If there are no data broadcasting with assignments for this function, then data propagation needs to be ensured by adding a new expression aj(p):= aj(pr).

Mapping of the system of recurrence equations onto systolic array structure So far, the algorithm has been transformed to a form which is most appropriate for further parallelization. The second part of the transformation is the final mapping of the algorithm to the space-time model of the hardware structure. We are looking for the equivalent transformation of the algorithm model with a new index set and new data dependence, while at the same time obtaining the same result as the initial algorithm. The resulting index set points towards indices of the processor elements, whereas new data dependencies follow the existing connections between processor elements. With index and data dependence sets of the algorithm presented in a matrix form, and taking into consideration linear transformation, the trasformation can be represented by the transformation matrix T. Transformation T:( J n , D) → ( J TnT , DT ) should: • not affect the structure of the body of the algorithm; the set of the assignment operations should remain the same, • not change the number of the operations; the number of index points in the transformed algorithm should remain unchanged, • not reverse the ordering of the execution of the algorithm operations. The index vectors and data dependence vectors of the transformed algorithm are defined as matrix-vector multiplication p T = T ⋅ p and d T = T ⋅ d . As the time and space model of the systolic array algorithm are separated, the same should be assumed for the transformation matrix. Transformation matrix T therefore consists of time transformation Π: J n → J Tk and space transformation S: J n → J TM as follows:

Π  T=  S  A complete procedure defining optimum algorithm transformation should follow the steps: Determine time transformation Π for which holds the inequality g ⋅ Π ⋅ di > 0 . Time mapping should enable accurate real-time algorithm execution. Each operation demands at least one time interval. In this interval the parameter acquisition, function recalculation and assignment of the results have to be executed. Select transformation Π that minimizes execution time

max( Π ⋅ (p v − p w ) + 1) . min( Π ⋅ d)

The selected transformation should result in an algorithm with minimum execution time. Select Χ for which holds ∑ j χ ji ≤ g ⋅ d ik .

Data dependencies D TM of the systolic algorithm model will have to be realized over physical connections between elements of the array. Data connections are described using

13

{ }

matrix Χ = χ ij

that describes the necessary linear combinations of connection primitives

l ( p ) ∈ P . Each transformed data dependence is to be expressed by relation d iM = ∑ j l (j p ) χ ji .

A single communication primitive, described in matrix P, requires one processing cycle for the transfer of data. Complex connections which are physically non-existent have to be established over a sequence of communication primitives. The time necessary to perform complex data distribution therefore equals ∑ j χ ji , and should be shorter than the time required by transformed algorithm operations g ⋅ Π ⋅ d i = g ⋅ d ik . All existing solutions for Χ are to be considered in the further procedure. For S solve the system S ⋅ D = P ⋅ Χ . S ⋅ D = P ⋅ X is the basic equation for data distribution over the processor array. A selection is to be made among all existing solutions, for all Χ available, of the above expression. If T is singular, select another solution.

The algorithm transformation is a bijective operation. The selected matrix transformation has to be non-singular. Selecting an optimal solution.

From all regular solutions, transformation with smallest number of index set bands should be selected. Index set bands are index subsets with different basic operations to be executed. This allows us to apply special purpose instead of general purpose processor elements at specified index points, thus reducing the complexity of the entire array. If any special properties of the systolic array are known, they should be considered when selecting the most appropriate transformation. In the following section, transformation of some basic linear operations will be presented to illustrate the procedure described.

3.4 Examples

Matrix-Vector Multiplication A very basic problem of matrix-vector multiplication was selected to present a complete transformation procedure from the definition of the problem to a couple of possible array solutions. Formal notation of the algorithm to be created is the expression y = Ax . Parameters of the operation are matrix An×n and vector xn with indices representing their dimensions. The result of the operation is vector yn. Square matrix A was selected for the purpose of this example, although generalization to other matrices can easily be made. The suggested iterative algorithm solves the equation by n

yi = ∑ aij x j , i = 1,..., n . j =1

Referring to the notation a j (p): = iteration(operator , argument , range) in the procedure described, the operator to be used is a summation, the arguments of the operation are elements aij and xj, and the range of parameters includes an entire set of numbers [1,...,n]. Using procedure A, the following recursive algorithm is obtained: 14

for i = 1 to n y(i) := 0 for j = 1 to n y(i) := y(i)+a(i,j)*x(j)

Since there is no recurrence in arguments for this function (y is not an argument in iterative notation), there is no need to apply the procedure C. By applying procedure D, the index space of this algorithm will be extended to full-index form. The index space of each function is extended using value 0 for the missing index value: y (i ) → y (i ,0) x ( j ) → x ( j ,0) The full index form algorithm is then: for i = 1 to n y(i) := 0 for j = 1 to n y(i,0) := y(i,0)+a(i,j)*x(j,0) The index set and index point of the algorithm, including representation of lower and upper limit bounds are   i   1 0 1 1 0 n  J = p |p =   ;  ⋅p ≥  ,  ⋅p ≤   .    j   0 1 1 0 1 n   There are three data dependencies existing, but the dependence on the function a is a zero, so it is not considered further. The dependence for function y i  i  dy =   −   =  j  0

0 0 0  i   j  = 0 1  j      

is affine data dependence defining a computational broadcast. Procedure E is to be applied, determining matrix B and propagation vector r, 1 0 0 By = I − Ay =  , ry =   .  0 0 1

The computed value in the expression should change to y(p) =y(i,j) We are free to choose between two possible solutions of the broadcast elimination problem. Using the incrementing propagation vector, the value used should change to y(p-ry)=y(i, j-1), with initial conditions y(i,0)=0 and with output value y(i,n). The alternative is a decrementing propagation vector, substituting the used value with y(p-(-ry))=y(i, j+1), with initial conditions y(i,n+1)=0 and with output results in y(i,1). For function x, there exists data broadcast with affine data dependence  i   j  i − j  1 −1  i  dx =   −   =  =   ,  j   0   j  0 1   j  which is eliminated by computing 0 1 1 Bx = I − Ax =  , rx =   .  0 0 0

15

Since there exists no recurrence expression for x, a new recurrence equation will have to be introduced to allow data propagation. Either x(i, j):=x(i-1, j) or x(i, j):=x(i+1, j) can be used, substituting x(j, 0) with x(i, j) in the original recurrence equation. For the solution, there are four combinations available, obtained by selecting incrementing or decrementing propagation vectors. We shall select the solution with local data dependencies: for i = 1 to n y(i,0) := 0 for j = 1 to n x(0,j) := x(j) for i = 1 to n for j = 1 to n x(i,j) := x(i-1,j) y(i,j) := y(i,j-1)+a(i,j)*x(i,j)

Now, the data dependencies are constant,  i  i − 1 dx =   −  =  j  j 

1 0,    i   i  0 dy =   −   =  ,  j   j − 1 1 1 0 D= , 0 1

and the above is a system of uniform recurrence equations. The initial broadcasting and the localized data dependence graph are presented in Figure 1: From here we may start the mapping of the algorithm onto a systolic array structure.

(a)

(b)

Figure 1 (a) data broadcast of the original algorithm, (b) localized data dependence

First, the time mapping function has to be selected, which will enable real-time execution in minimum time. The index space of the algorithm is two-dimensional, so a one-dimensional

16

time model of the algorithm is satisfactory. From the equation Π ⋅ d i > 0 the appropriate time mapping function satisfies the equations 1 Π ⋅ dx = Π ⋅   > 0 0 0 Π ⋅ dy = Π ⋅   > 0 1

The solution that minimizes the execution time is Π = [1 1] . The linear processor array with bi-directional communications is to be applied. The matrix of communication primitives for this case equals P = [ −1 1] and the matrix of algorithm data 1 0 dependence is D =   . We have to define which communication primitives are to be used 0 1 in the systolic algorithm. We need to satisfy the condition ∑ j χ ji = ∑ j d jik as well as the Π  nonsingularity of the matrix T =   . The space model transformation is a solution of the S 1 1 equation S ⋅ D = P ⋅ Χ . From the set of all existing solutions we select T =   . Data 0 1 dependencies of the algorithm will map into time space of the algorithm model as 1 d xk = Πd x = [1 1]   = 1 0 0 d yk = Πd y = [1 1]   = 1 1

whereas the space model requires connections 1 d xM = Sd x = [ 0 1]   = 0 0 0 d yM = Sd y = [ 0 1]   = 1 1

Those results mean that values x are stationary on the processor elements and values y propagate from left to right side of the array. The activities of the algorithm are best presented on the time-space diagram of this uni-directional algorithm, as shown in Figure 2.

17

Figure 2: time-space diagram of the paralellized algorithm

For this solution, n processor elements are required. The processor array is shown in Figure 3. The total execution time of the algorithm is t=2n-1.

QR Decomposition Another important linear algebra problem is the solving of systems of linear equations. A stable method for solving a system of linear equations is based on QR decomposition. The approach presented is based on the use of Givens rotations, which is especially suitable for parallel execution [6]. The QR tridiagonization procedure uses Givens rotations to annihilate lower triangular elements. For each annihilation, one rotation is to be performed. The entire process of tridiagonization could be written as : R = QT A Q T = Q 1 ⋅ Q 2 ⋅K⋅Q k ⋅K⋅Q n Q k = Q ( k ,k + n ) ⋅K⋅Q ( k ,n ) Q ( k ,i )

θ ki

1  cosθ ki =  − sin θ ki  0 a = arctan ki a kk

sin θ ki cosθ ki

0     1

A single rotation Q(ki) affects only rows k and i. This row transformation can symbolically be expressed in the following form. for k := 1 to n for i := k+1 to n row´(k) := F(row(k), row(i)) row´(i) := G(row(k), row(i)) row(k) := row´(k) row(i) := row´(i)

18

This algorithm should be transformed into a localized system of uniform recurrence equations. We have decided to introduce new variables instead of extending the index space by one more dimension. The rewritten algorithm will be for i := 1 to n rowb(0,i) := row(i) for k := 1 to n rowa(k,k) := rowb(k-1,k) for i := k+1 to n rowa(k,i) := F(rowa(k, i-1), rowb(k-1, i))* rowb(k,i) := G(rowa(k, i-1), rowb(k-1, i))*

with the final result in rowa(i,n). The Givens rotations themselves are denoted with *. For partial matrix elements it can be specified as calculate(c(k,i),s(k,i),a(k,i,k),b(k,i,k)) for i := k+1 to n a(k,i,k):=f(c(k,i),s(k,i),a(k,i-1,j),b(k-1,i,j)) b(k,i,k):=g(c(k,i),s(k,i),a(k,i-1,j),b(k-1,i,j)) a33 a23 a32 a13 a22 a31 a03 a12 a21 a30 a02 a11 a20 a01 a10 a00

Figure 3: Processor array for matrix-vector multiplication

An additional global data distribution appears which distributes values c(k,i)and s(k,i). In addition, the index space has to be completed. Indices in the localized form are presented in Table 2. values

a

b

c,s

new

(k,i,j)

(k,i,j)

(k,i,j)

used

(k,i-1,j)

(k-1,i,j)

(k,i,j-1)

initial

(k,k,j)

(k-1,k,j)

(k,i,k)

output

(k,n,j)

(n,i,j)

(k,i,n)

Table 2: index space of the localized QR algorithm.

Data dependencies of this algorithm are  1  0 0     d b = d1 = 0, d a = d 2 = 1, d sc = d 3 = 0 . 0 0 1

19

The system obtained is a system of uniform recurrence equations identified by an unity data dependence matrix with pyramidal index structure. The optimal time mapping function can be described as Π = [1 1 1] . The space mapping function should be selected out of functions S1 = [1 0 0], S 2 = [ 0 1 0], S 3 = [ 0 0 1], S 4 = [ −1 1 0], S5 = [ −1 0 1] . To reduce the quantity of data to be propagated among the array, we select a transformation with stationary data c and s. This is achieved using transformation 1 1 1 T = 0 1 0 , 1 0 0

Figure 4: Triangular array for QR decomposition

which transforms data dependence vectors to

20

1 T ⋅ d b = 0 1 1 T ⋅ d a = 0 1

1 1 1 1 1 0 0 = 0 1 time step / downwards 0 0 0 1 1 1 0 1 1 0 1 = 1 1 time step / to the right 0 0 0 0

1 1 1 0 T ⋅ d sc = 0 1 0 0 = 1 0 0 1

1 0 1 time step / stationary data   0

The result is a triangular systolic array, as shown in Figure 4. Two different purpose processor elements are used. Elements on the diagonal are simply delay elements used to transfer the values of b coming from the top to the right. Other elements perform Givens parameter generation in the first operational step and Givens rotations afterwards. The results can be obtained from the right side of the array. For the realization, n(n-1)/2 processor elements are required, as the delay elements on the diagonal of the array can simply be realized using registers instead of processor elements.

The Jacobi Algorithm The procedure was originally aimed at calculating eigendecomposition of real symmetric matrices, but later it was expanded to perform singular value decomposition (SVD) of matrices. For all Jacobi variants, the convergence is asymptotically quadratic. The Jacobi idea will be presented for the symmetric matrix, although an analogous SVD for general matrices exists [14, 15, 16]. The Jacobi method generates a sequence of matrices A k +1 = J Tk A k J k , where J k is a plane rotation, such that lim A k = Λ and lim J1 ⋅ J 2 ⋅... ⋅J k = Q . Each J i is determined in order to k→∞

k→∞

annihilate a pair of matrix elements a pq = aqp in A : 1 M  0  J i = M 0  M 0 

0 M cos Θ M L − sin Θ M L 0 L O L

p

0 M L sin Θ O M L cos Θ M L 0 L

L 0 M  L 0 p  M L 0 q  O M L 1

q

k +1 k +1 k k k (cos 2 Θ − sin 2 Θ) + ( a pp ) cos Θ sin Θ = 0 = a qp = a pq − a qq a pq

The algorithm is described as: do until off ( A ) < ε for p = 1 to n-1 for q = p+1 to n

21

begin determine Θ A k +1 = J T (Θ ) ⋅ A k ⋅ J(Θ ) end

The ordering of p and q is free and is not limited by the above rule. Several operations in the Jacobi method are independent of each other. Pre and post-multiplication with matrix J k only affects rows and columns p and q in the matrix A k +1 . We are allowed to perform rotations for different values of p and q simultaneously, as long as operations do not affect the same rows or columns. To achieve the highest level of parallelism, ordering that is different from row ordering (sequential algorithm in 2.2.) is preferable [15]. For systolic array realization, the broadcast-defining algorithm with affine data dependencies has to be rewritten in a form with constant data dependencies. Some new variables will be introduced to ensure proper distribution of the parameters. The entire problem is split into n 2 × n 2 two-by-two problems, each of them to be solved on the same processor element, avoiding still further data interchange requirements. This leads to an array built of n 2 × n 2 processors, each of them containing a 2 × 2 submatrix of A k . The diagonal elements of the array are capable of calculating the rotation parameters and of zeroing necessary elements; off-diagonal elements perform the row and column recalculations. Matrix elements as well as rotation parameters need to be interchanged between the processors after each step of the operation is completed. Eigenvectors need to be calculated as well, and their updates have to be stored in additional registers µ, ν, σ, τ of the processor. Initially processors Pij , i, j = 1, 2,..., n 2 contain submatrices of A1 and eigenvector submatrices Q1 = I , α ij γ  ij

β ij  a 2i −1,2 j −1 a 2i −1,2 j  = , a 2i ,2 j  δ ij   a 2i ,2 j −1

 µ ij σ  ij

ν ij  q 2i −1,2 j −1 q 2i −1,2 j  . = q 2i ,2 j  τ ij   q 2i ,2 j −1

The diagonal processors Pii , i = 1, 2 ,..., n 2 have first to calculate the required rotation parameters and to perform the appropriate transformation of the matrix elements [11]. The rotation parameters ci and si , have to be calculated by diagonal processors, and then to be distributed vertically and horizontally through the array. The rotation itself is represented by

α ' ij γ '  ij

T

β ' ij   ci = δ ' ij  − si

si  α ij  ci  γ ij

β ij  δ ij 

 ci − s  i

si  . ci 

The update of the eigenvector matrix has to be performed as well:  µ ij σ  ij

ν ij  τ ij 

 ci − s  i

22

si  = ci 

 µ ' ij σ '  ij

ν ' ij  . τ ' ij 

To complete a step, columns and corresponding rows have to be interchanged between adjacent processors, so that a new set of off-diagonal elements is ready to be annihilated during the next step. The available connections between processors are represented by the set of communication primitives, 0 0 1 0 −1 −1 1 −1 1 P= . 0 1 0 −1 0 −1 −1 1 1 The dependence matrix is shown for the inner sub-diagonal element of the array [16]: −1 1 −1 1 0 2  D = −1 1 1 −1 −2 0  1 1 1 1 0 0

This solution requires non-local distribution of rotation parameters in constant time. On the systolic array, only transmission of parameters at constant speed is possible, i.e. , only local data dependencies are allowed. The time required to send data from one processor to another is proportional to the distance between the processors. For the given example, the maximum distance between data-dependent array elements, regardless of matrix size, is ∆ ij − ∆ i ±1, j ±1 max = 2 , as shown in [15]. The same result can be obtained directly from the matrix D . To allow normal data interchange, each processor has to be idle for two time steps; by that time all required data is accessible.

Figure 5: Systolic array implementation of the Jacobi decomposition

The algorithm terminates after a fixed number of sweeps M. By parallelisation, we achieved an execution time O(n) of one sweep. The suggested array is shown in Figure 5. In a very similar way it is possible to construct an equivalent array that is capable of performing SVD [14].

23

4 Recent Examples of the Parallel Approach Presented 4.1 Notch Filter for Multiple Notches The result of most popular algorithms for adaptive filtering is based on calculating the estimate of the inverse of the input signal autocorrelation matrix R − 1 . The result may be reached directly or iteratively. The Least Mean Square algorithm is still the most widely used real time adaptive filtering algorithm due to its low computing requirements. As VLSI processing units become cheaper, more effective algorithms with more computational requirements can be built into the digital signal processor. The advantage of such methods is that they have faster convergence rates. Many users of adaptive filtering only know when to use a transversal filter rather than a recursive one, and this is why they prefer LMS gradient-based descent technique. They do not spend much time on analysing the known algorithms from linear algebra, and choosing the appropriate one for the problem. From the literature we know that there exist some other faster and more stable solutions based on the QR or Jacobi SVD decomposition. The aim of all methods suggested [19, 20] is to find an optimum way to solve the set of linear equations represented in the adaptive filtering process, w opt = R − 1p . The correlation matrix for the discrete-time stochastic signals is R = E[ u M ⋅ u HM ],

where u M ( n ) = [u( n ), u( n − 1),..., u( n − M + 1)]. R is a symmetric Toeplitz matrix. Suppose our transmitted signal is represented by u ( n ) . In the channel, some additional, linearly independent noise is added, so that the received signal equals x ( n ) = u ( n ) + v ( n ) . The principal axes of the signal space x are determined by eigenvectors and eigenvalues of its correlation matrix R x . The set of all M eigenvectors q forms a vector space of the signal [19]. Corresponding eigenvalues λ i , i = 1,..., M represent the power of the signal x ( n ) in the direction of i-th axis component[18]. The signal space consists of independent subspaces of signals u ( n ) and v( n) . To compensate the effect of the channel, we intend to use a linear filter with finite impulse response. In this case, the power at the output of the filter is equal to Py = w H R x w = w H R u w + σ v 2 w H w = Pyu + Pyv . The ratio between signal subspace power and noise subspace power is then defined by the equation Pyu Pyv

=

1 w H Ru w . σv2 wH w

According to the min-max definition of eigenvalues, λ i =

24

q iH R q i , we may state [18] that q Hi q i

 Pyu  λ   = max σ2  Pyv  max

. w = q u max

Because the eigenspace of R u is an independent subspace of the eigenspace of R x , it is possible to determine the required weights by eigendecomposition of R x , using q x max as a filter parameter vector w . The same method can be applied to achieve effects other than maximising signal-to-noise ratio. The filter weights required to remove the single interference frequency are then given by q x min (the eigenvector that corresponds to minimum eigenvalue). The Jacobi method is convenient for calculating the necessary eigendecomposition, because it is inherently parallel. It works by performing a sequence of orthogonal similarity updates Q T R Q → A with the property that each new R , although full, is "more diagonal" than its predecessor [11]. After some iterations the off-diagonal elements are small enough to be declared zero. The algorithm itself is presented in section 3. In the present approach, the FIR is realized as a narrow-bandstop filter. This "notch" filter can be adapted to minimize its output that causes the resonant frequency of the band-stop filter to be equal to the frequency of the sinusoid. Generally, a number of cascaded IIR notch filters can be used to track multiple sinusoids. In our approach, we have successfully simulated a 32nd-order FIR filter using a Jacobi-based adaptation algorithm. It is also shown that, when tracking multiple sinusoids, the individual

Figure 6: Single Interference Signal Filtering Simulation

sinusoids can be isolated even though they have different signal power levels. Such a system has a limitation in the case where the sinusoidal frequencies are much lower than the sampling frequency. The notch filter is realized as a 32nd-order FIR filter where the input signal is analyzed by the Jacobi-based algorithm and where the weights of the filter are calculated as the elements of the smallest eigenvector. The results of this simulation are shown in Figures 6 and 7. In 25

Figure 6, a single sinusoid is being tracked and the proper transfer function of the FIR filter is presented. When tracking multiple sinusoids, cascade IIR solutions are usually proposed. In simulations of the Jacobi-based parallel Adaptive Notch Filter multiple interference sinusoids were effectively eliminated with only one FIR filter. The simulations of the parallel systolic structure were performed on a PC 486 workstation in a sequential form. In all cases, the notch frequencies were normalised regarding the sampling frequencies. In the first example, (Figure 6) the desired signal is evenly distributed in frequency, while the original signal is disturbed by a single frequency interference signal. The Interference-to-Signal Ratio was around 30 dB. In the second example (Figure 6) we had two interference frequencies; ISR in this case varied from 20 to 30 dB for each interference signal.

Figure 7: Double Interference Signal Filtering Simulation

A single adaptive notch eigenfilter, capable of tracking multiple sinusoids has been presented. In the case of the multiple sinusoids it was shown that the proposed structure, with proposed algorithm can isolate individual sinusoids in the presence of the other sinusoids with large power values. From the filter coefficients or from the same decomposition results it is possible to determine the frequency values of the input sinusoids as well.

4.2 Two-Dimensional Least Square SVD Algorithm In the last example, the use of singular value decomposition in two-dimensional filtering applications will be presented. First, the Wiener solution will be extended to a twodimensional problem, introducing special formulation of an image signal matrix. The problem will be solved algebraically with implementation of two-dimensional convolution filter. The Wiener normal equation will be solved by using singular value decomposition of the image

26

signal matrix. The effectiveness of the method suggested will be illustrated on a practical filtering problem.

Introduction We have decided to represent the degradation model for our imaging system in the form of discrete linear point-spread degradation functions. For discrete image F degraded to image G and subjected to additive noise N, we may write g( x , y ) =

N

N

∑ ∑ h( x, y, u, v) f (u, v) + n( x, y) u =1 v =1

or, alternatively, in tensor notation [G ] = [[ H ]]{[ F ]} + [ N ] . with two-dimensional matrices G, F and N and using the four-index operator H [22]. The objective of restoration is to find an inverse to the degradation function. The solution presented is not valid for all cases of image degradation. In some cases it is possible to use a convolution filter to restore the image. The solution may then be represented in the form of F$ = W∗∗ G , with symbol ** standing for 2-D convolution. We should mention that the solution in such a formulation exists only for linear, space-invariant distortion functions with finite (spacelimited) response. The general adaptive filter representation for this case is illustrated in Figure 8. The filter operates on a real image (signal matrix) X that is corrupted with noise. The desired signal (reference image) is also provided. The filtering parameters can be represented in the form of an N×N matrix W, and the filtering process may be represented by convoluting the image input X with the matrix W. During the adaptation, the filtering weights may be changed in order to obtain the optimal solution. The filtering result is given by M

M

$ x , y ) = ∑ ∑ w(i , j ) g( x + i − k , y + i − k ) f( i =0 j=0

M = 2 k+ 1

27

The difference between the desired and the resulting image $ x, y) e( x , y ) = f( x , y ) − f( is called the estimation error. From the Wiener filter theory, optimal filtering coefficients W are defined by the minimum mean-square error criteria. The objective function J( W ) = E[e 2 ( x , y )] should be minimised for W to obtain the optimum filter.

Wiener Solution The idea is well known from 1-D adaptive filtering, where instantaneous estimates of the gradient of the error surface J(W) are used to approach the optimum solution iteratively. The algorithm is popularly called LMS algorithm. It is possible to extend the algorithm to be used in both x and y image dimensions, iteratively searching for the solution either columnwise or rowwise. The procedure is numerically convenient due to low storage and computing requirements. The problem of this approach is that the instantaneous estimates of the error surface have relatively large variances. The estimate of their gradient vectors may then not always be pointing to a global optimum; this fact could cause unstable performance of the algorithm. The stability may be improved using smaller adaptation step-size, however this seriously affects the convergence rate of the procedure.

Figure 8. SVD based 2-D adaptive filter

Another approach captures the information from the entire image, determining the shape and orientation of a global error surface, and calculating best possible weight matrix in a single operation. The mean-squared error surface can be described as

28

J( W ) = c − 2 w T p + w T Rw . For stationary images, it is exactly a second-order function of correlation coefficients W. The function can be visualized as a parabolic surface having a unique minimum W0, called optimum in the minimum mean-squared sense [19]. Rw 0 = p .

The equation is a well known normal equation that defines the optimum solution for the convolution weight matrix W. Matrix inversion will be required. Although matrix R is generally of smaller size than matrix H the inversion may still remain an ill-conditioned problem, and the solution for W will be unstable. That is the reason why special inversion techniques should be employed to solve the mentioned problem. Estimates of the correlation matrix R and cross-correlation vector p are defined introducing the input signal matrix as: X = [x(11 , ), x(2,1),K , x(1,2),K , x( N , N )]T

with input signal vector x( x , y ) = [g( x − k , y − k ), g( x − k + 1, y − k ), K , g( x , y ), K g( x + k , y + k )] T .

Good estimates of the autocorrelation matrix and cross-correlation vectors are the products $ = R

1 1 X T ⋅ X , p$ = XT ⋅ f . N -M+ 1 N - M +1

Ignoring constant scalar premultiplications, the normal equation is finally expressed as X T Xw 0 = X T f w 0 = ( X T X) −1 X T f

.

Singular Value Decomposition For the stable inversion process of the matrix R = X T X , one of the possibilities is to use SVD pseudo-inversion. A better approach than calculating the straight inverse of R −1 = ( X T X) −1 is to use the pseudo-inversion technique directly on matrix X. The solution for W can be expressed directly as w 0 = X+d where pseudo-inverse X + is defined in terms of the products of the singular-value decomposition U T XV = Σ of X. The procedure is numerically stable and its solution is unique in that its vector norm is minimum [18]. Convolution operator W can be created by restacking the values of the vector w back to the M × M matrix form: w=

M2

∑ i =1

σ i ≠0

 w (1) L w( M )    M O M . vi W =  σi  w ( M ( M − 1) + 1) L w ( M 2 )

u iT d

29

The non-iteratively calculated filtering parameters are optimal for the specific image/distorted image combination. They may be directly applied in a classical twodimensional convolution filter.

Implementation of the procedure The procedure may be implemented as a systolic array algorithm. The actual algorithm is to be combined out of partial linear algebra solutions presented in section 3. Note that the array to perform singular value decomposition is almost identical to the eigendecomposition array.

a

b

c

d

e

f

Figure 9: Sharpening of blurred image using 2D LS SVD algorithm (a) original image, (b) Image, blurred with 5-by-5 low pass convolution filter, additive noise, (c) b, restored with 7-by-7 2D LS SVD filter, (d) b, inverse filtered, (e) b, noiseless, (f) e, restored

Figure 9 represents the sharpening of blurred images using a two-dimensional least-square SVD algorithm. The original image is shown in Figure 9a. The image was blurred using a 5×5 low-pass convolution filter. Some uncorrelated noise has been added at the end of the blurring procedure (Figure 9b). The image has been restored using a 7×7 adaptive algorithm; the result is shown in Figure 9c. The results of the inverse filtering of the same image are presented in Figure 9d. From the result we can deduce that the proposed algorithm is less sensible than the classical inverse filtering procedure. The image blurred without presence of noise and the noiseless image sharpened by using 2D LS SVD algorithm are shown in Figures 9e and 9f, respectively. 30

The same procedure applied in noise removal is shown in Figure 10. From the images it is clear that the 2D LS SVD algorithm does not converge to the expected low-pass solution. A large amount of the uncorrelated noise has been successfully removed from the image, thus preserving sharp edges of the image. The softening of the image contours is a common problem when low-pass filters are used for noise removal (Figure 10d) The simulation results show that the Wiener filtering principle can successfully be implemented in image restoration. Methods well known from linear algebra theory may be applied instead of classical methods based on Fourier transformation. The effectiveness of the procedure may be improved by using special updating techniques.

Conclusion Characteristicall of almost all presented linear algebra operations proposed for use in digital signal processing applications is that they consist of a huge number of relatively basic mathematical operations. The fact that the operations are repetitive, yet applied on a wide set of data, inspired us to employ several processor elements performing the same task on separate data elements in parallel. Special properties of the mentioned processing problem allow us to construct a massive array of equal processor elements, which concurrently perform the necessary numerical operations. There exist several well-known parallel computer architectures; the architecture may vary according to the applied processor elements, reconfigurability, data interchange connections, etc. The architecture to be applied on a specific problem depends mostly on a problem itself. As the digital signal processing demands high speed computing with fixed procedures in use at relatively low cost, general-purpose parallel computers are not convenient for use. Digital signal processing is a data-oriented computing problem, so architectures with global data interchange are to be omitted. What we really need is an array of locally interconnected processor elements with local memory. The processor elements should synchronuously perform the same set of operations on the data structure. This architecture is a systolic array the rhythmical operation of the array reminds us of the systolic rythm of the heart The basic approach to mapping techniques and some possible applications were presented in this chapter. However, this was only a brief introduction to the world of special-purpose VLSI systolic architecture. More details on the described procedures as well as on optimisation techniques not presented here, may be found in the literature.

31

Bibliography [1]

Kung and C. E. Leiserson, Systolic Arrays (for VLSI), Tech. rep. CS-79-103, Carnegie Mellon University, Pittsburg, PA, Apr. 1978

[2]

A. B. Fortes, K. S. Fu, B. W. Wah, Systematic approaches to the design of algorithmic specified systolic arrays, Proc. IEEE ICASSP´85, IEEE Computer Society Press, March 1985, pp. 300-303

[3]

R. Cappello, K. Stieglitz, Unifying VLSI array designs with geometric transformations, Proc. of. 1983 Int Conf. on Parallel Processing, 1983, pp. 448-457

[4]

Kung, VLSI Array Processors, Prentice Hall, 1988

[5]

Kung, K. S. Arun, R. J. Gal-Ezer, D. B. Rao, Wavefront array processor: Language, architecture and applications, IEEE Tr. Comput., c-31 1982, pp. 1054-1066.

[6]

Gušev, Processor Array Implementations of Systems of Affine Recurrence Equations for Digital Signal Processing, PhD dissertation, University of Ljubljana, June 1992.

[7]

Leiserson, J.B. Saxe, Optimizing synchronous systems, J.VLSI and Computer Systems, 1983, pp 41-67

[8]

Kung, VLSI array processor for signal processing, Conf. Advanced Res. in Integrated Circuits, MIT, Cambridge, Jan. 28-30, 1980.

[9]

T. Kung, Notes on VLSI computation, Parallel Processing Systems, D.J. Evans ed., Cambridge University Press, 1983, pp 339-356.

[10]

Sameh, Numerical Parallel Algorithms-A Survey; High Speed Computer and Algorithm Organisation, 1977, pp. 207-228, Academic Press, 1977

[11]

H. Golub, C. F. Van Loan, Matrix computations, John Hopkins University Press, Baltimore, London, 1989.

[12]

H. Wilkinson, The Algebraic Eigenvalue Problem, Oxford University Press, London, 1965.

[13]

I. Moldovan, On the analysis and synthesis of VLSI algorithms, IEEE Trans. Comput., 31 (1982), pp. 1121-1126

[14]

Brent, R. P., , F. T. Luk and C. Van Loan, Computation of the Singular Value Decomposition using mesh-connected processors. J. VLSI and Computer Systems, vol. 1 no. 3, pp. 250-260, 1985.

[15]

Brent, R. P., F. T. Luk. The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays. SIAM J. Sci. Stat. Comput. vol. 6, no. 1, January 1985.

[16]

Thiele, L. Computational arrays for cyclic-by rows Jacobi Algorithms. SVD and Signal Processing, Algorithms, Applications and Architectures, Elsevier Science Publishers B. V. North Holland, 1988.

[17]

Burnik, U., G. Cain and J. Tasič, On the Parallel Jacobi Method Based Eigenfilters, COST 229 WG4 Workshop on Parallel Computing, Funchal, Portugal, 1993.

[18]

Haykin, S. Modern Filters, Macmillan, New York, 1989.

[19]

Haykin, S. Adaptive Filter Theory, Prentice Hall, Englewood Cliffs, N. J., 1991.

[20]

Tasič, J. et. al. Eigenanalysis in Adaptive FIR Filtering, Internal Report, University of Westminster, January 1993.

[21]

Gonzales, R. C., R. C. Woods, Digital Image Processing, Addison-Wesley Publishing Company, 1992.

[22]

Andrews, H. C., B. R. Hunt, Digital Image Restoration, Prentice Hall, Englewood Cliffs, New Jersey, 1977.

32

SYSTEMATIC METHODOLOGY OF MAPPING SIGNAL PROCESSING

SYSTEMATIC METHODOLOGY OF MAPPING SIGNAL PROCESSING

Suggest Documents

SYSTEMATIC METHODOLOGY OF MAPPING SIGNAL PROCESSING

Multirate Signal Processing - Signal Processing for Communications

Multirate Signal Processing - Signal Processing for Communications

Automatic Data Mapping of Signal Processing Applications - LaBRI

Automatic Data Mapping of Signal Processing Applications - LaBRI

A methodology for systematic mapping in ... - Environmental Evidence

Signal Processing

Signal Processing

Signal Processing

signal processing

A Digital Signal Processing Teaching Methodology Using ... - idUS

A Digital Signal Processing Teaching Methodology ... - IEEE Xplore

A SystemC-Based Design Methodology for Digital Signal Processing

Systematic Mapping Protocol - arXiv

Systematic Mapping Protocol - arXiv

Systematic Mapping Protocol - arXiv

Signal processing 1 Signal Processing for Speech Recognition ...

Cyclic LTI Systems in Digital Signal Processing - Signal Processing ...

Multirate Digital Signal Processing - Signal and Image Processing ...

Investigation of Adaptive Signal Processing

bandpass signal processing - Radioengineering

Presentation - Signal Processing Group

Advanced Signal Processing - SoftBooks

Adaptive Radar Signal Processing