Bit-Level systolic architectures for high performance IIR filtering

3 downloads 0 Views 1MB Size Report
S.C. KNOWLES AND J.G. McWHIRTER. Signal Processing Group, Royal Signals and Radar Establishment, Saint Andrews Road, Great Malvern, Worcestershire ...
Journal of VLSI Signal Processing, 1, 9-24 (1989) 9 1989 KluwerAcademicPublishers, Boston. Manufacturedin The Netherlands.

Bit-Level Systolic Architectures for High Performance IIR Filtering S.C. KNOWLES AND J.G. McWHIRTER Signal Processing Group, Royal Signals and Radar Establishment, Saint Andrews Road, Great Malvern, Worcestershire, England

R.E WOODS AND J.V. McCANNY Department of Electrical and Electronic Engineering, The Queen's University, Stranmillis Road, Belfast, Northern Ireland Received October 13, 1988, Revised February 24, 1989

Abstract. Several novel systolic architectures for implementing densely pipelined bit parallel IIR filter sections are presented. The fundamental problem of latency in the feedback loop is overcome by employing redundant arithmetic in combination with bit-level feedback, allowing a basic first-order section to achieve a wordlengthindependent latency of only two clock cycles. This is extended to produce a building block from which higher order sections can be constructed. The architecture is then refined by combining the use of both conventional and redundant arithmetic, resulting in two new structures offering substantial hardware savings over the original design. In contrast to alternative techniques, bit-level pipelinability is achieved with no net cost in hardware.

1. Introduction

The use of bit-level systolic arrays for the high performance VLSI implementation of signal processing functions is now well established. A number of devices have been successfully fabricated and some are available as commercial products [2,3]. While the technique has been applied fairly widely to the design of non-recursive components such as FIR filters, correlators and circuits for computing the Discrete Fourier Transform, it has had limited application so far, to the design of devices such as IIR filters which involve recursive computation. Bit-level systolic arrays for this type of application are much more difficult to design because the effect of introducing M pipelined stages into such a system is to introduce an M cycle delay into the feedback loop. For a bit serial design [4,5], this does not present a serious problem since the clock rate exceeds the input word rate by a factor - p (where p denotes the wordlength) and, for a fully pipelined system, M is also - p. The overall loop latency for a bit serial device is therefore - one word delay. However, for a bit parallel design the input word rate is equivalent to the clock rate and so the pipeline latency corresponds to M word delays. The potential This paperis a revisedand extendedversionof a paperby the same authorspresentedat the secondInternationalConferenceon Systolic Arrays, San Diego, May 1988 [1].

speed of the device is therefore reduced by a factor of M. The pipelining does of course allow M independent computations to be carried out in parallel, but this mode of operation is not always useful in practice. A significant contribution toward the pipelining of recursive filters has recently been made by Parhi and Messerschmitt [6,7] in proposing "scattered lookahead" recursive filters which are both pipelinable and stable. Related earlier work by Loomis and Sinha [8] did not guarantee stability. Unformrlately the basic lookahead technique requires an increase in hardware complexity by a multiplicative factor - M (where M is again the number of pipeline stages). The overhead is obviously severe if fine-grain (e.g., bit-level) pipelining is required. Parhi and Messerschmitt have gone some way toward minimizing this overhead by proposing decomposition techniques to implement the lookahead algorithm with logarithmic rather than linear multiplicative complexity. Parhi and Hatamian have used this technique to implement a 100 MHz fourth order fiR filter as a single chip, believed to be the first pipelined IIR filter to be implemented in silicon. Another approach to high sample rate filter design is the block processing technique. The best known block filter structures are incremental block filters, reported by Wu and Cappello [9] for direct form filters and Parhi and Messerschmitt [10] for state space filters. Parhi et aL have shown how to combine this technique with fine-

10

Knowles, Woods, McWhirter and McCanny

grain pipelining [11], but again the complexity of these structures scales linearly with density of pipelining. In this paper we present a novel and hardwareefficient approach to the problem of pipelining bit parallel ILR filter sections at the bit level. This has resulted from a detailed analysis of the mechanisms of partial product formation and accumulation at the bit level, and necessitates the use of redundant arithmetic. Our approach implements the recursion directly but with minimal latency, i.e., the number of cycles which must elapse between the beginning of successive stages of the recursion, is small (typically two or three) and independent of the wordlength. We show that the necessary redundant arithmetic can be performed using only conventional binary full adders and subtracters. Furthermore we show that this allows fine-grain pipelinability to be achieved with zero hardware overhead, in notable contrast to the techniques previously proposed [6-11].

2. Pipelining Reeursive Computations The difficulty involved in pipelining any recursive operation which involves bit parallel data can be illustrated by considering a first order filter section. In this case the relevant computation can be expressed in the form Yn

=

blyn-1 + un

(1)

where

bl2

(2)

u,, = aoxn + alxn-1

xn is a continuously sampled data stream and a0, al

and b 1 are coefficients which determine the filter frequency response. The non-recursive component defined by equation (2) can easily be implemented using conventional pipelined multiplier-adders. In this paper, we will concentrate on the design of a circuit for computing the recursive term in equation (1) on the assumption that un is available as an input. One way to implement the recursion defmed in equation (1) would be to use a conventional pipelined "shift and add" multiplier array with the output fed back to the input. An example of such a circuit with four-bit (binary) input data is illustrated in figure 1. In this diagram it has been assumed that all input values are fractional and a convention is adopted whereby themost significant fractional bit is noted by superscript 1, the next most significant bit by superscript 2, and so on. This convention is retained throughout the paper. For simplicity figure 1 assumes unsigned values, and the coefficient range bl < 1/~ has been chosen so that the output Yn is always less than 2 with a fractional input Un. The most signifi_cantbit (msb) of bl is therefore b2, the msb of un is u~, and the msb of Yn-1 is yon_1. The solid black circles in figure 1 represent latches and there are two types of combinatorial logic cells: type 1 cells are gated full adders, type 2 cells are simple

b3 t

b4 t

bI

C

c

Y~ur 2co~r§ Y~uT = Yx'~+ b ~ y + %

i

C~u~r

Y•-i

Y~-I Y~-i Y~-i

Fig. 1. Pipelined shift and add array with feedback.

CIN

Bit-Level Systolic Architectures for High Performance IIR Filtering full adders. The circuit operates as follows. The array of type 1 cells performs the multiplication of bl by Y~-I- Each bit of the output Yn-~ is fed back through a series of latches and then broadcast to the appropriate row of cells in the array. The function of each row of type 1 cells is to multiply the entire coefficient word by a single feedback bit and to add this product to the accumulating result passed down from the row above. Any ripple-carry bits which are generated in the process are propagated from right to left along the row. After settling, the row results are latched in the vertical direction before being passed to the row below. As will be noted from figure 1, each row is left-shifted by one cell with respect to the row above in order to ensure that bits of the same significance are added together. The multiplying array of type 1 cells operates in a least significant bit (lsb) first fashion; the lsb of the outp u t is fed back and broadcast first to the top row of the array, the second bit of the output is broadcast to the second row one cycle later, and so on. The nonrecursive component u~ is added at the base of the array by the row of type 2 cells. The final output is thus formed five cycles after the corresponding operands were input to the circuit. In situations where the coefficients do not need to be updated "on the fly" the latches on the coefficient bits in figure 1 (and all other circuits described in this paper) are optional and can be omitted. Two factors limit the speed of the circuit in figure 1. The first is that it takes five cycles (in general p + l cycles, where p is the wordlength) before the bits which must be fed back into the array for the next iteration become available at the array outputs. Note that the calculation in equation (1) yields a 9 bit result, so it is necessary to round or truncate this to 4 feedback bits in

bl

b5

11

order to prevent the size of the output word Yn-1 from growing in the direction of lower significance. Thus the first five bits output from the array are discarded (these lower significance bits emerge at the right-hand edge of the array and are not shown in figure 1). The five (in general p+l) cycle latency of the system means that computation of the value Yn+l must be held up for five (p+l) cycles until the bits o f y n a r e available. The circuit in figure 1 can therefore only produce a new result every five (p+l) cycles. This limitation is obviously severe for most practical systems where the value of p is typically between 16 and 24. The second factor limiting the performance of the circuit in figure 1 is the ripple-propagation of carries along each row before the row outputs can be latched in parallel. Again the problem becomes more severe for longer wordlengths. This latter problem can be solved by conventional techniques such as carry lookahead (rather inelegant in this application) or carrysave, which we will discuss later in this paper. The problem of pipeline latency within the feedback loop appears to be quite fundamental to the recursion process itself. In fact it is only fundamental to the use of conventional arithmetic, as we will now demonstrate. An apparent solution would be to rearrange the computation so that it proceeds msb first rather than lsb first, so that the most significant bits of the output would be generated first and could be fed back immediately. A circuit for performing this type of computation is illustrated conceptually in figure 2. It will be noted that the successive rows in this array are right-shifted (rather than left-shifted as in figure 1) to ensure that all accumulating partial products of the same significance pass in the vertical direction. If the output bits are fed back immediately, as illustrated in figure 2, then the @

~ DIGIT-LEVEL MULTIPLY-ADDER

(~)

--- DIGIT-LEVEL ~DER

Fig, 2. Conceptual msb first array with feedback.

12

Knowles, Woods, McWhirter and McCanny

problem of latency in the feedback loop and the resultant problem of low throughput rate can be avoided. The flaw in the approach illustrated in figure 2 is that a circuit of this type cannot be implemented using conventional binary arithmetic since this would necessitate the propagation of carries from the lsb to the msb stages, as indicated by the dashed lines. Such propagation violates the cut theorem [12] and would therefore disallow the illustrated pipelining of the circuit. The problem can be avoided if a redundant number system is used because arithmetic redundancy allows carry propagation chains to be broken with the result that the pipelined calculation can be carried out msb first. In this paper we propose a range of novel circuits based on this approach. The next Section introduces one type of redundant number system suitable for application to the recursion latency problem.

Redundancy in a number system allows methods of addition and multiplication to be devised in which each digit of the result is (typically) a function only of the digits in two or three adjacent positions of the operands and does not depend on the other digits in any way. This feature has a number of important consequences. It allows arithmetic operations to be carried out completely in parallel with no carry propagation from the least significant digit (lsd) through to the most significant digit (msd). The time required for an operation such as parallel addition is therefore constant and does not depend on the word length. Furthermore the calculation of least significant digits may be avoided in situations where they are not required since the calculation may be performed msd first.

3. Redundant Signed-Digit Number Representations

4. The Fully-Redundant First Order IIR Filter Section

Signed-Digit Number Representations (SDNRs) were originally introduced by Avizienis [13] to reduce carry propagation in parallel numerical operations such as add, subtract, multiply, and divide. They differ from conventional numbers in that the individual digits comprising a number may assume negative as well as positive values, the number of possible digit values being termed the cardinality of the number system. Conventional number representations have cardinality equal to the radix. For example, in the conventional radix-10 (decimal) system the digits are restricted to the set {0..9} which has 10 elements. SDNRs may have cardinality greater than the radix, in which case they are termed redundant because any given algebraic value may have several possible representations. For example, a radix-10 SDNR might use digits from set {9..0.. 9} which has a cardinality of 19 (an overbar is conventionally used to indicate negative digits). This would allow the decimal value 147 to be alternatively represented as 1}47, 153, 1}53, etc. In this paper we are concerned principally with the radix-2 SDNR (also known as the signed-binary number representation--SBNR) because, as with conventional binary logic, a low radix leads to fast and simple hardware implementations. The redundant SBNR allows digits to assume the values 1,0 or -1 (denoted by ]). Note that there is no need for an explicit mechanism (such as the two's-complement system in conventional binary) to handle the overall sign of a signed-digit number. The sign is determined by the sign of the most significant non-zero digit.

An architecture for computing the recursion in equation (1) is illustrated in figure 3. This circuit is based on the type of shift and add scheme shown in figure 2, uses a radix-2 redundant number system (i.e., r = 2) and has a most significant digit first data organization. The redundant digit set {],0,1} is used to represent the coefficient b 1, the input Un and the result Yn-1- For ease of illustration, it has been assumed that lunl < 89 and ]bl[ < 89 so that the output remains in the range [Yn-l[ < 1. This restriction is not fundamental, as will be demonstrated later. Note that the type 2 cell in figure 2 has been replaced by three new cells labelled 2A, 2B and 2C which enable the carry propagation chain at the left-hand edge of the circuit to be broken. Note also that the digits of Yn-1 in the circuit of figure 3 allow a range of ]Yn-ll < 2 although it has already been established that the value of Yn-1 will never exceed _1. This additional headroom is a necessary part of the redundant arithmetic technique. In figure 3, it has been assumed that b~, Y~-l and Un are all four-digit (in general p-digit) numbers. The circuit consists of a 4 x 4 (pxp) array of type 1 cells connected to a lattice of 12 (3p) type 2 cells as shown. Both the coefficient bl and the accumulating sum u, are input at the top of the array in a bit parallel manner, the output digits being generated at the left-hand side of the array. Each output digit is immediately fed back and broadcast to the appropriate row of cells as before. The main cell in the array, the type 1 cell, takes as its inputs digit b of the coefficient word b 1 and digit y of the result word Yn-1- It multiplies these together and

Bit-Level Systolic Architectures for High Performance IIR Filtering

13

U4 ~ b~

b~

\

y~l_1 .

yn3_l 9

%

• + Y~T

y;,

y y;.

t'z'm %'UT+ [• '• ] [%ur,•177 w,• ] W'o'ol %'o, Y4~, Fig. 3. Architecture o f a first o r d e r section,

adds the product to an incoming digit y'in to produce an output digit Y'out. The digit Y'out, which represents one digit of the accumulating result Yn, is then passed down vertically to a cell in the row below. In the process two other digits tout and t'ou t (transfer digits, analagous to carry/borrow digits) are generated and passed to the cell on the left. All digits which pass downwards are latched as indicated by the black circles, but those which propagate in the horizontal direction are not (for reasons which will become apparent). As in figure 2, all accumulating partial products which are generated in the same vertical column of cells have equal significance. The use of a redundant number system in the circuit of figure 3 means that carry propagation (transfer digit propagation) can be reduced to a minimum. It is this essential feature which allows the msd first data organization. However, the avoidance of carry propagation requires that the limits of the various values entering and leaving the cells be carefully chosen and that the addition function of the type 1 cell be split into three stages. This aspect is illustrated more clearly in figure 4 which shows the architecture in more detail. The structure of the type 1, type 2A, type 2B, and type 2C cells is indicated by means of a dotted line drawn around the appropriate subcells and, for ease of illustration, the coefficient lines have been excluded from the diagram. Note that a number of the least significant

cells/subcells in figure 3 have been omitted from the more detailed diagram in figure 4 which shows quite clearly that they do not contribute to the output of the circuit. The first stage of the calculation performed within each type 1 cell is the multiplication of a digit b by a digit Yin as defined by v = b.yin

(3)

followed by an addition of the form Wout + r.tout = Y[n + v

(4)

This takes place in the type A subcell. The transfer digit tout in equation (4) is generated to ensure that the result Wo~t (which could otherwise produce a value between - 2 and +2) may be represented by a digit in the set {2,1,0}. This requires tou t to assume the binary value 0 or 1. If tou t w e r e simply passed on as an input to the type A subcell immediately on its left, then carry propagation (or transfer digit propagation) would occur and defeat the objective of using a redundant number system. The transfer digit tout is passed on, instead, as an input to the second stage of computation in the adjacent type 1 cell. The second stage of the calculation within each type 1 cell is that of addition as defined by w~ut + r.t'~t = win + ti,

(5)

14

Knowles, Woods, McWhirter and McCanny u~ u~ u~ uR SUBCELL FUNCTIONS

[L.t]+ --~ 0..~] [i i] [i,.i] ~.~ ~/~

0"i>

[ ,, ] [~..o]

[o.,~] ~~ ~ [i,.o][o,,i]

Fig. 4. Detailed structure of the fully redundant first order section.

This takes place in the type B subcell. Addition of the digits win and tin (the transfer digit from the type A subcell on the right) results in a value which lies within the range - 2 to +1. This result W'ut is kept within the digit set {0,1} through the generation of a second binary transfer digit t'ut which can take the value 0 or 1 and is passed on as an input to the third stage of computation in the type 1 cell to the left.

The third stage of the computation in each type 1 cell is an addition of the form ' Your "=- Win' § tin'

(6)

This takes place in the type C subcell. The digit wi'~ is the sum from the type B subcell above and the digit ti" is the transfer from the type B subcell on the right.

Bit-Level Systolic Architectures for High Performance IIR Filtering

The resulting digit Y'out lies in the set {i,0,1} and constitutes the output of the type 1 cell. In order to complete the computdtion on each row of the array, it is necessary to combine the digits Y'ut, to,t, and tout produced by the most significant type 1 cell. This is performed by the type 2A, 2B, and 2C cells. The type 2A cells perform an addition of the form t!

t?

t

r

Wou t q- r.tou t = Yin -t- tin q- tin

(7)

where w"t is a digit from the set {2,1,0} and the transfer digit t;'~t is either 0 or 1. This may be implemented very simply using one type A and one type C subcell as shown in figure 4. The type 2B cells compute the sum t! r

t! l

Wou t "t- r.tou t

tt

= Win +

t!

tin

(8)

where Wou't takes the value 0 or 1 and the transfer digit to" t takes the value 1 or 0. This may be implemented using a single type B subcell as indicated in figure 4. Finally, the type 2C cell performs the addition Yout

. . .tin = W i n. . ."[-

(9)

which may be carried out using a single type C subcell as shown. The output of the circuit is generated in msd first skew-parallel format with a single clock cycle skew between consecutive digits of the same output word. Note that the digit which must be fed back along each row is available two cycles after the inputs were applied to that row, and, accordingly, a throughput rate of one input sample every two cycles can be achieved. Figure 4 clearly illustrates the second important feature resulting from the adoption of a redundant number system; transfer digits propagate diagonally rather than horizontally through the array of type 1 cells. Thus the outputs from any given type 1 cell can only influence the two type 1 cells immediately to its left. There is no ripple-propagation of transfers along the full length of the rows as there was with the conventional binary circuit of figure 1, so settling time of the rows (and hence throughput rate) is independent of the wordlength. The maximum clock rate of figure 4 is determined by the longest propagation path between latches. Assuming that broadcast of the y digits along the rows is not a limiting factor, this critical path occurs at the left-hand edge of the array where outputs from the type i ceils are summed by the type 2 cells. The critical path

15

contains seven subcells but may be shortened to four subcells by adding latches along the secondary pipeline cut indicated in figure 4. The penalty for this is that the latency of the circuit increases from two to three clock cycles, although those cycles may now be shorter.

5. Performance of the Basic Circuit Before developing the architecture further, we consider how the complexity and speed of the basic redundant circuit of figure 4 compares with binary circuits typical of conventional bit-level systolic array designs [2,3]. We will first compare the latched type 1 redundant multiply-add cell of figure 4 (including a latched coefficient) with a binary gated full adder having three latched inputs and one broadcast input. Estimates are for CMOS implementation and assume that 8-transistor static half-latches with propagate and set-up time of 3 typical gate delays are used, and that EXOR/EXNOR functions introduce two gate delays. Initial design studies indicate that a latched type 1 cell can be implemented using 88 transistors and has a critical path of 12 gate delays. An equivalent latched binary cell requires 54 transistors and has a critical path of 8 gate delays. The signed-digit cell is thus - 35 % slower and carries an area overhead of - 6 5 %. However, it should be noted that whereas a conventional p-bit fractional multiplier-adder requires p(p+l) cells, the p-digit redundant arithmetic circuit of figure 4 requires fewer than p(p+l) ceils because it is not necessary to include cells which evaluate the least significant digits. Furthermore, a p-digit SBNR coefficient can represent an equivalent range of values to a (p+l)bit conventional binary signed coefficient. Overall we conclude that the silicon area required to implement the circuit of figure 4 will be typically - 3 0 % greater than that required for a conventional bit-level systolic multiplier-adder. This in turn means that, for all practical values of input data wordlength, such a circuit could be readily accommodated on a single silicon chip. The circuit of figure 4 has a throughput of one sample every two clock cycles, and a critical path of 17 gate delays, occurring at the left-hand edge of the main array of type 1 cells where it interfaces to the accumulating chain of type 2 cells. Adding a secondary pipeline cut would reduce critical path to that of the type 1 cells (12 gate delays) but limit throughput to one sample every three cycles. This does not appear to be worthwhile for the illustrated case.

16

K n o w l e s , Woods, M c W h i r t e r a n d M c C a n n y

6. A Building Block for Second and Higher Order Sections Before exploring possible variations of the basic first order architecture of figure 4 we demonstrate how it may be extended to implement higher order filters. The basic computation required for a second order filter section takes the form Yn = b l Y n - i

+ bzyn-2

(10)

+ tan

where (11)

u n = a o x n + a l x n _ 1 + a2Xn_ 2

and, as before, the non-recursive term defined by equation (11) may be evaluated using one of the established bit-level systolic array designs. The recursive component may be expressed in the form Yn

!

=

b l Y n - 1 4- U n

(12)

where (13)

Un ' = b 2 Y n - 2 4- u n

and so it may be evaluated using the type of block schematic circuit shown in figure 5(b). Block 1 performs the computation in equation (12) while block 2 performs the computation in equation (13). A denotes the maximum processing delay associated with each block and this, in turn, defines the delay between consecutive input words to the circuit. Third and higher order sections may be implemented by a straightforward extension of this circuit involving three or more blocks. Since the structure of equations (12) and (13) is identical to that of the first order recursion in equation (1),

they could obviously be evaluated using the type of circuit illustrated in figure 4. As it stands, however, this circuit is not quite suitable for use as the basic building block in figure 5(a) because it assumes that all digits of the input u n are available simultaneously. From figure 5(b) it can be seen that the corresponding input u" to block 1 is obtained from the output of block 2, and this would be produced in msd first skew-parallel form using the circuit of figure 4. A wedge of latches needed to deskew the u" input would introduce - p extra clock delays into the feedback loop and so defeat the objective of maximizing throughput. This problem may be avoided by using the modified circuit shown in figure 6. The main difference between this circuit and the one in figure 4 is that the digits of un are input along the left-hand edge of the array and not from the top as before. Thus Un is now input in msd first skew-parallel manner, so the circuit in figure 6 can be used as the building block in figure 5 without introducing an extra delay - p cycles into the feedback loop. In order to accommodate the extra input digits along the left-hand edge of the array, it has been necessary to modify the type 2 cells and introduce two new subcells X and Y as shown in figure 6. Figure 6 also demonstrates the effect of increasing the number range so that Ibll < 2 and ]unl < 2, this coefficient range being the upper limit for a stable second order filter section. The effect of increasing the coefficient range by a factor of two is to introduce an extra cycle of latency. Thus, increasing the coefficient range from + 1A (figure 4) to _+2 (figure 6) increases the latency of the circuit by two cycles. It is important to note, however, that latency is still independent of Un+l

Yn-ti ...............................................

qo

i

i

~ ~

~U~+l un

Yn-i~ / n

i

BLOCK2

b

Yn-i

Yn= bqn+un (a)

(b)

Yn-i

Fig. 5. Basic building block and second order connection.

BLOCKi

Bit-Level Systolic Architectures for High Performance IIR Filtering

17

SUBCELL FUNCTIONS

[i,,1]

+

[i,a] [i'a]~ ] [o,.13E~..o] [~,,e] [o.1] [i,,o][o~,~] E~Ei o] [i..~] [~.,e]~

[o,.1]

[i.,i] [i.,o]

[Lo] [i_o] [o~i] [L1]

Fig. 6 Detailed structure of building block for higher order sections.

wordlength and that second and higher order sections constructed according to figure 5(b) will have the same throughput as the basic building block. Note that the output Yn-1 is now capable of overflow as it would be in any design of first or second order section which assumes a coefficient in this range.

The secondary row of latches referred to in relation to figure 4 has been included in figure 6 to pipeline the interface between type 1 and type 2 ceils at the lefthand edge. For this circuit the effect is to increase latency from 4 to 5 clock cycles but decrease critical path (and hence minimum clock period) by - 4 0 % . The net effect is an increase in throughput of - 3 0 % .

18

Knowles, Woods, McWhirter and McCanny

7. Architectures Employing Mixed Arithmetic The previous sections have demonstrated that the key to our solution of the recursion latency problem is the use of redundant arithmetic. This has so far carried with it a penalty in hardware complexity at the cell level. The redundant type 1 digit multipy-add cells of figure 4 require 65 % more transistors than the gated full-adder cells typical of conventional systolic multiplier-adder designs. It follows that hardware savings might be made if redundancy is only used in those parts of the circuit where it is strictly necessary, rather than throughout as is the case for the circuits of figures 4 and 6. Recall that redundant signed-digit arithmetic achieves two aims in the circuits presented so far. Firstly and most fundamentally it allows the carry propagation path shown dashed at the left-hand edge of the conceptual circuit in figure 2 to be broken. It is this that enables the calculation to proceed msd first and thereby solve the latency problem. We must use redundant arithmetic at this edge. Secondly, redundant arithmetic prevents ripple-propagation of carries along the rows of the array so that fully systolic performance can be achieved without pipelining along the rows. If such pipelining were necessary it would again incur delays within the feedback loop and hence defeat the objective. However, use of redundancy here is avoidable as we will demonstrate later. More immediately there is a third area where we have applied redundant arithmetic but where it is not strictly necessary; for the digit multipliers. We develop this theme of redundancy minimization in two stages in the next two sections, firstly by removing redundancy from the digit multipliers and subsequently by removing redundancy from the whole of the main array of type 1 cells. We demonstrate the techniques using first order sections similar to figure 4 for simplicity. In all cases a generalized block for higher order sections may be derived by extending the redundant accumulator at the left-hand edge to accommodate skew-parallel addition of the un term in a similar manner to that illustrated in figure 6.

8. Two's-Complement Coefficients The purpose of each row of digit multipliers in the circuit of figure 4 is to multiply the entire coefficient word by a single feedback digit. This feedback digit is generated redundantly and must therefore employ the full SBNR digit set {1,0,1}. In figures 4 and 6 we have

chosen to represent the coefficient using the same digit set. This conveniently allows each digit multiplier to be both independent and identical, generating a result which is also in the same range. A three-stage redundant adder structure is then required to add this to the redundant accumulating partial product passing down the array. If the multiplication of feedback digit by coefficient word could be arranged to yield a result in nonredundant binary form, such as two's-complement, then the subsequent adder would have one non-redundant (binary) input and one redundant (trinary) input. Such an adder can be implemented as a two-stage singletransfer structure, and is significantly simpler than the fully-redundant three-stage form. Figure 7 illustrates how this approach may be implemented in a circuit which is equivalent to figure 4 in terms of significance of operand digits. The coefficient is now represented in conventional two's-complement binary form. Essentially this means that all bits of the coefficient employ the binary digit set {0,1} except that the msb has negative weight and therefore may be thought of as employing the digit set {1,0}. Note that representation of the coefficient in two's-complement form requires one more digit than representation using the {1,0,1} digit set, for an equivalent range of coefficient values. Thus the circuit of figure 7 is limited to acoefficientrange [bl[ < ~,4 ratherthan [bl[ < 1/2 for figure 4. Again this restriction is not fundamental, but every doubling of coefficient range incurs an additional cycle of latency within the feedback loop and hence a decrease in throughput. The additive input Un retains the full {1,0,1} digit set and is therefore suitable for either redundant or conventional two's-complement binary signals. The two's-complement coefficient must be multiplied by a feedback digit in the set {1,0,1} in such a manner that the result digits are still binary. Multiplication by a feedback digit of 0 or 1 is a simple AND gating operation for all bits. Multiplication by i is achieved by the two's-complement "invert and add one" algorithm, i.e., invert all bits, add one to the lsb, and interpret the result in the normal two's-complement manner (msb negative, remainder positive). The inversion is achieved by EXORing all bits with the sign of the feedback digit. The addition of one to the lsb could cause a carry to propagate along the full length of the row, but fortunately this addition may be performed redundantly at the same time as the major row addition by using a spare adder input at the right-haM end of the row. This is clearly illustrated in Figure 7.

Bit-Level Systolic Architectures for High Performance IIR Filtering

SUBCELL

19

FUNCTIONS

[~,.i] i~~_]I ~ [o,,i]~c [o.,Q~ [~,,~]

-

1

[~176 J ? ~I [2..o] Eo.,1] 2~ ynOl

,

[i,,O]Eo~] [o,.~3(~ EL.o] '_~ Ei.,~3 [i..o] [~.~[3~. o3 _ EL,l[o,,1] ] [o..1]Ei~,o]

1

2C => 2'm-CDMPLEMENT :

- ~

INTERPRET AS -VE FOR MSD

S => SIGN BIT ONLY ~ INPUT 0 IF +VE , i IF -ME NOTE THAT SUB-CELLS

C AND F ARE LOGICALLY IBENTICAL

Fig. 7. Modified first order section with two's-complementcoefficients.

Figure 7 introduces three new types of cell. The digit multiplier cell (designated the type M subcell) has been modified to implement the two's-complement multiplication algorithm described above. The subcells which make up the new two-stage redundant adder are designated D, E, and E The ranges of digit values appearing

at the inputs and outputs of these new subcells are indicated in figure 7, together with possible implementations in conventional binary logic. The type D subcell appearing at the left-hand end of each row functions as a modified version of the type E subcell and allows for the negative weight of the most significant coefficient

20

Knowles, Woods, McWhirter and McCanny

bit in two's-complement notation. Note also that the type F subcell is logically identical to the type C subcell introduced earlier. The redundant accumulator at the left-hand edge of the circuit still requires the three-stage A/B/C subcell structure, as for figure 4, because all values passed to it from the main array still use the full {1,0,1} digit set. Figure 7 clearly demonstrates a substantial reduction in gate count over figure 4. Making the same assumptions as before, the new latched multiply-add cell (consisting of subcells M, E, F) can be implemented using 56 transistors, compared with 88 for the corresponding cell in figure 4 and 54 for an equivalent binary cell. Similarly the critical path of the new cell is ten nominal gate delays, compared with twelve for the figure 4 cell and eight for the conventional binary case. Thus the multiply-add cells of figure 7 are 20% faster and 35 % smaller than those of figure 4. Clearly there is no longer any significant penalty to be paid in terms of cell complexity for using redundant arithmetic within the rows. The speed improvement in the multiply-add cells of the main array does not map directly to a corresponding speed-up of the entire circuit. The critical path of 14 nominal gate delays occurs at the interface between the main multiplier array and the redundant accumulator at the left-hand edge of the circuit. A secondary pipeline cut similar to that indicated in figure 4 would reduce the critical path to 11 nominal gate delays, while increasing the number of clock cycles per iteration from two to three. Such a pipeline cut therefore is worthwhile with the illustrated coefficient range.

9. The Carry-Save Architecture We have noted earlier that the use of redundant arithmetic for the main array avoids ripple-propagation of carries along the rows, and thus obviates the need for pipelining along this axis. There is, however, a mechanism whereby this can be achieved using conventional arithmetic: the carry-save technique. Carrysave is commonly used in conventional systolic multiplier design and involves passing carries diagonally between rows rather than allowing them to ripple along the rows. The diagonal carries are latched as they pass between rows and thus achieve a propagation of a single digit position per cycle. The technique is best illustrated by figure 8 which is equivalent to our earlier high-level diagram in figure 3. A more detailed diagram is given in figure 9 in the same manner as figures 4 and 7, although to a certain extent this representation masks the simplicity of the carry-save approach. Both the coefficient bl and the additive input u n are now represented in two's-complement binary form, and the whole of the main array uses non-redundant binary arithmetic. The invert and add one scheme for multiplication of feedback digit by coefficient word is the same as that described for the circuit of figure 7, yielding a two's-complement binary result. Again a spare input at the right-hand end of each row is used to implement the addition of 1 to the lsb of a row when the feedback digit is -1. The main array employs binary full adders and full subtracters, the latter being required at the lefthand end of each row because of the negative effective weight of the msb of a two's-complement number. The

bT

Fig. 8. Basic architectureof a first order section with carry-savemain array.

Bit-Level Systolic Architectures for High Performance IIR Filtering

SUBCELL

21

FUNCTIONS

[Li] i~m~i ~ [o,,i]~ [o.,i]~c 1

1

1

i

i

i

[0,,i]~[O,,i] [o,,i] [ol, i]

[0,,i~[0,,I] ] [0"i]D,!I]

[LQ~ .O][o,.i EL,i]] ~~ 2 C => 2 ' ~ - C O N P L E N E N T S => S I G N

BIT O N L Y

: :

INTERPRET

INPUT

AS -VE

0 IF + V E

FOR

NS]B

, I IF - V E

Fig. 9. Detailed architecture of the carry-save first order section.

cell at the left-hand end of the top row is a special case; it is logically an adder but both additive inputs, and hence both outputs, have negative effective weight. The range of coefficient and input values is the same as that of figure 4, as is the throughput of one sample every two clock cycles (again the coefficient range may be extended at the expense of reduced throughput). The requirement for a redundant accumulator at the left-hand edge of the circuit remains unchanged for the fundamental reasons outlined earlier, and the output and feedback still employ the SBNR digit set. Furthermore, because of the diagonal flow of carries through the main array, triplets of outputs must be summed rather than pairs as in figures 4 and 7. However, since all signals emerging from the main array are now binary (some having negative weight as indicated), this redundant accumulator can be somewhat simplified. A two-stage

structure results, in which the first stage is comprised of subcells which are logically identical to full subtracters and the second stage is comprised of type C subcells, which are logically identical to half subtracters. Thus the circuit of figure 9 is constructed entirely from conventional binary adders, subtracters, and digit multipliers. Furthermore the total number of such cells is approximately the same as the number required for a conventional binary multiplier-adder. We are left with the quite remarkable fact that the redundant low latency multiplier-adder presented here is no more complex than a conventional non-redundant circuit for performing the same operation with somewhat greater latency. This negates the traditional assumptions that the use of arithmetic redundancy requires special hardware and leads to an increase in overall complexity.

22

Knowles, Woods, McWhirter and McCanny

The critical path of the circuit again occurs at the interface of the main array with the redundant accumulator at the left-hand edge, being 14 nominal gate delays including latch set-up and hold times. A secondary pipeline cut would reduce the critical path to that of the main array cells, which is ten nominal gate delays, but at the usual penalty of an additional cycle of latency.

10. Binary-to-SBNR and SBNR-to-Binary Conversion There arises the question of the overhead required for conversion between signed-digit numbers and binary numbers where the filter circuits presented here are to be used in binary systems. Input conversion, where necessary, is trivial since the binary digits are a subset of the signed digits; two'scomplement inputs can be accommodated simply by setting the sign bit of the msd to indicate that it has negative weight while all other digits have positive weight. It is worth mentioning that the msd first skewparallel input format of figure 4 is well-suited to the input of raw data from an analog to digital converter since this is the format naturally produced by pipelined successive approximation conversion. Conversion from SBNR to two's-complement is usually accomplished by separating positive and negative digits into two words, one containing only positive digits and zeros, the other containing only negative digits and zeros. The negative word (ignoring the sign of the individual digits) is then subtracted from the positive word in any conventional binary subtracter to yield a binary result. This subtraction requires full carry propagation, but since the subtracter is outside

~ MAG

yno ' ' "

the feedback loop, it can be pipelined to any required extent without reducing the throughput of the filter. Alternatively a parallel version of the successive approximation on-line conversion algorithm recently proposed by Ercegovac and Lang [14] may be used. This operates in a similar manner to Sklansky's conditional sum addition algorithm [151. As originally proposed, the Ercegovac/Lang on-line algorithm is only suitable for normalized operands, i.e., where the msd is guaranteed to be non-zero and thus indicates the sign of the number without reference to lower significance digits. In general the outputs of the circuits described here will not be normalized, so the basic algorithm is not directly applicable. A modified version of the algorithm which operates on un-normalized skewparallel data may be simply derived and has a lower latency when fully pipelined at the digit level than the simple separator/subtracter scheme above. The derivation of this modified algorithm is beyond the scope of this paper, but a suitable hardware implementation is given in figure 10. Note that this pipelined implementation also conveniently removes the output data skew.

11. Discussion In this paper we have shown how IIR filter sections with bit parallel inputs can be pipelined right down at the bit (or digit) level. This has been achieved by (a) using a redundant number system which allows results to be computed most significant digit first and (b) feeding digits from the result back as soon as they are available. The concept has been established by proposing a circuit which uses a redundant number system throughout, and then progressively refined by restricting the use of

MSB CELL

:

FUNCTION

"0

y~C : ON

'I O~x~[Z~Z~~ ~

\/2 ] D

s,GN~ MA6

I

TWO'~ COMPLEMENT

OUT

LSB

Fig. 10. Pipelinedcircuit for conversionfrom SBNRto two's-complementbinary.

Bit-Level Systolic Architectures for High Performance IIR Filtering redundancy to those parts of the circuit where it is essential. The most important feature of the circuits described is that they allow the recursive computation to be implemented directly without suffering the throughput limitations normally associated with pipeline latency in a feedback loop. A considerable body of work in the recent past has been devoted to the development of on-line computation, which exploits redundant arithmetic to reduce the latency of digit serial computations; see for example Ercegovac [16]. In the introductory section we pointed out that the problems of latency in recursive systems such as RR filters are much less severe for digit serial systems than for digit parallel systems. In this paper we have proposed an msd first skew-parallel data format which enables principles similar to the on-line methodology to be efficiently applied to digit parallel circuits, with consequent advantages in terms of throughput. The introductory Section referred to alternative solutions to the recursion latency problem, notably those due to Parhi and Messerschmitt [6,7]. Using their method, the recursion equations are iterated M-1 times so that each new input value depends only on computations initiated M cycles earlier. With this approach, circuits can be implemented using conventional pipelined multiplier-adders, but additional hardware is needed to compute all the intermediate terms which are generated when the recursive equations are iterated. The number of multiplier-adders required therefore increases with wordlength p, the overhead being typically in the range log p to p. This contrasts with the method proposed here where only one redundant multiplier-adder structure is needed whatever the wordlength. Furthermore, this redundant multiplier-adder has identical complexity to a conventional binary multiplier-adder. A detailed comparison of the various techniques is currently being carried out, and the results will be reported in a future publication. The arrays shown in figures 4, 7, and 9 are only 50% efficient in that they produce one output every two clock cycles. There are at least two techniques which could be employed to increased this to 100 percent. (a) The array size could be reduced by half by projecting the function of two neighboring rows of cells onto that of a single row. This would require some additional hardware for multiplexing and storage. (b) One could make limited use of the look-ahead technique by iterating equation (1) once, so that y, depends on Yn-2 rather than Y,-1, thus y , = (bl)2yn_2 + b , u , _ l + u,

(14)

23

With the latter scheme, the array could be used to produce a result on every cycle but an additional multiplieradder is required to compute the non-recursive term blun_ 1. Similar techniques could obviously be applied to higher order building block circuits such as that in figure 6 which, as it stands, is only 20% efficient. For reasons of simplicity all of the circuits presented here employ simple truncation of least significant digits to prevent word growth at each stage of the recursion. This has subtly different implications for signed-digit numbers than for conventional number representations because the part which is discarded may have opposite sign to the part which is retained. If this is the case then truncation will increase the magnitude of the number. To ensure the stability of recursive systems it is often necessary that truncation or rounding be toward zero. Modification to force this behavior in the circuits presented here is straightforward, requires very little hardware, and does not significantly affect performance. However, discussion of such schemes is beyond the scope of this paper. We have concentrated here on radix-2 implementations of redundant signed-digit arithmetic. Elsewhere we have discussed fully-redundant designs employing a radix of 4 [17,18]. One radix-4 cell takes the place of four radix-2 cells but is correspondingly more complicated. Other radices are also possible, each one offering a multiplicity of possible digit sets and arithmetic algorithms. A detailed study of the relative merits of the different possible radices and digit sets for this application is currently being undertaken.

References 1. S.C. Knowles, R.E Woods, J.G. McWhirter, and J.V. McCanny, "Bit-Level Systolic Arrays for 17JRFiltering" Proc IEEE Int Conf on Systolic Arrays, May 1988, pp. 653-663. 2. J.V. McCanny and J.G. McWhirter, "Some Systolic Array Developments in the UK," IEEE Computer, vol. 20, no. 7, July 1987, pp. 53-65. 3. J.V. McCanny and J.G. McWhirter, "Systolic and Wavefront Arrays" VLSI Technology and Design, eds. J.V. McCanny and LC. White, Academic Press, 1987, pp. 253-299. 4. R.E Lyon, "Filters: An Integrated Digital Filter Subsystem" VLSI Signal Processing: a Bit-Serial Approach, eds. P.B. Denyer and D. Renshaw, Addison Wesley, 1985, pp. 253-262. 5. L.B. Jackson, J.F. Kaiser and H.S. McDonald, "An Approach to the Implementation of Digital Filters" IEEE Trans on Audio and Electroacoustics, vol. AU-16, no. 3, 1968, pp. 413-421. 6. K.K. Parhi and D.G. Messerschmitt, "Concurrent Cellular VLSI Adaptive Filter Architecures;' IEEE Trans on Circuits and Systems, vol. CAS-34, no. 10, October 1987, pp. 1141-1151. 7. K.K. Parhi and D.G. Messerschmitt, "Pipelined VLSI Recursive Filter Architecures using Scattered Look-Ahead and Decomposition" Proc IEEE lnt Conf on Acouszics, Speech, and Signal Processing, New York, April 1988, pp. 2120-2123.

24

K n o w l e s , Woods, M c W h i r t e r a n d M c C a n n y

8. H.H. Loomis and B. Sinha, "High Speed Recursive Digital Filter Realization," Circuits, Systems, and Signal Processing, vol. 3, no. 3, pp. 267-294. 9. C.W. Wu and P.R. Cappello, 'Application Specific CAD of VLSI Second Order Sections," IEEE Trans on Acoustics, Speech, and Signal Processing, May 1988, pp. 813-825. 10.K.K. Parhi and D.G. Messerschruitt, "Block Digital Filtering via Incremental Block-State Structure;' Proc IEEE Int Syrup on Circuits and Systems, Philadelphia, May 1987, pp. 645-648. 11 .K.K. Parhi et al., '~Lrchitecture Considerations for High Speed Recursive Filtering" Proc IEEE Int Syrup on Circuits and Systems, Philadelphia, May 1987, pp. 374-377. 12.H.T. Kung and M.S. Lain, "Fault-Tolerant VLSI Systolic Arrays and Two-Level Pipelining;' Proc SPIE, vol. 431 (Real Time Signal Processing V1), 1983, pp. 143-158. 13. A. Avizienis, "Signed-Digit Number Representations for Fast Parallel Arithmetic," IRE Trans on Electronic Computers, vol. EC-10, Sept. 1961, pp. 389-400.

14. M.D. Ercegovac and T. Lang, "On-the-Fly Conversion of Redundant into Conventional Representations" IEEE Trans on Computers, vol. C-37, no. 7, July 1987, pp. 895-897. 15. J. Sldansky, "Conditional-Sum Addition Logic," IRE Trans on Electronic Computers, vol. EC-9, June 1960, pp. 226-231. 16. M.D. Ercegovac, "On-Line Arithmetic: an Overview" Proc SPIE, vol. 495 (Real Time Signal Processing VII), 1984, pp. 86-93. 17. R.E Woods, S.C. Knowles, J.V. McCanny, and J.G. McWhirter, "Systolic IIR Filters with Bit Level Pipelining," Proc 1EEE Int Conf on Acoustics, Speech and Signal Processing, New York, April 1988, pp. 2072-2075. 18. R.E Woods, S.C. Knowles, J.G. McWhirter, and J.V. McCanny, "Systolic Building Block for High Performance Recursive Filtering" Proc IEEE lnt Conf on Circuits and Systems. Helsinki, June 1988, pp. 2761-2764.

Simon C. Knowles is a Senior Scientific Officer at the Royal Signals and Radar Establishment in Malvern, England. He joined the establishment in 1983 with a BA in Electrical Sciences from Cambridge University. After spending some time developing VLSI CAD tools, he became a member of the Signal Processing Group at RSRE. His current research interests include VLSI architectures, computer number systems and arithmetics, and mixed-mode circuit simulation.

John G. McWhirter is a Senior Principal Scientific Officer at the Royal Signals and Radar Establishment. He joined the establishment in 1973 and is currently carrying out research on advanced algorithms and architectures for signal processing with particular emphasis on systolic and wavefront arrays. He is a fellow of the Institute of Mathematics and its Applications and a visiting professor at the Queen's University of Belfast. He graduated in applied mathematics from the Queen's University of Belfast in 1970 and received a PhD from the same university in 1973 for theoretical research on atomic and molecular physics.

Roger E Woods is an Engineering Assistant at the Queen's University of Belfast, Northern Ireland. He received a BSc in Electrical and Electronic Engineering from Queen's University in 1985 and is presendy studying for a PhD at the same university. He was employed as a Research Assistant on a contract from the Royal Signals and Radar Establishment from July 86 to July 88. His current research interests include VLSI architectures and digital filtering.

John V. McCanny is Professor of Microelectronic Systems at the Queen's University of Belfast. He joined the university in 1984 and currently leads a broad based Digital Signal Processing Group whose activities include VLSI signal processing, speech coding and hi-fi music coding. His current research interests include bit level systolic arrays, the systematic design of VLSI array processor architectures, and the design of very high performance DSP chips. Formerly he worked at the Royal Signals and Radar Establishment as a Principal Scientific Officer. He graduated with a BSc in Physics from the University of Manchester in 1973 an obtained his PhD from the University of Ulster in 1978 for research on the electronic structure of various layered compounds.

Suggest Documents