studies on the design and implementation of digital filters - CiteSeerX

Linkšping Studies in Science and Technology Dissertation No. 583

STUDIES ON THE DESIGN AND IMPLEMENTATION OF DIGITAL FILTERS Kent Palmkvist

x(n)

D

D

D

D s(n)

y(n)

D

D

D

D

Department of Electrical Engineering Linkšpings universitet, SE-581 83 Linkšping, Sweden Linkšping 1999

Kent Palmkvist:

Studies on the Design and Implementation of Digital Filters

Linkšping 1999

Dissertations Division of Electronics Systems Department of Electrical Engineering Linkšpings universitet Sweden Vesterbacka M.: On Implementation of Maximally Fast Wave Digital Filters, Linkšping Studies in Science and Technology, Diss. No. 487, Linkšping University, Sweden, June 1997. Johansson H.: Synthesis and Realization of High-Speed Recursive Digital Filters, Linkšping Studies in Science and Technology, Diss. No. 534, Linkšping University, Sweden, May 1998. Gustavsson M.: CMOS A/D Converters for Telecommunications, Linkšping Studies in Science and Technology, Diss. No. 552, Linkšping University, Sweden, Dec. 1998.

Linkšping Studies in Science and Technology Dissertation No. 583

STUDIES ON THE DESIGN AND IMPLEMENTATION OF DIGITAL FILTERS Kent Palmkvist

Department of Electrical Engineering Linkšpings universitet, SE-581 83 Linkšping, Sweden Linkšping 1999

STUDIES ON THE DESIGN AND IMPLEMENTATION OF DIGITAL FILTERS

Copyright © 1999 Kent Palmkvist Department of Electrical Engineering Linkšpings universitet SE-581 83 Linkšping Sweden

ISBN 91-7219-488-X

ISSN 0345-7524

Printed in Sweden by UniTryck, Linkšping 1999

ABSTRACT In this dissertation an efficient approach to design and implement fixed-function, high-speed recursive digital filters is presented. For a recursive algorithm there is an upper bound on the sample frequency of the corresponding implementation. A maximally fast implementation is an implementation with a sample frequency that is equal to this bound. The maximal sample frequency is determined by the ratio between the number of delay elements and the operation latency in the most timecritical recursive loop(s). We show how maximally fast implementations are obtained using a cyclic scheduling formulation that includes several sample periods. This formulation allows a simple isomorphic mapping of the arithmetic operations to a resource-optimal hardware structure. The presented implementations are based on bit-serial arithmetic, but digit-serial and bit-parallel arithmetic are also feasible. The cyclic scheduling formulation can also be used to design shared-memory architectures with processing elements that are multiplexed to execute the operations. Two different wave digital filters are presented that have been implemented using the proposed design approach. We propose several numerically equivalent transformations that may yield algorithms with reduced iteration period bounds. These transformations are used on a lower level of abstraction, i.e., the arithmetic level, but they affect the critical loops of the algorithms. Further, we define several new latency models for the arithmetic operations with different amounts of pipelining and discuss their effect on the maximal sample frequency. A number of digital filters have been implemented to demonstrate that an increase in sample rate often can be achieved by the use of an appropriate logic style, pipelining of the arithmetic operations, and numerically equivalent transformations. An important advantage of this approach is that the excess speed achieved by a maximally fast implementation can be converted into reduced power consumption by operating a CMOS implementation with a reduced power supply voltage.

ACKNOWLEDGMENT I would like to thank my wife Katarina for her support and inspiration. I also want to thank all my former and current colleges at the division of Electronics Systems (formerly known as the division of Applied Electronics), among these especially my supervisor Prof. Lars Wanhammar, Dr. Mark Vesterbacka, Erik Nordhamn, Peter Sandberg, and Dr. HŒkan Johansson. This work has been supported by NUTEK.

CONTENTS 1 INTRODUCTION ................................................................................................... 1 1.1 DIGITAL SIGNAL PROCESSING SYSTEMS ........................................... 1 1.2 THROUGHPUT .............................................................................................. 1 1.3 POWER CONSUMPTION ............................................................................. 2 1.4 DESIGN EFFORT ........................................................................................... 4 2 ALGORITHM PROPERTIES .............................................................................. 7 2.1 DSP ALGORITHMS ....................................................................................... 7 2.2 LATENCY AND SAMPLE PERIOD BOUND .......................................... 10 3 THE DESIGN PROCESS..................................................................................... 15 3.1 DSP ALGORITHM SELECTION............................................................... 16 3.2 PARTITIONING INTO COOPERATING PROCESSES ........................ 16 3.3 SCHEDULING .............................................................................................. 16 3.4 RESOURCE ALLOCATION AND ASSIGNMENT ................................. 17 3.5 SYNTHESIS OF AN OPTIMAL ARCHITECTURE................................ 17 3.6 PROCESSING ELEMENT DESIGN .......................................................... 17 3.7 ARITHMETIC............................................................................................... 18 3.8 LOGIC DESIGN............................................................................................ 18 3.9 CIRCUIT DESIGN........................................................................................ 18 3.10 VLSI .............................................................................................................. 18 4 SCHEDULING FORMULATION ...................................................................... 19 4.1 PRECEDENCE GRAPHS ............................................................................ 19 4.1 COMPUTATION GRAPHS ......................................................................... 21 4.2 PIPELINING AND INTERLEAVING........................................................ 23 4.3 CYCLIC SCHEDULE FORMULATION................................................... 25 4.4 RESOURCE ALLOCATION AND ASSIGNMENT ................................. 26 PART I: SIC ............................................................................................................. 27 SHARED-MEMORY ARCHITECTURES ...................................................... 27 SIC......................................................................................................................... 28 EXAMPLES ......................................................................................................... 29

PART II: BIT-SERIAL ARITHMETIC................................................................ 31 EXAMPLES ......................................................................................................... 32 INCLUDED PAPERS.............................................................................................. 33 REFERENCES ......................................................................................................... 35 PAPER 1 IMPLEMENTATION OF STATIC DSP ALGORITHMS USING MULTIPLEXED PE:S ....................................................................................... 41 PAPER 2 DESIGN AND IMPLEMENTATION OF AN INTERPOLATOR USING WAVE DIGITAL FILTERS.............................................................................. 51 PAPER 3 SOME EXPERIENCES FROM AUTOMATIC SYNTHESIS OF DIGITAL FILTERS ........................................................................................... 59 PAPER 4 IMPLEMENTATION OF FAST BIT-SERIAL LATTICE WAVE DIGITAL FILTERS ........................................................................................... 69 PAPER 5 MAXIMALLY FAST, BIT-SERIAL LATTICE WAVE DIGITAL FILTERS.............................................................................................................. 79 PAPER 6 ARITHMETIC TRANSFORMATIONS FOR FAST BIT-SERIAL VLSI IMPLEMENTATIONS OF RECURSIVE ALGORITHMS .......................... 91 PAPER 7 HIGH-SPEED MULTIPLICATION IN BIT-SERIAL DIGITAL FILTERS............................................................................................................ 103

PAPER 8 A COMPARISON OF THREE LATTICE WAVE DIGITAL FILTER IMPLEMENTATIONS .................................................................................... 115 PAPER 9 SIGN-EXTENSION AND QUANTIZATION IN BIT-SERIAL DIGITAL FILTERS ......................................................................................... 125 PAPER 10 REALIZATION OF SERIAL/PARALLEL MULTIPLIERS WITH FIXED COEFFICIENTS ................................................................................. 135 PAPER 11 DESIGN AND IMPLEMENTATION OF A HIGH-SPEED INTERPOLATION AND DECIMATION FILTER ..................................... 143

1

INTRODUCTION

1.1

DIGITAL SIGNAL PROCESSING SYSTEMS

Many large-volume electronic consumer products are based on digital signal processing (DSP) techniques. Examples of such products are mobile phones, GPS receivers, and modems. In a digital signal processing system the signal that contains the information of interest is represented by a sequence of numbers, so-called samples. The DSP system operates on this (input) sequence of numbers to form an output sequence. In general, the overall aim of a signal processing system is to reduce the information content, or to modify it so that it can be efficiently stored or transmitted over a transmission channel. In practice, we will also need to perform auxiliary operations on signals and partition the system into smaller parts that perform elementary functions, in order to achieve the overall goal. The market for many of these products is highly competitive and the time-window in which a particular model can be sold with a profit is steadily decreasing. Hence, electronics companies must rapidly develop new or improved products to stay competitive and deliver these to the market at the appropriate time. Efficient design methodologies that support rapid and error-free integrated circuits to be developed within a given time frame and with a limited amount of people, as well as tools and experience, is therefore of strategic importance to modern electronics companies. Here we are mainly interested in a special class of DSP algorithms, so-called frequency selective digital filters. These filters are used in many DSP systems to separate signals that occupy different frequency bands. However, many of the methods and ideas that are to be discussed are applicable to a wider class of algorithms. Many of the new products that use DSP algorithms are mobile and, hence, battery powered, which impose strict requirements on the acceptable power consumption. Low power consumption, which corresponds to longer talk-time and stand-by time between battery recharges, has become one important feature for mobile phones which is easily recognized and appreciated by the users. Hence, it is important to minimize the power consumption in implementations of DSP algorithms.

1.2

THROUGHPUT

Today, digital signal processing systems may process samples at the rate (throughput) of 100 Msamples per second or more. However, most applications, for example, GSM mobile phones use much lower sample rates. A speech signal typically uses only 8 ksamples/s.

2


From an algorithm as well as from an implementation point of view it is important if the samples to be processed arrive uniformly in time, as bursts, or randomly. We will not discuss the latter case in detail since it is less common. The computational properties for the burst case are more relaxed than if the samples arrive and must be processed at a constant rate. Hence, we will focus on the first case. The uniform arrival of samples in conjunction with the throughput requirements yields a sample period in which the operations of the algorithm should be computed. From an implementation point of view the arithmetic operations should be evenly distributed within the sample period to minimize the number of concurrent operations and thereby the amount of hardware resources. We refer to this design step as scheduling and it will later be discussed in detail. The complexity and regularity of the computations range from implementation of simple and regular algorithms such as frequency selective digital filters, to complex and non-regular algorithms such as protocol stacks and user interface. The affordable computational power in most applications limits the use of complex non-regular algorithms to low sample rates, while the highest sample rates only allow less complex and more regular algorithms to be used.

1.3

POWER CONSUMPTION

As mentioned above, power consumption is a very important issue, especially for battery powered systems, since a reduction in power consumption prolongs the operation time between recharges of the battery. A reduction of the power consumption also reduces the physical size of the battery. It is however not only the battery life that limits acceptable power consumption, but also the cost of cooling for non-battery powered equipment. Implementations that dissipate large amounts of power (in excess of 5-10 watts per chip) require cooling. This can range from simple conductive cooling to forced cooling using fans and liquids. It is not only the power consumption of individual chips that affects the cost of cooling. Cooling of large systems where each individual chip dissipates a small amount of power can be expensive if the total power dissipation is high. A reduced cooling requirement will generally result in lower cost. Minimal power consumption is obtained if the energy consumption per operation is minimized at a certain throughput. Note that minimum energy consumption per operation yields a long execution time and is therefore often not applicable. The energy required to perform an operation is the product of the average power consumption of the operation and the duration of the computation [4]. In CMOS circuits the switchings, which occur at the clock edges, generate large current peaks in the power supply lines. The current peaks may produce problems

Chapter 1 Ð Introduction

3

with noise on the power supply lines and possibly also electromigration. The current peaks should therefore be limited as well. The power consumption in CMOS circuits is due to charging and discharging of parasitic capacitances, short-circuit currents, leakage in reverse-biased diodes, and sub-threshold leakage [42, 51]. For most well-designed CMOS circuits the shortcircuit current is of the order 10% and the leakage components are of the order 1% of the dynamic switching power consumption. In many cases we therefore can disregard the two later components. However, for large RAM, ROM, and low-activity circuits these components may be more significant. The major dynamic component is due to the charging and discharging of internal parasitic capacitances and wiring capacitances. It can be estimated as

P = f × CL × VDD 2

(1)

where f is the switching frequency, C L is the switched capacitance, and VDD is the power supply voltage. Note that the power consumption is proportional to the square of the power supply voltage and that the power consumption is rapidly reduced if the power supply voltage is reduced. A reduction of the power supply voltage will affect the maximal clock frequency, fCLmax, of the circuit. The maximal clock frequency can be approximated as

f CL max

a VDD - VT ) ( =

K × C L × VDD

(2)

where a » 1.55 for short channel devices, V T is the threshold voltage for the transistor, and K is a constant defined by the circuit. The maximal clock frequency, fC L m a x , varies therefore with the power supply voltage as shown in fig. 1. Implementations that are faster than required can therefore have the power consumption reduced by reducing the power supply voltage [7]. For example, if an implementation can run at 100 MHz at 5V power supply voltage and the required sample rate is only 80 MHz, the power supply voltage can be reduced to about 3V. The power consumption is thereby reduced by a factor (3/5)2 = 0.36. As a consequence of the discussion above it is not sufficient to compare the number of arithmetic operations in different algorithms in order to estimate the power consumption. The maximal clock frequency, or sample rate, is also important, as it limits the possibility to use a lower power supply voltage to reduce the power consumption.

4


fCL, P 1.0 CLOCK FREQUENCY 0.8 0.6 0.4 0.2

POWER CONSUMPTION 2

3

4

5

VDD

Fig. 1. Normalized power consumption and clock frequency comparison.

Low power consumption can be accomplished from the following guidelines. 1) Select an algorithm that has both a high maximal sample frequency and a low number of arithmetic operations. 2) Schedule the operations in such a way that the maximal sample rate is obtained without excessive use of resources. 3) Reduce the power supply voltage if the required sample rate is lower than the maximal sample rate of the implementation. This converts the excess speed into a reduced power consumption. There are a number of different techniques available that enables a designer to modify an algorithm in order to obtain a new algorithm with a higher maximal sample rate [39, 40]. For frequency selective digital filters is it better to use other techniques to design filters with high maximal sample rates [15]. Better numerical properties are obtained by avoiding algorithm transformations, and the filters are often more computational efficient. Exploiting computational parallelism and pipelining using power supply voltage scaling is a well-known technique to reduce the power consumption at the expense of increased chip area [6, 7]. Interleaving can in some cases also be used. However, the algorithm must be partitioned into cascaded blocks, since true pipelining and interleaving can not be performed in recursive loops. Techniques such as clustered and scattered look-ahead pipelining [40, 64] are not true pipelining.

1.4

DESIGN EFFORT

Integrated circuits are cheap to produce in large volumes, reliable, small in size, and can perform large amounts of complex computations per second. One drawback is, however, the long design time and the long time it takes to manufacture an integrated circuit. The long turn-around time for manufacturing of an integrated circuit and the difficulty in finding and correcting an error makes it vital to make a correct design

Chapter 1 Ð Introduction

5

the first time around. It is therefore important to use a synthesis method that prevents the designer from introducing errors into the design. It is also desirable to automate the design methods to a high degree in order to reduce the risk for the designer to introduce errors. An approach that reduces the design time is to reuse previously designed and tested building blocks. The most common case of design reuse is the use of standard cell libraries. These typically contain inverters, simple gates, flip-flops, and latches. The process vendors often provide a standard cell library for their own technology. The most common way to use standard cell libraries is to use logic synthesis tools, for example Autologicª and Synopsisª, to map a register transfer level description onto an implementation with standard cells. There is also a new emerging approach based on ÒIP blocksÓ, where IP stands for Òintellectual propertyÓ. These blocks implement larger functions, ranging from bitparallel adders and multipliers up to complete DSP cores. There is, however, currently no widely accepted standard for interfacing data buses and control, and blocks may only work with a specific vendorÕs technology. Although, standardization efforts like VSI and RAPID are on-going. The digital filters and the corresponding maximally fast implementations that are to be discussed in this dissertation are highly modular and regular which makes them suitable for implementation as, e.g., IP blocks. Regularity of an implementation means that only a few subcircuits, or components, are used repeatedly to build the whole circuit.

6


2

ALGORITHM PROPERTIES

2.1

DSP ALGORITHMS

In this dissertation we are only concerned with fixed-function DSP algorithms that can be scheduled at design time. Examples of such algorithms are digital filters, fast fourier transforms (FFT), discrete cosine transforms (DCT), and wavelet transforms [5, 8, 9, 13, 28]. These algorithms are often used in applications which are characterized by a high, but fixed, throughput, and data-independent operation sequences. Further we are only concerned with hard real-time systems. DSP systems are usually used in a hard real-time environment. The system is then required to respond to an input before a given deadline imposed by the application. An unacceptable error occurs if the deadline is not met. It is therefore important for the designer to be able to guarantee that the implementation will fulfill the timing requirements imposed by the deadlines. A constant sample rate implies such a deadline. In most cases the real-time requirements are specified as the minimum number of samples that must be processed per second. The delay before the computed result arrives at the output, i.e., latency, is not equally important in all applications [20]. For example, the latency in a CD player is not important, but for a digital filter in a recursive loop the latency is of main concern. Statically schedulable algorithms can be scheduled at design time. The use of a static schedule makes it possible to guarantee that the deadlines imposed by a real-time application are met. It is also possible to make a more elaborate search for an efficient schedule at design time. Dynamic scheduling on the other hand requires that a scheduler is implemented, and that it selects a suitable schedule at run-time. This will be costly both in terms of power consumption and chip area. Many subsystems are designed to perform a single, fixed function. Systems that do not perform a fixed function can often be decomposed into parts that perform fixed functions and some parts that perform non-fixed functions. Such a partitioning of the functions allows the fixed-function subsystems to be specialized and optimized with respect to the application requirements. Typically, flexibility in terms of allowing alternating functions will generally induce an additional cost in terms of chip area, power consumption, and reduced speed. Hence, a flexible design will in general be more costly compared to the fixed-function implementation. In this dissertation we are mainly concerned with the design and implementation of fixed function subsystems.

8


The most common approach to flexible implementations is to use general-purpose or standard DSP processors. These processors are programmable, thereby adaptable to many different types of applications. However, they consume more power, require larger chip area, and longer execution time compared to a fixed-function implementation. Data-independent algorithms constitute a subset of statically schedulable algorithms. A data-independent algorithm is an algorithm in which the operation sequence and operation types do not depend on the input data, e.g., there are no conditional expressions that control the data flow. Typical algorithms that are data-independent are FFT, DCT, and digital filters. Examples of data dependent algorithms are sorting, computation of the factorial function, and iterative optimization algorithms. The classification of data-independent and data-dependent algorithms is also dependent on the atomic operations that are available. This is illustrated with the bubble-sort algorithm in fig. 2 [1]. If there is an atomic compare/swap operation available, then the bubble-sort algorithm is a data-independent algorithm since the operation sequence and operation type are independent of the input data. However, if only separate swap and compare instructions are available, then the bubble-sort algorithm is a data-dependent algorithm. This is because the execution of the swapping operation is determined by a comparison of the input data values. procedure bubble ( var A:array[1..n] of integer); { bubble sorts array A into increasing order } var i,j,temp : integer; begin for i := 1 to n-1 do for j := n downto i+1 do if A[j-1] > A[j] then begin {swap A[j-1] and A[j] } temp := A[j-1]; A[j-1] := A[j]; A[j] := temp end end; {bubble} Fig. 2. Bubble-sort algorithm. The data-dependent part is inside the gray box.

Many algorithms can be described using various graph representations. These graphs are efficient for analyzing the computational properties of DSP algorithms and for displaying the inherent parallelism among different subtasks [40]. A linear signal-flow graph (SFG) is often used to describe linear single-rate algorithms [28]. The signal-flow graph describes the operations to be performed during one sample period, or one input frame in the case of a transform. The nodes represent variables, or node values, and the arcs denote the transmittances. By

Chapter 2 Ð Algorithm Properties

9

definition, the node value at a node is the sum of the inputs to the node. A special node called the source node injects external inputs into the graph, and the sink node extract outputs from the graph. The transmittance of z-1 is used to describe variables that are used in the next sample period.

x(n) 1

b

1

1

y(n)

z-1 c a d z-1

Fig. 3. Signal-flow graph of a second-order digital filter.

A block diagram is a slightly more general alternative to a signal-flow graph. For example, the ordering of the operations is described explicitly in this representation. The graph consists of operations, i.e., blocks, connected with directed arcs representing the data-flow between blocks. We denote the unit delay operation z-1 as a T block, as shown in the block diagram in fig. 4.

x(n)

Q a T c

y(n)

b T d Fig. 4. Block diagram of a second-order digital filter.

Another description is the data-flow graph (DFG) which focuses on the flow of data. The nodes represent computations, functions, or subtasks, and the directed arcs represent data transfers. This enables non-linear, multi-rate DSP algorithms to be described. The arcs may also represent delays. The DFG describes the data-driven property of the DSP algorithm, where an operation can be executed as soon as all the inputs are available.

10


x

+

Q a b

T T

+

y

c d

Fig. 5. Data-flow graph of a second-order digital filter.

2.2

LATENCY AND SAMPLE PERIOD BOUND

The data rate of a system is bounded by the speed of the internal components of the system and their interconnection as well as by the algorithm that is being executed. The speed of a component is characterized by the throughput and latency. It is important to differentiate between throughput and latency in order to understand how systems may be modified to accommodate higher data rates. The latency is the time required for an input of a given significance level to generate an output of the same significance level. The latency and throughput for a bit-serial multiplier are illustrated in fig. 6 for two cases, least-significant-bit (LSB) first and most-significant-bit (MSB) first. The latency for the case with LSB first, corresponds to the number of fractional bits in the coefficient a. For example, the product of a signed 16-bit data word and a signed 8 bit coefficient is 16+8-1 = 23 bits. Assuming that the magnitude of the coefficient is less than one, then the latency is seven clock cycles, since one bit is processed per clock cycle and there are 7 fractional bits in the coefficient. The latency for the MSB first is more complicated and depends on the values of the operands [19]. Figure 6 also indicates that the latency is shorter than the time required to compute the complete result. It is therefore possible to start another bit-serial operation earlier on the partial output. The latency for digit-serial and bit-parallel operations is defined in a similar way. Throughput is defined as the rate at which a new sample can be processed. This is in general not a function of the latency. The latency can be shorter as well as longer than the reciprocal of the throughput. An algorithm is sometimes described using a signal-flow graph which is not fully specified from a computational point of view. The operation order influences the throughput and latency and the signal-flow graph must therefore be fully specified in order to uniquely determine the computational properties of the algorithms.


11

a) Input x

LSB

MSB

LSB MSB Input a LSB

MSB

Output y t 1 Throughput

TLatency

b) Input x Input a

MSB MSB

LSB LSB MSB

Output y TLatency

LSB 1 Throughput

t

Fig. 6. Latency and throughput for a multiplication y = a x with |a| < 1, using bit-serial arithmetic with a) LSB first, b) MSB first.

The throughput is limited by the critical path, which is the longest computational path in the algorithm. The throughput can often be increased by breaking the critical path into smaller pieces. This can be done by using techniques such as pipelining and interleaving. Note that pipelining and interleaving do not in principle increase the latency in terms of absolute time. Non-recursive algorithms can reach arbitrary high sample rates by utilizing pipelining and interleaving. The data rates of recursive algorithms on the other hand are bounded by the time required to complete the computations in the recursive loops. Pipelining and interleaving at the algorithm level cannot be used to reduce the latency in these loops. However, pipelining on the arithmetic and logic levels can often be used to improve the throughput [52, 58]. The minimum sample period of a recursive algorithm [43] is given by

ìT ü Tmin = max í OPi ý i î Ni þ

(3)

12


where TOPi is the total operation latency in the recursive loop i, and Ni is the number of delay elements in the recursive loop i. The computational loop that limits the throughput is called the critical loop. The block diagram in fig. 4 has two recursive loops as indicated in fig. 7a. It has a minimum sample period of

ì T + Tadd + Tadd + TQ Tb + Tadd + Tadd + TQ ü Tmin = max í a , ý 1 2 î þ

(4)

where Ta is the latency of the multiplication a, Tb is the latency of the multiplication b, and Tadd is the latency of the adder. Assuming that Tb < 2Ta + 2Tadd + TQ, the inner loop with the multiplication of a is limiting the minimum sample period to Ta + 2Tadd Ê+ TQ. This loop is then the critical loop.

x(n)

Q a T c

y(n)

Q a T c

b T d a)

x(n)

y(n)

b T d

b) Fig. 7. Recursive loops limiting the minimum sample period.

Algorithm transformations can be used to generate new algorithms with the same behavior, but with a shorter minimum sample period. One approach is to increase the number of delay elements in the critical loop. Such transformations include scattered look-ahead pipelining, clustered look-ahead pipelining, block processing, and parallel processing [40, 64]. The complexity, and thus the implementation cost, will often increase rapidly with the use of these methods, and some of the techniques may even result in an unstable filter implementation. It is also possible to reduce the latency of the critical loop by moving non-critical operations out of the loop. Such transformations include modifications of the operations and operation order using the commutative, associative, and distributive algebraic laws [29, 36]. Care must be taken, however, to guarantee that the computational properties (including overflow and truncation behavior) of the original algorithm are not adversely affected. An example of such an algorithm transformation would be to change the addition order of the digital filter in fig. 7a, which is acceptable under certain mild restrictions [64] if twoÕs-complement arithmetic is used. The specification of a second-order direct form II digital filter structure does not define the addition order. One


13

alternative addition order is shown in fig. 7b. This structure computes the same values as in fig. 7a, but the minimum sample period is now

ìT + T T + T + T add ü Tmin = max í a add , b add ý 2 þ î 1

(5)

Assuming that Tb < 2Ta + TQ, the critical loop is now Ta + Tadd + TQ, which is Tadd shorter. The minimum sample period has thus been reduced.

14


3

THE DESIGN PROCESS

We advocate that the design of a DSP system should start with a system specification that includes a behavioral description together with constraints and test specification that specify how the functionality should be validated. The design process consists of building a sequence of models of the system that are more and more detailed until the model(s) is so close to the implementation that it can directly be implemented. Each model should be validated against the specification and the DSP Algorithm differences between the successive models should be so small that errors are avoided. Here we are Partitioning into comainly interested in the design steps from operating processes Scheduling to Architecture design in fig. 8, where the functions are mapped to appropriate hardware Scheduling structures [64]. A hardware structure can either be custom made to match the specification or a more general structure that can be programmed to accommodate the algorithm. The latter means that the hardware must be more general and flexible compared to the custom-made, which usually yields an increased cost in terms of chip area and power consumption. The design process for a full-custom hardware structure suitable for design of fixed-function systems can be described by the design-flow graph shown in fig. 8. This design flow has been used throughout this dissertation. The graph illustrates a sequence of tasks, where each step should be ended with a verification or validation as well as a feasibility check that tests if the design is feasible for implementation. The design flow is forced to restart at a previous step if this is not the case.

Resource Allocation Resource Assignment Architecture Design Processing Elements Arithmetics Logic Design Logic Styles VLSI Design

Integrated Circuit Many of the steps represent tasks that are difficult to optimize. Many of the available algorithms to create an optimized VLSI design, as well as Fig. 8. Idealized view of the main algorithms used for scheduling and resource steps in DSP system design. allocation/assignment, are NP-complete [42].

16

3.1


DSP ALGORITHM SELECTION

The starting point in DSP system design is a DSP algorithm specification. The selected algorithm should be validated or verified to fulfill the specification for the system. The algorithm should be simplified if possible, word lengths selected, etc. This is usually done by the means of system simulations using either dedicated design tools such as DSP Stationª or Ptolemyª, or performed by writing simulation models in MATLAB ª or C. A testbench is also needed to ascertain that any subsequent model also fulfills the specification.

3.2

PARTITIONING INTO COOPERATING PROCESSES

The next step is to partition the algorithm into cooperating processes. Different levels of abstraction can be used to represent these processes. Examples of such processes are FFT, DCT, and digital filters. Larger processes are usually further partitioned into subprocesses, creating a hierarchy of processes. The aim of this partitioning is to find a description where each process can be mapped to a hardware structure using the procedure that will be described later. The selection and partitioning of the processes will have a major influence on the performance of the final system. The overall architecture as well as the architectures corresponding to the subprocesses are directly influenced by the process selection. A poor choice could result in an excessive amount of interprocess communication, or that the regularity in the algorithm is destroyed. It is a good strategy to reduce the interprocess communication when partitioning a system into processes. Any communication between processes requires signals interconnecting the corresponding hardware structures and a protocol that specifies when to read the inputs and generate new outputs. This either requires the processes to be synchronized, or a hand-shake protocol to be implemented. Loss in performance may occur if, for example, a process must wait for the receiving process to finish before proceeding. Reoccurring sequences of operations in the algorithm are usually described as a loop in a sequential description. These can either be kept as special control processes that will execute the loop body the required number of times, or be unfolded. Loop unfolding may allow the implementation to reach higher sample rates but may at the same time increase the controller complexity.

3.3

SCHEDULING

High throughput requirements typically necessitate that the hardware executes the processes concurrently. However, many high-level programming languages do not

Chapter 3 Ð The Design Process

17

support parallelism and concurrency explicitly. The algorithm description therefore often needs to be modified to explicitly describe the inherent parallelism. The assignment of execution time to process instances is often referred to as scheduling. Typically, the time is quantized and the assignment of processes to different time slots results in a combinatorial optimization problem. Most scheduling problems are NP-complete [1, 2, 11, 14]. The schedule defines the number of concurrent processes, and therefore also gives a lower bound on the amount of hardware resources; processing elements, storage, communication, and control. The schedule should therefore be optimized with a cost function that reflect the cost of the different types of resources.

3.4

RESOURCE ALLOCATION AND ASSIGNMENT

Resources must be allocated to meet the schedule. Since a hard real-time system corresponds to a resource adequate design problem, while, for example, a generalpurpose computer represents a resource limited problem. Any increase of resources above the limit defined by the maximum number of concurrent operations implies that parts of the hardware will be idle for some period of time. There are however situations where an increase of the resources will lead to a significant reduction of the control or communication resources. Using more than minimum amount of resources can also be used to reduce the power consumption as discussed before [6]. Resource assignment is the process of binding, for example, each arithmetic operation to one particular processing element in a pool of processing elements. Resource assignment has an affect on the communication requirement and control complexity. Selecting the type of processing elements (ALU, dedicated arithmetic unit, etc.) is also done in this step. A more flexible resource such as the ALU will allow a larger freedom at the resource assignment step, but will often also cost more in terms of chip area and power consumption.

3.5

SYNTHESIS OF AN OPTIMAL ARCHITECTURE

The generated schedule defines the overall requirements of an optimal hardware structure with processing elements, communication channels, storage elements, and control units. This structure can be implemented as a shared-memory architecture.

3.6

PROCESSING ELEMENT DESIGN

Processing elements are defined in this step. The internals of the processing element are determined in this step. Internal datapaths, control structures, communication, and control are defined.

18

3.7


ARITHMETIC

Selection of the type of arithmetic and number system has a large impact on the speed and power consumption. The arithmetic can be implemented using bit-parallel or bit-serial arithmetic or some intermediate form such as digit-serial arithmetic [12, 46]. Data and constants can be represented as twoÕs-complement, sign-magnitude, oneÕs-complement, SDC, CSDC, etc. [19]. Each of these representations has their advantages and drawbacks [6]. Arithmetic operations can be implemented using different types of algorithms. There are for example a large variety of multiplication schemes available, with varying performance in terms of speed, power consumption, and chip area.

3.8

LOGIC DESIGN

The description is then synthesized to a logic level, where boolean expressions and storage elements are the building blocks. All arithmetic operations are decomposed into boolean expressions. Many tools, e.g., Autologicª and Synopsisª, are available that simplify the logic description and generate a netlist of gates and flipflops.

3.9

CIRCUIT DESIGN

The circuit design step includes selection of logic styles (static CMOS, CPL, etc.) [52, 66]. This step may in most cases just consist of selecting a standard cell library, but unconventional logic styles and clocking schemes may be used to obtain high speed and low power consumption.

3.10 VLSI The final step involves the floorplanning and layout of the integrated circuit, where transistors and wires are positioned on the chip. This designstep is very complex and is therefore divided into several steps, such as floorplanning, placement, and routing [3]. Automatic layout tools are available, but for high performance full-custom layout is required.

4

SCHEDULING FORMULATION

4.1

PRECEDENCE GRAPHS

A signal-flow graph is not well suited for analysis of the computational properties of an algorithm. Aspects such as parallelism and required operation sequences are easier to extract from the precedence form of the algorithm. The precedence form describes the data dependencies between operations. The operations are sorted so the data flows from left to right in the graph, except for the delay elements that go from right to left. The steps 1-8 in fig. 9 describe a method to transform a signal-flow graph into its precedence form. Note that the first node set (N1) and the last (Nj+1 ) are the same. 1. Collapse unnecessary nodes in the fully specified signal-flow graph by removing all arcs with transmittance equal to 1. Transmittances of Ð1 can often be propagated into adjacent arcs. 2. Assign node variables to all nodes in the fully specified signal-flow graph according to ¥ Input and output nodes with variables xi(n) and yi(n), respectively. ¥ Contents of the delay elements (outputs) with variables, vi(n). ¥ Outputs from the basic operations, i.e., all the remaining nodes, with variables, ui. 3. Remove all arcs with delay element operations in the signal-flow graph. Initialize the loop index variable j = 1. 4. Choose all initial nodes (nodes with no entering arcs) in the (remaining) signal-flow graph and denote this set of nodes by Nj. Initial nodes already included in a previous set of variables is not included. 5. Delete the arcs that correspond to basic operations that are executable (that is, operations for which all inputs are initial nodes). Remove all initial nodes that no longer have any outgoing branches. 6. If there are any nodes left, increase j by one, and repeat from step 4. The algorithm is not sequentially computable if there are some operations but no initial nodes left. Hence, the precedence graph does not exist. 7. Choose all terminal nodes (nodes with only entering arcs) in the graph from step 3 and denote this set of nodes by Nj+1 . 8. Connect nodes with arcs (basic operations) according to the signal-flow graph, with the indexes of the set of nodes increasing to the right. Fig. 9. Algorithm to transform a signal-flow graph into a precedence graph.

20


The generated precedence graph will have all the input values and delay element output values located in the first node set to the left, and operations will be placed as far to the left as possible. Finally, the delay elements are updated, and their values are used as inputs to the next sample period. Figure 10 shows the precedence graph for the direct form II digital filter in fig. 7a. x(n)

u4

u8

Q

v1(n)

v1(n) a u1 c b d

v2(n) N1

u3

u2

v2(n)

u5

u7

u6 N2

1

y(n)

N4

N3

2

3

N6

N5

4

5

Fig. 10. Precedence graph for a direct form II digital filter.

The delay elements describe which values to save from one sample period to the next. This means that the delay elements are not operations in the sense that they modify the values, but they are only an artifact due to the description of a recursive algorithm. The precedence graph is then drawn on a cylinder in order to illustrate the iterative property of the algorithm as shown in fig. 11. The ordering has then become x(n)

y(n) Operations

T

T

Operations

T

t

Fig. 11. Cylinder transformation of a precedence graph.

Chapter 4 Ð Scheduling Formulation

21

a circular dependency, where the start and end of an algorithm may be defined at an arbitrary vertical cut in the cylinder. Figure 12 shows the precedence graph from fig. 10 drawn on a cylinder. The gray arcs representing the delay elements are now removed completely from the graph. Non-recursive algorithms may also use this description.

x(n) u1

a

u3

v1(n) c y(n) b v2(n) 5

d 1

u2 u5

u7

u6 2

t

Fig. 12. Cylinder representation of the direct form II digital filter.

4.1

COMPUTATION GRAPHS

The scheduling task consists of defining start times for all operations, while respecting the data dependencies, and minimizing the implementation cost directly or indirectly. Scheduling is performed on a computation graph, which is an extension of the precedence graph [64] by adding timing to the operations, as shown in fig. 13. The circumference of the cylinder defines the sample period. The latency of an operation is described by the horizontal distance from the end of the input arc to the start of the output arc. The reciprocal of the throughput for each operation is described as the length of the symbol. Figure 14 shows an example of an as soon as possible (ASAP) schedule for the direct form II digital filter, with the execution time of multiplications being four time units and additions one time unit. The sample period requirement is 10 time units.

22


T

sample

1/Throughput operation latency Fig. 13. Operation description in the computation graph.

x(n)

a Q b y(n)

d

c 0

1

2

3

4

5

6

7

8

9

10

t

Fig. 14. Initial schedule (ASAP) for the direct form II digital filter.

This scheduling formulation is well suited for many combinatorial optimization algorithms. A new schedule can be created by moving operations on the cylinder while keeping the lengths of the memory arcs non-negative. It is often efficient to change the order of shift-invariant operations to reduce the implementation cost. The necessary resource allocation will be a consequence of the operation schedule. It is therefore of interest to compare the number of concurrent active processes (number of processing elements), total length of the arcs corresponding to storage, and number of storage arcs with overlapping lifetimes (number of memory cells). The


23

cost function for optimization of the schedule may aim at, for example, sample frequency, amount of resources, latency, or a combination of these. Figure 15 shows an improved schedule in which the number of simultaneous multiplications has been reduced by moving the multiplications c and d together with the two following additions four time units to the right. x(n)

a Q b y(n)

c

d 0

1

2

3

4

5

6

7

8

9

10

t

Fig. 15. An improved schedule for the direct form II digital filter.

4.2

PIPELINING AND INTERLEAVING

The two most common techniques to increase the throughput of a system is to use pipelining and interleaving. Neither of these techniques can be used in recursive loops. Pipelining can only be applied to feed-forward cut-sets of the signal-flow graph, and interleaving requires the signal-flow graph can be parallelized into identical subgraphs with only feed-forward cut-sets between them. It is therefore not possible to modify the minimum sample period bound using interleaving or pipelining. Figure 16 shows an algorithm having a sequential operation dependency. The schedule requires three time slots per sample period. Since there is no recursion, pipelining and interleaving can be used to reduce the sample period. Interleaving exploits that the algorithm can be partitioned into independent parts, i.e., they are not connected in a loop. These parts can then be executed concurrently.

24


x(n)

Operation 1

Operation 2

Operation 3

y(n)

T sample Fig. 16. Sequential algorithm with a sample period of three operation latencies.

Interleaving is performed by parallelizing the algorithm and duplicating the resources implementing the algorithm N times. New samples are input at a sample rate N times faster than the original algorithm, and fed to subsequent branches in an alternating fashion. Figure 17 shows how the algorithm in fig. 16 is interleaved and skewed in time to have a uniform sample period. The new sample period is one third of the original sample period. x(3k)

Operation 1

Operation 2

Operation 3

Operation 1

Operation 2

y(3k)

x(3k+1) Operation 3

y(3k-2) Operation 2

x(3k+2)

Operation 3

Operation 1 y(3k-1)

T sample

T sample

T sample

Fig. 17. Interleaved version of the algorithm.

Pipelining is used to break the critical path(s) into smaller pieces in order to increase the throughput. Pipelining of an algorithm can be performed by first adding delay elements to all inputs or outputs. These delay elements are then propagated into the algorithm by using the fact that arithmetic operations are shift-invariant. Adding N delay elements at the input or output causes the output of the pipelined algorithm to be delayed by N samples. However, the latency in absolute time has not been increased, except for the time to write and read from the corresponding registers that have been introduced. The aim is to break long computational paths into shorter pieces and thereby increase the throughput. It is not meaningful to make the pieces shorter than T min of a recursive algorithm, but for a non-recursive algorithm the pieces can be made arbitrary short. Hence, the throughput for non-recursive algorithms is not upper


25

bounded. Of course, the interchanges of delay elements and arithmetic operations can not introduce delay elements into recursive loops, hence, pipelining does not improve the minimum sample period. Pipelining by adding N delay elements to, for example, the outputs of the algorithm correspond to drawing delay arcs N times around the cylinder since the circumference in this case was one sample period. Figure 18 shows the pipelined version of the algorithm in fig. 16. Three pipelining stages have been introduced. The operations in the different stages can be assigned to their own processing elements. x(n)

Operation 1 Operation 2 Operation 3

y(n-2)

T sample Fig. 18. Pipelined version of the algorithm.

Pipelining and interleaving of an algorithm differ in that interleaving is performed by parallelizing the algorithm, while pipelining is performed by adding registers between operations. The processing elements in the interleaved case must thus be able to perform the same operations as in the original algorithm. The processing elements in the pipelined algorithm will only perform a subset of the operations, i.e., the operations of each stage, and may therefore be simplified compared to the nonpipelined structure. The stages should in the pipeline case be of equal length. The sample period must be set to the longest execution time of any stage. Interleaving does not have this constraint, as each processing element in the interleaved algorithm performs all the operations.

4.3

CYCLIC SCHEDULE FORMULATION

In order to obtain the minimum sample period, or to reach a better resource utilization a cyclic scheduling formulation involving multiple sample interval is in general required. For example, if there is an operation with a reciprocal of its throughput that is longer than Tmin and when there are more than one delay element in the critical loop. This indicates that there needs to be more than one hardware resource that computes this operation. Another reason is that if the operations have

26


latencies that are integer while Tmin is a fraction of the basic time unit. It is then not possible to reach an average sample period equal to Tmin unless a schedule over multiple sample periods is used. Both these situations are solved by expanding the sample period an integer number of sample periods as shown in fig. 19. The fractional sample period problem can be solved by having an average sample period equal to Tmin , while long operations are handled by expanding the schedule to allow alternating resource assignments. x(Kn)

x(Kn+1) y(Kn)

N0

T T

x(Kn+K-1) y(Kn+1)

N1

T T

y(Kn+K-1) ... ...

NK-1

T T

Fig. 19. Expansion of the sample period by a factor K.

4.4

RESOURCE ALLOCATION AND ASSIGNMENT

The resource allocation and assignment step consists of mapping operations, variables, and communication channels to hardware resources. The mapping is often limited to be one-to-one or many-to-one. The cyclo-static scheduling algorithm is however an example of a one-to-many mapping [48]. A special case of the one-to-one mapping is isomorphic mapping. This will correspond to a hardware structure that has an interconnection network with the same topology as the computation graph and the same number of processing elements as in the schedule. This approach is best suited for very high throughput, where multiplexing of the processing elements is not possible. For applications with modest throughput requirements, it is often possible to share processing elements between operations. In this case the schedule also defines the essential properties of a corresponding optimal architecture with its processing elements, communications channels, and memories as well as control.

PART I: SIC SHARED-MEMORY ARCHITECTURES The architecture shown in fig. 20 is a shared-memory architecture. The N processing elements are connected to the shared memory, with a single address space, through an interconnection network. Each PE is given access to the shared memory once every N clock cycles using a rotating access schedule. At each access either new inputs are read from the segmented shared memory, or the output values are saved by simultaneously writing the outputs into different memory segments. This may, however, cause memory access conflicts since a segment typically can read or write only one value at a time. PE1

PE2

PEN

INTERCONNECTION NETWORK

M1

M2

M3

MK

Fig. 20. Shared-memory architecture

This architecture is suited for tightly-coupled algorithms, such as recursive algorithms with complicated data dependencies. Each processing element has the same access time to every memory cell. One drawback with the shared-memory architecture is the memory bandwidth bottleneck. Typically, the access time for the memories is of the same order of magnitude as the execution times for the processing elements. Larger architectures with many processing elements can therefore not fully utilize the processing elements. The effects of this bottleneck can be alleviated by the use of faster memories and by reducing the communication requirements by using, for example, broadcasting, cache memories, or direct interprocessor communication [64]. A better balance with respect to processing and communication capacity may be obtained by increasing the execution time of the processing elements [64]. Using bit-serial instead of bit-parallel arithmetic yields small processing elements with smaller area and longer execution times, but the throughput per unit chip area is often larger than for processing elements based on bit-parallel arithmetic.

28


The memory bottleneck is further reduced if complex basic operations are used. Complex basic operations require, in general, longer execution times and tend to reduce the number of memory accesses. For example, the use of a higher radix FFT and thereby more complex butterflies reduce the number of stages and memory accesses in the FFT.

SIC We have proposed a special case of shared-memory architecture that is suitable for fixed-function DSP algorithms with modest throughput requirements. The latter means that the processing elements can be multiplexed between different operations. Hence, this corresponds to the previously discussed case of many-to-one mapping of the operations. The acronym SIC have at least two different interpretations, Single Instruction Computer, which is also referred to as a MOVE computer or transaction triggered computer and Structured InterConnection. The basic idea of the SIC is that some of the addresses in the common address space is given special functions as illustrated in fig. 21. For example, some of the addresses correspond to the input and output ports while other addresses are inputs and outputs to fixed-function processing elements like adders, ALUs, multipliers, butterflies, and adaptors. The execution of an algorithm in a SIC only involves moving data back and forth between ordinary memory cells and the inputs and output of the processing elements. Hence, the name MOVE processor. The control structure for the SIC becomes very simple, i.e., read and write addresses to the memory. The SIC can also be viewed as a structured approach to interconnect processing elements or more complex computational blocks, e.g., another SIC structure [64]. In this paradigm the processing elements communicate through the shared memory. In a large system with many processing elements the communication between memory and processing elements can be asynchronous and in order to save power the processing elements should remain idle when there are no inputs to be processed. Bit-serial processing elements can run at very high clock frequencies, which present a major problem, if a global high-speed clock is used. A better solution is then to use a local clock generator for each processing element. The clock starts and runs for the appropriate number of clock cycles and then stops when inputs become available. This approach is efficient for static schedulable algorithm with only a few different types of operations, but can also been applied to more general problems. For example, Lipovski proposed a similar architecture as an alternative to general CPU structures for simple control tasks [22, 23, 49]. However, we only recently become aware of his work.

Part I: SIC

29

SharedÐMemory RAM cells I/O PE cells

PE PE

Fig. 21. SIC architecture.

EXAMPLES We have implemented two different wave digital filters (WDF) using the cyclic scheduling formulation and used the SIC architecture with bit-serial processing elements for the implementation. High-throughput FFTs have also been implemented by our group using the SIC architecture [25, 24, 65]. We conclude that this approach is efficient for regular, fixed-function algorithms or systems that can be partitioned into such subsystems. Each subsystem can then be implemented using a SIC architecture. The subsystems may be connected in series, parallel, hierarchically, or a combination thereof. Paper 1 describes the overall design method. It focuses on the scheduling formulation together with the SIC architecture. Paper 2 describes the design of an interpolator filter using the cyclic scheduling formulation together with the corresponding SIC architecture with two memories. The emphasis is on the scheduling since the interpolator is a multirate filter. Paper 3 finally describes in more detail the implementation results for the IF filter described briefly in paper 1. The description of the architecture in VHDL is discussed together with some reflections of the use of VHDL synthesis.

30


PART II: BIT-SERIAL ARITHMETIC Bit-serial computations are performed one bit at a time. The most common case is to use LSB first operations, where the LBSs of the input values are processed first, and the MSB is processed last. The small number of inputs allows the processing elements to have small area, and to have short combinatorial delays allowing for high clock frequencies. Interconnections are also simplified as these will consist of single wires. Bit-serial addition is performed using a single full-adder with a flip-flop to store the carry to the next clock interval as shown in fig. 22. The carry flip-flop is reset at the start of the operation. Subtraction can be performed almost identically, with the difference that the carry is set, and the input to subtractor is inverted. The latency of addition and subtraction is zero clock cycles and the delay of one full adder in absolute time. The number of clock cycles required to complete an addition is Wd +1 where Wd is the wordlength of the operands.

X Y Z Reset a)

FA D ¿

X Y

Sum C

FA Z

b)

Set

Diff. C

D ¿

Fig. 22. Bit-serial addition and subtraction.

Bit-serial multiplication can be performed using two bit-serial inputs. A common situation is, however, that the coefficients are available in parallel form. The corresponding serial/parallel multiplier has a latency equal to Wf clock cycles, where Wf is the number of fractional bits in the coefficient. The number of clock cycles required to complete an operation is equal to W d +Wf where W d is the data word length. The large difference between the latency and throughput requires the schedule to be performed over multiple sample periods in order to reach the minimum sample period. The small area of the processing elements and the simple interconnections makes an isomorphic mapping almost ideal. Each operation in the schedule is then assigned to its own processing element, and every processing element input is only driven by a single output. Storage can be implemented using simple shift registers.

32


EXAMPLES A third-order bireciprocal lattice wave digital filter (WDF) has been selected to illustrate the design method. This filter has been implemented by others using bitparallel carry-save arithmetic [18, 38]. Paper 4 describes the first initial implementation to validate that the design method and demonstrate that high data rates can be achieved for bit-serial implementations. This design was carried out using a full-custom layout. Paper 5 illustrates that the design methodology can be used for larger filters, and that the regularity of the algorithms can be exploited. This work has been developed towards a design CAD tool [54]. Paper 6 goes into detail about how algorithms can be transformed into new structures by using arithmetic transformations. The paper describes how multiplications and additions can be combined to obtain a lower Tmin . Paper 7 introduces different timing models used for the latency. These timing models describe the properties of the underlying logic styles used for implementation of the arithmetic operations. Paper 8 consists of a comparison of three different implementations to compare different implementation styles. Paper 9 discusses the implementation of sign-extension and truncation needed in recursive algorithms. Paper 10 presents how fixed coefficient multiplications can be implemented. The implementation technique reduces the cost of the multipliers to a fraction compared to the non-optimized version. Paper 11 describes how a multirate decimator/interpolator filter can be implemented using bit-serial arithmetic, while having high data rates. A single implementation of a combined interpolation and decimation filter, based on bireciprocal lattice wave digital filters is presented. The selection of the function is controlled by a single control signal. This target application had a predefined parallel interface that required the design to be interfaced to a parallel data bus, and to synchronize the high speed serial data clock to the slower parallel data clock.

INCLUDED PAPERS Paper 1

Palmkvist K., Vesterbacka M., Wanhammar L.: Implementation of Static DSP Algorithms Using Multiplexed PE:s, Proc. 3rd IEEE Conf. on Electronics, Circuits, and Systems, ICECSÕ96, Vol. 1, pp. 824-827, Rodos, Greece, Oct. 13-16, 1996.

Paper 2

Palmkvist K., Vesterbacka M., Lars Wanhammar: Implementation of an Interpolator Using Wave Digital Filters, Proc. Nat. Conf. on Radio Science, RVKÕ93, pp. 205-208, Lund, Sweden, April 5-7, 1993.

Paper 3

Sandberg P., Palmkvist K., Wanhammar L.: Some Experiences From Automatic Synthesis of Digital Filters, Proc. NorChip-94, Gothenburg, Sweden, Nov. 8-9, 1994.

Paper 4

Vesterbacka M., Palmkvist K., Sandberg P., Wanhammar L.: Implementation of Fast Bit-Serial Lattice Wave Digital Filters, Proc. IEEE Int. Symp. on Circuits and Systems, ISCASÕ94, Vol. 2, pp. 113116, London, England, May 29 - June 2, 1994.

Paper 5

Vesterbacka M., Palmkvist K., Wanhammar L.: Maximally Fast, BitSerial Lattice Wave Digital Filters, Proc. 7th IEEE DSP Workshop, DSPWSÕ96, pp. 207-210, Loen, Norway, Sept. 1-4, 1996.

Paper 6

Palmkvist K., Vesterbacka M., Wanhammar L.: Arithmetic Transformations for Fast Bit-Serial VLSI Implementations of Recursive Algorithms, Proc. IEEE Nordic Signal Processing Symp., NORSIGÕ96, pp. 391-394, Espoo, Finland, Sept. 24-27, 1996.

Paper 7

Vesterbacka M., Palmkvist K., Wanhammar L.: High-Speed Multiplication in Bit-Serial Digital Filters, Proc. IEEE Nordic Signal Processing Symp., NORSIGÕ96, pp. 179-182, Espoo, Finland, Sept. 2427, 1996.

Paper 8

Vesterbacka M., Palmkvist K., Wanhammar L.: A Comparison of Three Lattice Wave Digital Filter Implementations, Proc. 7th Int. Conf. on Signal Processing Applications & Technology, ICSPATÕ96, Vol. 2, pp. 1909-1913, Boston, MA, Oct. 7-10, 1996.

Paper 9

Vesterbacka M., Palmkvist K., Wanhammar L.: Sign-Extension and Quantization in Bit-Serial Digital Filters, Proc. 3rd IEEE Conf. on Electronics, Circuits, and Systems, ICECSÕ96, Vol. 1, pp. 394-397, Rodos, Greece, Oct. 13-16, 1996.

34


Paper 10 Vesterbacka M., Palmkvist K., Wanhammar L.: Realization of Serial/Parallel Multipliers with Fixed Coefficients, Proc. Nat. Conf. on Radio Science, RVK93, pp. 209-212, Lund, Sweden, April 5-7, 1993. Paper 11 Johansson H., Palmkvist K., Wanhammar L.: Design and Implementation of a High-Speed Interpolation and Decimation Filter, European Microelectronics Application Conf., EMAC-97, pp. 97-100, Barcelona, Spain, May 28-30, 1997.

REFERENCES [1]

Aho V., Hopcroft J. E., Ullman J. D.: Data Structures and Algorithms, Addison-Wesley, 1987.

[2]

Akl S. G.: The Design and Analysis of Parallel Algorithms, Prentice-Hall, 1989.

[3]

Bakoglu H. B.: Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley, 1980.

[4]

Bellaouar A. and Elmasry M. I.: Low-Power Digital VLSI Design: Circuits and Systems, Kluwer Academic Publ., 1995.

[5]

Bracewell R. N.: The Fourier Transformation and Its Applications, McGraw-Hill, 1986.

[6]

Chandrakasan A. P. and Brodersen R. W.: Low Power Digital CMOS Design, Kluwer Academic Publ., 1995.

[7]

Chandrakasan A. P. and Brodersen R. W.: Minimizing Power Consumption in Digital CMOS Circuits, Proc. of the IEEE, Vol. 83, No. 4, pp. 498-523, April 1995.

[8]

Crochiere R.E. and Rabiner L.R.: Multirate Digital Signal Processing, Prentice-Hall, Englewood Cliffs, N.J., 1983.

[9]

Fettweis A.: Wave Digital Filters: Theory and Practice, Proc. IEEE, Vol. 74, No. 2, pp. 270327, Feb. 1986.

[10] Fettweis A. and Meerkštter K.: Suppression of Parasitic Oscillations in Wave Digital Filters, IEEE Trans. on Circuits and Systems, Vol. CAS-22, No. 3, pp. 239-246, March 1975. [11] Gebotys C. H.: Throughput Optimized Architectural Synthesis, IEEE Trans. on VLSI, Vol. 1, No. 3, pp. 254-261, Sept., 1993. [12] Hartley R. I., and Parhi K. K.: Digit-Serial Computation, Kluwer Academic Publ., 1995. [13] Haykin S.: Digital Communications, Wiley & Sons, 1988. [14] Heemstra de Groot S. M. and Herrmann O. E.: Range-Chart-Guided Iterative Data-Flow Graph Scheduling, IEEE Trans. on Circuits and Systems, Vol. CAS-39, No. 5, pp. 351-364, May 1992. [15] Johansson H.: Synthesis and Realization of High-Speed Recursive Digital Filters, Linkšping Studies in Science and Technology, Diss. No. 534, Linkšping University, Sweden, June 1998. [16] Johansson H., Palmkvist K., and Wanhammar L.: Design and Implementation of a HighSpeed Interpolation and Decimation Filter, European Microelectronics Application Conf., EMAC-97, pp. 97-100, Barcelona, Spain, May 28-30, 1997.

36


[17] Johansson H., Palmkvist K., Vesterbacka M., and Wanhammar L.: High-Speed Lattice Wave Digital Filters for Interpolation and Decimation, Nation. Conf. on Radio Science, RVK-96, LuleŒ, Sweden, pp. 543-547, June 3-6, 1996. [18] Kleine U. and Bšhner M.: A High-Speed Wave Digital Filter Using Carry-Save Arithmetic, Proc. ESSCIRCÕ87, Bad-Soden, pp. 43-46, 1987. [19] Koren I.: Computer Arithmetic Algorithms, Prentice Hall, 1993. [20] Lawson H. W.: Parallel Processing in Industrial Real-Time Applications, Prentice-Hall, 1992. [21] Lin H. D., and Messerschmitt D. G.: Finite State Machine has Unlimited Concurrency, IEEE Trans. on Circuits and Systems, Vol. CAS-38, No. 5, pp. 465-475. [22] Lipovski G. J.: The Architecture of a Simple, Effective, Control Processor, in M. Sami, et al. (Eds), Microprocessing and Microprogramming, 1976. [23] Lipovski G. J.: On Conditional Moves in Control Processors, Proc. 2nd Rocky Mountain Symp. on Microcomputers, pp. 63-94, Pingree Park, Colorado, 1978. [24] Melander J.: Design of SIC FFT Architectures, Linkšping Studies in Science and Technology, Thesis No. 618, Linkšping University, Sweden, 1997. [25] Melander J., Widhe T., Sandberg P., Palmkvist K., Vesterbacka M., and Wanhammar L.: Implementation of a Bit-Serial FFT Processor with a Hierarchical Control Structure, Proc. European Conf. on Circuit Theory and Design, ECCTDÕ95, Istanbul, Turkey, pp. 423-426, Aug. 27-31, 1995. [26] Nilsson P., Torkelson M., Palmkvist K., Vesterbacka M., and Wanhammar L.: A Bit-Serial CMOS Digital IF-Filter for Mobile Radio Using an On-Chip Clock, Proc. 1994 Intern. Zurich Seminar on Digital Communications, Zurich, Switzerland, pp. 510-521, March 8-11, 1994. [27] Nordhamn E., Palmkvist K., Wanhammar L., Nilsson P., and Torkelsson M.: Implementation of a Digital MF-Filter, Nordiskt Radioseminarium, LuleŒ, Sweden, April 2-3, 1992. [28] Oppenheim A. V. and Schafer R.W.: Discrete-Time Signal Processing, Prentice Hall, 1989. [29] Palmkvist K.: Design and Implementation of Recursive Digital Filters Using Bit-Serial Arithmetic, Linkšping Studies in Science and Technology, Thesis No. 568, Linkšping University, Sweden, 1996. [30] Palmkvist K., Sandberg P., Vesterbacka M., and Wanhammar L.: Digital IF Filter for Mobile Radio. Proc. Nordic Radio Symp., NRS95, Saltsjšbaden, Sweden, April 24-27, 1995. [31] Palmkvist K., Vesterbacka M., and Wanhammar L.: Arithmetic Transformations for Fast Bitserial VLSI Implementations of Recursive Algorithms, Proc. IEEE Nordic Signal Processing Symp., NORSIGÕ96, pp. 391-394, Espoo, Finland, Sept. 24-27, 1996.

References

37

[32] Palmkvist K., Vesterbacka M., and Wanhammar L.: Implementation of an Interpolator Using Wave Digital Filters, Proc. Nation. Conf. on Radio Science, RVK93, pp. 205-208, Lund, Sweden, April 5-7, 1993. [33] Palmkvist K., Vesterbacka M., and Wanhammar L.: Implementation of Static DSP Algorithms Using Multiplexed PE:s, Proc. 3rd IEEE Conf. on Electronics, Circuits, and Systems, ICECSÕ96, Vol. 1, pp. 824-827, Rodos, Greece, Oct. 13-16, 1996. [34] Palmkvist K., Vesterbacka M., and Wanhammar L.: Use of Computer Simulations in ASIC System Design, Proc. Nation. Conf. on Radio Science, RVK-96, LuleŒ, Sweden, pp. 513517, June 3-6, 1996. [35] Palmkvist K., Vesterbacka M., Nordhamn E., and Wanhammar L.: A Fast Bit-Serial Lattice Wave Digital Filter, Proc. Workshop on Digital Communications, Uppsala, Sweden, pp. 8892, May 25-26, 1992. [36] Palmkvist K., Vesterbacka M., Sandberg P., and Wanhammar L.: Maximally Fast Recursive Algorithms, Linkšping, 1994, Proc. GIGAHERTZ 94, Linkšping, Sweden, Mars 22 - 23, 1994. [37] Palmkvist K., Vesterbacka M., Sandberg P., and Wanhammar L.: Scheduling of DataIndependent Recursive Algorithms, Proc. Euoropean Conf. on Circuit Theory and Design, ECCTDÕ95, Istanbul, Turkey, pp. 855-858, Aug. 27-31, 1995. [38] Pandel J. and Kleine U.: Design of Bireciprocal Wave Digital Filters for High Sampling Rate Applications, Frequenz, Vol. 40, No. 11/12, 1986. [39] Parhi K. K.: Algorithm Transformation Techniques for Concurrent Processors, Proc. of the IEEE, Vol. 77, No. 12, pp. 1879-1895, Dec. 1989. [40] Parhi K. K.: VLSI Digital Signal Processing Systems: Design and Implementation, Wiley & Sons, 1999. [41] Parhi K. K. and Messerschmitt D. G.: Static Rate-Optimal Scheduling of Iterative Data-Flow Programs via Optimum Unfolding, IEEE Trans. on Computers, Vol. C-40, pp. 178-195, Feb 1991. [42] Rabaey J: Digital Integrated Circuits: A Design Perspective, Prentice Hall, 1996. [43] Renfors M. and Neuvo Y.: The Maximum Sampling Rate of Digital Filters Under Hardware Speed Constraints, IEEE Trans. on Circuits and Systems, Vol. CAS-28, No. 3, pp. 196-202, March, 1981. [44] Sandberg P., Palmkvist K., and Wanhammar L.: Some Experiences From Automatic Synthesis of Digital Filters, Proc. NorChip-94, Gothenburg, Sweden, Nov. 8-9, 1994. [45] Sandberg P., Palmkvist K., Wanhammar L., and Gustavsson R.: Synthesis of the SIC Architecture from VHDL, LiTH-ISY-R-1610, Linkšping University, Sweden. [46] Smith S. G., and Denyer P. B.: Serial-Data Computation, Kluwer Academic Publ., 1988.

38


[47] Stone H. S.: High-Performance Computer Architecture, Addison-Wesley, 1990. [48] Swartz D. A.: Synchronous Multiprocessor Realizations of Shift Invariant Flow Graphs, Ph. D. Diss., Georgia Institute of Technology, 1985. [49] Tabak D. and Lipovski G. J.: MOVE Architecture in Digital Controllers, IEEE Trans. on Computers, Vol. C-29, pp. 180-190, 1980. [50] Uyemura J. P.: Circuit Design for CMOS VLSI, Kluwer Academic Publ., 1992. [51] Uyemura J. P.: Fundamentals of MOS Digital Integrated Circuits, Addison-Wesley Publ., 1988. [52] Vesterbacka M.: On Implementation of Maximally Fast Wave Digital Filters, Linkšping Studies in Science and Technology, Diss. No. 487, Linkšping University, Sweden, June 1997. [53] Vesterbacka M., Johansson H., Palmkvist K., and Wanhammar L.: Implementation of Narrow-Band Lattice Wave Digital Filters, IEEE Nordic Signal Processing Symp., NORSIG98, pp. 153-156, Denmark, June 8-11, 1998. [54] Vesterbacka M., Palmkvist K., and Wanhamar L.: A CAD Tool for Synthesis of Maximally Fast Lattice Wave Digital Filters, Nation. Conf. on Radio Science, RVK-99, Karlskrona, Sweden, June 14-17, 1999. [55] Vesterbacka M., Palmkvist K., and Wanhammar L.: A Comparison of Three Lattice Wave Digital Filter Implementations, Proc. 7th Int. Conf. on Signal Processing Applications & Technology, ICSPATÕ96, Vol. 2, pp. 1909-1913, Boston, MA, Oct. 7-10, 1996. [56] Vesterbacka M., Palmkvist K., and Wanhammar L.: High-Speed Multiplication in Bit-Serial Digital Filters, Proc. IEEE Nordic Signal Processing Symp., NORSIGÕ96, pp. 179-182, Espoo, Finland, Sept. 24-27, 1996. [57] Vesterbacka M., Palmkvist K., and Wanhammar L.: Maximally Fast, Bit-Serial Lattice Wave Digital Filters, Proc. 7th IEEE DSP Workshop, DSPWSÕ96, pp. 207-210, Loen, Norway, Sept. 1-4, 1996. [58] Vesterbacka M., Palmkvist K., and Wanhammar L.: On Implementation of Fast, Bit-Serial Loops, Proc. 1996 Midwest Symp. Circuits and Systems, MWSCASÕ96, Ames, Iowa, Aug. 18-21, 1996. [59] Vesterbacka M., Palmkvist K., and Wanhammar L.: Realization of Serial/Parallel Multipliers with Fixed Coefficients, National Conference on Radio Science, RVK93, Lund, Sweden, April 5-7, pp. 209-212,1993. [60] Vesterbacka M., Palmkvist K., and Wanhammar L.: Serial Squarers and Serial/Serial Multipliers, Nation. Conf. on Radio Science, RVK-96, LuleŒ, Sweden, pp. 518-522, June 36, 1996.

References

39

[61] Vesterbacka M., Palmkvist K., and Wanhammar L.: Sign-Extension and Quantization in BitSerial Digital Filters, Proc. 3rd IEEE Conf. on Electronics, Circuits, and Systems, ICECSÕ96, Vol. 1, pp. 394-397, Rodos, Greece, Oct. 13-16, 1996. [62] Vesterbacka M., Palmkvist K., Sandberg P., and Wanhammar L.: Implementation of Fast DSP Algorithms using Bit-Serial Arithmetic, Report LiTH-ISY-R-1577, Linkšping, Sweden, 1994, EDA TrŠff'94, Stockholm, March 15,1994. [63] Vesterbacka M., Palmkvist K., Sandberg P., and Wanhammar L.: Implementation of Fast Bit-Serial Lattice Wave Digital Filters, Proc. IEEE Int. Symp. on Circuits and Systems, ISCAS'94, Vol. 2, pp. 113-116, London, England, May 29 - June 2, 1994. [64] Wanhammar L.: DSP Integrated Circuits, Academic Press, 1999. [65] Widhe T.: Efficient Implementations of FFT Processing Elements, Linkšping Studies in Science and Technology, Thesis No. 619, Linkšping University, Sweden, 1997. [66] Yuan J. and Svensson C.: High Speed CMOS Circuit Technique, IEEE J. Solid-State Circuits, Vol. CS-24, No. 1, pp. 62-70, Feb. 1989.

40


PAPER 1 IMPLEMENTATION OF STATIC DSP ALGORITHMS USING MULTIPLEXED PE:S Kent Palmkvist, Mark Vesterbacka, and Lars Wanhammar Proceedings of the 3rd IEEE Conference on Electronics, Circuits, and Systems (ICECSÕ96), Vol. 1., pp. 824-827, Rodos, Greece, Oct. 13-16, 1996. Presented at the 3rd IEEE Conference on Electronics, Circuits, and Systems (ICECSÕ96), Rodos, Greece, Oct. 13-16, 1996.

42


IMPLEMENTATION OF STATIC DSP ALGORITHMS USING MULTIPLEXED PE:S Kent Palmkvist, Mark Vesterbacka, and Lars Wanhammar Department of Electrical Engineering, Linkšping University, S-581 83 Linkšping, Sweden Email: [email protected], [email protected], [email protected]

ABSTRACT An efficient and flexible ASIC implementation method suited for static DSP algorithms is presented. It is aimed at low power implementations with moderate speed requirements. The method allows for the processing elements to be multiplexed in order to reduce the amount of resources required. A method to find a minimal number of resources and a corresponding architecture from the cyclic scheduling formulation is described. An implementation of a wave digital bandpass filter is used as an example. The low power consumption and high resource utilization is obtained by using the cyclic scheduling formulation that leads to a maximally fast implementation. The excess speed can be converted to low power consumption by reducing the power supply voltage.

1. INTRODUCTION Broadband communication equipment, for example, utilizes advanced signal processing in order to reach high data transfer rates, efficient use of the available frequency bands, etc. The signal processing algorithms involved are usually of a static nature, e.g., digital filtering, FFT. This property allows the hardware implementation to be significantly simplified, since the arithmetic operations in the algorithm can be scheduled optimally at design-time. It is important for mobile, battery-powered equipment to have a low power consumption. Another important factor for consumer electronics is a high level of integration and the possibility to reuse components of the design. In the case of moderate speed requirements, a single computational unit may not be fully utilized unless it is multiplexed between multiple processes. Multiplexing of the computational units may therefore yield an implementation with high resource utilization.

Paper 1

43

2. STATIC DSP ALGORITHMS Fixed-function, data-independent DSP algorithms allows for static scheduling of the arithmetic operations. This allows for exhaustive search of the best schedule with respect to resource utilization since the execution times of the operations is known at design time. This also guarantees that the deadlines for a hard real-time system can always be met. The granularity of the basic operations to be executed on the processing elements needs to be chosen carefully. It is advantageous if the basic operations in the algorithm can be chosen so that it becomes modular and uses only a few different types of operations. The basic operations should be small enough so that the operation parallelism become large. On the other hand, the basic operations should be large enough to make the multiplexing and communication cost small. Recursive algorithms have a minimal sample period denoted Tmin [1] that determines the maximal throughput. The sample period bound is given by Eq.(1) where Topi is the sum of the processing element (PE) latencies and Ni is the number of delay elements in the recursive loop i. The loop with the largest ratio determines the overall throughput.

ì Topi ü Tmin = max í ý i î Ni þ

(1)

The PE latency may also depend on the selected logic style used to implement the PE:s, but also on the architecture, as the PE-Memory communication structure may introduce additional latency.

3. CYCLIC SCHEDULING FORMULATION Scheduling consists of defining the time of execution for each operation. Shimming delays must be inserted into branches if a result of an operation is computed earlier than it is needed by a subsequent operation. In some cases, for example, for bit-serial arithmetic, the execution times of the processing elements are often longer than the minimum sample period. Furthermore, the critical loops may contain more than one delay element. For these cases, it is not sufficient to schedule the arithmetic operations over a period equal to the sample period to achieve a minimum sample period [2]. Instead a cyclic operation scheduling has to be performed over multiple sample periods. In a cyclic schedule, operations belonging to several successive sample periods are scheduled, as illustrated in Fig. 1, where the operations belonging to sample interval i have been collected in the box denoted N i. This scheduling formulation leads to a maximally fast implementation if the scheduling period is selected appropriately.

44


x(mn)

.. .

x(mn+1) É x(mn+mÐ1) y(mn) y(mn+1) y(mn+mÐ1)

N0

T

.. .

T

.. .

N1

.. .

T

.. .

T

NmÐ1 .. T .

T

Fig. 1. Operation schedule for m sample periods.

Note that the delay elements are redundant since they only define boundaries between operations belonging to different sample periods. In Fig. 2 the redundant delay elements have been removed and the periodicity in the schedule has been visualized by drawing the operations on a cylinder of circumference equal to the scheduling period. This graph, so-called computation graph, contains only arithmetic operations with appropriate execution times and shimming delays [2]. From this graph an optimal architecture can be synthesized as described below.

Nm

y(mn +

mÐ1

Ð1

)

1)

x(mn)

+ x(mn

y(mn)

N1

. ..

.. . . ..

N0

. ..

t Fig. 2. Illustration of the cyclic operation schedule.

By using the shift invariant property of the arithmetic operations, operations and storage (arcs) may be reordered in the computation graph. The reordering is done order to find a schedule with as few simultaneous operations as possible. This number is a lower bound on the required number of processing elements. It is also important to find a schedule with a minimal number of simultaneous PE-memory

Paper 1

45

transactions, as this number implies a lower limit of the necessary communication resources and memory ports. The total length of the arcs indicates the necessary amount of storage. From this scheduling formulation, it is possible to find a resource-optimal schedule using arbitrary constraints on the sample rate (³ Tmin ).

4. RESOURCE ALLOCATION A lower limit of the number of resources (both processing elements, storage, and communication resources) may be extracted from the cyclic computation graph. A resource may be multiplexed between multiple tasks if, and only if, only one of the tasks is active at a time. The resource allocation problem can be solved by the use of inclusion graphs, where the tasks are indicated by vertexes, and arcs indicate the possibility to share a resource, i.e., non-overlapping lifetimes [2]. Figure 3 shows a lifetime diagram of 5 tasks and the corresponding inclusion graph. Cliques (sub-graphs where every vertex is connected to every other vertex in that sub-graph) are found such that every vertex belongs to one, and only one, clique. All tasks in a clique may share the same resource. Finding minimal number of cliques yields an optimal solution to the resource allocation problem.

d

e c a

b t

Resource 1 a

c

b e

d

Resource 2 Fig. 3. Resource allocation and assignment using clique partitioning.

5. MAPPING TO ARCHITECTURE Multiplexing of hardware resources is a common way to increase resource utilization. The multiplexing requires that results and internal states of the processes are stored in a separate storage. We propose the use of a SIC (Single Instruction Computer [2-5], also called a MOVE Computer) as a simple and efficient mean to

46


solve the communication problem between many PE:s. The PE:s are mapped into memory space as illustrated in Fig. 4. This lead to a simple programming model with only one instruction (MOVE). The programming will consist of defining the transactions between the PE:s and memory. This will define the interconnection structure extracted from the algorithm. A new communication structure is therefore simple to implement. Design reuse is therefore possible at various levels.

I/O

}

RAM cells

}

PE cells

PE PE Fig. 4. MOVE architecture.

6. BIT-SERIAL PROCESSING ELEMENTS A problem with bit-parallel designs is the large routing area needed for the data buses. The buses also consume significant amounts of power. Both of these problems may be reduced if bit-serial communication and processing elements are used. The bit-parallel memories are interfaced to the PE:s by using serial/parallel and parallel/serial converters as shown in Fig. 5. The PE:s can be provided by private clocks that are controlled by the availability of data and run for an appropriate number of cycles, making it possible to only clock each PE when necessary. This simplifies the design by removing global high-speed clocks and reduces the power consumption. Using on-chip clock oscillators also remove the need for feeding a high speed clock into the chip. A well-know method to reduce the power consumption is to exploit excess speed by reducing the power supply voltage. This method is ideally suited for this approach since the cyclic scheduling formulation inherently leads to a maximally fast implementation.

Paper 1

47

PE

Controller

ShiftRegister ShiftRegister Dual-port RAM

PE

ShiftRegister ShiftRegister Fig. 5. MOVE architecture with bit-serial PE:s.

7. EXAMPLE The wave-flow graph shown in Fig. 6 is a 12th order lattice wave digital filter (WDF) for use as an IF filter in a mobile communication system. It uses a sample rate of 380 kHz. The data word length is 15 bits, and the adaptor coefficients are 6 bits. T

T

T 1

x(n) 2

Ð1

T

Ð1

a0

a0

7

8

a1

a1

Ð1 T

T

T

5

a2

a2 T

y(n)

Ð1

T

3

4

T

T

Fig. 6. Lattice wave digital filter of bandpass type.

6 T

48


In the first design step the adaptor operations are selected as the basic operation. The multiplications with Ð1 are incorporated into the adaptor operations and the two additions are performed using a simplified adaptor operation. Hence, only one type of processing element needs to be implemented. A bit-serial PE was implemented using a standard-cell layout style. It is enough to use a single PE for the adaptor operations. The next design step consists of the cyclic scheduling of the adaptor operations. Fig. 7 shows the final adaptor schedule, and the extracted resource utilization graphs. The 28 variables can be compacted into a memory of 15 cells. This filter has been implemented in AMS 0.8Êµm double metal CMOS process, using VHDL synthesis and standard-cell layout. The dual-port RAM was synthesized using a 3-transistor-cell RAM generator. The implementation requires an active area of 3 mm2 . The required clock frequency was only 12 MHz for the control unit and 51.6 MHz for the bit-serial processing element. Figure 8 shows the layout of the wave digital filter. The PE is shown to the left and the RAM to the right. The standard-cell library has been shown to work properly well below 3.3 V. In fact, it is possible to reduce the power supply voltage to only 2 V and thereby reduce the power consumption significantly. Since the clock frequency in this case is relatively low it encourages the use of low supply voltage to reduce the power consumption.

REFERENCES [1] M. Renfors and Y. Neuvo, ÒThe Maximal Sampling Rate of Digital Filters Under Hardware Speed Constraints,Ó IEEE Trans. on Circuits and Systems, vol. CAS-28, no. 3, pp. 196-202, March 1981. [2] L. Wanhammar, DSP Integrated Circuits. Linkšping University, 1996. [3] G. J. Lipovski, ÒThe Architecture of a Simple, Effective, Control Processor,Ó in M. Sami, et al. (Eds), Microprocessing and Microprogramming, 1976. [4] G. J. Lipovski, ÒOn Conditional Moves in Control Processors,Ó in Proc. 2nd Rocky Mountain Symp. on Microcomputers, Pingree Park, Colorado, pp. 63-94, 1978. [5] D. Tabak and G. J. Lipovski, ÒMOVE Architecture in Digital Controllers,Ó IEEE Trans. on Computers, Vol. C-29, pp. 180-190, 1980.

Paper 1

49

1 2

6

5

14 16

4

3 18 19

1

4 5 6 7

2

7

y(n)

8

15

x(n)

1 2

13

3

17 4 5 6 7

20

3

23

22

y(n-1)

8

26

x(n+1)

27

1 2

11 12

5 25

24

4

8 9 10

6

21

8 9 10

7

28

11 12 t

3

a)

Computation graph Processing Elements 1 2

6 4

8 7

5 3

1 2

6 4

b)

8 7

5 3

t

Variables

c)

1 13 2 14 16 3 15 17 19 4 18 20 5 6 7 22 21 23 24 8 25 9 27 10 26 28 11 12

t

Transactions 2 1

d)

Fig. 7. a) b) c) d)

Computation graph for the bandpass WDF. Schedule for the PE:s. Memory variables. Memory Ð PE transactions.

t

50


Fig. 8. Layout of the BP filter with a full-custom RAM and a bit-serial adaptor.

PAPER 2 DESIGN AND IMPLEMENTATION OF AN INTERPOLATOR USING WAVE DIGITAL FILTERS Kent Palmkvist, Mark Vesterbacka, and Lars Wanhammar Proceedings of the National Conference on Radio Science (RVK93), Lund Institute of Technology, Lund, Sweden, April 5-7, pp. 205-208, 1993. Presented at the National Conference on Radio Science (RVK-93), Lund Institute of Technology, Lund, Sweden, April 5-7, 1993.

52


DESIGN AND IMPLEMENTATION OF AN INTERPOLATOR USING WAVE DIGITAL FILTERS Kent Palmkvist, Mark Vesterbacka, and Lars Wanhammar Department of Electrical Engineering, Linkšping University, S-581 83 Linkšping, Sweden Email: [email protected], [email protected], [email protected] ABSTRACT The design and implementation of an interpolator using wave digital filters is presented. The interpolator increases the sample frequency with a factor of 4, from 800 kHz to 3.2 MHz. The design approach yields implementations with low power consumption and small chip area. The excellent stability and sensitivity properties of wave digital filters are retained.

1. INTRODUCTION A fundamental property of a DSP system is the sample frequency. It sets the requirements on calculation speed in the system. The whole system may, however, not work at the same sample rate. The transition from one sample frequency to another requires interpolation or decimation. Typical examples are interpolation in D/A converters used in CD-players to allow a simpler analog filter after the D/A, decimation of the signal from åDÐA/D converters, interpolation and decimation to realize narrow-band filters. Design and implementation of an interpolator will be discussed in this paper. A decimator can be designed using the same approach. The interpolator will interpolate a signal that is sampled at 800 kHz up to 3.2 MHz. The data word length is 21 bits. The interpolator will suppress false images with more than 73 dB.

2. WDF INTERPOLATOR

x(n)

H AP

-2

H0

f sample Figure 1. The structure of the interpolator.

-2 2 f sample

H0

y(m) 4 f sample

Paper 2

53

A L time higher sample frequency is achieved by first inserting LÐ1 zero-valued samples between the samples in the input sequence x(n). The new sequence is then filtered so that the repeated baseband images in the spectrum are removed. It is more efficient to realize an interpolation by using interpolation stages, each interpolation with a factor of two. An important aspect in digital filters is stability. We use wave digital filters to be able to guarantee stability under non-linear conditions [2]. The lowpass filters in each stage must have a stop-band angle slightly less than p/2. An allpass filter is placed in front of the interpolator to compensate for the variation in the group delay. The stopband angle allows us to use an 11:th order bireciprocal lattice WDF as lowpass filters. This structure has a low computational complexity. In fact, it only required five multiplication and 16 additions. A 7:th order lattice WDF is used to equalize the group delay. The allpass filter requires 7 multiplications and 21 additions. Furthermore, it can be shown that each stage of the interpolating filter can be made to work at its input sample frequency. This reduces the arithmetic workload by 50 percent. The simplified interpolator is shown in Fig. 2. T

T

-1

-1

a12

a14

a16

T

T

T

a11

a13

a15

T a10

T

-1

x(n) T

-1

T/2

T

-1

-1

a3 ¥

T/2

-1

a3 ¥

a7 ¥

a7 ¥ y1 (m) -1

-1

a1

a5

¥ -1 T

a9

¥ -1 T

a1

¥

T/2

Figure 2. Wave-flow graph of the interpolator.

a9

¥

-1

-1 T

a5

¥

¥

-1 T/2

-1 T/2

54


3. SCHEDULING OF OPERATIONS The wave-flow graph for the interpolator is first redrawn to display the operations necessary during one input sample period. See Fig. 3. A suitable atomic operation is then to be identified. It will be implemented as a PE (Processing Element), which are to be shared between different operations. A selection of large operations T

T

-1

-1

3 T

T

-1

5

T

7

T

T

-1

T

-1

T

1 1

2

4

12

6

x(n) 9

8 -1

-1

T

-1

10 -1

T

T

y(m+3) 21

22 -1

-1

y(m+2) -1

18 T

-1

16

19

20

T

-1

-1

-1

-1

17

y(m+1) T

-1

13

T

-1

14

T

-1

15 -1

y(m) Figure 3. The whole interpolator using the input sample frequency.

Paper 2

55

results in a lower communication cost, but it also reduces the scheduling possibilities. Generally, it is favorable to choose relatively large atomic operations. The adaptor operation, including a possible negation of one output, is selected as the atomic operation. The adaptor-PE will have two inputs and two outputs and use bit-serial arithmetic. With a clock frequency of 115.2 MHz, the PEs will compute 6 results during each sample period because a new result is computed every 24 clock cycles. The number of PEs must therefore be at least 22/6 £ 4 as there are a total of 22 operations in the filter. Each PE is pipelined, hence it takes 48 clock cycles between an input value and the corresponding output value. The critical path is 11 operations long and will therefore require 48 * 11 = 528 clock cycles to be calculated. This is longer than the sample period. Pipelining is therefore necessary to split the critical path into smaller parts. The precedence graph is extracted from the wave-flow graph. The result of scheduling the recursive parts of the interpolator while viewing the operations over more than one sample period is shown in Fig. 4. This figure illustrates the possibilities to move operations over sample period bounds and how the pipelining stages are distributed. This graph does not, however, show the computational limits in the precedence graph that are due to the recursive loops inside the interpolator filter. An example of this are operations #16 and #21 that forms a recursive loop. The output from operation #21 are feed back to operation #16 of the next sample calculation, and therefore cannot operations #16 and #21 be placed further apart. This is however not visible in Fig. 4 and it is therefore not suitable to use this representation to do the actual schedule. 7

18

5

11

0

2

4

6

1

8

12

9

2

13

10

14

16

y(m+2)

20

21

3

1

19

22

y(m+3)

y(m)

15

17

y(m+1)

3

4

t

Figure 4. Optimal schedule showing operations connected to one input sample.

To see the number of required PEs and communication resources, the schedule in Fig. 4 can be folded into the form shown in Fig. 5. This graph shows all the operations and their positions in time. The communication which occurs at the start and endpoints of the operations is not explicitly shown in this figure. Every operation

56


takes two inputs and creates two output values, each start point of an operation corresponds to writing the previous result to a shared memory and reading the input values for the next calculation. From this schedule the minimum number of resources can be calculated. In this case are 4 PEs and 2 communication resources the minimum.

6 18

10

14 1

19

5

15 2

12

13

21

3

11 17 8

4 16 7

22 9 20 t

1.0 0.9583 0.9166 0.875 0.8333 0.7916 0.75 0.7083 0.6666 0.625 0.5833 0.5416 0.50 0.4583 0.4166 0.375 0.3333 0.2916 0.25 0.2083 0.1666 0.125 0.0833 0.0416

Figure 5. The optimal schedule.

4. RESOURCE ALLOCATION The optimal schedule requires 4 PEs and 2 communication resources. Each operation must now be mapped to a PE. This can be done by using click partitioning of a graph, where each node in the graph corresponds to an operation and arcs connecting operations that can share the same PE. The partitioning is done in such a way that every node in a partition is connected to every other node in the same partition [1]. The schedule shown in Fig. 5 forces each row of processes (e.g. 1, 12, 21, 8, 7, 20) to be mapped to a single PE if a minimal number of resources are to be used. This mapping also specifies the need of communication between the different PEs. In the interpolator, the PEs are assumed to only communicate through the use of shared memories (see below). The communication of an item back and forth to a PE corresponds to a variable that must be stored. Using a RAM for this gives low power consumption. Each cell in the memory may, over time, store more than one variable, due to the fact that some variables have short lifetimes. The total number of memory accesses is 22*2*2 = 88 accesses each sample period. This corresponds to an access rate of 70.4 MHz if only one memory is used. This may be feasible, but it is much simpler to use two memories with half the access rate. This means that also the variables should be partitioned into two sets that are stored in different memories. Using the same click partitioning technique as discussed

Paper 2

57

above and the left-edge algorithm [1], we obtain the final compacted variable lifetime table that is shown in Fig. 6. Two RAMs containing 6 and 9 cells is sufficient. 0

4

8 20b

7a

0

12 18b

16b

1

4b

2

16

20

5a

24

1b 3a 14b

6b

21b 2b

13b

9b

3

22b

4

10b 12b

5

19b

6

8b

7

11b

8

0

4

0 1 2

memory write memory read negate the value

15b 17b

4a

8

12

16

20

7b

21a

16a

10a 2a19a 8a

9a

3 4 5

24

13a

memory write memory read negate the value

3b 1a 14a18a 6a

5b 12a 11a

Figure 6. The final variable lifetime diagram with memory cell assignment.

5. ARCHITECTURE The schedule and resource allocation implies that the architecture shall have two shared memories which are communicating with the PEs. RAMs are used instead of shift registers since the later continuously move data and therefore consume large amounts of power. The bit-serial PEs are connected to two parallel memory buses through shift registers. By this, a PE can work asynchronous while the other PEs access the memory to store the results and get new inputs. The use of bit-serial PEs have the advantage of small power consumption and size, high utilization, and a good balance allowing both memory and PEs to work at their top speed.

58


I/O

RAM

PE

PE

PE

PE

RAM

1

1

2

3

4

2

Figure 7. Architecture used for the interpolator.

6. VLSI A VLSI implementation of the PEs has been designed using a 1.6 µm CMOS process. The area for one PE is about 0.22 mm2. It is estimated that the size of a complete chip is about 1.4x1.8 » 2.52 mm2 excluding pads.

REFERENCES [1] Wanhammar L.: System Design: DSP Integrated Circuits, Prentice-Hall, (In print) 1993. [2] Fettweis A.: Wave Digital Filters: Theory and Practice, Proc. IEEE, Vol. 74, No. 2, pp. 270327, Feb. 1986. [3] Crochiere R.E. and Rabiner L.R.: Multirate Digital Signal Processing, Prentice-Hall Inc., Englewood Cliffs, N.J., 1983.

PAPER 3 SOME EXPERIENCES FROM AUTOMATIC SYNTHESIS OF DIGITAL FILTERS Peter Sandberg, Kent Palmkvist, and Lars Wanhammar Peter Nilsson and Mats Torkelsson Proceedings of NorChip Conference (NORCHIP-94), Gothenburg, Sweden, Nov. 89, 1994. Presented at the NorChip Conference (NorChip-94), Gothenburg, Sweden, Nov. 8-9, 1994.

60


SOME EXPERIENCES FROM AUTOMATIC SYNTHESIS OF DIGITAL FILTERS Peter Sandberg, Kent Palmkvist, and Lars Wanhammar Division of Applied Electronics, Department of Electrical Engineering, Linkšping University, Sweden Email: [email protected], [email protected], [email protected] Peter Nilsson and Mats Torkelsson Department of Applied Electronics, University of Lund, Sweden Email: [email protected], [email protected]

ABSTRACT In order to evaluate the usefulness of a standard-cell library for design of digital filters, a 12-order BP filter realized as a lattice wave digital filter, has been implemented using a number of different approaches. The filter has been implemented using a cell library specially adapted to bit-serial arithmetic, using a commercial standard-cell library, and finally, partly using full-custom blocks mixed with synthesized blocks of standard-cells. The conclusions is that the full standard-cell approach is inefficient compared to using an adapted library. However, it is competitive if the full standard-cell approach is augmented with full-custom blocks for critical parts, such as RAMs and ROMs. We have also pointed out that an implementation using the SIC architecture gives advantages compared to a traditional multiplexed realization, since reconfiguration of the design with non-homogeneous processing elements is easy.

1. INTRODUCTION To evaluate the usefulness of implementing digital filters by using synthesis tools and commercial standard cell libraries, we have chosen a 12-order BP filter as a test vehicle [1]. The filter, see Fig. 1, is intended as an intermediate filter in a mobile radio system and is an alternative to an analog counterpart. The reason for implementing the

Paper 3

61

T

T -1

T

-1

T

a0

a0

-

-

x(n)

+

+ +

+

a1

a1 T

T

T

T -1

-1

a2

a2 T

y(n)

T

T

T

Fig 1. The 12-order BP filter realized as six two-port adaptors and a number of delay elements.

intermediate filter as a digital filter is mainly that it can be integrated among other digital parts of the system. The filter structure is built from of two cascaded lattice wave digital filters, where each filter consists of three two-port adaptors [2, 3]. An adaptor realizes the two functions, see Fig. 2: ¥ B1 = A2 + a (A2 - A1 ) ¥ B2 = A1 + a (A2 - A1 ) The adaptor coefficients, a, are therefore pair-wise equal in the two sections. The filter specification consists of a very stringent magnitude requirement, see Fig.Ê3, but also a group delay requirement [3].

62


A1

Ð

+

A2

a

B1

B2

Fig. 2. The wave-flow diagram of a symmetrical two-port adaptor.

0 A [dB] -50

-100

50

100

150

f [kHz]

Fig. 3. Magnitude response and specification for the BP filter.

Other characteristics of the filter are: Sample frequency:

fs = 380 kHz

Data word length:

Wd = 15 bits

Coefficient word length:

Wcoeff = 6 bits

The sampling frequency, fs = 380 kHz, represents a relatively modest processing requirement and can be realizable with standard cells as regarding to speed. The main object is to achieve the lowest possible power consumption. In the following, three different implementations of the filter will be discussed, first an isomorphic implementation using a cell library adapted to bit-serial arithmetic [1], second a fully synthesized bit-parallel version with multiplexed processing elements [4], and last a synthesized version using bit-serial arithmetic with some of the blocks replaced with modules made in a full-custom design style.

Paper 3

63

2. ISOMORPHIC MAPPING OF THE FILTER In an isomorphic implementation each operation in the signal-flow graph is mapped to a corresponding processing unit in the actual implementation [5]. The advantages are that each processing element can be optimized for chip area, speed, and workload, but on the other hand, the fixed binding of operations to processing elements makes multiplexing and, hence, higher utilization of the processing elements difficult. In this case, each two-port adaptor operation in Fig. 1 will be mapped to a separate processing element and therefore each adaptor can be optimized according to the coefficient. Since a bit-serial implementation has been chosen, to minimize power consumption and area, the communication paths are only one-bit wide. The bit-serial, isomorphic, implementation of the filter requires six processing elements and a large number of delay elements (shift registers) in order to realize the BP filter. The required clock frequency for the implementation is 14.4 MHz. The layout has been generated by using a high-level LISP-like description language to specify the filter, together with a cell library intended for bit-serial implementations [6]. The LISP-like description has been synthesized and targeted to the specific cell library. The final chip consists of 13800 transistors and the size is approximately 11.1 mm2 in a 1.0 µm CMOS process. If the process is scaled down to a 0.8 µm CMOS process the area is reduced to 7 mm 2 . The power dissipation is approximately 9 mW at 3V power supply voltage in the isomorphic implementation. If we scale up to a power supply voltage of 5V we get a power dissipation of approximately 25 mW.

3. THE SIC ARCHITECTURE The isomorphic mapping of operations onto processing elements reduces the freedom to fully utilize the processing elements. In a multiplexed realization a number of operations can share the same processing element. The multiplexing is often achieved by using multiplexers connected to the inputs of the processing elements. Depending on which step in the algorithm that currently is processed the multiplexer chooses the appropriate data and directs it to the input of the processing element. This approach requires storage elements to store the data not currently used in any of the processing elements. It also requires a large number of multiplexers to direct the data to and from the processing elements. A slightly different approach is to use a RAM as storage element and direct the data to a suitable processing elements by reading from an appropriate address. Each processing element has its inputs connected to cells in the memory map and data is directed by moving data from one cell in the memory to the cell connected to the input of the processing element, see Fig. 4. We call this approach a SIC, Single


SR SR PE PE

Dual-port RAM

Controller

64

SR SR Fig 4. The generic SIC architecture with two processing elements.

Instruction Computer, since the only operation that is performed is movement of data. The advantages with this approach are the ease with which it is possible to add new processing elements. The different processing elements need not to be homogeneous, and hence it is possible to mix very simple processing elements, such as the two-port adaptor, together with highly complex modules, such as an FFT processing element. As new processing elements are added, the controller have only to be redesigned to generate a new address pattern. In the case with the BP filter, we have chosen the two-port adaptor as the basic processing element. It can be shown that only one processing element is needed to fulfill the requirements stated earlier [3]. We also need a dual-port RAM with 16 cells to store intermediate data.

4. FULLY AUTOMATIC SYNTHESIS FROM VHDL In our first attempt to automatically generate the digital filter, we made a description in VHDL at Register-Transfer Level (RTL), containing no timing information, i.e. a purely behavioral description. The SIC architecture was used with a bit-parallel implementation of the processing element. In this case the coefficient word length is only six bits. Hence, the bit-parallel multiplier is relatively small. The VHDL description was transformed to a gate-level description by the use of a synthesis tools, AutoLogic [7]. As the gate-level description, including pads, was found to be correct by simulations, it was fed into an Automatic-Placement-and-Routing (APAR) tool, IC Station [8], to generate a layout-level description suitable for manufacturing. As target technology we selected a 0.8 µm CMOS standard-cell library from AMS [9]. The library consists mainly of traditional gates, such as NAND-gates, flip-flops, etc., and has no special features to support bit-serial implementations of digital filters. The APAR tool generated a layout, using the selected library, requiring a total area of approximately 9 mm2 , see fig 5. However, the active area is significantly smaller, approximately 4 mm2. The required clock frequency for this implementation is

Paper 3

65

approximately 6 MHz, which is significantly lower than the previous one. The power dissipation is approximately 45 mW at 5V power supply voltage, and proves to be significantly less power-efficient than the isomorphic implementation.

Fig. 5. The layout generated by the APAR tool for the automatically synthesized filter.

When examining the design more closely, one quickly reveals a number of drawbacks. First, as seen from the layout above, the design has been flattened and consists now of only one large block of standard-cells with a pad-frame. Unfortunately, the APAR tools we use requires the design to be flat to be able to do the routing and placement tasks. When the design is flattened the designer loses the possibilities to manually check the results from the APAR process. This has proved to be a major drawback since the APAR tool probably has introduced an error in this implementation of the BP filter. This was revealed after testing manufactured chips, even though extensive simulations had been done at various levels of the design to verify the functionality. However, the final step in the design process, the APAR task, was not possible to verify due to the flattening of the design.

66


Second, the generation of RAM is very inefficient. Each RAM cell has been generated as a flip-flop with a multiplexer inserted in a feedback loop. Thus, the data is not stored in the cell, instead it is circulated. This increases the power consumption significantly as seen before and this approach is also extremely area-inefficient. Third, ROMs are synthesized as combinatorial networks, not as commonly done as a programmable area and a decoder. If a traditional ROM could have been generated, savings both in terms of area and an increased operating speed had been possible.

5. MODIFIED DESIGN PATH As seen in the previous section, one of the main drawbacks with the fully synthesized version are the badly generated RAM. To improve the competitiveness of the synthesis approach we have augmented the design path to use blocks made in fullcustom. With the use of a RAM generator we have generated an appropriate RAM for the filter. The terminals of the generated RAM has been named in the same way as the synthesized RAM and made it possible to directly replace the previous RAM for the enhanced version. We have also designed a bit-serial adaptor, since the previous adaptor in the fully synthesized filter was bit-parallel. The use of a bit-serial adaptor makes comparisons with the isomorphic implementation more fair. However, the extremely short coefficient word length yields an area that is only slightly smaller than the bitparallel version. The difference in area will, for normal coefficient word lengths, be significant. To force the synthesis tool to generate a bit-serial adaptor, the VHDL code had to be written as a structural description. As a gate-level description of the filter is complete, the filter has been transformed into a layout description, see Fig. 6. The layout is here partitioned into four major blocks. From the left is a bit-serial adaptor, a control unit, an I/O unit, and a dual-port RAM. The core is surrounded by a padframe as in the previous design. The active area is approximately 3 mm2, with a total area of approximately 6.4 mm2, in the 0.8 µm AMS CMOS process. The main advantage with this approach is the partitioning of the layout which makes it possible for the designer to simulate the blocks independently. The use of a fullcustom RAM also reduces the power dissipation and area requirement, compared to the naive implementation when using full synthesis. Further, the expected area reduction by the use of a bit-serial adaptor instead of a bit-parallel adaptor did not occur. This is mainly due to the fact that we in this case have a very short adaptor coefficient, only six bits. Another reason for the low gain in terms of area are that the bit-parallel adaptor is realized as a combinatorial net, contrary to the bit-serial adaptor which is clocked. The bit-serial adaptor requires numerous delay elements to realize shimming delays and additional buffer registers not needed in the bit-parallel implementation. Despite the fact that a large amount of area is wasted when routing between the blocks, see Fig. 6, we still gain some area compared to the isomorphic

Paper 3

67

Fig. 6. Layout of the improved design of the BP filter with a full-custom RAM and a bitserial adaptor.

implementation. There is significant room, however, for improvement by using a more suitable cell library and a more efficient routing between the blocks. The commercial standard-cell library we currently are using has been shown to work properly well below 3.3V, which is the lowest power supply voltage the AMS guarantees. However, simulations show that a reduction of the power supply voltage down to 2V is possible with only minor losses in speed [10]. Since the clock frequency in this case is relatively low it encourages the use of low-voltage designs to further reduce the power consumption. The design has been sent for fabrication at AMS in early June 1994, and is expected to be back and tested in October, and hence, no power measurement has yet been done but a significantly lower power dissipation is expected, compared to the fully synthesized version of the BP filter.

6. CONCLUSIONS In this paper we have discussed three different approaches to implement digital filters. Depending on the library that is used the results will be significantly different. If there exists a library adapted to synthesis of bit-serial arithmetic it will probably out-perform the use of a commercial standard-cell library, if the standard-cell library is used as it is. However, a standard-cell library augmented with additional fullcustom blocks for the critical parts, as RAMs and ROMs, will yield the best overall result. The inclusion of bit-serial adders, subtractors, and multipliers in a standard-

68


cell will improve its competitiveness even further. We have also pointed out the advantages with the SIC architecture when using non-homogeneous processing elements.

REFERENCES [1] Nilsson, P., et al.: A Bit-Serial Realization of a Lattice Wave Digital Intermediate Frequency Filter, Proceedings of Sixth Annual IEEE International ASIC Conference and Exhibit - ASIC '93, Rochester, New York, USA, Sept. 27 - Oct. 1, 1993. [2] Fettweis, A.: "Wave Digital Filters: Theory and Practice", Proceedings of the IEEE, Vol. 74, No. 2, pp. 270-327, February, 1986. [3] Nordhamn E., et al.: "Implementation of a Digital MF-Filter", Proceedings of Nordic Radio Symposium at LuleŒ Institute of Technology, Sweden, April 2-3, pp. 19-25, 1992. [4] Sandberg, P., et al.: Synthesis of the SIC Architecture from VHDL, Report LiTH-ISY-R-1610, Linkšping, Sweden, 1994. [5] Wanhammar L.: System Design: DSP Integrated Circuits, Prentice-Hall, 1994. (In preparation) [6] Nilsson, P.: A CMOS VLSI Cell Library for Digital Signal Processing, Licentiate Thesis, Department of Applied Electronics, Lund Institute of Technology, Lund, Sweden, 1992. [7] AutoLogic User Interface Reference Manual, ver. 8.2_5, Mentor Graphics Corporation, 1993. [8] IC Station Reference Manual, ver. 8.2_5, Mentor Graphics Corporation, 1993. [9] AMS HIT-Kit Reference Guide, ver. 2.20, Austria Micro Systeme International, Austria, May, 1994. [10]Sandberg, P., et al.: Low-Power Design with Synthesis Tools, Report LiTH-ISY-R-1649, Linkšping, Sweden, 1994.

PAPER 4 IMPLEMENTATION OF FAST BIT-SERIAL LATTICE WAVE DIGITAL FILTERS Mark Vesterbacka, Kent Palmkvist, Peter Sandberg, and Lars Wanhammar Proceedings of IEEE Int. Symposium on Circuits and Systems (ISCASÕ94), Vol. 2, pp. 113-116, London, England, May 29-June 2, 1994. Presented at the IEEE Int. Symposium on Circuits and Systems (ISCASÕ94), London, England, May 29-June 2, 1994.

70


IMPLEMENTATION OF FAST BIT-SERIAL LATTICE WAVE DIGITAL FILTERS Mark Vesterbacka, Kent Palmkvist, Peter Sandberg, and Lars Wanhammar Department of Electrical Engineering, Linkšping University, S-581 83 Linkšping, Sweden Email: [email protected], [email protected], [email protected], [email protected]

ABSTRACT In this paper we discuss the design and implementation of fixed-function wave digital lattice filters. We demonstrate by means of an example that a sampling frequency of more than 130 MHz can be achieved by using bit-serial arithmetic. The proposed approach leads to very fast filters with low power consumption and a minimum requirement of chip area. Further, we show that the iteration period bound by Renfors et al. often can be lowered by applying numerical equivalence transformations to the signal-flow graph. The proposed implementation technique can easily be extended to higher-order bireciprocal and non-bireciprocal lattice filters as well as other types of filters.

1. INTRODUCTION As a demonstration vehicle we select a well-known third-order bireciprocal lattice wave digital filter [1, 2]. This filter has earlier been implemented in a 2 µm, double metal CMOS process using redundant, bit-parallel arithmetic. The fabricated chip in [1] has the following characteristics: Input data word length:

8 bit

Output data word length:

11 bit

Maximal sampling rate:

@ 35 MHz

Number of devices:

@ 9000

Chip area::

14.7 mm2 (pads included)

Power consumption:

150 mW

Figure 1 shows the filter structure which has a single coefficient a = 0.375. The attenuation for the filter which is a half-band lowpass filter is shown in Fig. 2. The filter can easily be simplified to be used for interpolation or decimation of the sample

Paper 4

71

frequency with a factor two. In both cases the filter can operate at the lower of the two sample rates. For example, the filter can be used to interpolate the sample rate from 130 MHz up to 260 MHz. In the following we will assume that all data is represented by twoÕs-complement binary numbers and that the data word length Wd is 12 bits.

x(n)

y(n)

T a

+ Ð T

T

Fig. 1. Third-order lattice wave digital filter.

A [db]

40 20 0

p/2

0

w p [rad]

Fig. 2. Attenuation.

2. MINIMUM ITERATION PERIOD BOUND The crucial point in the implementation of fast filters is the scheduling of the arithmetic operations. According to the iteration period bound by Renfors et al. [3], the minimal sample period is: Tmin = max { i

Top,Êi Ni

}

where Top, i is the total latency (delay) due to the arithmetic operations and Ni is the number of delay elements in the directed loop i, respectively. The critical loop is

72


indicated in Fig. 3. It contains two additions, one multiplication, and two delay elements, implying that the minimum sample period is (2Tadd+Tmult)/2. In bit-serial arithmetic, the operations are pipelined at the bit-level. The latency Top of a bit-serial PE is the time it takes for a bit of significance level j to propagate through the PE. For realizations using dynamic logic is T add = 1 clock cycle and Tmult = W c clock cycles, where Wc is the coefficient word length. With the coefficient a = 0.375, i.e., Wc = 4 bits, we get a minimum sample period of 3 clock cycles. For static CMOS logic, T a d d = 0 and T mult = Wc Ð1 clock cycles, respectively.

x(n)

y(n)

T a

+ Ð u(n) T

T

Fig. 3. Critical loop.

3. MAXIMALLY FAST SCHEDULING FORMULATION In order to attain the minimum iteration period, it is necessary to perform a cyclic scheduling of the operations belonging to several successive sample intervals [4, 5], as shown in Fig. 4, if: ¥ the computation time for a PE is longer than Tmin or ¥ the critical loop(s) has more than one delay element. x(mn)

y(mn) x(mn+1)

N0

T

y(mn+1) É x(mn+mÐ1)

N1

T

Fig. 4. m concatenated signal-flow graphs.

T T

y(mn+mÐ1)

NmÐ1

T T

Paper 4

73

Generally, the critical loop shown in Fig. 3 should be at least as long as the longest execution time for any of the PEs in the loop. In this case, the schedule should be performed over at least 5 sample periods, since the multiplier requires Wd+WcÐ1 = 15 clock cycles and Tmin = 3 clock cycles. If an improved multiplier design without the overhead for truncation of the additional WcÐ1 product bits is used, it is possible to schedule over only four sample intervals [5].

4. MAXIMALLY FAST SCHEDULE FOR THE ORIGINAL WDF A maximally fast, resource minimal schedule with a sample period of 3 clock cycles is shown in Fig. 5. Here, the algorithmic delay elements in the critical loop have been x(5n) u(5n+2) a

N0 u(5n) x(5nÐ1)

y(5n)

x(5n+1) u(5n+3) a

N1 u(5n+1) x(5n)

y(5n+1)

x(5n+2) u(5n+4) a

N2 u(5n+2) x(5n+1)

y(5n+2)

x(5n+3) u(5n+5) a

N3 u(5n+3) x(5n+2)

y(5n+3)

x(5n+4) u(5n+6) a

N4 u(5n+4) x(5n+3)

Fig. 5. Maximally fast, resource minimal operation schedule.

y(5n+4)

74


replaced by connections to the node u(n) in Fig. 3 and the remaining delay element by a connection to the appropriate input node in the cyclic schedule. The white areas indicate the delay for the least significant bit of the data word to reach the output and the shaded areas indicate the execution time for the remaining bits. Arrows indicate shimming delays. Input and output to the filter consist of five bit-serial streams that are skewed in time. It is impossible for the operations to share hardware resources in this schedule if nonpreemptive PEs are used. We are therefore forced to use an isomorphic mapping from the operation schedule to hardware resources. Figure 5 further shows that the adders will be unused three clock periods every sample period. This enables us to extend the internal representation of the sums with three most significant bits. The extended representation will then be sufficient to prevent overflow in the filter at no expense.

5. MAXIMALLY FAST SCHEDULE FOR TRANSFORMED WDF It is possible to reduce the length of the critical loop in Fig. 3 and thereby the required number of clock cycles per sample by performing numerical equivalence transformations on the signal-flow graph [7]. This particular transformation is based on the observation that if the critical loop contains an addition tree, the organization of the additions can be rearranged so that only one addition remains in the critical loop. In this case, the critical loop contains two such additions. In a first transformation step we select the addition before the critical loop and move it across the algorithmic delays as shown in Fig. 6. By duplicating the addition, one into each of its output branches, we get an addition tree in front of the multiplication. Now, the additions in the tree can be rearranged so that only the last addition remains in the critical loop. The transformed signal-flow graph is shown in Fig. 7.

x(n)

y(n)

T +

a

T

T

T

Ð

Fig. 6. First transformation step.

Paper 4

75

x(n)

y(n)

T

Ð

T a

Ð T

T

Fig. 7. Transformed signal-flow graph.

The minimal sample period is by these transformations reduced to 2.5 clock cycles. The conclusion is that the iteration period bound is only valid for a fully specified signal-flow graph. It should be stressed that the values computed in the modified signal-flow graph are the same as in the original signal-flow graph. Hence, the unique stability properties of wave digital filters are retained. A minor drawback is the increase of the number of additions and that the number of shimming delays may or may not change. More important, however, is that the maximal sample frequency has increased with 17%. A resource minimal schedule for the transformed algorithm, shown in Fig. 7, reveal that we now must schedule over six sample intervals and that we are again forced to use an isomorphic mapping between operations and PEs.

6 IMPLEMENTATION ISSUES We have chosen to implement the resource minimal schedule with a sample period of 3 clock cycles (Fig. 5). It requires 5 multipliers, 20 adders and 70 D-elements. Since the coefficient a is fixed, the multipliers were simplified into a sign-extension circuit, one bit-serial adder with an implicit D-element, and one D-element according to [8]. Figure 8 shows more explicitly the logic realization of the filter core. It is based on true single-phase clocked logic [6]. The device count is approximately 2400 for the filter and additional 400 transistors are required for control circuitry. In Fig. 9 the logic realization of the control unit is shown. It generates the necessary timing signals for initializing new operations in the filter core. Basically, it consists of a 15bit wide shift register that shifts a single Ò1Ó through its output stages. Along with the shift register some additional reset logic has been added to guarantee a correct starting condition during power-up.

76


x(5n) N0

5D

x(5n+1)

x(5n+2)

x(5n+3)

x(5n+4)

N1

N2

N3

N4

y(5n+1)

y(5n+2)

y(5n+3)

y(5n+4)

4D

5D

y(5n)

Fig. 8. Logic realization of the filter.

c(15n)

³1

D

c(15n+1)

&

D

c(15n+2)

&

D

c(15n+14)

&

D

1

³1 Fig. 9. Control unit.

Due to the circular structure of the data flow in the filter core, the core has been laid out around the control unit. To account for clock-skew, edge-triggered devices has been inserted along two radial cuts in the data paths. Clocks are distributed in a direction opposite to the data flow, in order to reduce the sensitivity to clock delay. An internal clock buffer is used to certify a well-shaped clock signal. The chip has been designed to operate with a clock frequency of at least 400 MHz, giving a sample frequency of 130 MHz. For high-speed testing purposes a bit-parallel input and output format was realized by using shift registers as serial-parallel converters [5], along with a pseudo-random generator for test pattern generation, and a parity checker for verification of the output. Of course, it is also possible to use the same approach with bit-parallel arithmetic. However, the maximal sample rate will not increase significantly and the required chip area will be much larger.

7. PERFORMANCE ESTIMATES Simulation results indicates the following performance can be achieved for the bitserial implementation:

Paper 4

77

Input data word length:

12 bit

Output data word length:

12 bit

Maximal sampling rate:

@ 130 MHz

Number of devices:

@ 2800

Chip area:

1 mm2 (pads excluded)

Power consumption:

30 mW @ fsample = 35 MHz 110 mW @ fsample = 130 MHz

In comparison with the bit-parallel implementation in [1] the device count is reduced by a factor 3 for the bit-serial implementation. This leads to significantly less chip area and power consumption compared to the bit-parallel implementation. The power consumption is estimated to 0.85 mW/MHz. A sample frequency of 35 MHz corresponds to a clock frequency of 105 MHz and a power consumption of 30 mW. The chip area is about 1 mm2 using a 0.8 µm, double metal CMOS process. Simulations as well as measurement on similar circuits show that these circuits can be clocked at very high speeds [6]. It is estimated that a clock frequency well above 400 MHz is feasible with a careful circuit design. Hence, a sample frequency of more than 130 MHz is feasible with a power consumption of 110 mW. A photograph of the chip is shown in Fig. 10.

Fig. 10. Chip photograph.

78


8. SUMMARY In this paper we have presented an approach to design and implement very fast fixedfunction digital filters using bit-serial arithmetic. We have shown that maximally fast and resource minimal implementations can be found by using a periodic scheduling technique. Sampling frequencies of more than 130 MHz is feasible. We have also shown how numerical equivalence transformations can be applied to the signal-flow graph to reduce the iteration period. A third-order lattice wave digital filter has been implemented using a state-of-the-art, dynamic CMOS logic style. The estimated performance in terms of chip area and power consumption is of an order of magnitude better than state-of-the-art designs based on traditional techniques.

REFERENCES [1] Kleine U. and Bšhner M.: A High-Speed Wave Digital Filter Using Carry-Save Arithmetic, Proc. ESSCIRCÕ87, Bad-Soden, pp. 43-46, 1987. [2] Pandel J. and Kleine U.: Design of Bireciprocal Wave Digital Filters for High Sampling Rate Applications, Frequenz, Vol. 40, No. 11/12, 1986. [3] Renfors M. and Neuvo Y.: The Maximal Sampling Rate of Digital Filters Under Hardware Speed Constraints, IEEE Trans. on Circuits and Systems, CAS-28, No. 3, March, pp. 196-202, 1981. [4] Wanhammar L., Afghahi M., and Sikstršm B.: On Mapping of DSP Algorithms onto Hardware, IEEE Intern. Symp. on Circuits and Systems, ISCAS-88, Espoo, Finland, June, pp. 1967-1970, 1988. [5] Wanhammar L.: System Design: DSP Integrated Circuits, Prentice-Hall, 1994. (in preparation). [6] Yuan J.: High Speed CMOS Circuit Technique, Linkšping Studies in Science and Technology, Thesis. No. 132, Linkšping University, Sweden, 1988. [7] Palmkvist K., Vesterbacka M., Nordhamn E., and Wanhammar L.: A Fast Bit-Serial Lattice Wave Digital Filter, NUTEK Workshop on Digital Comm., Uppsala University, Uppsala, Sweden, May 25-26, pp. 88-92, 1992. [8] Vesterbacka M., Palmkvist K., and Wanhammar L.: Realization of Serial/Parallel Multipliers with Fixed Coefficients, National Conference on Radio Science - 93, Lund Institute of Technology, Lund, Sweden, April 5-7, pp. 209-212, 1993.

PAPER 5 MAXIMALLY FAST, BIT-SERIAL LATTICE WAVE DIGITAL FILTERS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Proceedings of the 7th IEEE Digital Signal Processing Workshop (DSPWSÕ96), pp.Ê207-210, Loen, Norway, Sept. 1-4, 1996. Presented at the 7th IEEE Digital Signal Processing Workshop (DSPWSÕ96), Loen, Norway, September 1-4, 1996.

80


MAXIMALLY FAST, BIT-SERIAL LATTICE WAVE DIGITAL FILTERS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Department of Electrical Engineering, Linkšping University, S-581 83 Linkšping, Sweden Email: [email protected], [email protected], [email protected]

ABSTRACT An approach to schedule lattice wave digital filters so that the maximal sample frequency is obtained is presented. In the approach, bit-serial arithmetic and a scheduling method that decouples the sample period from the scheduling period are used. A lower bound on the scheduling period required to arrive at the minimum sample period is given. Different latency models for the arithmetic operations, and their effect on the minimum sample period are discussed. The operation schedule is mapped to a hardware structure using isomorphic mapping. The throughput of the resulting implementations is comparable to corresponding bit-parallel implementations.

1. INTRODUCTION Lattice wave digital filters are robust, low-sensitive filter structures with favorable properties from an implementation point of view [1, 2]. The lattice structure is modular and regular with a high degree of parallelism that allows for high throughput. An implementation of a maximally fast, third order, bireciprocal lattice wave digital filter has earlier been presented by the authors in [3]. This implementation achieved a sampling frequency of more than 100 MHz using bit-serial arithmetic. In this paper, the more complex case of implementing a maximally fast, (k+1):th order, general lattice wave digital filter is studied. Here, more than one recursive section has to be considered, since a general lattice wave digital filter consists of a number of cascaded first and second order allpass sections Si , as shown in Figure 1. In the following it is shown how the maximally fast implementation is achieved, i.e., an implementation yielding maximum throughput. The filter is also assumed to be implemented using bit-serial arithmetic, which may yield a throughput comparable to bit-parallel arithmetic [3, 4]. The benefits from this design choice will be a sig-

Paper 5

81

S2 T

S0 T

a4

Sk/2Ð1 T

T

a0

akÐ2

É T

a3

akÐ3

x(n)

y(n) a1

T

akÐ1

a5

T

T

a2

a6

T S1

T S3

É

ak

T Sk/2

Figure 1. Lattice wave digital filter with allpass sections.

nificantly smaller cost in terms of hardware resources and chip area. As a consequence, power consumption and speed are also affected in a positive sense due to the lower capacitive load originating from wire routing.

2. MINIMAL SAMPLE PERIOD Since the allpass sections of a lattice wave digital filter are recursive, the throughput of the filter is bounded by the critical loops. The minimum sample period Tmin is given by Tmin = max { i

Top(i) N(i)

}

where Top(i) is the total latency due to the operations and N(i) is the number of delay elements in the directed loop i [5].

82


The loop in a first order section is indicated in Figure 2. This loop contains two additions, one multiplication with a0 , and one delay element, yielding the minimum sample period bound T0min = 2Tadd +Ta0 . S0

S0

T

T a0

a0

Figure 2. The loop in a first order section.

In Figure 3, the critical loop in a second order section is indicated. For this loop, four additions, two multiplications, and one delay element yield the bound Ti/2min = Tai +TaiÐ1+4Tadd . Si/2

Si/2

T

T ai

ai

T

T aiÐ1 aiÐ1

Figure 3. The critical loop in a second order section.

The minimum sample period of the filter is then bounded by the section with the lowest throughput, i.e., the loop i that yield Tmin , since each section has to operate with the same throughput.

3. LATENCY MODELS To identify the critical loops in the filter, the latency of the additions and multiplications must be known. Here, decisions about the level of pipelining have to

Paper 5

83

be made. For a static CMOS logic style, the processing elements may be implemented with a gate delay latency, or with a clock cycle latency, if D flip-flops are used for pipelining For dynamic CMOS logic styles with the logic and latches merged, pipelining of the gates is required. Two latency models for a bit-serial adder are shown in Figure 4. x(n) y(n)

Model 0

FA åC

å(n)

D x0 y0

x1 y1

å0

å1

x(n) y(n)

Model 1

É É É

FA åC

xWdÐ1 yWdÐ1

x(n) y(n)

åWdÐ1

å(n)

Tadd latency

D

å(n)

D

å0

x0 y0

x1 y1

É É

xWdÐ2 xWdÐ1 yWdÐ2 yWdÐ1

x(n) y(n)

å1

å2

É

åWdÐ1

å(n)

Tadd latency

Figure 4. Latency models for a bit-serial adder.

In model 0, which corresponds to a static CMOS logic style without pipelining of the gates, the latency is equal to the gate delay of a full adder. In model 1, which corresponds to a dynamic CMOS logic style, or a static CMOS logic style with pipelining on the gate level, the full adder followed by a D flip-flop causes the latency to become one clock cycle. Example implementations show that model 1 generally results in faster bit-serial implementations, due to the shorter logic paths between the flip-flops in successive operations [6]. For multiplication, a simplified serial/parallel multiplier that uses bit-serial adders may be used [6]. The corresponding latency models for a serial/parallel multiplier are shown in Figure 5.

84


a

Model 0

serial/parallel multiplier

x(n) a y0

x0

y(n) xWdÐ1

É

yWdÐ1 É yWd+Wcf

É

latency

serial/parallel multiplier

x(n)

y0

y(n) Ta

a

Model 1

a

x(n)

x0 É

É

D

y(n)

xWdÐ2 xWdÐ1

yWdÐ1 É yWd+Wcf Ta latency

x(n) y(n)

Figure 5. Latency models for a serial/parallel multiplier.

Denoting the number of fractional bits of the coefficient Wcf, the latencies become Wcf for latency model 0, and Wcf+1 for latency model 1. It is thus important to minimize the word lengths of the coefficients when the filter is designed, since they affect the minimum sample period.

4. CYCLIC SCHEDULING For bit-serial arithmetic, the execution times of the processing elements are normally longer than the minimum sample period. In some algorithms, the critical loops may contain more than one delay element in the critical loops. For these cases, it is not sufficient to schedule the arithmetic operations over a period equal to the minimum sample period in order to achieve a sample period equal to the minimum sample period [4]. Instead a cyclic scheduling has to be performed, accounting for the fact that the schedule is inherently periodic. In a cyclic schedule, operations belonging to several successive sample periods are included. This principle is illustrated in Figure 6, where the operations belonging to one sample interval i have been collected into a set Ni. From this scheduling formulation, it is possible to find a resource-optimal schedule using arbitrary constraints on sample rate and latency. Note that delay elements are redundant since they only define boundaries between operations belonging to different sample periods. In Figure 6 the delay elements have been omitted and the

Paper 5

85

periodicity in the schedule has been visualized by drawing the operations on a cylinder of circumference equal to the scheduling period.

Nm

y(mn +

mÐ1

Ð1

)

1)

x(mn)

+ x(mn

y(mn)

N1

.. .

.. . .. .

N0

.. .

t Figure 6. Illustration of the cyclic operation schedule.

In Figure 7, the operations belonging to m sample periods in a first order allpass section have been scheduled. The shaded areas indicate execution time for the operations, with darker shaded areas indicating latency. A corresponding, maximally fast schedule for a second order section is shown in Figure 8. Since a processing element only processes 1 bit in each clock cycle, while a minimal sample period of Tmin clock cycles requires a sample of length Wd bits to be processed, operations belonging to m sample periods needs to be included in the schedule. Therefore, m must be selected as m³

Wd Tmin

in order to match the bandwidths between the processing elements and the sample period. The scheduling period for a maximally fast schedule becomes mTmin . For the allpass sections with a loop not yielding Tmin , the throughput has to be lowered to equalize the throughput in all sections. This can be achieved in two ways. First, m can be selected equal in all sections, which will require additional equalizing delays in the schedule to lengthen the scheduling period. Second, each section can be scheduled to be maximally fast, and then the clock period can be lengthened to match the bandwidths. The obvious choice is the first alternative, since the second alter-

✚

y(mn+mÐ1)

y(mn+1) x(mn+mÐ1)

y(mn)

x(mn+1)


x(mn)

86

N0

✚ ✖ ✚ ✚

N1

✚ ✖ ✚ . . .

. . . ✚ ✖

NmÐ1

✚

✚ t 0

Tmin

2TminÉ(mÐ1)Tmin mTmin

Figure 7. Maximally fast schedule for a first order section.

native requires excessive hardware in terms of multiple clocks, bit-buffers, and multiplexers/demultiplexers between sections with different m in order to adjust the number of sample inputs. Additional equalizing delays should preferably be inserted in cuts with few branches to minimize the amount of extra storage. Suitable locations for additional delay in the schedules above are between the processor sets Ni, since these cuts only contain the branches corresponding to the loops in the algorithm.

5. MAXIMALLY FAST FILTER SCHEDULE The resulting scheduling formulation for the complete lattice wave digital filter is shown in Figure 9. Here, each allpass section is scheduled with a scheduling period of mTmin .

Paper 5

✚

y(mn+mÐ1)

x(mn+mÐ1)

y(mn+1)

x(mn+1)

y(mn)

x(mn)

87

✚ ✖ N0

✚ ✚ ✚ ✖ ✚ ✚ ✚ ✖

N1

✚ ✚ ✚ ✖ ✚ . . .

. . . ✚ ✚ ✖ NmÐ1

✚ ✚ ✚ ✖ ✚

t 0

Tmin

2TminÉ(mÐ1)Tmin

Figure 8. Maximally fast schedule for a second order section.

mTmin

88


..

..

.

x(mn+mÐ1) . . . x(mn+1) x(mn)

.

..

..

.

S1

.

..

.

..

.

y(mn+mÐ1) . . . y(mn+1) y(mn)

Sk/2Ð1

..

.

..

.

S2

S0

..

..

.

.

..

..

.

S3

.

Sk/2

Figure 9. Scheduling formulation for a complete filter.

The input and output to the filter consist of m parallel bit-serial streams that are distributed in time over the scheduling period. Since the regular structure of lattice wave digital filters also is found in the maximally fast schedule, it should be a reasonable task to develop software tools for synthesis of such filters from the transfer function.

6. MAPPING TO A HARDWARE STRUCTURE An isomorphic mapping between the arithmetic operations and the bit-serial processing elements can generally be used for a maximally fast implementation, since the resource utilization usually must be high in order to reach the minimum sample period. Thus, a hardware structure is achieved that is maximally fast with a high resource utilization [4]. It has been shown in that an efficient approach to implement digital filters with low power consumption is to design a maximally fast implementation and then convert the excess speed to lower power consumption by reduction of the supply voltage [6]. The required chip area is very small for bit-serial implementations.

7. SUMMARY The methodology previously used in the design of a maximally fast, bireciprocal lattice wave digital filter has been extended to include the general structure of lattice wave digital filters. Different latency models for the arithmetic operations, and its effect on the minimum sample period has been discussed. Maximally fast, general schedules for the allpass sections have been presented. A lower bound on the scheduling period required to arrive at the minimum sample period using cyclic scheduling was given. The final schedule is regular and suitable for use with filter synthesis tools.

Paper 5

89

REFERENCES [1]ÊFettweis A., Levin H., and Sedlmeyer A.: Wave Digital Lattice Filters, Intern. J. Circuit Theory and Appl., Vol. 2, pp. 203Ð211, June 1975. [2]ÊFettweis A.: Wave Digital Filters: Theory and Practice, Proc. IEEE, Vol. 74, No. 2, pp. 270Ð 327, Feb. 1986. [3]ÊVesterbacka M., Palmkvist K., Sandberg P., and Wanhammar L.: Implementation of Fast BitSerial Lattice Wave Digital Filters, Proc. of IEEE Intern. Symp. on Circuits and Systems (ISCAS Õ94), Vol. 2, pp. 113 Ð 116, London, May 30 Ð June 1, 1994. [4]ÊWanhammar L.: DSP Integrated Circuits, Linkšping University, 1996. [5]ÊRenfors M. and Neuvo Y.: The Maximal Sampling Rate of Digital Filters Under Hardware Speed Constraints, IEEE Trans. on Circuits and Systems, CAS-28, No. 3, March, pp. 196Ð202, 1981. [6]ÊVesterbacka M.: Implementation of Maximally Fast Wave Digital Filters, Thesis no. 495, LiUTek-Lic-1995:27, Linkšping University, June 1995.

90


PAPER 6 ARITHMETIC TRANSFORMATIONS FOR FAST BITSERIAL VLSI IMPLEMENTATIONS OF RECURSIVE ALGORITHMS Kent Palmkvist, Mark Vesterbacka, and Lars Wanhammar Proceedings of IEEE Nordic Signal Processing Symposium (NORSIGÕ96), pp. 391394, Espoo, Finland, Sept. 24-27, 1996. Presented at IEEE Nordic Signal Processing Symposium (NORSIGÕ96), Espoo, Finland, Sept. 24-27, 1996.

92


ARITHMETIC TRANSFORMATIONS FOR FAST BITSERIAL VLSI IMPLEMENTATIONS OF RECURSIVE ALGORITHMS Kent Palmkvist, Mark Vesterbacka, and Lars Wanhammar Department of Electrical Engineering, Linkšping University, S-581 83 Linkšping, Sweden E-mail: [email protected], [email protected], [email protected]

ABSTRACT A method to increase the throughput of static recursive algorithms is presented. The signal-flow graph is transformed by first minimizing the number of summation points in the computational loops. A second transformation to rewrite the fixed coefficient multiplications as a sum of weighted signals is then followed by a reordering of the summations. It is how a sum of products can be implemented in this way. Sharing of sub-expressions are also discussed. A bitserial implementation of a third order bireciprocal lattice WDF is used to illustrate the transformations and sharing of sub-expressions.

1. INTRODUCTION The throughput for a recursive algorithm is limited by the latency of the computational loops. Equation (1) describes the minimal sample period (Tmin) [1]. Various methods have been proposed in order to realize circuits with a sample period equal to the minimal sample period [2, 5, 6].

ì Top ü Tmin = max í i ý i î Ni þ

(1)

where Topi is the total latency due to the arithmetic operations, etc. in the directed loop i, and Ni is the number of delay elements in the directed loop i. The loops having the maximal Topi/N i ratio are called critical loops. The sample period can not be reduced without modifying some property of the operations in these loops. Various methods have earlier been proposed in order to increase the throughput by modifying the algorithm [2, 5, 6]. However, some of these transformations may

Paper 6

93

modify the computational properties of the original algorithm and may even yield an unstable algorithm. The method described here modifies the signal-flow graph, but it retains the computational properties of the original algorithm.

2. BIT-SERIAL ARITHMETIC The arithmetic operations in digital filters, FFTs, DCTs, consists of additions, subtractions, and multiplications. In the bit-serial case one bit of the input data is processed in each clock cycle, usually starting with the LSB. Use of bit-serial arithmetic results in small processing elements and short interconnection paths between the processing elements. Combinatorial paths through the logic are short, allowing for high bit-rates. It also makes the bit-serial arithmetic well suited for implementation using dynamic logic styles of type Model 1 [6]. These logic styles are characterized by the fact that the latency is increased by one clock cycle compared to the inherent latency of the operation. Combining such blocks will have the favorable property of not reducing the maximal clock frequency, as the length of the combinatorial paths does not increase. Examples of logic families that is described by Model 1 are TSPC and clocked CVSL. The latency is defined as the time required for an input of a given significance level to affect the output at the same significance level. This gives an inherent latency of a bit-serial adder equal to zero clock cycles, but an implementation corresponding to model 1 will yield a latency of one clock cycle. A common case is multiplication with a fixed coefficient, which may be realized as multiplication of a bit-serial input of Wd bits with a bit-parallel coefficient of Wc bits, generating a bits-serial output of Wd+Wc Ð1 bits. A model 1 realization of this yields a latency equal to the number of fractional bits in the coefficient plus one.

3. ARITHMETIC TRANSFORMATIONS Using 2Õs-complement representation allows the designer under mild conditions to select an arbitrary addition order. This property has previously been used in order to reduce Tmin by removing all, but one addition from the critical loop [5, 6]. The addition order is in many algorithms not relevant, or the algorithms are described with the assumption that few operations correspond to fast implementations. This is however not the case for applications with high throughput requirements implemented using algorithm specific hardware. The parallelism of the operations is in these cases more important.

94


3.1 Addition Reordering Addition of multiple values is usually implemented using a summation tree as shown in Fig. 1. The depth of the tree is depending on the number of factors in the sum. An increase of the number of levels corresponds to an increase of the maximum latency through the addition tree.

A B A B C D C D S

S

Fig. 1. Different summation orders results in different latency.

In the case of model 1 logic, the latency measured in clock cycles through the addition tree is equal to the number of levels between the input and the output. It is however also possible to implement the addition as a sequence of additions. This results in an increased latency for most of the inputs and a decreased latency for a few inputs compared to the tree structure. The maximum clock frequency is however not affected by this reordering. By mapping the critical paths to the inputs of the adder which has a small latency, while using long latency inputs for the non-critical inputs, the total system may be implemented with a higher throughput compared to a system with non-optimized addition orders. A minimum number of summation points in the critical loop allows the designer to select an addition order that will result in a minimal latency. This generally corresponds to having a single summation point in the critical loop. Explicit expressions for intermediate values are then eliminated and only the expressions of the state-variables are retained explicitly. This transformation step may alter for example the stability property of an algorithm as it may move truncation/rounding of multiplication results. If multiple sums share input values, sub-expressions may be shared. There are several ways sub-expressions may be shared. Sharing is not allowed if the latency in the critical path is increased by the sharing.

Paper 6

95

3.2 Multiplication to Addition Transformation The structure of a general serial-parallel multiplier is shown in Fig. 2a. The input signal is delayed which multiplies the input by powers of two. These weighted input signals are then added together to form the final product. This corresponds directly to Eq. 2.

p(n) = 2a 0 x (n) + 4a 1 x (n) + 8a 2 x (n) + 16a 3 x (n)

(2)

Multiplying the input value with a coefficient containing f fractional bits gives an expression on the form

p(n) = 2 -( f +1) (2a 0 x (n) + 4a1 x (n) + 8a 2 x (n) + 16a 3 x (n))

(3)

The term 2Ð(f+1) corresponds directly to the inherent delay of f clock cycles plus one clock cycle delay due to the model 1 logic style.

x(n)

D

D a1

a0

&

D a2 &

D a3 &

&

p(n) a) x(n) a0

a1 &

p(n)

D

a2 &

D

a3 &

D

& D

b) Fig. 2. Serial-parallel multipliers.

The first flip-flop in Fig. 2a is not required, but is included in order to illustrate how this structure can be retimed for implementation using model 1 logic. Figure 2b shows a retimed version of the multiplier structure. This structure is well suited for model 1 logic, as every adder is followed by a flip-flop. This structure corresponds to the mathematical expression (4).

p(n) = 2(a 0 x (n) + 2(a 1 x (n) + 2(a 2 x (n) + 2a 3 x (n))))

(4)

96


It is possible to simplify the structure in Fig. 2 if a fixed coefficient is used. Every adder corresponding to the bit-position in the coefficient containing a zero is then available for addition of another value. This allows summations from ordinary additions to be interleaved into the summation originating from the multiplication. A model 1 logic implementation of a multiply and add operation would then have a latency only consisting of the inherent latency in the multiplier plus 1, as the addition operation is included in the multiplication. Several values can be added without additional latency if the coefficient is described using sign digit coding (SDC), as an addition can be replaced by a subtraction. The SDC coefficient must however not contain more fractional bits than the twoÕscomplement version if the value multiplied is formed by the critical loop. Otherwise would an increase of fractional bits corresponds to an increased inherent delay in the multiplication. The combination of summation and multiplication may be further generalized to a sum of products with fixed coefficients. The inputs are in this case feed into shiftregisters that forms the weighted versions, and the selected signals are then added together. The addition order is then selected in the same way as previously described in order to reach a low latency. Finally, the structure is retimed in order to be implemented using model 1 logic.

x(n)

D

D

D

D s(n)

y(n)

D

D

D

D

Fig. 3. Sum of products implemented using summation of weighted inputs.

3.3 Sharing Sub-Expressions If more than one expression is to be calculated, sub-expressions may be shared. These must be available at the correct timing. It is often possible to share subexpressions even if available results contain to many factors by removing values from an already calculated sum.

4. IMPLEMENTATION EXAMPLE A third-order bireciprocal lattice wave digital filter has previously been implemented using both bit-parallel and bit-serial arithmetic [4, 5, 6]. Figure 4 shows the signalflow graph. The minimal sample period was then reduced from 3 down to 2,5 clock cycles by reordering of the additions in the critical loop.

Paper 6

97

The algorithm has a critical loop consisting of the single computational loop, in which the state-variable v(n) is calculated. The algorithm uses the input value x(n), the previous input value x(nÐ1), and a previous internal state v(nÐ2) to calculate the output value y(n) and a new internal state v(n). First, the intermediate nodes are removed by only describing the expressions to be calculated. These are then rewritten on the form of a sum of products as shown in Eq. 5 and Eq. 6.

T

T

v(n)

0.0112

x(n)

y(n) T

Fig. 4. Third-order lattice wave digital filter.

v(n) = x (n) + 0.0112 ( x (n) - v(n - 2)) = = x (n) + 2 -2 x (n) + 2 -3 x (n) - 2 -2 v(n) - 2 -3 v(n)

(5)

y(n) = x (n - 1) + v(n - 2) + 0.0112 ( x (n) - v(n - 2)) = = x (n - 1) + 2 -2 x (n) + 2 -3 x (n) + v(n - 2) -2

(6)

-3

-2 v(n) - 2 v(n) The multiplication of v(n) in (5) has an inherent latency of three clock cycles, as the coefficient contains three fractional bits. The total latency in the critical loop is therefore at least four clock cycles for a model 1 logic implementation. The next step is to rearrange the addition order. The critical loop consists of the scaling of v(n) followed by the additions. All other factors consist of scaled versions of x(n) and x(nÐ1), which may be delayed an arbitrary amount of clock cycles. The scaled versions of v(n) are therefore the last factors to add to the sum, resulting in a low latency. The selected addition order is shown in Fig. 5. Each input starts with flip-flops that are propagated into the circuit during retiming. The x(n) input has two extra flip-flops compared to v(nÐ2), meaning that the x(n) input must be available two clock cycles prior to v(nÐ2).

98


x(n)

D

D

D

D

v(nÐ2)

D

D

D

D Ð Ð

v(n)

Fig. 5. Selected addition order for v(n).

The final step is to retime the addition allowing a model 1 logic implementation. The addition order selection and retiming may be performed simultaneously by rewriting the expressions. Every factor of 2 corresponds to a flip-flop. The equations are restructured to contain factors of 2(a + b), which corresponds to a single full-adder implemented using a logic style of type Model 1. The final expression contains an initial factor of 2Ð4, indicating a total latency of the recursive loop equal to 4 clock cycles. This yields a minimal sample period of two clock cycles. The computation of y(n) is non-critical. The goal is therefore to allow for a maximal amount of sub-expression reuse from the computation of v(n). The value v(n) is selected as a sub-expression, as it contains four of the six weighted inputs to add. The non-wanted term containing x(n) is removed and the missing terms are added as shown in Fig. 6.

v(nÐ2) x(nÐ1)

D

D

D

D

D

D

D D x(n) D

D

D

D

D

D

D

D D v(n) D

Ð y(n)

Fig. 6. Computation of y(n) by reusing v(n).

The logic realization of one block consisting of full-adders and flip-flops is shown in Fig. 7. This structure closely resembles the signal-flow graph in Fig. 4, as no transformations have been applied to this realization. This block may then be cascaded in a cyclic fashion in order to implement a complete filter [4]. The minimal sample period is in this case 3 clock cycles. Figure 8 shows a logic realization of one block for the transformed signal-flow graph. The minimal sample period has been reduced by 33%. The number of adders has increased by 2 (40%) compared to the

Paper 6

99

initial bit-serial implementation while the number of flip-flops has been reduced by 7 (35%). A speed gain is achieved in this case without an increased chip area, as an extra adder consists of two flip-flops plus logic.

x(n)

D

D

D v(nÐ2) x'(nÐ1)

D

D

D

D

D

D

D

D

D

v(n)

D D

D

D

D

D

x'(n)

D

D

y(n)

Fig. 7. Logic realization of the filter without transformations.

v(nÐ2) x'(nÐ1)

D

D

D

D

D x(n)

v(n)

D

D

D

D

D

D

D

y(n)

D x'(n)

Fig. 8. Logic realization of the filter using transformations.

5. CONCLUSIONS It has been shown that addition reordering may also be applied to multiplications in order to reach high throughput using bit-serial arithmetic. The total latency may in most cases be decreased to the number of fractional bits in the coefficient by which the critical loop result is multiplied. We have also described how the latency of an expression may be extracted from the mathematical expressions.

REFERENCES [1] Renfors M. and Neuvo Y., ÒThe Maximum Sampling Rate of Digital Filters Under Hardware Speed ConstraintsÓ, IEEE Trans. on Circuits and Systems, Vol. CAS-28, pp. 196-202, Mar. 1981. [2] Pahri K. K., ÒAlgorithm Transformation Techniques for Concurrent ProcessorsÓ, Proc. of the IEEE, Vol. 77, No. 12, pp. 1879-1895, Dec. 1989.

100 Studies on the Design and Implementation of Digital Filters

[3] Fettweis A., ÒWave Digital Filters: Theory and PracticeÓ, Proc. of the IEEE, Vol. 74, No. 2, pp. 270-327, Feb. 1986. [4] Kleine U. and Bšhner M., ÒA High-Speed Wave Digital Filter Using Carry-Save ArithmeticÓ, in Proc. ESSCIRCÕ87, Bad-Soden, 1987, pp. 43-46. [5] Vesterbacka M., Palmkvist K., Sandberg P., and Wanhammar L., ÒImplementation of Fast BitSerial Lattice Wave Digital FiltersÓ, in Proc. IEEE Int. Symposium on Circuits and Systems (ISCASÕ94), London, England, May 29 - June 2, 1994, Vol. 2, pp. 113-116. [6] Vesterbacka M., Implementation of Maximally Fast Wave Digital Filters, Linkšping University, Sweden: Linkšping Studies in Science and Technology, Thesis No. 495, 1995.

Paper 6

101

Errata EQUATION 5:

v(n) = x (n) + 0.0112 ( x (n) - v(n - 2)) = = x (n) + 2 -2 x (n) + 2 -3 x (n) - 2 -2 v(n - 2) - 2 -3 v(n - 2) Equation 6:

y(n) = x (n - 1) + v(n - 2) + 0.0112 ( x (n) - v(n - 2)) = = x (n - 1) + 2 -2 x (n) + 2 -3 x (n) + v(n - 2) -2 -2 v(n - 2) - 2 -3 v(n - 2) Section 4, first paragraph ...[4,5,6]. Figure 4 shows....


PAPER 7 HIGH-SPEED MULTIPLICATION IN BIT-SERIAL DIGITAL FILTERS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Proceedings of IEEE Nordic Signal Processing Symposium (NORSIGÕ96), pp. 179182, Espoo, Finland, Sept. 24-27, 1996. Presented at IEEE Nordic Signal Processing Symposium (NORSIGÕ96), Espoo, Finland, Sept. 24-27, 1996.


HIGH-SPEED MULTIPLICATION IN BIT-SERIAL DIGITAL FILTERS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Linkšping University, Department of Electrical Engineering, Linkšping University, S-581 83 Linkšping, Sweden Email: [email protected], [email protected], [email protected]

ABSTRACT Canonic signed-digit code representation of multiplier coefficients is often used in digital filters to reduce the required amount of hardware resources. Another approach taken in this paper is to use canonic signed-digit coded coefficients to increase the throughput of the multiplier. We show how the suggested approach applies to serial/parallel multipliers with fixed coefficients. A maximally fast implementation of a digital filter is further used as an example to demonstrate the use of the multipliers in recursive digital filters. The resulting bit-serial filters yield a throughput comparable to bit-parallel implementations, while using only a fractional amount of hardware resources. The filters can be used directly in high-speed applications or in low-power applications after supply voltage scaling.

1. INTRODUCTION For a maximally fast implementation, the sampling period is equal to the minimum sample period, which is [1] Tmin = max { i

Topi Ni

}

(1)

where Topi is the total latency due to the operations and Ni is the number of delay elements in the directed loop i. A maximally fast implementation using bit-serial arithmetic is attractive because of the high throughput that is possible, small chip area, and low power consumption. To arrive at a maximally fast implementation, the method of cyclic scheduling must be used [2, 3]. In earlier work [4] we have shown that the throughput in a recursive filter may be increased by increasing the latency of the processing elements. This seems to be a paradox, but the observation that the latency for a serial/parallel multiplier actually is

Paper 7

105

dependent on throughput explains the result. For such an implementation, the computations in the critical loop will require more clock cycles while the clock period is shortened. The result is a net gain in the overall throughput in an implementation of a recursive algorithm. In this paper it will be shown how bit-serial adders with larger latency may be used in the realization of the serial/parallel multipliers, yielding an increased throughput for both the multipliers and the whole filter. Its application to maximally fast filter implementation is then demonstrated by the means of an example.

2. BIT-SERIAL ADDITION The latency for a bit-serial processing element is defined as the time it takes to produce an output bit from an input bit of the same significance. To investigate how the latency affects the throughput of an implementation, three models for bit-serial addition, with least-significant bit first, have been studied. These models are further used in the realization of corresponding serial/parallel multipliers since such multipliers basically consist of addition and delay (shifts). The models for the adder latency are illustrated in Fig. 1. Each model is described in the subsequent subsections.

2.1. Model 0 Adder Model 0 in the top of Fig. 1 yields a latency equal to the gate delay of a full adder. The effective contribution to the latency is Tadd0 = 0ÊTclk0, i.e., the latency in terms of clock cycles is zero with a prolonged clock period Tclk0.

2.2. Model 1 Adder Model 1 in the middle of Fig. 1 corresponds to pipelining on the adder level. For this case the latency is Tadd1 = 1ÊTclk1. Thus, the latency is increased by one clock cycle, while the clock period Tclk1 remains constant.

2.3. Model 2 Adder Model 2 in the bottom of Fig. 1 corresponds to pipelining inside the adders. For this case the latency is Tadd2 = 2ÊTclk2. The latency is increased by two clock cycles, while the clock period Tclk2 remains constant. The pipelined model 2 adders may be realized according to Fig. 2. Here the circuit has been retimed, delimiting the critical path to one exclusive OR gate and one D flip-flop. This allows for a clock period Tclk2 shorter than the clock period Tclk1 of


latency model 1, where the critical path consists of one full adder and one D flipflop. Model 0 x(n) FA åC y(n)

å(n)

D x0 y0

É xWdÐ2 xWdÐ1 É yWdÐ2 yWdÐ1

å0

t

É åWdÐ2åWdÐ1

Model 1 x(n) FA åC y(n)

Tadd0

å(n)

D

D

å0

x0 y0

É xWdÐ2 xWdÐ1 É yWdÐ2 yWdÐ1

å1

É åWdÐ1

Tadd1

Model 2 x(n) y(n)

FA åC

2D

å(n)

D É xWdÐ3 xWdÐ2 xWdÐ1 É yWdÐ3 yWdÐ2 yWdÐ1 É åWdÐ1

Tadd2

Fig. 1. Computational models of the latency for bit-serial addition.

x(n) y(n)

=1

D

=1 &

&

D

Fig. 2. Logic realization of the model 2 adder.

D ³1

å

å(n)

D

C

Paper 7

107

3. SERIAL/PARALLEL MULTIPLICATION The latency for a serial/parallel multiplier is the time it takes to produce the additional least-significant bits of the full-precision product compared to the data. This number of bits is equal to the number of fractional bits in the coefficient which is denoted Wf . In Fig. 3 serial/parallel multipliers have been realized by the corresponding adders in Fig. 1. TwoÕs-complement (2C) representation of the data has been assumed, which require the input data x(n) to the multipliers to be sign-extended with Wf bits. Also, Model 0 a0 x(n)

a1

&

& FA åC

0

É aWf

D

D*

& FA åC D

D

Model 1 a0 x(n)

É aWf

&

& FA åC

0

FA åC

D

D*

D

a x(n)

D

Model 2 a0 x(n)

É aWf

&

&

0

a x(n)

FA åC

D

FA åC 2D

FA åC 2D

D*

D

a x(n)

Fig. 3. Realization of serial/parallel multipliers based on different latency models.


the leftmost adder of each multiplier has been converted into a subtracter in order to handle the sign-bit of the coefficient. The first D flip-flop marked with a star (*) is initially set. Note that the zero-input of the leftmost adder alternatively may be used to add data to the product without additional hardware.

3.1. Model 0 Serial/Parallel Multiplier There is a direct path through one AND gate and one full adder in the model 0 serial/parallel multiplier in Fig. 3. Since the multipliers may be cascaded without intermediate D flip-flops similarly to the latency model 0 adder, cascades of the model 0 processing elements yield a prolonged clock period T clk0. The model 0 multiplier latency becomes Tmul0 = W f ÊTclk0 due to the initial computation of the additional least-significant bits of the product.

3.2. Model 1 Serial/Parallel Multiplier The model 1 serial/parallel multiplier in Fig. 3 has an additional D flip-flop at the output compared to the model 0 multiplier. Hence all adders may be implemented as latency model 1 adders. The latency for the model 1 multiplier becomes Tmul1 = (Wf + 1)ÊTclk1.

3.3. Model 2 Serial/Parallel Multiplier The serial/parallel multiplier based on model 2 adders in Fig. 3 has in principle two D flip-flops at the output of each adder. For this realization, multiplication with an arbitrarily chosen twoÕs-complement coefficient cannot be accomplished since subsequent coefficient bits must have a relative bit-weight of at least four. To solve this problem, the coefficient is represented in canonic signed-digit code (CSDC) [5]. The key idea is to use the fact that all non-zero digits in the CSDC representation are isolated, which results in a relative bit-weight of four, or larger, between subsequent coefficient bits. A drawback with the method is that it cannot handle variable coefficients, since the non-zero digits may change position for different coefficients. For fixed coefficients, the model 2 multipliers can be realized with a latency of Tmul2 = (Wf + 2)ÊTclk2.

4. PARTITIONING OF THE MULTIPLIERS Assuming a fixed coefficient, the multipliers may be simplified since each AND gate used to form partial product bits may be replaced by a zero or the input x(n). Furthermore, full adders with one zero input can be removed. Multipliers simplified for a coefficient a = (0.01101)2C = (0.10-1010)CSDC are shown in Fig. 4. Note that the critical path of the multipliers equals the critical path of the adders of corresponding latency models after simplification.

Paper 7

109

Model 0 Adder

x(n)

D

FA åC

Adder

D

D

FA åC

D

a x(n)

D

Model 1 Adder

x(n)

D

FA åC

Adder

D

D

FA åC

D

D

a x(n)

D

Model 2 Subtracter

x(n)

2D

FA åC D*

Adder

2D

FA åC

2D

a x(n)

D

Fig. 4. Simplified serial/parallel multipliers with fixed coefficient a = (0.01101)2C = = (0.10-1010)CSDC.

In Fig. 4, the partitioning of the multipliers into adders of the corresponding model has been indicated by the dashed boxes. à For the CSDC multiplier at the bottom of Fig. 4, the addition of the partial product involving a coefficient digit with value Ð1 is taken care of by a subtracter. The same approach is feasible for every CSDC coefficient digit with a value of Ð1.

à For this particular coefficient, a multiplier derived from the twoÕs-complement representation could alternatively

have been partitioned into model 2 adders if two D flip-flops were added to the output. This is however not true for all coefficients, e.g., a = (0.0111) 2C where the two adders remaining after simplification would be separated by a single D flip-flop.


5. QUANTIZATION If the multipliers are used in a recursive filter, quantization of the additional leastsignificant bits of the product are required to maintain the word length of the data in the loop. Two quantization circuits are shown in Fig. 5. Model 0 and 1 s.e.

Model 2 s.e.

x(n)

x(n)

0 1

M U X

0

x'(n) 1

M U X

D

x'(n)

D Fig. 5. Sign-extension and quantization circuits.

The quantization circuits are assumed to be located at the multiplier outputs. When the sign-bit of the input x(n) arrives, the sign-extension control Ôs.e.Õ goes high for Wf clock cycles which causes the multiplexer to copy the sign-bit. The sign-extended product will then overwrite the least-significant bits of the following product. Thereby the product is truncated while providing a sign-extended output that directly can be used as input to further serial/parallel multipliers. The type of quantization circuit should be selected to match the latency model of the implementation.

5.1. Model 0 Quantization Circuit For model 0, the clock periodTclk0 is determined by the delay of a critical path through a number of processing elements. This delay is assumed to be significantly shorter than the delay of the multiplexer. Therefore the leftmost circuit in Fig. 5 with no D flip-flop at the output has been chosen to realize the quantization. The latency becomes Tq0 = 0ÊTclk0.

5.2. Model 1 Quantization Circuit Also for model 1 the leftmost circuit of Fig. 5 has been chosen to realize quantization, assuming that the delay of the quantization circuit is small compared to the delay of one full adder. The latency becomes Tq1 = 0ÊTclk1.

5.3. Model 2 Quantization Circuit To match the short critical path of the latency model 2 adder, the quantization circuit is realized with a D flip-flop at the output according to the circuit to the right in Fig. 5. Then the latency becomes Tq2 = 1ÊTclk2 for model 2.

Paper 7

111

6. FILTER EXAMPLE To demonstrate the use of the multipliers in filter design, the first-order recursive filter of Fig. 6 is taken as an example. The coefficient a has arbitrarily been chosen to 0.40625, the same value used in the multipliers in Section 4. The required data word length Wd is assumed to be 15 bits. a x(n)

Q

y(n)

T Fig. 6. First-order recursive filter.

6.1. Implementations One implementation of the filter in Fig. 6 has been made for each latency model [4]. The implementation process is illustrated in Fig. 7. First, the minimum sample period was determined to Tmin = T add +Tmul +Tq , since there is only one delay element in the single loop. Then different cyclic schedules were made for the three latency models in order to reach Tmin . Finally, an isomorphic mapping from the cyclic schedules to VHDL netlists were used. The netlists were synthesized using a 0.8 µm static CMOS, standard-cell library from AMS with a layout of each implementation as a result. The power supply voltage is 5 V.

6.2. Results Implementation results obtained from the synthesis tools is shown Fig. 8. The number of devices (excluding control circuitry) remains fairly constant for all implementations. The maximum clock frequency increases rapidly with higher order latency models. The clock frequency corresponds to the throughput of the individual processing elements, yielding an increase in throughput with 40% for the novel latency model 2 serial/parallel multiplier over the conventional model 1 multiplier. For the overall throughput of the filter, latency model 1 and 2 yields approximately the same maximum sample frequency, which is a factor of 2 larger compared to model 0. This is due to the relatively long delay of the standard-cell D flip-flops. If D flip-flops with shorter delay are used, an increase with up to 50% of the maximum sample frequency can be expected for model 2 over model 1.


a x(n)

Q Si

y(nÐ1)

y(n)

cyclic scheduling

Sm

y(mn +

mÐ1

Ð1

)

1)

x(mn)

y(mn)

+ x(mn

S1

.. .

.. .

S0

.. .

.. .

t

isomorphic mapping

Fig. 7. Implementation process.

Devices [] 1500 1400

1100 1200

1000

fclk [MHz] 300 200

500

270 190

100 70

0

0 Model 0

Fig. 8. Implementation results.

fsample [MHz]

Model 1

30 20 10 0

Model 2

27 27 14

Paper 7

113

7. CONCLUSION With three different latency models for bit-serial addition as a base, corresponding serial/parallel multipliers have been developed. For one of the multipliers, canonic signed-digit coding of the coefficient was used to increase the throughput with 40% over twoÕs-complement representation. The method is applicable to multiplication with fixed coefficients. Quantization schemes for the different multipliers were also discussed. The results of an implementation of a first-order filter using the three latency models have been presented.

REFERENCES [1] M. Renfors and Y. Neuvo: ÒThe Maximal Sampling Rate of Digital Filters Under Hardware Speed ConstraintsÓ, IEEE Trans. on Circuits and Systems, CAS-28, no. 3, pp. 196-202, March, 1981. [2] L. Wanhammar, DSP Integrated Circuits. Linkšping University, 1996. [3] M. Vesterbacka, K. Palmkvist, P. Sandberg, and L. Wanhammar: ÒImplementation of Fast BitSerial Lattice Wave Digital Filters,Ó Proc. of IEEE Intern. Symp. on Circuits and Systems, London, May 30 - June 1 1994, vol. 2, pp. 113 - 116. [4] M. Vesterbacka, K. Palmkvist, and L. Wanhammar: ÒOn Implementation of Fast, Bit-Serial Loops,Ó Proc. of 1996 Midwest Symp. on Circuits and Systems, Ames, Iowa, August 18-21 1996. [5] R.I. Hartley and K.K. Parhi: Digital-Serial Computation, Kluwer Academic Publishers, 1995.


PAPER 8 A COMPARISON OF THREE LATTICE WAVE DIGITAL FILTER IMPLEMENTATIONS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Proceedings of 7th Int. Conf. on Signal Processing Applications & Technology (ICSPATÕ96), Vol. 2, pp. 1909-1913, Boston, MA, Oct. 7-10, 1996. Presented at the 7th International Conference on Signal Processing Applications & Technlogy (ICSPATÕ96), Boston, MA, Oct. 7-10, 1996.


A COMPARISON OF THREE LATTICE WAVE DIGITAL FILTER IMPLEMENTATIONS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Department of Electrical Engineering, Linkšping University, Linkšping, S-581 83 Linkšping, Sweden Email: [email protected], [email protected], [email protected]

ABSTRACT Three VLSI implementations of a third-order lattice wave digital filter are compared with respect to: maximal sample frequency, chip area, and power consumption. Two of the implementations are maximally fast, bit-serial implementations while the third is a conventional bit-parallel implementation. One of the maximally fast implementations yields a sampling frequency of more than 100 MHz that can be used to achieve as low power consumption as for the bit-parallel implementation after supply voltage scaling. This implementation also has the lowest device count.

1 INTRODUCTION Three VLSI implementations have been made for a well-known, third-order lattice wave digital filter [1, 2] which earlier has been implemented using redundant, bitparallel arithmetic in which long carry propagation is avoided. The signal-flow graph is shown in Fig. 1, where the coefficient a is 0.375 = (0.011) 2 . The filter is a

x(n)

y(n)

T a

+ Ð

v(n) T Fig. 1. Third-order lattice wave digital filter.

T

Paper 8

117

bireciprocal (half-band) lowpass filter, which also can be modified into a highpass filter by changing the last adder to a subtracter. It is also suitable for use in interpolation and decimation, where the filter can operate at the lower of the two sample rates. The attenuation of the filter is shown in Fig. 2. A [db] 40 20 0

0

p/2

wT p [rad]

Fig. 2. Attenuation.

Two of the implementations are implemented using bit-serial arithmetic with leastsignificant bit first. The implementations are maximally fast, i.e., the implementations yield the maximum throughput limited by the recursive structure of the algorithm and the latency of the processing elements in the critical loop. Cyclic scheduling is used to arrive at the maximum throughput, where operations belonging to several successive sample periods are included in the final schedule [3, 4]. As the third case, a conventional, bit-parallel implementation based on an isomorphic mapping of the initial signal-flow graph to a hardware structure has be been selected for comparison. Carry-look ahead adders are used for the additions. The fact that for all cases, the single multiplication coefficient is fixed was used to simplify the multiplier. In the following sections, the different implementations of this filter are described and compared. In all cases, the input word length is assumed to be 12 bits.

2 CYCLIC SCHEDULING To achieve a maximally fast implementation, the sample period is made equal to the minimum sample period Tmin, which is given by Top(i) } Tmin = max { i N(i)

(1)

where Top(i) is the total latency due to the operations and N(i) is the number of delay elements in the directed loop i [5]. For the algorithm in Fig. 1, the minimum sample period is Tmin = (2Tadd+Tmul)/2 where Tadd is the adder latency and Tmul is the multiplier latency.


For a maximally fast implementation using bit-serial arithmetic it is generally not sufficient to schedule the arithmetic operations over a single sample period. Instead the schedule has to include sets of operations Ni from a number of sample periods m, as illustrated in Fig. 3.

É

T T x(mn+1)

N0

É

T

T N1

T y(mn) Fig. 3.

x(mn+mÐ1)

É

x(mn)

NmÐ1

T y(mn+1)

y(mn+mÐ1)

Unfolded algorithm including operations from m sample periods.

Bit-serial operations with execution times longer than the minimal sample period can thereby be completed within the scheduling period. Also, accounting for the periodical property of the algorithm, the schedule may be drawn on the circumference of a cylinder without beginning or ending delimiting the freedom of the scheduling. We denote this scheduling formulation as cyclic scheduling.

3 IMPLEMENTATIONS All of the filters were implemented using AMS 0.8 µm double metal CMOS process. The first (section 3.1) and last (section 3.3) filter was synthesized from VHDL using AMS 0.8 µm standard-cell library while the TSPC implementation (section 3.2) was done in a full custom layout style.

3.1

Bit-Serial Implementation with Static CMOS Logic

For bit-serial arithmetic it is convenient to express the latency in terms of clock cycles. In an implementation with static logic, the latency for a serial/parallel multiplier is Tmul = W f Tclk, where Wf is the number of fractional bits in the coefficient. The latency for a bit-serial adder using static logic equals the gate delay of the full adder. However, in terms of clock cycles the latency for an addition is zero clock cycles. Of course, the adder will determine the length of the clock period. According to Eq.(1), the minimum sample period becomes Tmin = 1.5Tclk, since Wf = 3 bits for the filter implemented with static CMOS logic.

Paper 8

119

Figure 4 shows the schedule for the bit-serial processing elements in a single sample interval (Ni), while the scheduling of all the sets is shown in Fig. 5. The shaded areas indicate execution time for the arithmetic operations while the darker shaded areas indicate operation latency. ✚

x(n)

✚

v(n)

✖

Ni v(nÐ2)

✚

x(nÐ1)

✚

y(n) t

0

10Tmin

x(10n+8) x(10n+9)

x(10n+6) x(10n+7)

x(10n+4) x(10n+5)

x(10n+2) x(10n+3)

x(10n) x(10n+1)

Fig. 4. Schedule for a set Ni of processing elements.

10Tmin

N0 N1 N2 N3 N4 N5 N6 N7

y(10n+6) y(10n+7)

y(10n+4) y(10n+5)

y(10n+2) y(10n+3)

y(10n) y(10n+1)

y(10n+8) y(10n+9)

N8 N9

Fig. 5. Maximally fast schedule.

The algorithms have been scheduled over 10 sample periods. The inputs and outputs of the filter consist of 10 bit-serial streams skewed in time. Since the maximally fast implementation requires input samples to arrive with a difference of 1.5 clock cycles


between two subsequent samples, a distribution of the inputs with alternating 1 and 2 clock cycles has been chosen to align the samples to an integer clock phase. The adders are unused during three clock periods every sample period in the schedule. The internal representation of the sums has been extended with three most significant bits to equalize the word length for the operations. The extended representation is also sufficient to prevent overflow in all critical nodes. Since the coefficient was fixed in the multiplications, the multipliers were simplified into a sign-extension circuit and a bit-serial adder. Shimming delays were introduced at the branches with different start and end time. An isomorphic mapping from the cyclic schedule resulted in a maximally fast resource structure requires 10 multipliers, 40 adders, and 75 D flip-flops.

3.2

Bit-Serial Implementation with TSPC Logic

In TSPC logic, the logic is merged with the flip-flops. For TSPC and similar clocked logic styles, the latencies are: Tadd = 1Tclk and T mul = (Wf+1)Tclk. Thus a maximally fast implementation will have the minimum sample period Tmin = 3Tclk. The schedule for a single sample interval of bit-serial processing elements is shown in Fig. 6, while the scheduling of the operations for five sample intervals is shown in Fig. 7 [4]. The inputs and outputs of the filter consist of five bit-serial streams skewed in time with three clock cycles. The internal representation of the sums was also extended with three most significant bits for this case. Thereby the word length for the operations was equalized and overflows prevented in all critical nodes. The multipliers were simplified as in the previous case. An isomorphic mapping from the cyclic schedule to a maximally fast resource structure requires 5 multipliers, 20 adders and 70 D flip-flops.

x(n)

✚ ✚

Ni

v(n)

✖ ✚

v(nÐ2)

✚

x(nÐ1) 0

Fig. 6. Schedule for a set Ni of processing elements.

y(n) t 5Tmin

Paper 8

x(5n+4)

x(5n+3)

x(5n+2)

N0

x(5n+1)

x(5n)

121

5Tmin

N1 N2 N3 y(5n+2)

y(5n+1)

y(5n)

y(5n+4)

y(5n+3)

N4


3.3

Bit-Parallel Implementa-tion with Static CMOS Logic

The lattice wave digital filter has also been implemented using bit-parallel, static CMOS logic. This implementation is based on an isomorphic mapping of the signalflow graph to a hardware structure. Note that this approach does not yield a maximally fast implementation. The hardware structure corresponding to the signal-flow graph required one bitparallel multiplier, four bit-parallel adders, and three registers for the delay elements. Each bit-parallel adder was implemented as a two-level carry-look-ahead adder with a word length of 15 bits to mimic the number range of the bit-serial implementations. The multiplication was implemented as addition of right shifted input words. For the parallel implementation the sampling frequency is the same as the clock frequency.

4 RESULTS All the implementations described have been manufactured and the measurement results have been compiled into Table 1. Simulation of the TSPC implementation indicated that about half of the power is consumed in the clock driver, which is included in all power measures. Results published in [1] has also been included for reference. It is however difficult to use this implementation for comparison, since it was implemented in a 2 µm, double metal CMOS process. If the same process were used, both the speed and power consumption would improve. The number of devices would however remain the same and can be taken as a crude measure of chip area.


The low density of the TSPC implementation is explained by the fact that the compacting tool was not able to compact the final layout properly. The TSPC implementation has the highest maximum sample frequency. The low maximum sample frequency of the static CMOS implementation is due to the long sum propagation path in the critical loop. The densities of the layouts differ, which makes it hard to compare the chip area required for the different implementations. Instead it can be noted that the number of devices required is least for the TSPC implementation, while it is significantly higher for the static CMOS implementation. Again, inserting of D flip-flops after the adders would approximately decrease the device count with a factor of two after the cyclic scheduling. The device efficiency (devices per MHz sample frequency) is taken as a measure of how efficiently the devices are used in an implementation. The TSPC implementation clearly has the best ratio. Table 1 also shows the measured power consumption at the maximum sample frequency and 5 V power supply. The TSPC implementation has the highest power consumption. To compare the power consumption with respect to throughput, voltage scaling has been used to lower the power consumption and thereby the maximum sample frequency [6]. These measures have been plotted in Fig. 8. Here it can be seen that the power consumption with respect to throughput is similar for the TSPC implementation and the bit-parallel implementation, while the static CMOS implementation that misleadingly had the lowest power consumption in Table 1 actually is the worst. From the device density of the layouts it could be noted that the area of the TSPC implementation could be improved by up to a factor of two. This would reduce the power consumption due to reduced interconnection capacitance. Less aggressive transistor sizing would further reduce the power consumption.

5 CONCLUSION Three VLSI implementations have been made for a third-order lattice wave digital filter. Two were maximally fast, bit-serial implementations while the third was a conventional bit-parallel implementation. A sampling frequency of more than 100 MHz was measured for one of the maximally fast implementations.

Paper 8

123

Description

Sec. 3.1

Sec. 3.2

Sec. 3.3

Ref. [1]

Arithmetic

serial

serial

parallel

parallel

Logic style

static

TSPC

static

static

Input word length [bits]

12

12

12

8

Internal word length [bits]

15

15

15

11

Output word length [bits]

12

12

12

11

Max. clock frequency [MHz]

50

310

80

35

Max. sample frequency [MHz]

33

105

80

35

6600

2800

3800

9000

0.9

1.0

0.6

15*)

Density [devices/mm2]

7300

2800

6300

n/a

Device efficiency [devices/MHz]

190

30

50

260

Supply voltage [V]

5

5

5

5

Power consumption [mW]

50

250

80

150

Number of devices [devices] Chip area [mm2 ]

Process [µm]

AMS 0.8 AMS 0.8 AMS 0.8

2.0

*) pad areas included.

Table 1. Implementation characteristics

The power consumption was compared over at the normal power supply voltage as well as for reduced supply voltages. After supply voltage scaling, the high sample rate could also be used to achieve as low power consumption as for the conventional, bit-parallel implementation. By this, the number of transistors was decreased while maintaining the same maximum sampling frequency and power consumption of the bit-parallel implementation.

REFERENCES [1] Kleine U. and Bšhner M.: A High-Speed Wave Digital Filter Using Carry-Save Arithmetic, Proceedings of ESSCIRCÕ87, Bad-Soden, pp. 43-46, 1987.


P [mW] = Implementation 3.1 = Implementation 3.2 = Implementation 3.3

250 200 150 100 50 0 0

20

40

60

80

fmax 100 [MHz]

Fig. 8. Power consumption versus the maximum sample frequency fmax when the power supply voltage vary in the range 1.2 Ð 5.0 V. [2] Pandel J. and Kleine U.: Design of Bireciprocal Wave Digital Filters for High Sampling Rate Applications, Frequenz, Vol. 40, No. 11/12, 1986. [3] Wanhammar L.: DSP Integrated Circuits, Linkšping University, 1996. [4] Vesterbacka M., Palmkvist K., Sandberg P., and Wanhammar L.: Implementation of Fast BitSerial Lattice Wave Digital Filters, IEEE Intern. Symp. on Circuits and Systems, ISCAS Õ94, Vol. 2, pp. 113 - 116, London, May 30 - June 1, 1994. [5] Renfors M. and Neuvo Y.: The Maximal Sampling Rate of Digital Filters Under Hardware Speed Constraints, IEEE Trans. on Circuits and Systems, CAS-28, No. 3, March, pp. 196-202, 1981. [6] Chandrakasan A.P. and Brodersen R.W.: Low Power Digital CMOS Design, Kluwer Academic Publ., 1995.

PAPER 9 SIGN-EXTENSION AND QUANTIZATION IN BIT-SERIAL DIGITAL FILTERS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Proceedings of 3rd IEEE Conf. on Electronics, Circuits, and Systems (ICECSÕ96), Vol. 1, pp. 394-397, Rodos, Greece, Oct. 13-16, 1996. Presented at the 3rd IEEE Conference on Electronics, Circuits, and Systems (ICECSÕ96), Rodos, Greece, Oct. 13-16, 1996.


SIGN-EXTENSION AND QUANTIZATION IN BIT-SERIAL DIGITAL FILTERS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Department of Electrical Engineering, Linkšping University, Linkšping, S-581 83 Linkšping, Sweden Email: [email protected], [email protected], [email protected]

ABSTRACT A method for handling overflow and quantization in recursive digital filters is described. The method merges the sign-extension required for a serial/parallel multiplier with the required truncation in the bit-serial loops. The method works with maximally fast implementations, i.e., implementations for which the minimum sample period is used as sample period. The method is first described using a first-order recursive filter, and then applied to a third-order bireciprocal lattice wave digital filter.

1 INTRODUCTION Due to finite word lengths in recursive digital filters, the processing elements must handle overflow and quantization of data (rounding or truncation). To describe how to handle these issues, the first-order recursive algorithm in Fig. 1 is taken as an example. A more complex filter will then be presented further on. In Fig. 1, the coefficient a is assumed to be fixed and has arbitrarily been chosen to 0.40625, i.e., the number of fractional bits in the coefficient W f is 5 bits. In the example, the required data word length Wd is assumed to be 15 bits. a x(n)

y(n)

T Fig. 1. First-order recursive filter.

Paper 9

127

2 OVERFLOW AND QUANTIZATION In bit-serial processing with least-significant bit first, the correction of overflow must be delayed until the result is available. Since additional delay in a loop causes the throughput to decrease, overflow correction has to be avoided in a fast implementation. A solution is to increase the word length in the loop so that overflow cannot occur. An investigation of the algorithm in Fig. 1 reveals that the output magnitude of the adder may be |(1Ða)Ð1 | as large as the input magnitude, thus an additional most-significant bit is required in the number representation of this node to prevent overflow. Quantization of the data must also be performed in the loop, for example by truncation of the least significant bits. If the word length is Wd bits, the input to a serial/parallel multiplier in a loop generally needs to be sign-extended to Wd+Wf bits. It is advantageous to move the sign-extension across the addition in the algorithm as shown in Fig. 2. a x(n)

s.e.

y(n)

Q T a

x(n)

s.e.

s.e.

Q

y(n)

T Fig. 2. Moving the sign-extension (s.e.).

First, the word length at the output of the addition may be extended so that no overflow can occur. Second, the sign-extension may be used to truncate the output from the multiplier without additional delay or hardware in the loop as illustrated in Fig. 3. sign-extended input x1(n)

x0(n)

x0(n)

x14(n+1) x13(n+1)

output from multiplier y2(n)

y1(n)

y0(n)

output after sign-extension y2(n)

y1(n)

y0(n)

y15(n+1) y14(n+1)

Q y0(n)

y14(n+1)

Fig. 3. Truncation of one bit by sign-extension of the multiplier output.


In this example, a 16-bit product is truncated to 15 bits by replacement of the least significant bit with a sign-extended bit from the previous multiplication. Since the output of the filter is sign-extended, cascades of such filter sections do not require any additional sign-extension circuitry at the inputs, if the word lengths are equal in the sections.

3 MAXIMALLY FAST IMPLEMENTATION The addition is performed by a bit-serial adder and multiplication is done using a serial/parallel multiplier. The speed of an implementation of a recursive algorithm is constrained by the time for a bit-serial word to propagate through the critical loop. To achieve a maximally fast implementation, the sample period is made equal to the minimum sample period which is given by Top(i) } Tmin = max { i N(i) where Top(i) is the total latency due to the operations and N(i) is the number of delay elements in the directed loop i [1]. For the algorithm in Fig. 1, the minimum sample period becomes Tmin = T add+Tmul, where Tadd is the adder latency and T mul is the multiplier latency. For bit-serial arithmetic, the execution times of the processing elements are normally longer than the minimum sample period. In some algorithms, the critical loops may contain more than one delay element in the critical loops. For these cases, it is not sufficient to schedule the arithmetic operations over a period equal to the minimum sample period in order to achieve a sample period equal to the minimum sample period. Instead a cyclic scheduling over several sample periods has to be performed, accounting for the fact that the schedule is inherently periodic [2, 3]. The periodicity of the schedule has been visualized in Fig. 4. Here, the operations belonging to several sample intervals in an algorithm have been drawn on a cylinder, where the operations belonging to one sample interval have been collected into a set Ni. Expressing the operation latencies in clock cycles Tclk, Tadd = 0Tclk and Tmul = WfTclk = 5Tclk for the filter in Fig. 1. This results in a minimal sample period of Tmin = 5Tclk. Since the word in the loop has to be sign-extended to Wd+Wf = 20 bits it is obvious that a single adder and multiplier cannot handle a new calculation every fifth clock cycle. To arrive at a maximally fast schedule, operations belonging to four sample intervals have to be included in the schedule as shown in Fig. 5.

Paper 9

129

Nm

y(mn +

1)

mÐ1

Ð1

)

x(mn)

+ x(mn

y(mn)

N1

.. .

.. . .. .

N0

.. .

t

y(4n+3)

y(4n+2)

x(4n+3)

x(4n+2) y(4n+1)

✚

y(4n)

x(4n)

x(4n+1)

Fig. 4. Illustration of the cyclic operation schedule.

✖ ✚ ✖ ✚ ✖ ✚ ✖ t 0

Tmin

2Tmin

3Tmin

4Tmin


In the schedule, the shaded areas indicate execution time for the operations, with darker shaded areas indicating operation latency. The result of an isomorphic mapping from the schedule to a hardware structure is shown in Fig. 6. This hardware structure can lead to problems for synthesis tools, since the tools may erroneously interpret the loop as non-sequentially computable due to the absence of D flip-flops in the loop. A work around is to break the loop at some point during synthesis and optimization, and reconstruct the loop afterwards.


x(4n)

x(4n+1)

x(4n+2)

a

a

y(4n)

x(4n+3) a

y(4n+1)

a

y(4n+2)

y(4n+3)

Fig. 6. Hardware structure of the filter.

4 OPERATION OF THE FIRST-ORDER RECURSIVE FILTER The operation of the filter is illustrated in Fig. 7. At the output of each multiplier, a sign-extension circuit is included that sign-extends and truncates the product with Wf = 5 bits. Multipliers whose output is sign-extended at a certain time have been shaded and labeled ÔseÕ. x(4n)

x(4n+1) a

x(4n+2) a

x(4n+3) a

a

se

1

y(4n)

x(4n)

y(4n+1)

x(4n+1)

x(4n+2) a

a

y(4n+2)

y(4n+3)

x(4n+3) a

a

se

2

y(4n)

x(4n)

y(4n+1)

x(4n+1) a

y(4n+2)

x(4n+2)

x(4n+3) a

a

y(4n+3) a

se

3

y(4n)

x(4n)

y(4n+1)

x(4n+1) a

y(4n+2)

x(4n+2) a

y(4n+3)

x(4n+3) a

a se

4

y(4n)

y(4n+1)

y(4n+2)

y(4n+3)

Fig. 7. Operation of the filter.

In step 1, referring to the left-most stage with input x(4n), the sign-extension circuit becomes active when the least significant bit of x(4n) arrives. Thus the product is truncated by replacing the multiplier output bit with the sign-bit of the previous

Paper 9

131

output. A total of five bits is input sequentially, which are truncated in the same fashion. In step 2, the sign-extension is released in the left-most stage. This ends the truncation and the least significant bit of the truncated product becomes available at the output. During the output of five bits, the second stage will concurrently receive a new input sample x(4n+1), which is added to the product. Now the sign-extension circuit is active in this stage, which will truncate its output. The processing continues with the third stage truncating five bits of the third product in step 3, and five more bits are truncated in step 4. After completion of the four steps, all bits of sample x(4n) have been processed and the schedule is cyclically repeated with new input samples.

5 HARDWARE IMPLEMENTATIONS 5.1 First-Order Recursive Filter To demonstrate the method, the example filter in Fig. 1 has been synthesized from a VHDL description using AMS 0.8 µm standard cells. Results obtained from the synthesis tools and SPICE simulations of the layout have been compiled into TableÊ1. The control unit required for timing of the sign-extension circuits is excluded. A clock buffer is included in the power estimation. Chip area Transistors Max clock frequency Max sample frequency Power dissipation Supply voltage

0.23 mm2 1400 70 MHz 14 MHz 10 mW 5V

Table 1. Implementation 1 characteristics.

5.2 Bireciprocal Lattice Wave Digital Filter The second example is a well-known third-order bireciprocal lattice wave digital filter [2, 4, 5]. The filter is a half-band lowpass filter, which also can be used as a highpass filter as well as for decimation or interpolation of the sample frequency by a factor two. In both cases, the filter can operate at the lower of the two sample rates. Figure 8 shows the filter structure that has a single coefficient a = 0.375.


A full-custom layout has been made for this filter [2]. The maximum sample frequency for the maximally fast implementation becomes fclk/3, where fclk is the clock frequency. A clock frequency of 400 MHz was used as design goal for the logic circuitry in this implementation, using the TSPC logic style. This would result in a sampling frequency of more than 130 MHz. The implementation has been manufactured and tested to work at a sampling frequency of 105 MHz (limited by the measurement equipment). The measurement results have been compiled into Table 2.

Chip area Transistors Max clock frequency Max sample frequency Power dissipation Supply voltage

1.0 mm2 2800 310 MHz 100 MHz 260 mW 5V

Table 2. Implementation 2 characteristics.

x(n)

y(n)

T a

+ Ð T

T

Fig. 8. Lattice wave digital filter.

The measured energy dissipation of 2.5 nJ/sample agrees well with the simulation results obtained using SPICE. A photograph of the chip is shown in Fig. 9. The filter is shown as the block to the right, and a test block is shown in the middle.

6 CONCLUSION A method of merging the sign-extension and truncation in bit-serial loops consisting of serial/parallel multipliers and adders has been described. The method is suited for use in maximally fast implementations. A standard cell layout has been made for a first order recursive filter, and a full custom layout has been made for a bireciprocal lattice wave digital filter.

Paper 9

133

Fig. 9. Photograph of the lattice wave digital filter.

REFERENCES [1] Renfors M. and Neuvo Y.: The Maximal Sampling Rate of Digital Filters Under Hardware Speed Constraints, IEEE Trans. on Circuits and Systems, CAS-28, No. 3, March, pp. 196-202, 1981. [2] Vesterbacka M., Palmkvist K., Sandberg P., and Wanhammar L.: Implementation of Fast BitSerial Lattice Wave Digital Filters, Proc. of IEEE Intern. Symp. on Circuits and Systems (ISCAS Õ94), Vol. 2, pp. 113 - 116, London, May 30 - June 1, 1994. [3] Wanhammar L.: DSP Integrated Circuits, Linkšping University, 1996. [4] Kleine U. and Bšhner M.: A High-Speed Wave Digital Filter Using Carry-Save Arithmetic, Proceedings of ESSCIRCÕ87, Bad-Soden, pp. 43-46, 1987. [5] Pandel J. and Kleine U.: Design of Bireciprocal Wave Digital Filters for High Sampling Rate Applications, Frequenz, Vol. 40, No. 11/12, 1986.


PAPER 10 REALIZATION OF SERIAL/PARALLEL MULTIPLIERS WITH FIXED COEFFICIENTS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Proceedings of the National Conference on Radio Science (RVK93), Lund Institute of Technology, Lund, Sweden, April 5-7, pp. 209-212, 1993. Presented at the National Conference on Radio Science (RVK-93), Lund Institute of Technology, Lund, Sweden, April 5-7, 1993.


REALIZATION OF SERIAL/PARALLEL MULTIPLIERS WITH FIXED COEFFICIENTS Mark Vesterbacka, Kent Palmkvist, and Lars Wanhammar Department of Electrical Engineering, Linkšping University, S-581 83 Linkšping, Sweden Email: [email protected], [email protected], [email protected]

ABSTRACT In this paper we discuss trade-offs in the realization of serial/parallel multipliers with fixed coefficients. The coefficients are represented as two-complement binary numbers or in canonic sign digit code.

1. INTRODUCTION The arithmetic operations in DSP algorithms can be realized by using many different techniques. For example, addition, subtraction, and multiplication can be done by processing the data bit-serial or bit-parallel. The arithmetic operations can also be done by using different binary representations for the negative numbers. These representations differ with respect to their implementation cost (space/time/power). In this paper we will compare bit-serial to bit-parallel arithmetic. We also demonstrate how serial/parallel multipliers with fixed coefficients can be realized efficiently. Finally, a method for incorporating truncation into a serial/parallel multiplier is proposed.

2. BIT-SERIAL VERSUS BIT-PARALLEL ARITHMETIC For bit-serial arithmetic, the computational throughput per unit chip area is higher than in parallel, which makes it interesting when area and power criterias dominate. However, speed requirements can also demand for the use of bit-serial arithmetic, especially when the implemented algorithm have an inherent high degree of parallelism that can be exploited by duplicating the bit-serial processing elements while maintaining a low cost in area and power consumption. The following characteristics are important when comparing bit-serial to bit-parallel arithmetic. POWER CONSUMPTION: A major advantage of bit-serial arithmetic is that lower power consumption is obtained by avoiding the redundant switching operations in

Paper 10

137

the carry propagation paths. Also, having shorter data paths results in reduced switching loads, thereby reducing the power consumption. SPEED: Theoretically, the speed will be of the same order for both types of arithmetics since the arithmetic operations essentially use the same hardware in the critical paths. In reality however, the potential performance of bit-serial processing elements may be somewhat degraded due to practical problems with high frequency clocking. AREA: Another advantage is that bit-serial arithmetic significantly reduces the chip area. This is a consequence of smaller processing elements together with the elimination of wide buses between the processing elements. DESIGN COMPLEXITY: The design time for the bit-serial system increases due to the higher complexity of timing the bit-serial streams. On the other hand, the design time for the integrated circuit tends to decrease due to simplified routing and to the lower complexity of the processing elements. The availability of appropriate design tools is a disadvantage for bit-serial arithmetic compared to the numerous tools adopted to bit-parallel arithmetic.

3. SERIAL/PARALLEL MULTIPLIERS In the serial/parallel (S/P) multipliers which we will consider, data is represented by twoÕs-complement (2C) binary numbers and coefficients by twoÕs-complement or canonic signed digit code (CSDC).

3.1 The Basic Serial/Parallel Multiplier A basic S/P multiplier is shown in Fig. 1. The bit-serial word X is input with the least significant bit first. To account for negative numbers, a hold (H) circuit repeats the sign-bit. X is multiplied with the bit-parallel coefficient Y by producing partial bitproducts in a set of AND-gates, which are successively added to previously generated partial bit-products in a chain of carry-save adders. The current sum of partial product are shifted right after each clock cycle, i.e. divided by 2. If the word length of X is denoted Wd and the word length for Y is denoted Wc, the product Z will have a word length of Wd + Wc Ð 1 bits. The time of multiplication takes as many clock cycles as the number of bits in the product.


Y0 X(n)

Y1

Yc-1

H

sign extension

&

& FA

D

D

& FA

D

FA

D

D

Z(n)

D

reset bit-slice:

#0

#1

É

#(C Ð 1)

Figure 1. Basic serial/parallel multiplier.

3.2 Fixed Coefficients If the coefficient Y is fixed then the multiplier can be simplified in order to reduce the amount of hardware. Referring to Fig. 2, where a simplified multiplier with the fixed coefficient Y = (0.1011)2C is shown, the following method can be applied: For a positive coefficient, remove bit-slices in increasing order until the first bit-slice with a coefficient bit of 1 is found; replace this bit-slice with a delay element (see bit-slice #1). Replace all bit-slices having a coefficient bit of 0 with delay elements (see bitslice #2). Finally, replace all remaining AND-gates with a connection (see bit-slice #3, #4). Y0=0 X(n)

Y1=1

Y2=0

D

D

Y3=1

Y4=1

H

sign extension

FA

D

D

FA

D

Z(n)

D

reset bit-slice:

#0

#1

#2

#3

#4

Figure 2. Simplified multiplier with a fixed coefficient Y = (0.1011)2C.

3.3 Fixed Negative Coefficients If the fixed coefficient is negative, the first carry-save adder can be removed by using the identity Z = XY = (ÐX)(ÐY) = X'(ÐY) + 2 1ÐWd(ÐY). This identity corresponds to adding an inverter at the input of the multiplier and initializing the delay elements in the carry paths to 1. This technique can also be used in the case of a positive Y

Paper 10

139

consisting of mainly 1Õs. Negating the coefficient will give a coefficient consisting of mainly 0Õs, whose corresponding carry-save adders can be simplified into delay elements, as discussed above.

3.4 Canonic Sign Digit Code A coefficient Y represented in canonic signed digit code can be written as the difference between two positive binary numbers, (Y)CSDC = (Y+)2 Ð (YÐ)2, where (Y+)2 contains 1Õs in the bit positions where (Y)CSDC has the value 1, and (YÐ)2 contains 1Õs in the bit positions where (Y)CSDC has the value Ð1. The product can then be written as Z = X(Y+ Ð Y Ð ) = XY+ + X'YÐ + 21ÐW dYÐ . This product can be computed with the basic S/P multiplier, if the input is inverted in the bit-positions corresponding to the 1Õs in YÐ, and the delay elements holding the corresponding carry values are initially set, instead of reset. A multiplier with the fixed coefficient Y = (0.100-1)CSDC is shown in fig. 3. Y0=0 X(n)

Y1=1

Y2=0

Y3=0

H

Y4=-1 1

sign extension

D

D

D

FA

D

Z(n)

D

reset set bit-slice:

#0

#1

#2

#3

#4

Figure 3. Simplified multiplier with a fixed coefficient Y = (0.100-1)CSDC.

4. TRUNCATION Often, the product Z is truncated, where Wc Ð 1 bits are discarded. If throughput is of importance, the multiplication and truncation can be done simultaneously in the multiplier by omitting the computation of some, or all, of the bits truncated. Successive multiplications can therefore be performed with a throughput proportional to 1 / Wd instead of 1 / (Wd + Wc Ð 1). It is possible to initialize the multiplier to a state equivalent to the state immediately after a truncation without actually performing these computations. The calculation of the initialization values would effectively duplicate the hardware if this had to be done for all of the delay elements. Fortunately, this is not the case. During the last Wc Ð 1 clock cycles of a multiplication, the bit-slices successively perform redundant operations in increasing order, referring to the ordering in Fig. 1. Hence,


these bit-slices can be used for the following multiplication, if the hold circuit is duplicated into the X input path of every bit-slice. Then, the hold circuits are successively released as the corresponding bit-slices becomes available for the following multiplication. An advantage is that only the carry-state of a bit-slice has to be calculated, since the preceding bit-slice has already calculated the sum. Efficient solutions are obtained when combining this method with the simplification methods for multipliers with fixed coefficients mentioned earlier. In Fig.Ê4 the principle for truncation of Wc Ð 1 bits are applied to the same example as in Fig.Ê2. Y0=0

Y1=1

Y2=0

Y3=1

Y4=1

X(n) H

se1

se3

D

H

D

H

se4

FA

D

FA

D

D

Z(n)

D

reset

C3 = f(Xd-3,Xd-2,Xd-1) C4 = f(Xd-4,Xd-3,Xd-2,Xd-1) Figure 4. Multiplier with inherent truncation.

Now, realization of the two initializing carry functions in Fig. 4 are costly compared to the original multiplier, but a partial truncation can be made almost for free. In Fig. 5, two bits out of four are truncated to the cost of one extra hold circuit and an AND-gate. Here, bit-slice #3 and #4 share a hold circuit and have their initial internal states calculated from the delay elements in bit-slice #1 and #2. Y0=0

Y1=1

Y2=0 H

D

D

X(n)

Y3=1

Y4=1

se3

H se1

FA

D

FA

D

D

reset bit-slice:

#0

#1

#2

É

&

Figure 5. Optimized multiplier with respect to hardware resources and speed.

D

Z(n)

Paper 10

REFERENCE [1] Wanhammar L.: System Design: DSP Integrated Circuits, Prentice-Hall, (In print) 1993.

141


PAPER 11 DESIGN AND IMPLEMENTATION OF A HIGH-SPEED INTERPOLATION AND DECIMATION FILTER HŒkan Johansson, Kent Palmkvist, and Lars Wanhammar Proceedings of European Microelectronics Application Conference (EMAC-97), pp. 97-100, Barcelona, Spain, May 28-30, 1997. Presented at the European Microelectronics Application Conference (EMAC-97), Barcelona, Spain, May 28-30, 1997.


DESIGN AND IMPLEMENTATION OF A HIGH-SPEED INTERPOLATION AND DECIMATION FILTER HŒkan Johansson, Kent Palmkvist, and Lars Wanhammar Dept. Electrical Engineering, Linkšping University, 581 83 Linkšping, Sweden email: [email protected], [email protected], [email protected]

ABSTRACT A wave digital filter for interpolation and decimation with factors of two has been implemented. A novel realization technique has been used in order to obtain a filter structure having a high maximal sample frequency, low power consumption, and small chip area. The filter has been implemented using bitserial arithmetic in a standard-cell layout. It can be used for sample rate conversions between 25 and 50 MHz.

1. INTRODUCTION Oversampling techniques are often used to relax the requirements on the analog parts in mixed analog/digital systems. This introduces the need for interpolation and decimation filters in the digital parts. In wideband communication systems where the bit rates are high these filters must be designed to work at high sample frequencies. Often, the systems are battery-powered which means that the design must also have low power consumption. The overall performance of the final implementation of a filter is largely dependent upon the filter structure, type of arithmetic, and logic style. Generally, recursive filters are superior to their non-recursive counterparts when the number of arithmetic operations is considered. However, recursive filters have a drawback in that they restrict the maximal sample frequency due to their recursive parts. This is an important factor to consider for implementations aiming at highspeed as well as low power consumption. The reason for the latter is that the excess speed can be converted into low power consumption via voltage scaling techniquesÊ[1]. The minimal sample period Tmin for a recursive filter is determined by the latency and the number of delay elements in its critical loop. It is defined as

ì Topi ü Tmin = max í ý î Ni þ

(1)

Paper 11

145

where Topi is the latency of the operations and Ni is the number of delay elements in the directed loop i [2]. The operation latency is in turn determined by the type of arithmetic and logic style. For both bit-parallel and bit-serial arithmetic the latency is largely dependent upon the coefficient word length of the filter. For bit-serial arithmetic the latency is proportional to the number of fractional bits in the coefficients [3]. For implementations aiming at high speed and low power consumption it is thus essential to use recursive filters with very short coefficients, especially when using bit-serial arithmetic. This paper concerns the design and implementation of a wave digital filter that can be used for both interpolation and decimation with factors of two. Waved digital filters are well-known for the low coefficient sensitivity and robustness. A novel realization technique has been used which results in wave digital filter structures that have high maximal sample frequencies and few arithmetic operations. A combined interpolation/decimation filter structure has been implemented using bitserial arithmetic with a standard-cell layout. The implemented filter is to be used in a future experimental wideband (2 Mbits/s) communication system based on OFDM [4] in which sample rate conversions are required between 25 and 50 MHz. The external data word length is 10 bits while the internal data word length is 15 bits.

2. FILTER STRUCTURE Recursive half-band filters are in general the most efficient filters for interpolation and decimation with factors of two [5]. The transfer function of a recursive half-band filter can be written in polyphase form

H ( z ) = H0 ( z 2 ) + z -1 H1 ( z 2 )

(2)

where H0(z) and H 1(z) are allpass filters. It is possible to obtain a filter having approximately linear-phase by letting one of the branches be a pure delay. The actual filtering takes place at the lower of the two sample frequencies involved by using the corresponding polyphase interpolator and decimator as shown in Figs. 1 and 2. The allpass filters can be realized as a cascade of first- and second-order sections, resulting in modular and parallel filter algorithms. Further, by using wave digital filters for the allpass sections it is possible to ensure stability under finitearithmetic conditions [6]. However, filters composed of allpass subfilters in parallel have a high coefficient sensitivity in the stopband. Roughly, the coefficient word length is proportional to the stopband attenuation [5]. For stringent requirements on the stopband attenuation this leads to long coefficients and therefore a reduction of the maximal sample frequency as discussed earlier.


H0(z) x(n)

y(m)

fsample

H1(z)

2fsample

Fig. 1. Polyphase interpolator.

H0(z) x(m) 2fsample

y(n) H1(z)

fsample

Fig. 2. Polyphase decimator.

2.1 Improved Sensitivity and Speed The required coefficient word length can, however, be substantially reduced by using two or several low-order filters in cascade. However, in the context of interpolation and decimation this approach is not directly applicable due to the different sample rates involved. Using a straightforward cascade realization it is only possible to exploit the efficient polyphase structure for the first stage. For the following stages the filtering takes place at the higher of the two sample rates involved, which increases the workload. It is possible, though, to realize the cascaded filters in such a way that the filtering is performed at the lower sample frequency, see for example [7]. However, that realization contains six subfilters in total, which means that the number of filter operations per second still is the same as for the straightforward realization. We have used a novel technique that reduces the number of required subfilters from six to five. The overall transfer function of two cascaded half-band filters is

(

)

H ( z ) = H0 ( z 2 ) + z -1 H1 ( z 2 )

2

=

= H02 ( z 2 ) + z -2 H12 ( z 2 ) + 2 z -1 H0 ( z 2 ) H1 ( z 2 )

(3)

By exploiting the fact that this is in polyphase form and that the two branches of the overall filter contain common factors we can use the structure shown in Fig. 3. By adding an extra multiplier, adder, and delay element it is thus possible to reduce the number of realized subfilters compared with existing realizations [7]. Note that the extra multiplication can be implemented as a simple shift operation. A generalization of this technique to N cascaded half-band filters can be found in [8].

Paper 11

147

2.2 Implemented Structure For the implemented filter two 7th-order linear-phase lattice wave digital filters in cascade have been used. In this case, H1(z) corresponds to one delay whereas H0(z) corresponds to a second-order allpass section. The latter has been realized using symmetric two-port adaptors. Interpolation and decimation are dual operations which means that the polyphase decimator can be obtained from the polyphase interpolator, and vice versa, by using the transposition theorem. By exploiting this fact and using the structure in Fig. 3 we get the final filter structure, shown in Fig. 4, that has been implemented. The position of the switches determines whether the filter is to be used for interpolation or decimation. The filter coefficients are as simple as a1 Ê=Ê0.125 and a2 Ê=ÊÐ0.5. Input and output signals are 10 bits wide in a bit-parallel format. In interpolation mode, the input x(n) is applied at the upper input and the lower input is not used. In the decimation mode only the upper output is used. Figures 5 and 6 shows the magnitude function and group delay respectively.

x(n)

H1(z)

H1(z)

H0(z)

H0(z)

T

fsample

y(m) H1(z)

2fsample

2

Fig. 3. Structure for interpolation, derived from two cascaded half-band filters.

T

T

T

y(2n)

T

I

x(2n) x(2n+1)

1/2

T

a2

a2

T

T

a1

a1

I

y(2n+1)

D

T

Fig. 4. Interpolator/decimator structure.

D

T

2


Fig. 5. Magnitude function for the interpolator/decimator.

Fig. 6. Group delay for the interpolator/decimator.

3. ALGORITHM MODIFICATIONS Each filter H 0(z) is a second-order allpass section as shown in Fig. 7. This filter section is a recursive structure and has therefore a limited sample rate dependent on

Paper 11

149

operation latencies [2]. Each adder in the final design will be followed by a flip-flop in order to shorten the combinatorial paths. This corresponds to the timing properties of model 1 logic [9]. The multiplications are however simple enough to allow us to use model 0 logic which lacks the ending flip-flop. The complete design is therefore a mixture of model 0 and model 1 logic. The latency for the operations are: 1 clock cycle for additions and 1 and 3 clock cycles for the multiplication when the coefficients are Ð0.5 and 0.125. The minimal sample period is then Tmin = 4 Tadd + 2 Tmult = 8 clock cycles v2(n)

v2(n+1)

v (n)

a1

1

v (n+1)

a2

-

1

-

x(n)

y(n)

Fig. 7. Signal-flow graph of second-order section.

A direct implementation of this structure would require a clock frequency of 200 MHz, which is to fast for a standard-cell implementation, where the propagation delay from a flip-flop through a full-adder at 3.0 V is larger than the available 5 ns clock period. The signal-flow graph is therefore modified using distributive, associative and commutative properties of addition and multiplication [9]. The final signal-flow graph is shown in Fig. 8, and has a minimal sample period of 5 clock cycles. This corresponds to a clock frequency of 125 MHz which is within reach using standard-cells. v2(n) v2(n-1) v1(n)

-

v3(n) v4(n)

a1

x(n)

Fig. 8. Modified second-order section.

a2

y(n)

a2

v2(n+1) v2(n) v (n+1) 1 v3(n+1) v4(n+1)


4. OPERATION SCHEDULE Cyclic scheduling [3] is used to reach the minimal sample period which can be illustrated by drawing the precedence graph of the algorithm on the surface of a cylinder as shown in Fig. 9. The scheduling formulation utilizes both inter- and intrasample parallelism in the algorithm. This allows for operations from different sample periods to be executed concurrently. An operation is characterized with both latency and execution time. A special property of the bit-serial operations is the large difference between the latency and the throughput. The latency of the bit-serial adder is equal to 1 when using model 1 logic, while the execution time is equal to the reciprocal of the data word length. This requirement is compatible with a schedule over multiple sample periods. In this case it is necessary to perform the scheduling over three sample periods [10] as shown in Fig. 9. The scheduling length equals the internal data word length which is 15 bits, of which three is used for truncation in the multiplications. y(3n+1) x(3n+2) y(3n+2)

x(3n+1)

x(3n)

y(3n)

Fig. 9. Cyclic scheduling formulation.

4.1 Mapping to hardware An isomorphic mapping is used to assign each operation to a dedicated bit-serial processing element. The amount of shimming delays needed to synchronize the data streams within the filter is also extracted from the schedule.

5. LOGIC DESIGN The complete design besides the filter contains input and output buffers, control unit, and a synchronization block. The interface to the exterior is a sample clock and bit-

Paper 11

151

parallel busses. The final logic design of a second-order section is shown in Fig. 10. The additions in this figure include flip-flops on the sum output and the internal carry. v2(3n) v2(3n-3) v (3n) 1 v3(3n) v4(3n)

D

D

D 2D

-a2

a1 2D

4D

x(3n)

y(3n)

3D

D

-a2

D

D 2D

-a2

a1 2D

-

-

4D

x(3n+1)

D

y(3n+1)

3D

D

-a2

2D

-a2

a1 2D

-

-

D

-a2

4D

x(3n+2)

3D

v2(3n+3) v2(3n) v (3n+3) 1 v3(3n+3) v4(3n+3)

y(3n+2)

Fig. 10. Logic design of the second-order section including shimming delays.

Parallel-to-serial converters are used on the inputs to buffer the inputs and feed multiple samples in bit-serial form into the filter circuitry. Serial-to-parallel converters are used to generate bit-parallel outputs. A large number of sample delays are needed, as H1(z) contains only delays. These have been implemented as parallel shift-registers. A high-speed clock is needed for the bit-serial processing elements. It is supplied from outside of the chip. As the sample period is 5 clock periods, the bit-serial clock must be at least 5 times faster than the sample clock. Synchronization between the bit-serial and sample clock is performed on-chip, allowing the use of a free-running high-speed clock local to the chip. This indicates the possibility to use a on chip clock in later revisions

6. IMPLEMENTATION The design was first described as schematics then and mapped to a standard-cell layout using automatic place and route tools for Mentor Graphicsª. The layout is shown in Fig 11. The total area required is 5.9 mm2 using a 0.8 µm double metal CMOS process, with an active area of 2.2 mm2. It contains approximately 21000 devices. Measurement shows that the chip meets the specification. The power consumption of the chip, excluding pads, is 150 mW at 3.0 V. The circuit has been shown to meet the specification at a power supply voltage of only 2.5 V with a power consumption of 105 mW.

REFERENCES [1] Chandrakasan A.P. and Brodersen R.W.: Low Power Digital CMOS Design, Kluwer Academic Publ., 1995.


Fig. 11. Chip layout. [2] Renfors M. and Neuvo Y.: The Maximal Sampling Rate of Digital Filters Under Hardware Speed Constraints, IEEE Trans. on Circuits and Systems, CAS-28, No. 3, March, pp. 196-202, 1981. [3] Wanhammar L.: System Design: DSP Integrated Circuits, Linkšping University, 1997. [4] Di Zenobio D. and Santella G.: OFDM Technique for Digital Television Broadcasting to Portable Receivers, The Fourth Intern. Symp. on Personal, Indoor and Mobile Radio Communications, Yokohama, Japan, Sept. 8-11, 1993, pp. 244-248. [5] Renfors M. and SaramŠki T.: Recursive Nth-Band Digital Filters-Part I: Design and Properties, IEEE Trans. on Circuits and Systems, Vol. CAS-34 No. 1, pp. 24-39, Jan. 1987. [6] Fettweis A. and Meerkštter K.: Suppression of Parasitic Oscillations in Wave Digital Filters, IEEE Trans. on Circuits and Systems, Vol. CAS-22, No. 3, pp. 239-246, March 1975. [7] Kale I., Morling R.C.S., Krukowski A., and Tsang C.W.: A High-Fidelity Decimator Chip for the Measurement of Sigma-Delta Modulator Performance, IEEE Trans. on Instrument. and Measurement., Vol. 44, No. 5, pp 933-939, Oct. 1995. [8] Johansson H. and Wanhammar L.: A Recursive Filter Structure for High-Speed Interpolation and Decimation, to be published. [9] Vesterbacka M.: Implementation of Maximally Fast Wave Digital Filtes, Thesis no. 495, LiUTek-Lic-1995:27, Linkšping University, June 1995. [10]Vesterbacka M., Palmkvist K., and Wanhammar L.: Maximally Fast, Bit-Serial Lattice Wave Digital Filters, IEEE Proc. DSP Workshop Ô96, Loen, Norway, Sept. 2-4, 1996.

studies on the design and implementation of digital filters - CiteSeerX

studies on the design and implementation of digital filters - CiteSeerX

Suggest Documents

studies on the design and implementation of digital filters - CiteSeerX

DESIGN OF VARIABLE DIGITAL FILTERS BASED ON ... - CiteSeerX

Case Studies on the Implementation and Impacts of Virtual Design

Case Studies on the Implementation and Impacts of Virtual Design ...

Minimax Design of IIR Digital Filters Using SDP ... - CiteSeerX

Design of Digital Filters for Frequency Weightings ... - CiteSeerX

Design of IIR digital filters with non-standard ... - CiteSeerX

Design of Finite Impulse Response Digital Filters

Design Of State Digital Filters - Signal Processing

VLSI DESIGN OF ADVANCED DIGITAL FILTERS Author

Iterative Design of L_p Digital Filters

Lecture 6 - Design of Digital Filters

Implementation and simulation of IIR digital filters in FPGA using

Design of IIR Digital Filters Based on Eigenvalue Problem - Signal

DESIGN AND IMPLEMENTATION OF DIGITAL ... - Auburn University

Design and implementation of digital tele stethoscope

DESIGN AND IMPLEMENTATION OF DIGITAL ... - Auburn University

Design and implementation of a digital clock

Optimized Implementation of RNS FIR Filters Based on ... - CiteSeerX

On the Design of Two-Dimensional Polar Separable Filters - CiteSeerX

Design and implementation of efficient resampling filters using ...

Design and Implementation of Digital Echo Cancellation On-Channel ...

Design and Implementation of Discrete-Time Filters for ... - DSpace@MIT

Design and Implementation of a Digital Signature Scheme Based on