Scalable, Power and Area Efficient High Throughput ... - IEEE Xplore

0 downloads 0 Views 122KB Size Report
able platform based, physical optimized design library ... lishes the basis for various physical design oriented im- .... Design space of selected static full-adders.
Scalable, Power and Area Efficient High Throughput Viterbi Decoder Implementations T. Gemmeke, V. S. Gierenz, T. G. Noll Chair of Electrical Engineering and Computer Systems University of Technology RWTH Aachen, Germany [email protected] Abstract Todays increasing demand for implementations featuring low power consumption and small area with adequate performance to comply with specifications requires a physical oriented design approach. In principle, a scalable platform based, physical optimized design library in combination with a flexible datapath generator establishes the basis for various physical design oriented implementations of high efficiency. As an example of a frequently used module for data reconstruction in digital communication systems, implementations of the Viterbi algorithm are presented applying this design approach and covering a broad range of high throughput rates from 50 Mbd up to 550 Mbd in a 0.25 µm CMOS technology. The development of a platform for scalable Viterbi decoder implementations is shown including quantitative optimization steps on different levels of design hierarchy. The resulting designs are compared to other leading edge Viterbi decoders approving this design style.

1.

Introduction

The well known Viterbi-Algorithm performs a maximum-likelihood prediction on encoded data sequences. j

The decoder calculates metrics Γ k for all traced trellis paths of the coding polynomial. The branch metrics i, j λk account for the corresponding algorithmical cost of a transition from state i to state j with respect to the received input data at iteration k. In each iteration of the so-called ACS-operation the most likely transition is selected by  chosing the  minimum updated state metric j i, j i Γk+1 = min Γk + λk . These decisions determine the i

most likely sequence of transmitted data. The survivor path memory unit (SMU) traces the decisions to estimate the sequence of coded data symbols.

Si0

i

Γk0

i1 PE Γk Si1

i ,j

λk0

decision MUX

PE

j

PE Γk+1 Sj

λik1 , j

Figure 1. Schematic ’add-compare-select’-operation

The next section presents our quantitative approach to yield high efficient and scalable implementations. As the straight-forward implementation of the unit calculating i, j the branch metrics λ k takes a relative small part in overall cost the paper focuses on the implementations of the ACSoperation presented in section 3 with optimization from architecture- to circuit-level including the physical layout. The design of the SMU is shown in section 4. The different implementations are compared to published designs in section 5.

2. Design Methodology The synthesis of standard cell based designs suffer from suboptimal throughput rate, large area and considerable high power consumption. Additionally, backannotation of physical design properties has to travers a rather long design cycle. With a physical design oriented implementation macro blocks of better area utilization are created with precisely controlled timing critical paths even at lower power consumption. A datapath generator (DPG) [3] allows the generation of such blocks at low design effort. A parameterizable signal flow graph (SFG) coded in an HDL language controls the automated macro implementation using a small library of optimized leaf cells which are modified to fit in their local environment. This process exploits regularity of the algorithm and additionally preserves locality from algorithmic down to physical layout level. The application of the highly flexible DPG includes generic macros starting at parameterizable basic DSP functions as multiplication, division or square root. The continuous growing library includes complex building blocks as FIR filters, Reed-Solomon decoders or, as is presented, Viterbi decoders. A primary design challenge is to close the gap between a technology’s speed potential and the required throughput rate. In changing parameters of the generic macro description throughput is scaled by e. g. trading speed for area on architecture level in changing the degree of parallelism and pipelining, and on circuit-level by selecting, e. g. adder implementations of suitable performance. The quantitative design optimization, e. g. minimizing area and power consumption at a given throughput rate, requires a quantitative quality criteria, defined as effi. ciency η = Performance Cost

Generally, performance is measured in throughput rate R. For Viterbi decoding this is equivalent to the number of decoded data symbols per second equivalent to the inverse of the time per ACS-iteration T . A primary measure of cost is silicon area A. In addition, mobile applications and low cost packaging requires lowest power consumption. Hence, power dissipation per throughput rate, i. e. energy conversion per symbol E = P/R , represents another important cost factor. 1 = RA and ηE = A·T1 ·E The derived efficiencies η = A·T are defined for the purpose of quantitative optimization of the Viterbi decoder implementations from architecture- to circuit-level applying a physical oriented design style.

ACSU

TBU 3.

Implementation of the Add-Compare-Select Unit

The classical high-speed implementation (e. g. [7]) performs one iteration per clock cycle on the preceding path metrics (PM) and the actual branch metrics (BM), the so called word-parallel radix-2 operation (code rate r= 1/2 , fig. 2). The widespread field of high-speed Viterbi decoders includes throughput rates from several 10 6 (e. g. satellite communication systems) up to 10 9 (e. g. readchannel of a hard-disk drive) data symbols per second.

3.1.

TFU

clock & control

Optimization on architecture-level

There are basically two ways to exploit excessive speed of a design. One efficient way is to trade throughput for power consumption through voltage scaling, but in our case system aspects prevented from scaled or clustered voltages. The other is to reduce area by reusing in time single processing elements for the ACS-operation of different trellis paths reducing the degree of parallelism. As there is only the small overhead of some additional multiplexers and registers, area is almost cut by the factor that each PE is shared in time. Additionally, area can be reduced by increasing the integration density using dedicated routing and placement strategies. Global path metric routing can be cut by two in searching Hamilton cycles in deBruijn graphs. Placing the PEs in such cycles keeps communication of every other path metric local between two direct PE neighbors. A two column approach originally lead to a large aspect ratio [11]. Folding the unit to get 4 columns not only improves the aspect ratio but also introduces additional local communication. Finally, this results in a smaller unit of less power consumption due to reduced routing capacitance. Typically, the recursive non-linear data dependent upj date of the path-metrics Γ k limits clock frequency. A significant increase in throughput rate above the classical approach can only be achieved by modifying the architecture. One way to break the bottleneck is to implement a so-called n-step ACS unit instead of n sequential radix-2 iterations. This results in an ideal speed-up of n, but the more complex 2 n -way add-compare-select operation lim-

Figure 2. Radix-2 Viterbi decoder layout its ideal increase in performance. Therefore only the so called radix-4 structure, i. e. n = 2, is of practical relevance [7]. For high-speed applications, a more significant increase in throughput rate can be achieved by pipelining the ACS-recursion down to bit-level: Pipelining along the LSB-first carry-propagate path would speed-up the add and compare operation, but conflicts with the select operation. Therefore, a MSB-first compare-andselect technique based on redundant number representation combined with carry-save arithmetic was proposed [4]. Pipelining along the MSB-first compare-propagate path now becomes possible up to a degree where the time critical path is cut to the ACS-recursion for two single digit weights. The removal of the coding redundancy for the digit “1” in the carry-save digit alphabet {0,1,2} was proposed in [10] for the purpose of a simplified maximum selection. The bit-level maximum selection is turned into a single bit-wise OR operations independent from the actual compare operation. However, the n-way carry-save select operation still limits the maximal throughput rate. Therefore, in [5] a novel concept for a highperformance bit-level comparison was employed: Instead of a pairwise comparison of all 4 updated path metrics and evaluating / combining the six relevant results, it is sufficient to compare every updated path metric against

the maximum metric of all other three metrics. This not only reduces the number of bit-level comparisons from six to four [5] but also leads to a high-speed CS-select operation. Combining the presented speedup techniques leads to a power efficient, scalable high-speed ACS-unit architecture as reported in [5] (fig. 4).

3.2.

Optimization on circuit-level

4

CRAaccel.,2

Area / AFAsym

3.5

1

Sym. FA

0.8

3

CRAaccel.,1 Gate based

2.5

Transm. Gate based Sym. FA+Inv. Sym. FA

1

Sym. FA+Inv. CRAaccel.,1

0.6

2

1.5

Transm. Gate based

0.4

Gate based CRAaccel.,2

0.2

0.5 0

−1  ηE = (A · T · E) /ηE,FAsym

4.5

0

0.2

0.4

0.6

0.8

1

Delay / τFAsym

(a) AT-Efficiency

1.2

1.4

0

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Delay / τFAsym

(b) ηE -Efficiency

Figure 3. Design space of selected static full-adders Optimization on circuit-level is focused on the add and compare operations of the PE (see fig. 1), which basically consist of full adders (FA). Therefore, several static CMOS design alternatives from low power to high speed were analyzed for the efficiency η resp. η E . In fig. 3a) the symmetrical 24-transistor FA (sym. FA) shows best efficiency η laying on the hyperbola closest to the origin in the ’area-time’-space (AT-space). Its advantage in energy efficiency η E is even more significant as can be seen in fig. 3b). Hence, this adder was taken in the low power implementation of the Viterbi Decoder (fig. 2 and tab. 1). When selecting other FAs on or close to the displayed hyperbola speed is adapted in fine grain while keeping efficiency η almost constant, i. e. trading area for speed. This exemplary quantitative optimization of the fulladder applies analogous to the implementation of any other leaf cell. The optimization results are dependent on the design on layout level. As mentioned before devices, except for those in the critical path, are minimum sized. This leads in combination with the short local interconnections of the abutting leaf cells to high performance at minimized power consumption.

4.

Implementation of the Survivor-Path Memory Unit

The decisions determined by the ACS operations are traced in the so-called survivor path memory unit (SMU) for at least LSMU iterations until all trellis paths have merged. Tracing can either be done in forward processing the incoming decisions using the so called registerexchange algorithm (REA) or in backward tracing the priorly stored decisions (TBA).

Figure 4. Radix-4 bit-level pipelined Viterbi decoder The REA offers minimal latency as required e. g. in disk drives [5] (fig. 4). Unfortunately, the capacitive load of the clock tree consumes significant energy as does the trellis structured interconnect networks which additionally require considerable area. Whereas the TBA is usually implemented using lowpower static memory blocks consuming less area per stored decision. An optional trace forward unit replaces the merge operation of the TBA in forward processing the incoming decisions to evaluate the proper initial decode state. The hybrid trace-forward-trace-back algorithm [8] is integrated in the radix-2 Viterbi (tab. 1, fig. 2).

5. Results and Comparison The features of our implementations are shown in tab. 1 spanning a broad range of applicable high throughput rates R. Several leading edge implementations are compared using the AT -diagram (see fig. 5). For this purpose the different implementations were matched with respect to technology and to specifications. Fig. 5 shows our different implementations on one hyperbola of identical efficiency η but scaled throughput. The ratio of area to throughput rate is constant signifying a proportional trade off between area and speed. This perfect AT-scaling covers the industry relevant range of one decade in throughput rate. Through the continous progress in technology area and supply voltage is scaled resulting in reduced cost per operation. The corresponding trend of power consumption per computation cost E of different Viterbi decoder implementations over the years is highlighted in fig. 6. The cost of energy conversion per symbol cost E is normalized with re-

Table 1. Technical data of the presented Viterbi decoder implementations Design throughput rate R / Mbd PV / mW A / mm2 Lmin / µm VDD / V States LSMU wPM / BM / bit

Radix-2

Radix-4 (bit-level pipelined)

96

550

@ 3.0V

130 1.42 0.35 3.3 64 128 8/4

@ 2.25 V

570 0.92 0.25 2.5 16 64 12 / 8

Acknowledgements The authors would like to thank Mr. F. Frieling from Infineon Technologies for cooperation and valuable discussions. [1] G. Abouyannis et al., “Prest: Power reduction for system technology,” in ESPLPD, pp. 22–27, 2000.

25

Symbol period [ns]

radix-2

this work

[1]

20

w/ resource sharing

15

η = const [9] radix-2

[2] S. Sridharan and L. R. Carley, “A 110MHz 350mW 0.6µm CMOS 16-State Generalized-Target Viterbi Detector for Disk Drive Read Channels,” IEEE Trans. on Solid-State Circuits, vol. 35, pp. 362–70, March 2000.

10

[2]

w/ resource sharing

[7]

5

[10]

radix-4 0

0

2

4

6

8

10

12

14

A [mm2 ] Figure 5. Normalized AT-efficiency of Viterbi decoders spect to the specification of the presented 64-state Viterbi decoder. The main focus of the radix-2 implementation was low power showing less cost E than the high speed bit-level pipelined radix-4 design. Both presented implementations disrupt the classical trend as they feature significant lower costE . Regarding the optimization problem of maximizing efficiency with the respect to area, throughput and power, the presented decoders dominate the remaining Viterbi decoder design space. 1

[8] [7]

CostE

[13] [9] [1]

0.1

[2]

radix-4

1992

1994

radix-2 1996

1998

2000

Year Figure 6. Trend of energy conversion per symbol

6.

[3] M. Gansen, F. Richter, O. Weiß, and T. G. Noll, “A Datapath Generator for Full-Custom Macros of Iterative Logic Arrays,” Proc. of the IEEE International Conference on Application-Specific Systems, Archtectures, and Processors (ASAP 97), Juli 1997. [4] G. Fettweis and H. Meyr, “A 100 Mb/s Viterbi-decoder chip: Novel architecture and its realization,” Proc. IEEE ICC, Atlanta, vol. 2, pp. 463–7, August 1990. [5] V. Gierenz, O. Weiß, T. Noll, I. Carew, J. Ashley, and R. Karabed, “A 550 Mb/s Radix-4 Bit-Level Pipelined 16State 0.25-µm CMOS Viterbi Decoder,” ASAP, Boston, 2000. [6] C. Henning and T. G. Noll, “Scalable Architectures and VLSI Implementations of High Performance Image Processing Algorithms,” IEEE International Conference on Imaging Science, Systems and Technology, 1999. [7] P. J. Black and T. H. Meng, “A 140-Mb/s, 32-State, Radix-4 Viterbi Decoder,” IEEE Journal of Solid-State Circuits, vol. 27, pp. 1877–85, December 1992. [8] P. J. Black and T. H. Meng, “A 1Gb/s, 4-State, Sliding Block Viterbi Decoder,” Symp. on VLSI Circuits, 1993.

[10]

this work

a datapath generator dedicated to the physical oriented design style increases flexibility and efficiency with respect to layout design and productivity. The quality of the generated Viterbi decoders was validated in a comparison to other leading edge designs revealing our implementations with better AT-efficiency and lower power consumption per decode operation. Scaled to current 0.18 µm technologies the presented architectures would provide a decode rate of 200 Mbd (41 mW, Radix-2) resp. 800 Mbd (310 mW, Radix-4), and in a future 0.12 µm technologies 300 Mbd (18 mW) resp. 1.2 Gbd (140 mW).

Conclusion

The implementation of high efficient Viterbi decoders with scalable throughput rate was presented. The use of

[9] Mitel Semiconductor Ltd., “System specification,” in ESPLPD, pp. 5–7, 2000. [10] A. K. Yeung and J. M. Rabaey, “A 210Mb/s Radix-4 Bitlevel Pipelined Viterbi Decoder,” ISSCC, pp. 88–92, 1995. [11] J. Sparsø, H. N. Jørgensen, E. Paaske, S. Pedersen, and T. R¨ubner-Petersen, “An Area-Efficient Topology for VLSI Implementation of Viterbi Decoders and Other ShuffleExchange Type Structures,” IEEE Journal of Solid-State Circuits, vol. 26, pp. 90–7, February 1991. [12] W. Wilhelm, “A New Scalable VLSI Architecture for Reed Solomon Decoders,” IEEE Journal of Solid-State Circuits, vol. 34, pp. 388–96, March 1999. [13] I. Kang and A. W. Willson, “A 0.24mW, 14.4 kbps, r=1/2, k=9 Viterbi Decoder,” IEEE Custom Integrated Circuits Conference, pp. 603–6, 1997.

Suggest Documents