the Viterbi Algorithm

Architectural Synthesis of a Complex Application : the Viterbi Algorithm

Christophe JEGO, Emmanuel CASSEAU, Eric MARTIN LESTER Lab, UBS University 2 Rue Le Coat Saint Haouen 56100 LORIENT France Email : {First-name .Surname}@univ-ubs.fr Tel : +33.(0)2.97.88.05.47; Fax : +33.(0)2.97.88.05.51 http://lester.univ-ubs.fr:8080/

Abstract Behavioural synthesis currently seems to be an interesting design process to reduce the overall design cycle time substantially (from system level to physical level). Behavioural synthesis tools actually map algorithms to architectures and provide fast estimations of area and speed. This paper presents the synthesis of a complex application such the Viterbi algorithm, from behavioural synthesis up to logic synthesis. The differences between high level area and path delay estimations and real cost after logic and physical synthesis of a complex application are highlighted.

Keywords Behavioural synthesis, digital ASIC design, interconnect cost estimations, Viterbi algorithm.

Introduction The increasing design complexity and time to market constraint lead to new design methodologies like high level synthesis. Although commercial tools are currently available but not very used, architectural synthesis, so called behavioural synthesis, allows to rapidly explore many different architectures according to specified constraints. The Viterbi algorithm is an interesting example to illustrate the synthesis of a typically complex application. Unlike usual synthesis examples (FIR filter,…), it highlights various drawbacks of architectural synthesis and particularly the problem of interconnect cost estimations. Actually, since behavioural synthesis tools quickly provide area and performance estimations, they allow a more efficient exploration of the design space. However, the different models of cost estimation do not reliably take into account the interconnect cost which becomes important in complex applications and/or for today's technologies.

The paper is structured in the following way : section 1 briefly presents the Viterbi algorithm. Section 2 presents the results of the behavioural synthesis of this algorithm. The logic synthesis of the provided RTL descriptions is presented in section 3. The differences between estimation and interconnect cost after placement and routing is thus analysed.

1 Viterbi algorithm overview The Viterbi algorithm [1] is known as an optimal solution to the problem of the state estimation of a finite-state discrete-time Markov process, such as convolutional and trellis decoding. Therefore, this algorithm is of great interest for digital communication systems, such as mobile communications, satellite communications... Based on the received symbols, the function of a Viterbi decoder is to estimate the most likely state sequence according to an optimisation criterion, such as the a posteriori maximum likelihood criterion, through a trellis which generally represents the process. The Viterbi algorithm can be divided into three parts that actually form the different units of a Viterbi decoder (figure 1) [2, 3,4]. The branch metrics, which represent the probabilities for the transition from one state of the trellis to one another or, in other words, the difference between the received and the transmitted symbols, are computed in the branch metric unit (BMC). Different algorithms can be used to estimate this distance (Euclidean, Manhattan, Hamming…). For each state, the add-compareselect unit (ACS) computes the state metrics obtained by recursively accumulating the branch metrics for all possible transitions and selecting the survivor (the most likely path). In this paper, we consider the case in which the ACS unit has to select one of two paths for each of the states. The survivor memory evaluation unit (SME) finally makes the trace-back operation possible with the

decisions provided by the ACS unit and sends out the decoded data. The two classical algorithms for survivor path storage and decoding are the traceback algorithm (TBA) and the register exchange algorithm (REA).

phase) according to a generic model of architecture. Then, the architecture is optimised and finally, an RTL description of a dedicated architecture is generated from the operators defined in the library.

2.2 Characteristics of the algorithm description input symbol

BMC

ACS

SME decision

Figure 1 : structure of a Viterbi decoder

2 Viterbi algorithm behavioural synthesis 2.1 A behavioural synthesis tool : GAUT The HLS tool used for this work was GAUT. It is a pipeline architectural synthesis tool which is dedicated to signal and image processing applications under real time execution constraints. This tool is developed by two university laboratories : Lester (University of South Brittany, France) and Lasti (University of Rennes, France). From one behavioural specification, one technology and one real time constraint, an optimised architecture is synthesised [5,6]. The global synthesis flow of GAUT is described in figure 2. Behavioural description

Library description

Compiler Analysis Transformations

Graphic Interface Data Flow Graph

Design of the FUs Operator Register Bus

Design of the Communication Unit Buffers State machine Controller of protocol

Architecture

Visualization Manipulation

Graphic Interface Synthesis results Efficiency of operators Efficiency of registers Structural description of the architecture

Design of the Memory Unit Memories bench Address generator

Output interface Structural description Functional description Stimulis (test) file

VHDL (RTL)

Figure 2 : GAUT synthesis flow The specification is written in VHDL, at a behavioural level without any architectural directives. After an algorithm analysis, the tool synthesises the Data Flow Graph (an internal representation obtained during the compilation

A description of the Viterbi algorithm has been realised using behavioural VHDL. Like many algorithms, some parameters of the Viterbi algorithm influence the architecture. Our description has three generic parameters : q node number of the trellis q quantification bits of the received symbols q truncation length. The synthesis can thus rapidly be turned to a specific domain (GSM, DAB, DVB…) according to these parameters. The Euclidean distance method has been chosen to calculate the branch metrics and the REA technique for the SME block. The GAUT library defined by the designer contains the characteristics of components that come from logic synthesis. The larger the contents of the library become the more vast solution space explored by the architectural synthesis tool. Different kinds of operators can be defined : standard operators, multi-function operators, pipeline operators, macro-function operators [6]. We have introduced three dedicated operators (macro-function) in the library for the Viterbi algorithm synthesis (see section 2.3) : q BM : computes the branch metrics q simple-ACS : updates one path of the trellis q double-ACS : updates two paths of the trellis. For instance, the RTL VHDL description of the operator “double-ACS” which are relevant as far as logic synthesis is concerned has been synthesised with Compass tool and then its physical parameters (area, delay time) have been extracted. The characteristics inserted in the library are shown in figure 3. Component double-ACS_vhdl Generic (area : integer := 125 ; (10-3 mm2) function : function_al := double-ACS ; delay time :integer := 37 ; (ns) Port (Mch :… ,Mb1 :… ,Mb2 :… ,result : …) ; End component ; Figure 3: generic parameters of a double-ACS operator

2.3 Architectural synthesis of the Viterbi algorithm A study of a Viterbi decoder which is characteristic of a 64 state trellis and 8 bit received symbols has been realised. The synthesis objective was an area optimisation of the processing unit with throughput constraint. For this project, the GAUT library has been defined according to a 0.8µ CMOS technology and only the processing unit of the architecture was concerned.

BMC+ACS block : A block which calculates both the branch metrics and the path metrics has been described (BMC + ACS). Different syntheses have been realised for throughputs from 10 Kb/s up to 5 Mb/s. For instance, the architectural characteristics (area, propagation time, operator and interconnect numbers) of three different syntheses for a 500 Kb/s throughput constraint are presented in table 1. The three syntheses use different operators : q Synthesis 1 : standard operators q Synthesis 2 : BM + simple-ACS operators q Synthesis 3 : BM + double-ACS operators. 64 state trellis Technology : ES2 0,8 µ CMOS Throughput Constraint : 500 Kbits/s

Synthesis Synthesis Synthesis 3 1 2

Execution time (ns) {specified constraint} Propagation time (ns) {architectural result}

2000

2000

2000

1536

1340

1320

Area (10-3 mm2) without busses Library (number of bits) BM operator Simple-ACS operator Double-ACS Surv Add-Sub Shift register Comparator / Multiplexor Register Mux Demux Tristate Bus

2125 8 0 0 0 0 1 1 2/1 25 14 19 99 8

942 8 1 2 0 1 0 0 0 14 2 8 17 9

979 16 1 0 1 2 0 0 0 9 1 3 6 7

Table 1 : BMC+ACS block synthesis Synthesis 1 with standard operators has an high area cost because it uses a lot of computing operators (standard gates) and then many interconnect operators (multiplexor, demultiplexor, tristate). Better results can be obtained using dedicated operators (macro-functions). As said previously, dedicated operators have been designed with Compass CAD tool for the computation of the branch metrics (BM) and for the computation of the path metrics of one or two trellis states (simple or

double-ACS). Afterwards, their characteristics were included in the GAUT library. The processing unit architecture obtained for a 500 Kb/s throughput and with simple-ACS operators (synthesis 2) is presented in figure 4. This solution requires one BM operator, two simple-ACS operators and one operator (Surv) which only extract the survivor (1 bit) from the chosen path metric (8 bits). The components that are numbered from 1 to 14 are the registers. The multiplexors, demultiplexors and tristates (interconnect operators) which allow operator and register reusing appear as black triangles.

Surv ACS

ACS

BM

Figure 4 : architecture scheme of the processing unit

The processing unit of synthesis 3 requires one operator BM, one operator ACS-double, nine registers and interconnect operators. Then synthesis 1 is twice more costly (area) than synthesis 2 (BM + simple-ACS) and 3 (BM + double-ACS). Figure 5 shows the area of the architecture for the three syntheses for different throughputs. Synthesis 3, using double-ACS, is better when the throughput increases because it uses fewer operators. Nevertheless, the cost difference of the area between synthesis 2 and 3 is attenuated : a double-ACS operator updates two path metrics of the trellis while a simple-ACS operator updates only one path metric. So, their output data are respectively 16 and 8 bit coded. Thus, a 16 bit library was used for the synthesis with double-ACS operators. That is the reason why there is not a significant difference on the graph.

SME block : The SME block has been synthesised for 500 Kb/s and 3 Mb/s throughputs and a truncation length from 5 up to 50. The intermediary decisions, which are normally saved in the registers of elementary cells, are stored in RAM (Memory Unit of the architecture). In fact, the solution with intermediate decisions in registers (REA typical structure) is too complex to be synthesised. Moreover, the results would not be better than a direct RTL VHDL description.

HDTV* TVHD

30000

GAUT synthesis limits CMOS 0,8 mm ES2 Technology 0.8 µ CMOS Technology 64 StatesTrelli 64 State Trellis

25000

DAB * 20000 Area (10 -3 mm 2)

TV Num

Synthesiswith with standard arithmetic operators operators Synthesis Synthesis with simple-ACS Synthesis with double-ACS

15000

10000

* : Typical digital communication applications

5000 GSM *

0 1

10

100

1000

10000

100000

Throughput (Kbits/s)

Figure 5 : architecture area for different syntheses

3 Interconnect cost in a complex architecture Even more than for logic synthesis, routing may be critical for architectural synthesis. Actually, the area and timing costs of interconnects are known to be difficult to be estimated especially when an architecture becomes complex [7,8]. This prediction problem is highlighted with the synthesis of the Viterbi algorithm. The RTL VHDL descriptions of the BMC+ACS blocks generated from the GAUT architectural synthesis tool have been synthesised with Compass tool for a 0.8 µ double metal CMOS technology. The results of syntheses 2 and 3 for different throughputs are presented in table 2. It presents the area in mm2 of the processing unit and the percentage of increase between high-level area estimation and area after placement and routing. The results of synthesis 1 with standard operators are not presented because the architecture becomes too complex from 500 Kb/s. The more complex the architecture is the higher the area difference. For instance, for synthesis 2 and from 3 Mb/s, the difference between estimated and obtained area is more than 100%. Actually, the first priority of high-level synthesis tools is an optimal use of the computing operators and an optimal use of the registers. Therefore, interconnect operators are necessary for operator and register reusing. For this reason, the architecture is composed of many interconnect operators when the throughput increases. Obviously, the interconnect cost depends on the number of operators and the number of operator reusing.

Synthesis 2 Synthesis 3 (BM and (BM and simple-ACS double-ACS 2 Throughput Area (mm ) operators) operators) 500 Kb/s Gaut PU 0.942 20% 0.979 4% Compass PU 1.13 1.02 1 Mb/s Gaut PU 1.20 1.43 29 % 19 % Compass PU 1.55 1.69 2 Mb/s Gaut PU 2.17 1.98 41 % 30 % Compass PU 3.06 2.58 3 Mb/s Gaut PU 3.31 2.59 95 % 29 % Compass PU 6.21 3.34 4 Mb/s Gaut PU 4.45 3.52 131 % 40 % Compass PU 10.27 4.90 5 Mb/s Gaut PU 5.45 4.11 163 % 34 % Compass PU 14.3 5.51 Gaut PU : area estimated by Gaut tool for the Processing Unit. Compass PU : area obtained with Compass tool for the P. U. 64 State Trellis 0.8 µ Technology

Table 2 : difference between estimated area and area after placement and routing 64 State Trellis 0.8 µ Technology

Synthesis 2

Synthesis 3

Operators Number Using rate Number Using rate ACS 2 64% 1 64% BM 1 8% 1 8% 500 Kb/s Register before opt. 17 38% 13 23% Register after opt. 14 46% 9 33% Interconnect 27 10 ACS 6 80% 3 88% BM 1 32% 1 32% 2 Mb/s Register before opt. 40 66% 25 54% Register after opt. 30 80% 17 80% Interconnect 90 22 ACS 13 99% 7 92% BM 1 80% 1 80% 5 Mb/s Register before opt. 80 83% 48 72% Register after opt. 57 100% 37 92% Interconnect 302 39

Throughput

Opt. : optimisation

Table 3 : operator number and efficiency

Reusing involves two design drawbacks : area and path delay uncertain estimations. For this study, the optimal use of register option was also selected. Table 3 associated with table 2 presents the number and using rate of different operators which compose the processing unit.

Area : Obviously, the routing problem is more prejudicial with complex algorithms. Table 2 shows that the interconnect wiring area can be, in some cases, as important as the operator area. For instance, synthesis 2 with a 2 Mb/s throughput leads to a routing area increase of 41% whereas it is 95% for 3 Mb/s. In fact, the percentage between the two areas (predicted and final architecture) increases in proportion to the complexity of the architecture and the register using rate ratio (before and after optimisation). For example, synthesis 3 has an interconnect cost which is always lower than 40% for throughputs between 500 Kb/s and 5 Mb/s. However, although double-ACS are 16 bit operators, these macro-functions process twice more useful computations than simple-ACS. There is therefore twice more simple-ACS operators (synthesis 2) than double-ACS operators (synthesis 3) for the same throughput constraint. Synthesis 3 has thus a better interconnect area cost than synthesis 2.

Path delays : In order to respect the real time constraint, behavioural synthesis tools require a library in which the operator delay time (DT) is specified. Actually, since the tool priority is an optimal use of the operators, the throughput is a function of the whole data path from register to register (register DT, interconnect operator DT, functional operator DT, wiring DT). Obviously, the data path depends on the architectural synthesis and optimisations. However, the optimisation is realised on the whole architecture without placement information, which leads to tremendous different wiring lengths. In order to reduce the synthesis time, a typical data path is usually considered during the allocation process (for example, with GAUT : register + demux + functional operator + mux) and a single wiring delay time is specified (the interconnect wiring is difficult to predict in high-level synthesis and CAD tool dependent [8] ). For the synthesis of a complex architecture like a Viterbi decoder, the difference between estimation and real path delay may thus become important. For instance, the delay time of the simple-ACS dedicated operator (37 ns = 6{reg} +1{demux} +25{op}+ 2{mux}

+3{estimated interconnect wiring time}) has been defined in our high level library. This delay time has been finally measured after placement and routing for different throughputs. Thus, a path delay equal to 42 ns has been found for the 4 Mb/s throughput. The routing introduces an increase by 14% of the delay time in this case, and nevertheless a 0.8µ technology is considered whereas the manufacturing trends suggest that the interconnect delay will be even more dominant in the foreseeable future.

Conclusion The architectural synthesis of the Viterbi algorithm has been presented in this paper. The behavioural synthesis of this typically complex application has been performed with GAUT and logic synthesis has been realised afterwards. This work highlighted the problem of interconnect cost (area and path delay) which may occur with the behavioural synthesis of complex algorithms.

Acknowledgements : Section 2.3 was supported by the MOCAT Brittany ITR research program (ref. : 8 97 C741).

References [1] A. J. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm", IEEE Trans. Inform. Theory, Vol. IT-13, pp. 260-269, April 1967. [2] "A Medium Rate Integrated Viterbi Decoder", Data sheet, COMATLAS, Cesson Sévigné, France. [3] M. Biver, H. Kaeslin, C. Tommasini, "Architectural design and realization of a single-chip Viterbi decoder", Integration, the VLSI Journal, No. 8, pp. 3-16, 1989. [4] E. Casseau, E. Lüthi, "Architecture of a high-rate VLSI Viterbi decoder", In proceedings of the 3° IEEE International Conference on Electronics, Circuits, and Systems, ICECS 96, Rodos, Greece, 1996. [5] E. Martin, O. Sentieys, H. Dubois, J.L. Philippe, "GAUT, an Architecture Synthesis Tool for Dedicated Signal Processors", In proceedings of EURO-DAC 93, pp. 14-19, 1993. [6] J.L. Philippe, O. Sentieys, J.P. Diguet, E. Martin, “ From digital signal processing specification to layout”, In Logic and Architecture Synthesis : state-of-the-art and novel approaches, pp. 307-313, Chapman&Hall, 1995. [7] S. Y. Ohm, F. J. Kurdahi, N. D. Dutt, “A Unified Lower Bound Estimation Technique for High-Level Synthesis”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 458-472, May 1997. [8] H. Mecha and M. Fernandez, "Interconnection Delay and clock Cycle Selection in High-Level Synthesis", 10 Int. Conference on VLSI Design, Ayderabad, India, Jan. 4-7 1997.

the Viterbi Algorithm - CiteSeerX

the Viterbi Algorithm - CiteSeerX

Suggest Documents

Viterbi Algorithm

The Viterbi Algorithm - IEEE Xplore

The Viterbi algorithm demystified

Multihypothesis Viterbi Data Association: Algorithm

Low Power Architecture of the Soft-Output Viterbi Algorithm - CiteSeerX

Area-efficient architectures for the Viterbi algorithm

a fpga-based viterbi algorithm implementation for speech ... - CiteSeerX

VitAL: Viterbi Algorithm for de novo Peptide Design - CiteSeerX

VitAL: Viterbi Algorithm for de novo Peptide Design - CiteSeerX

Viterbi Decoder Algorithm using Quantum Computing

HISTORY DEPENDENT VITERBI ALGORITHM FOR ...

a fpga-based viterbi algorithm implementation for

Extended Viterbi Algorithm for Optimized Word HMMs

Over-The-Horizon Radar Tracking Using The Viterbi Algorithm Second ...

The viterbi algorithm and markov noise memory - Semantic Scholar

Improved EGNOS Decoding with the List Viterbi Algorithm

Convolutional coding and one decoding algorithm, the Viterbi ...

Viterbi Algorithm for Iterative Decoding of Parallel ... - eurasip

Log-Viterbi algorithm applied on second-order hidden Markov model ...

fpga implementation of soft output viterbi algorithm using memoryless ...

Weighted Viterbi Algorithm And State Duration ... - Semantic Scholar

a modified viterbi algorithm for non- gaussian interference ... - eurasip

Asynchronous Viterbi Decoder in Action Systems - CiteSeerX