Download as a PDF

Toward behavioural level IPs : high level synthesis of the Viterbi algorithm Christophe JEGO, Emmanuel CASSEAU, Eric MARTIN LESTER Lab, UBS University, 2 Rue Le Coat Saint Haouen 56100 LORIENT France Email : [email protected]; [email protected]; [email protected] ; Tel : (33)02/97/88/05/47; Fax : (33)02/97/88/05/51 http://lester.univ-ubs.fr:8080/

Abstract This paper presents the behavioural synthesis of a complex application widely used in digital communications : the Viterbi algorithm. Since architectural synthesis allows different design alternatives to be rapidly explored under various constraints, dedicated architectures can be generated. In the same way, when the behavioural specification is generic, the synthesis can be rapidly turned to a specific application or reused as a virtual component in a high level system description. Keyword Behavioural IP, behavioural synthesis, digital ASIC design, high level synthesis, Viterbi algorithm

presents the main characteristics of a Viterbi decoder as far as RTL level is concerned. Section 3 is dedicated to the behavioural synthesis of this algorithm. The concluding remarks are drawn in section 4. 1 Viterbi algorithm overview The Viterbi algorithm [1] is applicable to a variety of decoding and detection problems which can be modelled by a finite-state discrete-time Markov process, such as convolutional (figure 1.a) and trellis decoding in digital communications. Based on the received symbols, the Viterbi algorithm estimates the most likely state sequence according to an optimisation criterion, such as the a posteriori maximum likelihood criterion, through a trellis which generally represents the behaviour of the encoder (figure 1.b).

Introduction S 1k

Recent advances in VLSI technology lead to new design methodologies like high level synthesis. Architectural synthesis is a methodology that permits a significant productivity increase by raising the abstraction level of digital designs. This process, which explores the space of possible designs, reaches the "best" architectural solution satisfying a set of specified constraints. The Viterbi algorithm is an interesting example that illustrates the interest of the synthesis of a typically complex application. Actually, the behavioural synthesis provides a panel of solutions from the design space, which allows new architectures to be designed. Furthermore, the designer can easily modify the parameters of his description and/or the architectural constraints, which provides high flexibility and reactivity. The paper is structured in the following way : section 1 briefly presents the Viterbi algorithm. Section 2

+

input data dk

d k-1

00 (0) 01 (1)

d k-2

output symbol

10 (2) 11 (3)

+

possible transition between states

encoder state S 2k

1.a

1.b

Figure 1 : a) 4-state convolutional encoder. b) 4-state trellis. The Viterbi algorithm can be divided into three main tasks. The branch metrics (BM) so called transition metrics, which represent the probabilities for the transitions from the encoder state si to state sj , are first computed. Then, for each of the states, the state metrics obtained by recursively accumulating the transition metrics for all possible transitions are computed. The

survivor (the most likely path) is selected and the state metric updated accordingly. This step is usually called add-compare-select (ACS) process. Then the surviving paths (one for each state) are stored and finally the best of them is traced-back through the trellis over the truncation length in order to provide the decoded symbol (decision). 2 Viterbi decoder synthesis Basically, a Viterbi decoder is divided into three synchronous circuit blocks, one for each of the three main algorithm tasks [2,3,4] : BM unit, ACS unit, and survivor memory evaluation (SME) unit (figure 2).

BMC unit

input symbol

ACS unit

SME unit

decision

Regular and modular design is possible, then the ACS unit is usually made up from dedicated double-ACS operators that concurrently update two states of a trellis (states 2n and 2n+1). The maximum decoding speed is therefore set by the maximum achievable computation speed of a double-ACS operator and their number. Thus, high data rate implementations of the Viterbi algorithm require N/2 double-ACS operators concurrently operating whereas a single double-ACS operator sequentially operating is used when the decoder size and/or consumption are of prime importance. In that case, a finite state machine (FSM) is necessary for the control of the computations and the state metric storage. Mixed designs, which combined sequential and concurrent ACS computations, are possible; however, a tricky state allocation of the ACS operators has to be manually performed in order to minimise the interconnect and the control process. ACS units are therefore typically fully concurrent (N/2 double-ACS) or fully sequential circuits (one double-ACS). SME unit :

Figure 2 : Structure of a Viterbi decoder BM unit : Each time slot, the BM unit determines the branch metrics from the received symbol which is usually a soft decision symbol. Different algorithms can be used to estimate this distance. For example, the branch metric Bm(i) using the Euclidean distance is : Bm(i) = X ± C(i) / 2

(1)

where X is the received symbol and C(i) a transmitted symbol (noiseless symbol). There are only 4 possible transmitted symbols, then the structure of the BM unit is easily designed. ACS unit : The performance of a Viterbi decoder is actually limited by the ACS unit : for each of the states, the current state metric has to be known before the next state metric can be calculated. This means in each time slot and for an N-state encoder, N ACS computations have to be performed according to recurrent equations 2a or 2b ( Sm(i) : state metric of state i ). N/2

Sm(2n) = min[Sm(n)+Bm(0), Sm(n+2 )+Bm(1)]

(2a)

The two classical algorithms for survivor path storage and decoding are the trace-back algorithm (TBA) and the register exchange algorithm (REA) [5,6]. The trace-back method is a backward processing algorithm which estimates the previous state given the current state and current state decision. This approach allows the use of a RAM for the survivor memory implementation. Moreover, for each state, the register exchange method requires a shift register which contains the survivor path leading to this state. The registers are trellis-like interconnected and their update is performed with an exchange of their contents based on the new decisions provided by the ACS unit. A register exchange circuit is shown in figure 3 for a 4-state trellis. Thus, since the registers contain the survivor paths, this algorithm avoids trace-back and the decoded data is then provided at the output of the shift register associated with the most likely state. d e c isio n s from ACS u nit d 0(k,0) d 0(k,1) d 0(k,2) d 0(k,3)

2:1 m ultip le xe r D E C I S I O N

flip-flo p

d ecod ed d ata d (k -L)

N/2

Sm(2n+1) = min[Sm(n)+Bm(2), Sm(n+2 )+Bm(3)] (2b) Figure 3 : Register exchange circuit for a 4-state trellis

Thus, the designer requires various basic parameters for a Viterbi decoder synthesis : the number of quantization bits of the received symbols for the BM unit and indirectly the ACS design, the number of states of the trellis for the ACS unit, the truncation length for the SME unit, and obviously the decoder throughput. Usually, the BM unit is a dedicated design, optimised for the bit number. The number of states and the truncation length may be generic parameters in a RTL descriptions, which allows descriptions to be reused. Finally, for a given technology, an ACS concurrent structure or an ACS sequential structure is chosen according to the throughput. 3 Viterbi algorithm behavioural synthesis High level synthesis (HLS) allows the mapping of a behavioural algorithm description into a register-transfer level (RTL) implementation to be done. The process includes the following tasks [7,8] :

specification : writing the behavioural description of

the application + operator library, : determining the cycle by cycle execution of each algorithmic statement, allocating resources : deciding the numbers and types of functional units, binding and fusion : mapping the algorithmic behaviour to structure and optimising the use of registers and busses, finite-state machine : generating FSM to control data path elements and interconnects according to the schedule and bindings.

scheduling

High level synthesis enables a rapid design space exploration through examination of different design alternatives. The design space exploration process is guided by constraints such as area, throughput or power. 3.1 A behavioural synthesis tool : GAUT The HLS tool used for this work was GAUT. This tool is developed by two university laboratories : Lester (University of South Brittany, France) and Lasti (University of Rennes, France). However, a commercial architectural tool like Behavioral Compiler for example, could has been used for this work. The VHDL behavioural description of the Viterbi has then to be adapted to the specific tool. GAUT is a pipeline architectural synthesis tool which is dedicated to signal and image processing applications under real time execution constraints. From one behavioural specification, one technology and one real time constraint, an optimised architecture is synthesised [9,10]. The global synthesis flow of GAUT is described in figure 4.

Behavioural description

Library description

C o m p ile r Analysis Transformations

Graphic Interface Data Flow Graph

Design of the FUs Operator Register Bus

Design of the Communication Unit Buffers State machine Controller of protocol

Architecture

Visualization Manipulation

Graphic Interface Synthesis results Efficiency of operators Efficiency of registers Structural description of the architecture

Design of the Memory Unit Memories bench Address generator

Output interface Structural description Functional description Stimulis (test) file

VHDL (RTL)

Figure 4 : GAUT synthesis flow The specification is written in VHDL, at a behavioural level without any architectural directives. After an algorithm analysis, the tool synthesises the Data Flow Graph (an internal representation obtained during the compilation phase) according to a generic model of architecture. Then, the architecture is optimised and finally, an RTL description of a dedicated architecture is generated from the operators defined in the library. 3.2 Characteristics of the algorithm description The Viterbi algorithm has been specified at a behavioural VHDL level. The description has three generic parameters :

the number of states of the trellis : the more this number is high, the more complex the ACS process, the number of quantization bits of the received symbols : the branch metrics are computed from the quantization samples received by the decoder, the truncation length : the length of the SME memory used to find the final decision depends on the coding gain and the transmission channel. The synthesis can thus rapidly be turned to a specific domain (global system for mobile communications (GSM), digital audio broadcasting (DAB), digital video broadcasting (DVB),…) according to these parameters. The GAUT library defined by the designer contains the specification of components that come from logic synthesis. The larger the contents of the library become, the more vast solution space explored by the architectural synthesis tool. Different kinds of operators can be typically defined : standard operators, multi-function operators, pipeline operators, macro-function operators

[9]. We have introduced the following dedicated operators (macro-functions) in the library for the Viterbi algorithm synthesis :

a BM operator : computes the branch metrics according to the Euclidean distance method, a simple-ACS operator : updates one path of the trellis, a double-ACS operator : updates two paths of the trellis, a REA-based operator : switches a decision.

For instance, the RTL VHDL description of the doubleACS operator has been synthesised with Compass tool and then its physical parameters (area, delay time) have been extracted. The characteristics inserted in the library are shown in figure 5. Component double-ACS_vhdl Generic (area : integer := 125 ; (10-3 mm2) function : function_al := double-ACS ; delay time :integer := 37 ; (ns) Port (Mch :… ,Mb1 :… ,Mb2 :… ,result : …) ; End component ; Figure 5: generic parameters of a double-ACS operator As far as an RTL description of a Viterbi decoder is concerned, double-ACS operator based designs are known to be a good trade-off between interconnect area and operator area. However, the two different ACS components introduced in the GAUT library allow different original architectural solutions to be explored. In the same way, other components (2n-ACS which updates 2n paths of the trellis) could have been introduced in this library. 3.3 Architectural synthesis of the Viterbi algorithm An implementation of the widely used 64-state trellis, 8-bit received symbols Viterbi decoder has been studied. The synthesis objective was an area optimisation of the processing unit with a throughput constraint. In the same way, the synthesis objective could has been a propagation time optimisation of the processing unit with an area constraint. Nevertheless, like commercial architectural synthesis tools, the GAUT university tool is dedicated to an area optimisation. For this project, the GAUT library has been defined according to a 0.8µ CMOS technology. This study concerns the processing unit of the Viterbi algorithm architecture and its control. BM+ACS unit : A block that calculates both the branch metrics and the state metrics has been described. The description uses dedicated operators. Obviously, the area cost of the

description with standard operators is more important than for a description with dedicated operators. The first priority of HLS tools is actually an optimal use of the computing resources, thus operator reusing, which involves too many interconnect operators for this complex algorithm as far as standard operators are concerned. The architectural characteristics (area, propagation time, operator and interconnect numbers) of two different syntheses for a 500 Kb/s throughput constraint are presented in table1. 64 state trellis Technology : 0,8 µ CMOS ES2 Throughput Constraint : 500 Kbits/s

Synthesis Synthesis 1 2 Execution time (ns) {specified constraint} 2000 2000 Propagation time (ns) {architectural result} 1340 1320 Area (10-3 mm2) without busses 942 979 Library (number of bits) 8 16 BM operator 1 1 Simple-ACS operator 2 0 Double-ACS operator 0 1 Surv 1 2 Register 14 9 Mux 2 1 Demux 8 3 Tristate 17 6 Bus 9 7 Synthesis 1 : BM + simple-ACS operators Synthesis 2 : BM + double-ACS operators

Table 1 : BM+ACS block synthesis The processing unit (PU) architecture associated with synthesis 1 is presented in the figure 6. This solution requires one BM operator, two simple-ACS operators and one operator (Surv) which only extract the survivor (1 bit) from the chosen path metric (8 bits). The components which are numbered from 1 to 14 are the registers. The multiplexors, demultiplexors and tristates (interconnect operators), which allow operator and register reusing, appear as black triangles.

Surv ACS

ACS

BM

Figure 6 : Architecture scheme of synthesis 1 processing unit (500 Kb/s) With regard to synthesis 2, the PU architecture is presented in the figure 7. This solution requires one BM operator, one double-ACS operator, two operators (Surv) and nine registers.

The control of the processing unit (FSM) represents less than ten percent of the PU area for syntheses 1 and 2 whatever the throughput. Surv

Surv

BM ACS

Figure 7 : Architecture scheme of synthesis 2 processing unit (500 Kb/s) Looking for figure 6 and 7, the interconnect cost difference of these two syntheses is obvious. The resources reusing has a more important cost for the synthesis with simple-ACS than for the synthesis with double-ACS. The area of the Viterbi algorithm architecture from throughput to throughput is presented in figure 8. Actually, the synthesis can be turned to a particular application by modifying the throughput constraint (GSM, DAB, HDTV,…) and architectural estimations (processing unit propagation time, area, power…) are quickly provided. Moreover, the optimisation of the resources is oriented to a particular domain, i.e. the ACS operator number is optimal for a specific throughput. Comparatively, an RTL level description of the ACS unit is usually fully concurrent (Number_of_states/2 double-ACS operators) or fully sequential (one double-ACS operator). Tricky architectures, as far as a RTL level is concerned, can then be explored with HLS tools.

Synthesis 2, using double-ACS, is interesting when the throughput increases because it uses fewer operators. Nevertheless, the cost difference of the area between synthesis 1 and 2 is quite attenuated. Since a double-ACS operator concurrently updates two path metrics of the trellis while a simple-ACS operator updates only one path metric, their output data are respectively 16 and 8 bit coded. Thus, a 16 bit library had to be used for the synthesis with double-ACS operators. That is the reason why there is not a significant difference on the graph, although, a priori, there is less interconnections. SME unit : The SME unit (REA) has been synthesised for 500 Kb/s and 3 Mb/s throughputs and a truncation length from 5 up to 50. The intermediary decisions which are normally saved in the registers of elementary cells are stored in RAM (Memory Unit of the architecture). In fact, the solution with intermediate decisions in registers (REA typical structure) is too complex to be synthesised. Moreover, the results would not have been better than a direct RTL VHDL description. Actually, the behavioural synthesis tool is not well adapted to this particular type of task : the register exchange algorithm is directly dedicated to RTL designs and, furthermore the SME process is typically data control and storage whereas the GAUT tool is dedicated to computing applications.

30000 HDTV* 0.8 µ CMOS ES2 T echnology 64 State T rellis

GAUT synthesis limits

25000

Synthesis with simple-ACS Synthesis with double-ACS

DAB*

-3

2

Area (10 mm )

20000

15000

10000

* : Typical digital communication applications

5000 GSM* 0 1,0E+00

1,0E+01

1,0E+02

1,0E+03

1,0E+04

Throughput (Kb/s)

Figure 8 : architecture area from throughput to throughput

1,0E+05

4 Concluding remarks

References

Advances in design process technology have enabled the design of complex systems. In order to meet design deadlines with limited design engineering resources, system developers need to use pre-designed blocks which leads to the creation and integration of reusable blocks of IP. Although commercial tools are currently available but not very used, architectural synthesis, so called behavioural synthesis, allows many different architectures to be rapidly explored according to specified constraints, which can be useful for the system developer. In this paper, the architectural synthesis of the Viterbi algorithm has been presented. This is our first experience about behavioural IP specification. The specification is written in behavioural VHDL and needs to be generic as far as high flexibility (reusing) is concerned. According to a particular library, which translates basic information such as the function, area, power and performance of the operators, the design space can be explored under a set of constraints (throughput, …) and, additionally, a set of optimisation directives. Consequently, a behavioural IP can be defined as a behavioural specification plus a library of RTL level IPs. Tricky architectures, as far as a RTL level is concerned, can then be explored. Furthermore, the system developer can quickly retrieve algorithmic parameters from simulations of the architecture. In other respects, since no architectural directives are specified, the specification requires few modifications to be supported by a particular HLS tool. We did not describe the architecture interface, e.g. how the I/O communication constraints can be introduced in the specification? This work is being investigated [11] and will be reported latter.

[1] A. J. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm", IEEE Trans. Inform. Theory, Vol. IT-13, pp. 260-269, April 1967. [2] P. G. Gulak, T. Kailath, "Locally connected VLSI architectures for the Viterbi algorithm", IEEE Jour. Sel. Areas in Com., Vol. 6, No 3, pp. 527-537, April 1988. [3] M. Biver, H. Kaeslin, C. Tommasini, "Architectural design and realization of a single-chip Viterbi decoder", Integration, the VLSI Journal, No. 8, pp. 3-16, 1989. [4] E. Casseau, E. Lüthi, "Architecture of a high-rate VLSI Viterbi decoder", 3° IEEE International Conference on Electronics, Circuits, and Systems, ICECS 96, Rodos, Greece, 1996. [5] G. Feygin, P. G. Gulak, "Architectural tradeoffs for survivor sequence memory management in Viterbi decoders", IEEE Trans. on Commun., Vol. 41, No. 3, pp 425-429, March 1993. [6] E. Paaske, S. Pedersen, J. Sparso, "An area-efficient path memory structure for VLSI implementation of high speed Viterbi decoders", Integration, the VLSI journal, N° 12, pp. 79-91, 1991. [7] D.D; Gajski, N. Dutt, "High Level Synthesis", Ed. Kluwer Academic Publisher 1992. [8] M. C. McFarland, A. C. Parker, R. Camposano, "The High-Level Synthesis of Digital Systems," Proc. IEEE, Vol. 78, No 2, Feb. 1990, pp. 301-318. [9] E. Martin, O. Sentieys, H. Dubois, J.L. Philippe, "GAUT, an Architecture Synthesis Tool for Dedicated Signal Processors", In proceedings of EURO-DAC 93, pp. 14-19, 1993. [10 ] J.L. Philippe, O. Sentieys, J.P. Diguet, E. Martin, “ From digital signal processing specification to layout”, In Logic and Architecture Synthesis : state-of-the-art and novel approaches, pp. 307-313, Chapman&Hall, 1995. [11] A. Baganne, J.L Philippe, E. Martin "A Formal Technique for Hardware Interface design", IEEE Transactions on Circuits and Systems II (TCAS II), May 1998.

Acknowledgements : Section 3.3 was supported by the MOCAT Brittany ITR research program (ref. : 8 97 C741).