Design and Implementation of Low Power High Speed ... - ScienceDirect

Available online at www.sciencedirect.com

Procedia Engineering 00 (2011) 000–000 Procedia Engineering 30 (2012) 61 – 68

Procedia Engineering www.elsevier.com/locate/procedia

International Conference on Communication Technology and System Design 2011

Design and Implementation of Low Power High Speed Viterbi Decoder K.Cholan, a* Sree Vidyanikethan Engineering College, Tirupati.

Abstract This paper describes the design of Viterbi decoding algorithm and presents an implementation of the decoder for the UWB MB_OFDM technology. The Viterbi algorithm is a maximum-likelihood algorithm for decoding of convolution codes. The algorithm tries to find a path of the trellis diagram, where the sequence of output symbols approximately matches the received sequence. To accomplish this task, it calculates for each path the path metric, which measures the distance to the received symbols sequence. BMU calculates the distance (metric) between the received noisy symbol and the output symbol of the state transition (branch). ACSU computes the accumulated metric associated with the sequence of transitions (path) to reach a state. When more then a path arrives to a state, ACSU selects the path with the lowest metric value, which is the survivor path. SMU stores the information that permits to trace back from a state to the previous one. This work also includes the design of ½ convolution encoder. The over all System will be designed using HDL language and simulation, synthesis and implementation (Translation, Mapping, Placing and Routing) will be done using various FPGA based EDA Tools. Finally the proposed system architecture performance speed, area, power and throughput.

© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of ICCTSD 2011

Open access under CC BY-NC-ND license.

Keywords: Viterbi decoding; BMU; ACSU; SMU; HDL.

1. Introduction VITERBI decoder is one of the most widely used components in digital communications and storage devices. Although its VLSI implementation is studied in depth over the last decades, still every new design starts with design space exploration. This can be partially explained by the fact that the design space is huge. In addition, the optimization criteria and the design figures keep on changing with the advancement in CMOS technology and design tools.

* K.Cholan. Tel.: +91- 9885455947. E-mail address: [email protected]

1877-7058 © 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. doi:10.1016/j.proeng.2012.01.834

62

K. Cholan / Procedia Engineering 30 (2012) 61 – 68 K.Cholan/ Procedia Engineering 00 (2011) 000–000

Different design aspects of the Viterbi decoder have been studied in a number of reserach papers [1]–[7]. However, most researchers concentrate on one specific component of the design (e.g., path metrics unit or survival memory unit). Somewhat more general studies are presented in [1], [8], and [9]. Still, in authors’ view, a systematic and comprehensive analysis summarizing and characterizing as many of the trade-offs and implementation techniques as possible is missing. This contribution presents such a survey, providing designers with clear guidelines and references to find the best solution for every specific case. The proposed research work will carry out on re-configurable FPGA technology, by adopting parallel/pipeline features of the hardware resources. The exiting algorithm is redesigned using HDL language, simulation, synthesis and implementation (translation, mapping place & routing) done with FPGA based EDA tools. 1.1. General structure of the viterbi algorithm

Fig.1. Proposed Viterbi Decoder

2. Convolution codes: Viterbi decoding algorithm is the most popular method to decode convolutional error correcting codes. In a convolutional encoder, an input bitstream is passed through a shift register. Input bits are combined using the binary single bit addition (XOR) with several outputs of the shift register cells. Resulting output bitstreams represent the encoded input bitstream. Generally speaking, every input bit is encoded using output bits, so the coding rate is defined as (or if input bits are used). The constraint length of the code is defined as the length of the shift register plus one. Finally, generator polynomials define, which bits in the input stream have to be added to form the output. An encoder is completely described by polynomials of degree or less. Figure.2 shows an example of a simple convolutional encoder together with the corresponding parameters.

Fig.2. Convolutional encoder (index t denotes timing in clock cycles).

Every state transition in the encoder results in one codeword being produced. After the transmission over a noisy channel, the codewords may get corrupted. Viterbi decoder reconstructs the initial input sequence of the encoder by calculating the most probable sequence of the state transitions. 64

63

K. Cholan / Procedia Procedia Engineering Engineering 00 30 (2011) (2012) 000–000 61 – 68 K.Cholan/

Fig.3.Trellis diagram.

This is done by tracing the trellis in a reverse manner while looking at the sequence of incoming codewords. Every codeword is produced as a result of a combination of a certain input bit with specific encoder state. This means, that the likelihood of the state transitions can be calculated even if received codewords are corrupted. By comparing the likelihood values, some transitions can be discarded immediately, thereby pruning the search space. Once the state transition sequence is determined, the reconstruction of the transmitted bit sequence is trivial. 2.1. Convolution Encoding

Fig.4. 1/2 convolutional coding with constraint length = 3 and generator polynomials 111 and 101

At the top level, Viterbi Decoder consists of three units: the branch metric unit (BMU), the path metric unit (PMU), and the survivor memory unit (SMU). The BMU calculates the distances from the received (noisy) symbols to all legal codewords. In case of the encoder represented in Figure 2: the only legal codewords are “00”, “01”, “10”, and “11”. The measure calculated by the BMU can be the Hamming distance in case of the hard input decoding or the Manhattan/Euclidean distance in case of the soft input decoding (e.g., every incoming symbol is represented using several bits). Finally, the SMU stores the bit decisions produced by the PMU for a certain defined number of clock cycles (referred as traceback depth, TBD) and processes them in a reverse manner called backtracking. Starting from a random state, all state transitions in the trellis will merge to the same state after TBD (or less) clock cycles. From this point on, the decoded output sequence can be reconstructed. 3. Branch metric unit: 3.1. Design Space BMU is typically the smallest unit of Viterbi decoder. Its complexity increases exponentially with n (reciprocal of the coding rate) and also with the number of samples processed by decoder per clock cycle (radix factor, e.g., radix-2 corresponds to one sample per clock cycle). The complexity increases linearly with softbits. So, the area and throughput of the BMU can be completely described by these two factors. As BMU is not the critical block in terms of area or throughput, its design looks quite straightforward. The version calculating the Hamming distance for coding rate 1/2 presented in Figure.5 performs perfectly in terms of both area and throughput. A BMU calculating squared Euclidean or 65

64


Manhattan distance is slightly more complex but can be easily mapped to an array of adders and subtracters as well. 3.2. Branch Metric Precision BM should be able to hold the maximum possible difference (distance) between ideal and received symbols. For hard input (Softbits=1) , the Hamming distance between a coded word and its complement is the maximum possible BM (i.e., the distance between codewords consisting of all 0’s and all 1’s). In such case, the maximum distance is equal to the symbol length n. So, BM precision is given by BMprecision=[log2n]+1 In soft input decoding, each received symbol is represented by more than one bit (softbits>1) . In this case, the maximum possible branch metric is calculated as distance BMBW=[log2(2softbis-1) n]+1 3.3. Radix-2 BMU

Fig.5. Radix-2 BMU

The hardware cost of BMU from Figure 5 in terms of areas of basic components can be expressed as follows: BMU2area = 2n

(HAarea + FAarea + (softbits-1)) + softbits

n

INVarea

Where HAarea, FAarea, INVarea, denotes the area of a half adder, full adder, and inverter, respectively.

4. Path metric unit: 4.1. Design Space PMU is a critical block both in terms of area and throughput. The key problem of the PMU design is the recursive nature of the add-compare-select (ACS) operation (path metrics calculated in the previous clock cycle are used in the current clock cycle as Figure.6 shows). In order to increase the throughput or to reduce the area, optimizations can be introduced at algorithmic, word or bit level. To obtain a very high throughput, parallelism at algorithmic level is exploited. By algorithmic transformations, the Viterbi decoding is converted to a purely feedforward computation. This allows independent processing of input blocks. The algorithmic parallel block processing methods intend to achieve unlimited concurrency by independent block decoding of input stream. These techniques result in quite high area figures. 66

65

K. Cholan / Procedia Procedia Engineering Engineering 00 30 (2011) (2012) 000–000 61 – 68 K.Cholan/

Fig.6. Radix-2 ACS

But as technological advancements are making the devices shrink, they are getting more attractive. Still, for a specific case, if required throughput can be achieved by utilizing word or bit level optimization techniques, there is no specific need to use algorithmic transformations. 4.2. Path Metric Precision The register temporarily storing path metrics should be wide enough to avoid overflow errors in the PMU operations. Modulo arithmetic is usually used for this purpose [11]. PM bitwidth is determined as follows:

Where B is given as B = ( 2softbits - 1 )

n

5. Survivor sequence update and storage unit: 5.1. Design Space The survivor sequence update and storage unit (briefly called survivor memory unit, SMU) receives decisions from the PMU and produces the decoded sequence. Figure.7 shows the design space for the SMU implementation. The implementation techniques for this unit can be divided in two classes: standard and adaptive. TBD determines the number of memory access operations required by the SMU. The value of the TBD parameter depends on the transmission channel conditions (or the channel model used in the simulation). On noisy channels, trace-back paths needs longer to merge, so (since usually a Viterbi decoder is designed for a worst case) TBD values become larger. On the other side, with fixed trace-back depth and good channel conditions (higher SNR), SMU may run redundant memory access operations, since paths tend to merge faster. Adaptive SMU implementations are used to reduce the number of these redundant accesses.

Fig.7. Design space for SMU 67

66


There are a number of adaptive techniques mentioned in the literature. Although they all require additional hardware to be added to the SMU, on average power dissipation of the unit can be reduced due to a lower number of memory accesses. In [12], [13], the adaptive techniques known as M-algorithm and T-algorithm are discussed, which reduces the number of states to consider by eliminating the states with too large path metric, basically dynamically adopting the number of ACS to calculate and trace-back paths to store. This idea is further developed in [14] as adaptive approximation by adding a pruning unit to the decoder’s core. The pruning unit limits the number of trace-back paths to follow, again aiming at power consumption reduction. Moreover, an SMU with dynamic prediction combined with trace-forward technique is introduced in [15] for Viterbi decoders used in wireless applications, which reduces the number of memory access by more than 70% in certain cases.It is quite difficult to give general guidelines for the usage of adaptive techniques in the SMU, since their positive affects are very channel/application dependent (as mentioned in all references). For some designs, a very noisy channel can even lead to some data loss in the decoder, so additional hardware is required for recovery [12]. AWGN channels, so the claimed power saving figures are questionable if, e.g., wireless channels with multipath and fading are considered. That is why the scope of this paper is limited to standard nonadaptive SMU designs. Nevertheless, for Viterbi decoder designs with strong focus on power reductions adaptive SMU is definitely something to consider, provided the designer is willing to spend some time on simulating the decoder with proper channel models. A detailed overview of adaptive trellis decoding techniques can be found in [16].

Fig.8. Simple register exchange network corresponding to the code introduced in Figure 2.

For standard (nonadaptive) Viterbi decoder there are two major techniques to implement the SMU: register exchange (RE) and trace-back (TB). RE uses a set of multiplexers and registers to store and update the survivor sequence of each state for trace back depth (TBD) trellis stages. It is highly regular, area efficient and has low latency. But, due to frequent switching of states in the registers (bits are physically shifting from one stage to the next), it consumes a lot of power especially for larger values of K and TBD. The structure of the RE network repeats the trellis of the convolutional code as it is shown in Figure.8 for a simple K=3 code. Since it can be assumed that after TBD cycles, all survivor paths have merged, the output of any register from the last stage of the RE network can be used as the output of the Viterbi decoder.

Fig.9. General trace-back scheme [18].

68

67

K. Cholan / Procedia Engineering – 68 K.Cholan/ Procedia Engineering3000(2012) (2011)61 000–000

The TB stores the decisions in an SRAM and reads them in reverse order during the trace-back cycle. As the update rate of the memory is much less frequent than the update rate of the registers in the RE technique, the power consumption is significantly reduced.TB is split in three major operations. These are write, merge, and decode as shown in Figure.9. If a survivor path is traced back forTBD stages, then the state at which all paths merge can be determined. Symbols traced back beyond this state can be used as decoded output. In each cycle 1 column of decisions is written into memory and M columns are read for trace-back and decode. Pointers are used to keep track of the column read. The 1-pointer technique uses a single pointer multiple times to read M columns. Multiple pointers are used in M-pointer technique for reading M-columns. The M-pointer technique requires inefficient and expensive multi-port or multibank memory, so it is rather unsuitable for a VLSI implementation. Pre-traceback (PTB) [4] is a technique used to reduce memory access frequency. Instead of writing decisions at each stage processed by the PMU, m-stages are pre-traced and written as one composite decision. Thus, only one column of write operations is required for every m trellis stages processed by the PMU. The pre-trace is implemented by a RE network of typically 3 or 4 stages. Note that especially for small values of TBD, some trace-back techniques like trace-forward can show lower BER performance than the conventional trace-back [16]. Throughput and area figures for the SMU can be calculated quite easily. For RE and TB, the critical path delay is determined by the flipflop toggle rate and SRAM access time respectively. The area for both increases linearly with TBD and exponentially with K SMUArea

A

TBD

B

2K-1

Where A is the number of trace-back memory blocks (4 with straightforward single-portmemory approach, 3 with folded read-write access, 2 if trace-forward is used, 1 for RE), B is an area of a flip-flop plus multiplexer for RE or an area of an SRAM bit for TB respectively. PERFORMANCE

EXISTING

PROPOSED

DELAY(ns)

14.47

4.250

CLOCK(MHZ)

70

239.294

AREA(no. of LUTS)

-

21

POWER(mW)

117

81

Table.1.Design Comparison Of Synthesis Of Radix-2:

Fig.10. (a) Encoder output waveform; (b) Decoder output waveform 69

68


5.2.Applications 1. Bit error detection in bulk storage devices. 2. Error detection and correction mechanism in wireless data communication. 3. Data error recovery methods in audio and image processing. Conclusion: In this paper a new method of viterbi decoder is presented. The design is based on reconfigurable FPGA technology, by adopting parallel/pipeline features of the hardware resources. The exiting algorithm is redesigned using HDL language, simulation, synthesis and implementation (translation, mapping place & routing) done with FPGA based EDA tools. The overall system performance improved in terms of area and speed. References [1] H. D. Lin and D. G. Messerschmit, “Algorithms and architectures for concurrent Viterbi decoding,” in Proc. IEEE Int. Conf. Commun., Jun.1989, pp. 836–840. [2] G. Fettweis and H. Meyr, “High rate Viterbi processor: A systolic array solution,” IEEE J. Sel. Areas Commun., vol. 8, no. 8, pp. 1520–1534, Oct. 1990. [3] G. Feygin, G. Gulak, and F. Pollar, “Survivor sequence memory management in Viterbi decoder,” in Proc. IEEE Int. Sym. Circuits Syst., Jun.1991, vol. 5, pp. 2967–2970. [4] P. J. Black and T. H.-Y. Meng, “Hybrid survivor path architectures for Viterbi decoders,” in Proc. Int. Conf. Acoust. Speech, Signal Process., 1993, vol. 1, pp. 433–436. [5] M. Boo, F. Arguello, J. D. Bruguera, R. Doallo, and E. L. Zapata, “High-performance VLSI architecture for the Viterbi algorithm,” IEEE Trans. Commun., vol. 45, no. 2, pp. 168–176, Feb. 1997. [6] K. K. Parhi, “An improved pipelined MSB-first add-compare select unit structure for Viterbi decoders,” IEEE Trans. Circuits Syst. I, Reg.Papers, vol. 51, no. 3, pp. 504–511, Mar. 2004. [7] Y. Engling, S. A. Augsburger, W. R. Davis, and B. Nicolic, “A 500-mb/s soft-output Viterbi decoder,” IEEE J. Solid-State Circuits, vol. 38, no. 7, pp. 1234–1241, Jul. 2005. [8] G. Fettweis and H. Meyr, “High speed parallel Viterbi decoding: Algorithm and VLSI architecture,” IEEE Commun. Mag., vol. 29, no. 5, pp. 46–55, May 1991. [9] H. M. H. Dawid and G. Fettweis, “A CMOS IC for gb/s Viterbi decoding: System design and VLSI implementation,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 4, no. 1, pp. 17–31, Mar. 1996. [10] R. D.Wesel, Wiley Encyclopedia of Telecommunications. New York: Wiley, 2003, ch. Convolutional Codes. [11] A. Hekstra, “An alternative to metric rescaling in Viterbi decoders,” IEEE Trans. Commun., vol. 37, no. 11, pp. 1220–1222, Nov. 1989. [12] M. H. Chang, W. T. Lee, M. C. Lin, and L. G. Chen, “IC design of an adaptive Viterbi decoder,” IEEE Trans. Consum. Electron., vol. 42, no.1, pp. 52–62, Feb. 1996. [13] E. Boutillon and L. Gonzalez, “Trace back technques adapted to the surviving memory managementin the m algorithm,” in Proc. Int. Conf.(ICASSP), Istanbul, Turkey, Jun. 2000, pp. 3366–3369. [14] R. Henning and C. Chakrabarti, “An approach for adaptively approximating the Viterbi algorithm to reduce power consumption while decoding convolutional codes,” IEEE Trans. Signal Process., vol. 52, no.5, pp. 1443–1451, May 2004. [15] C.-C. Lin, Y. H. Shih, H. C. Chang, and C. Y. Lee, “Design of a power-reduction Viterbi decoder forWLANapplications,” IEEE Trans.Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. 1148–1156, Jun. 2005. [16] M. Kamuf, “Trellis decoding: From algorithm to flexible architectures,” Ph.D. dissertation, Dept. Electrosci., Lund Univ., Lund, Sweden, Mar. 2007. [17] Y. Gang, A. T. Erdogan, and T. Arslan, “An efficient pre-traceback architecture for the Viterbi decoder targeting wireless communication applications,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 9, pp. 1918–1927, Sep. 2006. 70

Design and Implementation of Low Power High Speed ... - ScienceDirect

Design and Implementation of Low Power High Speed ... - ScienceDirect

Suggest Documents

Design of Low Power and High Speed Digital IIR

Design of high speed and low power 5:3

FPGA Implementation of Low Power, High Speed, Area Efficient ...

Tehrahertz CMOS Design for Low-Power and High-Speed Wireless

Low Power Low Voltage High Speed CMOS

Design of a High Speed, Low Power Synchronously ... - IEEE Xplore

Design of Low Power, High Speed Differential Amplifier Ring Voltage

Design & Study of a Low Power High Speed 8

Design & Implementation of High Speed N

High Speed Low Power Flash ADC Design for

High Speed Low Power Flash ADC Design for

High Speed Low Power Flash ADC Design for ... - Semantic Scholar

design and implementation of low power,low cost mimo

KMote-Design and Implementation of a Low Cost, Low Power

Low-Area Low-Power and High-Speed TCAMS - Semantic Scholar

high speed low power comparator for ultra

Design and Implementation of Power Supply of High-Power Diode

Implementation and Validation of a Power-Efficient, High-Speed ...

Ultra Low Power Low Voltage ASIC Design and Implementation

Physical Simulations of High Speed and Low Power ... - MDPI

Analysis of Low Power and High Speed Phase ... - Science Direct

Design and Implementation of a High Speed Parallel Architecture for

Design and Implementation of High Speed ... - Science Direct

Design and implementation of a high speed residue number system