16-Bit Viterbi Decoder Processor

16-Bit Viterbi Decoder Processor Ashkan Borna

Mojtaba Mehrara

Robert Mullenix

Brian Pietras

Student Dept. of EECS University of Michigan




[email protected]

[email protected]

[email protected]

[email protected]

Figure 1 shows an encoder with a constraint length of 3 and generator polynomials of ‘111’ and ‘101’ [1] which may be shown in a closed form of (3, 7, 5). We used this structure as the reference for designing our Viterbi decoder.

ABSTRACT In this report, implementation of a Viterbi decoder on a 16 bit RISC microprocessor with a 2-stage pipeline is described in detail. Some extra modules and instructions have been added to the baseline processor to optimize it as a Viterbi decoder. The Viterbi algorithm is used for decoding a class of error correcting codes called convolutional codes which are widely used as channel coder in digital communication systems. We discuss our design choices and supplemental additions, as well as some of the pitfalls encountered.

Keywords Viterbi algorithm, convolutional codes, RISC processor, VLSI.

1. INTRODUCTION The Viterbi algorithm is widely used in communication systems to extract the most probable bit sequence out of a transmitted data stream that has been encoded using convolutional codes. The algorithm is based on computing the distance (Hamming distance for hard input data and Euclidean distance for soft input data) between the received data sequence and all possible sequences, and extracting the most probable one.

Figure 1. Encoder Structure

2.2 The Viterbi Algorithm When a sequence of data is received from the channel, it is desirable to estimate the original sequence that has been sent. The process of identifying such a sequence can be done using a diagram called ‘trellis’ (Figure 2) The detection of the original stream can be described as finding the most probable path through the trellis.

In this project we have fit a soft input Viterbi decoder with a constraint length of three and input width of three bits into a 16bit RISC microprocessor. The design goal was to minimize the memory usage while maintaining the decoding speed. The report is organized as follows. In section 2 a brief description of convolutional coding and Viterbi decoding is presented. Section 3 gives an overview of the whole chip implementation along with our approach for implementing the Viterbi algorithm in the processor. Sections 4 and 5 describe testability features and pin placement while section 6 includes the timing analysis of the final design. Finally section 7 contains our final results.

000 001 010 011 100 101 110

2. BACKGROUND 2.1 Convolutional Encoder

111

In a convolutional encoder, the input is fed into a shift register and the outputs are the results of the addition of different registers and the primary input. The number of the registers in the encoder plus one is called ‘constraint length’. A generator polynomial is assigned to each output as a structural specification of the encoder which defines the registers that should be used to produce that specific output. For instance a generator polynomial of ‘101’ indicates that the output is the addition of the primary input and output of the second register. So each encoder can be uniquely identified by its constraint length and generator polynomials.

Figure 2. Trellis diagram for a (4,13,17) decoder[2] In the trellis diagram each node corresponds to an individual state at a given time and indicates a possible pattern of recently received data bits. Each branch indicates the transition to a new state at the next timing cycle [1]. The transition from each stage to the next is defined by a state transition diagram which is sometimes called ‘trellis legend’. This legend is constructed according to the structure of the encoder. Figure 3 shows the legend which corresponds to our reference encoder.

1

lot in implementing the desired application. The arithmetic shift operations stand as the sole exception. Due to the nature of our funnel shifter design, we essentially got them for free. This principle had the effect of putting an emphasis on minimizing area and reducing our core size dramatically since we excised any extraneous features. The “RAM” block provides another example of this principle at work. Since the dataflow in our application is real-time and our algorithm only makes use of the register file, we do not need to save information into memory. Even the smallest RAM block available in the parts library would add unwanted delay and complexity in addition to doubling our chip size. The specifications required us, however, to implement the load and store operations. So we adapted a copy of our register file, connected a 4 to 16 bit decoder to it and made it as a 16-word “RAM”, fulfilling both the requirements and our design philosophy.

Figure 3. Trellis legend for (3, 7, 5) decoder[3] Each branch on the trellis has an assigned metric which represents the cost of passing through that branch to the next state. For the case of soft input Viterbi this value equals the Euclidean distance between the actual received bits and the expected branch data. The state metrics, or path metrics, are the accumulation of the branch metrics through the most probable path arriving into a specific state. Section 3.3 explores these operations in more detail. After building up the trellis, two approaches called ‘trace back’ and ‘register exchange’ may be used to decode the data. In the register exchange method, a register is assigned to each state and it records the decoded output sequence along the path from the initial state to the final state. At the last stage, the decoded output sequence is the one stored in the survivor path register assigned to the state with the minimum path metric [4]. The trace back approach needs less computation than register exchange, but since the latter is faster using our approach and requires less memory, we have chosen to use that for the decoder. Ideally the sequence along the final survivor path is valid when all data in a given sequence has been received and the whole trellis has been constructed. But eventually after some point in the trellis diagram, all survivor paths originate from the same state. This length is called the trace back length and it equals to five times the constraint length

3.2 Chip Architecture

3. CHIP OVERVIEW 3.1 Design Considerations

We used the one master and 16 slave latches design as the template for the register file. Although this approach made driving the bus difficult in absence of read signals, we addressed this problem by placing keepers on the data lines. Several of the Viterbi instructions required the three distinct registers, two source registers and one destination. The normal baseline instructions use one of the source registers as the destination register. We used a mux to selectively couple and decouple one of the read ports with the write port. This allowed us to maintain normal functionality during baseline instructions, but expand when executing the Viterbi instructions.

To accomplish Viterbi decoding while still preserving the flexibility of a general-purpose processor, we designed and laid out the base architecture that was provided and supplemented it with a module that added all of the functions Viterbi required. We also support load and store instructions through a 16-word RAM block based on our register file. As mentioned before, we did not require additional memory, since the Viterbi algorithm makes efficient use of our register space. We implemented the suggested two-stage pipeline, fetching each instruction and then executing it. A deeper pipeline could have possibly increased our performance, but it added a greater degree of complexity. In the next few sections, we outline individual components.

3.2.1 Datapath Our fully custom datapath incorporates the register file, ALU, shifter, and enough multiplexers (muxes) and tri-state buffers to control the flow of data from one unit to another. Although we designed each component independent of each other, we adjusted input and output ports as the interactions revealed more accurate knowledge about timing and load constraints. Section 6 details the timing specifics.

Throughout our design, we kept a number of principles in mind to guide us when making critical decisions. We decided not to design around maximizing performance or minimizing powerconsumption, both traditional and highly desirable VLSI objectives. In picking one, a team must ultimately sacrifice the other, in addition to other important parameters. Oftentimes the marginally incremental gains in speed or power savings come at a tremendous cost elsewhere. Instead, we based our decisions around providing an acceptable balance between complexity, performance, power, and development time.

The adder design afforded us the most flexibility and range in possible implementations. After an examination of the descriptions and trade-offs between adder families, we decided on the Variable-length (square-root) Carry Increment Adder [6]. It closely resembles the Carry Select Adder, improving it with a number of logic optimizations (mainly using propagate-generate logic versus full-adder logic). During the layout, we learned that

Finally, we always kept our application in the forefront of our mind when making design choices. We decided not to spend time implementing features that would not benefit Viterbi decoding. As a result, we did not implement most of the instructions in the included Instruction Set Architecture (ISA) above the required minimum, but we added several new instructions that helped us a

2

removing the multipliers needed for doing the squares. If we use seven bits for the branch metric, the normalization operation, which will be discussed in the next section, would occur quite often and this reduces the accuracy of decoding. But in case of using four bits for each branch metric, the possibility of this normalization occurrence is greatly reduced and this results in improvement in BER. Figure 8 shows the results of the Matlab simulation for the Matlab Viterbi decoder, our decoder with the Euclidean distance branch metric, and also our decoder with the simplified branch metric.

the non-uniform length made custom implementation very timeconsuming with only marginal perceived performance gains. In hindsight, we probably should have picked the fixed length design. For our shifter, we used the funnel shifter design described in [6]. We picked this shifter because of its simple, intuitive layout and flexible functionality. We achieved logical and arithmetic right and left shifts just by using a 4-to-1 mux on the input. Although, the Viterbi Algorithm did not require arithmetic shifts explicitly, we gained this functionality with almost zero additional work. During the datapath construction, we noticed poorer than expected performance from our components. Our muxes did not do an adequate job of passing the data with minimal delay. We spent some time investigating different mux designs, until we resolved on a six-transistor circuit (a PMOS, an NMOS and two inverters as output buffers) that perfectly met our needs. It had the minimum delay of the different designs we tested, and we could fit two of them inside our bit-slice width of 73.5 lambda.

The bmu computes X + Y, (~X) + (~Y), (~X) +(Y) and (X) + (~Y) and stores them in the destination register. X and Y are the three bit soft inputs to the decoder and come directly from the IO registers and ~X and ~Y are their inverted values. Since we have used three bits for the soft input we allocated four bits to each branch metric container and we used one 16-bit word to store all of them.

3.2.2 Controller

At each stage of the trellis we need to add the previous path metrics to the branch metrics of the current stage according to the trellis legend (Figure 3), and update the path metric assigned to that state. In order to update the path metric we perform one addcompare-select operation for each state. This operation adds two previous path metrics to current branch metrics, compares the two values entering each stage and selects the minimum among them as the winner.

3.3.3 Path Metric Unit

In addition to the logic that determined the datapath and Viterbi control signals, the controller contained the Instruction Register (IR), the Program Counter (PC), and the Program Status Register (PSR) bits. Since we didn’t need an entire word for the PSR, we broke up the condition codes into individual registers, although the scan chain connected them in the same order as the ISA specification detailed. The condition codes were set during Addition, Subtraction, Comparison, and certain Viterbi Instructions. For Addition, Subtraction, and Comparison, the output of the operation, regardless if written back to memory, determined the value of the condition codes. For example, a negative 2’s complement result would set the N register, while an overflow would set the F register. The Viterbi instructions treated the condition code registers differently than the specification detailed.

We have a ‘pmu’ instruction in our ISA and we run it twice in the assembly program for each decoding cycle. The first run computes the path metrics of states 00 and 01 from the previous path and branch metrics related to states 00 and 10. The second one does the same thing for states 10 and 11 from states 01 and 11. It is obvious that we have a butterfly structure between two successive stages when we perform the path metric computation. This structure is shown in Figure 4. We perform a swap operation among path metrics later on to implement this butterfly.

For the most part, the controller determines the next state logic on the positive edge of the clock. We could not, however, get the Next_PC register to set correctly unless we used the negative edge. This had a detrimental effect on our final clock speed, since that forced the branch instruction to finish in half a cycle.

As stated above, the bit width of each branch metric container is four bits. The minimum bit width for the path metrics is five bits and since we fit two path metrics in one register, we use eight bits as the bit width for the path metrics. Due to the accumulative nature of the path metric computation, these values tend to overflow after several stages of computation. It is important to normalize the values to prevent overflow. There are several approaches for path metric normalization [5]. Among them the Fixed Shift normalization fits well in our design. In this method we use nine bits to perform the addition for each path metric in pmu and depending on the possibility of overflow in the next stage we choose between the eight most significant or eight least significant bits as the final value for the path metric. When there is a possibility of overflow in one of the metrics, all of them are shifted to right by one bit. To keep track of overflow detection, we used the C, N, Z and L flags in the PSR. These are set by pmu instructions and are checked afterwards to perform shift operations when necessary.

3.3 Viterbi Decoding implementation 3.3.1 Viterbi Decoder Components In order to fit the decoding algorithm in our chip we decided to implement some extra modules in verilog. We based our design choices on the simulation results of a Viterbi decoder which we implemented in Matlab. Later on, we used Matlab to fully verify our assembly program, which we had developed to perform the decoding using our chip’s extra features and instructions.

3.3.2 Branch Metric Unit This module computes the metrics of the branches at each stage of the trellis diagram. There is a ‘bmu’ instruction in our processor which performs the branch metric computation at each decoding stage based on the inputs. As stated before, Euclidean distance is traditionally used as the branch metric for soft input data types. But during our Matlab simulations, we discovered that if we use the regular distance without the squares, we would have improvements both in terms of total Bit Error Rate (BER) and Viterbi module area. The improvement in area is caused by

3

Computes 4 four bit branch metrics and stores them in Rdest

BMU Rdest PM0

PMU PM1

PMU Rsrc1,Rsrc2,Rdest

Adds the proper branch metrics in Rsrc1 to the previous path metrics in Rsrc2, compares them and puts the minimum in Rdest - Each PMU instruction computes two 8 bit path metric.

Swap1(2) Rsrc1,Rsrc2,Rdest

Copies one of the computed path metrics of Rsrc1 and Rsrc2 and puts them in Rdest to prepare the path metric for next stage computation

Hs Rsrc,Rdest (half shift)

Shifts the higher and lower 8 bits of Rsrc and copies that into Rdest. Used for normalization of the path metrics

PM2

PMU PM3

Hcmp1(2,3) Rsrc, Rdest

Figure 4. Butterfly structure in computing path metrics [3]

3.3.4 Register Exchange During this operation, the path related to each state is updated. Since the trace back length is fifteen for our decoder, each path can be stored in a 16-bit register in the register file. At each stage the path related to the previous winning state is shifted to left by one bit and the bit on the winner branch is inserted at bit position zero. This path is then copied to the register assigned to the current state. At the same time the last bit on the register which corresponds to the path with the minimum path metric is sent to the output port as the decoded data. We used the processor’s shift instruction to complete this stage, so there were no extra modules.

Compares the higher and lower 8 bit values in Rsrc and copies the minimum in Rdest Used for finding the minimum metric at each stage to send out data from corresponding path register Table 1. Extra instructions

Our additional instructions save us twenty-two cycles for each decoding stage. We also save about twenty instructions because most of our registers hold an operand in both the upper and lower 8-bits. This means that we do not have to load and store to RAM with our implementation. Overall these savings effectively double our throughput.

3.5 Our Custom Serial Interface Our chip implements a serial connection with a complete handshake to the outside world. The output device asserts the reset and do_viterbi signals to begin. At the rising edge of do_viterbi, our chip asserts a send_data signal to the output device to indicate that it is ready to receive data. Then the device provides the data on opx_serial and opy_serial and asserts data_in_valid for three clock cycles to build up the 3-bit registers inside the Viterbi module. From then on, our chip asserts send_data after each BMU command, receives opx and opy and stores them for the next BMU. Sixteen decoding cycles after the assertion of do_viterbi our chip starts outputting data and asserting data_out_valid at the end of each decoding stage, which takes approximately forty-three cycles. In order to prevent processing without proper data, the controller stalls if it reaches a BMU instruction and the BMU_ready signal is low. When the outside world is done with sending data, it drops the do_viterbi signal low so the chip knows to simply output the remaining sixteen bits of data.

Figure 5 shows the program flow for decoding the input data.

4. TESTABILITY

Figure 5. Viterbi decoder program flow

We provided a scan chain in order to access important control registers. When an external device asserts the scan_en signal via an input pin to our chip, both the processor and the controller stall normal operations. Figure 6 demonstrates the scan path through the PSR, IR, and PC registers. On every clock cycle, data from the scan_in input pin stores into the N register, with the value stored in the N register moving to the L register, and so on, with the value in the MSB of the PC outputting on the scan_out output pin.

3.4 Extra instructions We added five new types of instruction; a few instructions have extra variations. Some of our instructions require two source registers and a destination register, so we used the epodes 0x6xxx and 0xAxxx to handle these. The controller uses a mux to determine which bits of the opcode to use as the destination register based on the instruction. New instructions are in the following table.

4

scan_in

scan_out PSR

IR

PC

LSB -> MSB for all modules

Figure 6. Scan chain. Upon lowering the scan_en signal, the processor resumes executing the instruction stored in the IR and advancing the PC from its current value. The reset signal restarts the processor by clearing the PC and other registers and loading the first instruction stored in the ROM into the IR.

Isolated Delay

Integrated Delay

Register File Writes

0.7835ns

0.8646ns

Register File Reads

0.6744ns

0.7224ns

ALU

2.4154ns

2.3589ns

Shifter

0.290ns

0.6709ns

Table 2. Component delays, isolated and integrated. The critical path through the datapath occurred while operating a subtraction. The XOR gates that inverted the Rsrc inputs did a poor job of driving the signal through the ALU.

5. PINS Our processor uses twenty-six pins. I/O requires twelve pins, leaving the remaining fourteen as power and ground. Figure 7 shows the I/O placement around our chip with respect to the major internal components. Five pins serve as testability and overall system control/synchronization (scan_en, scan_in, scan_out, reset, and clk). We use one pin to start the beginning of Viterbi decode (do_viterbi), three pins for validation/handshaking (send_data, data_out_valid, and din_valid), two pins that provide input to the Viterbi module (opx_serial and opy_serial), and one pin (data_out) that contains the recovered signal.

We found that the critical instruction, however, was the Branch. Because the controller required the value of the next PC be ready at the negative edge of the clock, we needed to determine the Branch in a half cycle. The Branch instruction took 4.75 ns to complete. This brought our minimum clock period to 9.5ns. We set our final clock frequency to 105MHz which will lead to a 2.45 Mbps throughput for the Viterbi decoder, since it takes around forty-three instructions to decode each bit. We tried to remove the artificial and constricting negative edge requirement of the branch instruction, but could not find an adequate solution. Had we resolved that issue, the Viterbi PMU instruction, at 7.4ns, would have been the critical instruction, allowing us to increase the clock frequency to 133MHz, an increase of 26%.

opx_serial

send_data

data_out

data_out_valid

do_viterbi

dirty gnd

dirty vdd clean gnd

Component

opy_serial RAM

clean vdd

Viterbi

din_valid

reset

clk

dirty gnd

clean_vdd

ROM

Ctrlr

dirty vdd

Dtpth

clean gnd

dirty gnd

dirty vdd dirty gnd

scan_en

scan_in

scan_out

clean vdd

clean gnd

dirty vdd

Figure 7. Pin Placement

6. TIMING ANALYSIS We ran timing analysis on each component as we built them, determining the rise and fall times, worst-case delays, and critical paths. As the circuit became progressively more complex, we realized that testing each component in isolation would yield inaccurate results due to a number of factors such as input drive strength and output capacitance. So we went back and recalculated many of the delays. Table 1 highlights some of the differences we observed.

Figure 8. Final Layout of the chip

5

Figure 9. Output Bit Error Rate vs. SNR for one million samples

[4] El-Dib, D.A., Elmasry, M.I., “Modified register-exchange Viterbi decoder for low-power wireless communications,” IEEE Transactions on Circuits and Systems, vol. 51, p.p. 371-378 (Feb 2004)

7. Conclusion In this project we fitted a Viterbi decoder in a baseline 16 bit RISC processor by adding some extra features and instructions. Although we could have implemented the decoder in a separate module like an ASIC, our design fits well into the processor and we used the processor’s features to implement our application. In order to verify the final processor design, we wrote an assembly code which performed the decoding and compared the chip’s outputs to the Matlab’s outputs which proved the complete match between the hardware implementation and Matlab Code. Figure 9 shows the results of Bit Error Rate measurements of our decoder in Matlab for one million samples and for different channel signal to noise ratios.

[5] Shung, C.B. Siegel, P.H. Ungerboeck, G. Thapar, H.K., “VLSI architectures for metric normalization in the Viterbi algorithm,” Proc. IEEE Int. Conf. Communications (ICC’90), vol 4, p.p. 1723-1728(Apr 1990) [6] Weste, Neil H.E., and Harris, David, “CMOS VLSI Design,” 3rd ed., Addison-Wesley, Reading, MA (2004)

8. REFERENCES [1] Forney, G.D., Jr., “The viterbi algorithm,” Proceedings of the IEEE, Vol. 61, Issue 3, p.p. 268-278 (March 1973)

[2] Afzali-Kusha, A., “IP Core Library Development for use in Digital System designs(viterbi block)”, technical report, University of Tehran,(May 2005)

[3] Mehrara. M, “FPGA implementation of a Turbo decoder using SOVA algorithm”, BS project report, Sharif University of Technology, (June 2005)

6