A Silicon Efficient High Speed L = 3 rate 1/2 Convolutional Decoder Using Recurrent Neural Networks Arto Rantala, Silja Vatunen, Timo Harinen, Markku Åberg VTT Electronics, P.O.Box 11012, FIN-02044 VTT, Finland E-mail:
[email protected] Abstract A silicon efficient real-time approach to decode convolutional codes is presented. The algorithm is a special recurrent neural network, which needs no supervision. A standard solution for the convolutional decoding has been the Viterbi algorithm, which is an optimal solution. The complexity of a Viterbi decoder increases exponentially as a function of the constraint length. The complexity of the utilized algorithm increased more likely polynomically, which makes it attractive for applications with a long constraint length. The algorithm requires massive parallel and fast computing, which is hard to achieve effectively using a standard digital logic. Novel floating-gate structures are used to perform highly parallel signal processing within minimal silicon area. Silicon area of a decoder having constraint length of 3 and rate 1/2 is only 950 x 450 µm2 using 0.35 µm CMOS. Measurements show that a BER of 0.06 can be obtained at decoding speed of 1.25 MHz with a input signal having SNR of 0dB.
1. Introduction A digital signal is easily corrupted during transmission. This causes errors to the transmitted data and the error rate is usually expressed as a bit error rate (BER). Expected BER can be enhanced by coding the signal e.g. by using convolutional codes. The coding is easy to implement efficiently, but the decoding process needs lots of computation. The complexity of the standard solution, Viterbi algorithm, increases exponentially as a function of the constraint length, which makes it uneconomical with a long constraint length. A new algorithm based on a recurrent neural network (RNN) for the convolutional decoder has been presented [1]. The main benefit of the algorithm over Viterbi is that the complexity of the RNN algorithm increased polynomically, so the algorithm is expected to be more efficient than Viterbi when a long constraint length is used. The RNN algorithm requires highly parallel implementation and efficient calculation methods for real time decoding. In this study floating-gate MOS-transistors are used to have maximal parallel structure, while maintaining silicon and power consumption at a low level. A RNN
decoder having L = 3 (constraint length) and rate 1/2 has been investigated. A test chip has been designed with a 0.35 µm CMOS-process. Measurement results are show that presented ideas work and the performance of the decoder is quite good.
2. RNN algorithm A new method to decode convolutional codes by using recurrent neural networks (RNN) is presented in [1]. The decoding problem is not presented detailed here, but only results relevant when implementing the RNN decoder.
2.1 Original algorithm The decoding problem is to find a bit sequence B, which minimizes the function: T
min B
∑
r ( t + sdt ) – γ ( B ( t + sdt ) )
2
s=0
= min f ( B ) B
(1)
,where r(t+sdt) is received codeword at time t+sdt, dt is the sampling interval, γ is the encoder and B(t+sdt) is the vector. Using gradient descent ∂ f (B) b ( k ) = b ( k ) – α --------------∂b ( k )
(2)
the structure for the neurons of the RNN decoder is obtained. By setting α properly an artificial neuron can be formulated. The output for kth neuron for L = 3 rate 1/2 decoder can be expressed with equation shown below. S k = f a ( – I k, 1 S k – 2 + I k, 2 S k – 1 S k – 2 (3) + I k + 1, 2 S k + 1 S k – 1 – I k + 2, 1 S k + 2 + I k + 2, 2 S k + 1 S k + 2 + AWGN )
Ij,i `s are external inputs where i denotes ith bit of input word and j denotes the delay of signal. Sl `s are the outputs of the neurons and fa is the activation function. AWGN is the added white gaussian noise which prevents the system to stack in the local minima. The decoding operation of the RNN decoder can be expressed with following the sequence. For each received codeword r(0) at time do 1. Fix S-(L-1),...,S-1 to previous decision and ST+1,...,ST+(L-1) to zeros. Apply received codewords to inputs 2. Initialize S(i), i=0,...,T
3. Update each neuron k=0,...,T using EQ (3) 4. Decrease the variance of AWGN. If the stopping criterion (e.g number of iterations) is fulfilled the state of neuron S0 is the received codeword. The size of the decoder or the number of the neurons (T) sets the complexity of the decoder. The size of the decoder was selected to be T = 5, which is a trade-off between decoder performance (acquired BER) and the decoding speed.
2.2 Methods to realize the algorithm The decoder using the algorithm presented above can be easily implement with a software and special signal processor. However, the required number of iteration loops is makes a real time decoding impossible at least at reasonable cost and power consumption. It was estimated that a high performance DSP-processor (capable to perform 2400 MIPS) could decode at the speed of 8.8 kb/s, when 800 iteration cycles is applied to the decoder having 17 neurons. A digital ASIC was considered at the first place, but there are some details, e.g. variable variance of the noise generator which made this alternative not so attractive. Moreover, multiplications, summings and decision functions need several gates and the delay in the signal becomes larger than with topology presented in this paper. A mixed-signal ASIC using novel floating-gate circuit blocks was selected to be most the efficient way to implement the algorithm.
2.3 Modified algorithm The silicon realization of the RNN decoder using floating-gate transistors requires some modifications to the original algorithm. Input signals Ij,i `s are originally continuous amplitude signals, and a quantization is needed. Multiplication with analog signals can be implemented but the delayed input vectors needed are not easily accomplished. Moreover, fully analog signal processing would draw more current than selected mixed-signal strategy. The number of quantization levels is critical, since the amount of the hardware doubles per one added bit (see next chapter). The selected value, 8 levels (3-bit), is a trade-off between performance and cost (silicon and power consumption). The performance with this value is near the optimum case (continuous amplitude inputs) and the improvement obtained by adding extra bits is not significant [2]. White gaussian noise (AWGN) is added to the summing node of the neuron (see EQ (3)). The added noise for the different neurons has to be uncorrelated, which means that several different noise sources are required. Moreover, the variance of the noise has to be varied during iteration process. It is obvious that noise sources having certain properties can not easily be implemented and a semi-optimal solution was investigated. Matlab-simulations with the original algorithms showed that a small cor-
relation between noise sources (less than < 0.5) can be accepted, which makes the implementation of a multiple number of noise sources practically possible. The signals in the original algorithm were continuous, bipolar, having range of [-1,1]. These signals were mapped to digital unipolar values {0,1} and {000,111} for the 1- and 3-bit signals, respectively. The most important modification to the original algorithm was to let the neurons run free. This means that no clocking or iteration cycles are applied to the neurons, but the decoder is allowed to stabilize to a certain state (local optima). If the duration and the properties of the added noise (the level and the decrease function) are properly chosen the decoder hopefully finds the global minima.
3. RNN decoder implementation In this chapter the structure and the functionality of the RNN circuit blocks are briefly explained. Following blocks will be considered: multiplier, a CSD-block (D/Aconversion, summing, decision) and AWGN-noise generator.
3.1 Multiplier A number of multiplications is needed to calculate the state of the neuron (see EQ (3)). Input signals Ij,i`s are 3bit and Sl `s are 1-bit digital signals. It can easily be verified that each sum term is a result of one 3-times-1 multiplication, a plural of 1-times-1 multiplications and a sign bit. When having the above signal notations XNOR-gate is the most efficient way to perform multiplication, Fig. 1. A1 A2 Q A3
Figure 1. A schematic diagram of a 1-times-1times-1 multiplier with XNOR-gates Each of the sum term in EQ (3) can be easily implemented with a few XNOR-gates and the sign of sum term is changed with XNOR-gate having another input tied to ‘0’. It should be noted that resultant sum terms are 3-bit signals (3 XNOR-multiplier blocks in parallel) for the each sum term.
3.2 A CSD-block The core of the neuron is a CSD-block, which D/Aconverts the terms, sums them together with a AWGNsignal and finally makes a decision. The implementation of these functions are so closely tied together that the operation of the whole CSD-block must be considered. The CSD-block is constructed by using floating-gate MOS-transistors [3]. The basic idea of the floating-gate transistor is shown in Fig. 2. The MOS transistor has a floating gate (FG) and a multiple number of the input
gates capacitively coupled to the floating gate as shown in Fig. 2 (a). V1
C1
V2
C2
Vn
Cn
Floating gate
D φF
Gate oxide B Cfb S MOS channel
(a)
Substrate
Input gates
(b)
Figure 2. (a) A schematic diagram of a floatinggate MOS transistor (b) A schematic crossreference of a floating-gate MOS implemented with CMOS The accumulating weight of the input couplings controls the channel current in the floating-gate MOS transistor. The potential of floating gate can be expressed as: n
φF =
∑
CiV i i=1 -------------------n
(4)
∑ Ci
i=1
where Vi ‘s are the input voltages. Thus the floating gate potential is determined as a linear sum of all the input signals weighted by the capacitive coupling coefficients Ci / Ctotal. In Fig. 2 (b) a schematic cross-section of the floating-gate MOS transistor implemented with a double polysilicon CMOS process is illustrated. The upper polysilicon layer forms the floating gate and the bottommost polysilicon layer composes capacitively coupled input gates. The floating-gate transistor gives an ideal way of summing analog and digital signals, required within CSDblock. The floating-gate transistor can be e.g. the input device of the comparator, so the summing of the signals is performed in the same node which makes the decision. It is clearly seen the signal delay through the neuron is minimized, and the decoding rate is high. D Q
Q0
C1=Cu
C=7*Cu B
φF
⇒ S
Q1
C2=2*Cu
Q2
Cn=4*Cu
D φF B
S
Figure 3. A schematic diagram of the D/A converter implemented with a floating-gate The signals (multiplied sum terms) that are fed in the CSD-block are 3-bit, and a D/A-conversion has to be performed. A multiple number of standard D/A-converters would consume lots of power, due to required high speed operation (neurons change states at the free-running frequency, > 500 MHz). This was not an attractive alternative, and the basic floating-gate structure was altered in such manner that 3-bit D/A-conversion is accomplished by dividing the sum capacitor into 3 binary weighted parts. A division of one sum branch is shown if Fig. 3.
3.3 AWGN-noise generator Following specifications were given for a average gaussian noise sources. 1. Each neuron need ‘own’ noise source i.e. the correlation between other noise sources is minimal. 2. The number of neurons is T+ 2(L-1), so 17 noise sources are needed. 3. The variance of the noise have to be controlled at least within 4-bit accuracy (16-levels) A linear shift register was chosen to generate multiple number of digital pseudorandom sequences. A VLSI efficient method was utilized to minimize the required hardware [4]. Only one shift register is needed, since properly placed XOR-functions produce 17 uncorrelated sequences. An analog noise signal is obtained if the digital sequence is fed through a low pass filter. A level control of the noise is needed to perform the variance decrease. One branch of the floating-gate summing element is dedicated to the AWGN. This branch is divided into 4 binary weighted parts and one part (for bit i) is shown in Fig. 4. A
2i*Cu
B
2i*Cu φF
AWGN_ctr(i) AWGN
Figure 4. A schematic of the ith bit of the variance control circuit AWGN is a digital noise input and AWGN_ctr(i) (i=0,.., 3) is an external level control input. The circuit is working as follows: - If AWGN_ctr(i) = ‘0’, A = ‘1’ and B = ‘0 ‘ (AWGN is attenuated) - If AWGN_ctr(i) = ‘1’, A = B = AWGN (AWGN is fed to the floating gate)
3.4 RNN decoder The circuit blocks explained above are used to build a RNN neuron illustrated in Fig. 5. The summing node of neuron is at the same time the input of the analog comparator, which performs a hard limiting activation function or decision (see fa( ), EQ (3)). A digital multiplexer is added into the output to have ability to set a certain initial state to the neuron. A RNN decoder for L = 3 rate 1/2 is constructed with 17 real neurons (neurons S-1 and S-2 are made with flipflops), shift register to delay input signals and AWGN-generator. Also an internal VCO was designed to provide high frequency clock for the AWGN-generator. A 0.35 µm CMOS process provided by ST-Microelectronics was selected for the decoder test chip. The simulations were carried out with the mixed-signal simulator Continuum by Mentor Graphics. During the simulations it was found that performance of the decoder (BER and speed) depends heavily on the parameters and the control signals. The simulation time of one test set was very long, due to a huge stabilization time of a very complex system
Sk+1 Sk+2
Sk-2 Sk-1
Analog signal processing
Ik,1 Comparator MUX
Ik,2
Sk
Ik+1,2 Ik+2,1 Ik+2,2
VSS
(hard) and continuous analog input (soft) signals, respectively. In the case of the 3-bit input signal the optimum BER would been between those values. The power consumptions of the decoder are 48.8 mW and 11.4 mW at the decoding frequency of 5 MHz with VDD = 3.3 V and VDD = 2.0 V, respectively.applied. TABLE 1. Measured BER of the decoder as a function of the decoding frequency (fd)
AWGN_ctr(3:0)
Neu_sel Neu_in
Figure 5. A simplified schematic of the neuron having lots of feedback paths. Although, the optimum operational condition could not be obtained with simulator, it was found that the design is robust i.e. several sets of usable parameters was found (the decoder is working). As a result all control signals were taken external to have maximal flexibility for measurements.
4. Experimental results The active area consumed by the designed decoder was as low as 950 µm x 450 µm with the 0.35 µm CMOS technology used. A microphotography of the decoder is shown in Fig. 6. The chip was packaged into PQFP144package and soldered on the test PCB. The input data (encoded bits) and control signals were fetched to the decoder and the results were acquired Logic Analysis System 16500B by Hewlett Packard.
fd
5 MHz
1.25 MHz
0.5 MHz
VDD = 3.3 V
0.132
0.063
0.044
VDD = 2.0 V
0.243
0.186
0.180
Conclusions The implementation of a new algorithm for convolutional codes has been investigated. The algorithm is based on tailored neurons and the recurrent neural networks (RNN). Special floating-gate MOSFET circuits have been utilized, to have maximal parallel and fast neurons required by a real time decoding. A pseudorandom generator having multiple number of outputs and a variance control has been designed. The decoder for L = 3 rate = 1/ 2 have been processed using 0.35 µm CMOS process with small silicon area (950 x 450 µm2). Measurement results shown that a BER of 0.06 can be achieved at the decoding frequency of 1.25 MHz, while the power consumption is 48.8 mW.
Acknowledgements
Single neuron
The authors would like to thank Ari Hämäläinen and Jukka Henriksson and Nokia Corporation for their support of this work.
References
AWGN - generator
Figure 6. Microphotograph of the L = 3 rate 1/2 RNN decoder (size 950 x 450 µm2 ) Preliminary results show presented that idea works and an optimum performance for the decoder is found when: - Maximal frequency AWGN-generator is applied - Having a proper duration of the added noise (depends on the decoding speed) A summary of the measured BER at different conditions is shown in Table 1. The AWGN clock frequency was 500 MHz and the variance of the noise was linearly decreased during decoding process. The signal-to-noise ratio (SNR) of the input signal was 0 dB (3 dB per bit) and 5 bursts each having 1000 bit were appliedThe optimum (ideal) BER values are 0.03 and 0.004 for 1-bit quantized
[1] A. Hämäläinen and J. Henriksson, “Convolutional Decoding Using Recurrent Neural Networks“, Proceedings of International Joint Conference on Neural Networks, Washington (DC), USA, 1999. [2] J. S. Lee and L. E. Miller, CDMA Systems Engineering Handbook, Artech House Publishers, Boston, 1998, pp. 916 922. [3] A. Rantala, P. Kuivalainen and M. Åberg, “Low Power High-Speed Neuron MOS Digital-to-Analog Converts with Minimal Silicon Area”, IEEE journal of Analog Integrated Circuits and Signal Processing, vol. 26, pp. 53-61, 2001. [4] J. Alspector et al “A VLSI-Efficient Technique for Generating Multiple Uncorrelated Noise Sources and Its Application to Stochastic Neural Networks”, IEEE Trans. on Circuits and Systems, vol. 38, pp. 109-123, 1991.