digital signal processing engine design for polar ... - CiteSeerX

4 downloads 2635 Views 680KB Size Report
design of Digital Signal Processing (DSP) engine for the polar transmitter. The digital part includes rectangular-to-polar converter and digital phase modulator, ...
Digital Signal Processing Engine Design for Polar Transmitter in Wireless Communication Systems Hung-Yang Ko, Yi-Chiuan Wang and An-Yeu (Andy) Wu Graduate Institute of Electronics Engineering, and Department of Electrical Engineering, National Taiwan University, Taipei, 106, Taiwan, R.O.C.

Abstract-Polar modulation techniques offer the capability of

problem through the two paths. In this paper we proposed a DSP engine which includes rectangular-to-polar converter and digital Phase Modulator (PM). The design does not have the distortion problem caused by the analog components and the phase modulation process can be precisely controlled by the digital phase modulator. The baseband phase signal is modulated through digital phase modulator at the specific frequency range. The phase modulated signal is represented as SIF-PM(t).

multimode wireless system and the potential for the high efficiency Power Amplifier (PA). This paper describes a new design of Digital Signal Processing (DSP) engine for the polar transmitter. The digital part includes rectangular-to-polar converter and digital phase modulator, and the engine is designed for EDGE (2.5G) system. We employ the Coordinate Rotation Digital Computer (CORDIC) and Direct Digital Frequency Synthesizer (DDFS) techniques in our design. A prototype chip has been designed and fabricated in UMC 0.18 um CMOS process with 1P6M technology.

s IF − PM (t ) = cos( wc t + φ (t )).

The PA stage of amplitude modulator (AM) operates in principle as a multiplier in our design model. This gives the output signal in the specific frequency band as follows:

1. INTRODUCTION Polar modulation offers the capability of achieving high linearity and high efficiency simultaneously in a wireless transmitter. Improved efficiency is achieved by using a highly efficient and non-linear PA to work at its peak efficiency. Linear transmission is achieved by modulating the envelope of the signal through the voltage supply of the PA. Polar transmission utilizes envelope and phase component to represent the digital symbols instead of the conventional I/Q format [1]. The baseband signal V(t) is split into the phase signal θ(t) and the envelope signal A(t).

V (t ) = x(t ) + j ⋅ y (t ).

(3)

s IF (t ) = A(t ) ⋅ s IF − PM (t ),

{

}

= A(t ) ⋅ Re e jφ (t ) ⋅ e jwct , = x(t ) cos( wc t ) + j ⋅ y (t ) sin( wc t ).

(4)

For convenience of the simulation model [2], the gain of the PA is set to one. Thus the Eq. (4) is equal to the signal of EDGE, which is up-converted at Intermediated Frequency (IF) band. The nonlinearity of PA and analysis of up-converter to Radio Frequency (RF) stage are beyond the scope of this paper.

(1)

A(t ) = x(t ) 2 + y (t ) 2 ,  y (t )  .  x(t ) 

θ (t ) = tan −1 

(2)

It is clear that from Eq. (2) we can have a phase-only signal through phase modulator and multiplied with its envelope at the PA to recreate the original complex signal V(t). This polar modulation process is like the Envelope Elimination and Restoration (EER) [2] architecture. In the conventional design, one part goes through a limiter to remove the envelope and keeps the phase information only. And the other part is detected by an envelope detector to extract the envelope information.

Fig. 1. Architecture of polar transmitter.

2. POLAR TRANSMITTER ARCHITECTURE The architecture of the polar transmitter is shown in Fig. 1. The rectangular-to-polar converter extracts the symbol phase and envelope information in the digital domain. Then the phase information is modulated through digital phase modulator to create a constant envelope and phase modulated signal. The phase modulated precision and channel selection can be well controlled in the digital part first. In this paper we use the concept from [3] to realize the digital phase modulator design. The digital fine-tune frequencies are generated by the DDFS. The DDFS interpolates

This work is supported by the MediaTek Inc., under NTU-MTK wireless research project. But both circuits would suffer from the non-linearity and distortion of the analog devices and would cause mismatch

0-7803-8834-8/05/$20.00 ©2005 IEEE.

6026

the carrier frequencies between the coarse frequencies generated by the integer-N PLL. The main design considerations of the DSP engine include: (a) the bandwidth of the envelope and the phase signal; (b) the numbers of the fine-tune frequencies generated by the DDFS would affect the clock rate of DDFS and rectangular-topolar converter; (c) the quantization effect in digital domain will cause phase noise and frequency spurs. And this effect also influences the Error Vector Magnitude (EVM) performance and the signal spectrum. Typically the bandwidths of envelope and phase signal are equal to 1~2 MHz and larger than the EDGE signal bandwidth 200k Hz. The clock rate of the DDFS can be derived [3] as below:

π

⋅ tan −1 (2 −( i −2 ) ), 1

∑1 + 2 − 2 ( i − 2 )

.

i

(5)

3. RECTANGULAR-TO-POLAR CONVERTER

± ±

±

Where fclk is the clock rate of DDFS, S is the number of samples per symbol and fsym is the symbol rate of the EDGE signal. The maximum output frequency of DDFS is limited to 0.4 times the clock frequency. The parameter fcs is the carrier spacing (200 kHz) in EDGE system, N is the number of digital fine tune frequency and ftb is the transition BW of the filter which is located after upconverter stage. In our design, we choose N=25, ftb=10 MHz, fcs=200 kHz and S=96. Thus the clock rate of DDFS should be operated at 26 MHz. The digital fine tuning frequencies are generated by the DDFS and locating at 5 MHz~10.4 MHz. Each interpolated frequency (channel) is stored in the fine-tune Frequency Control Word (FCW) table.

±

The desired phase is zi+1 and the desired envelope value is xi+1 multiplied by a constant scaling factor K. Due to the iterative feature of CORDIC algorithm, the clock rate of this module is n*fclk, and n is iteration number. It is hard for the module to operate at such high clock rate. A compromise is to use unfolded technique and the architecture is shown in Fig. 2.

±

f 1 ⋅ ( f cs × ( N + 1) + tb ) 0 .4 2

K=

1

±

f clk = S ⋅ f sym >

pi =

Fig. 2. Architecture of rectangular-to-polar converter.

For a coordinate axis converter, we adopt the CORDIC algorithm in our design since the CORDIC algorithm is very simple and low hardware cost. In order to further reduce the complexity, we also apply the technique in [4] to our rectangular-to-polar converter. For the first iteration we move the input vector into 1-th and 4-th quadrant with simply sign inversing and data exchanging. Second we replace yi by yi·2-i as compared with conventional CORDIC algorithm. This modification can save once iteration and one barrier shifter in the rectangular-to-polar converter. This can save more area in our design. For i=1 and input vector is (x1, y1) from the EDGE signal:

4. DIGITAL PHASE MODULATOR



mod− π 4

x2 = d1 ⋅ y1 , y 2 = − d1 ⋅ x1 , z 2 = 0.5 ⋅ d1 .

(6)

Fig. 3. Architecture of DDFS

− 1, yi < 0 d i = sign( yi ) =  .  1, yi ≥ 0

The DDFS architecture is shown in Fig. 3. The DDFS has three basic blocks: FCW table, phase accumulator and phase-toamplitude converter. The FCW table stores the desired fine-tune frequency control words and can be derived from Eq. (8).

And the remaining iterations (for i=2~n) are shown in Eq. (7).

xi+1 = xi + d i ⋅ 2 −2( i−2) ⋅ yi , yi +1 = 2 ⋅ [ yi − d i xi ],

fc =

(7)

zi+1 = zi + d i ⋅ pi .

FCW ⋅ f clk , ∀ FCW < 2 L−1 L 2

(8)

In our design we focus on the phase-to-amplitude converter design and propose an architecture which is based on Least Squared (LS) algorithm [5] and Merged-Multiply Accumulator (MAC) technique [6]. The input phase is first truncated by 3-bit according to the π/4 symmetry and the amplitude of the sine

6027

function can be express by the polynomial. The approximated polynomial is generated according to the LS algorithm. In this paper we compare the Spurious Free Dynamic Range (SFDR) performance with the other approximation algorithm such as Taylor and Chebyshev [9]. The comparison method is set the input phase from 0 to π/2. The phase word-length is 15-bit and amplitude output is 15-bit. From the simulation result in Fig. 4, we can easily see that the LS-based polynomial can achieve better performance than Taylor and Chebyshev approximation algorithm with less polynomial order. The less order of polynomial means that low hardware complexity can also be achieved.

truncated accumulated phase word-length to W=15 bits and amplitude word-length to P=14 bits. These hardware parameters can achieve SFDR=86dBc. The other parameter is the wordlength of the phase of the EDGE signal. This will also introduce phase noise and spurs in the output spectrum and we will discuss in section 5. The proposed DDFS circuit is simulated by the NANOSIM tool and compares with state of the art in Table 1. It is obvious that the proposed DDFS can achieve high SFDR performance. The power efficiency is also superior to the other designs.

C2

C1

C0 P0

P1 P2

P3 P4 P5 P6 P7 P8 P9 P10

C15 C14 C13 C12 C11 C10

C9

C8

C7

C6

C5

C4

C3

R5

R4

R3 R 2

R1

R0

P23 P22 P 21 P20 P19 P18 P17 P16 P15 P14 P13

In order to reduce the polynomial order we further divide the approximated region into eight segments. In each segment, the approximated polynomial p(X) can be represented as in Eq. (9).

Table 1. Comparison with the existing DDFS designs.

p( X ) 2

+ c1 ⋅ X + c 0

n1 −1 n4 −1 i k = ∑ Ri ⋅ 2 + [ c1 ] n ⋅ [ X ] n + ∑ C k ⋅ 2 2 3 i =0 k =0 n2 / 2 j = MAC ([ rom _ 1] n + ∑ Q j ⋅ [ X ] n ⋅ 4 + [ rom _ 2] n ). 1 3 4 j =0

[ ]n : denote

and

c1, −1 = 0 ,

DDFS

CMOS tech.

SFDR

Latency

Power efficiency (mW/MHz)

Ours

0.18

86

5

0.15

Ref [7] Ref [8]

0.18 0.25

84 90.3

13

0.22 0.66

0.35

82.5

9

0.26

Ref [9] (Taylor) Ref [9] (Chebyshev) Ref [10]

(9)

0.35

73

7

0.35

0.35

80

2

0.44

5. SIMULATION RESULT

Q j = −2c1, 2 j +1 + c1, 2 j + c1, 2 j −1 , c1, j = 0, 1

P12

Fig. 4. Architecture of Modified-MAC.

Fig. 4. SFDR comparison between LS, Taylor and Chebyshev.

= c2 ⋅ X

P11

(10)

For Mobile Station (MS), the requirements of EVM-rms and EVM-peak are below 9% and 30%. For Base-Tranceiver Station (BTS) EVM-rms and EVM-peak are below 7% and 22%. The SFDR performance of the digital frequency synthesizer is suitable for the up-link and down-link spectral requirement. But the phase signal word-length also contributes spurs and phase noise. And the wordlength also affects the EVM and the signal spectrum. In this paper we simulate the finite word-length (J) effect of the phase signal with the EVM measurement and spectral mask requirement. The performance summary is in Table 2.

the truncation with n − bit.

Where ci represents the coefficient, and X is the phase of each divided region. In Eq. (9) we store the first term and third term in the look-up table. The size of rom_1 and rom_2 are 1,536 bits and 232 bits respectively. The operations in Eq. (9-10) now become one booth multiplication and two constant additions. These can be merged into a modified-MAC (Fig. 4). First the binary phase X is inputted to the booth decoder circuit and the partial product term is generated in each row of MAC. The partial product terms are summed through Carry-Save-Adder (CSA) tree. As compared with the direct implementation of 2th order polynomial, the CSA tree can prevent the carry ripple problem in the early stages, and the carry ripple only occurs at the final stage. Due to the EDGE spectral requirement we target the desired SFDR over 80dBc. From Matlab simulation, we set the

Table 2. Simulation result and EVM measurement. J-bits EVM-rms EVM-peak Spectral requirement 9-bits 10-bits 11-bits

6028

0.028% 0.014% 0.007%

0.094% 0.046% 0.018%

No (Spurs at -66dBc) No (Spurs at -74dBc) No (Spurs at -79dBc)

12-bits

0.003%

0.011%

Power consumption@26MHz

Yes (Spurs at -81dBc)

From the Table 2, we can see that the errors produced by the phase quantization are very small for the word-length higher than 9-bits. And the errors introduced by the entire digital phase modulator can be eliminated. But the spectrum of the SIF-PM(t) signal is not exactly below the spectral mask. Especially for BTSmask, the requirement of the mask is more stringent than MSmask. Since the quantization phase error will degrade the synthesizer SFDR performance. It is conservative to choose J=12bit in our design. The signal spectrum with J=12-bit at the carrier which equals to 8 MHz is shown in Fig. 5. The digital phase modulated signal generated by the DDFS can meet the spectral requirement for BTS-mask and MS-mask.

3.92 mW

7. CONCLUSION In this paper, we proposed the DSP engine for the polar transmitter. The engine is realized by the CORDIC and DDFS techniques. In the digital phase modulator we adopt the LS algorithm. We also apply MAC technique in our DDFS architecture to reduce the hardware complexity and decrease the carry ripple problem of the direct polynomial implementation. The chip implementation with UMC 0.18 um CMOS process with 1P6M technology is also presented in this paper.

8. REFERENCES [1] Nagle, P.; Burton, P.; Heaney, E.; McGrath, F., “A wideband linear amplitude modulator for polar transmitters based on the concept of interleaving delta modulation,” IEEE Journal of Solid-State Circuits, vol. 37, pp. 1748-1756, Dec. 2002. [2] Rudolph, D.; “Out-of-band emissions of digital transmissions using Kahn EER technique,” IEEE Trans., Microwave Theory and Techniques, vol. 50, pp. 1979-1983. Aug. 2002. [3] Vankka, J.; “Digital frequency synthesizer/modulator for continuous-phase modulations with slow frequency hopping,” IEEE Trans., Vehicular Technology, vol. 46, pp. 933-940, Nov. 1997. [4] Chen, A.; Yang, S.; “Reduced complexity CORDIC demodulator implementation for D-AMPS and digital IFsampled receiver,” in Proc. Globecom `98, vol.3, pp.14911496, Nov. 1998. [5] M. Flickner, J. Hafner, and E.J. Rodriguez, and J.L.C. Sanz; “Fast least-squares curve fitting using quasi-orthogonal splines,” IEEE Int. Image Processing, vol. 1, pp. 686-690, Nov. 1994. [6] Elguibaly, F.; “A fast parallel multiplier-accumulator using the modified Booth algorithm,” IEEE Trans., Circuits and Systems, vol. 47 , pp. 902-908 , Sept. 2000 [7] Langlois, J.M.P.; Al-Khalili, D.; “Low power direct digital frequency synthesizers in 0.18 /spl mu/m CMOS,” IEEE CICC Proceedings, pp. 21-24, Sept. 2003. [8] A. Torosyan, Dengrwei Fu, Jr. Willson A. N., “A 300 MHz quadrature direct digital synthesizer/mixer in 0.25 µm CMOS,” in IEEE Solid-State Circuits Conference, vol. 1, pp. 132 -133, 2002. [9] Kalle I. Palomaki and Jarkko Niittylahti; “Phase-toAmplitude Mapping in Direct Digital Frequency Synthesizers Using Series Approximation,” EURASIP Journal on Applied Signal Processing, 2001. [10] D. De Caro, E. Napoli, A. G. M. Strollo, “Direct digital frequency synthesizers using high-order polynomial approximation,” IEEE Solid-State Circuits Conference, vol. 1 , pp. 134 -135, 2002.

Fig. 5. The spectrum of the EDGE signal through DSP engine.

6. IMPLEMENTATION RESULT The proposed DSP engine was implemented in UMC 0.18 um CMOS process with 1P6M technology. The layout of the DSP engine is shown in Fig. 6. The summary of the circuit is list in Table 3.

Fig. 6. layout of the proposed DSP engine. Table 3. Implement summary of the DSP engine. Technology

UMC 0.18 um 1P6M CMOS

Voltage Core layout area Chip layout area System clock Frequency

1.8 V 0.51x0.51 mm2 1.114x1.114 mm2 26MHz

6029

A Memory-Reduced Log-MAP Kernel for Turbo Decoder Tsung-Han Tsai

Cheng-Hung Lin and An-Yeu Wu

Department of Electrical Engineering National Central University Jhongli 320, Taiwan, R.O.C. [email protected]

Graduate Institute of Electronic Engineering National Taiwan University Taipei 102, Taiwan, R.O.C. [email protected]

Abstract—Generally, the Log-MAP kernel of the turbo decoding consume large memories in hardware implementtation. In this paper, we propose a new Log-MAP kernel to reduce memory usage. The comparison result shows our proposed architecture can reduce the memory size to 26% of the classical architecture. We also simplify the memory data access in this kernel design without extra address generaters. For 3GPP standard, a prototyping chip of the turbo decoder is implemented to verify the proposed memory-reduced LogMAP kernel in 3.04×3.04mm2 core area in UMC 0.18um CMOS process.

I.

INTRODUCTION

The turbo codes have been proved that they can get a high coding gain near the Shannon capacity limit [1]. The reduction in bit error rates (BER) is achieved at the expense of intensive computations involved in the iterative turbo decoding steps. The iterative turbo decoding is composed of the soft-input soft-output (SISO) decoding algorithms. A powerful SISO algorithm for the turbo decoding is the maximum a posteriori (MAP) algorithm. However, the MAP algorithm has been converted into logarithm maximum a posteriori (Log-MAP) algorithm with additive form due to the large requirements of multiplication and exponential computations [2]. The memory organization of the SISO algorithms, especially the Log-MAP, is critical because they are highly data dominated. Some previous works have been developed based on the techniques for memory reduction [3], [4], [7]. In this paper, we focus on the Log-MAP kernel of turbo decoder, as shown in Fig. 1, and propose a memory-reduced Log-MAP architecture with low complexity in data access.

This paper is organized as follows. The basic Log-MAP algorithm is described in Section II. In Section III, we describe the details of the previous and proposed organizations of the Log-MAP algorithm. The analysis and evaluation results are demonstrated in Section IV. Finally, the experiment results and the conclusions are given in Section V and Section VI, respectively. II.

FUNDAMENTALS OF LOG-MAP ALGORITHM

In turbo decoding, the Log-MAP decides binary encoded bit uk on the sign bit of the a posteriori log-likelihood ratio (LLR). The arithmetic operations of the Log-MAP are described as follows. The branch metrics are computed as 1 2

m



i =1



γ k ( S k −1 , S k ) =  Λ in, k (u k ) x ks + Lc y ks x ks + Lc ∑ y kpi x kpi 

(1)

, where Λin,k is a priori information, xk denotes the transmitted codewords, yk denotes received codewords, m is the number of each parity bit, and Lc = 4ES / N0. Note that xsk is the systematic bit which equals to uk.

α k ( S k ) = MAX * (γ k ( S k −1 , S k ) + α k −1 ( S k −1 )),

(2)

β k ( S k ) = MAX * (γ k +1 ( S k , S k +1 ) + β k +1 ( S k +1 )),

(3)

S k −1

S k +1

, where αk is the forward recursion state metrics, βk is the backward recursion state metrics. The LLR is defined as Λ k (u k ) = MAX * (α k ( S k ) + γ k +1 ( S k , S k +1 ) + β k +1 ( S k +1 )) S k ,u k = +1

− MAX * (α k ( S k ) + γ k +1 ( S k , S k +1 ) + β k +1 ( S k +1 )). (4) S k ,u k = -1

The sign bit of the LLR value decides whether uk = +1 or uk = -1. The MAX* operation is defined as MAX * ( x, y ) = ln(e x + e y ) = MAX ( x, y ) + ln(1 + e − ( x − y ) ). (5) Figure 1. Block diagram of the turbo decoder.

0-7803-8834-8/05/$20.00 ©2005 IEEE.

1032

The MAX* can be implemented by an add-compare-selectoffset (ACSO) unit, as shown in Fig 2. Small look-up tables can implement the corrective term ln(1+e-(x-y)).

γ'

k

α'

k -1

α α

The a priori term can be easily replaced by the sum of two branch metrics which is marked * in Table I. In our kernel design, we only store two branch metrics used to calculate the Λk(uk) in registers instead of buffering the a priori term. Fig. 4 shows an example when m = 1. θ1 =

k

θ2 =

'' k -1

(

1 Λ in ,k (u k ) + Lc yks 2

(

1 Lc y kp1 2

)

)

(x , x ) = (+ 1,+1);

γ = (θ1 + θ 2 )

(x , x ) = (+ 1, - 1);

γ k2 = (θ1 − θ 2 )

s k

s k

p1 k

p1 k

3 k

γ k3 γ k0 γ k2 γ k1

γ '' k

Figure 3. Block diagram of the branch metrics generation.

Figure 2. Architecture of the ACSO.

Because of the turbo decoding, there is one value to be iteratively interleaved and fed back to the Log-MAP as a priori information. This value called extrinsic information is defined as Λ e,k (u k ) = Λ k (u k ) − Λ in,k (u k ) − Lc y ks .

III.

A. Memory Reduction in Algorithms Conventionally, all branch metrics γki, i = 0 ~ 2m+1-1, are calculated and then stored in the branch metrics memory (BM) until the Λk(uk) is calculated. If each branch metric is nbm–bit wide, the bit-length of the BM is 2m+1*nbm bits. However, we can further reduce the branch metrics stored in the BM. Table I shows that the γkj = -γkM - j, where j = 0 ~ 2m-1 and M = 2m+1-1. To reduce the memory, the BM only stores the γkM - j, and the other branch metrics can be generated by multiplying the stored γkM - j by -1. Fig. 3 shows an example of our BM-reduced approach when m = 1. The bit-length of the BM can be reduced to 2m*nbm bits.

Case (1) m = 1 (xsk, xp1k) γki ( +1, +1) γk3 * ( +1, -1) γk2 * ( -1, +1) γk1 = -γk2 ( -1, -1) γk0 = -γk3

Case (2) m = 2 (xsk, xp1k, xp2k) γki ( +1, +1, +1) γk7 * ( +1, +1, -1) γk6 ( +1, -1, +1) γk5 ( +1, -1, -1) γk4 * ( -1, +1, +1) γk3 = -γk4 ( -1, +1, -1) γk2 = -γk5 ( -1, -1, +1) γk1 = -γk6 ( -1, -1, -1) γk0 = -γk7

β k +1

γ k +1

(6)

MEMROY-REDUCED APPROACH

TABLE I. THE BRANCH METRICS GENERATION.

αk Λ k (uk )

γk

Λ e, k (uk )

Figure 4. Block diagram of the extrinsic values generation.

In general, the values of the forward (2) and backward (3) recursions are computed in chronologically reverse order. Both forward (Ak) and backward recursion state vector (Bk) are required for computation of the Λk(uk), so it is necessary that a large size of state metric memory (SMM) stores the Ak (or Bk) values to compute the Λk(uk) until the Bk (or Ak) is υ generated. Each state vector is composed of 2 state metrics (the size of the trellis state) , and each one is nsm-bit wide. υ The size of SMM for each state vector is 2 *nsm. B. Classical Architecture Since the SMM reduction is an important issue in facilitating the hardware implementation, the classical LogMAP architecture has been proposed to reduce the memory usage [3]. The classical Log-MAP architecture is composed of three recursion processes (RPs) operated in parallel. Two of them are used for the backward recursions (RPB and ARPB, Acquisition RPB), and one is used for the forward recursions υ * (RPA). Each RP contains 2 ACSO units to do MAX operations working in parallel so that one recursion can be computed in one cycle.

The scheduling of the classical architecture is shown in Fig. 5. L refers to the acquisition depth by using the sliding window concept (i.e. L = (4 ~ 6)*(υ + 1), whereυ denotes the number of delay elements in a convolutional encoder). The ARPB provides the reliable Bk to the RPB and does not need any SMM to store these state vectors of length L. s υ Besides, the a priori term, –(Λin,k(uk) + Lcyk ) in (6), is However, the Ak is stored in a L*2 *nsm size of the SMM in also buffered until the Λe,k(uk) is calculated. It can be further order to compute the Λk(uk) values. Note that the Λk(uk) is eliminated in our Log-MAP kernel design. Take m = 1 for generated in reverse order. The scheduling also illustrates example, (6) can be transferred as follows: that the decoding latency of this architecture is 4L and that the computation cost is 3, which equals the total number of Λ e,k = Λ k (u k ) − Λ in,k (u k ) − Lc y ks = Λ k (u k ) − (γ k2 + γ k3 ). (7) RPs. The branch metrics are buffered in 4L symbol time and the a priori term is buffered in 3L symbol time.

1033

4L

Symbols

3L

b

2L

h nc ra

ri et m

cs

n io sit i qu ARPB ac

natural order. α

RPA

γ'

α'

k

k

k -1

RPB

L

Latency

co de

d de

t, bi

uk

α ''

0 L

2L

3L

4L

5L

k -1

time

γ

Figure 5. Scheduling of the classical Architecture [3].

C. Traceback Architecture In this paper, we propose a traceback architecture to reduce more SMM size. In our approach, The Bks are traced back with the forward recursions. The Bks of length L are generated in the backward recursion and then traced back in the trace-back recursion when Aks are generated. The trellis evolution of backward recursion and trace-back recursion is shown in Fig. 6 and the scheduling of this architecture is illustrated in Fig. 7. RPB

TRPB

Backward Recursion

Trace-back Recursion

Figure 6. Trellis evolution of the backward and trace-back recursion (υ = 2). 4L

ARPB

'' k

Figure 8. Architecture of the TBU.

IV.

ANALYSIS RESULTS

The recomputed architecture (Section V.E in [5]) has been proposed as an efficiently memory-reduced architecture. In order to reduce the SMM size, it uses the simplified ACSO to recompute the Bks every P cycles with the stored one-bit decision and noff-bit offset values of length L. To describe the overall performance, the different configurations are evaluated in Table II and III. Note that the computation cost in Table II means the total number of RPs and 0.5 RPB means the either Recompute-RPB or TRPB are used. In Table III, the Artisan SRAM Generator in UMC 0.18um Process is used to generate the dual-port SRAMs of the different configurations. The parameters, such as υ = 3, L = 24, P = 4, noff = 3 bits and nsm= 13 bits, are under the constructions followed 3GPP standard [5]. Note that the state vectors and decision, offset values of the recomputed configurations are stored and loaded with different widths, so two different sizes of SRAM are required. Although the recomputed configurations can reduce the SMM size efficiently, the total physical SRAM areas of the recomputed configuration are larger. However, the SMM size and physical SRAM area of the proposed configuration are less because only one size of SRAM is needed to store the comps.

Symbols

3L RPB

TABLE II. THE PERFORMANCE OF THE DIFFERENT ORGANIZATIONS.

2L

Organization RPA & TRPB

L 0 L

2L

3L

4L

5L

Classical [3]

Recomputed [4]

RPA

1

1

1

RPB Additional Address Generator Controllability Latency

2

2.5

2.5

None

Needed

None

Easy 4L

Complex 4L

Easy 4L

Computation Cost

time

Figure 7. Scheduling of the proposed Architecture.

In the trace-back recursion, the TRPB (trace-back RPB) is υ required to regenerate 2 backward state metrics with only nsm-bit comparison metrics (comp, as shown in Fig. 1.) In Fig. υ-1 υ 6, the 2 comps can regenerate 2 backward state metrics so υ-1 that the TRPB contains 2 TBUs (trace-back unit, as shown in Fig. 8). Thus the total size of the SMM which stores the υ comps is L*2 -1*nsm. The TRPB is half the complexity of the υ RPB because the TRPB only contains 2 -1 TBUs with similar complexity as the ACSO units. The Λk(uk) is generated in

Proposed

The total memory size of the classical configurations is compared with our approach in Table IV. The AP denotes the a priori term buffer whose bit-length is nap bits. We take υ = 3, m = 1, L = 40, nbm = 10 bits, nsm = 13 bits and nap= 9 bits to comply with the 3GPP standard. Besides, a more advanced approach is combined our method with the low latency concept [6]. Since ARPB can be substituted by a look-ahead RPB (LARPB) in the low latency method, 5L/2 decoding latency can be achieved for improvement. In

1034

addition, the advanced architecture can further reduce the total memory size. The overall architecture of the advanced Log-MAP decoder is shown in Fig. 9. TABLE III. THE SMM AREA COMPARISON OF THE DIFFERENT ORGANIZATIONS. (υ = 3, L = 24, P = 4, noff = 3 bits and nsm= 13 bits. Generated by Artisan HS-SRAM-DP Dual Port SRAM Generator in UMC 0.18um Process)

Organization Classical [3] Recomputed [4] Proposed

SMM Size

Words x Bits

L*2υ*nsm 192x13 (﹝L/P﹞+P72x13 1)*2υ*nsm υ L*2 *(1+ noff) 192x4 L*2υ-1*nsm

96x13

Total Memory Cells

Single Memory Area (mm2)

Total Memory Area (mm2)

2496

0.15

0.15

1704

0.08

26% of the classical architecture. A prototyping chip is implemented to verify the proposed Log-MAP kernel. The proposed approach ensures that the memory size of the memory-reduced Log-MAP kernel is much less than that of the classical architecture, while the kernel is implemented in a system-on-chip for wireless communication applications.

0.16

0.08 1248

0.13

0.13

TABLE IV. THE TOTAL MEMORY SIZE OF DIFFERENT ORGANIZATIONS. Organization

BM

SMM

m+1

AP

3L*nap Classical [3] 4L*2 *nbm L*2 *nsm 0 4L*2m*nbm Proposed L*2υ-1*nsm 0 Advanced 5L/2*2m*nbm L/2*2υ-1*nsm υ

Total (Kb)

Memory Size Ratio

Figure 10. Chip layout of the 3GPP turbo decoder.

11.64 5.28 3.04

100% 45% 26%

TABLE V. CHIP SUMMARY OF THE 3GPP TURBO DECODER. UMC 0.18um 1p6m CMOS process Technology 3.04x3.04mm2 Core Size Maximum Operating 145MHz Frequency 12Mb/s (6 iterations) Maximum Decoding Rate 1.8V Supply Voltage Total Dual-Port SRAM Size of 0.28Mb The Turbo Decoder Dual-Port SRAM Size 3.04Kb inside The Log-MAP Kernel Dual-Port SRAM area 0.95mm2 inside The Log-MAP Kernel Total Area of The Log-MAP 1.85mm2 Kernel

Figure 9. Architecture of the memory-reduced Log-MAP decode.

REFERENCES V.

[1]

EXPERIMENTAL RESULTS

A fast and quite accurate VLSI implementation approach is obtained by simply using cell-based synthesis tools to compile Verilog HDL cores. In Fig. 10, we implement the 3GPP turbo decoder to verify the advanced memory-reduced Log-MAP kernel. We summarize the characteristics of our design in Table V. The SRAM size of the Log-MAP kernel is 3.04Kb (listed in Table IV) and the total physical area of the SRAMs inside the Log-MAP kernel is almost half of the total physical area of the kernel. The overhead of the TRPB used to regenerate Bks is 0.06mm2, which is less than 4% of the total area of the Log-MAP kernel. VI.

[2]

[3]

[4]

[5]

[6]

CONCLUSIONS

In this paper, a memory-reduced Log-MAP kernel is proposed. In this kernel, the BM, SMM, and a priori term are modified to reduce the memory size. The comparison result shows it can efficiently reduce the memory usage to

[7]

1035

C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error-correcting coding and decoding: Turbo Codes,” in Proc. ICC’93, pp. 1064-1070, May 1993. J. P. Woodard, and L. Hanzo, “Comparative study of turbo decoding techniques: an overview,” IEEE Trans. Vehicular Technology, vol. 49, pp. 2208 –2233, Nov. 2000. A. J. Viterbi, “An intuitive justification and simplified implementation of the MAP decoder for convolutional codes,” IEEE J. Selected Area in Commun., vol. 16, pp. 260-264, Feb. 1998. E. Boutillon, W. J. Gross, and P. G. Gulak, “VLSI architectures for the MAP algorithm,” IEEE Trans. on Commun., vol. 51, pp. 175-185, Feb. 2003. “Technical Specification Group Radio Access Network; Multiplexing and Channel Coding (FDD),” 3rd Generation Partnership Project, 3GPP Ts25.212 v5.1.0, 2002. A. Raghupathy, and K. J. R. Liu, “A transformation for computational latency reduction in turbo-MAP decoding,“ in Proc. ISCAS '99, vol. 4, pp. 402 –405, Jul 1999. J. Dielissen, and J. Huisken, “State vector reduction for initialization of sliding windows MAP,” in Proc. 2nd Int. Symp. Turbo Codes, pp. 387-390, Sept. 2000.