Document not found! Please try again

A Fully Parallel Truncated Viterbi Decoder for Software ... - IEEE Xplore

7 downloads 8666 Views 808KB Size Report
National Laboratory for Parallel and Distribution Processing. National University of Defense Technology. Changsha, China, 410073. Email: {rongchunli,yongdou ...
2013 IEEE Wireless Communications and Networking Conference (WCNC): PHY

A Fully Parallel Truncated Viterbi Decoder for Software Defined Radio on GPUs Rongchun Li, Yong Dou, Yu Li

Shi Wang

National Laboratory for Parallel and Distribution Processing National University of Defense Technology Changsha, China, 410073 Email: {rongchunli,yongdou,yuli}@nudt.edu.cn

Wuhan Military Delegate Bureau General Armament Ministry, Wuhan, China. Email: [email protected]

Abstract—In this paper, we propose a fully parallel truncated Viterbi decoder for Software-Defined Radio (SDR) on the Graphics Processing Unit (GPU) platform. We exploit a map-reduce strategy based on the three-point Viterbi decoding algorithm (TVDA) due to the high parallelization potential. The trellis of Viterbi decoding algorithm can be divided into sub-trellises in truncation, which can perform independent forward metrics computing and trace-back procedure in parallel. The parallel Viterbi decoding algorithm is mapped on a GPU named NVIDIA GTX580. The experiment shows that our method shows low BER and 36.0x speedup over a C implementation on a CPU with the frequency of 2.0GHz. At the meantime, our method achieves a performance improvement of 1.2x-3.6x times that of the existing GPU-based implementation. Index Terms—Graphics Processing Unit (GPU), Softwaredefined Radio (SDR), Viterbi Decoder, CUDA

I. Introduction Software-defined radio (SDR) [1] technology is designed to support various communication standards through software configuration without altering hardware platforms. The emergence of SDR presents the problem in tradeoffs between performance and efficiency. Many SDR platforms are currently based on DSPs or FPGAs. However, these platforms have several drawbacks. Although DSPs has good code flexibility, their arithmetic computation capability can not fulfill the realtime requirement. As a comparison, FPGAs provide high computation power to support requirement of wireless communication, but their process of development is so complicated that the developers need to learn the hardware description languages and be familiar with the programming and debugging tools. Furthermore, such hardware platforms are expensive to be afforded. The drawbacks of DSPs and FPGAs obstructs improvement of the SDR technology. By contrast, the SDR platform based on central processing units (CPUs) features a contrasting tradeoff. Developers can use familiar architecture and work in a sophisticated programming and debugging environment, making for increased development efficiency. Furthermore, the algorithms can be modified and updated easily in the software implementation. However, CPUs are not designed for wireless signal processing and performance cannot satisfy the requirements for real-time wireless communication. The SDR platform based on GPUs can overcome the

978-1-4673-5939-9/13/$31.00 ©2013 IEEE

problem of having to compromise on either performance or efficiency. The development trend of GPUs matches Moore’s Law, and peak performance can reach up to 4.58 tera floating operations per second. GPUs inherently integrate with large storage, which can run up to 6 giga Bytes. In signal processing, SDR algorithms require math-intensive vector operations, which are appropriate for the parallel execution mode of single instruction multiple data (SIMD) on the GPU platform. Furthermore, GPUs inherently integrate with numerous floating arithmetic units that converting floating point algorithms into corresponding fixed-point algorithms is no longer necessary for real-time communication purposes. In recent years, many GPU manufacturers proposed their own programming model. For example, the NVIDIA corporation presents Compute Unified Device Architecture (CUDA) [2], using C as a high-level programming language, which offers a software environment that facilitates the development of high-performance applications through a sophisticated programming and debugging environment at considerably greater flexibility. Beyond this, The acceleration method of GPUs are so simple that it is extensively used in the fields of image processing, numeric computing, signal processing, and others. GPUs also present lower cost and can be integrated into a commodity PC. Few studies have been devoted to SDR applications implemented on GPUs in recent years, which can be separated to two categories. The first one is to accelerate the SDR applications on GPUs. Viterbi decoder [3], Turbo decoder [4], MIMO detectors [5], LDPC decoder [6] based on GPUs are proposed to meet with the requirement of high throughput. It is worth noting that Lin et.al. [3] proposed a parallel GPU-based Viterbi decoding algorithm by the method of ”divide-and-conquer”, which exploits a merging approach to eliminate dependencies among trellises. Beside these, several GPU-based systems, such as WiMAX [7]-[8] and DVB-T2 [9], are also presented to perform wireless communication. The second aim is to study the management of GPU in the SDR prototyping environment. Plishker et.al. [10] presented a dataflow prototyping a GPU accelerated application and explored the design space. Horrein et.al. [11] discussed how to integrate GPU computing in the SDR environment in detail. In SDR applications, Viterbi decoding algorithm is widely used to decode the convolutional code, providing a low bit

4305

error ratio and low complexity. However, by analyzing the operations of each module in the physical layer of current wireless standards, such as 802.16 or 802.11a, Viterbi decoder is the most time-consuming module in the receiver. In this paper, we accelerate the Viterbi decoding process based on the three-point Viterbi decoding algorithm (TVDA) due to the high parallelization potential. The algorithm is mapped on a GPU platform to perform decoding procedure with high performance and efficiency. Different from [3][8], the trellis of Viterbi decoding algorithm can be divided into subtrellises in truncation, which can perform both independent forward metrics computing and trace-back procedure in parallel. Compared to [3][8], the proposed Viterbi decoder achieves 2-3x performance speedup and good bit error ratio (BER) performance. II. Background

B. Viterbi Decoder Viterbi decoding procedure can be expressed as a trellis described in Fig.1. In the trellis, the state transition is repeated from left to right for a number of time stages equivalent to the length of information bits. The process of the Viterbi algorithm can be partitioned into two directions: the forward procedure and the trace-back procedure, which aims to identify the maximum likelihood path through the trellis. A metric called path metric (PM) is adopted to measure each possible path through the trellis. At each time stage, the branch metric (BM) between the two states is calculated to updated the PM by adding them. However, only the path with the minimum PM is retained and delivered to its consumer in the next iteration. For each state in the trellis, an add-compare-select (ACS) operation is performed to carry on this action, which produces the survived bits (SBs) for all the paths. The ACS operation can be written as equation 1:

A. CUDA CUDA is a GPU programming mode and the framework of CUDA comprises a logical hierarchy and physical hierarchy. In the logical hierarchy, the CUDA organization includes a kernel, a grid, blocks, and threads. The CUDA kernel is a device program, which is executed on the GPU and hierarchically governed as the thread-block-grid step. The kernel code of C-extension language can be compiled by typical compilers like Microsoft Visual Studio. Once a kernel is called by the C program on the CPU, a grid is generated on the GPU, which is concurrently executed by thread blocks with multiple parallel threads. GPU has a SIMD architecture in which multiple threads can perform a single instruction with their own independent set of data. In each thread, an identifier is provided to distinguish from others that should be precisely controlled by developers in performing SIMD processing. In the physical hierarchy, the CUDA structure consists of several types of memory, as well as multiple stream-multiprocessors (SMs) with several integrated streamprocessors (SPs) and a few special-function units. Four types of memory are found in the GPU: the constant memory, texture memory, global memory, and shared memory. Each kind of memory has its own hardware, access speed, and programming scopes. Global memory, which connects the host computer and the GPU accelerator, can be read or written by all the blocks with an access latency of more than 400 GPU clock cycles. Constant or texture memory is read-only memory with no latency. In a thread-block, the most critical part is the shared memory, whose latency is four cycles. It enables threads to communicate with one another, thereby reducing the overhead incurred by accessing global memory as the registers are instantaneously used only by a thread. CUDA provides a programming interface to facilitate controlling GPUs, as well as exploiting the parallelism of GPUs. It bridges the logical and physical hierarchy in the way automatically mapping the thread blocks to the idle SMs, which can help the developers to focus on the logical hierarchy, ignoring the physical one.

PMt,s = min{PMt−1,2 j + BM2 j,s , PMt−1,2 j+1 + BM2 j+1,s } S Bt,s = (PMt−1,2 j + BM2 j,s >PMt−1,2 j+1 + BM2 j+1,s )?1 : 0 (s = {0, 1, ..., 2K−1 }; j = s%2K−2 )

(1) where K is the constraint length of the convolutional code, and BMi, j , PMt,s and S Bt,s denote a BM changing from state i to state j, and PM and SB changing on time t and state s,respectively. BMi, j is calculated by the product of information bit c and hamming distance hd. After the forward procedure creating the whole trellis, the trace-back process starts to trace backwardly to search out the maximum likelihood path, the SBs along which are taken as the decoded bits (DBs). forward computing phase F0

states 0

BM 0,0

1



1

PM t



PM t

3

0

1

2

3

4

5

PM t Ld+Lt time

(a) forward computing trellis

Fig. 1.

decoding phase D0

states 0

traceback phase T0

merge state 1





decoded bit sequence

2

2

3

0

PM t



2

3

0

1

2

Ld

Ld+1 Ld+2 Ld+3 Ld+Lt-1 time

(b) traceback and decoding computing trellis

Trellis diagram of Viterbi decoder

Various convolutional codes are adopted in different wireless communication standards, which varies in the constraint length and code rate. In this paper, we choose the coding scheme employed by the 802.16 standard, with the constraint length K of 7 and coding rate R of 1/2. III. Implementation of Parallel Truncated Viterbi Algorithm on GPU A. Parallel Mode of TVDA By analyzing Viterbi decoding algorithm, we can find out that in forward computing trellis depicted in Fig.1(a), the ACS operations among the states have no dependence along the state-axis; these operations can be parallel processed. Along the time axis, however, PMt,s is affected by PMt−1,s . The time

4306

stages must be executed serially one by one. Fortunately, in the backward computing procedure, all survivor paths gather at one state after a suitable trace-back length. This merging procedure can be defined as the trace-back phase. We can then trace back from the merging state and obtain the decoded sequences. This process is called the decoding phase. Fig.1(b) describes the two phases in the trellis. This is the so-called TVDA [12]. In TVDA, there are three types of phases, the forward computing phase (FCP), the trace-back phase (TP) and the decoding phase (DP). The input bit sequence can be truncated into n chunks in average. Each chunk will process the above three phases. FCP produces the sub-trellis in each chunk. TP find out the merging state, which is the beginning state of the decoding phase. DP generates the final output decoded bits. There are two execution modes of TVDA, pipelined one and parallel one, which are described in Fig.2. The pipelined mode of TVDA was proposed in [12] previously, which is shown in Fig.2(a). The arrow implies the execution direction of the algorithm. When the ith FCP, noted as Fi , is completed, the (i − 1)th TP, labeled as T i−1 , can be processed, after which the (i − 2)th DP, represented as Di−2 , can be carried out. When the pipeline is fulfilled, the three phases can be simultaneously processed in one time stage. Although the pipelined mode can achieve high decoding throughput on FPGAs, it is not suitable to be performed on GPUs. In order to fully use the arithmetic units on GPUs, we proposed the parallel mode of TVDA, which is shown in Fig.2(b). At first, the input bit sequence is truncated into n chunks, each of which have a tail length to overlap with the next one. The tail length is used to performed to TP to find out the merging state. And then FCPs in all the chunks can be parallel executed, after which the corresponding TP and DP in each chunk can be processed independently. chunks

bit sequence truncated into n chunks

Fi denotes the ith forward computing phase Ti denotes the ith trace-back phase Di denotes the ith decoding phase bit sequence

C. Coalesced Memory Access Pattern

truncated into n chunks F0

T0

F1

F0

D0

T1

F2

D1

T2

T0

F3

D0

F2

F1 T1

D1

D2

F4

F3 T2

T3

D3

D4

time (a) pinelined mode

the conceptual diagram of proposed parallel Viterbi decoder on a GPU. In the diagram, Fi denotes the FCP of the ith truncated sub-trellis. ACS t,s indicates the ACS operation at time instance t and state s. T Di represents the trace-back and decoding phase at time of the ith truncated sub-trellis. To precisely decode the bits, the truncation length of Fi can be set to 2 Lt + 2 Ld , where Lt is the length of time stages for the TP to identify the merging state and Ld is the length of time stages for the decoding phase. The input signal sequence is divided into subsequence, whose length is 2 Ld . A tail sequence with the length of 2 Lt , which is part of next sub-sequence, is added to the sub-sequence to perform the track-back phase. In the FCP, each ACS operation in a time stage of the trellis is assigned to each thread, where the BM, PM, and SB value are calculated. A total of 1024 threads are found in each thread-block, which can perform p-ways Fi parallel computing, where p is 1024 divide by 2K−1 . The network between successive time stages is performed in the shared memory in each thread-block. Each thread writes the PM and SB value into the corresponding address in the shared memory and global memory, respectively. The SB value consists of all the paths in the trellis, which is too large for shared memory. After that, lightweight synchronization occurs among the threads to ensure all the threads complete the computing and writing operations among a time stage in the trellis, which does not influence performance. The correct input PM value can be fetched in the shared memory after the synchronization. Then the computation in the successive time stage can be performed in the same situation. After the FCP is completed, the TD phase of the ith truncated sub-trellis is performed in the first thread of T Di . A divide-and-conquer method is employed to select the minimum PM value in the (Lt + Ld )th states in each Fi as the traceback state. After tracing Lt bits, the merging state is revealed. Finally, the last Ld bits are the decoded bits of the ith truncated sub-trellis. The pseudo code of the proposed GPU-based parallel Viterbi algorithm is shown in Algorithm 1.

(b) parallel mode

Fig. 2. Pipelined and parallel mode of three-point Viterbi decoding algorithm

B. GPU-based Truncated Viterbi Algorithm On the basis of parallel mode of TVDA, the Viterbi decoder can be implemented by the truncation method, which partitions the trellis along the time axis. Each sub-trellis can be divided into two phases: the FCP and the trace-back and decoding (TD) phase, which includes the TP and the DP. Fig.3 depicts

As mentioned above, the access latency of global memory is more than 400 GPU clock cycles which costs a large percent of execution time. It the GPU-based truncated Viterbi algorithm, the ACS operation in each thread needs to store the SBs to the global memory in every iteration, which produces large memory access overhead. However, the total time of memory access can be reduced through the coalesced memory access pattern by a factor of 16. In CUDA, a half-warp within 16 threads inside can launch a unified coalesced access to the memory, instead of launching out 16 individual memory accesses. This coalesced access pattern only happens in the situation that the data to be fetched by the half-warp must stored in continuous memory addresses. So it is essential to shuffle the SB organization in the global memory to perform the coalesced memory accesses, which is shown in Fig.4. Before compilation, the SBs must be permuted in a contiguous manner for a thread to fetch, forming SBs in 64 states storing

4307

Input bit sequence 2Lt 2Ld 2Lt

2Ld 1

Ă

Ă

2

note K is constraint length Fi denotes ith forward computing phase p is equal to 1024/2K-1 ACS denotes jth ACS operation at time t in F t,j i Ă Ă Ă Ă TDi denotes ith traceback and decoding phase Lt and Ld are length for traceback and decoding phase, respectively Ă thread block 0 n-1 Ld bits Chunkn-1 threads Chunk 0 1 K-1 2(Ld+Lt) (n-1)*2 Ă F0 threads Ă Ă

Ă 1

2

3

Ă

4

ACS0,0

ACS1,0

ACS Lt

Ld !1,0

ACS0,1

ACS1,1

ACS Lt

Ld !1,1

2Ld

TD0

thread 0 thread 1

1024 threads (n-1)*2K-1+1

thread block 1

(n-1)*2K-1+2

ACS1,2K -1

ACS L ! L

1

t

d

1,2

thread 2K-1-1 Chunk0 threads

K 1

Fig. 3.

SB0

t,0

SB0

t,1

Ă

SB0

t,2

64 states Ă

Global memory

Ă

Ă

Ă

Ă

Ă64 states Ă

t0

t1

t2

t3

Ă

Ă

Ă

Ă

Ă

t12 t13 t14 t15

Ă

half-warp0

Ă t0

Ă

half-warp1

Fig. 4.

Ă

t1

t2

t3

Ă

t12 t13 t14 t15

Ă

Ă

Coalesced memory access by a half-warp

D. Performance Analytical Model Consider proposed decoder is processed on a GPU with the frequency of Freq, integrating S m SMs, each of which has S p SPs. Let us denote by N s the number of states, with n being the number of divided chunks. If we assume that the execution cycles and memory access latency in a ACS operation is E ACS and MACS respectively, while ET and MT is the the processing cycles and memory access latency in a trace-back operation respectively, we can analysis the execution time of the proposed algorithm on GPUs as T proc : n × N s × (E ACS +

Ă

Fn-p threads 1024 threads

the host CPU and the GPU to T proc , which is depicted in equation 3. T all = T host→gpu + T proc + T gpu→host

(3)

IV. BER Performance and Throughput Results A. Experiment Setup To verify the practicability of proposed GPU-based parallel Viterbi decoding algorithm, we lay out the testing scenario depicted in Fig.5. At the beginning, The binary data bits are randomly generated by the host computer, encoded by the convolutional encoder and then then passed through the AWGN channel. In the receiver, the signal is delivered to the global memory of the GPU to perform the Viterbi decoding procedure. The output bits are sent back to the host computer to calculate the BER. In the test scenario, the CPU in the host computer is Intel Xeon E5405 with a frequency of 2.0 GHz, and the GPU is GTX580. There are 16 SMs, each with 32 SPs on GTX580, forming an architecture of 512 CUDA cores. The frequency of GTX580 is 1.54 GHz and the size of its global memory is 1.5 GB. Each thread-block can support 1024 threads for parallel execution.

) + (ET + MT ) × (Ld + Lt ) × n Freq × S p × S m (2) MACS is divided by 16 is due to the coalesced memory access in the FCP. Because the there are extra length Lt is added to each chunk to perform the TP, the execution and memory access cycles in trace-back operation ET and MT include parts of TP and DP. Then the total decoding time T all can be calculated by adding the data delivering time between T proc =

Ld bits

t,3

shuffle

SBt,s

thread block n/p-1

Ă

Conceptual diagram of parallelizing Viterbi algorithm

together. Then the half-warp within 16 threads, from t0 to t15 , can perform data exchange with the global memory with a single bonding memory access. SB0

Ă

Ă

Ă

n*2K-1-1

Ă

Ă

1

thread 2

Ld !1,2

Ă

Ă

F0

ACS0,2K -1

ACS Lt

1024 threads

Ă

ACS1,2

Ă

ACS0,2

Output bit sequence

Ă

Fp threads

Ă

Ld bits

MACS 16

4308

white noise GPU GTX580 Binary signal

Convolutional Encoder

channel

Truncated Viterbi Decoder

Result compare BER

Fig. 5.

Block diagram of verification system

Signal Receiver

B. Decoder BER performance As mentioned above, in the TVDA, the length of TP, which is labeled as Lt , can decide the beginning state of decoding procedure and affect the bit error ratio of the Viterbi decoder. To evaluate the BER performance of our decoder, we measure the BER of our decoder under the AWGN channel in three trace-back lengthes of Lt , which is two, four, six times of the constraint length. The length of decoding procedure Ld is set to be 64. The final BER performance of the proposed decoder is shown in Fig.6. As shown in the figure, the BER decreases as the increase of Lt . The BER will be stabile when the trace-back length rising over 42, which is about 6 times the constraint length K. The BER of the decoder is close to theoretical BER performance in the matlab simulation, which proves that our decoder is practical.

0

10

theoretical

Lt=14 Lt=28 Lt=42

-1

10

-2

10

BER

Algorithm 1 GPU-based parallel Viterbi decoding algorithm Initialization: Input: The input bits c and the hamming distance hd; Output: The decoded bits DB; In each block, allocate shared memory PM,nextPM; In each thread k, set PMk = 0;nextPMk = 0; p = 16; i = 0; BM1 = 0; BM2 = 0; s = 0; Iteration: 1: for thread-block b 0 to ⌊n/p⌋ parallel do for thread in x dimension tx 0 to p − 1 parallel do 2: 3: i = b × p + tx; 4: for thread in y dimension ty 0 to 63 parallel do 5: j = ty%2K−2 ; 6: for t 0 to Ld + Lt − 1 do 7: Load ci+2t and ci+2t+1 from the global memory; 8: Load hd2 j and hd2 j+1 from the constant memory; 9: k = tx × 64 + ty; 10: BM1 = ci+2t × hd2 j ,BM2 = ci+2t+1 × hd2 j+1 ; 11: nextPMk = min{PM2 j + BM1 , PM2 j+1 + BM2 }; 12: S Bt,ty = (PM2 j + BM1 > PM2 j+1 + BM2 )?1 : 0; 13: Store S Bt,ty in the global memory; 14: Synchronize; 15: PMk = nextPMk ; 16: end for 17: end for 18: Reduce and Fetch the state with minimum PM smin from PMα+Ld +Lt ,0 to PMα+Ld +Lt ,63 ; 19: s = smin ; 20: for t Ld + Lt − 1 to 0 do 21: j = s%2K−2 ; 22: Load S Bt,s in the global memory; 23: if t < Ld then 24: DBt = S Bt,s ; 25: end if 26: s = (S Bt,s = 0)?2 j : 2 j + 1; 27: end for 28: end for 29: end for

-3

10

-4

10

-5

10

-6

10

0

1

2

3

4

Eb/No (dB)

Fig. 6. BER performance of our decoder in various trace-back lengthes under AWGN channel

C. Decoder Throughput To measure the processing time of the proposed Viterbi decoding algorithms on GPUs, we use the profiler provided by NVIDIA Corporation. In the experiment, the number of total decoded bits is 16384, which is 32 times 512. The total 16384 bits are divided into several truncated blocks, which are performed on Nb thread-blocks on the GPU. With the alteration of Nb , various Ld lengths are performed to find the minimum execution time at the same value of Nb . The Ld with the length of 32, 64, 128, 256, and 512 is calculated, which can produce 16, 8, 4, 2, and 1 ways Fi parallel computing in a thread-block. Lt is set to be 14, which can achieve the good BER performance. Fig.7 show the processing time of the proposed Viterbi decoder on different values of Nb and Ld . In Fig.7, we can find that when Ld is 64 and 8-ways Fi are parallel performed, the minimum execution time is archived. This is due to the balance between the computation capacity of CUDA cores and the usage of registers and shared memory in the thread-blocks. D. Performance Comparison In order to show the advantage of our decoder, related works of GPU-based Viterbi decoding algorithms [3][8] is listed for comparison. The algorithm in [8] adopts only one threadblock to compute the FCP, in which each thread calculates one ACS operation. The inverse procedure is performed in a thread. The algorithm in [3] exploits the method of dividing the bit sequences into blocks, which are distributed to multiple thread-blocks that are similar to ours. In the inverse phase, however, all the paths of the multiple thread-blocks are merged and the decoded sequences are produced through one thread similar to [8]. The algorithm that we propose is a fully parallel truncated Viterbi algorithm. The entire trellis is partitioned to form multiple sub-trellises, and the forward and track-

4309

25000

V. Conclusion Ld=32 Ld=64

20000

Ld=128

processing time(us)

Ld=256 Ld=512

15000

10000

5000

0 1

2

4

8

16

32

number of thread-blocks Nb

Fig. 7.

Processing time of proposed decoder on various decoding length

back phases of each sub-trellis are executed independently. For the purpose to comparing our decoder with previous GPU implementations, we realize the above three types of Viterbi decoders, including ours, on the GTX580 with the same optimization method, such as the coalesced memory access pattern and the usage of shared memory in each thread-block. As a result, the performance of three types of algorithm on the GTX580 GPU platform is shown in Fig.8. The CPU in the figure is Intel Xeon E5405 with a frequency of 2.0 GHz. The peak throughput of our Viterbi decoding algorithm is 67.148 Mbps, which achieves 36.0x speedup relative to the sequential C implementation on a Intel Core 4 CPU with a frequency of 2.0GHz. The throughput of our decoder gets average 3x and 2x performance improvement compared with that achieved in [8] and [3], respectively. The power consumption of proposed Viterbi decoder on GTX580 is 56W, which is measured as the power gap between power running the decoding program and not. The total power of the host computer with GTX580 is 126W. Compared to CPU, the proposed GPU-based decoder achieves about 36x speedup with additional about 80% power consumption. CPU [8] [3] Ours Sp. vs CPU Sp. vs [8] Sp. vs [3]

70 60

67.1 28.8

53.7

20

14.6 7.3

1.8

40

3.6

3.6 3.7

3.6

3.4

3.6 3.6

2.2

1.6

1.4

1.2

29.4 28.7

27.3

30 20 13.7

10

6.8 2.7 3.3 1.90.9

0

-20

15.1 14.8

8.6 7.6

4.7 1.91.8

12.3

2.3 2.3

Speedup

throughput (Mbps)

50

40

36

4.1 1.9

1.9

1.9

1.9

0

-40 1

2

4

8

16

32

number of thread-blocks

Fig. 8.

Performance comparison of various GPU-based Viterbi decoders

This paper proposes a GPU-based fully-parallel truncated Viterbi decoding algorithm. We divide the input bit sequence of Viterbi decoder into several truncated blocks to preform TVDA. The truncated blocks are distributed to multiple threadblocks on the GPU to perform parallel decoding procedure in the SIMD mode. Each thread-block can decode severalways parallel truncated blocks. In each truncated block, the forward path metrics computing and trace-back procedure can be independent executed, which can maximize the parallelization capability of the GPU. The experiments show that the throughput of the proposed GPU-based truncated Viterbi decoder can achieve up to 67.1Mbps, which is 36.0 speedup relative to a sequential C implementation on a 2.0GHz CPU and 1.2x-3.6x performance improvement compared with that achieved in the existing GPU-based implementation. The result shows that the proposed Viterbi decoder can be used in the base station or mobile station in the 3G or 4G communication. Acknowledgment This work was supported by National Science Foundation of China (61125201). References [1] W. Tuttlebee, ”The software defined radio:enabling technologies,” Chichester: Wiley, 2002. [2] NVIDIA Corporation, ”NVIDIA CUDA Compute Unified Device Architecture Programming Guide version 4.0,” 2011. [3] C. Lin, W. Liu, W. Yeh, L. Chang, W. Hwu, S. Chen, and P. Hsiung, ”A Tiling-Scheme Viterbi Decoder in Software Defined Radio for GPUs,” in WiCOM’11: 2011 7th International Conference on Wireless Communications, Networking and Mobile Computing, (Wuhan, Hubei, China), pp. 1-4, 2011. [4] W. Michael, S. Yang, G.Wang, and C. Joseph, ”Implementation of a High Throughput 3GPP turbo decoder on GPU,” Journal of Signal Process System, vol. 65, no. 1, 171-183, 2011. [5] W. Michael, S. Yang, G. Siddharth, and C. Joseph, ”Implementation of a High Throughput Soft MIMO Detector on GPU,” Journal of Signal Process System, vol. 64, no. 1, pp. 123-136, 2011. [6] F. J. Martinez-Zaldivar, A. M. Vidal-Macia, A. Gonzalez, and V. Almenar, ”Tridimensional block multiword LDPC decoding on GPUs,” Journal of Supercomputing, vol. 58, no. 3, pp. 314-322, 2011. [7] J. Kim, H. Seungheon, and C. Seungwon. ”Implementation of an SDR system using graphics processing unit,” IEEE Communications Magazine, vol. 48, no. 3, pp. 156-162, 2010. [8] C. Ahn, J. Kim, J. Ju, J. Choi, B. Choi, and S. Choi, ”Implementation of an SDR platform using GPU and its application to a 2x2 MIMO WiMAX system,” Analog Integrated Circuits and Signal Processing, vol. 69, no. 2, pp. 107-117, 2011. [9] S. Gronroos, K. Nybom, and J. Bjorkqvist, ”Complexity analysis of software defined DVB-T2 physical layer,” Analog Integrated Circuits and Signal Processing, vol. 69, no. 2-3, pp. 131-142, December 2011. [10] W. Plishker, G. F. Zaki, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall, ”Applying graphics processor acceleration in a software defined radio prototyping environment,” in RSP’11: Proceedings of the International Workshop on Rapid System Prototyping, (Karlsruhe, Germany), pp. 67-73, May 2011. [11] P. Horrein, C. Hennebert, and F. Petrot, ”Software Defined Radio Integration of GPU computing in a software radio environment,” Journal of Signal Processing Systems,vol. 69, no. 1, pp. 55-65, October 2012. [12] R. Li, Y. Dou, Y. Lei, S. Ni, and S. Guo, ”Design and Implementation of the Parameterized Multi-Standard High-throughput Radix-4 Viterbi decoder on FPGA,” IEICE transactions on communications, vol. E95B, no. 5, pp. 1602-1611, May 2012.

4310

Suggest Documents