Integration of GPU Computing in a Software Radio ... - Semantic Scholar

3 downloads 5180 Views 401KB Size Report
Oct 24, 2011 - using programmable processors to process it digitally. As it can be ..... J Sign Process Syst. File. Source. File. Sink. Test application. CPU. GPU.
J Sign Process Syst DOI 10.1007/s11265-011-0639-1

Integration of GPU Computing in a Software Radio Environment Pierre-Henri Horrein · Christine Hennebert · Frédéric Pétrot

Received: 28 September 2011 / Revised: 24 October 2011 / Accepted: 7 November 2011 © Springer Science+Business Media, LLC 2011

Abstract This article studies the integration of Graphics Processing Units in a Software Defined Radio environment. Two main solutions are considered, based on two levels of granularity for the parallelization. First, a fine grain parallelism solution, which is an extension of the existing solutions but adapted to operations of large computational complexity, is proposed. Second, an original solution based on coarse grain approach allowing better usage of the computing resources and easier parallelism extraction is described. For both solutions, scheduling and communication design as well as implementation are given, along with integration in the environment. Both solutions have been implemented and compared on different operations types and on multi-operations sequences. It is clearly shown that using the second solution can provide performance improvement, while the first one is not adapted to SDR applications. Keywords GPGPU · OpenCL · GPU · SDR · SIMD parallelism · GNURadio

P.-H. Horrein (B) · C. Hennebert CEA, Leti, Minatec, 17 rue des martyrs, 38054 Grenoble Cedex 9, France e-mail: [email protected] C. Hennebert e-mail: [email protected] F. Pétrot TIMA Laboratory, CNRS/Grenoble INP/UJF, Grenoble, France e-mail: [email protected]

1 Introduction The fast development of wireless networks in the last decade has led to the emergence of many standards, requiring new implementation strategies to support them all. While network devices designed to process the signal were traditionally implemented as hardware, hard-wired components, the aim now is to have more and more flexible implementations, in order to be able to cope with the radio improvements at low cost. A possible solution to reach this flexibility is Software Defined Radio (SDR). While traditional devices implement all computationally intensive tasks in hardware, thus pushing the hardware/software interface somewhere in the protocol layers, SDR devices implement as few operations as possible in hardware, quickly sampling the signal and using programmable processors to process it digitally. As it can be expected, the main limitation for this promising implementation is the required processing power. While optimizations could be done by designing dedicated SDR applications, SDR environments reduce the development efforts and are the preferred option. To cope with the required processing power, GeneralPurpose computation on Graphics Processing Units (GPGPU) is a possible solution. It allows processing of data at a very high rate, providing a very high computing power, well suited to image processing algorithms. This is achieved by using Graphics Processing Units (GPU) as conventional processing units. GPGPU can increase the possible throughput, and it can ease the task of the associated General Purpose Processors (GPP), freeing them for other applications or for protocols. But benefiting from GPU is not a straightforward operation. SDR environments are usually designed for

J Sign Process Syst

GPP architectures. GPP platforms are task parallel platforms, meaning that several tasks can be run simultaneously. GPU architectures are Single Instruction Multiple Data (SIMD) architectures, designed to run simultaneously a single instruction on a large set of data. This mismatch in the computing model used, as well as the required framework for GPGPU which leads to high memory transfer overhead and centralized management, means that deploying an SDR environment on a GPU platform may not improve performance. The work presented here aims at defining and comparing software architectures suitable to use GPGPU in SDR environments. The paper is organized as follows. In Section 2, a presentation of the GPU architecture is given, as well as a presentation of the GNURadio environment, used as an example. We also mention there some relevant previous studies. In Section 3, a general view of both considered solutions for operation design is given. In Section 4 and Section 5, two integration methods of the GPGPU in the GNURadio environment are described. The distributed implementation presented in Section 4 keeps the per-block approach that is usually chosen, while the centralized solution in Section 5 proposes an innovative and more relevant approach. Finally results of both approaches for both integration methods are given in Section 6, and conclusions are finally drawn in Section 7. 2 OpenCL Environment and Previous Work 2.1 The OpenCL Environment The GPU environment used in this work is Open Computing Language (OpenCL) [7]. It is a unified computing environment standard, published by the Khronos Group, and designed to enable heterogeneous parallel computing. OpenCL is based on a hierarchical representation of the computing platform, as presented in Fig. 1. The platform is divided between a host (a GPP) and Computing Devices. These computing devices are made of Computing Units (CU), and each of these units is built from Processing Elements (PE). The PE are the basic blocks of an OpenCL platform. Computing Units are based on the Single Instruction Multiple Data (SIMD) model, with all the processing elements executing the same stream of instructions on a data vector. Processing in OpenCL is divided in two parts, as represented on Fig. 2: –

the kernels, which are executed on the Computing Devices,

Figure 1 OpenCL platform representation. Host

Host Memory

Data Ctrl



the host program which is executed on the host, and which controls the execution of kernels on Computing Devices.

The data set on which a kernel must be executed is divided using an index space. A kernel is instantiated for each of the single element. This instantiation creates the work-items. Each work-item of an index space executes the same program. Work-items are then organized in work-groups. A work-item thus has a global identification number (ID), in the complete index space, and a local ID in the work-group. Work-groups are submitted for execution to the CU. Work-items of a work-group are executed on the PE of the associated CU. All commands to the GPU, such as memory transfer requests or kernel executions, are issued using a command queue.

User Code C99 Extensions

Device

Host

kernel

host app.

OpenCL API

Compilation (host) work item

work item

Range

Exec.

work-group work-group

Software

OpenCL Runtime

Hardware HW device

HW device

HW device

....

Figure 2 OpenCL application execution.

HW device

J Sign Process Syst

Possible programming models for OpenCL programs are Data Parallel or Task Parallel. Data Parallel is the preferred model, since the OpenCL software architecture is designed with a SIMD processing platform in mind. Task Parallelism is achieved using a single work-item, with a parallelization over the different CU. 2.2 GNURadio Architecture The aim of this work is to use OpenCL and its GPGPU capabilities for a SDR environment. The environment chosen here as an example is GNURadio [3]. GNURadio is a project offering a complete framework for Software Defined Radio and signal processing applications. However, this work is applicable to other SDR environments. A SDR application can be represented as a sequence of processing blocks, with lossless FIFO communication channels between each block. These flowgraphs, as they are called in GNURadio, can be built using predefined blocks. GNURadio provides several elements: –





basic processing blocks, with a hierarchical classification, and templates to easily develop new blocks. These blocks are developed in C++, a runtime part, comprising a scheduler, communication channels between the blocks, or a flowgraph interpreter, input/output (I/O) libraries, offering interfaces with possible sources and sinks of the platform, such as the sound card, or the Universal Software Radio Peripheral (USRP) and its extensions.

Two possible scheduling methods are actually implemented in GNURadio: – –

the first method uses a single process, and switches from block to block, the second method uses one process (thread) per block, with blocking communication channels.

When using GNURadio, the runtime is hidden to the developer, who only needs to instantiate the different processing blocks using the proposed API. Once all block are instantiated, the resulting application can be executed. The execution calls the scheduler, which will then process data according to the given flowgraph. 2.3 Previous Work The use of GPU computing for SDR application is not a new field of research. SDR applications are demanding applications, with high computing power requirements, and the increase of the offered throughput in wireless

standards means that classical CPU approaches are not able to stand the data rate anymore. A first example of an existing solution is given by Kim et al. [8]. They present a hardware platform and software architecture designed to enable GPU computation of SDR applications. This study is interesting, as it proposes a complete WiMAX modem implementation, but its main goal is to present the global architecture, not to detail the GPU management. On the implementation side, studies can be divided according to two different aims. The first one is to design specific applications or operations using the GPU. A CUDA based implementation of polyphase channelization has been proposed by Harrison et al. [4]. MIMO detection and demodulation is studied by D. Wu et al., using a 4 elements wide SIMD approach [15]. Given the vector width, this approach is difficult to use on a GPU which uses bigger vectors. The same application is studied by M. Wu et al. on a GPU, using the CUDA environment from NVidia [16]. More generally, it is possible to find implementations of usual SDR operations on the GPU, but not specifically designed for radio operations: the FFT [1], Turbo Code decoder [14], Low Density Parity Check (LDPC) codes decoder [6],. . . . Specialized libraries are also available, for example from CUDA [10]. Another aim of a study on GPU integration in SDR applications is the use of GPU in a SDR environment. Integration in OSSIE [12], a Software Communication Architecture [5] implementation, has been considered [9]. The study focuses on simple operations, such as sample amplification, decimation, or amplitude and frequency demodulation. A study of GPU and GNURadio was recently proposed [11]. The authors propose a model of SDR application fit for GPU execution, as well as a possible integration of GPU operations in the environment. The use of a GPU in GNURadio has also been discussed on the GNURadio mailing list. 3 Approaches for GPU Execution In order to use the GPU in GNURadio, two different things must be studied: – –

the design of a single operation, the integration of this operation in GNURadio.

This section described two possible design approaches for a GPU operation. 3.1 Why are SDR Applications Different? A SDR application, and more specifically a GNURadio application, is a set of processing blocks. The blocks

J Sign Process Syst

are linked together using FIFO channels, and can be represented as in Fig. 3. Each block has a processing function, designed to process arrays of values, and to produce values. An example of a processing block is a N-points Fast Fourier Transform (FFT). The processing function of the block takes as inputs N-values arrays, and outputs N-values arrays representing the FFT of the inputs. The function is encapsulated in an environment specific block which depends on the environment implementation method. In GNURadio, the encapsulation is done using C++ objects. The FIFO channels are managed by the runtime part of GNURadio. This runtime is used to run the processing functions of the blocks. When the block is initialized, the estimated number of input values required to produce a given number of output values is processed. The runtime monitors the filling of buffers in order to call the processing function when required. SDR applications are streaming applications, as opposed, for example, to scientific processing in which an operation may be used only once for a single vector. As long as the radio terminal is used, there are data to process. Another important difference lies in the vector width. An FFT implementation for the GPU is efficient for vector larger than 218 samples [1]. According to Woh et al., SIMD width in new standard architectures is 1024 at most [13]. Given the architecture of a GPU, such a width is not sufficient to get a throughput improvement.

Algorithm 1 FFT example using Direct Mapping procedure execute  Host code Compute itmax  Required for radix 2 FFT Compute Topt  Scheduling threshold Wait for Topt vectors Copy N × Topt samples to GPU memory while it < itmax do Set up in, out and it arguments Execute fft_r2_kernel(in, out, it) on Topt vectors it ← it × 2 end while Copy N × Topt resulting samples from GPU memory end procedure procedure fft_r2_kernel(in, out, iter)  GPU Code butterfly_radix2(input, output, iteration) end procedure

This is represented in Algorithm 1 for the FFT operation. The algorithm used is a radix-2 FFT implementation for GPU [1]. It uses log2 (N) iterations (N is the vector width), each iteration executing N2 radix2 butterflies. In the DM approach, the kernel for the operation is a radix-2 butterfly. Better utilization is obtained by increasing the Topt parameter, which is in fact the executing threshold, the number of vectors required in the input FIFO before the operation can be launched.

3.2 Direct Mapping Solution The first solution is an adaptation of the adaption in OSSIE for operations that require complex computations and data manipulations instead of per sample operations. We call this approach the Direct Mapping (DM) solution. In order to implement a block using the DM approach, an efficient implementation of the considered operation is required. The vectors of SDR applications are usually not wide enough to efficiently use the GPU. In order to maximize GPU utilization, several vectors are processed simultaneously, thus increasing the number of threads being executed.

3.3 Extracting Coarse Grain Parallelism The Direct Mapping solution, while attractive with its simplicity, suffers from the following drawbacks: –



a GPU-compatible algorithm must be found and implemented to benefit from the GPU, which is not always easy or possible, it uses small kernels, which increases the control overhead of the operation.

In order to provide a solution to these two drawbacks, a change in the granularity of the application

Environment-Specific Elements

Environment-Specific Elements

Environment-Specific Elements

Process Function

Process Function

Process Function

Figure 3 Example application representation for GNURadio.

J Sign Process Syst

Algorithm 2 Coarse Grain approach procedure execute  Host Code Compute Topt Wait for Topt vectors Copy Topt vectors to the GPU memory Setup in and out arguments Execute fft_kernel(in, out) on the Topt vectors Copy Topt vectors from the GPU memory end procedure procedure fft_kernel(input, output)  GPU code output ←fft(input) end procedure

is proposed. Instead of working on small kernels, the Coarse Grain (CG) approach uses kernels representing the entire operation. Keeping the FFT operation as an example, instead of using the butterfly as a kernel, the CG approach uses a complete FFT as a kernel. This is represented in Algorithm 2. A threshold is still required in order to fully use the GPU. While this approach may increase the latency and required memory (higher thresholds may be required), it has three main advantages: –





it is not necessary to find a GPU-optimized version of all operations in order to be efficient, since the kernels implement the same algorithms as on a CPU, even data dependent operations can be processed and accelerated, as long as there is no inter-vector dependency. This means that, as long as the datadependency characteristic is only true for specific known vectors, it is possible to run them concurrently. On the contrary, if the data-dependency is a stream characteristic, meaning that all the operations will always depend on previous results, then the approach is not applicable anymore, the coarse grain structure of the operation means that the application can adapt to the GPU width by adapting the threshold. Required time for a single core to process a complete vector can be expected to be the same as required time for as many vectors as the number of PEs in the GPU.

The latency induced by the approach can be decreased by using the OpenCL units independence. Using a pipeline structure, each of the units in the GPU can be used to process a different operation. This may reduce the number of required vectors for optimal processing, while keeping the same throughput for the whole application.

3.4 Scheduling Scheduling the different operations on the GPU is done using the threshold Topt in the algorithm. This threshold may vary from one operation to another, and it also depends on the vector size, as will be seen in Section 6. The aim of the scheduler is to define when the operation will be executed, according to performance, latency and memory criteria. The higher the threshold, the later the operation will be executed, thus increasing the latency and the required memory, but also improving the obtainable throughput. Counterbalancing the latency increase due to vectors accumulation is possible using knowledge of the structure of a transmission. Given the fact that increasing the number of vectors being processed does not dramatically increase the processing time, it is possible to synchronize the operation execution with highly constrained packets, such as beacons. Alternatively, synchronizing operation executions on high layer packets boundaries may completely avoid latency problems. While there would be latency at the samples level, there would be no latency at packet level. Memory constraints, while important, are generally less important than latency constraints, except in specific applications such as broadcasting applications with no strict timing constraints, or embedded systems with strict memory constraints.

4 Distributed Integration In order to use the GPU in a GNURadio application while keeping the benefit of the framework, operations on the GPU must be integrated in the environment, regardless of the selected approach. This integration is not straightforward. Two models are considered in this study, a distributed model and a centralized one. This section focuses on the distributed integration. 4.1 General View The main idea behind the distributed integration is to keep the per-block approach of GNURadio, and thus to provide an integration which is as transparent as possible. All OpenCL specific structures and instantiations are done using a specific OpenCL class designed in GNURadio. The aim of this class is to guarantee that OpenCL structures are uniquely instantiated. A GPU-executed block is similar to a classical CPU block. It provides the same interface, uses the same parameters and functions, and is based on the same FIFO channels. The only difference is the processing

J Sign Process Syst

function. Instead of implementing the actual operation, the processing function is the scheduler for the block. The scheduler for the GPU is thus distributed, with almost no information on the state of the other blocks. This is represented in Fig. 4a. The local scheduler is responsible for data management and control of the GPU.

4.2 Data Communication A straightforward solution to implement communications between CPU and GPU blocks is to keep the CPU FIFO between the blocks, and let the processing function do the actual required transfers. This solution has two advantages. First, it uses only as much GPU memory as required, since only Topt vectors will be transferred at the same time. But the main advantage is that it is possible to design an application using a mixture of CPU and GPU blocks, benefiting of the GPU computing power for operations well suited to it while keeping the CPU for operations not implementable using a SIMD model. Since the distributed integration is transparent for the final user and for the GNURadio runtime environment, this is possible. It would also be possible to use different implementation concurrently for the same operations. This possibility has been considered by Dickens et al. for dynamic operation migration [2]. The problem is that this advantage is not really an advantage. Since data transfers are a bottleneck of GPU computing, transferring data to and from the GPU in order to perform the computation on the required unit is not advisable. In order to avoid these

Local Scheduler

Processing Function

useless transfers when a sequence of GPU operations is executed, an OpenCL FIFO must be designed. This means that in order to take into account the presence of the GPU into the environment, and to avoid useless memory transfers, the runtime must be modified. In this modified runtime, communication channels between the blocks, which were untouched in the straightforward implementation, are modified to accommodate for the GPU presence. Different solutions have been envisioned to implement the actual transfer. To be more precise, the OpenCL standard offers two functionalities that may be used to provide an OpenCL buffer: –



an OpenCL “buffer” (an allocated memory zone) can be created by using a host buffer as a basis. This can be compared as mapping host memory in the GPU, with the OpenCL runtime doing the actual transfer when necessary. Hopefully, it does not transfer data when not necessary. The problem with this solution is that it has a random timing behavior, with applications being very efficient sometimes, but completely inefficient on other times, the opposite can also be done, mapping GPU memory in the addressable range of the CPU, thus making data transfer transparent, and avoiding useless data transfer when the CPU does not access the memory. Once again, varying results have been observed using this implementation.

Since OpenCL based implementation is inefficient or not stable from the performance point of view, we have designed a new buffer.

Local Scheduler

Processing Function

CPU GPU Kernel

Kernel

(a) Data transfer in processing function Transfer

Processing Function

Transfer

Processing Function

Local Scheduler

Local Scheduler

Kernel

Kernel

CPU GPU

(b) With dedicated OpenCL buffer Figure 4 SDR application using distributed GPU integration.

Processing Function

J Sign Process Syst

4.3 Dedicated FIFO Three new elements are implemented to enable OpenCL FIFO: – – –

CPU to GPU transfer blocks, used to transfer data to the GPU (when the next block is a GPU block), GPU to CPU transfer blocks, for the opposite transfer, GPU to GPU FIFO, in which data remains located in the GPU memory.

FIFO are implemented as mixed GPU/CPU entities. Actual data storage is done in the GPU memory, while all control needed for a FIFO implementation is managed by the CPU. The resulting application is presented in Fig. 4b. Dotted lines represent control path, while plain lines represent data path. During initialization of an OpenCL buffer, two memory zones are allocated: – –

the OpenCL buffer which will actually contain the data, a ghost FIFO in the CPU memory, which is used to manage the OpenCL buffer.

The ghost FIFO is an index FIFO. Its size depends on the basic element size. When a processing block is activated by the GNURadio runtime, it reads data from its FIFO, processes data, and then write data in its output FIFO. For GPU processing block using the dedicated FIFO, the difference is that it does not read or write data in the FIFO, it reads index at which data is available in the OpenCL buffer. Nothing is actually read or written in this FIFO. The size is used by GNURadio to keep information on the application state up-to-date. The main advantage of this implementation is that no more memory copies are required. Transfer can be done asynchronously to the GPU memory, and two consecutive blocks in GPU can keep memory in the GPU instead of transferring data twice. But using this solution, mixed execution of a single block by a CPU and a GPU is not possible anymore, and the target of

Processing Function

the block must be specified when instantiating the application, which disables the possibility to dynamically switch execution from one domain to the other [2]. For both solutions, another important point of this implementation is that with OpenCL, the bigger the dataset, the more efficient the GPU becomes, as can be seen in Section 6. Taking the example of a FFT, it becomes truly efficient when the FFT is run on 32768 points, which is not common in radio applications.

5 Centralized Integration 5.1 General View Using distributed integration is the preferred option regarding GNURadio philosophy. But using OpenCL environment means using centralized control. In order to better fit the OpenCL approach, a centralized integration can be developed. The idea behind centralized integration is to use a single GNURadio block to implement a whole sequence of GPU operations as presented in Fig. 5. All the kernels are configured and executed in a single global GNURadio block. On the figure, a single GPU block is presented, surrounded by two CPU blocks. It is possible to generate multiple GPU blocks for different reasons. A first example is an operation impossible to compute for the GPU such as communication with I/O ports. Another possible examples is the use of a multiple GPU platform. In this case, it could be best to use one centralized block for each GPU to fully benefit from the platform. Using the centralized integration suppresses the need for FIFO between GPU operations. Since kernels are executed in a single GNURadio blocks, and since input/output ratio are known for each operation, it is possible to compute a schedule during the initialization, based on the optimal thresholds and block size. The resulting schedule defines the number of blocks being processed by each execution of an operation, as well

Global Scheduler

Processing Function

CPU GPU Kernel

Figure 5 Centralized integration.

Kernel

Kernel

J Sign Process Syst 250 File Source

Sample throughput (MS / s)

CPU GPU

File Sink Test application

Figure 6 Test protocol.

CPU Coarse Grain Direct Mapping

200 150 100 50 0

the number of executions and their order for each operation in a block. The centralized approach then uses this static scheduling, which can be dynamically altered to accommodate for high priority samples. 5.2 Comparison of Both Integration Methods

100 80

1000

6 Results 6.1 Method In this section, performance results for both DM and coarse grain approaches as well as for both centralized and distributed integration methods. The test platform is a workstation based on: – –

an Intel Core i5 processor and 8 GB of DDR3 memory, a NVidia GTS450 GPU with 1 GB of DDR5 memory.

40 20 400

800

The last discussed advantage lies in the multitasking capabilities of newer GPU. Recent GPU have been able to process independent programs on their different units. Using the distributed approach, it is not possible to really benefit from this capability, since the local scheduler has no information on the state of other GPU schedulers. Since the centralized integration uses a global scheduler, it can manage the execution of different kernels on the GPU.

CPU Coarse Grain Direct Mapping

200

600

Figure 8 Results for demapping with threshold as a varying parameter, N = 1024.

60

0

400 topt

Sample throughput (MS / s)

Figure 7 Results for FFT with threshold as a varying parameter.

Sample throughput (MS / s)

The centralized approach has multiple advantages when compared to the distributed approach. Since the schedule is computed during the initialization of the application, the size of the buffers is known when they are initialized. It is not required to implement a complex and resource consuming OpenCL FIFO to avoid useless data transfer. The centralized approach thus reduces the required memory on the GPU. The use of global blocks also eases the use of multiple GPU platform. In order to have multiple GPU in the distributed integration, several OpenCL classes must be instantiated, one for each GPU. The GPU operations are then instantiated using the OpenCL structures associated to the desired GPU. In the centralized approach, whole sequences of operations are affected to different GPU. The drawback of this approach is that if a sequence is made of several operations not really efficient on the GPU, it is not possible to counterbalance the loss on a GPU by very efficient operations. However, once again, load balancing between different processing devices when working with OpenCL is not recommended, because of the cost of data transfers.

200

600 topt

(a) Vector size 128

800

1000

100 80

CPU Coarse Grain Direct Mapping

60 40 20 0

200

600

400 topt

(b) Vector size 1024

800

1000

J Sign Process Syst 250

CPU Coarse Grain

Sample throughput (MS / s)

Sample throughput (MS / s)

100 80 60 40 20 0

200

400

topt

600

800

The processor is quite powerful while the GPU is a cheap graphic card. Both devices have the same advertised power consumption. In order to study the performance of the different solutions, three basic operations are used: – – –

a FFT, which has already been described, a QPSK demapper, which is a low complexity operation, an untapped block IIR filter, which computes the filter on a vector with data independence between the vectors.

The GNURadio test application is presented in Fig. 6. The test applications used are single operations, and a sequence of all three operations. Each time, sufficient data is available in the file to process 10,000 vectors. The studied indicator is the attainable samples throughput, meaning the number of samples that can be processed in a given amount of time. 6.2 Single Operations First, the results for single operation applications are presented in Figs. 7, 8, 9, 10, 11 and 12. For single oper-

80

40 20 8

9

10 log2 (N)

50

11

Figure 10 Results for FFT with optimal threshold.

12

8

9

10 log2 (N)

11

12

ations, it is useless to study distributed or centralized integration methods. These are used only for multitask applications. For each operations, two parameters are modified: the scheduling threshold and the vector size. The efficiency is discussed compared to CPU performance. Results for the FFT are shown in Fig. 7 for two different vector sizes (128 and 1024), the varying parameter is the scheduling threshold. The first conclusion from these results is that there is really a threshold value, above which there is no observable improvement in the performance. The coarse grain approach requires a higher threshold to become efficient (meaning at least as efficient as the CPU implementation), but attainable throughput is better than for direct mapping approach. The required threshold is also lower for wider vectors. Looking at the exhaustive results which are not presented here for practical reasons, it is possible to see that the threshold can be given in number of samples instead of number of vectors, it stays constant in this case. Results for demapping (Fig. 8) and IIR (Fig. 9) operations are similar. There is no efficient IIR implementation using the fine grain operation, due to the data dependencies. As it can be seen in Fig. 8, the

100

CPU Coarse Grain Direct Mapping

60

0 7

100

Figure 11 Results for demapping with optimal threshold.

Sample throughput (MS / s)

Sample throughput (MS / s)

100

150

0 7

1000

Figure 9 Results for IIR with threshold as a varying parameter, N = 1024.

200

CPU Coarse Grain Direct Mapping

80

CPU Coarse Grain

60 40 20 0 7

8

9 10 log2 (N)

11

Figure 12 Results for IIR with optimal threshold.

12

Sample throughput (MS / s)

J Sign Process Syst CPU Distributed Centralized

140 120 100 80 60 40 20 0

7

8

9 10 log2 (N)

11

12

Figure 13 Results for operation sequence.

demapping operation is the only one for which the Direct Mapping method is efficient. Figures 10, 11 and 12 give throughput results for different vector width using optimal threshold. The decrease in performance observed for the last value is due to the buffer size, not big enough for vectors that big. 6.3 Multitasking The results for a sequence of operations are evaluated for the coarse grain approach only, using an IIR/FFT/Demapping sequence. They are presented for different input block sizes in Fig. 13. First of all it can be seen that the improvement compared to CPU execution is significant for multitask operation. The difference between the single operation improvement and the sequence improvement can be explained by the importance of data transfer time in the GPU efficiency. The other surprising element is that attainable throughput is higher for the sequence of operations than for IIR operation. This is due to the decimation effect of the demapping. In the IIR, N × 10,000 complex elements (8 bytes) had to be transfered from the GPU to the CPU and then written to file. In the sequence, N × 10,000/4 characters (1 byte) must be transfered, since the QPSK demapper decreases the amount of data. As expected, the centralized integration performs better than the distributed approach. Since this approach also reduces required memory and GNURadio runtime overhead, it is the logical choice for GPU integration in GNURadio.

no real integration in the environment, is inefficient, and yields worse results than the CPU version. The new method proposed in this article does not use GPU specific optimization for the block implementation. Instead, it focuses on processing several arrays at the same time. Since radio applications are mainly stream applications, with the same sequence of operations applied to all arriving samples, it is possible to extract data parallelism from the application. A centralized integration method has also been presented, which allows better control on the GPU operations. The resulting SDR environment performs between 3 and 4 times better than the CPU application, using a powerful CPU and a cheap GPU. Several improvements are considered based on this study. First of all, improvements for the scheduling of operations in the centralized integration can be done using results from studies on Synchronous Data Flow. Implementation of efficient multitask applications using the CU independence is currently under way, and expected to yield the same result with smaller latency and memory usage. Next, future research will focus on integration in an embedded environment. Embedded systems present different constraints from the system used in this study, mainly for power efficiency and memory/performance constraints. Implementations of SDR applications on embedded systems, while possible, is still limited, and is usually based on Digital Signal Processors (DSP). This raises questions on possible benefits from GPU: –



how will the proposed approach perform using a GPGPU enabled embedded GPU such as the new Mali T604 from ARM? Due to power limitations in embedded systems, this GPU is expected to be less powerful than a “usual” GPU. Some elements may vary, such as the number of cores, their efficiency, and the available memory for the GPU. These modifications can affect the efficiency of the proposed solution. Early specifications of the T604 have been released, but it not yet clear how the OpenCL environment will be implemented. given the strict power efficiency constraints, will the proposed study allow an improvement in performance or in power consumption compared to DSP based implementations?

References 7 Conclusion This study has shown two possible methods to include GPU in a SDR environment. The direct mapping method, which uses the GPU as an efficient CPU, with

1. BEALTO. OpenCL Fast Fourier Transform. http://www. bealto.com/gpu-fft.html. Visited on 23 Aug 2011. 2. Dickens, M. L., Laneman, J. N., & Dunn, B. P. (2011). Seamless dynamic runtime reconfiguration in a software-definedradio. In Proceedings of SDR’11-WinnComm-Europe.

J Sign Process Syst 3. GNU Radio. GNU Radio - WikiStart - gnuradio.org. http:// gnuradio.org. 4. Harrison, G., Sloan, A., Myrick, W., Hecker, J., & Eastin, D. (2008). Polyphase channelization utilizing general-purpose computing on a GPU. In SDR 2008 technical conference and product exposition. 5. Joint Tactical Radio System Joint Program Office (JTRSJPO) (2001). Software Communications Architecture Specifications - MSRC - 5000SCA - V2.2. 6. Ju, H., Cho, J., & Sung, W. (2011). Memory access optimized implementation of cyclic and quasi-cyclic LDPC codes on a GPGPU. Springer Journal of Signal Processing Systems, 64, 149–159. 7. Khronos Group (2010). The OpenCL specif ication, version 1.1. 8. Kim, J., Hyeon, S., & Choi, S. (2010). Implementation of an SDR system using graphics processing unit. IEEE Communication Magazine, 48(3), 156–162. 9. Millage, J. G. (2010). GPU integration into a software def ined radio framework. Master’s thesis, Iowa State University. 10. NVIDIA (2011). NVIDIA CUDA C programming guide version 4.0. 11. Plishker, W., Zaki, G. F., Bhattacharyya, S. S., Clancy, C., & Kuykendall, J. (2011). Applying graphics processor acceleration in a software defined radio prototyping environment. In Proceedings of the international symposium on rapid system prototyping, Karlsruhe, Germany. 12. Virginia Tech. OSSIE - SCA-based open source software defined radio. http://ossie.wireless.vt.edu/. 13. Woh, M., Lin, Y., Seo, S., Mahlke, S., & Mudge, T. (2011). Analyzing the next generation software defined radio for future architectures. Springer Journal of Signal Processing Systems, 63, 83–94. 14. Wu, M., Sun, Y., & Cavallaro, J. R. (2010). Implementation of a 3GPP LTE turbo decoder accelerator on GPU. In IEEE workshop on signal processing sytems (SIPS) (pp. 192–197). 15. Wu, D., Eilert, J., & Liu, D. (2011). Implementation of a high-speed MIMO soft-output symbol detector for software defined radio. Springer Journal of Signal Processing Systems, 63, 27–37. 16. Wu, M., Sun, Y., Gupta, S., & Cavallaro, J. R. (2011). Implementation of a high throughput soft MIMO detector on GPU. Springer Journal of Signal Processing Systems, 64, 123–136.

Pierre-Henri Horrein received an engineering degree (master) from Grenoble Institute of Technology in 2008. He is a PhD student in CEA Grenoble. His research interests are in the field of flexible radio and reconfigurable computing, from a software point of view.

Christine Hennebert is currently a research engineer at CEA Grenoble, France. She obtained an engineering degree (master) from ESIGELEC, France in 1992 and a PhD degree from Institut National Polytechnique de Grenoble, France in 1996. Her research work is in the field of implementation of wireless communication applications on innovative and heterogeneous platforms.

Frédéric Pétrot received the DEA (master) and PhD degree in computer science from Université Pierre et Marie Curie (Paris VI), Paris, France, in 1990 and 1994, respectively, where has been assistant professor in computer science. He was then one of the main contributors of the Alliance VLSI CAD System and the Disydent ESL environment. F. Pétrot joined TIMA in September 2004, and holds a professor position at the Ensimag, Institut Polytechique de Grenoble, France, where, since 2007, he heads the System Level Synthesis group. His main research interests are in system level design of integrated systems, and include computer aided design of digital system, architecture and software for homogeneous and heterogeneous multiprocessor systems on chip.

Suggest Documents