Optimizing Memory Access Latencies on a Reconfigurable Multimedia

0 downloads 0 Views 314KB Size Report
uct codes (TPC) decoder achieved on a novel Reconfigurable Multimedia. Accelerator (RMA). ... In this paper, we present the implementation of a Turbo Product Codes (TPC) decoder on a .... in the RMA at the time of boot phase. 5 Results ...
Optimizing Memory Access Latencies on a Reconfigurable Multimedia Accelerator: A Case of a Turbo Product Codes Decoder Samar Yazdani1 , Thierry Goubier2 , Bernard Pottier1 , and Catherine Dezan1 1

Universit´e de Bretagne Occidentale Lab-STICC, UMR CNRS 3192, Brest 29200, France [email protected] 2 CEA, LIST Embedded Real-Time System Foundations Laboratory Mail Box 94 - F91191 Gif sur Yvette Cedex, France Abstract. In this paper, we present an implementation of a turbo product codes (TPC) decoder achieved on a novel Reconfigurable Multimedia Accelerator (RMA). The RMA is based on the principle of hierarchical shared memory storage managed through a dedicated local controller favoring high data throughput, while squeezing round-trip memory latencies. The mapping methodology facilitates the characterization of the RMA for a TPC decoder in terms of the communication and computation resources.

1

Introduction

In this paper, we present the implementation of a Turbo Product Codes (TPC) decoder on a novel reconfigurable multimedia accelerator (RMA), weakly coupled to an application processor platform. RMA relies on explicit multithreading and streaming intrinsics [6,7] to mitigate the effects of memory latency and bandwidth in a unified manner. A mix of static and dynamic control reduces the overheads related to pure dynamic control. Several stream processors with differing characteristics have emerged including Cell Broadband Engine [4] and Merrimac[2]. On the other side, the Tera MTA[1] and Niagara[5] are multithreaded architectures. RMA couples the explicit multithreading and streaming intrinsics through a local dedicated controller in a unified manner. The dramatic increase of integrated circuit capacity makes on-chip sophisticated error control methods possible. Turbo product codes (TPC) have good error correction performance and are promising candidates for the FEC1 scheme in 4G. We consider the implementation of the decoder algorithm presented in detail in [3]. The approach is based on achieving block-level parallelism by pipelining (prefetch/execute/poststore) stages in a modular Reconfigurable Multimedia Accelerator (RMA). The RMA associate a process-level execution to a set of local 1

Forward Error Correction.

J. Becker et al. (Eds.): ARC 2009, LNCS 5453, pp. 287–292, 2009. c Springer-Verlag Berlin Heidelberg 2009 

288

S. Yazdani et al.

memory banks fed from the off-chip storage. Multiple memory banks provide the bandwidth needed for the local computations. Given a set of communication and computation processes, shared data dependencies are managed using a synthesized local controller. This controller coordinates fine-grain atomic memory accesses, reducing computational latencies compared to the architectures in which data circulate through communication channels. Furthermore, a complete mapping methodology has been presented that facilitates the characterization of RMA for the TPC decoder in terms of communication and computation resources. The outline of the paper is as follows: section 2 describes the SoC platform considered in this work. The target architecture RMA is explained in section 3. Section 4 describes the mapping of a parallel specification to the RMA. Section 5 is dedicated to the design space exploration and the results obtained.

2

System-on-Chip Architecture

The platform under consideration is a distributed multiprocessor architecture with shared memory and an interconnect network supporting out of order and split transactions. The host processor runs the operating system, and the subsystems attached to the host perform the compute intensive tasks of multimedia applications. Each sub-system has a DSP core with data and instruction cache, and a reconfigurable multimedia accelerator (RMA) coupled to the sub-system DSP by its internal bus, and to the system memory through a master port of the interconnect. There are two modes of communication with the RMA, one is memory to memory through interconnect, and the other is through shared memory with the host DSP. The RMA, once configured to run an application, performs stand-alone processing of blocks of data.

3

Reconfigurable Multimedia Accelerator - RMA

In RMA, the compute-intensive kernel is mapped spatially to the FPGA while the application control such as data sequencing and address generation logic is mapped to the data-transport engine. The streams of data that are consumed and eventually produced by the FPGA through its I/O primary ports are produced by a DMA attached to that port, which reads and writes from a local buffer memory. In a steady state, the FPGA can access a new data on each primary port at each clock cycle, and so the maximum throughput can be sustained to feed the FPGA. The streaming engine has been designed to meet this high throughput requirements. It is composed of a local memory and a set of address generation units to drive the DMAs attached to the FPGA ports. A multibanked, multiported memory block is used to store data prefetched from system memory, and to store temporary results of computations from the FPGA. This local buffer is useful to minimize traffic on the system interconnect, to mask bus latencies and keep a high compute efficiency for the RMA.

Optimizing Memory Access Latencies on a RMA: A Case of a TPC Decoder

289

Reconfigurable Multimedia Accelerator (RMA) Sub-host configuration bus Task Controller

Context RFs

AGUs

MEM DRAM Bank

MEM FPGA

DMA MEM

DRAM Bank

MEM Data Transport Engine

Compute Engine

Fig. 1. An instance of Reconfigurable Multimedia Accelerator

The DMAs of the FPGA ports are under control of the address generation units (AGUs). It is their task to update the DMA burst descriptors. Burst Descriptor is a structure that contains the start address, increment and last address of the memory bank under access according to the application requirements. These processors are single-issue machines with an instruction memory and a data memory, and they communicate with each other through a set of shared registers that are read/write accessed by all the AGU, for instance to implement sync protocols. [3] describes how to implement the Data Transport Engine in an interleaved scheme with the right number of pipeline stages and hardware duplication so as to allow one DTE pipelined to work at the same time on blocki and blocki−1 . 3.1

Execution Model

The execution model is composed of distinct pipeline stages. These pipeline stages are descibed as follows: Prefetch (PF) a packet bus based transaction to transfer data from DRAM to local memory storage Transmit (TX) a burst write access to feed compute nodes inside the compute engine Receive (RX) a burst read access to rearrange processed data in the local memory storage Poststore (PS) a packet bus transaction to write the processed data back to DRAM

4

Mapping Decoder Algorithm to the RMA

The decoding scheme used to map Mini-Maxi to the RMA is shown in figure 2. The architecture makes full use of a single block decoder, by interleaving the decoding of two blocks. While block i is being read from the input buffers and processed by the block decoder (one pair of decoded vector/received vector at a

290

S. Yazdani et al. PREFETCH packet transaction

MEM (n,n+2,n+4,......)

row-major burst TRANSMIT

data image arranged in blocks

elementary decoder (half iteration) column-major burst RECEIVE

MEM (n+1,n+3,n+5,......)

iterations row-major burst TRANSMIT

elementary decoder (half iteration) POSTSTORE packet transaction column-major burst RECEIVE

MEM (n,n+2,n+4,......) data image reconstruction

Fig. 2. Decoding scheme wait

synchronization events

wait

EXECUTE

PREFETCH

add data rd/wr

MEM

add data rd/wr

MEM

POST STORE

wait

from subsystem host

wait

wait

start Execute

Dispatch

stop

I/O port EXECUTE

PREFETCH

FPGA

POST STORE

Update

wait

wait

wait

EXECUTE

PREFETCH

add data rd/wr

MEM

POST STORE

wait

Controller

Communication Threads (µ-components)

Scratchpad Memory Banks

Computation Thread

Fig. 3. Abstractions in the form of concurrent processes that are mapped to an RMA template

time) block i - 1 is being decoded and written to the output buffers (one pair of decoded vector/received vector at a time). Once the decoding of a block is finished, the output buffers are exchanged with the input buffers, and the block decoder is ready to start the next half-iteration for this block (see figure 2). In this scheme, we maintain a full pipeline while copying with the necessary rebuild of the matrix between two half-iterations. Figure 3 illustrates the virtualization of underlying hardware in terms of communication and computation threads. Each thread is abstracted as a component that has an exact mapping to the underlying architectural constituent. A communication thread is a set of 3 atomic tasks (prefetch, execute and poststore). Each communication thread is associated to a memory bank. The execution context of each communication thread is maintained by the controller that scoreboards the execution history. Furthermore, the controller program is synthesized automatically from a runtime system [6] that synchronizes the task dispatch hence data

Optimizing Memory Access Latencies on a RMA: A Case of a TPC Decoder

291

dependencies and allows concurrent shared memory accesses. The execution is traced for different data inputs, and architecture configurations are downloaded in the RMA at the time of boot phase.

5

Results - Design Space Exploration

Different implementation alternatives have been considered for the TPC decoder algorithm. As TPC decoder requires high-computational capacity and data bandwidth, a tradeoff between the use of different decoder instances and memory banks have been considered. A 2-bank, 1 elementary decoder, a 4-bank 1 elementary decoder, a 4 bank - 2 elementary decoders and a 22 bank - 11 elementary decoder configurations have been studied. Table 1 illustrates the application/architecture parameters used for the exploration while table 2 shows the results obtained for different architecture templates as previously explained. The performance is measured in terms of 3 criteria namely; execution latency, communication/computation overlap and throughput. Due to the iterative nature, the algorithm is not only demanding in memory bandwidth but also in terms of computational capacity. It could be seen from the results (see table 2 that the performance (throughput) increases significantly with the increase of memory banks and elementary decoders. The results 2 are obtained for a reference algorithm explained in [3] for an eBCH(32, 26) product codes. We exploit block level parallelism of an input data image having 16×16 blocks. Each block represents 32×32 symbols. A symbol is coded with 5 Table 1. Application and architecture reference parameters Code

eBCH(32, 26)2

Reference Algorithm Mini-Maxi[3] Quantization bits 5 Number of iterations 5 Memory block size 1024×5 bits Clock frequency 100MHz Table 2. Application and architecture reference parameters 2-banks, 1 4-banks, 1 4-banks, 2 22-banks, 11 elementary elementary elementary elementary decoder decoder decoders decoders Execution Latency (cycles)

9014.6K

8992.2K

4512.2K

851.4K

Comm/Comp ratio

1.0626

1.06135

1.0602

1.0505

Throughput

14Mbits/sec 14Mbits/sec 29Mbits/sec 153Mbits/sec

292

S. Yazdani et al.

bits. Each memory block used in the architecture has a 1024×5 bits granularity. The throughput is obtained for a clock frequency of 100MHz.

6

Conclusions

In this paper, we have shown that multiple shared memory buffers are useful to mask off-chip memory access latencies and to increase the data bandwidth. It is further shown, that concurrent memory accesses improve the application performance while relying on optimal synchronization i.e. one that follows the dataflow. To recall the case of Turbo Product Codes Decoder, it has been shown that by scaling memory banks from 2 to 22, processing bandwidth follow improvements from 14Mbits/sec to 153 MBits/sec. This scaling also demonstrates that the RMA mapping method provides a design space exploration capability to adapt the RMA dimensions to a set of target applications.

References 1. Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., Alverson, R., Smith, B.: The tera computer system. In: Proceedings of the 4th international Conference on Supercomputing, ICS 1990, Amsterdam, The Netherlands, ACM, New York (1990) 2. Dally, W.J., Labonte, F., Das, A., Hanrahan, P., Ahn, J.-H., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T.J., Kapasi, U.J.: Merrimac: Supercomputing with streams. In: Supercomputing, 2003 ACM/IEEE Conference, pp. 35–35 (November 2003) 3. Goubier, T., Dezan, C., Pottier, B., J´ego, C.: Fine grain parallel decoding of turbo product codes: Algorithm and architecture. In: 5th International Symposium on Turbo Codes and Related Topics (September 2008) 4. Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: CF 2006: Proceedings of the 3rd conference on Computing frontiers, pp. 1–8. ACM, New York (2006) 5. Kongetira, P., Aingaran, K., Olukotun, K.: Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25(2), 21–29 (2005) 6. Yazdani, S., Cambonie, J., Pottier, B.: Coordinated concurrent shared memory access on a reconfigurable multimedia accelerator. ELSEVIER Journal of Microprocessors and Microsystems, Embedded Hardware Design (2008) 7. Yazdani, S., Cambonie, J., Pottier, B.: Reconfigurable multimedia accelerator for mobile systems. In: The proceedings of 21st IEEE International SoC Conference, pp. 287–291 (2008)

Suggest Documents