Optimizing SDRAM Bandwidth for Custom FPGA Loop ... - CiteSeerX

7 downloads 3784 Views 417KB Size Report
Feb 22, 2012 - for constructing an application specific on-chip memory address ..... easily expressed in a mathematical form referred to as the Polyhe-.
Optimizing SDRAM Bandwidth for Custom FPGA Loop Accelerators Samuel Bayliss and George A. Constantinides Department of Electrical and Electronic Engineering Imperial College London London SW7 2AZ, United Kingdom {s.bayliss08, g.constantinides}@imperial.ac.uk

ABSTRACT Memory bandwidth is critical to achieving high performance in many FPGA applications. The bandwidth of SDRAM memories is, however, highly dependent upon the order in which addresses are presented on the SDRAM interface. We present an automated tool for constructing an application specific on-chip memory address sequencer which presents requests to the external memory with an ordering that optimizes off-chip memory bandwidth for fixed onchip memory resource. Within a class of algorithms described by affine loop nests, this approach can be shown to reduce both the number of requests made to external memory and the overhead associated with those requests. Data presented shows a trade off between the use of on-chip resources and achievable off-chip memory bandwidth where a range of improvements from 3.6× to 4× gain in efficiency on the external memory interface can be gained at a cost of up to a 1.4× increase in the ALUTs dedicated to address generation circuits in an Altera Stratix III device.

Categories and Subject Descriptors B.5.2 [Design Aids]: Automatic Synthesis

Keywords FPGA, Memory, SDRAM, Loop Transformations

1.

INTRODUCTION

The number of pins available on semiconductor devices, and the data rate available on such pins, has not scaled with transistor density [1]. This, among other factors has contributed to a ‘memory wall’ in which the parallelism achievable with computing devices is limited by the speed in which large off-chip memories can be accessed. On-chip memory hierarchies improve the performance of the memory system by exploiting reuse and reordering. Reusing data retained in an on-chip memory hierarchy reduces the number of requests made on the external memory interface. Where those data requests can be safely reordered, often the control overhead of servicing the requests can be reduced and hence bandwidth on the external interface is more efficiently used.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FPGA’12, February 22–24, 2012, Monterey, California, USA. Copyright 2012 ACM 978-1-4503-1155-7/12/02 ...$10.00.

The impact of this reordering can be seen when Dynamic Random Access Memories (DRAMs) are used as the external off-chip memory. DRAM devices have a hierarchical structure in which a small number of independent memory arrays (banks) are themselves composed of rows and columns. A row within a bank must be explicitly opened (‘activated’) before columns within it can be accessed, and closed (‘precharged’) before another row can be selected. While DRAM capacity has scaled in line with Moore’s observations [19], increased cell density has been used to deliver increased capacity for a given silicon die area. This means the wirelengths upon which ‘precharge’ and ‘activation’ times are dependent are not significantly shorter than 10 years ago. With smaller transistors delivering shorter clock periods, the overall impact is that timing delays associated with ‘precharge’ and ‘activation’ cycles take up an ever increasing proportion of clock cycles on the external memory interface. The net effect is that the bandwidth efficiency of external memory interfaces has reduced with each generation of SDRAM device. On general purpose computers, the sequence of memory requests generated by the CPU is typically unknown at design time, so the processor cache is designed to be able to reuse and reorder data without a static compile time analysis of the program to be executed. Complex dynamic memory controllers in modern CPUs buffer and dynamically reorder cache line-fill requests to external memory. Both the cache and memory controllers therefore contain memory and associative logic to buffer and dynamically select and service memory requests [15]. As well as the direct area cost this imposes, caches and dynamic memory controllers make it very difficult to predict memory performance at compile-time. In this paper we ask the question: “What can be done in the memory controller to improve bandwidth efficiency without sacrificing predictability?”. In essence, we aim to trade logic resources (whose availability increases with process scaling in line with Moore’s observations [19]) for increased memory interface bandwidth (which does not). While this area has been explored in the context of dynamic memory controllers and caches which operate on random streams of data, we focus instead on opportunities for a statically scheduled reordering of memory requests. Such an approach means we can exactly determine memory access latency and bandwidth at compile time for each memory access. This knowledge might be exploited in a high level synthesis flow to improve resource scheduling algorithms. For a given application targeted for hardware acceleration on an FPGA, the structure of computation kernels often makes static analysis tractable. More specifically, for such kernels, the sequence of memory requests can often be determined at compile time. In this paper, we give a methodology grounded in a formal computation framework (the Polyhedral Model [16, 20]) for reordering

memory requests for such kernels. We present results which are specific to hardware implementation, in that they exploit the reconfigurability of the device by creating a custom memory architecture for a specific application, and also take advantage of the abundance of arithmetic logic components available on modern FPGAs to build sophisticated controller structures. The key contributions are : • A parametric representation of the set of SDRAM rows accessed by a memory reference within a loop nest as the integer points enclosed in a convex polytope, avoiding the need for enumeration. • An efficient procedure to optimize the size of such a parametric representation, corresponding to optimizing the FPGA resources dedicated to constructing an address generator. • A method of code generation from the optimized set description which minimizes the number of row swaps performed by, and the number of commands issued to, the SDRAM memory controller. • A method for producing an efficient and flexible pipelined memory address sequencer. • An evaluation of the SDRAM bandwidth efficiency improvements achieved by such an approach, considering separately the improvements attributable to the reuse of data and the reordering of data requests to external memory. • An evaluation of the cost of that transformation in terms of additional logic required in address generation and on-chip data storage. We have implemented our method as a complete design flow that reads in a subset of C, extracts a polyhedral representation and generates a Verilog implementations of an optimized SDRAM address sequencer. We would encourage readers to try the flow via the web interface, available at http://cas.ee.ic.ac.uk/ AddrGen.

2. 2.1

BACKGROUND SDRAM Memory

DRAMs are designed for high yield and low cost in manufacturing and have densities which exceed competing technologies. However, the design trade-offs needed to achieve these characteristics means much of the burden of controlling DRAM memory access falls on an external memory controller. SDRAM memories store data as charged nodes in a dense array of memory cells. An explicit ‘activate’ command sent to the memory selects a row within the array before any reads or writes can take place. This is followed by the assertion of a sequence of memory ‘read’ or ‘write’ commands to columns within the selected row followed by a ‘precharge’ command which must be asserted before any further rows can be activated. The physical structure of the memory device determines the minimum time which must elapse between the issuing of ‘precharge’, ‘activate’ and ‘read’/‘write’ operations. The on-chip memory controller is responsible for ensuring these timing constraints are met. When reading consecutive columns within a row, DDR SDRAM memories can sustain two words-per-clock-cycle data-rates. With a 200MHz DDR2 device (with 8-bit wide data bus), this means 400MBytes/s. However the overhead of ‘precharge’ and ‘activate’ commands and the delays associated with their respective timing constraints means that achieving this peak bandwidth is rare. In the worst case, where single bursts are requested from different rows within the same bank, memory bandwidth is reduced to 20% of its peak rate.

2.2

Previous Work

High performance memory systems are a goal across the spectrum of computing equipment and a large body of work exists which seeks to improve cache performance ([13] provides a comprehensive review). Most of this work assumes the sequence of addresses is randomly (but not necessarily uniformly) distributed and describes optimizations of dynamic on-chip structures to exploit data locality. Where scratchpad memories have been used within a memory hierarchy, there are examples of static analysis to determine which specific memory elements are reused. Of particular note is the work of Darte et al. [11] and Liu et al. [18]. These two works both explore data-reuse using a polytope model. One develops a mathematical framework to study the storage reuse problem, and the other is an application of the technique within the context of designing a custom memory system implemented on an FPGA, but without considering the impact that ordering has on performance. The ‘Connected RAM’ (CoRAM) methodology presented in [10] is notable in that, in common with our approach, it seeks to decouple communication and computation threads within an application. Their methodology lacks a compilation framework to extract communication threads from high level descriptions and optimally schedule those threads to access external memory, and our work presented here is a step towards the realization of that goal. Prior work focused more specifically on SDRAM controllers can be divided into work that seeks to optimize the performance of random data streams using runtime on-chip structures, and that which uses static analysis techniques on kernels of application code. A review of different dynamic scheduling policies for SDRAM controllers can be found in [23]; the results show that none of the fixed scheduling policies are optimal across all benchmarks, providing a motivation for our approach of using FPGA programmability to pursue an application-specific approach. The method presented in [2] describes a memory controller that guarantees an allocation of bandwidth and bounded access latency to many requestors in a complex SOC. A set of short templates are defined which optimize bursts of data and time-slots are allocated using a dynamic creditcontrolled priority arbiter. The static approach we demonstrate in this paper enables fine-grained scheduling of datapath operations that is difficult using a dynamic approach. Other static compile time approaches to improving SDRAM efficiency can be found in [17] where different data layouts are used to improve efficiency in a image processing application. A blockbased layout of image data is proposed rather than a traditional rowmajor or column-major layouts and a Presberger arithmetic model is used estimate the number of ‘precharge’ / ‘activate’ operations required in the execution of a video benchmark. Their results show a 70-80% accuracy compared to simulation results and achieve up to 50% energy savings. In [9], a strategy is proposed for allocating arrays to different memory banks to hide the latency of row activation. Their heuristic approach assumes each logical row of an allocated array fits within an SDRAM row; an assumption is likely to be restrictive in handling large data-sets. While our proposed methodology does not consider bank allocation directly, we believe it complimentary to the concept demonstrated in [9], since by reordering memory accesses to cluster together accesses to the same row, SDRAM rows which are accessed consecutively can be allocated to different banks with a simple permutation of address bits. To the best of our knowledge, our work is the first to propose static analysis of loop nests for developing hardware address generators for application-specific SDRAM-optimized memory reordering and data reuse.

char A [ 5 6 ] ; f o r ( x1 = 0 ; x1

Suggest Documents