AMD Geode[2] provide a series of x86-compatible SoC microprocessors and I/O companions produced by AMD targeted at the embedded com- puting market.
Reconfigurable Memory Controller with Programmable Pattern Support Tassadaq Hussain, Miquel Peric`as, and Eduard Ayguad´e Barcelona Supercomputing Center {thussain, miquel.pericas, eduard.ayguade} @bsc.es Abstract. Heterogeneous architectures are increasingly popular due to their flexibility and high performance per watt capability. A kind of heterogeneous architecture, reconfigurable systems-on-chip, offer high performance per watt through the reconfigurable logic and flexibility via multiprocessor cores. But in order to achieve the performance goals it is necessary to provide enough data to the accelerators. In this paper we describe a programmable, pattern-based memory controller (PMC) that aims at improving the performance of heterogeneous or reconfigurable SoC devices. These include scatter gather and strided 1D, 2D and 3D patterns. PMC can prefetch complete patterns into scratchpads that can then be accessed either by a microprocessor or by an accelerator. As a result, the microprocessors and accelerators can focus on computation and are relieved of having to perform address calculations. PMC has been implemented and tested on an ML505 evaluation board using the MicroBlaze softcore as the platform’s microprocessor. While PMC adds some latency, it improves performance by offloading the processor and by making better use of available bandwidths. The PMC provide 1.5x speed-ups with processor and 27x speed-ups achieved by using hardware accelerator in PMC SoC based environment while executing thresholding application.
1
Introduction
Multiprocessor system-on-chip (MpSoC) with accelerated architectures are increasingly popular due to short design time, flexibility, computational performance and high performance per watt efficiency. While traditional microprocessor cores are considered easy to program, fixed function units enable performance boost beyond what is possible by a ISA programmed microarchitecture. Texas Instruments OMAP (Open Multimedia Application Platform)[1] is an application processors SoC platform that can integrate different cores such as ARM Cortex-A8 superscalar microprocessor core and cores developed by Texas Instruments. AMD Geode[2] provide a series of x86-compatible SoC microprocessors and I/O companions produced by AMD targeted at the embedded computing market. Besides MPSoC architectures, an architecture that is growing in popularity are the so called Reconfigurable System-on-Chip, which combine processor cores with configurable logic, allowing the user to synthesize reconfigurable accelerators on the chip die. Xilinx Extensible Processing Platform[3] is an example of Reconfigurable System on Chip. It is based on ARM’s dual-core Cortex-A9 MPCore processors and Xilinx’s 28nm programmable logic that takes a processor-centric approach by defining a complete processor. AMD Fusion[4] is a new approach to design Heterogeneous systems and software development. It delivers powerful CPU and GPU capabilities for high performance applications in a single-die processor called an APU. In order to meet performance and power goals, integration of accelerators and microprocessors is required. However, availability of high-performance accelerators is of
no use if the memory hierarchy is unable to provide the necessary bandwidth. Correctly managing memory access across the set of accelerators and microprocessors in such a scenario is thus performance-critical, but it is also very challenging. This paper introduces a memory controller based on high-level data patterns in order to simplify programming of SoC applications while ensuring high performance and high efficiency. Other high-level programmable memory controllers have been researched in the past. This field of research is strongly tied to that of memory prefetchers. Basic patterns that have been exploited by prefetchers include vectors with constant strides or following linked lists[5]. Dynamic prefetching[6][7] is a great approach when the processor is designed to run a wide variety of workloads of which it is agnostic during design. A more tightly mapping between a memory controller and an application can be achieved by a software-managed memory controller attached to a scratchpad memory. The main proposal of this paper is hardware implementation of programmable memory controller (PMC) that takes data access description (Descriptor Blocks) and provide streaming data. In order to reduce data access times and improve performance PMC provide following features. – PMC system accesses Structure of Array(SoA) in Array of Structure (AoS) format with the help of strides. – Multiple descriptors are used to access complex streams ( AoS or SoA) without generating address syncronization delay for non-contiguous memories. – A High-Speed Source Synchronous interface is provided that can be easily interconnected with any generic hardware accelerator. – A scratchpad-memory interface can be used for microprocessor without modifying existing system. This paper discuses the architecture of the memory controller and its implementation on a Xilinx Virtex-5 (ML505 Development Board) along with the necessary glue to attach it to the MicroBlaze processors or ROCCC-generated accelerators[8]. The programmable memory controller is based on a scatter-gather memory controller and can be directly programmed from C language programs using a (special purpose) interface based on first programming the controller and then issuing send() and receive() calls. For the purpose of evaluating the architecture we implemented a thresholding algorithm and executed it on both variants of the architecture and also a microblaze-based version that lacks the programmable memory controller
2
Proposal
To exemplify functionality of the PMC, in this section we explicate data access pattern and architecture of PMC. A conventional memory controller’s minimal descriptor contains source, destination addresses and size of the transfer with unit stride access that is not efficient to access complex memory patterns. This is shown in Figure 1. DMA manages one stream of data. If the access pattern is complex then task of the
Fig. 1: Generic DMA Data Access
Fig. 2: PMC Data Access
DMA becomes significantly more complex. Accessing non-contiguous memory locations generate delay while computing addresses. System having a PMC unit can access non-contiguous memory location with the help of stride and jump functionality. To do so PMC uses multiple descriptor block to access the complex data stream. Streams can be from a non-contiguous memory to a contiguous address space, and vice versa. Figure 2 shows the operation of the data access pattern of PMC. Channel 0 access is continuous (Data [n]) having unit-stride and m-stream size. Variable Jump is used between Channels to access contiguous data from different memory location. Channel 1 accesses a diagonal element (Data[n+stride]) with n-stride between two consecutive data locations and m-stream size. Our initial PMC implementation shown in the Figure 3 accesses DDR2 memory [9]. This involves multiple descriptors. The minimum set of parameters for a single descriptor block are shown in Figure 4. Command specifies whether to read or write a single or a stream of data. The address parameters belong to the starting address of the source and to the destination memory location. The PMC system contains four main units: – – – –
The Front-End Interface The Pattern Controller The Stream Controller The Memory Controller
Fig. 3: Internal Architecture of Controller
Command Source address Destination address Transfer Size Strides Fig. 4: Descriptor Structure of Memory controller
2.1
The Front-End Interface
The Front-End Interface demonstrates support of the PMC for different systems. – High-Speed Source Synchronous Interface – Processor Local Bus Interface High-Speed Source Synchronous Interface The Source Synchronous Interface shown in Figure 3 is used to supply high-speed data to hardware accelerators. Synchronous handshaking protocol is applied to request and grant Data. Transfer of Data is performed according to the physical memory clock. Processor Local Bus Interface The Scratch Pad controller shown in Figure 3 provides the interface between the LMB and the Scratch Pad Memory (BRAM block). A scratch pad memory subsystem consists of the Descriptor memory along with the Buffer memory. The Descriptor memory feed descriptors to the pattern controller unit as linklist fashion this reduces descriptor request/grant time and eliminates the additional resynchronization time required to access non-contiguous memories. The Buffer memory is used to temporarily hold data while it is being moved to/from physical memory. 2.2
Pattern Controller
The Pattern Controller is the top unit shown in Figure 3, which communicate with external processing units. Pattern controller takes descriptor block from external source and feeds it to channel controller. The channel controller unit manages multiple descriptors. A single Channel takes a one descriptor from the channel controller and generates a stream. Figure 2 shows how different channels are combined to access a non contiguous stream. Multiple descriptors are used when the application needs to grab data in the form of complex patterns. 2.3
Stream Controller
The Stream Controller comes after the Pattern Controller shown in Figure 3. This unit is responsible for transferring data between physical memory and the hardware accelerator depending upon the programmed descriptor. Salient features of the stream controller are shown in the Table 1. Stream controller contains two main units. – Data Management Unit (DMU) – Address Management Unit (AMU) The Data Management Unit (DMU) The DMU is dependent on Data in and Data out units shown in Figure 3. It enables the data stream to be written to the appropriate physical memory by generating the write-enable along with write-data and mask-data signals. It supports data streams of up to 1024 elements each having 64-bit words.
Table 1: Stream Controller Features Processing Units Channels Stride Stream (32-bit word)
Register Width Range Minimum Range Maximum 8 bit 1 256 8 bit 1 256 8 bit 4 address 256 address 8 bit 32 2048
The Address Management Unit (AMU) The AMU deals with Start Address, Stream and Stride units shown in Figure 3. It takes two clocks to program the AMU. Strides between two consecutive accesses are handled by the AMU without generating delay or latency. In each stream, the first data transfer uses addresses taken by the descriptor unit and for rest of the transfer the address is equal to the address of the previous transfer plus size of strides. AMU supports a stream size of up to 1024 contiguous memory elements with one descriptor. Supported strides need to be multiples of f our. 2.4
Memory Controller
A modular DDR2 SDRAM [9] controller is used to access data from physical memory. DDR2 SDRAM controller provides high-speed source-synchronous interface and transfers data on both edges of the clock cycle. It allows designs to be ported easily and also makes it possible to share parts of the design across different types of memory interfaces.
3
Evaluation of Stand-alone PMC
For the evaluation of PMC we used the architecture shown in Figure 3 with the source synchronous interface. Detailed examination of controller connection and functionality is done with hand written test cases. To verify maximum bandwidth and speed different access patterns are executed over stand alone PMC controller. The results of these patterns are shown in Tables 2 and 3. 3.1
Data Access Patteren (AoS, SoA)
It has been found that most HPC applications are in favor of operating on SoA format [10]. PMC system is one way to access SoA in AoS format without generating any delay. AoS data access requires unit-stride, where as the SoA requires strided access.
Fig. 5: Structure of Array Access Pattern
Table 2: Clocks taken by Different Write Stream Transfer Type Number of Words (32 bit) Number of Clocks Total Clock (+ Latency) Single 1-byte - 8-byte 4 4-8 Minimum Stream 32 11 11 Stream-8 (AoS or SoA) 64 19 19 8x8 (matrix) 512 131 131 Table 3: Clocks taken by Different Read Stream Transfer Type Number of Words (32 bit) Number of Clocks Total Clock (+ Latency) Minimum Stream 1-byte - 8-byte 2 32 Single Stream 64 10 42 2D Stream (AoS or SoA) 128 18 58
The stride is determined by the size of the working data structure. Figure 5 shows three different access patterns (x[n],y[0][n],z[n][n]) of n × m matrix. Where [n] belongs to contiguous row with unit stride, y[0][n] belongs to column access with stride equal to n stream width where as z[n][n] is diagonal SoA pattern its stride is addition of x stream width and unit stride. 3.2
Testing and verification
To test the functionality of PMC, hand written HDL test patterns are used to program PMC. These test patterns read write data to/from physical memory in different format shown in Tables 2 and 3. PMC is synthesized using Xilinx ISE 11 [11] for Virtex 5 ML505 board with a XC5VLX110T FPGA. On Virtex-5 family PMC can work with clock at 260 MHz and it consumes 2457 flip flops and 1602 LUT.
4
Evaluation of PMC based SoC using Test Application
To evaluate and analyze PMC functionality we use a simple thresholding algorithm. Thresholding is a straightforward algorithm of image segmentation. It takes gray scale image having 8 bit pixel depth and converts into binary image. Our thresholding application takes individual 8 bit pixels from 256 × 256 size of image shown in Figure 6(a). If value of pixel is greater than threshold value it save binary 1 to new pixel other wise 0 will be saved. New image shown in Figure 6(b) have same dimension 256 × 256 with 1 bit pixel depth. Thresholding application is executed over following architectures. – MicroBlaze Stand-Alone – MicroBlaze with PMC – MicroBlaze with PMC and Reconfigurable Hardware Accelerator
Fig. 6: (a) 256x256 Gray Scale Image taken from [12] (b) 256x256 binary Image
Fig. 7: Microblaze Based System
4.1
Microblaze Stand-Alone
Generic MicroBlaze SoC system accesses physical memory using Multi-Port Memory Controller (MPMC) and Processor Local Bus (PLB). PLB is multi-master bus that is ideal for connecting external peripherals to the MicroBlaze processor core. PLB peripherals encounter issues like Arbiters P riority level, Congestion of traf f ic over buses and Bus protocol translation. The latency and delay to access physical memory can have negative impact on performance that has removed by adding PMC. 4.2
MicroBlaze with PMC
To get benefit of embedded processor inside FPGA an architecture is proposed having PMC and Scratch Pad Controller in MicroBlaze system shown in Figure 7 Approach A. In this system MicroBlaze first program descriptor block via Scratch Pad Controller that will then read write data to BRAM memory from physical memory. MicroBlaze reads Data for computation from the BRAM using the Processor Local Bus (PLB) and after doing computation it will write back data to the BRAM to save it in physical memory. Salient parts of the proposed architectures are : – – – –
Dual Port Memory Controller Scratch Pad Memory Controller Pattern-Based Memory Controller (PMC) Programable Hardware Accelerator
Dual Port Memory Controller The Dual-port memory controller is employed to access Scratch Pad memory. One port is dedicated to the Microblaze processor and a second port is used to serve Scratch Pad controller. The Dual-Port memory architecture permits data access on the system side to occur in parallel with PMC side. Additionally, the preferred data memory may be utilized with a variety of cache coherency techniques or policies.
Table 4: Memory Maped PMC Descriptor Register Number Register Name Type (Read Write) Offset Address Description 0 dst add r/w 0x0000 Buffer Memory Start Address 1 src add r/w 0x0004 Physical Memory Start Address 2 cmd r/w 0x0008 single or stream read write 3 stream r/w 0x000c Stream 4 stride r/w 0x0010 Stride between two consecutive address 5 ready r 0x0014 PMC ready 6 intr1 r 0x0018 Soft Interrupt1 7 intr2 r 0x001c Soft Interrupt2 8 to 31 reserved Reserved for Future use
Scratch Pad Memory Controller The Purpose of the Scratch Pad Memory Controller is to program descriptor blocks of PMC via Microblaze and provide BRAM access to PMC. The Scratch pad controller is connected with Microblaze through PLB and shares descriptor registers of PMC. When the scratch pad controller is ready the programmer can read/write data to/from Physical memory. After completion of data an interrupt signal is generated which indicates that the PMC is ready for next send/receive. Microblaze uses below functions to configure PMC registers. pmc descriptor(*/dst add/*,*/src add/*,*/cmd/*,*/stream/*,*/stride/*); pmc send(); pmc receive(); The parameter of device drivers are memory mapped and are shown in Table 4. 4.3
MicroBlaze with PMC and Reprogramable Hardware Accelerator
An Hardware Accelerator based architecture is proposed shown in Figure 7 Approach B. In this approach Scratch Pad Controller is programmed by MicroBlaze. Thresholding application is running in Hardware Accelerator. Benefits of this approach are : – Processor’s computation load has been reduced. – Delay while accessing Scratchpad memory via PLB is removed. 4.4
Results and Comparison
The PMC is programmed in such a way that it is accessing 2-D matrix (256x256x8). The MicroBlaze programmed 256 descriptors of PMC each one is transfering 256 byte of data. XPS (Xilinx Platform Studio) [13] is used to configure and build hardware for Virtex 5 ML505 board with a XC5VLX110T FPGA. Results of the three approaches Microblaze Stand-Alone, MicroBlaze with PMC and Microblaze with PMC and Reprogramable Hardware Accelerator are shown in Table 5. Column DDR2 Access contains clocks taken by Microblaze Stand-Alone architecture to read and write 256 × 256 Byte of image from physical memory. Microblaze take two cycles to compute threshold point for each pixel. DDR2 Access for Microblaze with PMC architecture are clocks to read and write 256 × 256 image from physical memory to BRAM. BRAM to Microblaze are number of clocks taken by MicroBlaze to access data from BRAM via PLB bus. PLB is served to access instruction memory and data memory that increase number of clocks to access BRAM. In pipelined architecture MicroBlaze programmed PMC to work in parallel. After writing
first stream (image row from DDR2 to BRAM) by PMC the MicroBlaze starts processing over it. While processing, PMC prefetches next stream that hides the DDR2 memory access time for next accesses. In this case processor to BRAM memory access is dominant. In Microblaze with PMC and Reprogramable Hardware Accelerator approach Hardware accelerator access data from physical memory. Computation is pipelined with input data stream. Computed data is saved in BRAM buffer. Hardware accelerator has direct connection with BRAM. Only Physical memory access time is dominant in this technique.
5
Related Work
A number of DMA Memory Controllers are available in research and development sectors. The XPS Channelized DMA Controller [14] provides simple Direct Memory Access (DMA) services to peripherals and memory devices on the PLB. The DMA reside on the PLB, peripherals working with DMA are forced to follow the PLB protocol. Lattice Semiconductor Scatter-Gather Direct Memory Access Controller IP [15] and ALTERA Scatter-Gather DMA Controller core[16] provide data transfers from noncontiguous block of memory to another by means of a series of smaller contiguous transfers. Both cores read a series of descriptors that specify the data to be transferred. Transfer of data contains unit-strides that are not suitable to access complex memory patterns. The Impulse memory controller [17] supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications can control the data to be accessed and cached. The Impulse controller works under authority of the Operating System that manages physical address.
6
Conclusion
This work attacks the memory-processor data access bottleneck by proposing a programmable pattern-based memory controller (PMC). The PMC can work with any SoC architecture and stand-alone HPC kernel without modifications to the microprocessor system. The Controller can be programmed from C programs using a special purpose interface based on send() and receive() calls. Currently, in order to implement higher level patterns PMC uses scatter gather commands, but as future work we are considering to implement more patterns directly in hardware. One kind of pattern that we will consider is automatic tiling. The PMC system provides support for stride access and scatter/gather that eliminates the overhead of arranging and gathering data by the microprocessor unit. The PMC achieve 27x speed-ups with programmable hardware accelerator in PMC SoC based environment while executing a thresholding application. Table 5: System on Chip Results with Different Approaches 256 × 256 Bytes DDR2 Access BRAM to MicroBlaze Computation Total of Image Read/Write clocks Read/Write clocks (Threshold) clocks clocks Stand-alone Processor 1040882 2 1075882 PMC with Microblaze 38816 655375 2 694191 PipeLined PMC with Hardware Accelerator
38816 38886
655375
2 2
655475 38988
This shows that the PMC based architecture is useful for High Performance Hardware accelerators. A wrapper module can be used to configure descriptors of PMC to directly connect with high level language programable hardware accelerators such as ROCCC.
References 1. “Texas Instruments OMAP (Open Multimedia Application Platform).” [Online]. Available: http://focus.ti.com/general/docs/wtbu/wtbugencontent.tsp?templateId= 6123&navigationId=11988&contentId=4638#omap4 2. Advanced Micro Devices, Inc. All rights reserved, AMD Geode LX Processors Data Book, February 2007. 3. Keith DeHaven, “Extensible Processing Platform Ideal Solution for a Wide Range of Embedded Systems,” April 27, 2010. [Online]. Available: www.xilinx.com/.../ wp369 Extensible Processing Platform Overview.pdf 4. “The AMD Fusion Family of APUs.” [Online]. Available: http://www.fusion.amd.com/ 5. Amir Roth, Gurindar S. Sohi, “Effective jump-pointer prefetching for linked data structures,” ISCA ’99 Proceedings of the 26th annual international symposium on Computer architecture, vol. Volume 27 , 2, May 1999. [Online]. Available: http://portal.acm.org/citation.cfm?id=300989 6. Keith I. Farkas and Norman P. Jouppi and Paul Chow, “How Useful Are Non-blocking Loads, Stream Buffers, and Speculative Execution in Multiple Issue Processors?” HighPerformance Computer Architecture, 1995. Proceedings., First IEEE , pp. 78–89, 1994. 7. Norm Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISBN: 0-8186-2047-1, pp. 364–373, 28-31 May 1990. [Online]. Available: http://ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber= 134547 8. “Riverside Optimizing Compiler for Configurable Computing (ROCCC).” [Online]. Available: http://roccc.cs.ucr.edu/index.php 9. Xilinx, Memory Interface Solutions, December 2, 2009. 10. C. Gou, G. Kuzmanov, and G. N. Gaydadjiev, “Sams multi-layout memory: providing multiple views of data to boost simd performance,” pp. 179–188, 2010. [Online]. Available: http://doi.acm.org/10.1145/1810085.1810111 11. “Xilinx Integrated Software Enviroment Design Suite.” [Online]. Available: http: //www.xilinx.com/support/techsup/tutorials/tutorials11.htm 12. “New York Criminal Defense Blawg.” [Online]. Available: http:// newyorkcriminaldefenseblawg.com/wp-content/uploads/2010/10/fingerprint.jpg 13. “Xilinx Platform Studio.” [Online]. Available: http://www.xilinx.com/support/ documentation/dt edk edk11-1.htm 14. Xilinx, Channelized Direct Memory Access and Scatter Gather, February 25, 2010. [Online]. Available: www.xilinx.com/support/documentation/ip.../chan dma sg.pdf 15. Lattice Semiconductor Corporation, Scatter-Gather Direct Memory Access Controller IP Core Users Guide, October 2010. [Online]. Available: www.latticesemi.com/dynamic/ view document.cfm?document id=24824 16. A. Corporation, Scatter-Gather DMA Controller Core, Quartus II 9.1, November 2009. [Online]. Available: www.altera.co.jp/literature/hb/nios2/qts qii55003.pdf 17. John Carter, Wilson Hsieh, Leigh Stoller, Mark Swanson, Lixin Zhang, Erik Brunvand, Al Davis,Chen-Chi Kuo, Ravindra Kuramkote,Michael Parker, Lambert Schaelicke, and Terry Tateyama, “Impulse: Building a Smarter Memory Controller,” Fifth International Symposium on High Performance Computer Architecture (HPCA-5), pp. 70–79, January 1999. [Online]. Available: http://www.cse.psu.edu/hpca5/