Efficient Dynamic Heap Allocation of Scratch-Pad Memory - School of ...

36 downloads 222 Views 433KB Size Report
Jun 8, 2008 - [email protected]. Abstract. An increasing number of processor architectures support scratch- ..... While profiling could be employed to choose the .... wav file using the ogg encoder; pyt – execution of the python ex- ample file ...
Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy

Peter Dickman ∗

Joe Sventek

Department of Computing Science University of Glasgow, UK [email protected]

Google Inc. Zurich, Switzerland [email protected]

Department of Computing Science University of Glasgow, UK [email protected]

Abstract An increasing number of processor architectures support scratchpad memory – software managed on-chip memory. Scratch-pad memory provides low latency data storage, like on-chip caches, but under explicit software control. The simple design and predictable nature of scratchpad memories has seen them incorporated into a number of embedded and real-time system processors. They are also employed by multi-core architectures to isolate processor core local data and act as low latency inter-core shared memory. Managing scratch-pad memory by hand is time consuming, error prone and potentially wasteful; tools that automatically manage this memory are essential for its use by general purpose software. While there has been promising work in compile time allocation of scratch-pad memory, there will always be applications which require run-time allocation. Modern dynamic memory management techniques are too heavy-weight for scratch-pad management. This paper presents the Scratch-Pad Memory Allocator, a lightweight memory management algorithm, specifically designed to manage small on-chip memories. This algorithm uses a variety of techniques to reduce its memory footprint while still remaining effective, including: representing memory both as fixed-sized blocks and variable-sized regions within these blocks; coding of memory state in bitmap structures; and exploiting the layout of adjacent regions to dispense with boundary tags for split and coalesce operations. We compare the performance of this allocator against Doug Lea’s malloc implementation for the management of corelocal and inter-core shared scratchpad memories under real world memory traces. This algorithm manages small memories efficiently and scales well under load when multiple competing cores access shared memory. Categories and Subject Descriptors D.4.2 [Operating Systems]: Storage Management—Allocation/Deallocation strategies; D.3.3 [Programming Languages]: Language Constructs and Features— Dynamic storage management; C.1.3 [Processor Architectures]: Other Architecture Styles—Heterogeneous (hybrid) systems General Terms

Algorithms, Experimentation, Performance

1. Introduction It is well known that processor speeds are increasing at a significantly faster rate than external memory chips. This has led to increasing costs for memory accesses relative to processor clock rate, a phenomenon sometimes known as the memory wall [23]. Since ∗ Work

carried out while at the University of Glasgow.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISMM’08, June 7–8, 2008, Tucson, Arizona, USA. c 2008 ACM 978-1-60558-134-7/08/06. . . $5.00 Copyright

most applications exhibit temporal locality in data access patterns, memory access overheads can be significantly reduced by storing an application’s working set in a smaller but much faster on-chip memory. Many architectures structure this on-chip memory as a cache, where movement of data between on- and off-chip memory is under hardware control. Caches perform very well under most situations, conveniently hiding the non-uniformity of the memory system from software, while decreasing average memory access latency. However, caches suffer from a number of penalties compared with non-associative memories, such as increased silicon area and energy requirements, increased access latency, complex coherency mechanisms for multi-core systems, and lack of realtime guarantees due to potential cache misses. These factors have led a number of processor architectures to provide explicit access to on-chip memory, in the form of scratch-pad memory. Scratch-pad memories are small but very fast and tightly coupled memories, typically ranging from 4kB to 1MB in size. Unlike caches, transfer of data to and from these scratch-pad memories is under explicit software control. They can be found in embedded and real-time processor architectures [5, 16, 20], where they have been found to not only reduce power and silicon area requirements compared to an equivalent cache, but also potentially increase run-time performance [3]. Mutli-core processors, especially heterogeneous architectures [1, 16, 19], have also begun to incorporate scratch-pad memories in their designs. Multi-core architectures can employ this scratch-pad memory either as local memory accessible by only a single core, or as a shared data store that can provide high speed inter-core communication. Explicitly partitioning local and shared data in this way leads to simpler hardware design and can increase overall performance by reducing contention. Unfortunately, scratch-pad memory places an additional burden on the software developer by being located in an address space disjoint from main memory. While it is potentially possible to statically allocate scratch-pad memory by hand for small systems, automatic management of this memory is required for complex systems. Automatic management of scratch-pad memory as a dynamic heap is possible, however, existing dynamic memory management algorithms are too heavyweight, in terms of memory state requirements, code complexity and execution performance. Taking these constraints into account led us to design the Scratch-Pad Memory Allocator (SMA), an unconventional algorithm specifically targeted at the management of scratch-pad memories. This algorithm uses a variety of techniques to reduce its overall footprint while still remaining effective. These include: representation of memory both as fixed-size blocks and variable-sized regions within these blocks; use of bitmap based data-structures to code the current state of the memory being managed; and exploiting the ordering of adjacent region split and coalesce operations to dispense with wasteful boundary tags. SMA is intended as a first step in the process of managing a complex memory hierarchy which contains multiple levels of explicitly accessible memory. We do not expect application developers to be forced to explicitly manage each memory hierarchy level

through algorithms such as SMA. Instead, we envisage a runtime system which abstracts the heterogeneous memory hierarchy from the typical user. This runtime system will use annotations, inserted into the code, either by the programmer or a tool, as hints to choose the most appropriate memory level heap for a particular data allocation request. If an allocation request on the preferred heap fails, the runtime will attempt to allocate on a less optimal heap, thereby gracefully handling exhaustion of a particular memory level heap, without burdening the user by having to explicitly deal with multiple heaps. In this paper we present a mechanism which enables efficient allocation of the smallest and most closely coupled memory in such a system, but leave the runtime which supports the full memory hierarchy to future work. The key contributions of this paper are: • a simple and lightweight algorithm for managing small areas of memory with minimal overheads • a technique which employes a coded bitmap to manage region splitting and coalescing without employing boundary tags • a scalable and lightweight concurrent implementation of this algorithm for management of shared memory by multiple cores Section 2 overviews related work. The SMA algorithm is presented in Section 3, with some enhancements presented in Section 4. Experimental measurements of the performance of this algorithm and a more conventional allocator are compared in Section 5. Section 6 concludes.

2. Related Work An extensive range of dynamic memory management strategies has been investigated over the years, many of which are reviewed by Wilson et al [22]. Typically, memory is managed by creating one or more free-lists – linked lists where every node is a free region of memory. The memory management algorithm will remove a node from a free-list in order to service a memory request. If the free-list node is larger than the requested size, then it is split, with the remainder being returned to the free-list. When memory is freed, it is coalesced with any neighbouring free regions, and returned to the appropriate free-list. Most memory management algorithms differ primarily in their placement strategy (how they choose the most appropriate free region for a given allocation) and the techniques used to perform splitting and coalescing of memory regions. A number of these techniques, such as binary buddy systems [13], deferred coalescing and bitmap allocation schemes [6], have inspired aspects of SMA. Much of the work in scratch-pad memory management has eschewed dynamic management techniques due to their associated overheads, and have instead concentrated on allocating memory at compile time. Some of the earliest work into automatic scratchpad management involves the compiler using this memory to store spilled register values [7]. This work has been extended by others to allow global variables to be allocated to scratch-pad memory by the compiler [3, 18], as well as enabling compiler allocation of stack variables to scratch-pad memory [2, 10]. However, these approaches are limited in that the allocation of scratch-pad memory cannot be modified at run-time. Use of a software caching mechanism [9] enables automatic management of scratch-pad memory while allowing the set of data stored in scratch-pad memory to change at run-time. However, this approach introduces a run-time overhead for every memory access and leads to unpredictability in memory access times, which may be inappropriate for real-time systems. Kandemir et al [12] describe an approach where the compiler inserts routines to copy stack data from main memory to scratch-pad memory if it expects the data to be accessed frequently. This allows the stack data being held in scratch-pad memory to change depending upon run-time requirements, without the memory access overheads of software caches.

Initial Free Block Bitmap

1 1 1 0 0 1 0 1 1 1 1 1 1 0

...

1. 1 1 1 0 Allocation 2. Attempts

1 0 1 1

3. Final Free Block Bitmap

1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 0

...

Figure 1. SMA large allocation example. Dominguez et al [8] extend this approach to provide support for heap data variables as well as stack variables. While dynamically sized variables are supported, only a fixed-sized portion of each variable’s data is stored in the scratch-pad, with any remainder relegated to main memory. Our approach of a truly dynamic run-time heap management algorithm allows full storage of dynamically created data-structures in scratch-pad memory. We see this as a complementary technique to compiler based allocation of scratch-pad memory, allowing a choice between the flexibility of dynamic run-time memory management and the low overhead of compile time allocation.

3. Scratch-Pad Memory Allocator The SMA algorithm represents memory as a coarse-grained series of fixed-size blocks, within which, memory can be split into finelygrained, variable-sized regions (referred to as small regions hereafter). Large-sized memory requests are serviced by allocating a large enough contiguous sequence of free blocks to meet the requested size. The last block of a large allocation is potentially split into small regions, such that this space can be reused by small-sized memory requests. Small-sized memory requests are allocated from a series of small regions’ free-lists. If the allocated small region is larger than the requested size, any excess is split off and returned to the appropriate free-list. By representing memory as a series of coarse-grained blocks, the data-structure size requirements of SMA are significantly reduced; while allowing these blocks to be split into small regions, if required, alleviates the potential fragmentation this coarse memory representation could otherwise cause. Many of the algorithms used by SMA consist of simple bitwise operations. These bitwise based algorithms lend themselves to hardware implementation. There is therefore potential for a reconfigurable or custom built embedded processor to significantly accelerate the SMA algorithm by incorporating a small number of simple custom bitwise instructions. We leave hardware acceleration of the SMA to future work. 3.1 Large Block Allocation SMA partitions all available scratch-pad memory into fixed-sized blocks. Each block is represented by a single bit in a free-block bitmap, where a set bit represents a free block, and an unset bit represents an in-use block. The block size is fixed at run-time, but can be varied at compile-time based upon memory size. To allocate a memory region which is larger than the block’s size, SMA allocates enough full blocks to satisfy the allocation re⌉). If the the requested size is not a multiquest (i.e., ⌈ request_size block_size ple of the block’s size, the last of these allocated blocks is split up to allow smaller-sized requests to reuse the remainder of this block. This section describes the technique SMA uses to allocate the complete blocks for a large-sized request. The process of splitting the final block is the same as that described in Section 3.2.2.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

v o i d∗ s m a _ l a r g e _ m a l l o c ( i n t n o _ b l o c k s ) { v o i d ∗ r e s u l t = MEM_START; i n t b l o c k s _ b m= me ms t a t e .b_bm ; i n t mask= ( 1 > s h i f t _ a m o u n t ; i f ( ! blocks_bm ) { /∗ a l l o c f a i l e d ∗/ } r e s u l t += s h i f t _ a m o u n t ; }

1 2 3 4 5 6 7 8 9 10

i n t c a l c _ s h i f t ( i n t mask, i n t b l o c k s _ b m ) { i n t a t t e m p t = mask & b l o c k s _ b m ; i f ! ( a t t e m p t & ∼ ( mask >> 1 ) { / / i s t h e l a s t b i t s e t i n t r e m _ b l o c k s = b l o c k s _ b m &∼mask ; return f f s ( rem_blocks ) ; } else { i n t i n _ u s e = mask & ∼ a t t e m p t ; / / return f l s ( in_use ) + 1; } }

Listing 2. Algorithm used to calculate bitmap shift. me ms t a t e .b_bm &= ∼ ( mask