the same time, exploiting the respective advantages. When adopting SPMs, the design effort shifts to the problem of how to best map code and/or data onto them ...
A Post-Compiler Approach to Scratchpad Mapping of Code Federico Angiolini † , Francesco Menichelli ‡ , Alberto Ferrero ‡ , Luca Benini † , Mauro Olivieri ‡ †
‡
DEIS, Universita` di Bologna, Bologna 40136, Italy
DIE, Universita` di Roma “La Sapienza”, Roma 00184, Italy
ABSTRACT
1. INTRODUCTION
ScratchPad Memories (SPMs) are commonly used in embedded systems because they are more energy-efficient than caches and enable tighter application control on the memory hierarchy. Optimally mapping code and data to SPMs is, however, still a challenge. This paper proposes an optimal scratchpad mapping approach for code segments, which has the distinctive characteristic of working directly on application binaries, thus requiring no access to either the compiler or the application source code - a clear advantage for legacy or proprietary, IP-protected applications. The mapping problem is solved by means of a Dynamic Programming algorithm applied to the execution traces of the target application. The algorithm is able to find the optimal set of instructions blocks to be moved into a dedicated SPM, either minimizing energy consumption or execution times. A patching tool, which can use the output of the optimal mapper, modifies the binary of the application and moves the relevant portions of its code segments to memory locations inside of the SPM.
Advances in manufacturing processes are driving the semiconductor industry towards miniaturization and integration of chip design. One side effect of this evolution is the growing relative cost of accessing off-chip components, among which external memory certainly takes one of the most prominent spots. In embedded systems and applications, where power consumption and cost play more important roles than versatility, the solution of ScratchPad Memories (SPMs) is a good alternative to caches. These are quite similar to caches in terms of size and speed (ideally one-cycle access time), but have no dedicated logic for dynamic swapping of contents. Instead, it is the designer’s responsibility to explicitly map addresses of external memory to locations of the SPM. It is possible to implement both a scratchpad and a cache at the same time, exploiting the respective advantages. When adopting SPMs, the design effort shifts to the problem of how to best map code and/or data onto them. Code and data pose very different challenges; typically, code exhibits more locality and predictability, while data may show more varied access patterns. As a consequence, data may be best handled by dynamic prefetching, while code is likely to experience noticeable improvements in performance and energy consumption even with static optimizations. Statically handled code optimizations will be the focus of this paper. A key issue which may prevent SPM deployment is the inability to access the application source code, and/or its compiler. This may happen in legacy architectures, or when porting third party software. The approach described in this paper completely bypasses such issues by providing a post-compilation technique to automatically move code sections to a scratchpad memory area, without any modification to the application itself. The design flow is based upon a first run of the target application, from which execution traces are collected; subsequently, traces are analyzed and the most frequently used code sections are flagged for SPM mapping. A polynomial complexity algorithm is used to solve such mapping problem. A patching utility then modifies the application binary, moving code sections to memory areas belonging to the target SPM. The rest of this paper is structured as follows. Section 2 discusses previous work in the area of memory hierarchies, with emphasis upon scratchpad memories. Section 3 discusses the general design of our implementation and the ties to the underlying simulation platform. Section 4 describes the patching tool which edits the application binary. Section 5 focuses upon the mapping algorithm we developed. Section 6 details the experiments performed to val-
Categories and Subject Descriptors C.3 [Special-Purpose and Application-Based Systems]: Realtime and embedded systems; C.4 [Performance of Systems]: Design studies, Performance attributes; C.5.3 [Computer System Implementation]: Microcomputers; F.1.3 [Computation by Abstract Devices]: Complexity Measures and Classes
General Terms Algorithms, Performance, Design
Keywords Scratchpad Memory, Optimization Algorithm, Dynamic Programming, Embedded Design, Memory Hierarchy, Power Saving, Design Automation, Executable Patching, Post-Compiler Processing
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’04, September 22–25, 2004, Washington, DC, USA. Copyright 2004 ACM 1-58113-890-3/04/0009 ...$5.00.
On−core devices Off−core devices Buffered memory hierarchies
Scratchpad memories
single cycle
Cache memories
Cache Memory
Main RAM
(optional)
Software mapping techniques
Hardware fitting techniques
Static techniques
Dynamic techniques
CPU
Figure 1: A taxonomy of SPM mapping approaches.
idate the proposed approach, discussing benchmark data. Finally, Section 7 summarizes the results of our work.
2.
Decoder
instructions and data
tens of cycles
Scratchpad instructions
Memory
single cycle
Figure 2: Possible memory architecture with both a cache and an instruction SPM.
RELATED WORK
A significant amount of literature is available on the subject of memory hierarchies and, more specifically, of scratchpad memories. For example, in [1, 2, 3] some possible architectures for midto-high-end embedded processors with SPM are described. Scratchpad memories, in these works, can act in cooperation with caches, either by taking the role of fast buffers for data transfer or by helping prevent cache pollution thanks to intelligent data management. Previous work on SPM mapping can mostly be split in two categories (see Fig. 1); one of them deals with synthesis of optimal hardware to match a given application, while the second one maps applications to a given hardware. Examples of the first strategy are, for example, [4, 5, 6, 7], where partitioned SPMs are synthesized to optimally match target applications. While relevant energy savings can be achieved with such approaches, the need to synthesize specific hardware may constrain the applicability of these solutions. For this reason, we will take the second, software-only route in this paper. The software problem of optimally allocating critical data onto a fast but small memory has been thoroughly investigated since the early eighties, with the appearance of caches. Many software-based techniques for cache allocation have been developed in this time frame. A comprehensive review of compiler-based approaches may be found in [8]. More recently, these researches have been specifically applied to SPMs. For example, the works described in [9, 10, 11] deal with allocation of data onto scratchpad memories and/or caches; such approaches are not aimed at code. Scalar and array variables are separately managed, and arrays are monolithic entities to be mapped in the addressing space. An extension of the above to take power estimation into account is made in [12], but this time with no focus upon scratchpads. Interesting ways of dealing with SPMs, mostly concerning data, are detailed in [13, 14, 15, 16, 17], which explore choices of banking, tiling, allocation and even sharing in multiprocessor environments. Power, area and speed optimization through the use of SPMs for data and also code are instead all explored in [18] and [19]. The authors report improvements above 20% in every respect at the same time: execution cycles, energy consumption and die area. In [20], the basic approach is extended towards dynamic copying of memory locations to and from the SPM; in [21], data arrays are more thoroughly taken into account; in [22], cache interactions are modeled. What must be noticed about all of these works is
that the authors rely on the support provided by a specific custom compiler (encc), and postulate the availability of source code for the target application, which are both unneeded by the methodology described in this paper. Additionally, the mapping algorithm used by the authors has exponential complexity and works with coarse-grained chunks of code and data, while we will propose an algorithm with polynomial complexity in its inputs and capable of automatically working with arbitrary grain. It is interesting to notice that, in principle, it might be possible to apply some of the above proposals as a post-compiler pass, fetching flow graphs directly from the application binary (assuming availability of symbolic information in the executable, or maybe choosing optimization strategies which do not require it). However, to the best of our knowledge, this has not been attempted before, and does not look trivial. Even if feasible, such an extension would likely have to resort to executable patching techniques very similar to the ones we are proposing in this paper, like instruction splitting. An interesting dynamic strategy to fetch data onto SPMs at runtime is discussed in [25]. The authors perform a compile-time application modification, inserting code to move variables should this be detected to be beneficial. While the paper proves that the overhead of dynamically moving blocks can be more than offset by the gains over a static mapping, this comparison revolves around data only (not code), and is done on a cacheless architecture. Additionally, the authors again assume availability of application source code. Dynamic approaches to SPM mapping of data are also outlined in [26], where software-programmable DMA engines are used at runtime to fetch memory sections.
3. DESIGN FLOW AND ARCHITECTURE 3.1 Target Hardware Architecture Our reference architecture for SPM synthesis is outlined in Fig. 2; in comparison to plain cache-based designs, all that is needed is a very simple decoder and an SRAM array. The SPM is mapped as a contiguous memory range. As we will better explain in Section 4, since jump instructions may use a compact offset in their opcode, the SPM must be located within a certain distance (32 MB for ARM) from regular code segments in the memory space. Different SPM sizes can be explored during the design stage.
3.2 Software Mapping Approach The placing of program portions in SPM can either be statically decided once for all, or dynamically adjusted by inserting additional functions in the application to transfer objects among the SPM and the external RAM at runtime. The second approach is potentially more scalable when taking into account huge applications, but is also more fragile - object transfers require time and an excessive number of them might actually lead to worse performance, both in energy and speed, than a static solution where the memory mapping is fixed. While data tiles may very well justify a dynamic solution, as [25, 26] among others show, the above considerations seem to keep some validity for code segments and small applications. While not ruling out the potential benefits of a dynamic solution, since our target is code segments only, in the present paper we chose to follow the simpler route of static mapping.
Application sources
Cross− Compiler
Optional step Binary images
Simulation Platform
Performance analysis
Execution traces Filters Filtered traces
Analysis Algorithm Optimal ranges Patching Tool
3.3 Simulation Platform Our work builds upon a fully parametric multiprocessor platform (see [23, 24]): a multi-core ARMv7 device, the interconnect of which is currently AMBA AHB or STBus. The number of processors is fully customizable. Every processor has access to a private memory, and an additional shared memory is available for interprocessor communication. The platform is written in SystemC, and is fully accurate at the signal and timing level. The ARM cores themselves are actually simulated via freely available C++ routines, but are transparently fitted to the SystemC infrastructure by the use of specific wrappers. The simulation platform is widely configurable in terms of memory hierarchies. Both caches (split or unified) and SPMs can be attached to system cores, with a configurable size, access latency and, for caches only, associativity. Additionally, power models are associated with every memory layer in the platform, including cache tags, based upon foundry datasheets by STMicroelectronics (see Section 6). This allows us to extract accurate energy figures for every simulation run.
3.4 Design Flow An outline of the system design flow is given in Fig. 3. Two main tools are adopted: an analysis algorithm and a patcher. First of all, if needed, the target application is compiled; an initial benchmark run of its binary is then performed on a system with ICache. Power and speed statistics are extracted, and an execution trace is collected. If the source code of the application is available, the optional use of markers allows trimming traces to span only the critical code routines instead of the entire application running time. The SPM analysis algorithm requires several inputs. Some of them are application-independent, like the size of the target SPM. In addition to those, the algorithm uses some of the results of the previous platform simulation: application traces, cache hit rates, cache refill average duration. The last two parameters are useful in establishing metrics to understand the actual profit of mapping code onto the SPM instead of using a conventional I-Cache. Eventually, an optimal set of code sections is found by the analysis algorithm and passed to the patching tool. This tool modifies the original application binary, inserting jumps, moving code to different address ranges, and possibly adapting some critical instructions. The binary is then fed to the simulation engine, and a second simulation run is launched. New power and speed metrics are collected and SPM accesses are detected, thus providing complete reporting on the efficiency of the implementation. All of the described steps can be fully scripted and automated, providing a powerful evaluation environment to identify the optimal parameter set.
Patched binaries
Figure 3: Schematic optimization flow.
4. PATCHING TOOL The basic idea upon which our work is based is code relocation through executable code patching. When dealing with relocation of code over to a scratchpad memory, two different approaches can be followed to maintain code correctness: hardware level or software level. Hardware techniques are based on a sort of fixed logical-to-physical address translation, which re-maps some blocks of the main memory addressing space over to the scratchpad memory (e.g. see [31]). Software techniques ([18, 19]) are rather based on code manipulation to relocate code on the scratchpad memory. The pros and cons of a relocation technique can be classified using the subsequent metrics: • Minimum block size: it is the smallest size of code which can be relocated by the technique. • Space overhead: it can be assumed as the overhead of logical ports/memory locations with respect to the original configuration. • Time overhead: it is the overhead in execution time due to introduced delays or code expansion. Our software approach is based on the direct analysis and patching of the application, at the executable code level. Blocks of code are moved on instruction boundaries and branches are inserted before and at the end of the block. This strategy offers some unique advantages over existing software approaches based on compiler-level code relocation. First of all, the minimum block size is reduced to single instructions (memory words in case of 32-bit instructions). Such fine-grain block boundary results in a higher level of freedom for the optimization algorithm, which can more effectively tailor relocated code to the SPM. Another key advantage is that it does not require ad-hoc compilers and even source code availability: this makes it possible to leverage existing software development toolchains while optimizing applications in a straightforward manner, even when software developers want to protect their intellectual property by rendering the source code completely or partially (e.g. IP software libraries) unavailable.
We would like to underline why the ability to map code to SPM with an arbitrary granularity is a useful feature. Coarse code blocks do not represent good candidates; for example, when considering a function containing a loop, it is possible that the optimal SPM mapping might only take the loop into account. If the loop contains conditional statements, maybe only one portion of the loop code might be the actual useful candidate for SPM mapping. Basic blocks of code (i.e. sequential blocks without any internal incoming or outgoing branch or jump) can instead be conceptually treated as single objects, and therefore can be allocated monolithically, but such atomic blocks (taking into account loops, jumps, conditional statements and function calls) are likely to be quite small anyway. And even if basic blocks, or coarser code blocks up to the function level, proved to be an adequate approximation of the best entities to map onto an SPM, allocation with instruction granularity still has the advantage of allowing at least a partial mapping onto SPM should the available space not be sufficient to hold complete blocks. A minor downside of the proposed approach is that it is not possible to assume that such small candidate code chunks will always be introduced and followed by jumps in the original execution flow; such jumps will have to be manually introduced to link to other code. (Basic blocks do not guarantee this property, either; only functions do). Another possible downside is the increased amount of code sections to consider for SPM mapping with respect to a logical-block granularity; we believe, however, that the results in Section 6 show acceptable analysis times for a static approach.
4.1 Tool Implementation Being based upon low-level software manipulation, the technique we used is strictly tied to the CPU instruction set. We have chosen the ARM 32-bit Instruction Set (IS) as the subject of our analysis ([32]), on the basis of its widespread adoption in embedded systems and of the availability of accurate simulators ([33]). Yet, the need of addressing a specific instruction set does not impact the general methodology, which can be applied with some modifications to other RISC processors. While apparently a straightforward technique, relocation by code patching conceals some non-trivial problems when applied to the real world (i.e. executable code produced by a compiler). We will briefly summarize the major steps followed by our code relocation tool: 1. Patching of branches. When a block of instructions is relocated, every branch reference in the code to an instruction in the block must be updated. Since the ARM IS uses an offset from the Program Counter (PC) value to indicate the branch destination, branches becomes position dependent; even branches from the relocated block which reference the outside, unmoved code must be updated. Moreover, the ARM IS, as in most RISC processors, uses a reduced offset to identify the destination; this limits code relocation to a restricted range, specifically 32 MB.1 2. Patching of instructions with the PC register as source. When an instruction uses the PC as a source operand, it becomes position dependent and must be patched to render relocation completely transparent. There is a variety of subcases which 1 Other ISAs allow for even smaller offsets, e.g. 64 kB for MIPS. This requires workarounds resulting in more overhead for jumps. On the other hand, unless mapping is done at the function level (resulting in blocks without any references to the outside), such overhead would also be present in any compiler-based approach, though masked within the compiler.
must be singularly resolved in order to maintain code correctness, and which can result in the insertion of new instructions or constants in the code. Three examples follow: • LDR Rx, [PC, #offset] This instruction loads in Rx a constant from the location at PC+offset. Inside of the patching tool, this location is calculated and the instruction is patched to LDR Rx, [PC, #new offset]. • MOV Rx, PC This instruction puts the value of PC in Rx, for subsequent operations. The patcher calculates the current PC and stores its value in a constant, inlining it with code (one-word space overhead). Finally it changes the instruction to LDR Rx, [PC, #offset], to load the constant. • AND Rx, PC, Ry This instruction puts in Rx the value of the bitwise AND operation between PC and Ry. Since Ry is not known, the result cannot be computed in advance. As in the previous example, the PC value is calculated and stored in a constant, then the original instruction is split into a couple of instructions (overall space overhead of two words): LDR Rx, [PC, #offset] (load in Rx the constant) followed by AND Rx, Rx, Ry.
4.2 Space and Time Overhead Space overhead is essentially measured by code size expansion. In our technique, code expansion is primarily due to the presence of new instructions at block boundaries (jump in and jump out instructions), to the need of splitting single instructions in two or more instructions, and to the insertion of constant values in the code section. It is important to notice here that some of the space overhead, e.g. the jump in instructions, does not directly affect SPM performance, as it concerns code located in main memory. Another interesting point is that the added instructions are jumps, not branches; this means that patching will not affect branch prediction logic, should the CPU provide any. Time overhead is intended as the increment in the number of execution cycles with respect to the original code. Time overhead is related to space overhead, since the insertion of an extra instruction brings a corresponding increment in the number of execution cycles. On the other hand, some modifications (instruction substitution) have no impact on space overhead, but may bring a time overhead due to the possibility of a difference between the execution latencies of the original and the new instruction. It must also be noticed that some of the added instructions could, in certain circumstances, have a space overhead but not a time overhead. This may be the case, for example, of remapped code blocks the first instruction of which is also the target of pre-existing jumps or branches. From the instruction flow point of view, newly added jumps to link the block could just be replacing, at least in a fraction of the executions, original branches, and therefore might involve no time penalty. Space and time overhead for the developed technique have been accurately characterized in order to form, in conjunction with adhoc processing of the application execution traces, a complete set of data used as the input of the optimization algorithm.
5. OPTIMAL SPM MAPPING ALGORITHM Most of the known software-based techniques for SPM mapping usually involve compiler support and formulate some variant of the
Knapsack Problem (KP), a well-known NP-complete problem dealing with optimally filling a fixed size knapsack with objects all having different costs and profits (see [27]). Although NP-complete, the problem can be solved in a time which is polynomial in the number of objects and the size of the knapsack. Since the encoding length of the latter, which is the actual problem input size, is proportional to its logarithm, such a running time is exponential in the input size, and is also called pseudopolynomial. As long as the size is not huge, the problem can be solved effectively. Additionally, approximate solutions to KP can be found in polynomial time, where the order of the polynomial is inversely proportional to the quality of the approximation ([29, 30]). A key point to understand here is that the problem at hand (ScratchPad Memory Mapping Problem, SPMMP) shows at least one fundamental intrinsical difference with respect to KP: mapping a code section onto a different memory range has a space overhead due to the need for extra jumps (and possibly additional data, see Section 4), while purely assigning a memory block to an available memory space has not. This means that SPMMP involves a penalty every time ranges are mapped to the scratchpad, while a pure KP assignment does not. Some literature work ([19]) solves this issue by artificially adding such space overhead directly within the basic candidate code sections, and then managing contiguous candidates (which do not need to have a linking jump between them) by providing additional, fictitious superset candidates. The resulting problem, also being exponential in the number of candidate blocks, is more complex than KP (IP solvers are needed). To solve SPMMP, we adopt a post-compilation algorithm which manages to keep polynomial complexity in its input size. The proposed algorithm is based upon Dynamic Programming, and is directly derived from the one described in [7], to which we refer for a full description. The basic reasoning is that, based upon an execution trace of the target application, the algorithm analyzes every possible range of memory locations as a candidate for SPM allocation. For each of such ranges, a cost and a profit function are computed, quite similarly to what happens in KP. The cost function expresses the cost of mapping the range onto SPM (due to space requirements), while the profit is the advantage of moving it, and can be computed according to different metrics. Based upon such functions, the algorithm is then able to compute the optimal allocation of code onto the SPM. The algorithm has a worst-case complexity Θ(N · C 2 ), N being the number of entries in the processed input trace (i.e. the memory footprint of the application) and C the size of the target SPM. While this polynomial complexity leads to worse execution times than KP in most cases, it is better than the exponential bound described for previous SPMMP solvers ([5]) or modified-KP solvers ([19]). The present work uses a version of the mentioned SPMMP solver in which cost and profit functions are replaced to reflect a completely different environment. While in [7] a methodology is described to physically partition a SPM and map code and data chunks on SPM banks, the present work is aimed at allocating code sections onto a monolithic SPM by using software-only tools.
tion was accessed immediately after or before a jump in the program flow. • The trace is sorted and entries pertaining to the same memory location are merged. After this step, the trace length is no more proportional to the execution time, but to the application memory footprint. Every trace entry now contains aggregate data about the number of accesses and the inbound and outbound jumps. • A tool parses the code segment of the application binary, detecting any instruction which would carry a space overhead in case of SPM mapping (please refer to Section 4). Any such instruction which is also found in the processed execution trace gets appropriately flagged in the latter. After such processing, the trace can be fed as an input to the analysis algorithm.
5.2 Cost and Profit Metrics One fundamental premise holds true in both the SPMMP and KP problems, and is the heart of the solving algorithm. Every time a decision has to be made about if a certain address range should belong to the optimal SPM mapping, two factors get evaluated: the profit it would bring, and its cost (expressed in terms of SPM space). Being αi the i-th address of the processed trace (i = 1, . . . , N , being N the number of trace entries), and ωi the corresponding amount of extra words required by the instruction when mapped onto the SPM (see Section 4), we defined the cost function (expressed in bytes) for a code section spanning from the i-th to the j-th location as j
w(i, j) := (αj − αi + 1) +
(ωh · 4) + 4
(1)
h=i
Such a function is based upon the range length, to which an overhead is possibly added due to the presence of instructions in the range needing extra space. The return jump instruction (4 bytes) is finally computed. To achieve maximum flexibility, we implemented two different profit functions in our algorithm, only one of which can be selected at runtime: • The first profit function is geared primarily towards maximum speed. In this case, the profit of a range translates into the total number of accesses the range is subject to during application execution. The function is defined as j
p(i, j) := (
j
τh ) · (λ − 1) − h=i
(τh · πh ) − δi − ∆j (2) h=i
As outlined in Section 4, the analysis algorithm needs some very specific information about the target application beyond the bare execution trace. For this reason, small postprocessing tools were developed. A brief description of what is collectively defined “Filters” in Fig. 3 is provided below.
where τi is the number of accesses to the i-th location of the trace, λ is the average latency of an I-Cache memory (computed as a back-annotated weighted average of miss latencies and single-cycle hit latencies), πi is the number of extra accesses associated to the i-th location of the trace, δi is the number of newly introduced jumps to get to the first location of the trace, and ∆j is the number of newly introduced jumps to leave the last location of the trace.
• The raw execution trace is parsed, thus detecting any jumps in the execution flow. Two fields are added to every entry of the trace, annotating whether the corresponding instruc-
• The second function is geared towards energy savings. If this option is selected, the profit of a range is described in terms of lower energy consumption due to accessing a small
5.1 Trace Preprocessing
Configuration A B C D E F G H
120
Relative Performance (%)
100
80 SPM Read Energy Cache Read Energy
60
SPM Area Cache Area
40
20
0 128 B 256 B 512 B 1 kB
2 kB
4 kB
8 kB
Description No SPM, 1 kB I-Cache, 1 kB D-Cache No SPM, 4 kB I-Cache, 1 kB D-Cache 1 kB SPM, 1 kB I-Cache, 1 kB D-Cache 1 kB SPM, 1 kB U-Cache 4 kB SPM, 1 kB I-Cache, 1 kB D-Cache 4 kB SPM, 4 kB I-Cache, 1 kB D-Cache 4 kB SPM, 1 kB U-Cache 4 kB SPM, 4 kB U-Cache
16 kB
Table 1: Tested system configurations. Figure 4: Relative area and power of cache and scratchpad memories. on-chip SRAM bank instead of an off-chip memory. The function is j
p(i, j) := (
j
τh ) · (ρ − σ) − ( h=i
(τh · πh )) · (σ + κ) − µij h=i
(3) being µij := δi · (ρ + κ) + ∆j · (σ + κ)
(4)
where ρ is the average energy to access an I-Cache (taking into account misses and the related energy for external RAM accesses), σ is the energy to access the SPM and κ is the energy taken by a processor core cycle. The two optimization strategies above, while both achieving improvements in speed and energy, may produce slightly different results depending on system parameters. As an example, a cache miss may be very expensive in latency but not in energy, while the opposite may be true of a jump. In this case, a speed-oriented solution would try to map the single most frequently accessed instructions to SPM, even accepting a high fragmentation of the code segment and the resulting cost in linking jumps, while an energy-oriented solution would try to map longer chunks of code onto the SPM, trading a higher number of I-Cache misses with a lower amount of jumps.
6.
EXPERIMENTAL RESULTS
6.1 Static Results Fig. 4 shows the relative power consumption and area cost of cache and scratchpad memories, with numbers taken from STMicroelectronics foundry datasheets (most literature work is based upon less reliable CACTI [34] estimates). Memory cuts were selected among the library ones according to two criteria: singlecycle latency at 200 MHz, and minimum power consumption. The reference line being a direct mapped cache, scratchpad memories show huge improvements in both energy and silicon. Energy savings span from 39% to 87%, while area savings are between 17% and 61%. This is due to the fact that SPMs do not need tags and tag comparators, and to the different layout of 128-bit and 32-bit memory banks. We first tested our approach by trying to understand the amount of involved code expansion and the mapping overhead. As a reference, we will describe the results for an FFT application. This
benchmark, compiled with a gcc cross-compiler with the optimization flag -O2, has a code segment which is 2541 words (around 10 kB) long. Out of this segment, only 37 instructions (1.5%) needed some kind of patching in case of relocation, and all of them had a space overhead of just one extra word. Another source of overhead were the jumps implicit in SPM mapping; for this particular benchmark, our solving algorithm computed an optimal mapping respectively composed of 12 and 17 moved blocks for 1-kB and 4-kB target SPMs. These numbers translate into a double amount of extra jumps, half of which (the exit ones) constitute a SPM overhead. When mapped on a 1-kB SPM (256 locations), the overhead for FFT was 13 words (one jump per block plus one instruction needing specific patching), equal to 5.1%, while on a 4-kB SPM (1024 locations) it was 24 words (one per block plus seven instructions to be patched), or 2.3%.
6.2 Simulation Results and Algorithm Performance The mapping results produced by the solving algorithm were very similar for the two strategies discussed in Section 5, so, for the sake of brevity, numbers are reported here for the “lowest energy” policy only. However, it is important to notice that this may not be the case on different CPU architectures or with different foundry processes. We assumed the latency of cache and scratchpad memories to be of a single cycle, while external memory was supposed to be on-chip but attached via an AMBA interconnect and having one wait state per access. Fig. 5 and Fig. 6 show the results of our optimization approach for two reference benchmarks, JPEG and FFT. The hardware configurations reported in the chart are detailed in Table 1. The baseline 100% value is that of a system having 1 kB of I-Cache and 1 kB of D-Cache (I-Caches are direct mapped, while D-Caches and unified caches are supposed to be 4-way set-associative). We tested system performance after optimization under two main scenarios: by adding the SPM to the system for maximum energy and time savings, and by replacing the previous I-Cache to stay within area constraints. In the latter case, the D-Cache was instead used as a unified cache, also intercepting code accesses falling outside of the blocks mapped in SPM. Power figures are reported for both the entire system, including the CPU core, and just the memory subsystem. As an example, in FFT, adding 1 kB of dedicated SPM (configuration C) to the reference system (A) achieved a lower energy consumption by 36% (JPEG: 21%) and a better execution time by 32% (JPEG: 17%), while for a 4 kB SPM (E), these figures climbed to 48% and 47% (JPEG: 35% and 34%). Silicon real estate for the buffering subsystem increased by 36% for a 1 kB SPM (C) and by 110% for a 4 kB one (E). It is interesting to contrast these figures to those of a more traditional way of improving system performance,
1300 1200 1100
Energy Consumption (mJ)
i.e. increasing the I-Cache size to 4 kB (B), which yields 31% and 39% (JPEG: 14% and 23%) improvements in energy and time, with an area penalty equal to 89%. The comparison between (A) and (B) is also interesting because it shows decreasing power usage despite the presence of a larger cache: this can be explained by hit rate improvements. Replacing the I-Cache with a SPM and shifting the D-Cache to a unified cache yielded even better results. The system with 1 kB of SPM and 1 kB of unified cache (D) saved 24% (JPEG: 14%) in energy and 21% (JPEG: 12%) in time, while also using 11% less silicon than the baseline system with 1-kB dual caches (A). Moving from 4 kB of I-Cache and 1 kB of D-Cache (B) to 4 kB of SPM and 1 kB of unified cache (G), 23% (JPEG: 20%) of the energy was saved, execution times were 12% (JPEG: 10%) lower and area went down by 13%. The figures translate into better power/delay/area products of 47% and 41% (JPEG: 33% and 37%) respectively. The previous numerical power comparisons were all based upon overall system power; an exploded view for the JPEG benchmark, showing the energy consumption of single components, is available in Fig. 7. This chart shows that energy savings are due to improvements of comparable entity in the memory subsystem and the processing unit; the latter gain is almost linear in the decrease of execution time allowed by SPM addition.
1000 900 800
RAM SPM
700
U−Cache
600
D−Cache I−Cache
500
Core
400 300 200 100 A
B
C
D
E
F
G
H
Figure 7: Exploded view of energy consumption for the JPEG benchmark.
JPEG (N = 4228) FFT (N = 1742)
1 kB SPM (C = 1024) 4s 2s
4 kB SPM (C = 4096) 89 s 44 s
300 275
Table 2: Algorithm execution times.
Relative Performance (%)
250 225 200
A
175
B C
150
D
125
E F
100
G
75
H
50 25 0 Time
Area
Total Power
Memory Power
7. CONCLUSIONS
Figure 5: Optimization results for the JPEG benchmark.
300
Relative Performance (%)
275 250 225
A
200
B C
175
D
150
E F
125
than acceptable. Execution times on an ordinary desktop 2 are summarized in Table 2. We would like to point out that application size (N ) is only related to the code segment, and thus represents a fraction of the complete application footprint. Additionally, should the source code of one part or of the entire application be available, as mentioned in Section 3, the designer could add markers to the critical code sections, thus resulting in trimmed traces at the simulator level and therefore in much smaller analysis times.
This work proves the viability of a post-compilation approach allowing application optimization even without access to source code of the target application and/or the compiler. Even by optimizing the code segment alone, savings of up to 47% could be achieved for the power/delay/area product with respect to a cache-only solution. Optimization times are moderate, requiring less than two minutes for our test benchmarks. The approach proved to scale polynomially. Future research may lead to the extension of the patching analysis to other architectures, to the investigation of similar approaches for data variables, or to the development of ways to dynamically load code onto the SPM at runtime.
G
100
H
75 50 25 0 Time
Area
Total Power
Memory Power
Figure 6: Optimization results for the FFT benchmark. The execution time of the solving algorithm, the most critical tool among the ones adopted in out approach, proved to be more
8. REFERENCES [1] Raam, F.M.; Agarwal, R.; Malik, K.; Landman, H.A.; Tago, H.; Teruyama, T.; Sakamoto, T.; Yoshida, T.; Yoshioka, S.; Fujimoto, Y.; Kobayashi, T.; Hiroi, T.; Oka, M.; Ohba, A.; Suzuoki, M.; Yutaka, T.; Yamamoto, Y., “A High Bandwidth Superscalar Microprocessor for Multimedia Applications”, Digest of Technical Papers of the 1999 IEEE International Solid-State Circuits Conference, pp. 258–259, 1999. R 2 Benchmarks taken on a Pentium 4 2.26 GHz desktop with 512 MB of RAM.
[2] Suzuoki, M.; Kutaragi, K.; Hiroi, T.; Magoshi, H.; Okamoto, S.; Oka, M.; Ohba, A.; Yamamoto, Y.; Furuhashi, M.; Tanaka, M.; Yutaka, T.; Okada, T.; Nagamatsu, M.; Urakawa, Y.; Funyu, M.; Kunimatsu, A.; Goto, H.; Hashimoto, K.; Ide, N.; Murakami, H.; Ohtaguro, Y.; Aono, A., “A Microprocessor with a 128-bit CPU, Ten Floating-Point MAC’s, Four Floating-Point Dividers, and an MPEG-2 Decoder”, IEEE Journal of Solid-State Circuits, Volume 34 Issue 11, Nov 1999, pp. 1608–1618, 1999. [3] Koyama, T.; Inoue, K.; Hanaki, H.; Yasue, M.; Iwata, E., “A 250-MHz Single-Chip Multiprocessor for Audio and Video Signal Processing”, IEEE Journal of Solid-State Circuits, Volume 36 Issue 11, Nov 2001, pp. 1768–1774, 2001. [4] Benini, L.; Macii, A.; Macii, E.; Poncino, M., “Increasing Energy Efficiency of Embedded Systems by Application-Specific Memory Hierarchy Generation”, IEEE Design and Test of Computers, Volume 17 Issue 2, Apr-Jun 2000, pp. 74–85, 2000. [5] Benini, L.; Macchiarulo, L.; Macii, A.; Poncino, M., “Layout-Driven Memory Synthesis for Embedded Systems-on-Chip”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume 10 Issue 2, Apr 2002, pp. 96–105, 2002. [6] Benini, L.; Bertozzi, D.; Bruni, D.; Drago, N.; Fummi, F.; Poncino, M., “Legacy SystemC Co-Simulation of Multi-Processor Systems-on-Chip”, Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp. 494–499, 2002. [7] Angiolini, F.; Benini, L.; Caprara, A., “Polynomial-Time Algorithm for On-Chip Scratchpad Memory Partitioning”, Proceedings of the ACM International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), pp. 318–326, 2003. [8] Kennedy, K.; Allen, J.R., “High-Performance Compilers”, Elsevier Science and Technology Books, 2001. [9] Panda, P.R.; Dutt, N.D.; Nicolau, A., “Efficient Utilization of Scratch-pad Memory in Embedded Processor Applications”, Proceedings of the European Design and Test Conference, pp. 7–11, 1997. [10] Panda, P.R.; Dutt, N.D.; Nicolau, A., “Local Memory Exploration and Optimization in Embedded Systems”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 18 Issue 1, Jan 1999, pp. 3–13, 1999. [11] Panda, P.R.; Dutt, N.D.; Nicolau, A.; Catthoor, F.; Vandecappelle, A.; Brockmeyer, E.; Kulkarni, C.; De Greef, E., “Data Memory Organization and Optimizations in Application-Specific Systems”, IEEE Design and Test of Computers, Volume 18 Issue 3, May 2001, pp. 56–68, 2001. [12] Shiue, W.-T.; Chakrabarti, C., “Memory Exploration for Low Power, Embedded Systems”, Proceedings of the 36th Design Automation Conference, pp. 140–145, 1999. [13] Kim, S.; Vijaykrishnan, N.; Kandemir, M.; Sivasubramaniam, A.; Irwin, M.J.; Geethanjali, E., “Power-Aware Partitioned Cache Architectures”, Proceedings of the International Symposium on Low Power Electronics and Design, pp. 64–67, 2001. [14] Kandemir, M.; Ramanujam, J.; Irwin, M.J.; Vijaykrishnan, N.; Kadayif, I.; Parikh, A., “Dynamic Management of Scratch-Pad Memory Space”, Proceedings of the Design Automation Conference, pp. 690–695, 2001. [15] Kandemir, M.; Choudhary, A., “Compiler-Directed Scratch
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27] [28] [29]
[30]
Pad Memory Hierarchy Design and Management”, Proceedings of the 39th Design Automation Conference, pp. 628–633, 2002. Kandemir, M.; Kadayif, I.; Sezer, U., “Exploiting Scratch-Pad Memory Using Presburger Formulas”, Proceedings of the 14th International Symposium on System Synthesis, pp. 7–12, 2001. Kandemir, M.; Ramanujam, J.; Choudhary, A., “Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems”, Proceedings of the 39th Design Automation Conference, pp. 219–224, 2002. Banakar, R.; Steinke, S.; Lee, B-S.; Balakrishnan, M.; Marwedel, P., “Scratchpad Memory: a Design Alternative for Cache On-Chip Memory in Embedded Systems”, Proceedings of the Tenth International Symposium on Hardware/Software Codesign, pp. 73–78, 2002. Steinke, S.; Wehmeyer, L.; Lee, B-S.; Marwedel, P., “Assigning Program and Data Objects to Scratchpad for Energy Reduction”, Proceedings of the IEEE Design and Test in Europe Conference (DATE), pp. 409–415, 2002. Steinke, S.; Grunwald, N.; Wehmeyer, L.; Banakar, R.; Balakrishnan, M.; Marwedel, P., “Reducing Energy Consumption by Dynamic Copying of Instructions onto Onchip Memory”, 15th International Symposium on System Synthesis, pp. 213–218, 2002. Verma, M.; Steinke, S.; Marwedel, P., “Data Partitioning for Maximal Scratchpad Usage”, Proceedings of the ASP-DAC 2003. Asia and South Pacific Design Automation Conference, pp. 77–83, 2003. Verma, M.; Wehmeyer, L.; Marwedel, P., “Cache-Aware Scratchpad Allocation Algorithm”, Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, Vol. 2, pp. 1264–1269, 2004. Bertozzi, D.; Poletti, F.; Benini, L., “Performance Analysis of Arbitration Policies for SoC Communication Architectures”, Design Automation of Embedded Systems, Special Issue on Covalidation of Embedded Hardware/Software Systems, 2003. Loghi, M.; Angiolini, F.; Bertozzi, D.; Benini, L.; Zafalon, R., “Analyzing On-Chip Communication in a MPSoC Environment”, Proceedings of the IEEE Design and Test in Europe Conference (DATE), February 2004, pp. 752–757, 2004. Udayakumaran, S.; Barua, R., “Compiler-Decided Dynamic Memory Allocation for Scratch-Pad Based Embedded Systems”, Proceedings of the ACM International Conference on Compilers, Architecture, and Synthesis for Embedded System (CASES), 2003. Poletti, F.; Marchal, P.; Atienza, D.; Benini, L.; Catthoor, F.; Mendias, J. M., “An integrated hardware/software approach for run-time scratchpad management”, Proceedings of the 41st Design Automation Conference, pp. 238–243, 2004. Martello, S.; Toth, P., “Knapsack Problems”, John Wiley & Sons, Chichester, 1990. Bellman, R.E., “Dynamic Programming”, Princeton University Press, Princeton, NJ, 1957. Ibarra, O. H.; Kim, C. E., “Fast Approximation Algorithms for the Knapsack and Sum of Subset Problems”, Journal of the ACM (JACM), Volume 22 Issue 4, Oct 1975, pp. 463–468, 1975. Sahni S., “Approximate Algorithms for the 0/1 Knapsack Problem”, Journal of the ACM (JACM), Volume 22 Issue 1,
Jan 1975, pp. 115–124, 1975. [31] Macii, A.; Macii, E.; Poncino, M., “Improving the Efficiency of Memory Partitioning by Address Clustering”, Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp. 18–23, 2003. [32] Jagger, D.; Seal, D., “ARM Architecture Reference Manual Second Edition”, Addison-Wesley, 2000.
[33] SWARM http://www.g141.com/projects/swarm/ [34] CACTI http://research.compaq.com/wrl/people/ jouppi/CACTI.html