suffer from unpredicted misses thanks to its software-controlled mechanism. ..... Test programs are collected from open source domain and MiBench [25]: Dijkstra ...
To appear in the 18th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS-18), April 2012.
Leveraging both Data Cache and Scratchpad Memory through Synergetic Data Allocation Sangyeol Kang Alexander G. Dean Center for Efficient, Scalable and Reliable Computing Dept. of Electrical and Computer Engineering North Carolina State University, Raleigh, NC 27695, USA {sykang, agdean}@ncsu.edu
Abstract—Although a data cache provides fast access latency, it degrades the timing predictability of real-time embedded systems due to misses which are difficult to predict. Scratchpad memory is accessed as fast as a data cache, but does not suffer from unpredicted misses thanks to its software-controlled mechanism. This study presents how scratchpad memory reduces data cache pollution and misses for preemptive real-time embedded systems, so that both of the fast memory subsystems can work together with synergy. First, by classifying data cache misses into intrinsic misses and interference misses we reveal previously hidden characteristics of the interactions between data in the cache. Second, we suggest a heuristic method of data allocation to scratchpad memory using the new perspective, which reduces the cache pollution and finally improve the cache performance. Third, we examine these concepts with several tasks running on a real hardware platform and a preemptive real-time operating system. In addition, we perform a supplementary case study which shows how sensitive the data cache is to small changes of data memory layout and its dynamic contents. Our proposed scheme guides us through the synergetic process by using scratchpad memory beyond the sensitive data cache. In our experiments, the proposed data allocation scheme significantly reduces inter-task cache pollution as well as the intrinsic cache misses of the tasks themselves. Keywords-Scratchpad Memory; Data Cache; Data Allocation, Inter-task Cache Pollution
I. I NTRODUCTION In real-time embedded systems, the timing variability of components is critical because it directly affects schedulability of the system. Using cache memory leads to misses, which increases the timing variability at the memory access level. Hence, there has been extensive research work suggesting methods to improve the timing predictability of cache memory [1]–[5]. However, reducing cache misses fundamentally simplifies the creation and verification of hard real-time embedded systems. Scratchpad memory (SPM, such as on-chip SRAM) can provide a complementary way to improve the timing predictability of cache memory while still providing faster *This work was funded in part by NSF CNS awards 0720797 and 0915503.
access. SPM differs from cache memory in aspects such as chip area, energy consumption, and access mechanism [6]. Much prior research work compares cache memory against SPM [7]–[10], or utilizes only SPM [11]–[15] to pursue the reduction of execution time or energy consumption. This study proposes a memory allocation method actively exploiting both fast memory subsystems together for improving the cache performance and ultimately the system performance as well as the timing predictability. A new perspective to understand cache misses is also proposed; cache misses of a system are classified in two types intrinsic misses and interference misses. The cache misses of a preemptive real-time system are analyzed by this concept, and particular data are selectively allocated into SPM. This data allocation removes cache misses due to the data itself, and inter-data cache pollution as well. The suggested approach is examined with several tasks running on a real hardware system and a preemptive realtime operating system. Our experiments shows that many cache misses come from inter-task cache pollution, and they can be reduced effectively by the proposed SPM allocation scheme. One supplementary experiment is also performed, which shows the data cache is sensitive not only to its dynamic contents but also to data memory layout changes. In spite of this sensitivity, our scheme guides us quite well to leverage data cache and SPM synergetically. The remainder of this paper is organized as follows. Section II introduces related studies. Section III discusses the motivating case study on the sensitivity of the data cache to memory layout changes and its dynamic contents. Section IV explains the new concept of cache miss decomposition proposed by this work, and presents a heuristic to allocate task data into SPM. Section V describes the experiments setup followed by the results in Section VI. Section VII concludes this work. II. R ELATED W ORK The crucial role of cache memory and SPM to improve system performance has motivated much research work
through diverse approaches. Some of the vast breadth of research work is mentioned here. A. Timing Predictability of Cache Memory Improving timing predictability of cache memory has been tackled by many researchers. One track of the work [1]–[5] estimated the number of cache misses or response time of tasks considering inter-task cache interference. Another track of work [16]–[20] reduced inter-task cache interference by partitioning or locking the cache. Lee et al. [1] estimated the worst case response time by bounding cache related preemption delay (CPRD) due to inter-task cache pollution with the useful cache block (UCB) concept. Negi et al. [2] extended the prior work with a program path analysis to estimate CPRD for both the preempted and the preempting task. Both works cover only the instruction cache even though Lee et al. [1] mentioned the extendability of the concept to data cache. Altmeyer et al. [5] suggested a more detailed concept named as definitely cached UCB (DC-UCB), and tightened CPRD estimation more than the former work where cache misses are counted doubly on timing analysis and context switching cost analysis. A data cache miss pattern was obtained statically from the extended Cache Miss Equations (CME) [21] and fed into a static WCET analysis framework by Ramaprasad et al. [3]. This work was later extended to support a multitasking preemptive environment [4]. They analyzed preemption points, corresponding preemption delay, and finally worst case response time. Cache partitioning reduces cache interference between tasks by isolating cache space to the tasks. Kirk et al. [16], [17] divided a cache space into one large shared partition and small segmented partitions. Tasks requiring predictable cache hits are allocated into the small partitions, while others are assigned to the shared pool. Mueller [18] focused on compiler support for this situation, i.e., non-linear addressing to instructions located in separated partitions. Puaut et al. [19], [20] locked instruction cache contents. They first selected contents of a statically locked cache to minimize CPU utilization or to reduce the interferences from tasks preemptions [19] . Second, they devised an off-line instruction selection algorithm for a dynamically-managed locked cache or SPM considering different characteristics such as addressing scheme and granularity of allocation [20]. B. Allocation into SPM In prior research studies [6]–[15], SPM has been utilized mainly as an alternative to cache memory, and its performance is compared to that of cache memory rather than used for extracting synergy by improving cache performance. Banakar et al. [6] showed SPM always outperforms cache memory in aspects such as chip area, energy consumption, and performance by simulation. Steinke et al. [7] allocated
instructions and data together to SPM by integer linear programming (ILP) formulation, and evaluated energy consumption and execution time. Some work reduced average case or worst case execution time with improved timing predictability by using SPM [11]–[15]. Avissar et al. [11] presented an ILP-based SPM allocation scheme for static and automatic data. Data on the worst-case path was allocated by Suhendra et al. [12] considering infeasible execution paths. Deverge et al. [13] dynamically allocated both static and automatic data to SPM in order to reduce a single task’s WCET. Ghattas et al. [14] combined data allocation algorithm with a preemptive realtime scheduling method to maximize SPM utilization, and Kang et al. [15] extended this work by providing a feasible method and toolchain with an assembly code rewriting technique to split a program’s original stack frame into SPM. Absar et al. [8], [9] focused on utilizing SPM for a dynamic application whose data access pattern is irregular or non-manifest. The first work [8] investigated a dynamicallymanaged SPM scheme. The second work [9] allocated data of linked list and trees into SPM by their access probability which must be known a priori via investigating the target programs. Mamagkakis et al. [10] also allocated dynamic type data into SPM. They group dynamic type data into data pools based on access history derived from instrumentation. The activity degree of each pool is computed to determine its priority for allocation into SPM. Several works investigated a hybrid model with both cache memory and SPM. Allocation of instructions to SPM considering their effect on the instruction cache was investigated by Verma et al. [22]. Partitioning of scalar and array variables to SPM and data cache was studied by Panda et al. [23], [24], and these works are the closest to our present study. While the first work [23] focused on a single program system, the later work [24] extended the concept so that SPM is partitioned for multiple programs. Both works aim to minimize cross-interference between different scalar and array variables by using several metrics reflecting the variable’s life-time, size, access frequency, and potential cache conflicts inside of loops. The potential cache conflicts were estimated statically and loosely by examining the program source code for array accesses. All of this information is required a priori for the allocation process. The synergy between SPM and data cache was partially reflected in their experimental results such as execution time and external memory access counts. Despite the breadth of existing investigations, our work differs from all the work above in several aspects. (1) Our proposed scheme takes advantage of both cache and SPM, rather than one or the other, in order to mitigate inter-data cache pollution while still providing fast memory access for all data. (2) This study presents a new approach toward cache miss analysis, which determines the priority of data being allocated into SPM. (3) The allocation scheme is thus
based on the new cache miss analysis rather than data access counts analysis or data access frequency analysis. (4) Our work considers data allocation and data cache rather than instruction allocation and instruction cache, where cache behavior is ever-changing and analyzing it involves much more complexity. (5) A real hardware platform is used for evaluating the merits of this work. III. M OTIVATION We begin by performing a case study to examine the impact of three kinds of variation (input workload, memory layout, and cache pollution by other programs) on the number of cache misses of the tasks. This experiment helps explain the difficulties of static cache analysis, providing motivation for our work in subsequent sections. The hardware and the test programs used in this experiment are presented in Table I and Table II respectively. (For more details, see Section V-A.) The control case is named Default, and three experimental factors are varied: input workload to the programs, memory layout of data, and data cache contents before starting a new job of a task. In this experiment, we find that sometimes the performance of a data cache is surprisingly sensitive to small changes in memory layout and its dynamic contents. The test programs are fed with five different input workloads. Each program runs two jobs back-to-back for one identical input workload. Before the first job, the cache is flushed thoroughly to remove jitters (cold cache case). To emulate the best case between the two neighboring jobs, the warm cache case does not flush the cache before starting the second job, which means the data cache is not polluted by other programs. The worst case is the same as the cold cache case, where the cache is flushed out thoroughly just before performing any job. To make different memory layouts of data, three memory layout variants are implemented: Reversed Statics, LRLA Dynamics, and 1.5 Cache Line Offset Stack. For the Reversed Statics configuration, the order of definition or declaration of static variables gets reversed within each source code file. The configuration LRLA Dynamics adopts a last-returnedlast-allocated algorithm to allocate heap memory chunks to dynamic type data, where the most recent freed memory block is not allocated until previously freed blocks have been allocated. Default configuration uses LRFA algorithm meaning last-returned-first-allocated, where the most recently freed memory block is allocated first for the next allocation. For 1.5 Cache Line Offset Stack configuration, a stack pointer at entry points of the programs is adjusted intentionally by one and half cache lines (48 bytes). All other factors which can affect cache misses are controlled as described in Section V-A. Fig. 1 shows the sensitivity of data cache misses in the single-threaded system. Each number of cache misses is normalized to the average number of cache misses for the
five inputs running with Default configuration. Each column represents the average number of cache misses for the five inputs with an blue error bar showing the worst and the best number of cache misses for one configuration. The input workload variation affects the cache miss rate for some programs (Dijkstra and TinyJPEG) since different inputs can generate different control flows to the programs. In the Default configuration Dijkstra varies 11.6%, and TinyJPEG varies 0.2%. Other benchmarks show a constant or almost constant number of cache misses, so different inputs do not generate different control flows. Memory layout changes bring an unpredictable impact. In the cold cache case, Dijkstra generates 77.8%, 105.2%, and 231.8% of cache misses of Default configuration on average for Reversed Statics, LRLA Dynamics, and 1.5 Cache Line Offset Stack respectively. Other benchmarks show from 99.1% to 102.4% of cache misses of their Default cases on average for the various memory layouts. In case of the warm cache, data cache behavior changes more dramatically. The second jobs of ADPCM, MgSort, FFT2D, and Rotate never experienced any cache misses for all inputs and all memory layouts. This implies their working data from the first job are still kept in the data cache perfectly. In Default configuration, Dijkstra and TinyJPEG show 87.2% and 87.5% of their original cache misses respectively in average. This tells the second jobs of the programs still cause significant cache misses across different memory layout variants. Dijkstra needs to be discussed more due to its quirky behavior. First, there is dramatic variation with different memory layouts. Second, even though the data cache (8KB) is large enough to hold all of the data (5,564 bytes), it does not eliminate cache misses for the second job. The warm cache case still generates many cache misses. These behaviors are explained by cache contention. Dijkstra’s static, automatic, and dynamic variables thrash the cache. In our system, the three different types of data are located in separated memory regions, but they are mapped into the same cache line in case of Dijkstra. It cannot avoid contentious mapping into the same cache lines due to their location in data memory. Consequently, memory layout changes lead to different patterns of cache misses, and cache pollution by the task itself delivers many cache misses even in the warm cache case. This implies that static cache miss analysis techniques by prior work may be fragile and be invalidated by minor memory layout changes. Moreover, multi-threaded real-time systems can suffer unnecessarily from cache pollution by context switching. With this motivation, our work suggests a data allocation scheme where cache misses are abstracted through simple measurements rather than deliberate and precise analysis, and inter-data cache pollution is minimized. This study examines how to use the advantages of both fast memory types in a complementary way. To see the
2.6
2.6 Default Reversed Statics LRLA Dynamics 1.5 Cache Line Offset Stack
2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6
Default Reversed Statics LRLA Dynamics 1.5 Cache Line Offset Stack
2.4 Normalized Num. of Cache Misses
Normalized Num. of Cache Misses
2.4
2.2 2 1.8 1.6 1.4 1.2 1 0.8
Dijkstra
ADPCM TinyJPEG MgSort
FFT2D
Rotate
(a) Case of cold cache
0.6
Dijkstra
ADPCM TinyJPEG MgSort
FFT2D
Rotate
(b) Case of warm cache (no cache pollution by other programs)
Figure 1: Sensitivity of cache misses in a single-threaded system synergy between SPM and data cache in angle of benefit, each fast memory holds different categories of data. SPM is used to manage data causing many cache misses and severe inter-data cache pollution, while the data cache covers the other remaining data in main memory. This improves performance at the memory access level. The data stored in SPM is accessed still as fast that as in data cache. This reduces cache pollution, and total number of cache misses falls. This can be interpreted as the effective capacity of the data cache being increased since the data allocated to SPM is not cached any more. Taken together, allocating data into SPM brings both (1) performance improvement at the memory access level and (2) timing predictability improvement synergetically at the same time.
as procedure calls and returns occur. Similarly, a particular group of dynamic variables of a multi-threaded system may not make up a data set since same addresses in heap memory may be allocated to multiple dynamic type variables from different tasks during scheduling. The intrinsic misses are the data cache misses of a data set when only that data set (and no other) is cacheable in a program or a system. Thus the intrinsic misses of a data set are a characteristic of the data set because they are determined by program logic, an input workload to the program, a fixed data memory layout, and hardware running the program but not by other data sets.
IV. A LLOCATION OF DATA INTO SPM
Definition IV.1. The intrinsic miss count m (V ) of a data set V is the total number of cache misses caused by V when only V is cacheable in a system.
Our new cache miss analysis classifies cache misses of a system in two groups: intrinsic misses and interference misses. Based on this concept, a heuristic to allocate data into SPM is proposed.
The interference misses are cache misses due to cache contention and pollution for the same cache lines by multiple data sets. This type of miss is determined by contending and interfering relationships among the data sets.
A. Intrinsic Misses and Interference Misses
Definition IV.2. The interference miss count of a system is the difference between the total number of cache misses of the system and sum of the intrinsic misses of all data sets.
Before looking into the details of the two types of cache misses, we must define a term: a data set is a set of data which has unique and fixed memory address. This also implies each data set is exclusive to others, and each data item must belong to only one data set. For instance, all static variables, and all automatic variables of a program can be two different data sets as they do not share the same address. In a real-time system, all data variables of a task may compose a set. However, particular subgroups of automatic variables of a program may not compose one data set since their addresses may not be uniquely fixed. The stack of the program grows up and down
The intrinsic misses are the cache misses of a data set observed when only that data set uses the data cache, while the interference misses are unnecessary cache misses of the program or the system because all the data sets contend for the limited cache resource. In our analysis, we need not detail how the misses are generated in the system. From these definitions, two characteristics are induced. The notation m (x) represents number of intrinsic misses by a data set x. m (x#y) denotes the number of interference misses between two data sets x and y. When X is a super
set of sub elementary sets, m (#X) returns number of interference misses inside of the super set X when all the sub elementary sets of X are cacheable at the same time. Theorem IV.1. In a system V , when only n data sets (V = V1 ∪ ... ∪ Vn ) are cacheable at the same time, the total number of cache misses of the system is the sum of the intrinsic misses of all the n data sets and the system’s interference miss, i.e., m (V ) =
n X
m (Vi ) + m (#V )
(1)
Therefore, n X m (V ) = m (Vi ) + m (#VΩ ) + m (#VΘ ) + m (VΩ #VΘ ) i=1
Also, m (V ) =
n X
m (Vi ) + m (#V )
i=1
Finally, m (VΩ #VΘ ) =m (#V ) − m (#VΩ ) − m (#VΘ )
i=1
Proof: If some data sets share the same cache lines so that extra misses exist in addition to the intrinsic misses of the n data sets, the total number of cache misses of the system is increased by positive integer α, i.e., m (V ) =
n X
m (Vi ) + α
i=1
⇔ α = m (V ) −
n X
m (Vi )
i=1
From Definition IV.2, α = m (#V ). If no sets share any cache lines or no extra misses are observed, the total number of cache misses of the system is not increased. With m (#V ) = 0, then, m (V ) =
n X
m (Vi )
i=1
From both the cases, m (V ) =
n X
m (Vi ) + m (#V )
i=1
Theorem IV.2. In a system V , when only n data sets (V = V1 ∪ ... ∪ Vn ) are cacheable at the same time, and are grouped into two super sets VΩ and VΘ exclusively and exhaustively, i.e., V = VΩ ∪ VΘ while VΩ ∩ VΘ = ∅, then one identity is established as below. m (VΩ #VΘ ) = m (#V ) − m (#VΩ ) − m (#VΘ )
(2)
Proof: Since all data sets from V1 to Vn are exclusive to each other, the assumptions about VΩ and VΘ are satisfied. Then, Theorem IV.1 is applicable. m (V ) = m (VΩ ∪ VΘ ) = m (VΩ ) + m (VΘ ) + m (VΩ #VΘ ) By applying Theorem IV.1 again to m (VΩ ) and m (VΘ ), m (VΩ ) + m (VΘ ) =
n X i=1
m (Vi ) + m (#VΩ ) + m (#VΘ )
Notice that (2) is an identity originating from (1) – a different perspective of the data sets. B. Allocation Scheme – Exhaustive Search Provided that we allocate some data sets to SPM (here let VΩ group the targets), definitely it is beneficial to move data sets causing more cache contention to other data sets (here they are bundled up with VΘ ). Theorem IV.2 suggests an approach to identify the data sets. The term m (VΩ #VΘ ) of (2) is number of interference misses between two super sets VΩ and VΘ , and the right hand side tells how to calculate it. Then we may choose one VΩ among all possible combinations which maximizes the term m (VΩ #VΘ ). For example, when 2 tasks are chosen from 6 tasks, 6 C2 different combinations may be examined, and one of the combinations is selected if it maximizes the benefit. The term m (#V ) of the right hand side in (2) is the interference miss count by all n data sets. This can be computed by applying Theorem IV.1 after measuring the intrinsic misses of all the n data sets and the total number of cache misses. The memory layouts are, of course, fixed during the n + 1 times of measurement. Per each VΩ , the other terms of the right hand side in (2) are recalculated. The second term m (#VΩ ) is the interference miss between data sets composing VΩ . This can be computed by Theorem IV.1 and measurement of cache misses when only VΩ is cacheable in the system. Notice that the intrinsic miss count of each set of VΩ is already known from the first calculation above when computing m (#V ). Since memory layout does not change during this analysis, the intrinsic misses from the first measurement can be reused. Similarly the third term m (#VΘ ) also can be calculated in the same manner. However the calculation above is trivial because we have already known which combination is the most beneficial one. We already measure the total cache miss counts by each possible VΩ and corresponding VΘ . Those are actual cache misses which the system will experience after the allocation. Therefore the selection process becomes an exhaustive exploration although Theorem IV.2 is applied, and it requires significantly more effort as the number of sets increases. The complexity of this search when examining all combinations of n data sets is O(2n ).
Table I: Target system features
C. Allocation Scheme – Heuristic We suggest a heuristic scheme to avoid the computational complexity of the exhaustive scheme presented above. First, an undirected graph called cache interference degree graph is defined. The graph represents intrinsic misses for each data set and interference misses between each pair of data sets. A node of this graph represents one data set, and a node has arcs to other nodes annotated with the interference miss count between the nodes. The interference miss count can be obtained from Theorem IV.1. Each node is also annotated with its intrinsic miss count. A graph used in this work is presented in Fig. 2. We calculate one metric from the graph for each node: the sum of the intrinsic miss count of a node N and the fluxes from N to all other nodes. This metric prioritizes all nodes, and allocation to SPM follows the order of the priority until the SPM is filled. This scheme implies that interference miss count of a set with all other data sets is positively correlated with (even though it is not proportional to) the sum of the interference miss counts of each pair of data sets, including the set of interest. Rationale behind this is that the interference miss counts annotated in the graph account for relationship between only two data sets. The interference misses between larger number of data sets are greater than or equal to the annotated interference misses, although it varies on particular cache mapping. The complexity of the proposed heuristic is O(n2 ). The complexity is only polynomial time, unlike the exponential complexity of the exhaustive approach. This scheme is applicable iteratively to finer granularity levels of data sets. Even in case VΩ is chosen appropriately, if SPM capacity is not enough to contain all data of VΩ , another sub analysis to VΩ may be conducted. This sub analysis can identify which sub elementary sets of VΩ will benefit from allocation into SPM. For example, in Fig. 2, each task is composed of three sub sets representing different variable storage classes: static variables (S), dynamicallyallocated variables (D), and automatic variables (A). For each of the sub sets chosen at the previous step, more sub analysis can be applied again in order to determine which data storage class of the set will be allocated into the limited capacity of SPM. V. E XPERIMENTS We carried out a set experiments to evaluate the proposed scheme and the effect of SPM on data cache misses. We use a real hardware platform running a real-time operating system with several tasks. A. Target System The hardware system used for the experiments has an ARM7TDMI-S processor core running at 36MHz, and a heterogeneous memory system; 64KB internal SRAM (SPM), 2MB external SRAM, 32MB SDRAM, 2MB internal flash
Microcontroller
NXP LPC2888 (ARM7TDMI-S, 36MHz)
Int. SRAM (SPM)
64KB, 32-bit width, not cacheable 2/2 clock cycles for read/write of 32 bits
Ext. SRAM
2MB, 16-bit width, cacheable 17/14 clock cycles for read/write of 32 bits
Unified Cache
8KB, 2-way set associative 32 bytes cache line, LRU replacement write-back, no allocation on write miss only data cache enabled 1/1 clock cycle for read/write of 32 bits
SDRAM
32MB, 16-bit width, not cacheable
Flash ROM
2MB, 128-bit width
RTOS
FreeRTOS, 1 ms scheduling period
Toolchain
GNU arm-elf-gcc 4.1.1, arm-elf-as 2.17 newlib 1.14.0
ROM, and 8KB unified cache. The flash ROM stores program code. Only the data cache is enabled, so the capacity of the data cache is 8KB as shown. The other details of the hardware are listed in Table I. The SPM has faster read and write latency than the external SRAM. Unfortunately, in our system the SPM access time is longer than the data cache hit access time, which weakens the performance impact of our approach from its potential. Other common systems feature scratchpad memories which are as fast as the data cache. The gap in access latency between data cache and SPM affects the execution time of tasks, and consequently preemption, which increases number of cache misses much as discussed in Section III. However, the maximum difference in the number of preemptions measured is so small that the gap is negligible. Test programs are collected from open source domain and MiBench [25]: Dijkstra, ADPCM, TinyJPEG, MgSort, FFT2D, and Rotate. Dijkstra and TinyJPEG use dynamic type data, and Dijkstra and MgSort contain recursive procedure calls. These programs are chosen to demonstrate different types of code and data behavior. As input workloads to the tasks except TinyJPEG, five different 8KB sample data are sampled from five classic texts: The Adventures of Huckleberry Finn, The Catcher in the Rye, The Lord of the Rings, The Odyssey, and The Three Musketeers. For TinyJPEG, five 64×48 pixel images are prepared. These five inputs add timing variability to the execution time of the programs. Once the system runs, these five inputs are fed into in round-robin way. Each program is run as one task in our real-time system. Table II presents real-time properties and memory usage of the programs. A preemptive real-time operating system FreeRTOS [26] schedules those tasks. The priority of each task is assigned rate-monotonically. Higher priority is represented by a higher number in Table II. The system is
Table II: Details of test programs Task
Priority
WCET (ms)
BCET (ms)
Period (ms)
Num. of Jobs
Dijkstra ADPCM TinyJPEG MgSort FFT2D Rotate
1 2 3 4 5 6
681.3 623.1 555.0 474.3 262.3 164.1
605.5 598.7 356.1 473.7 262.1 163.9
6,500 6,500 5,000 5,000 2,500 1,500
30 30 39 39 78 130
designed to have about 62.1% processor utilization, with about 10% per task. Overhead due to the scheduler and experiment-management code raises actual utilization to 69.7%. The actual sizes of the input workloads used by the tasks are adjusted to make the hyperperiod of the system feasible. Here the hyperperiod is 195 seconds. Only memory accesses to the external SRAM are cacheable, and all factors which can affect cache misses are controlled except data of the test programs. For example, internal data such as task management lists, stacks of the RTOS itself, dynamic memory management library and cache miss measurement library are allocated into a noncacheable memory region (SDRAM). Only the data used by the programs themselves are cacheable.
Memory Usage (Bytes) Static Dynamic Automatic 4,388 948 260 4,100 4,100 4
1,020 0 51,972 0 0 0
156 332 1,580 552 568 4,540
S A
D
S A
S D
A
D
A
S A
D
S D
B. Experiment Setup At the experiment stage, the total number of cache misses during the hyperperiod is measured by a hardware cache miss counter inside of the processor while exhaustively exploring all task combinations from the test tasks. We allocate data from 0 to 6 of tasks into the SPM, exploring 64 combinations in all. If a task is determined to be allocated into SPM, all of its static and automatic data are moved into SPM. The data of the other tasks remain in the external SRAM and are still cacheable. Dynamic type data are excluded from this experiment because the exclusivity condition of data sets is not guaranteed, as mentioned previously. The prior section shows that slightly different data memory layouts may unpredictably affect the number of cache misses. Hence, we use two different types of allocation per each combination of the tasks to evaluate our heuristic excluding (Controlled-Allocation) and including (RealAllocation) that noise. Every measurement is performed for both types of allocation (128 times in total). Controlled-Allocation pins the data memory layout throughout all the exploration. To keep the data memory layout of the static data variables unchanged, each static variable has one dummy static variable with the same size. The dummy static data variables occupy the same location as the original static data in the external SRAM so that other static variables also can keep their original location, after its original variable is moved to SPM. Of course, the dummy
S A
D
Figure 2: Cache interference degree graph used in the experiments in Rotate-centric view variables are never accessed. In the case of automatic data variables, the task stack allocator of FreeRTOS is modified so that each task is always assigned to the same memory location in the external SRAM. No artificial adjustment to memory layout is applied to Real-Allocation. Controlled-Allocation is an idealized and wasteful memory allocation. But Real-Allocation is a default memory allocation actually performed by the linker and the RTOS stack allocator, although it does not fix a unique memory layout during the measurement. Cache miss analysis is also performed to evaluate how the proposed heuristic scheme works well in the real world. The cache miss analysis uses the Controlled-Allocation. First, the intrinsic misses of the tasks are measured as well as the total number of cache misses. Then each pair of the tasks is moved to SPM and measured. We can compute interference miss counts between any pair of tasks. All these measurement are already included in the exhaustive measurement explained above. When the suggested heuristic
is applied to real world situations, only these measurements are performed. Then the cache misses can be analyzed as explained in the prior section, and we obtain new memory allocation consequently. Finally a cache interference degree graph is drawn as in Fig. 2. Each node in Fig. 2 contains the number of intrinsic cache miss in parentheses and interference misses with other tasks. The resulting priority of the tasks to be allocated into SPM follows the order: Rotate → Dijkstra → FFT2D → MgSort → ADPCM → TinyJPEG from higher to lower. This order means that as the number of tasks being allocated into SPM increases, it is more beneficial to select the tasks starting with Rotate. The graph is drawn to emphasize the Rotate task having the highest priority. VI. R ESULTS The system’s total number of cache misses during the hyperperiod is 618,174, shown in Table III as the case of zero tasks allocated into SPM. From Fig. 2 the interference misses are estimated to be 83.3% of the total cache misses, and the intrinsic misses are only 16.7% in ControlledAllocation. It is from cache contention and corresponding cache pollution that most cache misses arise. The cache miss counts for all 64 task combinations are plotted in Fig. 3, sorted horizontally by the number of tasks with data allocated into SPM. The blue solid line shows the performance trajectory of a system using our heuristic, i.e, the tasks are added into SPM in sequence of Rotate → Dijkstra → FFT2D → MgSort → ADPCM → TinyJPEG. In both the case of Controlled-Allocation and RealAllocation, our heuristic is best at reducing the total cache misses of the system, so it gives very good guidance in choosing appropriate data for SPM. The combinations showing the best result for each number of tasks with data in SPM are listed in Table III. Prior research work selects data to allocate to SPM based on memory access counts or frequency [9]–[12], [14], [15]. We next examine how well these approaches work with our experimental configuration in comparison with our method. Table IV presents the portion of total memory access counts and the portion of intrinsic cache miss counts by each task during the hyperperiod. Using this data we examine the performance trajectory of three allocation sequences, shown by the three black lines (dotted, dot-dashed, and dashed lines) in Fig. 3. The first is ordered by the task’s scheduling priority and results in the sequence Rotate → FFT2D → MgSort → TinyJPEG → ADPCM → Dijkstra. The second is ordered by the portion of total memory access counts and results in the sequence ADPCM → MgSort → FFT2D → Dijkstra → Rotate → TinyJPEG. The third sequence is ordered by the portion of intrinsic cache miss counts and result in the sequence Dijkstra → Rotate → FFT2D → MgSort → TinyJPEG → ADPCM. None of these allocation scheme
Table IV: Portion of memory access counts and intrinsic cache miss counts by each task during the hyperperiod Task Dijkstra ADPCM TinyJPEG MgSort FFT2D Rotate
Mem. Access Count
Intrinsic Cache Miss Count
12.8% 34.3% 1.2% 27.6% 16.0% 8.1%
92.0% 0.3% 0.4% 0.5% 1.3% 5.5%
works as well as our heuristic, although the intrinsic cache miss count-based scheme (the dashed lines) works closely. Those trajectories imply that the inter-task cache pollution is not determined solely by the scheduling priority (or the number of jobs during the hyperperiod since the target system adopts the rate-monotonic priority), the memory access amounts or the number of cache misses. It results from superposition of those factors. The proposed heuristic accommodates the cutting edge of the complex interactions among the factors. It is interesting to note that by moving only one task (Rotate), the total number of cache misses is reduced to 37.3% (Controlled-Allocation) and 37.2% (Real-Allocation) of the original (i.e. 618,174). This would be counterintuitive without using our approach, since that the intrinsic misses of Rotate account for only 0.9% of the total cache misses. However, allocating task Rotate into SPM reduces the total cache misses of the system dramatically. Table III also lists how much intrinsic and interference misses fall with the proposed scheme. This shows the synergy between data cache and SPM which we have proposed. The task data are serviced by fast SPM, so they will not experience cache misses. In addition, data cache contention and pollution are reduced, helping the other tasks with data which still remains in the cacheable memory region. This behavior continues as more tasks are allocated into SPM. The two allocation policies - Controlled-Allocation and Real-Allocation shows similar behavior but slightly different numbers of cache misses through all the exploration. The maximum difference of number of cache misses between the two policies is 3.2% when the difference is normalized to the original number of cache misses. This implies that different memory layouts can still perturb result of static cache miss analysis slightly. However, our allocation reduces the total number of cache misses and consistently overcomes the obstacles. In this work, we did not set an SPM capacity budget available in order to examine the maximum performance of the heuristic. But if SPM capacity is limited, analysis may be conducted on a finer data set granularity level, as discussed previously. Also, in this experiment, the number of cache misses is easily measured with the hardware cache miss counter of the processor. For systems lacking that feature,
1
Task Schduling Priority Total Mem. Acc. Counts Intrinsic Cache Miss Counts Proposed Heuristic
0.8
Normalized Num. of Cache Misses
Normalized Num. of Cache Misses
1
0.6
0.4
0.2
0 0
1
2 3 4 Num. of Tasks in SPM
5
6
Task Schduling Priority Total Mem. Acc. Counts Intrinsic Cache Miss Counts Proposed Heuristic
0.8
0.6
0.4
0.2
0 0
1
(a) Controlled-Allocation
2 3 4 Num. of Tasks in SPM
5
6
(b) Real-Allocation
Figure 3: Reduction of cache misses by data allocation into SPM using heuristic (blue solid line) and other methods (black dotted, dot-dashed and dashed lines) on exhaustive search (grey dots) Table III: Best task combinations to be allocated into SPM and corresponding cache misses counts Num. of Tasks in SPM 0 1 2 3 4 5 6 1
Controlled-Allocation Intrinsic Interference 103,544 514,630 97,821 132,509 2,527 75,506 1,242 8,411 M 750 2 M, T 324 0 M, T, A 0 0
Tasks in SPM R R, R, R, R, R,
D D, D, D, D,
F F, F, F,
Total 618,174 230,330 78,033 9,653 752 324 0
Tasks in SPM R R, R, R, R, R,
D D, D, D, D,
F F, M F, M, T F, M, T, A
Real-Allocation Intrinsic Interference 102,119 516,055 97,826 132,128 2,522 74,546 1,242 6,069 754 0 324 0 0 0
1
Total 618,174 229,954 77,068 7,311 754 324 0
Analyzed priority order of tasks: Rotate (R) → Dijkstra (D) → FFT2D (F) → MgSort (M) → ADPCM (A) → TinyJPEG (T)
many cache miss analysis techniques may be accommodated as Ramaprasad et al. [4] did. This does not necessarily degrade applicability of our scheme since total number of cache misses can be reduced by using a small amount of SPM. This work seeks a method with multiple goals; reducing cache misses by using SPM and providing a cache miss analysis scheme to support it. VII. C ONCLUSION Much research concentrates on tightly bounding data cache misses, or else exploiting SPM in place of a data cache. However, cache miss analysis requires careful attention since it is sensitive to many factors, as illustrated above. Using only SPM misses the synergy attainable when using data cache and SPM simultaneously. This study suggests a heuristic scheme to improve data cache performance by using SPM to hold troublesome data. We present a new perspective: classifying cache misses as intrinsic or interference misses. The intrinsic misses depend only on a data set itself, while the interference misses reflect
conflicts with other data sets. The proposed scheme identifies which data sets are best to allocate into SPM. We have revealed significant synergy between data cache and SPM by looking into inter-task cache pollution. The tasks with allocated data into SPM benefit from fewer intrinsic misses, while other tasks also benefit due to fewer interference misses from less cache contention and pollution. Our experiments show the proposed heuristic provides good guidance for choosing these data sets. We leave iterative application of the proposed heuristic as future work. R EFERENCES [1] C.-G. Lee, H. Hahn, Y.-M. Seo, S. L. Min, R. Ha, S. Hong, C. Y. Park, M. Lee, and C. S. Kim, “Analysis of cache-related preemption delay in fixed-priority preemptive scheduling,” Computers, IEEE Transactions on, vol. 47, no. 6, pp. 700 –713, jun 1998. [2] H. S. Negi, T. Mitra, and A. Roychoudhury, “Accurate estimation of cache-related preemption delay,” in Proceedings of the 1st IEEE/ACM/IFIP international conference
on Hardware/software codesign and system synthesis, ser. CODES+ISSS ’03. New York, NY, USA: ACM, 2003, pp. 201–206. [3] H. Ramaprasad and F. Mueller, “Bounding worst-case data cache behavior by analytically deriving cache reference patterns,” in Real Time and Embedded Technology and Applications Symposium, 2005. RTAS 2005. 11th IEEE, 7-10 2005, pp. 148 – 157. [4] ——, “Bounding preemption delay within data cache reference patterns for real-time tasks,” in Real-Time and Embedded Technology and Applications Symposium, 2006. Proceedings of the 12th IEEE, april 2006, pp. 71 – 80. [5] S. Altmeyer and C. Burguiere, “A new notion of useful cache block to improve the bounds of cache-related preemption delay,” in Real-Time Systems, 2009. ECRTS ’09. 21st Euromicro Conference on, july 2009, pp. 109 –118. [6] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, “Scratchpad memory: design alternative for cache on-chip memory in embedded systems,” in CODES ’02: Proceedings of the tenth international symposium on Hardware/software codesign. New York, NY, USA: ACM, 2002, pp. 73–78. [7] S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel, “Assigning program and data objects to scratchpad for energy reduction,” in Design, Automation and Test in Europe Conference and Exhibition, 2002. Proceedings, 2002, pp. 409 –415. [8] M. J. Absar and F. Catthoor, “Compiler-based approach for exploiting scratch-pad in presence of irregular array access,” in Proceedings of the conference on Design, Automation and Test in Europe - Volume 2, ser. DATE ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 1162–1167. [9] J. Absar and F. Catthoor, “Analysis of scratch-pad and datacache performance using statistical methods,” in Design Automation, 2006. Asia and South Pacific Conference on, 2006, pp. 820–825. [10] S. Mamagkakis, D. Atienza, C. Poucet, F. Catthoor, D. Soudris, and J. Mendias, “Custom design of multi-level dynamic memory management subsystem for embedded systems,” in Signal Processing Systems, 2004. SIPS 2004. IEEE Workshop on, oct. 2004, pp. 170 – 175. [11] O. Avissar, R. Barua, and D. Stewart, “Heterogeneous memory management for embedded systems,” in CASES ’01: Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems. New York, NY, USA: ACM, 2001, pp. 34–43. [12] V. Suhendra, T. Mitra, A. Roychoudhury, and T. Chen, “WCET centric data allocation to scratchpad memory,” in Real-Time Systems Symposium, 2005. RTSS 2005. 26th IEEE International, 8-8 2005, pp. 10 pp. –232. [13] J.-F. Deverge and I. Puaut, “WCET-directed dynamic scratchpad memory allocation of data,” in Real-Time Systems, 2007. ECRTS ’07. 19th Euromicro Conference on, 4-6 2007, pp. 179 –190.
[14] R. Ghattas, G. Parsons, and A. Dean, “Optimal unified data allocation and task scheduling for real-time multi-tasking systems,” in Real Time and Embedded Technology and Applications Symposium, 2007. RTAS ’07. 13th IEEE, 3-6 2007, pp. 168 –182. [15] S. Kang and A. G. Dean, “DARTS: Techniques and tools for predictably fast memory using integrated data allocation and real-time task scheduling,” in Real-Time and Embedded Technology and Applications Symposium (RTAS), 2010 16th IEEE, 12-15 2010, pp. 333 –342. [16] D. Kirk, “SMART (strategic memory allocation for realtime) cache design,” in Real Time Systems Symposium, 1989., Proceedings., 5-7 1989, pp. 229 –237. [17] D. Kirk and J. Strosnider, “SMART (strategic memory allocation for real-time) cache design using the mips r3000,” in Real-Time Systems Symposium, 1990. Proceedings., 11th, Dec 1990, pp. 322 –330. [18] F. Mueller, “Compiler support for software-based cache partitioning,” SIGPLAN Not., vol. 30, pp. 125–133, November 1995. [19] I. Puaut and D. Decotigny, “Low-complexity algorithms for static cache locking in multitasking hard real-time systems,” in Real-Time Systems Symposium, 2002. RTSS 2002. 23rd IEEE, 2002, pp. 114 – 123. [20] I. Puaut and C. Pais, “Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison,” in Design, Automation Test in Europe Conference Exhibition, 2007. DATE ’07, 16-20 2007, pp. 1 –6. [21] S. Ghosh, M. Martonosi, and S. Malik, “Cache miss equations: an analytical representation of cache misses,” in ICS ’97: Proceedings of the 11th international conference on Supercomputing. New York, NY, USA: ACM, 1997, pp. 317–324. [22] M. Verma, L. Wehmeyer, and P. Marwedel, “Cache-aware scratchpad allocation algorithm,” in Design, Automation and Test in Europe Conference and Exhibition, 2004. Proceedings, vol. 2, 16-20 2004, pp. 1264 – 1269 Vol.2. [23] P. Panda, N. Dutt, and A. Nicolau, “Efficient utilization of scratch-pad memory in embedded processor applications,” in European Design and Test Conference, 1997. ED TC 97. Proceedings, 17-20 1997, pp. 7 –11. [24] P. R. Panda, N. D. Dutt, and A. Nicolau, “On-chip vs. offchip memory: the data partitioning problem in embedded processor-based systems,” ACM Transactions on Design Automation and Electronic Systems, vol. 5, no. 3, pp. 682–704, 2000. [25] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, and R. Brown, “Mibench: A free, commercially representative embedded benchmark suite,” in Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on, dec. 2001, pp. 3 – 14. [26] The FreeRTOS Project, http://www.freertos.org.