A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware Stefan Metzlaff1 , Irakli Guliashvili1 , Sascha Uhrig2 , and Theo Ungerer1 1
Department of Computer Science, University of Augsburg, Germany {metzlaff,guliashvili,ungerer}@informatik.uni-augsburg.de 2 Robotics Research Institute, TU Dortmund, Germany
[email protected]
Abstract. This paper proposes a hardware managed instruction scratchpad on the granularity of functions which is designed for realtime systems. It guarantees that every instruction will be fetched from the local, fast and timing predictable scratchpad memory. Thus, a predictable behavior is reached that eases a precise timing analysis of the system. We estimate the hardware resources required to implement the dynamic instruction scratchpad for an FPGA. An evaluation quantifies the impact of our scratchpad on average case performance. It shows that the dynamic instruction scratchpad compared to standard instruction memories has a reasonable performance - while providing predictable behavior and easing timing analysis.
1
Introduction
Embedded systems in safety-critical application domains like automotive and avionic applications underlie hard real-time (HRT) constraints. In HRT systems the miss of a deadline can cause serious system breakdowns that may harm the system or even human beings. Therefore, the timing of a HRT system has to be analyzed before it can be deployed. To be sure that in all cases, independent of input values and system states, the deadlines are met, the worst case execution time (WCET) has to be determined. To estimate the WCET of an application two methods [1] can be applied: static or measurement-based WCET analysis. Both WCET analysis techniques have to consider the whole system including processor and the memory system. Therefore, a predictable memory access is crucial for a HRT system. In memories like caches it is complex to determine, if a memory access will lead to a cache hit or miss [2]. If there is an uncertainty on a cache hit, a cache miss has to be assumed, to assure that the calculated WCET is not underestimated. Thus, an upper bound of the memory access time can be estimated for caches, but it will be pessimistic. In the worst case the WCET analysis has to ignore the cache. The content of a scratchpad is usually defined by the application designer at compile-time. This simplifies the analysis of timing behavior, because it is defined
This work has been supported by the EC Grant Agreement n◦ 216415 (MERASA).
M. Berekovic et al. (Eds.): ARCS 2011, LNCS 6566, pp. 122–134, 2011. c Springer-Verlag Berlin Heidelberg 2011
Dynamic Instruction Scratchpad Memory
123
what data is located in which memory. Hence, the usage of scratchpads with a static assignment of data or instructions allows exact timing estimations for the memory accesses and eases the computation of the WCET. The drawback of static assignment is that the scratchpad memory content cannot change during run-time. In contrast to caches the memory utilization of statically assigned scratchpads is poor. Beside the type of the memory it is also of importance if and how memory operations could interfere each other. If a memory request can be delayed by another resource (e.g. when the instruction fetch unit and the load/store unit access a common memory controller), the complexity of a precise WCET analysis is escalated. But if such memory interferences are not handled correctly by the analysis, the calculated WCET might be underestimated. In this paper we propose the Dynamic Instruction ScratchPad (D–ISP), that loads functions dynamically on demand, provides a predictable instruction fetch and addresses the problem of memory interferences. By a two-phased execution scheme the D–ISP hinders memory interferences between instruction and data path and thus eases a precise WCET analysis. The paper is structured as follows: The next section discusses related work. Section 3 describes architectural characteristics of the D–ISP and its implementation. Section 4 evaluates the impact of the scratchpad on average case performance and Section 5 discusses the hardware requirements of the D–ISP. The last section concludes the paper and presents an outlook to future work.
2
Related Work
Scratchpad memories are mainly used in embedded systems to reduce energy consumption or lower the WCET of HRT applications. Both goals can be reached, because scratchpads (usually SRAMs) have a low and constant memory latency and do not need an energy consuming tag memory like caches. The content of scratchpads can be assigned either statically or it is managed during run-time by software or, like we propose, by hardware. To statically assign code to the scratchpad e.g. the most frequently used instructions are selected to reduce the energy consumption [3] or the WCET [4]. In [5] the authors reduce WCET by statically locating basic blocks at compiletime into an instruction scratchpad. In contrast to static scratchpads softwaremanaged scratchpads allow the changing of the content during run-time as proposed in [6,7]. Egger et. al. [8] combines both static and software managed approaches: A function can be statically located in the scratchpad, in the external memory, or the function is loaded dynamically on demand into the scratchpad. For functions that are copied on demand a page manager handles the content lookup and the function copying. The decision which function will be placed in the scratchpad is done by an optimization algorithm to reduce the energy consumption. Since the page manager is implemented in software, the performance drawback will be higher than for a hardware controlled solution that we propose. Moreover, this overhead must be taken into account during the WCET analysis.
124
S. Metzlaff et al.
Janapsatya et. al. [9] describe a managed scratchpad that is optimized regarding to the memory utilization to reduce energy consumption. The scratchpad contains basic blocks that are selected based on a temporal proximity metric: basic blocks that are executed consecutively are assigned together to the scratchpad at the same time. A scratchpad controller implemented in hardware is triggered by special copy instructions and loads the selected blocks into the scratchpad. In contrast to the D–ISP the scratchpad proposed in [9] is not supposed to hide the memory hierarchy from the processors fetch path. Using the D–ISP every fetch is directed to it. This ensures that fetches do not interfere with other memory accesses which is important for a precise WCET estimation. Caches as hardware managed memories are also used in real-time systems. But because a precise cache analysis is complicated for common caches, different techniques are used in real-time systems. To allow predictability of caches the cache content can be locked [10]. Thus, the cache eviction can be controlled, such that the analysis is simplified. In [11] a cache-based code partitioning for functions to decrease the WCET is proposed. A WCET-aware compiler decides which functions are put in a cacheable memory region and which are not using the cache. Thus, the miss rate of the cache for the worst case path can be reduced and the WCET of the application is decreased. Another approach based on functions as the proposed D–ISP is the predictable method cache by Schoeberl [12,13]. The method cache uses complete methods as replacement granularity. The proposed cache structure binds a memory block to its cache tag. So the usage of smaller memory blocks to improve memory density, leads to a high number of cache tags that are causing either a slow or hardware intensive hit detection. In the D–ISP the scratchpad content is decoupled from the lookup tables. So the complexity of hit detection is restricted to the number of entries of the lookup tables only. Preußer et. al. [14] address the complexity problem of a fine grained method cache implementation and show an implementation with a stack based replacement policy.
3
The Dynamic Instruction Scratchpad
The idea of the D–ISP is to bind all instructions of one function together and load them into a fast on-chip scratchpad at once. Thus, it is ensured that every instruction of the active function is held in the scratchpad before the function is executed. So while a function is executed every instruction is fetched from the D–ISP and no instruction memory access on any other level of memory hierarchy is needed. This fact is of importance, because any fetch that leads to a shared memory level, like off-chip memory as shown in Figure 1(a), will not only disrupt the timing of the execution (caused by waiting on the instructions) it also interferes with data memory accesses. If memory interferences are possible, a complex and detailed integrated pipeline and memory system analysis has to be applied to obtain a save WCET. Otherwise for each memory access an additional delay caused by interferences that have be assumed. But this pessimistic approach impairs the tightness of the estimated WCET. By eliminating memory
Dynamic Instruction Scratchpad Memory
125
(a) D–ISP overview
(b) Detailed D–ISP block diagram with memories Fig. 1. D–ISP block diagrams
interferences of instruction and data memory accesses, a precise WCET analysis with reduced effort is possible. The D–ISP precludes the interferences between data and instruction memory accesses, by guaranteeing that all instructions of the active function are located in the D–ISP. This leads to a two-phased execution behavior: either the pipeline is stalled because of loading a function into the D–ISP or the pipeline executes a function. 3.1
D–ISP Architecture
The D–ISP is located on-chip near the fetch stage of the processor pipeline. It handles the fetch requests from the processor like a common scratchpad. The D–ISP requires control signals (see Figure 1(a)) from the pipeline to notice the control flow changes on calls and returns. Therefore, the host processor needs minor changes in logic and signal routing: If the pipeline executes call or return instructions the D–ISP has to be informed. Furthermore the call target address has to be routed to the D–ISP. With that information the D–ISP is capable of loading functions that are activated by calls or returns into the local memory before their execution. For functions that are stored in the D–ISP every fetch request will be answered with the constant low latency of an on-chip SRAM. The D–ISP consists of two different parts: first the fetch control which is responsible for delivering the instructions to the pipeline and the second part is the content management. For handling fetch requests the D–ISP has to translate the native addresses requested by the pipeline into the local addresses used by the scratchpad. To allow a correct address translation, information from the content management about the stored functions is needed. The D–ISP content management has to perform several subtasks to assure that the currently active function is in the scratchpad memory: check the content of the scratchpad, copy functions into the scratchpad, evict functions on overwrite and update the address mappings. On calls or returns the content management has to check at first if a function is already contained in the D–ISP. If so (D–ISP hit) the function execution can be started without any delay. On a D–ISP miss, the content management has to load the complete function from the next memory level into the scratchpad. If
126
S. Metzlaff et al.
the scratchpad space is too small to load the function without overwriting others, these will be evicted. After completely copying the function into the scratchpad the address mapping is updated. Since the loading of a function may need several cycles, the function execution is stalled until the function is completely loaded. This is necessary because otherwise the two-phased execution scheme would be disrupted by owing function loads requested by the D–ISP controller that may interfere with data memory accesses triggered by function execution. The content management ensures that a function is either completely contained in the D–ISP or not. Therefore, the minimum D–ISP size is determined by the size of the largest function in the executed code that is intended to be loaded into the D–ISP. 3.2
D–ISP Implementation Details
The D–ISP requires the length of a function when copying it into the scratchpad. In contrast to previous work [15] where the D–ISP controller detected the end of the function on the fly, we decided to use instrumentation to obtain the function’s length. This reduces the complexity of the D–ISP, because no instruction parsing has to be performed. Therefore, we created an instrumentation tool that hooks into the compilation and linking process of the application and adds a special instruction at the beginning of every function in the application code to encode the function length. Using this instruction the D–ISP obtains the function size and copies the appropriate number of bytes into the scratchpad. Functions larger than the scratchpad or without this special instruction will be ignored by the D–ISP. As shown in Figure 1(b) the D–ISP controller consists of the two separated parts described in Section 3.1: fetch control and content management. Both are coupled by the context register. This register is written by the content management and used by the fetch control to determine the start address of the active function in the scratchpad. For simplicity of program execution the D–ISP hides its address space from the pipeline by an address mapping in the fetch control. The fetch control calculates the offset of the fetch regarding to the function’s start address and adds it to the address where the function is located in the scratchpad memory. Both addresses are stored in the context register. Because this is implemented asynchronously, fetches can be handled by the D–ISP within one cycle. To prevent the fetch control of accessing invalid entries in the context register and deliver wrong instructions, the fetch control stalls all pending fetch requests, while the content management is active. The content management is activated on call or return only. It uses a mapping table to check if the activated function is in the scratchpad or not. For each mapped function the mapping table holds the address in the native address space, the address in the scratchpad and the function length. To find entries in the mapping table without inspecting the whole table (which takes one cycle per entry) an additional lookup table is used, see Figure 1(b). The lookup table delivers several addresses of mapped functions within one cycle to the D–ISP controller that compares them to the address of the function that was called.
Dynamic Instruction Scratchpad Memory
127
On a lookup table hit, the corresponding mapping table entry is selected, the context register is updated and the fetch control will be reactivated. If the called function is not found in the lookup table, the function has to be copied into the scratchpad. Therefore, a new mapping is created and the first fetch block is requested from the main memory. Using the function size obtained by decoding the special instruction, the content management requests the remaining fetch blocks of the function and copies them into the scratchpad. If the first block of another function in the scratchpad is overwritten, the content management deletes the corresponding mapping table and lookup table entries. Thus, each function is maintained in a whole or not by the scratchpad. The scratchpad addressing is cyclic such that the replacement policy of the scratchpad is FIFO. After copying the last block of a function into the scratchpad the context register is updated and the fetch control is reactivated. On return instructions the content management has to determine the address of the caller function, to check if it is still in the scratchpad. To obtain this address without complex calculations or long latencies, the content management maintains its own stack memory (see Figure 1(b)). This memory works as a stack that contains the addresses of the called functions. So on every call the function address is put onto the stack and on return the top address is deleted. Using the stack memory the address of the reactivated function is determined on return without delay. With this address the content management is able to check the lookup table and proceed like described above for handling function calls.
4
Evaluation of Performance Impact
The main contribution of the D–ISP is an improved predictability for instruction fetches: The two-phased execution scheme enforces the absence of instruction and data memory interferences, which impede a precise WCET estimation. The loading of a function into the D–ISP is timing predictable, because the size of the function is known a priori. By the use of the D–ISP for instruction fetches during function execution, a constant memory access time for all fetches is achieved. Furthermore due to the use of functions by the content management, the complexity of a content analysis that determines the worst case behavior for function load and eviction can be alleviated. Beside the predictability aspect of the D–ISP, we will discuss its average case execution performance. We will show that the cost on average performance is worth the gain in predictability that is inherited in the D–ISP usage. To classify the average performance of the D–ISP, we compare the fetch cost of several benchmarks for the D–ISP, with commonly used memories in embedded systems: an instruction cache and a static instruction scratchpad. We implemented the D–ISP in a cycle accurate SystemC simulator for architectural evaluation and in VHDL to estimate the hardware cost. The host processor for the D–ISP is the TriCore instruction set compatible CarCore [16] processor. The CarCore is an SMT processor with a predictable timing designed to execute HRT threads. We also implemented the D–ISP as first level instruction memory for HRT threads in the MERASA multi-core processor [17]. The
128
S. Metzlaff et al.
D–ISP needs to be informed by the host pipeline, if a call or return is executed. Thus, we had to apply minor changes to the pipeline which are limited to the signaling of call or return processing and the routing of the call target address to the D–ISP controller. All memory types were evaluated in the cycle accurate CarCore SystemC simulator. The instruction cache is direct-mapped and has a cache line size of 32 bytes containing 4 fetch blocks each. The static instruction scratchpad (S–ISP) contains multiple functions of the benchmark code. For the selection of the functions that are put into the S–ISP we used a Knapsack optimization to fit the given scratchpad size as best as possible. We used the maximum dynamic instruction count of a function as metric for the selection. Functions that are not contained in the S–ISP are fetched directly from the off-chip memory. The memory access time for the S–ISP is one cycle. For fetches that lead to the off-chip memory level 4 cycles are needed. A cache hit also takes one cycle. On a cache miss four 64 bit off-chip memory accesses are needed plus one extra cycle to detect the miss. For D–ISP the hit detection on call or return cost 4 cycles for table lookup and context register write. This delay is hidden by the call or return processing of the pipeline. A miss of the D–ISP takes as much cycles as are needed to load the complete function from the off-chip memory plus 5 cycles for internal processing, like table lookup, context register write and write latencies. Fetches are handled by the D–ISP within one cycle. For the performance evaluation we measured the fetch cost of applications from three different benchmark suites: M¨ alardalen WCET benchmarks [18], MiBench [19] and EEMBC AutoBench [20]. We selected 6 benchmarks with different function length and call characteristics. The fetch cost measured in the evaluation is the number of cycles for all fetches requested by the processor for the whole benchmark run. We normalized these numbers to the configuration where the complete code is located in the S–ISP which represents the minimal fetch cost. A normalized fetch cost of 4 will be reached, if all instructions are fetched from the off-chip memory. To compare the three on-chip instruction memories we varied their sizes from 128 byte to the overall size of the actual benchmark in steps of 128 byte. The results are shown in Figure 2. The Figures 2(a) to (f) show that, if the memory size is as large as the benchmark code, the S–ISP always performs slightly better than the cache and the D–ISP. This is caused by the fact, that the content of S–ISP is set up before the benchmark execution. The fetch cost of the S–ISP decreases with higher scratchpad sizes, because more functions can be assigned to the scratchpad. But there are exceptions for this behavior in Compress (b) and Rspeed (e). This is caused by the used metric of the static assignment algorithm that takes the longest possible path for each function and the maximum number of function invocations into account, though this is independent of the real execution behavior. The cache reduces the fetch cost to nearly the optimum even with a memory size lower than one third of the code size. In some configurations the cache shows a thrashing behavior, like for Compress (b) at 384 bytes. This is caused by the direct-mapped cache organization. Also the cache miss rate for the EEMBC
Normalized fetch cost 2 3 4
Normalized fetch cost 1.2 3.5
●
S−ISP I−Cache D−ISP
●
● ●
●
●
2.5
●
●
●
129
S−ISP I−Cache D−ISP
●
●
Normalized fetch cost 1.5 2.0
S−ISP I−Cache D−ISP
●
5
3.75
Dynamic Instruction Scratchpad Memory
●
● ● ●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
1
●
128
●
●
640
●
●
●
●
●
●
●
1152 1792 2432 Memory size in Byte
●
● ●
●
●
●
●
●
3072
●
128
384
640 896 1152 Memory size in Byte
1.0
●
1
●
1408
●
128
(a) ADPCM (M¨ alardalen) (b) Compress (M¨ alardalen)
384
●
640 896 1152 Memory size in Byte
1408
(c) EDN (M¨ alardalen)
●
● ●
●
● ●
●
●
●
6.5
6
● ●
●
●
Normalized fetch cost 1.5 2 2.5 3
●
S−ISP I−Cache D−ISP
●
Normalized fetch cost 3 4 5
●
S−ISP I−Cache D−ISP
2
Normalized fetch cost 1.25 1.5 3.25 3.5 3.75
4
27.9 ●
●
S−ISP I−Cache D−ISP
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
128
384 640 896 Memory size in Byte
●
1152
(d) Dijkstra (MiBench)
●
128
384
640 896 1152 Memory size in Byte
●
1408
(e) Rspeed (EEMBC)
1
●
1
1
● ● ● ● ●
● ● ●
128
768
1536 2304 3072 3840 Memory size in Byte
4608
(f) Ttsprk (EEMBC)
Fig. 2. Normalized fetch cost for S–ISP, cache and D–ISP with different memory sizes
benchmarks (e) and (f) is high. This behavior is due to the low spatial locality of the EEMBC benchmark caused by code replication. In the Figures 2(a) to (f) the upright lines denote that the memory size is at least as large as the size of the largest function in the benchmark code. This mark is important for the D–ISP, because it loads only complete functions into the scratchpad. Thus, for any measure points on the left-hand side of the upright line, the D–ISP has to ignore functions larger than its size. This is done automatically by the D–ISP controller. Furthermore the D–ISP cannot enforce the two-phased execution scheme for the unmaintained functions. This causes that the timing analysis of these functions has to take memory interferences into account. For configurations with a larger (or equal) scratchpad size than the largest function, every fetch request is directed to and handled by the D–ISP. Then no memory interferences can occur and the timing analysis can threat instruction and data memories independently. This assumption cannot be made for any cache and S–ISP configuration, except if the whole code is located in the S–ISP. For benchmarks (a), (b) and (f) some outliers for the D–ISP appear, if the size of the scratchpad is slightly larger than the largest function. This is caused by the fact that the largest function will always evict nearly all other functions maintained in the scratchpad on its load. So if the largest function is very active in calling or getting called, the fetch cost for the D–ISP will increase, caused by evicting and reloading functions frequently. For Ttsprk (f) this behavior results
130
S. Metzlaff et al.
Table 1. Utilization of D–ISP controller on Stratix II EP2S180F1020C3 FPGA Functions 8 16 32 64 128 256
ALUTs Registers max. Frequency 986 653 100.44 Mhz 1159 658 101.41 Mhz 1505 665 102.60 Mhz 2185 672 85.34 Mhz 3512 677 72.90 Mhz 6218 682 50.97 Mhz
in a fetch cost of 27.9 which is almost 7 times worse than using the off-chip memory instead of the D–ISP. In Ttsprk the call hierarchy is flat and every function is called directly by the large main function. For scratchpad sizes that are not close to the size of the largest function the D–ISP performs better than the S–ISP. This is explained by the dynamic content managed of the D–ISP. So the D–ISP will maintain the functions that are used in the actual phase of the application, whereas the S–ISP holds functions for the whole application execution. Comparing the D–ISP to the instruction cache, the cache accounts mostly a lower fetch cost. This is caused by the finer granularity of the cache lines that hide the structure of the application. Thus, the D–ISP in many configurations cannot compete to the instruction cache, since the cache handles misses on a much finer granularity: On a miss only one line has to be loaded into the cache, in contrast to the D–ISP that loads the complete function. In general the average performance of the D–ISP is between the S–ISP and cache performance. This is expected since it uses a dynamic memory management but not as fine grained as a cache does. By the two-phased execution the D–ISP provides a predictable function execution behavior. From timing analysis point of view the D–ISP eases the analysis by allowing independent pipeline and memory analysis without decreasing the accuracy of the estimates. Furthermore the WCET analysis of a function-based instruction memory is easier than for an instruction cache, because the content changes only on call/return instructions. These benefits of the D–ISP outweigh the moderate average case performance, which is in general better than a S–ISP and in the same order of magnitude as the direct-mapped instruction cache.
5
D–ISP Hardware Effort Estimation
To estimate the hardware effort of the D–ISP we used an Altera Stratix II EP2S180F1020C3 FPGA. Table 1 shows the utilization of the D–ISP controller, without the memories needed to store and maintain the content of the D–ISP, in used ALUTs and registers. The maximum possible frequency of the controller is also provided. The number of functions in the table defines how many functions can be checked on lookup by the content management within one clock cycle. It is defined by the port width of the lookup table memory. This number does not represent the number of functions that can be handled by the D–ISP concurrently, which depends on the sizes of the mapping and lookup table memories.
Dynamic Instruction Scratchpad Memory
131
As Table 1 shows, the number of comparisons for function lookup is critical for the usage of ALUTs and the maximum possible frequency. This relation is expected since the parallel comparisons are very costly in hardware. Therefore, we propose to use at most 32 functions to be compared within one cycle. To support more than 32 functions the lookup can be split into multiple cycles e.g., if the D–ISP should handle 128 functions, then the function lookup will take 4 cycles at maximum. This is an acceptable delay for the hardware amount that is saved. Another aspect of the D–ISP hardware amount is the memory that is used. The overall used memory is fragmented into the memory used to store the functions (sizef unc ), the size of the function mapping table (sizemap ), the size of the lookup table (sizelookup ) and the size of the function stack (sizestack ): sizeD−ISP = sizef unc + sizestack + sizemap + sizelookup sizemap = (widthnaddr + widthsaddr + lengthf ) · nof 1 sizelookup = widthnaddr · nof 1 sizestack = widthnaddr · nof 2 Three of these memories depend on the width of the function addresses stored in the tables (widthnaddr , widthsaddr ), on the number of bytes to encode the function length (lengthf ) and additional parameters (nof 1 and nof 2 ). The mapping table needs to store two addresses for each function: one in the native (widthnaddr ) and one in the scratchpad address space (widthsaddr ). Also the size in bytes of the mapped function (lengthf ) is stored in the table. When allowing a maximum scratchpad size of 512kB, the function length (lengthf ) can be encoded in 2 bytes, for a fetch block size of 64 bit. Then for the scratchpad addresses 16 bit can be used (widthsaddr ). For the native addresses (widthnaddr ) the 32 bit addresses can be reduced to 24 bit by taking the 5 bit segment address of the CarCore into account and aligning all functions to 64 bit addresses. So in sum 7 bytes are necessary per mapping table entry. The lookup table and the stack memory store only the address of functions in the native address space, thus each entry is holding a 24 bit address and is 3 bytes long. The variable nof 1 defines the number of functions that the D–ISP is able to maintain. Notice that this number must not correspond to the number of functions that can be looked up by the D–ISP controller within one cycle, which is represented by the used logic amount in table 1. The maximum allowed stack depth is defined by nof 2 . Both parameters nof 1 and nof 2 have to be powers of two. To evaluate the memory amount used by the D–ISP, we compare it with a direct-mapped cache with a cache line size of 32 bytes. The cache uses 27 bit addresses, because the CarCore’s 5 bit segment address is taken into account. For the D–ISP we used a stack size of 16 functions (nof 2 ) and the number of supported functions (nof 1 ) varies from 2 to 128. Figure 3 compares the used memory size of the D–ISP to the cache and a static scratchpad. It shows the overall required memory (including cache tag and the additional memory structures of the D–ISP controller) in relation to the memory that is used to buffer instructions. As depicted in Figure 3 the D–ISP has a high memory overhead for small scratchpad sizes compared to the cache. This is caused by the fact, that
S. Metzlaff et al.
4096
●
2048
●
1024
●
●
512
●
●
●
256
Required memory (in bytes)
8192
132
128
●
128
256
512
1024
D−ISP (2 functions) D−ISP (4 functions) D−ISP (8 functions) D−ISP (16 functions) D−ISP (32 functions) D−ISP (64 functions) D−ISP (128 functions) direct−mapped cache static scratchpad 2048
4096
8192
Usable memory (in bytes)
Fig. 3. Memory usage of D–ISP, direct-mapped cache and static scratchpad
the tag memory size of a cache depends on the cache size (assuming that the cache line size is constant), whereas the additional memory used by the D–ISP is independent of the scratchpad size. But if the scratchpad size is increased above 2kB the memory amount used by D–ISP and cache is in the same order of magnitude - even for a rather large number of functions.
6
Conclusions
In this paper we proposed the Dynamic Instruction Scratchpad (D–ISP) as alternative instruction memory for real-time systems. The D–ISP design is focused on predictability. It hinders memory interferences of data and instruction accesses on a shared memory level by the two-phased execution behavior and features a function based content management. The D–ISP is able to outperform a statically managed scratchpad but it cannot compete with a cache with its fine-grained content management. But in contrast to a cache, the D–ISP allows a tight and separated timing analysis, caused by the absence of memory interferences. This influence on WCET analysis outweighs the moderate average case performance for hard real-time systems. The hardware effort needed to implement the D–ISP depends strongly on the number of functions that can be looked up in parallel. But by splitting the function lookup into multiple cycles it is possible to support an arbitrary number of function lookups without an exceeding logic amount. We also found out that the memory overhead for the D–ISP content management tables is larger than the tag memory of caches. But for larger scratchpad sizes the memory overhead of the D–ISP is in the same order of magnitude as the tag memory of a cache.
Dynamic Instruction Scratchpad Memory
133
For our future work we plan to model the timing of the D–ISP and calculate WCET estimates. Then the impact of the D–ISP on the WCET can be compared with caches and other scratchpad memories that suffer from memory interferences.
References 1. Wilhelm, R., Engblom, J., Ermedahl, A., Holsti, N., Thesing, S., Whalley, D., Bernat, G., Ferdinand, C., Heckmann, R., Mitra, T., Mueller, F., Puaut, I., Puschner, P., Staschulat, J., Stenstr¨ om, P.: The Worst-Case Execution-Time Problem - Overview of Methods and Survey of Tools. ACM Trans. Embed. Comput. Syst. 7(3), 1–53 (2008) 2. Reineke, J., Grund, D., Berg, C., Wilhelm, R.: Timing Predictability of Cache Replacement Policies. Real-Time Systems 37(2), 99–122 (2007) 3. Banakar, R., Steinke, S., Lee, B.S., Balakrishnan, M., Marwedel, P.: Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In: Proc. of the 10th Int. Symp. on Hardware/software Codesign, pp. 73–78. ACM, New York (2002) 4. Wehmeyer, L., Marwedel, P.: Influence of Onchip Scratchpad Memories on WCET Prediction. In: Proc. of the 4th Int. Workshop on WCET Analysis (2004) 5. Falk, H., Kleinsorge, J.: Optimal static WCET-aware scratchpad allocation of program code. In: Proc. of the 46th Design Automation Conf., pp. 732–737 (2009) 6. Udayakumaran, S., Dominguez, A., Barua, R.: Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. 5(2), 472–511 (2006) 7. Ravindran, R.A., Nagarkar, P.D., Dasika, G.S., Marsman, E.D., Senger, R.M., Mahlke, S.A., Brown, R.B.: Compiler managed dynamic instruction placement in a low-power code cache. In: Proc. of the Int. Symp. on Code Generation and Optimization, pp. 179–190. IEEE, Los Alamitos (2005) 8. Egger, B., Kim, C., Jang, C., Nam, Y., Lee, J., Min, S.L.: A Dynamic Code Placement Technique for Scratchpad Memory Using Postpass Optimization. In: Conf. Compilers, Architecture, and Synthesis for Embedded Systems (2006) 9. Janapsatya, A., Ignjatovi´c, A., Parameswaran, S.: A novel instruction scratchpad memory optimization method based on concomitance metric. In: Asia and South Pacific Conf. on Design Automation, pp. 612–617. IEEE, Los Alamitos (2006) 10. Puaut, I., Pais, C.: Scratchpad Memories vs Locked Caches in Hard Real-Time Systems: a Quantitative Comparison. In: Proc. of the Conf. on Design, Automation and Test in Europe, pp. 1484–1489 (2007) 11. Plazar, S., Lokuciejewski, P., Marwedel, P.: WCET-driven Cache-aware Memory Content Selection. In: Proc. of the 13th IEEE Int. Symp. on Object/Component/Service-Oriented Real-Time Distributed Computing, pp. 107–114 (2010) 12. Schoeberl, M.: A Time Predictable Instruction Cache for a Java Processor. In: Workshop on Java Technologies for Real-Time and Embedded Systems, pp. 371– 382 (2004) 13. Kirner, R., Schoeberl, M.: Modeling the Function Cache for Worst-Case Execution Time Analysis. In: Proc. of the 44th Design Automation Conf., pp. 471–476 (2007) 14. Preußer, T., Zabel, M., Spallek, R.: Bump-pointer method caching for embedded Java processors. In: Proc. of the 5th Int. Workshop on Java Technologies for Realtime and Embedded Systems, p. 210. ACM, New York (2007)
134
S. Metzlaff et al.
15. Metzlaff, S., Uhrig, S., Mische, J., Ungerer, T.: Predictable dynamic instruction scratchpad for simultaneous multithreaded processors. In: Proc. of the 9th Workshop on Memory Performance, pp. 38–45. ACM, New York (2008) 16. Mische, J., Guliashvili, I., Uhrig, S., Ungerer, T.: How to Enhance a Superscalar Processor to Provide Hard Real-Time Capable in-Order SMT. In: M¨ uller-Schloer, C., Karl, W., Yehia, S. (eds.) ARCS 2010. LNCS, vol. 5974, pp. 2–14. Springer, Heidelberg (2010) 17. Ungerer, T., Cazorla, F., Sainrat, P., Bernat, G., Petrov, Z., Rochange, C., Quinones, E., Gerdes, M., Paolieri, M., Wolf, J., Casse, H., Uhrig, S., Guliashvili, I., Houston, M., Kluge, F., Metzlaff, S., Mische, J.: Merasa: Multicore execution of hard real-time applications supporting analyzability. IEEE Micro. 30, 66–75 (2010) 18. M¨ alardalen Real-Time Research Center (MRTC): WCET Benchmark Suite, http://www.mrtc.mdh.se/projects/wcet/benchmarks.html 19. Guthaus, M., Ringenberg, J., Ernst, D., Austin, T., Mudge, T., Brown, R.: MiBench: A free, commercially representative embedded benchmark suite. In: 2001 IEEE Int. Workshop on Workload Characterization, pp. 3–14 (2001) 20. Embedded Microprocessor Benchmark Consortium: AutoBench1.1 software benchmark data book, http://www.eembc.org/techlit/datasheets/autobench_db.pdf