WCET-aware Static Locking of Instruction Caches∗ Sascha Plazar
Jan C. Kleinsorge
Peter Marwedel
Computer Science 12 TU Dortmund University D - 44221 Dortmund, Germany
Computer Science 12 TU Dortmund University D - 44221 Dortmund, Germany
Computer Science 12 TU Dortmund University D - 44221 Dortmund, Germany
[email protected]
[email protected] Heiko Falk
[email protected]
Institute of Embedded Systems /Real-Time Systems Ulm University D - 89081 Ulm, Germany
[email protected] ABSTRACT In the past decades, embedded system designers moved from simple, predictable system designs towards complex systems equipped with caches. This step was necessary in order to bridge the increasingly growing gap between processor and memory system performance. Static analysis techniques had to be developed to allow the estimation of the cache behavior and an upper bound of the execution time of a program. This bound is called worst-case execution time (WCET ). Its knowledge is crucial to verify whether hard real-time systems satisfy their timing constraints, and the WCET is a key parameter for the design of embedded systems. In this paper, we propose a WCET-aware optimization technique for static I-cache locking which improves a program’s performance and predictability. To select the memory blocks to lock into the cache and avoid time consuming repetitive WCET analyses, we developed a new algorithm employing integer-linear programming (ILP ). The ILP models the worst-case execution path (WCEP ) of a program and takes the influence of locked cache contents into account. By modeling the effect of locked memory blocks on the runtime of basic blocks, the overall WCET of a program can be minimized. We show that our optimization is able to reduce the ∗Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 ”Providing Information by Resource-Constrained Analysis”, project A3. The research leading to these results was also partially funded by the European IST FP7 NoE ArtistDesign. And work on this publication has partially been supported by Deutsche Forschungsgemeinschaft (DFG) under grant FA 1017/1-1.
estimated WCET (abbr. WCETest ) of real-life benchmarks by up to 40.8%. At the same time, our proposed approach is able to outperform a regular cache by up to 23.8% in terms of WCETest .
1.
INTRODUCTION
Caches have become popular in the domain of embedded systems to bridge the growing gap between high processor and low memory performance. They are developed to work transparently from the programmer’s point of view by integrating a fully autonomous hardware controller. Recently used memory blocks are kept as copies in fast cache memories since they are likely to be accessed in the near feature. If a block can be fetched from the cache, a so-called cache hit occurs. Otherwise, a cache miss occurs and the requested content has to be fetched from the slow main memory resulting in penalty cycles due to pipeline stalls. Although caches effectively improve the average-case performance, they are a source of predictability problems due to their dynamic behavior. Static analysis techniques have been developed to compute conservative bounds on the cache behavior and its influence on the runtime of a program. The worst-case execution time (WCET ) of a program is the upper bound of the execution time for all possible input data and all possible initial system states. The WCET is a key parameter for real-time scheduling and the development of hardware platforms which have to satisfy critical timing constraints. Since the real WCET of a system cannot be determined, static timing analyzers are employed to determine WCET estimations (WCETest ).
The WCET of a program corresponds to the length of the worst-case execution path (WCEP ) which is that path of the graph the highest execution time. "Permission to make digital or hard copies of all or part of this work for personal orcontrol classroomflow use is granted(CFG) without with fee provided that copies are not made or distributed for Permission to make digital or hard copies of all or part of this work for personal or profit or commercial advantage and that copies bear this notice and the full citationOptimization on the first page. of To elements copy otherwise, to republish, on to post onWCEP servers orcan to redistribute to like functions the classroom use is granted without fee provided that copies are not made or distributed lists, requires prior specific permission and/or a fee. shorten this longest path in such a way that another path for profit or commercial advantage and that copies bear this notice and the full citation '12,page. March 31-April 4, San Jose, USto post on servers or to redistribute onCGO the first To copy otherwise, to republish, becomes the new WCEP. Optimization of elements not lying Copyright© 2012 ACM 978-1-4503-1206-6/12/03... $10.00" to lists, requires prior specific permission and/or a fee. on the WCEP will not result in a reduction of the WCET. CGO ‘12, March 31-April 4, San Jose, US Copyright © 2012 ACM 978-1-4503-1206-6/12/03... $10.00
44
Hence, possible switches of the WCEP have to be taken into account during optimizations. In this paper, we present a novel WCET-aware optimization for static locking of instruction caches. The optimization aims at reducing the WCET of a program by statically locking parts of a program into the cache. Statically means that the cache content is loaded and locked in advance and does not change during the program’s execution. An integer-linear programming (ILP ) approach is employed to select the memory blocks to lock. The ILP explicitly models the CFG of a program in order to cope with switching WCEPs. The impact of locked cache contents on the execution time of the contained basic blocks is modeled as well and thereby avoids repetitive WCET analyses. To enable such a WCET-centric optimization, the sophisticated static WCET analyzer aiT developed by AbsInt [1] is employed. The main contributions of this paper are as follows: • Cache content is statically locked based on its impact on the WCETest of a program. • The presented ILP-based approach determines a set of memory blocks to lock into the cache. Its objective is the reduction of the WCETest by explicitly modeling the impact of locked cache contents on the WCEP of a program. • We show that WCETest reductions of up to 40.8% can be achieved compared to a system without cache for a set of real-life programs. • The practical relevance of our new optimization is attested by outperforming a regular cache by up to 23.8% w. r. t. WCETest reductions. This paper is organized as follows: In the next section, an overview of related work considering memory and cachebased optimizations is provided. Section 3 presents our new WCET-aware algorithm for static I-cache locking. Section 4 introduces the WCET-aware C compiler WCC employed to develop our novel algorithm. An evaluation in Section 5 compares the performance of a system with a regular cache with our novel WCET-aware static I-cache locking technique. Finally, Section 6 concludes our work and presents ideas for future work.
2.
RELATED WORK
Falk et al. counteract possible predictability problems of caches with a static allocation of program code to so-called scratchpad memories (SPM ) [5]. They employ integerlinear programming to select the optimal content of the SPM w. r. t. the program’s WCETest . The disadvantage of moving parts of the code to SPMs is the necessary correction of the control flow: Far jumps have to be inserted to branch between different memories leading to an unavoidable code and runtime overhead. In contrast, the cache locking based optimization presented in this work exploits the transparent behavior of a cache and performs a lockdown of cache content inside the startup code of a program. Another work considering scratchpad allocation is presented in [18]. Suhendra et al. developed an ILP-based allocation
45
of frequently accessed data objects to faster memories in order to decrease the overall WCETest . Their model of the program’s WCET and possible execution paths serves as basis for the ILP-based algorithm presented in [5] and was also employed for the technique discussed in Section 3. Since only the intra-function control flow is modeled, a time consuming branch-and-bound approach or a sub-optimal heuristic is employed to optimize along the WCEP. Furthermore, moving data objects to SPM is much more easier than locking instruction blocks into the I-cache. Data elements can be almost arbitrarily moved around to fill the SPM without gaps. Thereby, elements are never competing for the same memory/cache lines as occurring during cache locking. In [8], Gebhard et al. present a technique for rearranging the positions of tasks to improve the cache performance. The interdependency relation of tasks is evaluated in order to determine a memory layout which maximizes the number of persistent cache sets for each task. Papers [16] and [3] present techniques for statically locked instruction caches which are very close to the work presented in this paper. In [16], Puaut et al. present two algorithms which try to minimize the CPU utilization and the interferences between different tasks, respectively. Although they consider the WCET as metric, they are not able to react on switching WCEPs since they always optimize along an initially determined WCEP. Compared to [16], [3] presents an additional genetic algorithm which has the disadvantage that for each created individual, a time consuming WCET analysis has to be performed. Puaut et al. also present techniques for I-cache locking [15] which consider changing WCEPs. However, the way how WCEPs are recomputed is not detailed. The authors use a parameter N trading off accuracy of WCEP recomputation with runtime consumption. Since runtimes for WCEP recomputation are still very high, the authors are unable to provide results for some of their benchmarks. In contrast, the techniques presented in this work scale much better so that results for very large benchmarks can be gathered. Static I-cache locking is also proposed by Falk et al. in [7]. An Execution Flow Graph (EFG) is employed to model possible execution paths on function level. The WCEP is determined by applying a modified Dijkstra algorithm before the most promising function on the WCEP is locked into the cache. To consider paths which are not the initial WCEP, two analyses are required for each alternative path in order to compute the gain of the functions on such a path. As opposed to this, the approach presented in this paper only requires two WCET analyses in total. An extension of [7] was developed by Liu et al. in [12]. There, an Execution Flow Tree is presented which is traversed to generate a simple ILP which selects the functions to lock into the cache. In contrast to [7] and [12], the approach proposed in this paper is able to model the intrafunction WCEP including loops at basic block level. The influence of the memory layout on lockable memory blocks – and thereby the runtime of basic blocks – is also taken into account; [7] and [12] ignore the fact that selected func-
1 2 3 4 5 6
void foo1 ( int x ) { if (x