Hybrid Instruction Cache Partitioning for Preemptive Real-Time Systems* José Vicente Busquets-Mataix Departamento de Ingeniería de Sistemas, Computadores y Automática Universidad Politécnica de Valencia P.O.Box 22012 46071-Valencia (SPAIN)
Andy Wellings Real-Time Systems Research Group Department of Computer Science University of York York, UK e-mail:
[email protected]
e-mail:
[email protected] Abstract
mand real-time capabilities but at a relative low cost, as all consumer products impose this requirement. Yet, some of these applications involve intensive computing tasks. These applications require a cheap, but powerful hardware platform to be competitive in the marketplace. Microprocessor manufacturers are adopting solutions based on statistical assumptions of the workload to improve the cost-performance ratio. Hennessy and Patterson write in [7]: “Making the frequent case fast is the inspiration for almost all inventions aimed at improving performance”. A good example of this trend is the cache. Running typical programs, the majority of memory references are cache hits, which is almost equivalent (from a performance point of view) to implementing the whole main memory with fast and expensive chips. This trend, jointly with big market volumes, makes contemporary processors achieve a very good cost-performance ratio. On the other hand, since these processors are dedicated to a large audience, they are widely tested increasing their reliability. In conclusion, contemporary processors are good candidates to implement cost-effective real-time applications because they are cheap, powerful and reliable. However, some techniques have to be applied to make those processors more predictable. With caches, a good first step to increase processor utilization is to calculate the worst case execution time (WCET) of tasks taking into account the cache speed-up. Some tools can be used to estimate the WCET of cached programs [1][13][15]. These tools typically assume continuity for the code execution, thus no preemption is allowed. This limitation restricts the application of such tools to non preemptive scheduling policies like cyclic
Cache memories have been historically avoided in real-time systems because of their unpredictable behavior. In addition to the research focused at obtaining tighter bounds on the worst case execution time of cached programs (typically assuming no preemption), some techniques have been presented to deal with the cache interference due to preemptions (extrinsic or inter-task cache interference). These techniques either account the extrinsic interference in the schedulability analysis, or annuls it by partitioning the cache. This paper describes a new technique, hybrid partitioning, which is a mixture of the former two. It either provides a task with a private partition or accounts for the extrinsic interference that may arise. The hybrid technique outperforms the original two for any workload or hardware configuration. Additionally, this technique is less influenced by those factors. In conclusion, this technique represents a powerful yet general framework for dealing with extrinsic interference.
1 Introduction Real-time systems are commonly considered very specific systems used in specific applications and implemented by ad-hoc hardware and software. Nowadays, the applications of the real-time theory are not limited to the life-critical systems (hard-real time systems), where the cost factor is of secondary importance. Emerging areas like car computers, multimedia computing, gesture recognition, voice interaction, CD-I and the like [20], de-
* This work was supported in part by the Spanish Ministry of Education and Science. It was done whilst Jose V. Busquets was on Study Leave with the RealTime Systems Research Group at University of York.
1
executives or cooperative scheduling. However, two actions can be undertaken to make caches suitable for preemptive real-time systems. Firstly, making the cache behavior more predictable (for example using partitioning, locking pieces of code, or simply using easy to model cache configurations). Secondly, modelling the cache behavior to incorporate the cache effects on the schedulability analysis. Using these techniques we can take advantage of the huge research effort committed to date in preemptive systems. Without these techniques, the inevitable use of cache (due to the exponentially increasing of the memory-processor speed gap) may represent the end of the line for low cost preemptive realtime systems. Since they were first introduced in 1969, cache memories have been successfully used in general purpose computing to bridge the gap between the memory and processor speeds. However, since caches are based on statistical assumptions of program behavior, they have opened a gap between the worst and the average case behavior. This gap is a measure of the unpredictability introduced by the use of caches. This unpredictability has precluded historically the use of caches in real-rime systems, since such systems are based on the worst case execution time of tasks. The unpredictability of cache memories stems from two causes. Firstly, the multitasking nature of many realtime kernels has caused the extrinsic cache interference† among tasks to exist. Secondly, it is difficult to model the intrinsic‡ cache behavior, because it is automatic, dynamic, adaptive, history sensitive and program dependent. Both characteristics have increased the difficulty to bound the unpredictability of the cache, as a mean to reach deterministic speed-up. However, the use of cache is becoming inevitable, as a mean to achieve higher cost-performance ratios. As a proof of that, it is being invariably incorporated in contemporary processors. Lately, some actual architectural trends are putting even higher demands on the instruction memory path. First, the use of pipelining increases the number of instructions requested per clock cycle. Second, RISC processors exhibit lower program codification densities (larger instructions). Third, the number of bus lines is limited in embedded systems because of cost, design and reliability constraints. In addition, the memory is one of the more expensive parts in the computer. Effective use of this expensive element will pro-
duce greater budget benefits [19]. Another significant reason, typically not considered, is that the use of onchip caches increases reliability by means of reduced bus traffic, since off-chip buses are more exposed to external interference. The objectives of this paper are two-fold: 1. Show the rationale behind the Hybrid technique, the motivation, and how to apply it. 2. Compare the scheme proposed in this paper to the schemes presented in previous works [5][9] on CRTA (Cached Response Time Analysis) and full partitioning.
2 Previous work The use of caches in real-time is an emerging area which is being increasingly pursued. We are only concerned with the extrinsic (inter-task) behavior of the instruction cache. A good discussion of the intrinsic (intra-task) behavior can be seen in [13]. Hereinafter, we will concentrate on aspects of the extrinsic behavior while assuming implicitly that the intrinsic behavior has already been taken into account, for example when calculating the WCET of tasks. The effects of the extrinsic cache behavior over general multiprogramming systems have been studied in a number of papers, refer to [6] for a review of them. Basumallick and Nilsen [3] seeded the rationale for incorporating the extrinsic behavior in the schedulability analysis. They introduced CRMA (Cached Rate Monotonic Analysis), in which a constant cache refill§ penalty was added to every task’s worst case execution time prior to performing the utilization based test of the Rate Monotonic Analysis (RMA). A comparison of both CRMA and partitioning (described in this section) was given in [6]. It used the same experiment model that we will consider in this paper. It presented motivation on the use of caches in preemptive real-time systems. In [5] the CRTA (Cached Response Time Analysis) technique was introduced, and it was compared to both CRMA and cache partitioning (in this case the feasibility was tested using RTA instead of RMA). The latter paper also compared the refilling techniques to a system in which the cache is disabled, to get a perspective of the benefits of considering the cache in the analysis. The conclusion was that CRTA outper§
We use the term “refill” throughout the paper to refer to the time wasted when a preempted task resumes execution to return the cache to the state it had before preemption. A cache refill, does not necessarily imply a refill of the whole cache. Depending on the model used, it may comprise only a set of lines or a partition. We also apply the term to the techniques that incorporate the extrinsic interference in the schedulability analysis, we call them “refilling techniques”.
†
Extrinsic interference (inter-task interference) is produced by tasks competing for the cache. ‡ The intrinsic interference (intra-task) is produced by colliding lines during the program execution due to limited cache space or limited associativity.
2
worst case execution time of task τi taking into account the speed-up of the cache assuming continuous execution (no preemption). The refill penalty is γj. It can be either the time to reload the unique lines touched by the preempting task τj or the time to reload the entire cache. The latter is the worst case penalty to be paid, it is constant for all tasks. To avoid other sources of interference (for example those from lower priority tasks), the blocking time Bi has to be calculated based on the Ceiling Semaphore Protocol (CSP) [21] or by other synchronization technique that precludes a task from being blocked during its execution. CSP ensures that if a task is blocked, it will occur at the beginning of its execution. This way, additional context switches due to blocking are avoided. Using the Priority Ceiling Protocol (PCP) [22] the task under consideration may yield the CPU to a lower priority task once the former reaches a resource held by the latter, thus allowing interference from lower priority tasks to occur. Using CSP, only the execution of higher priority tasks is the reason of disruption of the task’s execution continuity. The equation is evaluated recursively until the time window wi n stops growing (i.e. wi n+1 = wi n = Ri ) or the deadline has been violated (i.e. wi n+1 > Di ). Since Ri increases monotonically with decreasing priorities, an initial value that can be used for wi 1 is Ri-1, while C1 can be used for w11.
forms CRMA, but the cache partitioning has to be used for those system configurations in which CRTA performs worse. In the following subsections, we summarise the schemes in which the hybrid technique presented in this paper is grounded: CRTA and partitioning.
2.1 CRTA CRTA [5] is the latest proposed technique that models the extrinsic cache interference in the schedulability analysis. CRTA incorporates the extrinsic behavior to the conventional Response Time Analysis (RTA). RTA calculates the response time for tasks in the worst case scenario (which corresponds to the critical instant in which all tasks are released at the same time point). Comparing trivially the worst case response time to the allowed deadline for each task provides an exact and efficient off-line analysis for fixed-priority preemptive scheduling. Since CRTA is derived from RTA, it inherits some characteristics belonging to the latter analysis: • It is independent of the fixed priority assignment policy used to schedule the tasks. • The task-set model may include sporadic tasks without any further consideration. • The deadline for any task can be shorter than the period. CRTA considers that a refill penalty is paid for each preemption that may arise during the worst case response time window of the task under analysis. The number of preemptions in the worst case is equated to the number of releases of higher priority tasks. The preempting model cannot discern if the preemptions are either suffered by the task under analysis (direct interference) or another higher priority task (indirect interference). The model only provides the total number of preemptions. We know which task may preempt another, but not which task suffers the preemption. Thus, the refill penalty can only be characterised based on the preempting task. We can either consider a refill of the lines misplaced by the preempting task or consider a refill of the entire cache, depending on the level of detail available of the workload model. For this model of the worst case extrinsic interference, the following recursive equation was derived (based on the RTA equation): n +1 i
w
2.2 Cache partitioning Cache partitioning (abbreviated PART) is aimed at improving the cache predictability by annulling the extrinsic interference by providing each task with a private cache partition. Additionally, a common partition may be used for data sharing and non critical tasks. The cache partitioning approach may be implemented by software [25][18] or hardware [11][9][12][10]. The hardware schema requires the cache be off-chip and some additional external circuitry for cache control. It introduces added latency to the processor cycle. The software solution requires compiler support. The application code is relocated to provide exclusive mappings on the cache for each task. This schema also introduces delays, in this case due to insertion of branches to interconnect the relocated pieces of code. The partition size can be varied from one task to another, but both techniques impose that the partition size must be power of two. In the hardware scheme, this requirement is due to the hardware mechanism for address mapping. In the software scheme, this requirement is necessary for performing quickly the pointer transformations to access data structures. However, there is no restriction in the case of instruction cache. Both schemes point out that some shared partitions can be used for data
wn = Ci + Bi + ∑ i × (C j + γ j ) j ∈ hp ( i ) Tj
Where hp(i) is the set of tasks with higher priority than task τi . The calculated response time for task τi at iteration n is wi n. Ti is the period (for periodic tasks) or the minimum inter-arrival time (for sporadic tasks). Ci is the
3
sharing, and non critical tasks. An optimal algorithm for allocation of partitions to tasks is presented in [10]. A nearly similar technique to partitioning is the cache lock [14], used in some actual controllers. This mechanism allows the programmer to lock a piece of code in cache (usually interrupt handlers), guaranteeing fast and predictable response times for this code. The main conclusion is that it increases schedulability.
for conventional systems. The hybrid method will take maximum advantage of the speed-up provided by the cache (by the use of adequate size partitions), while minimising the extrinsic interference. The hybrid technique proposed in this paper (hereafter called Hybrid), is derived from the partitioning approach but the number of partition is less than the number of tasks. This way, the partitions will be larger than they are for (fully) partitioning, allowing better intrinsic performance (by better exploiting of the temporal locality). On the other hand, since partitions are smaller than the entire cache, the refilling penalty due to preemption will be smaller than it is for refilling techniques. However, since some tasks will share a partition, they will exhibit extrinsic interference. Thus, a model for the preemption behavior is needed (i.e. the extrinsic behavior has to be included in the analysis). That is the reason this method is called hybrid, because it is exactly the union of partitioning and refilling schemes. It fact the hybrid technique includes the other two: it may become a fully partitioning (when the partitions allowed equates to the number of tasks) or CRTA (when only one partition is allowed). On one hand it has higher complexity, but on the other hand its behavior will be better than any of the other two, independent of the characteristics of the system. The added complexity results in the following problems (the last two are new problems, inherent to the Hybrid technique): • It requires the extrinsic behavior to be incorporated in the schedulability analysis. • The number of partitions that gives the best results needs to be found. • The best (optimal) assignment of partitions to tasks needs to be found. Moreover, the problem of incorporating the extrinsic behavior in the analysis will be more complex than it is for the refilling techniques, since not all preemptions that may occur will imply a partition refill. Each of the cited problems is addressed in subsequent sections.
3 Motivation Previous papers have showed that the partitioning technique suffers from intrinsic interference due to the small size of partitions resulting from providing each task with a private partition. On the other hand, the refilling schemes suffer from extrinsic interference since a cache refill has to be accounted for every context switch, and the number of preemptions is characterised pessimistically. As will be shown in subsequent paragraphs, both methods perform clearly better in opposite situations. The partitioning technique is constrained by the factors related to intrinsic cache behavior. But it is affected more strongly by the factors that settle partition size (cache size and number of tasks). Thus, this technique is better suited on systems equipped with large caches and comprising a limited number of tasks. The refilling techniques (either CRMA or CRTA) are affected by the extrinsic behavior. This behavior is characterised by two aspects: the number of context switches considered and the refill penalty to be paid for each. The number of context switches is conditioned by both the system frequency and the number of tasks. While the cache refill penalty is dependent on the cache size and memory bandwidth. In contrast to the partitioning solutions, refilling techniques are better suited in systems with small caches. Factors such as memory bandwidth and number of tasks have the same effect on both approaches. Others, like system frequency, affect only one (system frequency is not relevant for partitioning). And finally, the cache size is the factor that divides the applicability domain for each of the solutions. Larger caches are better for partitioning while they are bad for refilling schemes, and vice-versa. But the problem is not to choose one technique among the available. The problem is that both techniques perform better for extreme values than for moderate ones. This is an inconvenience since moderate values are likely to be more abundant. This scenario provides motivation for the development of a hybrid technique capable of performing well independently of the system configuration (especially cache size). A method capable of performing at its best
4 Incorporating the extrinsic behavior in the schedulability analysis Considering the model for CRTA, the Hybrid model can be derived by taking into account exactly when a collision is produced since some tasks may have private use of partitions. A first approximation would be to consider only the preemptions produced by tasks that share partition with the task under analysis. However this is not accurate, since we have also to account for indirect preemptions. That is, a given task τj with higher priority than the task of interest τi , may preempt another task τk with interme-
4
diate priority (i.e. Pj > Pk > Pi ). If both task τj and task τk share the same partition, then task τj will inflict a refill penalty to task τk. And finally, this refill penalty paid by τk will delay the response time of τi , since task τk’s computation time (plus refilling penalties that may have place) is considered in the analysis as conventional interference to task τi . This scenario can be seen in Figure 1 (generated by STRESS simulator [2]). Process execution is depicted as a rectangle. Circles represent task release (bottom corner) or task termination (upper corner). The preemption time is a solid line along the timeline of the task. Cache refill penalties are depicted as a black rectangle, just following a restart of execution after preemption. In the depicted case, we assume that task 1 and 2 share the same partition while task 3 has a private one. It might seem that task 1 will not interfere with task 3, but it fact it does. Task 2 has to refill the partition after the preemption due to the execution of task 1, and this delay is suffered indirectly by task 3.
tic in relation with the phasing depicted in reality. However, since no assumptions can be made about the time relation between task 1 and 2, the worst case phasing must be considered. The depicted phasing is the worst case phasing in terms of interference due to computation times of higher priority tasks. However it is not the worst case in terms of number of preemptions. The analysis consider a scenario that is the worst case in both terms, even though it may not exist in reality.
5 Assigning partitions to tasks and finding the optimal number of partitions We will introduce first a generic framework for finding the optimal configurations. The second subsection describes a simplified procedure for Rate Monotonic Schedulers.
5.1 Generic approach The first problem to be addressed is to determine the best number of shared partitions. Assuming N tasks and S partitions, if only one partition is shared, then S-1 tasks will benefit from a private partition. If two partitions are shared, then S-2 tasks are now the tasks with a private partition. The latter case distributes the N-(S-2) lower priority tasks in two partitions. However it does not help, because a task will produce extrinsic cache interference as long as there is another task using the same partition. The tasks with lower priorities than both colliding tasks, will suffer indirectly the interference, irrespective of whether they are mapped to another partition (either private or shared). In conclusion, the best choice is to provide only one shared partition. All tasks with a private partition should hold higher priorities than tasks without a private one to prevent the privileged ones from suffering indirect interference (according to the rationale of the last paragraph). However, this assignment of partitions to priorities may not necessarily deliver the best results, since other factor like period may influence greatly the number of preemptions, and therefore, the extrinsic cache interference. For example, using Deadline Monotonic Priority Assignment, a task with a long period but tight deadline is assigned a high priority. Thus, a private partition can be assigned to such a task, even though it frequency (and thus its potential interference capability) is low. The trade-off between the priority and period poses a problem: the partition assignment is not trivial for those schedulers that are not Rate Monotonic (i.e. lower priorities are assigned to tasks with longer periods). Therefore, an exhaustive search is proposed to cope with both problems: finding the optimal number of partitions and the optimal partition assignment to tasks.
Figure 1 In conclusion, all the releases of a task with higher priority than the task of interest will imply a refill penalty delay if there is a task with intermediate priority that uses the same partition (this includes the task under consideration, in which case, the interference is direct). This delivers the following recursive equation:
win +1 = Ci +
wn i × (C j + γ i , j ) T j ∈ hp ( i ) j
∑
where γi,j is the refill penalty given by the following formula, where γj is the time to refill the partition:
γ j γ i, j = 0
∃k : Pj > Pk ≥ Pi ∧ R j = Rk otherwise
Where Rj is the partition assigned to task τj. Pj is the priority of task τj (higher values for higher priorities). With respect to the task-set corresponding to the timeline depicted in the Figure 1, the analysis of the response time of task 3 considers that every execution of task 1 preempts an instance of task 2. This may seem pessimis-
5
lower priorities coincide with the ones with larger periods. Thus, given S partitions, S-1 are used to provide private partitions to S-1 tasks, while only one partition is assigned to the remainder N-(S-1) lower priority tasks. The problem is then reduced to finding the optimal number of partitions. The following describe an analytical way to determine such a number. However, if the hardware and software of the system are currently available, it is also possible to determine the optimal number of partitions by brute force, since the size of the problem is limited (the number of partitions must be power of two). The systems can be exercised for the available number of partitions, and the best configuration is chosen. Depending on the workload and hardware factors of the system, a different number of partitions may be more appropriate. There is a trade-off between the intrinsic and extrinsic cache behavior to calculate the optimal number of partitions. On one hand, larger partitions (due to low number of partitions) produce benefits in terms of better intrinsic behavior. On the other hand, smaller partitions (due to high number of partitions) imply shorter refill penalties and less extrinsic interference since more tasks may hold a private partition. Fortunately, it is not necessary to have an accurate model of the workload to compare intrinsic and extrinsic behavior for a given number of partitions. We can calculate the speed-down ratios with respect to ideal cache performance for both intrinsic and extrinsic behavior. By multiplying both ratios, we calculate the overall ratio for every number of partitions to be considered. The optimal number of partitions will achieve higher score for the overall ratio. The following explains how to calculate the speed-down ratios. Regarding intrinsic behavior, we can assume that the tool used to calculate the WCET of cached programs achieves a typical hit rate for each of the cache (partition) sizes to be considered. Given the hit rate, and some hardware parameters (memory bandwidth, cache miss penalty), we can calculate the slow down produced by the intrinsic interference assuming that the utilization is 100 % and the code executes without preemption. The following formula calculates the speed-down (subject to scaling) for intrinsic cache behavior:
An exhaustive search can be understood as asking to each task whether it wants to have a private partition or not. The size of this problem is the permutations of 2 things taken N at a time with repetition, that is P*(2,N)=2N. A task may or may not have a private partition. Thus there are two choices (two things), and a choice has to be performed for each task (taken N at a time). Finally the choice can be repeated (more than one task may have a private partition). The number of partitions has to be power of two. There are several reasons to hold this requirement. Firstly, some reasons are inherited from Hardware or Software Partitioning (discussed in subsection 2.2). Secondly, the partitions must be of the same size to make the problem independent of the particular benefits that every task may obtain from the use of caches. Thus the following requirement holds:
S ∈ {s ∃k ∧ 2 k = s} Therefore the problem size is reduced to:
N N N N k N + + +...+ = ∑ i 0 1 3 S − 1 i = 0 2 − 1 Each item corresponds with the combinations of N things taken 0, 1,..., at a time. That is the combinations of having 0, 1, ... private partitions for N tasks. The problem will be solved by testing schedulability for each leaf of the binary tree that have a permitted number of partitions. In other words, we count from 0 to 2N, and suppose that each bit of the counting number is related to a particular task. If the bit is 1, the task has a private partition. Thus, all the task-sets that correspond with numbers with 2i-1 ones, will be tested for schedulability. The search may be stopped once a schedulable task-set is found, or can be completed searching for the task-set with lower utilization. If the search space is large, the techniques like simulated annealing or genetic algorithms can be used. For example simulated annealing has been used to find appropriate task allocations in distributed hard real-time systems [23]. The cache partitioning problem could be solved in this context.
5.2 Rate Monotonic schedulers
α int =
For the particular case in which the task-set comply the Rate Monotonic requirements, and such a priority assignment is applied, the problem of assigning partitions to task is simpler. For such a scheduling policy, lower priorities are assigned to the tasks with longer periods. Thus there is no doubt of which tasks must be confined in a common partition, since the ones with
tc µ × tc + (1 − µ ) × tm
Where tc and tm represents the cache and memory read time, and µ is the cache hit delivered by the tool for a particular partition size. Applying such a formula, the following table can be obtained (where the speed-down
6
ratio is calculated by scaling the slow down with respect to the result obtained using the entire cache): Partition size speed-down ratio 256 bytes 0.82 512 bytes 0.87 1024 bytes 0.90 2048 bytes 0.92 Entire cache 1.0 More accurate ratios can be obtained if the code that comprises the application is currently available and analysed by the WCET tool for each of the partition sizes to be considered. Regarding extrinsic behavior, the speed-down ratio is the relation between execution time with and without extrinsic cache interference. This ratio can be obtained by calculating the time wasted performing cache refills for a given period of time, for example the LCM (least common multiple) of task’s periods. We need to calculate the number of preemptions that cause refill, and the refill penalty to be paid for each of them. The refill penalty depends on the partition size and memory bandwidth. According to the model of preemptions used in CRTA and Hybrid, the number of preemptions suffered by a given task is the number of releases of higher priority tasks. Since we want to obtain the total number of preemptions, each release of a higher priority other than the lowest one comprises a preemption. The following formula can be used to calculate the speed-down ratio for extrinsic interference (αext):
bility analysis for a variety of hardware and workload factors. To drive the experiments, we need a workload to be executed in a hardware. Since our experiments are simulations, in the next sections both hardware and workload models are outlined. The details of both models are discussed in [6] and [5]. Given an initial utilization, we assign execution time for each task. From the execution time, the number of memory references is estimated. From the number of references and the cache configuration, we calculate the intrinsic interference delay using the formulae of the miss rate model (Refer to [6] or [5]). This delay is added to the initial execution time. Notice that this delay increases monotonically for smaller partition size. The resulting execution time is applied to the schedulability analysis to determine if the task-set is schedulable for the given utilization. If no partitioning is used then the extrinsic interference is considered also in the analysis, thus RMA becomes CRMA or RTA becomes CRTA. We consider in the experiment that the refill penalty is the time to refill the entire cache (CRMA and CRTA), or the entire partition (Hybrid). No context switch cost other than the cache refill is considered in the experiments. This process is repeated until the maximum schedulable utilization is reached. A binary search was used.
6.1 Hardware model This work is focused on contemporary cost-effective general purpose processors. The system modelled is a single-processor with one level physically mapped instruction cache (typically on-chip). Some cache parameters (cache size, line size, associativity) are varied to evaluate their impact on the system schedulable utilization. The values have been chosen according to the actual trends. It is very common the use of separate caches for data and code, direct mapped, and 8 words line size. Since our model considers only instruction cache, write policies are not considered. Refer to [6] or [5] for a detailed explanation of both the cache miss rate model and cache miss penalty model. The hardware model of the system has the factors summarized in Table I. The default, minimum and maximum values considered for the experiments are shown in the table. Some explications follow. A cache associativity degree of 1 means one-way-set associative or direct mapped. The memory read time is the time in nanoseconds to read a word. No stalls due to refresh cycles are considered. The memory access time is the additional latency every time a line is read from memory. This includes bus contention and error detection delays. The units used for this factor are processor clock cycles. For the sake of simplicity, we assume that only one instruction is delivered per processor clock cycle.
N −1
α ext
∑
TLCM ×γ j j = 1 Tj = 1− TLCM
Where tasks are ordered by decreasing priorities, and TLCM is the least common multiple of task’s periods. Finally, a the following table can be obtained: Partition size speed-down ratio 256 bytes 0.96 512 bytes 0.91 1024 bytes 0.88 2048 bytes 0.81 No preemption 1.0 We obtain the composite ratios by multiplying the ratios in both the tables for each partition size. The higher the ratio, the better the choice of number of partitions.
6 The experiment An experiment is designed to find the maximum schedulable utilization guaranteed by a given schedula-
7
Table I: Hardware factors Key
Def
Min
Max
Keyname
Name
-c
32
4
64
Cachesize
Cache size (Kbytes)
-k
8
4
16
Linesize
Cache block size (words)
-r
1
1
4
Assoc
Cache associativity degree
-m
80
20
150
MIPS
MIPS
-b
40
20
60
Readtime
Memory read time (nS)
-a
2
1
4
Acctime
Memory access time (cycles)
reached under ideally conditions (zero overhead). This allows us to isolate completely the effect of the cache interference from the timeline constraints. That is, varying the factors, it is not possible to configure a task-set not capable to reach the 100% of schedulable utilization (without overheads). Thus, the system is jeopardised by the cache interference exactly by the difference between the resulting maximum schedulable utilization and 100%. Although this approach may seem simplistic, it is sufficient since some common characteristics of real-time task-sets can be reduced to periodic behavior (aperiodic load, precedence constraints, multi-deadline tasks) thus assumed implicitly in the model, or even ignored (release offsets) due to the abstraction level of the schedulability analysis used. Table II: Workload factors
6.2 Workload model Real examples of real-time workloads can be found in [17][16][4]. However, the workload used in this paper is inspired on the application level synthetic benchmark Hartstone [24]. The workload used in the experiment is a set of periodic and independent tasks. The task periods and task loads are configured as a function of a variety of factors (see Table II). -n Total number of tasks. Depending on the available number of frequencies, the tasks can be clustered in frequencies (more than one task has the same frequency), but the priority is different for each task. -s Utilization or load distribution among tasks (percentage slope). For high numbers the load is balanced towards low frequency tasks. -h Harmonicity degree, it is the percentage ratio respect the maximum harmonicity achievable. The maximum harmonicity is obtained when all task frequencies are harmonic to each other without release offset. -q Max/Min task frequency ratio. It is the relation between the lowest and highest frequency. For example, a value of 260 means that if the lowest frequency is 1 Hz, then the highest is 260 Hz. -g Separation among frequencies. This factor sets the base of the potential function to generate the periods. For example, the base 3 (generating the issues 3, 9, 27...) generates more distant periods than the base 2 (2, 4, 8...) -v Frequency (Hz). This factor settles the frequency of the system. All task frequencies are scaled in relation to this frequency, which is indeed the frequency hold by the highest priority task. The number of releases must be the same for all task-sets exercised, being independent of the factors above mentioned. This constraint is necessary since the extrinsic interference is highly related to the number of task releases. All task’s deadlines are equated to the task’s periods. Thus, the Rate Monotonic priority assignment can be applied being optimal. The default task-set is fully harmonic (i.e. harmonicity factor at 100%). Under these assumptions, a 100% schedulable utilization can be
Key Def
Min
Max
Keyname
Full name
-n
20
10
40
NumTask
Number of tasks
-h
100
60
100
Harmty
Harmonicity (%)
-s
0
-90
90
Distr
Utilization distribution (slope)
-q
260
260
2200
Band
Max/Min task frequency ratio
-g
2
2
3
Separat
Separation among frequencies
-v
200
100
1000
Freq
Frequency (Hz)
7 Variation explained by the factors We use the experimental 2k factorial design [8] to evaluate the variation in the schedulable utilization introduced by each factor (and each combination of them). This method determines the change on the system response (schedulable utilization in this case) when a factor is varied. Instead of evaluating one factor while the others are kept at an average value, this method exercises all possible combinations of levels for the factors (two levels, thus 2k combinations), and extracts from the results the portion of variation explained by each factor, as a percentage. Notice that the results are unsigned. We cannot deduce from this experiment if the relation between the factor and the result is either direct or inverse. We used the Min and Max levels showed in Table I and Table II. For clarity, only the factors with more representativeness than 2% are shown in the graphs. The combinations of factors are plotted as a concatenation of keys (e.g. cb means cache size and read time, see tables). Notice that the larger the influence of a factor, the more sensitivity the system has to that factor. Real-time systems designers have to pay special care for those factors that are important. For example using CRTA, the frequency of the system has a high effect on the performance (see Figure 2). Notice that the percentage of variation explained by every factor is relative to the other
8
All factors using CRTA
factors in the same experiment (same figure). Comparing percentage values among different graphs is not correct. Furthermore, the results are neither general nor absolute; they are representative of the values used for each factor. They have to be judged with those values in mind.
40
Percent of variation in the schedulable utilization
35
7.1 Combined hardware and workload factors As can be seen in Figure 2 and Figure 3, cache size plays an important role in both CRTA and PART schemes. Using CRTA, it is directly related to the refilling cost paid for every preemption (extrinsic cache interference). In the partitioning scheme (PART) it is related to intrinsic interference. For the former the effect is negative as cache size increases, while for the latter the effect is positive (larger cache means larger partitions, thus better intrinsic behavior). Since partitioning voids completely the inter-task interference, such scheme is only affected by the intrinsic behavior of the cache. As can be seen in Figure 4, the Hybrid solution has reached its first challenge: the relevance of the cache size amongst the other factors has been highly decreased compared to the former two techniques (note that we compare based on score relative to other factors in the same figure). This means that whatever is the cache size, a number of partitions can be found that makes the system perform at its best. The system frequency seriously jeopardises the performance of CRTA while it is almost insignificant for partitioning. Higher frequencies mean more context switches, which affects negatively the useful performance. However, no cost is associated to the context switch in the partitioning scheme. The Hybrid solution inherits from the partitioning scheme the capability to deal with high frequencies, as can be seen in Figure 4. The memory bandwidth is approximately of the same importance for all schemes, because it is important for both extrinsic and intrinsic cache behavior. In fact, a technique insensible to memory bandwidth would have solved the memory bottleneck. The number of tasks factor is moderately relevant to both refilling and partitioning approaches. For CRTA it contributes to the total number of context switches in the system. For the partitioning scheme, the number of tasks jointly with the cache size settles the partition size, and therefore the cache related speed-up of task code. The Hybrid solution is less affected by this factor. Like frequency, it can be almost overlooked for such a technique. The cache line size (block size), access time, and MIPS are associated to the intrinsic behavior of the cache, thus their score is higher for PART. Those factors achieve the higher score for the Hybrid solution. This provides an important conclusion: the Hybrid solution has managed to annul the factors that depend on the
30
25
20
15
10
5
0 CacheSize LineSize ReadTime AccTime NumTask
Freq
cv
cbv
cnv
Figure 2 All factors using PART and RTA as test 40
Percent of variation in the schedulable utilization
35
30
25
20
15
10
5
0 CacheSize
LineSize
ReadTime
AccTime
MIPS
NumTask
Distr
am
Figure 3 All factors using Hybrid with Optimal partitions 40
Percent of variation in the schedulable utilization
35
30
25
20
15
10
5
0 CacheSizeLineSize Assoc ReadTimeAccTime MIPS NumTask Distr
Freq
bm
ns
Figure 4 workload, being only mildly affected but those factors that are intrinsic of the cache performance. Since those factors are intrinsic, no method is capable to circumvent them. Notice the difference between the factors that are related to the cache interference (speed down), to the ones that are related to the cache performance (speed up
9
In conclusion, once again, the Hybrid technique gets the best from both CRTA and PART. It is very effective, and capable to manage the worst situations.
offered by the cache). The effect of the latter factors can be apparently reduced by the greater effect of the former factors, but they always affect as long as there is a cache in the system. In conclusion, the Hybrid solution is rather independent of the factors that produce loss of performance, while is only affected by the ones that are intrinsic to the use of a cache. The higher variation explained by the MIPS factor over the other two is mainly due to the wide domain considered for it.
All factors using PART and RTA as test
Density function of the Schedulable Utilization
250
7.2 Density functions As commented earlier, the exponential 2k factorial design was used to obtain the results presented in the former subsection. That experiment evaluated the system for every possible combination of the factors (each factor having two possible values). The following graphs represent the density functions obtained in such experiments for each of the techniques considered. They illustrate the improvements achieved by the Hybrid solution. As can be seen in Figure 7, its performance is always above 15 % of schedulable utilization. Approximately fifty percent of the cases evaluated score above 65 %. It interesting the shape of the function, increasing the density for higher values of schedulable utilization, even though some of the situations are very demanding.
150
100
50
0 0
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Figure 6 All factors using Hybrid with Optimal partitions
Density function of the Schedulable Utilization
250
All factors using CRTA 250
Density function of the Schedulable Utilization
200
200
150
200
150
100
50
0 0
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Figure 7
100
50
8 Comparing the results to both CRTA and Partitioning
0 0
In this section we compare the effectiveness of the approach described in this paper to both CRTA and PART with 3-D graphs as a function of some factors. This comparison will provide a view of the benefits of the method presented in this paper over the previous available ones. Only the factors that have been showed relevant in previous section will be studied here (while the others are kept to a default value). These results provide a view of whether the relationship between the factors and the schedulable utilization is either direct or indirect. Note that some workload factors, like harmonicity, are independent of the results, by nature of the analysis.
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Figure 5 The shape for CRTA (see Figure 5) reflects that it is capable of obtaining good results. However, it is very sensitive to some factors. For all those very demanding situations, this scheme score as low as 0 % due to its high sensitivity. Depending on the factors, this solution can be very effective but also can be really bad. The shape for PART is spread for the whole range. It is not concentrated in any especial area. This means that such solution performs moderately but it is capable to manage the worst situations without falling below 5 %. This solution is not very effective but performs quite constantly.
10
8.1 Hardware factors
among the figures is the number of partitions used for the Hybrid solution. Figure 8 presents the effectiveness of using just 4 partitions. Figure 9 presents several numbers of partitions, to show that more partitions are needed when the cache is larger to obtain the maximum utilization. Each plane has the maximum for a different cache size. It is interesting to contemplate the transition of the plane shape as the number of partitions changes. Finally, we present in Figure 9 (and all subsequent graphs) the Hybrid method using the optimum number of partition for each point in the plane. By looking at these graphs, two obvious questions arise: Why use a large cache? Why use other solution other than CRTA? Since using a 8 Kb cache and the CRTA technique, the obtained utilization is at least at the same level as that obtained with other techniques for larger caches. The answer to this question comes with the next subsection, in which it will be show that a small cache (coupled with CRTA) will not cope with all workload configurations that may arise. Furthermore, small caches do not cope with the fetch requirements when the processor power increases. Obviously the best technique will be always the Hybrid, since it includes the other two. Refilling schemes are usually constrained by large caches due to the higher penalty associated. However for
The three first figures are included to illustrate how the Hybrid solution works. They use the cache size and memory read time factors. The three figures include representations for CRTA and PART. The difference Schedulable Utilization % CRTA 55.2 36.9 18.6 4 Parts 69.1 60.2 51.4 PART 59.1 48.6 38.2
80 70 60 50 40 30 20 10 0
4
19
34
49
Cache size (Kbytes)
64
60
20 30 40 50Memory read time (nS)
Figure 8 Schedulable Utilization %
80 70 60 50 40 30 20 10 0
Schedulable Utilization %
CRTA 2 Parts 4 Parts 8 Parts PART 80 70 60 50 40 30 20 10
20 4
30 19
34
49
Cache size (Kbytes)
64
60
40 50Memory read time (nS)
Figure 9
4 Schedulable Utilization %
4
19
34
49
Cache size (Kbytes)
64
60
19
34
49
Cache size (Kbytes) CRTA 55.2 36.9 18.6 Hybrid 71.7 65.6 59.4 PART 59.1 48.6 38.2
80 70 60 50 40 30 20 10 0
CRTA 54.2 41.2 28.1 Hybrid 68.5 63.3 58 PART 56.3 46.3 36.4
64
4
1 1.75 2.5 3.25 Memory access time (cycles)
Figure 11 very small caches, the intrinsic interference becomes dominant even for those solutions. This trend can be observed in any of the figures for around 4 Kb cache size, where the plane folds below its local maximum. Figure 12 shows two interesting effects. For high levels of processor power (MIPS factor), small caches do not manage to provide the bandwidth requested by the processor. Inversely, a too large cache does not help for small processor power, since the extrinsic penalty becomes dominant over the intrinsic benefit of having a
20 30 40 50Memory read time (nS)
Figure 10
11
Schedulable Utilization %
Schedulable Utilization %
CRTA 55.9 39.6 23.4 Hybrid 72.7 65.4 58.1 PART 62 49.6 37.1
80 70 60 50 40 30 20 10 0
CRTA 50.8 34 17.1 Hybrid 69.9 65.6 61.4 PART 57.3 46.6 35.9
80 70 60 50 40 30 20 10 0
20 4
52.5 19
4
85 34
49
Cache size (Kbytes)
117.5 64
150
MIPS
19
34
49
Cache size (Kbytes)
Figure 12
40 64
50
10 20 30 Number of Tasks
Figure 13
large cache. The Hybrid solution delivers the best performance for any configuration.
Schedulable Utilization % CRTA 51.4 34.4 17.3 Hybrid 66.9 61.8 56.6 PART 54.4 47.4 40.3
8.2 Workload factors 80 70 60 50 40 30 20 10 0
As can be seen in the graphs of this subsection, the performance of CRTA is highly conditioned by the number of tasks and system frequency. The partitioning technique is only slightly conditioned by the number of tasks. The Hybrid solution is always above the other two, even for the largest cache size. This means, that even for the more demanding workloads (high frequencies and large task sets) it is not necessary to fully partitioning the cache, because Hybrid technique with an optimal number of partitions performs better. Figure 13 shows that for a large number of tasks, CRTA obtains a good bound with the smallest cache. Inversely, partitioning is better suited with the largest cache. Hybrid solution performs properly independent of the cache size. This is important, since other applications requirements may condition the processor election. The processor may hold a cache of undesirable size for the other techniques. Moreover, because of the current level of integration, a mid size (i.e. 32 Kb) cache memory is more common in contemporary processors. Figure 14 shows a similar effect, but regarding system frequency. Note that the Hybrid solution is slightly affected by system frequency, while the partitioning is unaffected. However, the performance loss for the former is less than 10%, moving from 100 Hz to 1000 Hz. It is important to note that the absolute maximum is obtained using the Hybrid technique with the largest cache. As can be seen in Figure 15, the better utilization is obtained for low frequencies and low number of tasks. The important fact from this figure is that even though Hybrid performs as constantly as partitioning, the former can outperform the latter by 20 %, which is a considerable gain in schedulable utilization. Figure 15 presents
100 200 300400 1000 Frequency (Hz)
64
4 19 34 49 Cache size (Kbytes)
Figure 14 Schedulable Utilization %
CRTA 54.1 36.2 18.2 Hybrid 68.5 64.1 59.8 PART 57.3 53 48.8
80 70 60 50 40 30 20 10 0
100 200 300400 1000 Frequency (Hz)
40 50
10 20 30 Number of Tasks
Figure 15 evidence as well of the limitations of CRTA for adverse workload situations (high frequency and high number of tasks). This delivers the important conclusion that a small cache and CRTA is not appropriate for all system configurations. In this case, only the partitioning techniques can cope with the system’s demands.
12
9 Conclusions This paper has presented a general yet powerful framework for dealing with the extrinsic cache behavior. It enables the use of cache memories in preemptive realtime systems, increasing notably the maximum utilization achievable. It has been shown that the schedulable utilization bound obtained is higher and more stable (quite independent of hardware and workload factors) than for earlier methods. This method can be applied optimally in an early stage of the system design, because of the limited workload information required. For simple schedulers, like Rate Monotonic ones, the assignment is straight forward. The framework is flexible enough to be successfully applied real systems. It includes periodic and sporadic tasks, process synchronization, deadlines shorter than periods, and arbitrary fixed priority assignments. With this technique, both performance and reliability benefits from the use of caches can be definitely exploited in real-time systems.
[9]
[10]
[11] [12]
[13] [14]
Acknowledgements
[15]
The authors thank Dr. Kelvin Nilsen for his seminal ideas underpinning this strand of work. Jose V. Busquets thanks Prof. Alan Burns for hosting his Study Leave at York.
[16]
References
[17]
[1]
[2]
[3]
[4]
[5]
[6]
[7] [8]
R. Arnold, F. Mueller, D. Whalley and M. G. Harmon."Bounding Worst-Case Instruction Cache Performance". IEEE Real-Time Systems Symposium, pages 172-181, 1994. N. C. Audsley, A. Burns, M. F. Richardson and A. J. Wellings."The STRESS Hard Real-Time System Simulator". Software-Practice and Experience, Vol. 24, Num. 6, pages 543-564, June 1994. S. Basumallick and K. D. Nilsen."Cache Issues in RealTime Systems". ACM SIGPLAN Workshop on Language, Compiler, and Tool Support for Real-Time Systems, June 1994. A. Burns, A. Wellings, C. Bailey and E. Fyfe."The Olympus Attitude and Orbital Control System: A Case Study in Hard Real-time System Design and Implementation". Ada sans frontieres Proceedings of the 12th Ada-Europe Conference, Lecture Notes in Computer Science 688, pages 19-35, 1993. J. V. BusquetsMataix and A. J. Wellings."Adding Instruction Cache Effect to Schedulability Analysis of Preemptive Real-Time Systems". YCS 260, September 1995. J. V. BusquetsMataix and J. J. SerranoMartin."The Impact of Extrinsic Cache Performance on Predictability of Real-Time Systems". Proceedings of the Second International Workshop on Real-Time Computing Systems and Applications, Tokyo, Japan, October 1995. J. L. Hennessy and D. A. Patterson. Computer Architecture a Quantitative Approach. Morgan Kaufmann Publishers, Inc. San Mateo, CA, 1990. M. D. Hill and A. J. Smith."Evaluating Associativity in
[18]
[19] [20]
[21]
[22]
[23] [24]
[25]
13
CPU Caches". IEEE Transactions on Computers, Vol. 38, Num. 12, pages 1612-1630, December 1989. D. B. Kirk."SMART (Strategic Memory Allocation for Real-Time) Cache Design". Proceedings of the IEEE Real-Time Systems Symposium, pages 229-237, December 1989. D. B. Kirk."Allocating SMART Cache Segments for Schedulability". Foundations of Real-Time Computing: Scheduling and Resource Management., pages 251-275, 1991. D. B. Kirk."Process Dependent Static Partitioning for Real-Time Systems". Proceedings of the IEEE Real-Time Systems Symposium, pages 181-190, 1988. D. B. Kirk and J. Strosnider."SMART (Strategic Memory Allocation for Real-Time) Cache Design using the MIPS R3000". Proceedings of the IEEE Real-Time Systems Symposium, pages 322-330, 1990. S. Lim and company."An Accurate Worst Case Timing Analysis Technique for RISC Processors". IEEE RealTime Systems Symposium, pages 97-108, 1994. T. H. Lin and W. S. Liou."Using Cache to Improve Task Scheduling in Hard Real-Time Systems". IEEE Workshop on Architecture Support for Real-Time Systems, pages 81-85, December 1991. J. Liu and H. Lee."Deterministic Upperbounds of the Worst-Case Execution Time of Cached Programs". IEEE Real-Time Systems Symposium, pages 182-191, 1994. C. D. Locke, D. R. Vagel and T. J. Mesler."Building a Predictable Avionics Platform in Ada: a Case Study". IEEE Real-Time Systems Symposium, pages 181-189, 1991. J. Molini, S. Maimon and P. Watson."Real-Time Scenarios". IEEE Real-Time Systems Symposium, pages 214-225, 1990. F. Mueller."Compiler Support for Software-Based Cache Partitioning". Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. K. D. Nilsen."Key note". Symposium on Real-Time Systems and Applications, Chicago, Illinois, May 1995. K. Nilsen."Real-Time is No Longer a Small Specialized Niche". Proceedings of the Fifth Workshop on Hot Topics in Operating Systems (HotOS-V), Orcas Island, Washington, 1995. R. Rajkumar, L. Sha, J. P. Lehoczky and K. Ramamithram."An Optimal Priority Inheritance Protocol for Real-Time Synchronization". COINS Technical Report 88-98, October 1988. L. Sha, R. Rajkumar and J. P. Lehoczky."Priority Inheritance Protocols: An Approach to Real-Time Synchronization"". IEEE Transactions on Computers, Vol. 39, Num. 9, pages 1175-1185, September 1990. K. Tindell, A. Burns and A. Wellings."Allocating RealTime Tasks (An NP-Hard Problem made Easy)". RealTime Systems, Vol. 4, Num. 2, pages 145-165, June 1992. N. H. Weiderman and N. I. Kamenoff."Hartstone Uniprocessor Benchmark: Definitions and Experiments for Real-Time Sistems". The Journal of Real-Time Systems, Num. 4, pages 353-382, 1992. A. Wolfe."Software-Based Cache Partitioning for RealTime Applications". Proceedings of the Third International Workshop on Responsive Computer Systems, September 1993.