Adding Instruction Cache Effect to Schedulability Analysis of Preemptive RealTime Systems* José Vicente Busquets-Mataix Departamento de Ingeniería de Sistemas, Computadores y Automática Universidad Politécnica de Valencia P.O.Box 22012 46071-Valencia (SPAIN)
Andy Wellings Real-Time Systems Research Group Department of Computer Science University of York York, UK e-mail:
[email protected]
e-mail:
[email protected] Abstract
applications of the real-time theory are not limited to the life-critical systems (hard-real time systems), where the cost factor is of secondary importance. Emerging areas like car computers, multimedia computing, gesture recognition, voice interaction, CD-I and the like [33], demand real-time capabilities but at a relative low cost, as all consumer products impose this requirement. Yet, some of these applications involve intensive computing tasks. These applications require a cheap, but powerful hardware platform to be competitive in the marketplace. Microprocessor manufacturers are adopting solutions based on statistical assumptions of the workload to improve the cost-performance ratio. Hennessy and Patterson write in [11]: “Making the frequent case fast is the inspiration for almost all inventions aimed at improving performance”. A good example of this trend is the cache. Running typical programs, the majority of memory references are done fast, which is almost equivalent (from a performance point of view) to implementing the whole main memory with fast and expensive chips. This trend, jointly with big market volumes, makes contemporary processors achieve a very good cost-performance ratio. On the other hand, since these processors are dedicated to a large audience, they are widely tested increasing their reliability (it is interesting to mention the Pentium divide flaw). In conclusion, contemporary processors are good candidates to implement cost-effective real-time applications because they are cheap, powerful and reliable. However, some techniques have to be applied to make those processors more predictable.
Cache memories are commonly avoided in real-time systems because of their unpredictable behavior. Recently, some research has been done to obtain tighter bounds on the worst case execution time (WCET) of cached programs. These techniques usually assume a non preemptive underlying system. However, some techniques can be applied to allow the use of caches in preemptive systems. This paper describes how to incorporate the effect of instruction cache to the Response Time schedulability Analysis (RTA). RTA is an efficient analysis for preemptive fixed priority schedulers. We also compare through simulations the results of such approach to both cache partitioning (increase of the cache predictability by assigning private cache partitions to tasks) and CRMA (Cached RMA: cache effect is incorporated in the utilization based Rate Monotonic schedulability analysis). The results show that cached version of RTA (CRTA) clearly outperforms CRMA, however the partitioning scheme may be better depending on the system configuration. The obtained results bound the applicability domain for each method for a variety of hardware and workload configurations. The results can be used as design guidelines.
1 Motivation Real-time systems are commonly considered very specific systems used in specific applications and implemented by ad-hoc hardware and software. Nowadays, the
* This work was supported in part by the Spanish Ministry of Education and Science. It was done whilst Jose V. Busquets was on Study Leave with the RealTime Systems Research Group at University of York.
1
2 Previous work
Concerning with caches, a good first step to increase processor utilization is to calculate the worst case execution time (WCET) of tasks taking into account the cache speed-up. Some tools can be used to estimate the WCET of cached programs. These tools typically assume continuity for the code execution, thus no preemption is allowed. This limitation restricts the application of such tools to non preemptive scheduling policies like cyclic executives or cooperative scheduling. However, two actions can be undertaken to make caches suitable for preemptive real-time systems. Firstly, making the cache behavior more predictable (for example using partitioning, locking pieces of code, or simply using easy to model cache configurations). Secondly, modelling the cache behavior to incorporate the cache effects on the schedulability analysis. Using these techniques we can take advantage of the huge research effort committed to date in preemptive systems. Without these techniques, the inevitable use of cache (due to the exponentially increasing of the memory-processor speed gap) may represent the end of the line for low cost preemptive realtime systems. Since they were first introduced in the IBM 36/85 in 1969, cache memories have served to alleviate the constantly increased processor-memory speed gap, so called Von Newman Bottleneck. Baskett [5] estimated an annual rate of performance increases for microprocessors of 80% in contrast to the 7% estimated for memory devices [11]. Lately, some actual architectural trends are putting even higher demands on the instruction memory path. First, the use of pipelining increases the number of instructions requested per clock cycle. Second, RISC processors exhibit lower program codification densities (larger instructions). Third, the number of bus lines is limited in embedded systems because of cost, design and reliability constraints. In addition, the memory is one of the more expensive parts in the computer. Effective use of this expensive element will produce greater budget benefits [32]. Another significant reason, typically not considered, is that the use of on-chip caches increases reliability by means of reduced bus traffic, since off-chip buses are more exposed to external interference. The objectives of this paper are two-fold: 1. Show the effectiveness of the use of cache memories in preemptive real-time systems by the consideration of the inter-task cache interference in the schedulability analysis. 2. Compare the scheme proposed in this paper to the schemes currently presented earlier. Thus, the cached RTA (CRTA) is compared to both CRMA and partitioning.
The use of caches in real-time is an emerging area which is being increasingly pursued. We are only concerned with the extrinsic (inter-task) behavior of the instruction cache. The intrinsic (intra-task) behavior will be briefly referred in subsection 6.1. Thus we will concentrate in aspects of the extrinsic behavior while assuming implicitly that the intrinsic behavior has been taken into account where apply, for example for calculating the WCET of tasks. Firstly, we briefly review some models and experimental results of the extrinsic behavior in generalpurpose systems. Next, we present the CRMA and partitioning techniques. Both of them have been previously compared in [9].
2.1 Extrinsic behavior Some published works have explored the cache performance on multitasking environments. The paper [2] shows the cache performance degradation due to both the operating system and multiprogramming activity using traces from the real world. The results are presented as the miss rate using several cache sizes. The effect of context switching over the performance of a cached system is evaluated in [27] using some real experiments. This study distinguishes between intrinsic and extrinsic cache misses. It measures the loss of performance at several time points after the context switch takes place. The analytical cache model given in [1] includes the effects of multitasking. It considers round robin scheduling with constant duration time slices. Another analytical model is given in [43][42]. In this case, the cache refill transients are calculated analytically from the task footprint over a set-associative cache. The footprint is a statistical view of the cache lines touched by an executing task. Recent paper [26] addresses the question of which cache configuration is more appropriate for either technical or multi-user commercial workloads. It uses real traces, and takes into account the operating system impact on cache performance. All those papers fall in the field of exploring the effect of cache performance on common multitasking systems. This is far away from considering a preemptive real-time system for the study, where other factors arise, like timing requirements, scheduling algorithms, periodicity of tasks, high frequency, etc. A real-time system requires the delay-effect suffered by each task to be under a given bound, to ensure the time correctness of the entire system. In typical multiprogramming workloads, like commercial ones, the delay imposed by the cache thrashing is evaluated at a system level.
2
2.2 Cached RMA
trinsic interference by providing each task with a private cache partition. Additionally, a common partition may be used for data sharing and non critical tasks. The cache partitioning approach may be implemented by software [46][30] or hardware [16][19][17] [18]. The hardware schema requires the cache be offchip and some additional external circuitry for cache control. It introduces added latency to the processor cycle. The software solution requires compiler support. The application code is relocated to provide exclusive mappings on the cache for each task. This schema also introduces delays, in this case due to insertion of branches to interconnect the relocated pieces of code. A nearly similar technique to partitioning is the cache lock [22], used in some actual controllers. This mechanism allows the programmer to lock a piece of code in cache (usually interrupt handlers), guaranteeing fast and predictable response times for this code. The main conclusion is that it increases schedulability.
Cached Rate Monotonic Analysis (abbreviated CRMA) presented in [6] incorporates the worst case extrinsic cache effect (cache refill penalty) on the conventional RMA [24]. Although Basumallick and Nilsen point out in [6] other ways to bound the refill penalty, an entire cache refill is assumed. The cache refill penalty will be added to the preempting task’s execution time, assuming that in real execution the preempted task will use such time budget to restore its cache state. CRMA considers that each task will preempt a lower priority one. Thus, the constant cache refill penalty γ is added to the worst case execution time for each task but the lowest priority one. The utilization for each preempting task is calculated based on C+γ to be applied in the following feasibility test: N
∑U i =1
i
≤ N ( 21 N − 1)
Where N is the number of tasks, and Ui the utilization for task τi . RMA is a sufficient feasibility test based on utilisation for task-sets scheduled by Rate Monotonic. RM is an optimal preemptive fixed priority scheduling paradigm under the following constraints: 1. Tasks are periodic and independent. 2. The deadline is equal to the period (i.e. Ci ≤ Di = Ti ). 3. The worst case computation time for each task is known. 4. All task releases coincide at least in one time point (critical instant assumption). Rate Monotonic analysis has been widely accepted the last decade due to strong theoretical foundations [7]. Its main advantage over cyclic executives is the flexibility and separation of concerns. The timing correctness is studied independent of the functional correctness. It is also easy and fast. The computational complexity is O(n) in the number of tasks. On the other hand, RMA suffers from a pessimistic utilization bound, since it is not an exact analysis (i.e. it is sufficient but not necessary). It converges to 0.69 (i.e. ln(2)) as the number of tasks approaches infinity. However, for a randomly chosen large task-set the likely bound is 88 percent [38]. Another time-domain technique is presented in the Spring system [31], where tasks execute without preemptions, thus avoiding the extrinsic interference from its roots.
3 Exact schedulability analysis
2.3 Cache partitioning
This test was independently developed by Harter [10], Joseph and Pandya [15], and by the Real-Time Research Group at York [7].
The unnecessary pessimistic utilization bound given by the utilization based RM analysis motivated the development of exact (sufficient and necessary) tests. An obvious exact analysis is simulating the schedule over an interval equivalent to the least common multiple of task periods (LCM) [20]. However the LCM may be extremely larger even for small task-sets. The following property gave the key to avoid a complete simulation over the LCM: “For a set of periodic tasks, if each task meets its first deadline when all tasks are released at the same time, then all subsequent deadlines will be met”. Thus, it is not necessary to test for the entire LCM to be able to guarantee timing correctness for all task phasings, since the critical zone is the worst case phasing. This theorem provides the basis for an exact schedulability test for preemptive fixed priority based scheduling. Two analyses have been developed: • Checking the scheduling points (release time of a higher priority task) for each task. • Calculating the response time for tasks by allocating the task’s execution time and interference in a time window. Both of them achieve equal results (because both are exact), but the latter is usually faster. For this reason, we will concentrate on the second test.
3.1 Calculating the response time for tasks
Cache partitioning (abbreviated PART) is aimed at improving the cache predictability by annulling the ex-
3
4 Incorporating the extrinsic cache interference to the analysis
Some restrictions of the RMA are no longer necessary inherently to this method. Thus this analysis is also stronger. 1. The deadline may fall before the end of the period Ci ≤ Di ≤Ti (Note that the runnability constraint still hold). However, without this requirement the RateMonotonic priority assignment policy can not be used. The Deadline-Monotonic priority assignment can be applied instead, which is also an optimal one. 2. Non periodic tasks are allowed provided that a minimum inter-arrival time can be guaranteed. For analysis purposes such time is equivalent to the period of periodic tasks. These tasks are called sporadic. 3. The analysis is applicable for any preemptive fixed priority based scheduling policy. It is not limited to Deadline-Monotonic or Rate-Monotonic. Moreover, these policies become non optimal when release offsets are allowed [4]. The only restriction is that priorities must be unique (i.e. tasks do not share the same priority level). The worst case response time for the highest priority task is trivial, because it depends solely on the execution time of the task. However, the response time for other tasks is affected by the execution of higher priority tasks. The worst case interference is produced when all the tasks are released at the same time (critical instant). A recursive approach is used to calculate the response time of a generic task. The approach tries to allocate in a time window w the task τi computation time Ci , task’s blocking time Bi and the interference produced by the execution of higher priority tasks. The blocking time is the maximum time that the task can be delayed by lower priority tasks due to resource contention. The process is iterative because every step the interference is added to the current window wi n, so resulting on a longer time window wi n+1 that might include greater interference in the next step. The process finishes when the window stops growing (wi n+1= wi n). If the resulted response time for any task is greater than the deadline (wi n+1= wi n = Ri > Di ), the taskset is not schedulable. The formula is depicted below:
win +1 = Ci + Bi +
The extrinsic cache interference suffered by each task is simply another interference to be added to the formula in the iterative process. The problem can be divided: how many times the interference is accounted, and how long is its time penalty.
4.1 Accounting preemptions To address the first problem, the following property gives the key to bound this interference: “The maximum number of preemptions suffered by a given task is bounded by the number of releases of higher priority task in its response time window”. This property can be explained as follows: in the worst case, the task will be preempted by each of the higher priority tasks executed in its time window. This can be envisioned as that the task resumes execution after each of the higher priority task executions, thus paying refill penalty every resumption. It could seem that the task under consideration may be prevented from execution while more than one task is executed, thus suffering in reality less preemptions. However, the delay must be accounted anyway because the refill penalty affects the task of interest in a direct or indirect way: • Direct: the task pays the refill penalty when it resumes execution after preemption. • Indirect: other tasks with higher priorities pay the refill penalty, thus increasing their execution times. The task under analysis is affected by larger execution times of higher priority tasks. The direct and indirect interference can be observed in Figure 1 (refer to the STRESS review [37] for a detailed explanation of the notation used in the figure). The cache refill penalties are depicted as a black rectangle, just following a restart of execution after preemption. Such schedule corresponds to being all task released simultaneously. The first release of task 1 and 2 do not preempt the task 3, because the latter have not yet started execution. However the analysis also includes these two preemptions (Despite that we consider that the WCET of tasks incorporates the first cold start transient). This is because in a nearly similar task phasing, the task 3 can be started just before the release of task 1 and 2, thus suffering preemption apart from the cold start penalty paid before preemption (Figure 2). Note that the preemption suffered in the second instance of task 2 due to execution of task 1, inflates the task 2 execution time, which results in increased interference for task 3. This is an indirect cache penalty, in reality it is paid by task 2, but it has to be considered as well in calculation of the response time for task 3. For the time schedule depicted
wn ∑ Ti × C j j ∈ hp ( i ) j
Where hp(i) denotes the set of tasks with higher priority than task τi . Checking the worst case response time against the allowed deadline for each task compound the analysis. If no deadline is missed, the task-set is schedulable.
4
in Figure 1, the analysis of response time for task 3 would consider five preemptions (two releases of task 2 plus three releases of task 1), while in the depicted case, task 3 paid two refills, and suffers an indirect one (owed to task 2).
3. The time to refill the lines used by the preempted task. 4. The time to refill the maximum number of useful lines that the preempted task may hold in cache for a given time. Useful lines are those that are likely to be used again. We consider that the preemption arises in the worst case (i.e. when the number of useful lines is the maximum). This rationale was introduced in [21]. 5. The time to refill the intersection of lines between the preempting and preempted tasks. Approaches 2, 3, 4 and 5 require the knowledge of the number of lines touched by each task before performing the analysis. Additionally, approach 5 requires the knowledge of exactly which lines are used by each task, not merely the number of them. Approach 1 is independent of the lines used by tasks. Approaches 3, 4 and 5 require an exact characterization of the direct preemptions suffered by each task, since the task’s lines may be different from one preempting task to another. This is incompatible with the model of preemptions given in the former subsection, since it was not able to make a distinction between the direct and indirect preemptions. We simply bounded the number of overall preemptions. Thus, only approaches 1 and 2 can be used with such a characterization of the preemptions, since these approaches concern the effect of the preempting tasks, which is independent of either direct of indirect facts. We have to choose between the 1 and 2 for the experiments. The second approach is more accurate but requires the knowledge of the lines touched by each task in order to perform the analysis. This is a problem since gathering this information is not trivial. From the point of view of schedulability analysis, partitioning the cache is easier than analysing the tasks’ trace. For this reason, we will consider the first approach unless partitioning is used, in which case only the time to refill the partition is considered. Obviously, the maximum number of cache lines that can be misplaced by the execution of a preempting task is the entire cache. Thus the first approach is the worst case bound for the extrinsic interference penalty. In conclusion, the formula depicted below adds the cache refill penalty γj (or γ if approach 1 is used) to the execution time of every preempting task τj, provided that this time will be wasted direct or indirectly by the preempted task τi . An obvious parallel effect is that the time window is likely to be longer now, allowing more task to be released in such frame from iteration to iteration (the same way the common interference does).
Figure 1
Figure 2 To avoid other sources of interference (for example those from lower priority tasks), the blocking time Bi has to be calculated based on the Ceiling Semaphore Protocol (CSP) [36] or by other synchronization technique that precludes a task from being blocked during its execution. CSP ensures that if a task is blocked, it will occur at the beginning of its execution. This way, additional context switches due to blocking are avoided. Using the Priority Ceiling Protocol (PCP) [39] the task under consideration may yield the CPU to a lower priority task once the former reaches a resource held by the latter, thus allowing interference from lower priority tasks to occur. Using CSP, only the execution of higher priority tasks is the reason of disruption of the task’s execution continuity.
4.2 Assessing the refill penalty γ The second problem is assessing the cache interference penalty associated with every preemption (cache refill penalty). There are several ways of measuring the interference: 1. The time to refill the entire cache. 2. The time to refill the lines misplaced by the preempting task.
5
n +1 i
w
win = Ci + Bi + ∑ × (C j + γ j ) j ∈ hp ( i ) Tj
Ii is the cache interference directly suffered by the task τi . We assume that tasks are ordered regarding priority, being 1 the highest priority. For the highest priority task τ1, no interference exists, since it is never preempted:
Equation 1 In this case, the blocking time Bi has to be calculated based on the Ceiling Semaphore Protocol (CSP), as explained earlier. This formula, contrary to the general one, explicitly announces the use of a cache (due to the γj item). Thus the Ci becomes the WCET taking into account the cache speed-up, by the use of an appropriate tool and assuming that the code is executed stream-line.
I1 = 0 For the next priority, the preemptions caused by task τ1 must be accounted. We have to subtract the cache interference suffered by task τ1 (although it is zero), since preemptions that may arise (hypothetically) while task τ1 is executing, can not also preempt task τ2. Thus the cache interference paid by task τ1 is accounted once properly scaled by the number of executions of task τ1 in the response time window of task τ2:
4.3 Proof of the preempting model In this subsection we provide formal demonstrations of the preempting model described in subsection 4.1 using the extrinsic cache penalty γj for a generic preempting task τj that was given in subsection 4.2 (approach 2). We consider here that the refill penalty is the time to read the lines misplaced by the preempting task τj because this assumption is more general (stronger) than considering a refill of the entire cache (which only depends on hardware factors). However the latter will be used in the experiments for simplicity. Theorem: Given a set of real-time tasks scheduled by a fixed priority preemptive policy over a cached system (all tasks are required to comply the requirements of the conventional Response Time Analysis) and the worst case extrinsic interference bound that any task may experience, named γ, then there either exist a value for Ri that makes the following equation to be true and Ri is the worst case response time for task τi or such a task is not schedulable:
R Ri = Ci + Bi + ∑ i × C j + γ j j ∈ hp ( i ) Tj
(
R R I 2 = 2 × γ 1 − 2 × I1 T1 T1 For priority 3, we follow the same rationale described in the former paragraph, but in this case, two tasks (i.e. τ1 and τ2) may preempt task τ3. Therefore the preemptions and deductions are accounted for each:
R R R R I 3 = 3 × γ 1 − 3 × I1 + 3 × γ 2 − 3 × I 2 T2 T2 T1 T1 For a generic task τi , the corresponding formula is depicted below:
Ii =
)
R R i × γ j − i × I j j ∈hp ( i ) Tj Tj
∑
Equation 2 Now, we consider that the response time for a given task τi includes task τi ’s execution time Ci , task τi ’s blocking time Bi , task τi ’s cache interference time Ii , plus the execution time and cache interference time of the higher priority tasks that may fit in the time window in the worst case. The obtained equation is:
Proof: All of the terms used in this equation have been explained earlier in Equation 1. However, to simplify, we assume that the final response time for each task is currently known. Thus we will use Ri to denote the final response time window wi n (i.e. Ri = wi n+1 = wi n) in which the cache interference penalties are already included. We will use the term extrinsic cache interference, cache interference and interference interchangeably. We will start characterising the direct cache interference for each task, given its response time window Ri . Then we will allocate such interference in the equation of the response time, to show that it is in reality the equation that proves the theorem.
Ri = Ci + Bi +
R i × (C j + I j ) + Ii T j ∈hp ( i ) j
∑
Substituting the cache interference for task τI, Ii, by the Equation 2, we obtain:
6
R R Ri = Ci + Bi + ∑ i × C j + ∑ i × I j + j ∈hp ( i ) Tj j ∈hp ( i ) Tj +
R i ×γ j − j ∈ hp ( i ) Tj
∑
comes CRTA. As explained in subsection 4.2, we consider in the experiment that the refill penalty is the time to refill the entire cache. No context switch cost other than the cache refill is considered in the experiments. This process is repeated until the maximum schedulable utilization is reached. A binary search was used.
R i × Ij j ∈ hp ( i ) Tj
∑
6 Hardware model
Simplifying, the second summation becomes cancelled with the fourth summation and factoring the first and third, we reach the following equation:
Ri = Ci + Bi +
R i × Cj + γ j T j ∈ hp ( i ) j
∑
(
This work is focused on contemporary cost-effective general purpose processors. The system modelled is a single-processor with one level physically mapped instruction cache (typically on-chip). Some cache parameters (cache size, line size, associativity) are varied to evaluate their impact on the system schedulable utilization. The values have been chosen according to the actual trends. As can be seen in Table I, it is very common the use of separate caches for data and code, direct mapped, and 8 words line size. Since our model considers only instruction cache, write policies are not considered. Table I: Cache parameters of actual cost-effective processors
)
Which corresponds to the equation that proves the theorem. Notice that this equation is equivalent to the recursive Equation 1, once the time window wi n+1 stops growing (assumed that task τi is schedulable) thus becoming Ri . If the recursive process to reach Ri . does not converge then such a task is not schedulable. The above equation is also valid for other window sizes. Therefore, it is valid for the intermediate values of wi n (i.e. while calculating schedulability using the recursive process: wi 1, wi 2, wi 3...) This is because from the point of view of extrinsic cache interference, there are not especial differences between Ri and another time window. Finally, If we consider that the refill penalty involves the entire cache, then γj becomes γ, which is constant for all tasks.
Processor
I-Cache size Associativity
Line size
Intel i486DX
8-Kb (D+I)
4-way
4-word
Intel Pentium
8-Kb
2-way
8-word
DEC 21064 (Alpha)
8-Kb
1-way
8-word
Hitachi HARP-1 (PA-RISC)
8-Kb
1-way
8-word
MIPS R4000
8-Kb
1-way
8-word
5 The experiment
MIPS R4400
16-Kb
1-way
8-word
SuperSPARC
20-Kb
5-way
16-word
An experiment is designed to find the maximum schedulable utilization guaranteed by a given schedulability analysis for a variety of hardware and workload factors. To drive the experiments, we need a workload to be executed in a hardware. Since our experiments are simulations, in the next sections both hardware and workload models are presented. Given an initial utilization, we assign execution time for each task. From the execution time, the number of memory references is estimated. From the number of references and the cache configuration, we calculate the intrinsic interference delay using the formulae of the miss rate model (see subsection 6.1). This delay is added to the initial execution time. Notice that this delay is larger for partitioning because each task only uses its partition, instead of the entire cache (no shared partitions are considered). The resulting execution time is applied to the schedulability analysis to determine if the task-set is schedulable for the given utilization. If no partitioning is used then the extrinsic interference is considered also in the analysis, thus RMA becomes CRMA or RTA be-
Micro & Tera SPARC
4-Kb
1-way
8-word
PowerPC 601
32-Kb
8-way
16-word
Intel 80960-JD
4-Kb
2-way
??
Intel 80960-HT
16-Kb
4-way
??
6.1 Cache miss rate model Under multiprogramming workloads, the cachemisses are produced by inter-task (extrinsic) and intratask (intrinsic) interference. Inter-task interference is caused by task preemption. Once a task is launched, the cache is gradually filled by the task working-set while the task executes. Once this cold-start transient finishes, the task is expected to maintain a reasonable low miss rate. The preemption causes the working-set of the preempted task to be misplaced by the working-set of currently running tasks. Once the preempted task is restarted, it will pay again the cold-start transient (for all the misplaced lines). Intra-task interference is caused by task colliding lines due to both limited cache space (capacity misses) and/or limited cache associativity (conflict misses). Since both intrinsic and extrinsic be-
7
throughout the bibliography. In [34] the miss penalty is divided on bus access time and other latencies (memory recovery state and error checking and recovery) plus data transfer time for the whole block. In [11] the time to refill a cache block is considered the miss penalty, and other techniques can be used additionally to decrease latency (early-restart and out-of-order fetch). In [42], however, it seems that the former techniques are used by default. After checking some processors' reference manuals, we assume that the miss penalty is the cost to read a simple word (not the whole block) plus the bus access time. These assumptions suppose the use of the two techniques mentioned earlier. Moreover, some tactics can be used at compile time to ensure that the branch targets are line leading words, making not necessary the out-of-order fetch strategy at running time.
haviors are considered in the experiments, a model for both behaviors is needed to calculate the miss rate. With respect to extrinsic behavior, in [42][43] an analytical model of the miss ratio of two competing tasks is derived. In [1] the model is extended to any number of tasks. Those models are based on statistics, and the knowledge of unique lines touched by each task. A primary approximation to the worst case extrinsic cache penalty, and the one considered in this experiment, is considering a refill of the entire cache every time a context switch is produced. The papers [42] and [44] rely on a power function to model the miss ratio depending on cache size of a fully associative cache. In [41] and [40] the model is refined to take into account the line size. In [1] a model for setassociative caches is presented. In [13] the relationship between associativity and miss ratio is also categorized. A recent paper [35] presents a purely statistical technique based on a so called gap model, to derive the expected number of misses of a given program trace over all possible address mappings. In [12] some formulas based on empirical evidence are presented for a variety of cache configurations. The following formula corresponds with instruction cache. M is the miss-rate. The cache size is 2s bytes. The number of bytes of the cache line is 2r. A is the architectural factor, usually close to 1. In the formula, α=0 for a direct mapped cache, α=0.5 for two-way set-associative and α=1 for four-way set-associative. r −3
M ≈ A × 3 × 0.7
6.3 Hardware factors As conclusion of the former discussion, the hardware model of the system has the factors summarized in Table II. The default, minimum and maximum values considered for the experiments are shown in the table. Some explications follow. A cache associativity degree of 1 means one-way-set associative or direct mapped. The memory read time is the time in nanoseconds to read a word. No stalls due to refresh cycles are considered. The memory access time is the additional latency every time a line is read from memory. This includes bus contention and error detection delays. The units used for this factor are processor clock cycles. For the sake of simplicity, we assume that only one instruction is delivered per processor clock cycle. Table II: Hardware factors
α × 0.75s−3 × 1 − 2 (10 + r / 2 − s) + 2.5
Recently some papers have been presented to estimate the WCET of cached programs [23][21][29][3]. The last two references propose a static method capable of predicting conservatively around 70% of cache hits. However, we have chosen the model presented in [12] for several reasons. Firstly, it takes into account the factors we intend to include in the experiments in an analytical way. Additionally such a model presents different formulas for instruction, data or mixed caches, which is especially desirable in our experiments, since we are only concerned with instruction caches. Secondly, the WCET estimations are getting close to the average case, and it is expected to improve in the near future. For the level of abstraction of this study, the error introduced for considering the average case instead of worst case is negligible.
Key
Def
Min
Max
Keyname
Name
-c
32
4
64
Cachesize
Cache size (Kbytes)
-k
8
4
16
Linesize
Cache block size (words)
-r
1
1
4
Assoc
Cache associativity degree
-m
80
20
150
MIPS
MIPS
-b
40
20
60
Readtime
Memory read time (nS)
-a
2
1
4
Acctime
Memory access time (cycles)
7 Workload model Real examples of real-time workloads can be found in [28][25][8]. However, the workload used in this paper is inspired on the application level synthetic benchmark Hartstone [45]. The workload used in the experiment is a set of periodic and independent tasks. The task periods and task loads are configured as a function of a variety of factors (see Table III). -n Total number of tasks. Depending on the available number of frequencies, the tasks can be clustered in
6.2 Cache miss penalty model The miss penalty is the processing time lost when a cache miss occurs. This is the time spent from the memory request responsible of the miss to the restart of normal operation. This term is not defined with precision
8
frequencies (more than one task has the same frequency), but the priority is different for each task. -s Utilization or load distribution among tasks (percentage slope). For high numbers the load is balanced towards low frequency tasks. -h Harmonicity degree, it is the percentage ratio respect the maximum harmonicity achievable. The maximum harmonicity is obtained when all the task frequencies are harmonic to each other without release offset. -q Max/Min task frequency ratio. It is the relation between the lowest and highest frequency. For example, a value of 260 means that if the lowest frequency is 1 Hz, then the highest is 260 Hz. -g Separation among frequencies. This factor sets the base of the potential function to generate the periods. For example, the base 3 (generating the issues 3, 9, 27...) generates more distant periods than the base 2 (2, 4, 8...) -v Frequency (Hz). This factor settles the frequency of the system. All task frequencies are scaled in relation to this frequency, which is indeed the frequency hold by the highest priority task. Since all those factors allow us to change the periodic nature of the tasks, a procedure to generate the periods has been build to comply an important constraint. The number of releases must be the same for all the task-sets exercised, being independent of the factors above mentioned. This constraint is necessary since the extrinsic interference is highly related to the number of task releases. All task’s deadlines are equated to the task’s periods. Thus, the Rate Monotonic priority assignment can be applied being optimal. The default task-set is fully harmonic (i.e. harmonicity factor at 100%). Under these assumptions, a 100% schedulable utilization can be reached under ideally conditions (zero overhead). This allows us to isolate completely the effect of the cache interference from the timeline constraints. That is, varying the factors, it is not possible to configure a task-set not capable to reach the 100% of schedulable utilization (without overheads). Thus, the system is jeopardised by the cache interference exactly by the difference between the resulting maximum schedulable utilization and 100%. Although this approach may seem simplistic, it is sufficient since some common characteristics of real-time task-sets can be reduced to periodic behavior (aperiodic load, precedence constraints, multi-deadline tasks) thus assumed implicitly in the model, or even ignored (release offsets) due to the abstraction level of the schedulability analysis used.
Table III: Workload factors Key Def
Min
Max
Keyname
Full name
-n
20
10
40
NumTask
Number of tasks
-h
100
60
100
Harmty
Harmonicity (%)
-s
0
-90
90
Distr
Utilization distribution (slope)
-q
260
260
2200
Band
Max/Min task frequency ratio
-g
2
2
3
Separat
Separation among frequencies
-v
200
100
1000
Freq
Frequency (Hz)
8 Variation explained by the factors We use the experimental 2k factorial design [14] to evaluate the variation in the schedulable utilization introduced by each factor (and each combination of them). This method determines the change on the system response (schedulable utilization in this case) when a factor is varied. Instead of evaluating one factor while the others are kept at an average value, this method exercises all possible combinations of levels for the factors (two levels, thus 2k combinations), and extracts from the results the portion of variation explained by each factor, as a percentage. Notice that the results are unsigned. We cannot deduce from this experiment if the relation between the factor and the result is either direct or inverse. We used the Min and Max levels showed in Table II and Table III. For clarity, only the factors with more representativeness than 2% are shown in the graphs. The combinations of factors are plotted as a concatenation of keys (e.g. cb means cache size and read time, see tables). To interpret properly of the following graphs and as a design guideline, the larger the influence of a factor, the more sensitivity the system has to that factor. Realtime systems designers have to pay special care for those factors that are important. For example using CRTA, the frequency of the system has a high effect on the performance (see Figure 3). Notice that the percentage of variation explained by every factor is relative to the other factors in the same experiment (same figure). Comparing percentage values among different graphs is not correct. Furthermore, the results are neither general nor absolute; they are representative of the values used for each factor. They have to be judged with those values in mind.
8.1 Combined hardware and workload factors As can be seen in Figure 3 and Figure 4, cache size plays an important role in both schemes. Using CRTA, it is directly related to the refilling cost paid for every preemption (extrinsic cache behavior). In the partitioning scheme (PART) it is related to intrinsic interference. Note that for the former the effect is negative as cache size increases, while for the latter the effect is positive (larger cache means larger partitions, thus better intrin-
9
All factors using PART and RTA as test
sic behavior). Since partitioning voids completely the inter-task interference, such scheme is only affected by the intrinsic behavior of the cache. The most important contrast between both approaches is clearly the effect of the system frequency. While it is noticeable for CRTA, it is insignificant for partitioning. Higher frequency means more context switches, which affects negatively the useful performance. However, no cost is associated to the context switch in the partitioning scheme. Therefore, either the code is executed streamline or scattered over the time is unimportant regarding performance for such approach. The memory bandwidth is approximately of the same importance for both schemes, because it is important for both extrinsic and intrinsic cache behavior. In fact, a technique insensitive to memory bandwidth would have been the solution to the worst problem facing computer architects: the memory bottleneck. The number of tasks factor is moderately relevant to both approaches. For CRTA it contributes to the number of context switches in the system. For the partitioning scheme, the number of tasks jointly with the cache size settles the partition size, and therefore the cache related speedup of task code. The cache line size (block size), access time, and MIPS are associated to the intrinsic behavior of the cache, thus their score is higher for PART. The higher variation explained by the MIPS factor over the other two is mainly due to the wide domain considered for it. The correlation among the cache size and frequency is significant for CRTA (see cv column in Figure 3).
40
Percent of variation in the schedulable utilization
35
25
20
15
10
5
0 CacheSize
LineSize
ReadTime
AccTime
MIPS
NumTask
Distr
am
Figure 4
9 Comparing the results to CRMA In this section we compare the effectiveness of the approach described in this paper to CRMA. As described earlier, CRMA is the cached version of RMA, the utilization based sufficient analysis for RM schedulers. This comparison will provide a view of the benefits of the method presented in this paper over the previous available one. We also present the results obtained if the cache benefits are not accounted at all, this means very poor streamline code execution, but not inter-task cache interference is accounted. Note that not considering the cache in the analysis does not imply that the system does not have cache. The system can be cached, but its performance benefits are not used to critical tasks. Such benefit can be used by non-critical tasks using some policy (for example executing them in background). The plane called RTA in the graphs corresponds to a system analysed by RTA in which no cache is considered. Only the factors that have been showed relevant in previous section will be studied here (while the others are kept to a default value). These results provide a view of whether the relationship between the factors and the schedulable utilization is either direct of indirect. Note that some workload factors, like harmonicity, are independent of the results, by nature of the analysis.
All factors using CRTA 40
35
Percent of variation in the schedulable utilization
30
30
25
20
15
10
9.1 Hardware factors
5
As can be seen in the following graphs, the benefits from using CRTA over CRMA are invariant through different values of the factors. This is deduced from the parallelism that both planes exhibit. The difference between them is around 20%, with CRTA performing the better. This behavior demonstrates that the level of characterization of the cache is equivalent in both methods. Therefore, the improvements of CRTA are owed to the exact nature of RTA, in contrast to the pessimistic bound
0 CacheSize LineSize ReadTime AccTime NumTask
Freq
cv
cbv
cnv
Figure 3
10
given by RMA. The merit of CRTA is characterising the cache at the same accuracy than CRMA. When the extrinsic overhead becomes critical, the cached approaches (CRTA and CRMA) may perform even worse than not using cache at all. This is approximately when the schedulable utilization fall below 20%, which is the level of the plane for RTA. Since CRMA is jeopardised by a pessimistic bound inherited by RMA, the utilization guaranteed by such method fall below the plain RTA if the cache is larger than approximately 40 Kb. However, CRTA performs better than RTA for almost all the extension of the ranges considered. For the extreme values, the cached approaches even reach zero, due to extremely high extrinsic overhead. Refilling schemes (CRMA and CRTA) are usually constrained by large caches due to the higher penalty associated. However for very small caches, the intrinsic interference becomes dominant. This trend can be observed in any of the figures for around 4 Kb cache size. Figure 7 shows two interesting effects. For high levels of processor power (MIPS factor), small caches do not manage to cater the bandwidth requested by the processor. Inversely, a too large cache does not help for small processor power, since the extrinsic penalty becomes dominant over the intrinsic benefit of having a large cache. Using large caches with slow memories are not recommended for refilling schemes (see Figure 6).
Schedulable Utilization %
80 70 60 50 40 30 20 10 0
4
4
19
34
49
64
34
49
64
60
20 30 40 50Memory read time (nS)
Figure 6 Schedulable Utilization % CRTA 50.5 28.8 CRMA 33.2 16.7 RTA 26.2 20.2
80 70 60 50 40 30 20 10 0
20 4
52.5 19
85 34
49
Cache size (Kbytes)
117.5 64
150
MIPS
Figure 7
CRTA 49.9 32.4 CRMA 30.2 15.3 RTA 21.3 18.4
Cache size (Kbytes)
19
Cache size (Kbytes)
Schedulable Utilization %
70 60 50 40 30 20 10 0
CRTA 49.1 24.7 CRMA 33.6 16.9 RTA 23.5 19.9
9.2 Workload factors
1 1.75 2.5 3.25access time (cycles on miss) Memory 4
Figure 5
11
Figure 8 shows the relation between the cache size and the number of tasks. Systems that comprise a large number of tasks will be better suited in a processor with a small cache. On the other hand, larger caches can be used only if the system has a lower number of tasks. Note that CRTA can be used for a large number of cache sizes and number of tasks combinations, since it will provide greater performance than not using cache at all (However, as will be seen in subsection 10.2, partitioning will be more appropriate for certain configurations).
The frequency-number of tasks trade-off is presented in Figure 9. It shows that the number of tasks factor is not so important as it may be suspected. Even for 40 tasks at 300 Hz the CRTA presents benefits over the other techniques. The main conclusion from the graphs discussed in this section is that even though the exact analysis presents a substantial improvement over the CRMA, it still lacks effectiveness to deal with very high frequencies. Because the improvement of CRTA is owed to its exact nature, the extrinsic interference model remains the same as CRMA. In spite of this, it is appropriate for frequencies below 300 Hz, which in fact can accommodate a wide variety of applications. Moreover, beyond this frequency, other techniques like hardware-software cosynthesis are better solutions than conventional processors alone.
Schedulable Utilization % CRTA 45.2 22.8 CRMA 31.4 15.8 RTA 20.5 20.4
70 60 50 40 30 20 10 0
4
19
34
49
Cache size (Kbytes)
40 64
50
10 20 30 Number of Tasks
Figure 8 Schedulable Utilization % CRTA CRMA RTA
10 Time-domain versus space-domain solutions
80 70 60 50 40 30 20 10 0
100 200 300400 1000 Frequency (Hz)
40 50
Previous work [9] compared CRMA to partitioning for a variety of factors. Since CRTA is an improvement to CRMA, a new comparison is needed to get a perspective of the state of the competitiveness of the timeoriented method (now represented by CRTA instead of CRMA). Notice that since the effectiveness of CRTA is greater than CRMA, the maximum cache size considered is now 64 Kb instead of the 32 Kb used in the previous study.
10 20 30 Number of Tasks
Figure 9
10.1 Hardware factors Cache size plays an important role in these comparisons. Cache partitioning performs better as cache size increases, while the cache refilling scheme (CRTA in this case) performs poorly. The intersection of the two planes forms a clear frontier that can be used as a design guideline. Depending on the factors used (which correspond to a fixed point in the plane), one of the two schemes is clearly better than the other. For example if the system is based on a 40 Kb I-cached processor, the proper choice (either partitioning or not) depends on memory read time (see Figure 12), MIPS (see Figure 13) or other factors. For extreme cache sizes the election is clearer: refilling is proper for small cache sizes as partitioning is for larger. Cache partitioning suffers from intrinsic interference because the partitions assigned to tasks are much smaller than the entire cache. This trend can be observed in Figure 13 where the maximum schedulable utilization decreases as consequence of increasing the processor performance (MIPS). The MIPS delivered by the processor are directly related to the intrinsic interference: the
Schedulable Utilization % CRTA 45.7 23 CRMA 31 15.7 RTA 20.4 20.4
70 60 50 40 30 20 10 0
100 200 300400 1000 Frequency (Hz)
64
4 19 34 49 Cache size (Kbytes)
Figure 10 Figure 9 and Figure 10 present evidence of the quick loss of utilization as the system frequency increases. This is especially true for larger caches. The penalty associated with a large cache is not affordable for higher frequencies, where context switches are abundant.
12
more the instructions issued, the more the cache needed to face the higher memory demand. If the cache size is kept constant, the relative schedulable utilization decreases as effect of interference. Effects of memory read time are more important on the cache refill scheme: the plane is steeper across this axis than it is for partitioning (see Figure 12). The cause of this is that cache refilling cost is highly dependent on the memory bandwidth. In subsection 8.1, Figure 1 & 2 showed that cache refilling schemes presents higher sensitivity to memory read time than the partitioning ones. However, the effect of memory access time factor is different. Figure 11 shows that the partitioning scheme is more sensitive, because such factor has a higher weight in the intrinsic behavior. Each cache miss includes one memory access cycle, while the same access latency is paid for a whole line transfer in cache refill scheme. Notice as well the twisted baselines in Figure 12 for partitioning: memory read time factor has higher effect for larger cache sizes. As conclusion, if our system uses a large cache it is important to pay attention to the memory bandwidth. In other words, faster memory chips will produce more benefit when used with larger caches.
Schedulable Utilization %
80 70 60 50 40 30 20 10 0
20 4
30 20 10
34
49
Cache size (Kbytes)
64
4
49
117.5 64
150
1 1.75 2.5 3.25 Memory access time (cycles)
Figure 11
Schedulable Utilization % CRTA 50.8 34 17.1 PART 57.3 46.6 35.9
Schedulable Utilization % CRTA 55.2 36.9 18.6 PART 59.1 48.6 38.2
80 70 60 50 40 30 20 10 0
4
19
34
49
Cache size (Kbytes)
64
MIPS
The figures confirm the previous results obtained in section 8. The cache partitioning scheme is quite independent of workload factors, while the refilling scheme is highly dependent of both frequency and number of tasks factors. The effect of the number of tasks on the cache partitioning scheme depends on the cache size. The twisting shape of the baselines in Figure 14 shows that the slope varies along the plane. For 8 Kb cache, the schedulable utilization is more sensitive to the cache size. For 64 Kb cache, the schedulable utilization is more sensitive to the number of tasks. However, none of these variations are as significant as the variation produced using the refill scheme. The schedulable utilization delivered by cache partitioning approach are quite constant across the plane though somewhat low. Figure 15 and Figure 16 show that the refilling scheme outperforms partitioning for lower frequencies. The partitioning scheme performs rather constant along the plane.
40
19
34
10.2 Workload factors
50
4
85
Figure 13
CRTA 54.2 41.2 28.1 PART 56.3 46.3 36.4
60
52.5 19
Cache size (Kbytes)
Schedulable Utilization %
70
CRTA 55.9 39.6 23.4 PART 62 49.6 37.1
60
70 60 50 40 30 20 10 0
4
20 30 40 50Memory read time (nS)
19
34
49
Cache size (Kbytes)
40 64
Figure 14 Figure 12
13
50
10 20 30 Number of Tasks
The results presented here comprise an important collection of design guidelines for real-time developers. These results help the developer to choose between refilling and partitioning schemes, and present evidence of the better workload and hardware configuration.
Schedulable Utilization % CRTA 54.1 36.2 18.2 PART 57.3 53 48.8
80 70 60 50 40 30 20 10 0
Acknowledgements
100 200 300400 1000 Frequency (Hz)
40 50
The authors thank Dr. Kelvin Nilsen for his seminal ideas underpinning this strand of work. We also thank Sasikumar Punnekkat for review the manuscript. Jose V. Busquets thanks Prof. Alan Burns for hosting his Study Leave at York.
10 20 30 Number of Tasks
References
Figure 15
[1]
A.Agarwal, M.Horowitz and J.Hennessy. "An Analytical Cache Model". ACM Transactions on Computer Systems, Vol. 7, Num. 2, pages 184-215, May 1989. [2] A.Agarwal, J.Hennessy and M.Horowitz. "Cache Performance of Operating System and Multiprogramming Workloads". ACM Transactions on Computer Systems, Vol. 6, Num. 4, pages 393-431, November 1988. [3] R.Arnold, F.Mueller, D.Whalley and M.G.Harmon. "Bounding Worst-Case Instruction Cache Performance". IEEE Real-Time Systems Symposium, pages 172-181, 1994. [4] N.Audsley, K.W.Tindell and A.Burns. "The End of the Road for Static Cyclic Scheduling". Proceedings of 5th Euromicro Workshop on Real-Time Systems, pages 3641, Oulu, Finland, 1993. [5] F.Baskett. "Keynote Address". International Symposium on Shared Memory Multiprocessing, April 1991. [6] S.Basumallick and K.D.Nilsen. "Cache Issues in RealTime Systems". ACM SIGPLAN Workshop on Language, Compiler, and Tool Support for Real-Time Systems, June 1994. [7] A.Burns. "Preemptive Priority Based Scheduling: An Appropriate Engineering Approach". Advances in RealTime Systems, pages 225-248, 1994. [8] A.Burns, A.Wellings, C.Bailey and E.Fyfe. "The Olympus Attitude and Orbital Control System: A Case Study in Hard Real-time System Design and Implementation". Ada sans frontieres Proceedings of the 12th Ada-Europe Conference, Lecture Notes in Computer Science 688, pages 19-35, 1993. [9] J.V.Busquets Mataix and J.J.Serrano Martin. "The Impact of Extrinsic Cache Performance on Predictability of Real-Time Systems". Proceedings of the Second International Workshop on Real-Time Computing Systems and Applications, Tokio, Japan, October 1995. [10] P.K.Harter. "Response Times in Level Structured Systems". CU-CS-269-94 Departament of Computer Science, 1984. [11] J.L.Hennessy and D.A.Patterson. Computer Architecture a Quantitative Approach. Morgan Kaufmann Publishers, Inc. San Mateo, CA, 1990. [12] L.Higbie. "Quick and Easy Cache Performance Analysis". ACM Computer Architecture News, Vol. 12, Num. 2, pages 33-44, June 1990.
Schedulable Utilization % CRTA 51.4 34.4 17.3 PART 54.4 47.4 40.3
70 60 50 40 30 20 10 0
100 200 300400 1000 Frequency (Hz)
64
4 19 34 49 Cache size (Kbytes)
Figure 16
11 Conclusions It has been shown how the extrinsic effect of the instruction cache can be incorporated in an exact feasibility test for fixed priority scheduling. This enables the use of cache in preemptive real-time systems. The results obtained with such a new scheme presents a constant improvement over the CRMA presented earlier. Such improvement is mainly owed to the higher bound provided by the exact analysis, since the cache characterization is equivalent. It has been shown that cache size, memory bandwidth, number of tasks and frequency have a high effect on the cache-refill scheme. On the other hand, the partitioning scheme is highly conditioned by both the cache size and number of tasks factors. Since the behaviors of both approaches are fundamentally converse, given a system configuration, one technique clearly outperforms the other. They are complementary: whatever the system is, any of the techniques will be able to take profit from the use of cache memory.
14
[13] M.D.Hill and A.J.Smith. "Evaluating Associativity in CPU Caches". IEEE Transactions on Computers, Vol. 38, Num. 12, pages 1612-1630, December 1989. [14] R.Jain. The Art of Computer Systems Performance Analysis. John Wiley & Sons, Inc., 1991. [15] M.Joseph and P.Pandya. "Finding Response Times in a Real-Time System". The Computer Journal (British Computer Society), Vol. 29, Num. 5, pages 390-395, 1986. [16] D.B.Kirk. "Process Dependent Static Partitioning for Real-Time Systems". Proceedings of the IEEE Real-Time Systems Symposium, pages 181-190, 1988. [17] D.B.Kirk and J.Strosnider. "SMART (Strategic Memory Allocation for Real-Time) Cache Design using the MIPS R3000". Proceedings of the IEEE Real-Time Systems Symposium, pages 322-330, 1990. [18] D.B.Kirk. "Allocating SMART Cache Segments for Schedulability". Foundations of Real-Time Computing: Scheduling and Resource Management., pages 251-275, 1991. [19] D.B.Kirk. "SMART (Strategic Memory Allocation for Real-Time) Cache Design". Proceedings of the IEEE Real-Time Systems Symposium, pages 229-237, December 1989. [20] J.Y.T.Leung and M.L.Merrill. "A Note on Preemptive Scheduling of Periodic, Real-Time Tasks". Information Processing Letters, Vol. 11, Num. 3, pages 115-118, 1980. [21] S.Lim and company. "An Accurate Worst Case Timing Analysis Technique for RISC Processors". IEEE RealTime Systems Symposium, pages 97-108, 1994. [22] T.H.Lin and W.S.Liou. "Using Cache to Improve Task Scheduling in Hard Real-Time Systems". IEEE Workshop on Architecture Support for Real-Time Systems, pages 81-85, December 1991. [23] J.Liu and H.Lee. "Deterministic Upperbounds of the Worst-Case Execution Time of Cached Programs". IEEE Real-Time Systems Symposium, pages 182-191, 1994. [24] C.Liu and J.W.Layland. "Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment". Journal of the ACM, Vol. 20, Num. 1, pages 46-61, 1973. [25] C.D.Locke, D.R.Vagel and T.J.Mesler. "Building a Predictable Avionics Platform in Ada: a Case Study". IEEE Real-Time Systems Symposium, pages 181-189, 1991. [26] A.M.G.Maynard, C.M.Donnolly and B.R.Olszewski. "Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads". Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 145-156, San Jose, CA, October 1994. [27] J.C.Mogul and A.Borg. "The Effect of Context Switches on Cache Performance". Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 75-84, Santa Clara, CA, 1991. [28] J.Molini, S.Maimon and P.Watson. "Real-Time Scenarios". IEEE Real-Time Systems Symposium, pages 214-225, 1990. [29] F.Mueller, D.B.Whalley and M.Harmon. "Predicting Instruction Cache Behavior". Proceedings of the ACM
[30]
[31] [32] [33] [34]
[35]
[36]
[37]
[38] [39]
[40]
[41] [42] [43] [44]
[45]
[46]
15
SIGPLAN Conference on Programming Language Design and Implementation, 1994. F.Mueller. "Compiler Support for Software-Based Cache Partitioning". Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. D.Niehaus, E.Nahum and J.A.Stankovic. "Predictable Real-Time Caching in the Spring System". Workshop on Systems and Software, Atlanta, 1991. K.D.Nilsen. "Key note". Symposium on Real-Time Systems and Applications, Chicago, Illinois, May 1995. K.Nilsen. "Real-Time is no Longer a Small Specialized Niche". Pending of publication., 1995. S.Przybylski, M.Horowitz and J.Hennessy. "Performance Tradeoffs in Cache Design". Proceedings of the 15th International Symposium of Computer Architecture, pages 290-298, May 1988. R.W.Quong. "Expected I-Cache Miss Rates via The GapModel". Proceedings of the 21th International Symposium of Computer Architecture, pages 372-383, 1994. R.Rajkumar, L.Sha, J.P.Lehoczky and K.Ramamithram. "An Optimal Priority Inheritance Protocol for Real-Time Synchronization". COINS Technical Report 88-98, October 1988. N.C.Audsley, A.Burns, M.F.Richardson and A.J.Wellings. "The STRESS Hard Real-Time System Simulator". Software-Practice and Experience, Vol. 24, Num. 6, pages 543-564, June 1994. L.Sha and J.B.Goodenough. "Real-Time Scheduling Theory and Ada". IEEE Computer, pages 53-62, April 1990. L.Sha, R.Rajkumar and J.P.Lehoczky. "Priority Inheritance Protocols: An Approach to Real-Time Synchronization"". IEEE Transactions on Computers, Vol. 39, Num. 9, pages 1175-1185, September 1990. J.Singh, H.S.Stone and D.F.Thiebaut. "An Analytical Model for Fully Associative Cache Memories". IEEE Transactions on Computers, Vol. 41, Num. 7, pages 811825, July 1992. A.J.Smith. "Line (block) Size Choice for CPU Cache Memories". IEEE Transactions on Computers, Vol. 36, Num. 9, pages 1063-1075, September 1987. H.S.Stone. High-Performance Computer Architecture. pages 24-119, Addison Wesley, June 1993. D.F.Thiebaut and H.S.Stone. "Footprints in the Cache". ACM Transactions on Computer Systems, Vol. 5, Num. 4, pages 305-329, November 1987. D.F.Thiebaut. "On the Fractal Dimension of Computer Programs and its Application to the Prediction of the Cache miss Ratio". IEEE Transactions on Computers, Vol. 38, Num. 7, pages 1012-1026, July 1989. N.H.Weiderman and N.I.Kamenoff. "Hartstone Uniprocessor Benchmark: Definitions and Experiments for Real-Time Sistems". The Journal of Real-Time Systems, Num. 4, pages 353-382, 1992. A.Wolfe. "Software-Based Cache Partitioning for RealTime Applications". Proceedings of the Third International Workshop on Responsive Computer Systems, September 1993.