Cache and Pipeline Sensitive Fixed Priority Scheduling ... - CiteSeerX

3 downloads 938 Views 205KB Size Report
functional units increase the processor throughput. On the other hand their behaviour is very hard to predict. If a re- sponse time guarantee is to be given it may ...
Cache and Pipeline Sensitive Fixed Priority Scheduling for Preemptive Real-Time Systems J¨orn Schneider Dept. of Computer Science, Saarland University Postfach 15 11 50, D-66041 Saarbr¨ucken, Germany [email protected]

Abstract Current schedulability analyses for preemptive systems consider cache behaviour by adding preemption caused cache reload costs. Thereby, they ignore the fact that delays due to cache misses often have a reduced impact because of pipeline effects. In this paper, these methods are called isolated. Pipeline-related preemption costs are not considered at all in current schedulability analyses. This paper presents two cache and pipeline sensitive response time analysis methods for fixed priority preemptive scheduling. The first is an isolated method. The second method incorporates the preemption caused cache costs into the Worst-Case Execution Time (WCET) of the preempted task. This allows for the compensation of delays due to cache misses by pipeline effects. It is shown that the applicability of isolated approaches is limited to a certain class of CPUs. Practical experiments are used to compare both methods.

1. Introduction Schedulability analysis for real-time systems becomes harder when using modern hardware. On one hand microarchitectural features like caches and pipelines with parallel functional units increase the processor throughput. On the other hand their behaviour is very hard to predict. If a response time guarantee is to be given it may well be that much of the speed gain is swallowed by pessimistic assumptions. The research on real-time schedulability analysis already started to focus on this problem. But, while promising solutions for the estimation of Worst-Case Execution Times (WCETs) exist [6, 7, 17, 19], there is still a gap between the accuracy of WCET results and the ability to benefit from them in schedulability analysis.

 Partly supported by DFG (German Research Foundation), Transferbereich 14

When a task is preempted its cache and pipeline behaviour in general changes. Cache entries loaded by the preempted task are now displaced by the preempting task. The overlapping execution of instructions in the pipeline is now cut off at the preemption point. The resulting increases of the execution time have to be considered in real-time schedulability analysis (unless the WCET was estimated in a very pessimistic way without considering cache and pipeline behaviour). Current schedulability analyses incorporate cacherelated preemption costs by adding fixed cache reload costs for each predicted cache miss. Thereby, the interaction between cache and pipeline behaviour is neglected. Therefore, throughout the paper these methods are referred to as isolated methods. They ignore the fact that delays due to cache misses often have a significantly reduced impact because of two different classes of pipeline effects. First, cache misses often coincide with pipeline stalls, thereby, less or not at all increasing the execution time (during a pipeline stall the execution of following instructions is stopped anyway and a cache miss does no additional harm). Second, many instruction-cache misses have no impact because modern pipelines use instruction buffers (prefetch queues) that are loaded (from memory) ahead of time. This paper presents two cache and pipeline sensitive response time analysis (RTA) methods for fixed priority preemptive scheduling. The first method is an isolated method. It determines the cache-related preemption costs as the reload costs for the intersection of cache lines displaced by the preempting task and cache lines used by the preempted task. Pipeline-related preemption costs are also individually determined for each pair of preempting and preempted tasks. The second method incorporates the cache-related preemption costs into the WCET of the preempted task. This allows for the compensation of delays due to cache misses by the above described pipeline effects. To distinguish this method from isolated methods in the paper it is called integrated method. The pipeline-related preemption costs can

be determined as in the first method. However, as shown later this may restrict the type of supported processor architectures. Therefore, an alternative approach to incorporate pipeline-related preemption costs is described. The structure of this paper is as follows. The next section gives a brief discussion on related work. Section 3 describes the presented schedulability analyses. First the used system model is given. Thereafter an overview of the involved analyses is given before the cache interference and the pipeline preemption analysis are explained. It follows a description of possible limitations of the pipeline preemption analysis and the above mentioned alternative approach for incorporating pipeline-related preemption costs as workaround. After that the common aspects of the two RTA-methods are presented. The first is an improved isolated method that considers cache-related as well as pipeline-related preemption costs. Subsection 3.6.2 shows that isolated methods suffer from some inherent restrictions specifically that their applicability is limited to a certain class of CPUs. The second RTA-method is described in subsection 3.6.3. This integrated method incorporates cacherelated preemption costs into the WCETs of the preempted task. Thereafter a short theoretical comparison of the two described methods is presented. In section 4 the presented methods are compared with a simpler isolated method in practical experiments. The last section provides conclusions and describes future work.

2. Related Work The available work in schedulability analysis uses different ways to support advanced microarchitectural features of CPUs. There are approaches that limit the possible preemption points [12], or are restricted to a fixed order of tasks and try to find an optimal order [8, 15]. It is not possible to combine these approaches with commonly used real-time operating systems. Other approaches restrict the abilities of microarchitectural features, e. g. by cache partitioning [9]. There is also work that incorporates cache behaviour into fixed priority schedulability analysis for preemptive systems [2, 4, 11]. However, all these approaches consider cache-related preemption costs isolated from pipeline effects (if pipeline effects are considered at all). Basumallick and Nilsen [2] describe an extension of the well-known utilization based rate monotonic analysis (RMA) of Liu and Layland [13] that considers cache-related preemption costs. The WCET of a task is increased by the worst-case cache refill costs that this task may impose on any preempted task. These additional costs can be either the size of the cache or the size of the memory area occupied by the preempted task. In [4], Busquets-Mataix et al. show that the RMAmethod is worse than a comparable RTA-method because

of its pessimistic assumptions on the schedulable workload. The CRTA (cached version of RTA) [4] adds cache reload costs for each preemption of a task. The authors mention five ways to estimate this cache related interference. Their CRTA supports two of these five ways: either the costs for refilling the entire cache or the time to refill all cache lines possibly displaced by the preempting task can be used. Lee et al. [11] present a sophisticated isolated approach. For each program point of each task an upper bound of the worst-case cache-related preemption costs at this program point is calculated. These results are used in a second step to derive an upper bound of the overall worst-case cache related preemption costs during the response time of the preempted task. The second step uses a linear programming technique that executes in each iteration of the response time analysis. As in the isolated approach of the present paper the relationship between the preempting task and the set of tasks that is possibly running when the preemption occurs is considered. Additionally the phasing of tasks is considered. All three approaches [2, 4, 11] are isolated methods that do not consider pipeline related preemption costs. In contrast to them the isolated method presented in this paper considers pipeline-related preemption costs. No isolated approach is able to take into account the overlapping of preemption caused cache misses and pipeline effects. The here presented integrated method considers pipeline-related preemption costs and overcomes this drawback of all isolated approaches.

3. Schedulability Analysis 3.1. System Model This work considers reactive systems with periodic and sporadic tasks. Periodic tasks arrive repeatedly after a fixed period. Arrivals of sporadic tasks are irregular, but it is assumed that they are separated by a minimum inter-arrival time. The scope of this work does not (yet) include soft real-time processes. Therefore, every task has a hard deadline. The deadlines are less than or equal to the periods and minimum inter-arrival times, respectively. If tasks are allowed to communicate (via shared resources), the immediate ceiling priority protocol (ICPP) [3] has to be used. 3.1.1 Scheduling Scheme A fixed priority scheduling scheme where the priority is independent of the WCET (e. g. rate monotonic or deadline monotonic priority assignment) is assumed. Each task i has a fixed unique base priority i, hp(i ) = fj j j < ig and lp(i ) = fl j l > ig is the set of higher and lower priority tasks, respectively, i. e. smaller indices mean higher

priorities. Let b(i ) = fl j l can block i g be the set of tasks that are able to block i . The immediate ceiling priority protocol (ICPP) implies that each resource has a fixed ceiling value that is the maximum priority of all tasks using it. Furthermore, a task has a dynamic priority that is the maximum of its base priority and the ceiling values of all resources it has currently locked. For the remainder of the paper only the fixed base priorities are used. The dynamic priorities can be ignored for our purpose. The immediate ceiling priority protocol (ICPP) guarantees that a task is blocked only once per execution and that this can happen only before the actual execution of the task. As a consequence there are no additional context switches due to blocking. However, for each critical section (sections which possibly block the execution of higher priority tasks) an additional scheduler execution occurs (see below). For further information on priority ceiling protocols and blocking behaviour see [3, 14]. Throughout the paper i always has a lower priority than j (j is always less than i), i. e. i is usually the preempted task and j is usually a preempting task. 3.1.2 Scheduler An event driven scheduler is assumed, but it is also possible to extend this work to the usage of tick schedulers (clock driven). Every preemption is done by the scheduler. The scheduler (dispatcher) d is executed whenever a task arrives (at the start of each period, or when a sporadic request occurs), when a task suspends itself (when the task finished work for the current invocation), and when a task leaves a critical section to make sure that a possibly blocked task is executed immediately. The scheduler disables all kinds of interrupts while running because an uninterruptible scheduler significantly simplifies the schedulability analysis. This means, that any notification of a task arrival is delayed while the scheduler runs. Therefore, schedulability analysis has to account for one additional scheduler WCET as release jitter.

3.2. Analyses Overview Many different analyses are involved in finding a schedule for a given task set. First of all each individual task is analysed for its WCET. This includes three analyses. A cache analysis [5, 6, 7] computes a safe estimation for the behaviour of instruction- and data-caches. The result of the cache analysis step is a classification of each memory reference in their execution contexts as always miss (am), always hit (ah), or not classified (n ). The classifications are used by a pipeline analysis [17, 18], which supports pipelines with dynamic behaviour, like in superscalar processors. The pipeline analysis results are the numbers of clock cycles needed by each instruction to enter the

pipeline, and the number of clock cycles needed by exit instructions (possible last instructions of that task) to leave the pipeline (pipeline flush costs). For n -classified memory references the pipeline analyser chooses whatever is worse cache-miss or cache-hit. Note that it has been the former choice for all pipeline analysers and CPUs considered so far. Based on the pipeline analysis results the estimated WCET of each task is computed by a path analysis [19]. The next step is the cache interference analysis. It takes a set of competing tasks as input and computes a maximum set of conflicting cache sets, i. e. cache sets where cache misses occur because too many memory blocks map to them. A further step is the pipeline preemption analysis which uses the results of the cache interference analysis and computes the worst-case costs for all possible interruptions of the pipelined execution of each task. For the integrated RTA-method another step is necessary before the actual RTA. This is a further pipeline analysis and a subsequent path analysis for each task, to determine the WCET including cache-related preemption costs. Now the (isolated or integrated) response time analysis uses the former results to find a schedule.

3.3. Cache Interference Analysis Cache misses due to preemptions come from displacements of cache lines. Displacements occur only in conflicting cache sets. A cache set is a conflicting cache set if more different memory blocks mapped to it are referenced than cache lines fit into the set. The cache interference analysis uses the results of the cache analysis to compute all possibly conflicting cache sets for a given set of interfering tasks. The conflicting cache sets are computed as follows. Let M be the set of all memory blocks and S be the set of all cache sets (a cache set contains one or more cache lines, depending on the level of associativity). Let ' : M ! S be the cache controller dependent mapping of memory blocks to cache sets. Let M  M be a set of memory blocks of concurrent tasks. The function : M ! 2SIN returns the cache sets used by these tasks and their maximum number of occupied cache lines:

(M ) = f(s; u) 2 S  IN j M 0 = 0 fm 2 M j '(m) = sg; u = jM jg

(1)

Let A be the level of associativity of the cache, and = (M ). The function : 2SIN ! 2S returns the set of conflicting cache sets in :

( ) = fs 2 S j (s; u) 2 ^ u > Ag

(2)

Note that references to cache lines due to shared memory are not counted more than once. However, these references cannot be completely ignored. Displacements can occur as

soon as a conflicting cache set exists. Even if the new cache entry created by the preempting task is useful for the preempted task this does not necessarily compensate for a later reference to a possibly displaced memory block. The cache interference analysis is used in different ways by the isolated and integrated schedulability analysis presented later. The set of considered tasks (memory blocks) and the usage of information drawn from the results differ. For the isolated approach only the cardinality of the set returned by is important. The integrated approach uses the delivered information to update the classifications of the cache analysis before they are used by the second pipeline analysis step. Both approaches consider cache-related preemption costs only for memory references that are classified as cache hits by the cache analysis. Let ( )  M be the set of memory blocks possibly referenced by the task  . The function hl (m) returns true for an m 2 (l ) iff at least one of the references to memory block m in l is classified as cache hit. The cache interference analysis can be used for any cache replacement policy. However, in the description of the isolated and the integrated RTA-method below a least-recently used cache replacement policy is assumed. Thereby, lower priority tasks need not be considered when determining the conflicting cache sets. Nevertheless, when considering all lower priority tasks for the conflicting cache sets any cache replacement policy can be used with the cache interference analysis. Of course the cache analysis would have to support the same cache replacement policy.

3.4. Pipeline Preemption Analysis In contrast to a task-by-task pipeline analysis, the analysis in a preemptive scheduling environment has to deal with control flow potentially leaving and re-entering a task at any program point (program instruction). It is impractical to consider every possible (re-)entry pipeline state at each possible (re-)entry program point (in general each program point). The proposed solution is best explained by an example. Consider the simple preemption situation displayed in Figure 1: a task i runs, (a) an event causes the scheduler d to interrupt i immediately before program point p; (b) d invokes the preempting task j ; (c) j finishes and returns control to d ; (d) d resumes i at p. The different input pipeline states of the scheduler depend on the preemption point of i (a) and on the preempting task j (c). The input pipeline state of the scheduler can have an influence on its WCET. To remove this influence it is pretended that each task leaves an empty pipeline state behind, either at its end or when preempted. Therefore, the WCET of each task includes the maximum time needed to drain off (flush) the pipeline at the end of its

execution. Furthermore an upper bound for the pipelinerelated preemption costs, Fi , is added to the response time for each preemption of a task. This solution is safe under the assumption that the time needed to flush the pipeline is at least as large as any prolongation of the execution time due to pipeline effects when using seamless execution. I. e. the worst-case additional pipeline delay without the simplifying assumption of a pipeline flush may never be larger than the time needed for the assumed flush. The entry and re-entry pipeline states of j (b) and i (d) depend on the possible end pipeline states of the scheduler. To remove this dependence, the WCET of the scheduler includes the maximum time needed to let the pipeline drain off (flush) at the end of the scheduler execution. Thereby, the previous analysis result for uninterrupted execution of i can be taken over. This is a conservative approximation provided that an empty pipeline can cause no longer delay than a nonempty one. The upper bound for the worst-case pipeline flush costs, Fi , can be estimated in two ways. One way is to consider the pipeline states at each program point of the preempted task and determine the worst-case time needed to flush the pipeline at these points. This can be easily done by a modified version of the pipeline analysis described in [17]. The maximum of these results is used as Fi . The other (simpler and maybe coarser) way is to use a CPU dependent overall worst-case constant. 1 0 0 1 0 1 0 1 0 1 0 1 0 1

τd τj (a)

τi

11 00 00 11 00 11 00 11 00 11 00 11 00 11

(b)

1 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1

(c)

(d)

1 0 0 1 0 1 0 1 0 1 0 1 0 1

t

τj Figure 1. Pipeline Preemption Analysis example.

3.5 Limitations of the Pipeline Preemption Analysis and Workaround The above described solution is safe as long as the following two assumptions hold: (A1) an empty pipeline can cause no greater delays than a nonempty one; (A2) the time

needed to flush the pipeline is at any program point at least as large as any prolongation of the execution time due to pipeline effects when using seamless execution. Assumption (A1) can for instance be violated when a CPU uses a prefetch queue. Then it is possible that in the case of uninterrupted execution a cache miss has no impact because of a filled prefetch queue. While an empty pipeline state (and, therefore, an empty prefetch queue) leads to a larger delay due to the same cache miss. This can be circumvented by adding the prefetch queue fill time, i. e. one additional cache miss penalty, to Fi . Assumption (A2) certainly holds for most pipelines. But, one can imagine architectures, especially when considering out-of-order-execution [16] that violate this condition. In such a case, an alternative approach is necessary. Three points have to be reconsidered. First, the WCET of the scheduler can be increased by more than the worst-case flush costs. This can either be eliminated by using special instructions at the very beginning of the scheduler code that invalidate the pipeline state artificially, or if possible be considered by finding a worst-case CPU dependent constant. Second, the input pipeline states of each task should consist of the possible end pipeline states of the scheduler. Note that in this case it is necessary to make sure that the end pipeline states of the scheduler are independent of its input pipeline states. Third, re-entering a task can increase the WCET of that task further (more than the previously assumed flush costs). The actual increase depends on the re-entry point. Perhaps, this unwanted dependency can be eliminated by restoring the exact pipeline state before the preemption. If this is not possible the end pipeline states of the scheduler have to be considered as additional input pipeline states for each program point.

3.6. Response Time Analysis Two invariants are assumed to make analysis easier: (I1) the scheduler is never interrupted, (I2) every switch between tasks is done by the scheduler. Because of (I1) no preemption costs for the scheduler exist, instead release jitter has to be considered in the schedulability analyses; (I2) allows to treat sporadic tasks in the same way as periodic tasks. The schedulability tests are performed through response time analyses, i. e. for each task i they compare the worstcase response time, Ri with the deadline, Di , which is less or equal to the task’s period, Ti . For sporadic tasks the minimum inter-arrival time is used as period. The computation of the interference for i in the isolated approach presented below makes it necessary that the Rj of all j 2 hp(i ) are computed before Ri is computed. Therefore, the RTA of the highest priority task is done first. The tests are extensions of a test proposed by Audsley et al. in [1].

Let ^ = f1 ;   ; n g be the set of all tasks (remember that 1 has the highest and n the lowest priority). Let Æi be the number of critical sections of i . If

^ : Ri



8 i 2



Di

(3)

holds, the task set ^ is schedulable. Here Ri is as follows:

Ri

= C + B + Æ C + I + 2C i

i

i

d

i

d

(4)

The response time, Ri , consists of the WCET, Ci , of i itself, the maximum blocking time, Bi , of i (maximum length of critical sections of all tasks that can block i ), the additional scheduler run for each critical section, Æi Cd , the interference by other tasks, Ii , and the initial delay due to the scheduler, 2Cd (two times to consider release jitter as described above). To compute Ri it is necessary to solve a recursive equation since the number of preemptions (and, thereby, Ii ) depends on the response time. The iterative algorithm stops when the response time converges or when Ri exceeds Di . The equation for Ii depends on the method used (isolated or integrated). Generally it can be said that Ii covers two sources of interference. Costs due to preemptions by higher priority tasks and interruptions by the scheduler alone due to arrivals of lower priority tasks. Each of these two parts covers direct costs (i. e. costs that increase the execution time of i ) and indirect costs (i. e. costs that increase the execution time of tasks that preempt i ). The first part always includes the WCET, Cj , of the preempting task, the WCET of the scheduler at the beginning and at the end of the preemption, 2Cd , the additional scheduler run for each critical section of j , Æj Cd , and the j;i pipeline-related direct or indirect preemption costs, Fmax . All tasks that might be interrupted due to the arrival of j during the response time of i can cause pipeline-related preemption costs: j;i Fmax

= max(fF ;   ; F +1 ; F 1 ;   ; F1 g [ fF j  2 b( )g) i

j

l

j

l

(5)

i

Cache-related preemption costs are included either explicitly (isolated approach) or implicitly (integrated approach). Note that for each preemption either direct or indirect pipeline-related costs occur, while cache-related costs may occur as direct and indirect costs together. The second part of Ii always includes the scheduler WCET, Cd , and the worst-case pipeline flush costs: i Fmax

= max(fF

^ ^ k  ig [ fFl j l 2 b(i )g) k j

k

2

(6)

Again cache-related preemption costs are considered either explicitly or implicitly. Since every context switch is done by the scheduler and the WCET of the scheduler is included in the RTA, the costs for context switches are included in Ri .

3.6.1 Isolated Method This paper assumes a pipeline for which an upper bound of the impact of a cache miss on the execution time of all possible instruction sequences can be guaranteed. The interference is computed as follows:

X1  i

Ii =

j

=1

d

Ri Tj

j;i (Cj + Æj Cd +2Cd + Xi;j + Xi;d + Fmax )

e

X1 i

+

=j +1

X k



 k;j;i + X  k;d;i ) (X

Rk Ri ed e d Tk Tj

(7)

n

+

=i+1

(Cd + Xi;d + F

Ri e d Tl

l

X1 i

+

k

=1

i max

 k;d;i X

Rk Ri d ed e Tk Tl

)



The sum over j covers the interference due to preemptions by higher priority tasks ( j ). This includes costs because of direct and indirect preemptions. There are costs i e that are considered for each preemption by j , i. e. d R Tj times. They consist of the worst case execution time of j , Cj , the additional scheduler run for each critical section of j , Æj Cd , the preempting and resuming scheduler run, 2Cd , as well as the cache-related direct preemption costs due to the preempting task j , Xi;j , and the scheduler d , Xi;d , and j;i the pipeline preemption costs, Fmax . The pipeline preemption costs can be either direct or indirect costs. The second part is considered for each preemption of a possibly running Rk i task k 2 hp(i ) \ lp(j ), i. e. d R e times. This part ed Tk Tj consists of the indirect cache-related preemption costs due k;j;i , and the scheduler, Xk;d;i , to the preempting task, X respectively. The sum over l contains the costs for interruptions by the scheduler alone due to arrivals of lower priority tasks ( l ). Besides the WCET of the scheduler, Cd , the cache-related direct preemption costs, Xi;d , and the pipeline-related prei emption costs, Fmax , are considered for each arrival of a lower priority task. Again, the pipeline-related costs can be either direct or indirect costs. The second part covers the k;d;i , due to scheduler runs indirect cache-related costs, X because of arrivals of lower priority tasks.  k;j;i and Xk;d;i are precomputed by the Xi;j , Xi;d , X cache interference analysis. How this precomputation is done is explained in the following. Among the tasks that fill cache sets used by i , so that it may eventually come to a displacement of cache lines of i by j , are (besides j ) all tasks in fi ;   ; j +1 g. Cache lines belonging to lower priority tasks cannot be younger than those of i . Since a least-recently used cache replacement policy is assumed they cannot displace cache lines of i . Note that this holds also in the case of communicating

tasks since the immediate ceiling priority protocol guarantees that blocking times are always before the actual execution of the blocked task. However, the cache-related preemption costs inflicted on tasks in b(i ) during blocking periods of i have to be considered somewhere. There are many ways to do so. One of the simplest (but also coarsest) ways is to enhance the set of tasks assumed to displace cache lines of i . Cache lines originating from tasks in hp(j ) need not be considered for this set, because they are taken into account when the preemption of i by the corresponding tasks is considered. Therefore, the set of memory blocks used to determine the conflicting cache lines during the cache interference analysis is as follows:

Mi;j

=( ) [ ( 1 ) [    [ ( +1 ) [ ( ) [ ( 1 ) [    [ ( ) i

i

j

j

(8)

lk

l

Where b(i ) = fl1 ;    ; lk g. The set of conflicting cache sets is:

= fs 2 S j s 2 ( (M ))g

Si;j

i;j

(9)

For each s 2 Si;j the set of memory blocks of j possibly displacing cache lines, j (s), and the set of memory blocks of i whose corresponding cache lines may be displaced, ah (s), is computed: i

j (s) = fm 2 (j ) j '(m) = sg 

ah i

(10)

(s) = fm 2 ( ) j '(m) = s ^ h (m)g i

i

(11)

In the worst-case, the number of cache misses for each cache set s can be as high as the size of ah (s) or j (s), i whatever is smaller, but not higher than the cache set size (level of associativity). Therefore, Xi;j is precomputed as follows:

Xi;j

X

=

2S

s









min(  (s) ;  (s) ; A) ah i

j

(12)

i;j

Where  is the worst-case penalty for a cache miss. Xi;d is computed in a similar way. For the indirect cache-related preemption costs only those cache misses are accounted for that are not already included in Xi;j or Xi;d . The set of conflicting cache sets is: Sk;j

= fs 2 S j s 2 ( (M ))g k;j

(13)

Where Mk;j is computed according to Equation 8. The (s), contains no set of displaced memory blocks of k , ah kni ah blocks already included in i (s):

(s) = fm 2 (k ) j '(m) = s ^ hk (m) ah kni ^

m 2 (i ) ! hi (m) = f alseg

(14)

The indirect cache-related preemption costs due to the prek;j;i , are computed as follows: empting task, X

k;j;i X

X









=   min(  n (s) ;  (s) ; A) 2S s

ah k i

j

(15)

k;j

The indirect cache-related preemption costs due to the k;d;i , are computed in a similar way. scheduler, X 3.6.2 Restrictions of Isolated Approaches As mentioned before overestimations occur because cache misses are often compensated by pipeline effects. Even worse than this is the fact that isolated methods can lead to underestimations when considering processors with dynamic pipeline decisions. As shown by Lundqvist and Stenstr¨om [16] processors with out-of-order execution can cause penalties for cache misses which are higher than expected, i. e. more than the transfer time of a memory block into the cache. But this is also true for in-order CPUs, like the SuperSPARC I, that take dynamic decisions, e. g. when grouping instructions for parallel execution [17]. In this case a higher worst-case miss penalty must be assumed (if possible, see below). The pessimism due to ignoring the interaction of cache and pipeline behaviour is, thereby, increased even further. To circumvent this it is not enough to know the mere quantity of cache misses. One must know at which program points the cache misses occur. With this knowledge it would theoretically be possible to selectively compute local effects of a cache miss on the pipeline behaviour. Using this approach one could consider combined cache and pipeline effects and still add the cache-related preemption costs separately during the schedulability analysis. Unfortunately for processors with dynamic decisions the effects of a cache miss on the pipeline are no longer local. The instruction grouping or (in case of out-of-order execution) instruction scheduling decisions of the CPU are changed and the pipeline behaviour of many following instructions can, therefore, be influenced. Lundqvist and Stenstr¨om showed in [16] that for dynamically scheduled processors, i. e. processors with out-of-order execution, it is possible that constant limits cannot be guaranteed due to a domino effect. Here patchwork does not help. All potential cache misses must be known during the analysis of the pipeline behaviour [18]. To completely integrate WCET analysis and schedulability analysis seems not to be a suitable solution for this problem. All these hard problems (analysis of arbitrary caches, complex pipelines, program paths, and schedulability) with mutual influences incorporated in one solution space can hardly lead to acceptable computation times for real applications. Instead the USES (Universit¨at des Saarlandes Embedded Systems) group proposes a way with min-

imal loss of accuracy that keeps the different analyses separated where possible. The integrated approach proposed in the following subsection allows to keep cache, pipeline, path, and schedulability analysis separated even in the case of pipelines with dynamic decisions.

3.6.3 Integrated Method The isolated method has two main drawbacks. First, the accounted cache related preemption costs increase unboundedly with the number of preemptions. This does not correspond to the true cache behaviour. In reality there is a bound for the cache related preemption costs this is the number of possible reuses of cache entries. Second, the isolated method does not allow for the integration of pipeline behaviour and preemption caused changes in cache behaviour. Thus no compensation of preemption caused cache reload costs by pipeline effects is possible. The separation of cache and pipeline behaviour can even lead to the above mentioned underestimations. The natural way to eliminate these drawbacks is, no longer to regard preemption caused cache reload costs as something separated. The integrated method incorporates the preemption caused cache reload costs into the WCET of the preempted task. At first sight this might look simplistic. But it is not. The true simplification is to add preemption caused cache reload costs separately. The integrated method is implemented in the following way. The cache interference analysis finds for each task any memory block that is possibly replaced with memory blocks of higher priority tasks in the cache. All always hit-classifications (ah-classifications) of references to these memory blocks are updated to not classified (n ). The updated cache results are used by a pipeline analysis whose results are fed to a path analysis. The resulting new WCET, Ci , includes an upper bound for the cache-related direct preemption costs which is independent of the actual number of preemptions. As symbolic representation of this procedure a set of substitutions of classifications of memory references appended to the interruption-free WCET is used. Thereby, [ lassah (m)=n ℄ means that all ah-classifications of referi ences to m by i are substituted by n . The set of memory blocks of interfering tasks is as follows:

Mi0

= ( 1 ) [    [ (1 ) [ ( ) i

d

(16)

This is enhanced by (i ) to compute the conflicting cache sets:

Mi

= ( ) [ M 0 i

i

(17)

Ci is computed as follows:

1.1e+11 Simple Isolated RTA T3 Isolated RTA T3 Integrated RTA T3 Simple Isolated RTA T4 Isolated RTA T4 Integrated RTA T4 Deadline

1e+11 ah i

i

9e+10

i

i

i

(18)

i

The last line of the equation for Ci guarantees that cache misses due to intrinsic cache behaviour of i are ignored. The interference Ii is computed as follows:

Ii

=

1

=1

d

Ri Tj

(C + Æ C + 2C + F

7e+10 6e+10 5e+10 4e+10 3e+10

j

d

d

j;i max

)+

1e+10 0 2.05e+06

(19)

X

= +1

j

e

n

l

8e+10

2e+10

i X

j

Response time of Task 3 and 4

=C f[ lass (m)=n ℄ j m 2 ( ) ^ '(m) 2 ( (M )) ^ h (m) 0 0 0 ^ 9m 2 M : '(m ) = '(m)g

Ci

d

Ri Tl

(C + F

e

d

i max

)

The costs due to preemptions by higher priority tasks are covered by the first sum and disruptions by the scheduler alone due to arrivals of lower priority tasks appear in the second sum of Ii .

Ii

=

1

i X

j

3.7. Comparison of Isolated and Integrated Method As explained in subsection 3.6.2 the applicability of isolated approaches is limited. Furthermore the isolated method ignores the fact that the number of reuses of a cache line can be less than the number of preemptions. Therefore, overestimations occur when the number of preemption caused displacements of a cache line exceeds the number of its reuses. A further source of overestimation is that for preemption caused cache misses no overlapping with pipeline effects is considered. But this is the case for all isolated approaches. The presented integrated approach has none of these drawbacks. Instead another disadvantage accompanies it: overestimations occur, if the number of reuses of displaced cache lines exceeds the number of preemptions.

4. Practical experiments In this section the two presented cache and pipeline sensitive RTA-methods are compared for a sample task set. Additionally a simpler RTA-method is used on this task set. Blocking is not considered for the experiments. The simpler method is a modification of the one presented in [4]. The differences are that pipeline-related preemption costs are incorporated and the interference caused by the scheduler is explicitly considered. The used equations are as follows:

= C + I + 2C i

i

2.15e+06 period of Task 2

2.2e+06

2.25e+06

Figure 2. Response time of task 3 and 4 in dependence of period of task 2.

i

Ri

2.1e+06

d

(20)

+

=1

d

Ri Tj

= +1

j

d

j

s

j;i max

) (21)

n X

l

(C + 2C + Y + Y + F

e

d

Ri Tl

(C + Y + F

e

d

s

i max

)

i

Yj and Ys are respectively the number of cache lines accessed by j , and d times the miss penalty . The sample task set contains the (pseudo-)scheduler, two short tasks that arrive at a high frequency, and two larger tasks that arrive at a low frequency. The short tasks could for instance represent sporadic tasks. Task 1 and 2 perform matrix summations. Task 3 is a bubble sort derivative. Task 4 performs matrix multiplications. CPU clock cycles are used as time unit. The pipeline analyser for the SuperSPARC I CPU [17] is used. There is only one program path for each task, to make sure that the worst-case path is independent of cache and pipeline behaviour. Consequently a path analysis was not necessary. Exact execution profiles were used instead. They were derived with the help of qpt2 (Quick program Profiler and Tracer) [10] that is part of the Wisconsin Architectural Research Tool Set (WARTS) distribution. The task set parameters are as follows: Ci

Ci

d 1 2

578 341,096 341,096

341,545 341,096

3 4

40,976,838 51,638,057

45,487,636 76,811,181

Deadline 411,420 2.06106 –2.5106 99.92109 99.92109

Period 411,420 2.06106 –2.5106 99.92109 100109

For simplicity reasons only instruction cache behaviour was considered. The assumed cache has a size of 1k, is

direct-mapped and  = 10.1 The cache interference analysis delivers the following k;j;i , and Yj : X1;0 = 30, precomputed values for Xi;j , X X2;0 = 0, X2;1 = 0, X3;0 = 250, X3;1 = 640, X3;2 = 520, X4;0 = 390, X4;1 = 1140, X4;2 = 990, X4;3 = 950, X1;0;2 = 190, X1;0;3 = 30, X2;0;3 = 0, X2;1;3 = 0, X1;0;4 = 30, X2;0;4 = 0, X3;0;4 = 250, X2;1;4 = 0, X3;1;4 = 640, X3;2;4 = 520, Y0 = 440, Y1 = 1800, Y2 = 1800, Y3 = 1060, and Y4 = 1190. For simplicity reasons a constant upper bound F = 20 is used for the pipeline-related preemption costs. The estimated response times are as follows: 1 2 3 4

Simple RTA 345,366 D2 exceeded 1,652,261,656 –D3 exceeded 3,732,388,007 –D4 exceeded

Isolated RTA 344,136 2,056,708 1,434,595,454 –D3 exceeded 3,577,288,379 –D4 exceeded

Integrated RTA 344,495 2,057,053 1,514,804,944 –54,921,657,635 4,069,716,097 –D4 exceeded

At the lower end of the range of T2 (note that for the variable period the deadline was always equal to the period) the deadline of 4 is estimated to be exceeded by all three methods, the deadline of 3 by the isolated approaches and the deadline of 2 by the simple isolated approach. Figure 2 shows the estimated response times for 3 and 4 in dependence of the period of 2 . The behaviour is as expected, for lower values of T2 , i. e. more preemptions, the integrated method delivers more precise results. The proposed isolated method is always better than the simple isolated method. The response time table above shows that for higher values of T2 the isolated methods are slightly better (this could not be shown in Figure 2, because the differences are to small for this scale). The point of break-even between the isolated and the integrated method depends not only on the number of preemptions compared to the number of reuses of cache lines. Considering overlapping of cache and pipeline effects is advantageous for the latter method. Note that all values above the deadline are not valid response times since the RTA is terminated as soon as the deadline is exceeded.

5. Conclusions and Future Work Two cache and pipeline sensitive methods for schedulability analysis in fixed priority environments have been presented. The isolated version considers the actual preemption situation (as far as possible). For each preemption direct and indirect cache-related as well as pipeline-related

 = 10 is more than the number of clock cycles needed to transfer a memory block into the cache (9 clock cycles). Since the SuperSPARC I makes dynamic pipeline decisions this is the true worst-case miss penalty [18]. 1

preemption costs are added. A new feature of this approach is that pipeline-related preemption costs are considered. The overlapping of cache misses due to preemptions with pipeline effects is ignored by all isolated approaches. It has been shown that all isolated approaches are limited to a certain class of CPUs. A novel integrated method has been proposed. There is no inherent limit for the applicability of this method. It depends only on the ability of the underlying WCET estimation to support modern microarchitectural features of CPUs. It incorporates all possible cache-related preemption costs into the WCET. Therefore, it is possible to reduce the pessimism for these costs significantly by considering the overlapping of cache and pipeline effects. The static priority assignment of tasks is used to estimate the cache-related preemption costs. Both presented methods can be applied to communicating tasks, provided that the immediate ceiling priority protocol (ICPP) is used. However, the described technique to incorporate cache-related preemption costs for blocking periods is quite coarse. In future work the computation of worst-case blocking times (including cache-related preemption costs) will be more sophisticated. For instance by determining them in a way similar to the response time computation. The experimental results show that both methods have their strengths. The modeling of the dynamic preemption situations by the isolated method is better for low numbers of preemptions. The more static approach of the integrated method is better for a higher number of preemptions. The decline in precision for less preemptions does not even come close to the one suffered by the isolated methods with increasing number of preemptions. Moreover the integrated method is better off because it considers the overlapping of preemption caused cache misses with pipeline effects. It seems to be a good idea to use both approaches and pick the best response time for each task. Unfortunately, it is not possible to use isolated approaches for each CPU type as has been shown in subsection 3.6.2. Therefore, the integrated method should be improved, e. g. by supporting an upper bound for the number of repeated preemption caused cache misses. Such a bound can for instance be found as the maximum number of preemptions. Future work will concentrate on this.

Acknowledgements Many members of the compiler design group at the Universit¨at des Saarlandes, especially the members of the USES (Universit¨at des Saarlandes Embedded Systems) group deserve acknowledgement. Prof. Reinhard Wilhelm, Daniel K¨astner, and Stephan Thesing carefully read draft versions of this work and provided many valuable hints and suggestions. Dr. Christian Ferdinand supplied his vast

knowledge about cache analysis and gave valuable suggestions for improvement during fruitful discussions. I would like to thank Prof. Sang Lyul Min for his appreciated and helpful comments on the paper. I also have to thank Prof. Tomasz M¨uldner who was so kind to give me his valuable advice in some stylistic questions. I would like to thank Mark D. Hill, James R. Larus, Alvin R. Lebeck, Madhusudhan Talluri, and David A. Wood for making available the Wisconsin architectural research tool set (WARTS), and the anonymous reviewers for their helpful comments.

[13]

[14] [15]

[16]

References [17] [1] N. Audsley, A. Burns, M. Richardson, K. Tindell, and A. Wellings. Applying new scheduling theory to static priority pre-emptive scheduling. Software Engineering Journal, 8(5):284–292, 1993. [2] S. Basumallick and K. Nilsen. Cache Issues in Real-Time Systems. ACM SIGPLAN Workshop on Language, Compiler and Tool Support for Real-Time Systems, 1994. [3] A. Burns and A. Wellings. Real-Time Systems and Programming Languages. Addison-Wesley, second edition, 1997. [4] J. V. Busquets-Mataix, J. J. Serrano, R. Ors, P. Gil, and A. Wellings. Adding Instruction Cache Effect to Schedulability Analysis of Preemptive Real-Time Systems. In Proceedings of the IEEE Real-Time Technology and Applications Symposium, pages 204–212, June 1996. [5] C. Ferdinand. Cache Behavior Prediction for Real-Time Systems. Dissertation, Universit¨at des Saarlandes, Sept. 1997. [6] C. Ferdinand, F. Martin, R. Wilhelm, and M. Alt. Cache Behavior Prediction by Abstract Interpretation. Science of Computer Programming, 35(1):163–189, 1999. [7] C. Ferdinand and R. Wilhelm. Efficient and Precise Cache Behavior Prediction for Real-Time Systems. Real-Time Systems, 17(2–3):131–181, 1999. [8] D. K¨astner and S. Thesing. Cache Sensitive Pre-Runtime Scheduling. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems, volume 1474 of Lecture Notes in Computer Science, pages 131–145. Springer, 1998. [9] D. B. Kirk. Smart (strategic memory allocation for realtime) cache design. In Proceedings of the 10th Real-Time Systems Symposium, pages 229–237, Dec. 1989. [10] J. Larus. Abstract Execution: A Technique for Efficiently Tracing Programs. Software Practice and Experience, 20(12):1241–1258, Dec. 1990. [11] C.-G. Lee, J. Hahn, Y.-M. Seo, S. L. Min, R. Ha, S. Hong, C. Y. Park, M. Lee, and C. S. Kim. Enhanced Analysis of Cache-related Preemption Delay in Fixed-priority Preemptive Scheduling. In Proceedings of the IEEE Real-Time Systems Symposium, pages 187–198, Dec. 1997. [12] S. Lee, C.-G. Lee, M. Lee, S. L. Min, and C. S. Kim. Limited preemptible scheduling to embrace cache memory in real-time systems. In Proceedings of the ACM SIGPLAN

[18]

[19]

Workshop on Languages, Compilers and Tools for Embedded Systems, volume 1474 of Lecture Notes in Computer Science, pages 51–64. Springer, 1998. C. L. Liu and J. W. Layland. Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment. Journal of the ACM, 20(1):46–61, 1973. J. W. S. Liu. Real-Time Systems. Prentice Hall, 2000. G. Luculli and M. Di Natale. A Cache-Aware Scheduling Algorithm for Embedded Systems. In Proceedings of the IEEE Real-Time Systems Symposium, pages 199–209, Dec. 1997. T. Lundqvist and P. Stenstr¨om. Timing Anomalies in Dynamically Scheduled Microprocessors. In Proceedings of the IEEE Real-Time Systems Symposium, pages 12–21, Dec. 1999. J. Schneider and C. Ferdinand. Pipeline Behavior Prediction for Superscalar Processors by Abstract Interpretation. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems, volume 34 of ACM SIGPLAN Notices, pages 35–44, May 1999. J. Schneider, C. Ferdinand, and R. Wilhelm. Pipeline Behavior Prediction for Superscalar Processors. Technical Report A/02/99, Computer Science Department, Saarland University, 1999. H. Theiling and C. Ferdinand. Combining Abstract Interpretation and ILP for Microarchitecture Modelling and Program Path Analysis. In Proceedings of the IEEE Real-Time Systems Symposium ’98, pages 144–153, Dec. 1998.

Suggest Documents