Performance-Based Design of Distributed Real ... - Semantic Scholar

4 downloads 0 Views 347KB Size Report
mandates a minimum allowable success rate for out- puts that meet their ... (ii) a minimum acceptable average throughput. In de- ...... return Feasible Solution.
Performance-Based Design of Distributed Real-Time Systems  Dong-In Kang, Richard Gerber Institute for Advanced Computer Studies Department of Computer Science University of Maryland College Park, MD 20742 fdikang, [email protected]

Abstract

This paper presents a design method for distributed systems with statistical, end-to-end real-time constraints, and with underlying stochastic resource requirements. A system is modeled as a set of chains, where each chain is a distributed pipeline of tasks, and a task can represent any activity requiring nonzero load from some CPU or network resource. Every chain has two end-to-end performance requirements: Its delay constraint denotes the maximum amount time a computation can take to ow through the pipeline, from input to output. A chain's quality constraint mandates a minimum allowable success rate for outputs that meet their delay constraints. Our design method solves this problem by deriving (1) a xed proportion of resource load to give each task; and (2) a deterministic processing rate for every chain, in which the objective is to optimize the output success rate (as determined by an analytical approximation). We demonstrate our technique on an example system, and compare the estimated success rates with those derived via simulated on-line behavior.

1 Introduction Scheduling analysis can often be a valuable tool in the design of a hard real-time system. Traditionally, this kind of system is characterized by two key criteria. First, all execution times and communication delays can be deterministically bounded; the designer assumes that these worst-case bounds may be su ered each time a task is run, or whenever a packet is sent. Second, all real-time constraints are known ahead of  This research is supported in part by ONR grant N0001494-10228, NSF Young Investigator Award CCR-9357850, ARL Cooperative Agreement DAAL01-96-2-0002 and NSERC Operating Grant OGPO170345.

Manas Saksena Dept. of Computer Science Concordia University Montreal Quebec H3G 1M8 Canada [email protected]

time, and are part of the system's requirements { or derive from them in some logical way. Such constraints might include an individual task's period, a message's deadline, or perhaps the rate at which a network driver is run. Given this type of system, real-time scheduling analysis can often be used to predict (and then ensure) that the performance objectives will be met. Of course, while hard real-time applications do exist (e.g., reactor shutdown systems come to mind), many systems cannot be designed like this. First, while one could theoretically bound all execution times and network delays, these bounds would never be tight. Contemporary systems make this almost impossible, by including features like superscalar pipelining, hierarchies of cache memory, etc. { not to mention the nondeterminism inherent in almost any network. Aside from these physical limitations, since a program's actual execution time is almost always data-dependent, its worst-case time may be several orders of magnitude greater than the mean. Hence, if one incorporates worst-case costs in a design, the result will be an extremely under-utilized system. Second, most real-time constraints are not inherently part of the system's end-to-end requirements; rather, task and message constraints are artifacts of the design itself, and there is rarely an intrinsic need to meet each and every artifact deadline. For example, consider a video-conferencing system. It should be able to capture, process, transmit, receive, process, and display the data at some mean frame-rate, albeit with permissible deviations for jitter. Or consider a real-time control system, which might involve sampling, computation, and actuation; thus the controlloop is typically implemented as a periodic task with a deadline. In both these cases, real-time performance requirements are usually intended to enforce an acceptable level of quality, and do not hinge on the system meeting all of its artifact's deadlines. In other words, while hard real-time scheduling theory provides

a sucient way to build such systems, it is not strictly necessary, and is probably not the most cost-e ective design framework. In this paper we explore an alternative approach, by using statistical guarantees to generate cost-e ective system designs. We model a real-time system as a set of task chains, with each task representing some activity requiring a speci c CPU or network link. For example, a chain may correspond to the data path from a camera to a display in a video conferencing system, or may re ect a servo-loop in a distributed real-time control system. The chain's real-time performance requirements are speci ed in terms of (i) a maximumacceptable propagation delay from input to output, and (ii) a minimum acceptable average throughput. In designing the system, we treat the rst requirement as a hard requirement, that is, any end-to-end computation that takes longer than the maximum delay is treated a failure, and does not contribute to the overall throughput. In contrast, the second requirement is viewed in a statistical sense, and we design a system to exceed, on average, its minimal acceptable rate. We assume that a task's cost (i.e., execution time for a program, or delay for a network link) can be speci ed with any arbitrary (discrete) probability distribution function.

Problem and Solution Strategy. Given a set of task chains, each with its own real-time performance requirements, our objective is to design a system which will meet the requirements for all of the chains. Our solution uses several techniques to make the problem tractable:  We assume a static partitioning of the system resources; in other words, a task has already been allocated to a speci c resource.  A task's load requirements are expressed stochastically, in terms of a discrete probability distribution function (or PDF), whose random variable characterizes the resource time needed for one execution instance of the task. All successive instances are modeled to be independent of each other. While our objective is to achieve an overall, statistical level of real-time performance, we can still use the tools provided by hard real-time scheduling to help solve this problem. Our method involves (1) assigning a xed period to a chain, so that (if possible) the chain's tasks are invoked at the periodic boundaries; and (2) assigning to each task a xed proportion of its resource's load. Then, using the techniques provided by real-time CPU and network scheduling, we can guarantee that during any period, a task will get

at least its designated share of the resource's time. When that share fails to be sucient for the task to nish executing, it runs over into the next period { and gets that period's share, etc. Given this model, the design problem may be viewed as the following interrelated sub-problems: (1) Given a set of chains, how should the CPU and network load be partitioned among the tasks, so that every chain's performance requirements are met? (2) Given a load-assignment to the tasks in a chain, what is the optimal period to execute the chain such that the e ective throughput is maximized? In this paper we present algorithms to solve these problems. The algorithm for problem (1) uses a heuristic to compare the relative needs of tasks from di erent chains competing for the same resources. The algorithm for problem (2) makes use of connecting Markov chains, to estimate the e ective throughput of a given chain. Since the analysis is approximate, we validate the solution generated using a simulation model.

Related Work. Like much of the work in real-time

systems, our results extend from preemptive, uniprocessor scheduling analysis. There are many old and new solutions to this problem (e.g., [11, 1, 7, 10]); moreover, these methods come equipped oine, analysis tests, which determine a priori whether the underlying system is schedulable. Many of these tests are load-oriented suciency conditions { they predict that the tasks will always meet their deadlines, provided that the system utilization does not exceed a certain pre-de ned threshold. The classical model has been generalized to a large degree, so that there now exist analogous results for distributed systems, network protocols, etc. For example, the model has been applied to distributed hard real-time systems in the following straightforward manner (e.g., see [14, 20]): each network connection is abstracted as a real-time task (sharing a network resource), and the scheduling analysis incorporates worst-case blocking times potentially su ered when high-priority packets have to wait for transmission of lower-priority packets. Then, to some extent, the resulting global scheduling problem can be solved as a set of interrelated local resource-scheduling problems. In our previous work [4] we relaxed the precondition that period and deadline parameters are always known ahead of time. Rather, we used the system's end-toend delay and jitter requirements to automatically derive each task's constraints; these, in turn, ensure that

the end-to-end requirements will be met on a uniprocessor system. This idea was also used recently in [15], where individual task constraints are derived from a set of control laws { with the objective being to optimize the system's performance index while satisfying schedulability. A similar approach has been used for \soft" transactions in distributed systems [8], where each transaction's deadline is partitioned between the system's resources. As for modeling stochastic execution times, some recent results deal with uniprocessor systems, and where the artifact, real-time constraints are considered part of the system's requirements. For example, in [18] task completion time bounds are found for the case in which costs can vary within some range, and where arrivals can be sporadic. And in [19], scheduling analysis is extended to consider probabilistic execution times on uniprocessor systems. This is done by giving a nominal \hard" amount of execution time to each task instance, under the assumption that the task will usually complete within this time. But if the nominal time is exceeded, the excess requirement is treated like a sporadic arrival (via a method similar to that used in [9]). To our knowledge, however, ours is the rst technique that achieves statistical real-time performance in a distributed system, by using end-to-end requirements to assign both periods and the execution time budgets. In this light, our method should be viewed less as a scheduling tool (it is not one), and more as an approach to the problem of real-time systems design. To accomplish this goal, we assume an underlying runtime system that is capable of the following: (1) decreasing a task's completion time by proportionately increasing its resource share; (2) enforcing, for each resource, the proportional shares allocated for every task, up to some minimum, pre-de ned window size; and (3) within these constraints, isolating a task's real-time behavior from the other activities sharing its resource. In this regard, we are building on many recent results that have been developed for providing OS-level reservation guarantees, and for fair-share queueing in network protocols. In particular, we assume a network model in which per-connection real-time performance guarantees can be provided, assuming that a certain proportion of bandwidth has been allocated to each connection. Many policies have recently proposed to accommodate this, including the \Virtual Clock Method" [24], \FairShare Queueing" [2], \Generalized Processor Sharing" (or GPS) [13], and \Rate Controlled Static Priority Queueing" (or RCSP) [22]. These models have also been extended to derive statistical guarantees; in particular, within the framework of RCSP (in [23]), and for GPS (in [25]). Related results for real-time traf-

c sources can be found in [3] (using a policy like \Virtual Clock" [24]), and in [21] (using a modi ed FCFS method, with a variety of trac distributions). In [6], statistical service quality is achieved via proportional-share queueing, in conjunction with server-guided backo ; i.e, servers dynamically adjust their rates to help utilize the available bandwidth, and to avoid network congestion. Many of the proportional-sharing techniques will suce for our purposes, since all we require are (1) the mutual independence of all sessions running through a switch, given proportional bandwidth allocations for each; and (2) delay distributions for all single-hop connections. With this abstraction, we are able to model a single-hop connection like a real-time task, which shares a network resource (as opposed to a CPU). We also rely on a similar model at the CPU level, and there have been many policies suggested to achieve this end. (In fact, rate-monotonic and EDF scheduling can be considered two such policies.) Moreover, several recent schemes have been proposed to guarantee processor capacity shares for the system's real-time tasks, and to simultaneously isolate them from overruns caused by other tasks in the system. For example, Mercer et al. [12] proposed a processor capacity reservation mechanism to achieve this, a method which enforces each task's reserved share within its reservation period, under micro-kernel control. Also, Shin et al. [16] proposed a reservation-based algorithm to guarantee the performance of periodic real-time tasks, and also improve the schedulability of aperiodic tasks. Recently, many of the rate-based disciplines developed for network queueing have sprouted analogues for CPU scheduling. In particular, Goyal et al. [5] proposed a scheme which allows for hierarchical partitioning of CPU load, so multiple classes of tasks (e.g., hard and soft real-time applications) can coexist in the same system. Also, Stoica et al. [17] developed a proportional share scheduling strategy, which multiplexes available CPU load according to relative weights of the constituent tasks. Like the abovementioned schemes used for network queueing, this method maps a task's real-time state-changes onto a virtual time-line, and then schedules the tasks via a virtual clock.

2 Model and Solution Overview As stated above, we model a system as set of independent, pipelined task chains, with every task mapped to a designated CPU or network resource. The chain abstraction can, for example, capture the essential data path structure of a video conferencing system, or a distributed process control loop. For-

mally, a system possesses the following structure and constraints. Bounded-Capacity Resources: There is a set of resources r1; r2; : : :; rm , where a given resource ri corresponds to one of the system's CPUs or network links. Associated with ri is a maximum allowable capacity, or mi , which is the maximum load the resource can multiplex e ectively. The parameter mi will typically be a function of its scheduling policy (as in the case of a workstation), or its switching and arbitration policies (in the case of a LAN). Task Chains: A system has n task chains, denoted ?1 ; ?2; : : :; ?n, where the j th task in a chain ?i is denoted i;j . Each computation on ?i carries out an end-to-end transformation from its external input Xi to its output Yi . Also, a producer/consumer relationship exists between each connected pair of tasks i;j ?1 and i;j , and we assume a one-slot bu er between each such pair. Hence a producer may overwrite bu ered data which is not consumed. Stochastic Processing Costs: A task's cost is modeled via a discrete probability distribution function, whose random variable characterizes the time it needs for one execution instance on its resource. Maximum Delay Bounds: ?i's delay constraint, MDi , is an upper bound on the time it should take for a computation to ow through the system, and still produce a useful result. For example, if MDi = 500ms, it means that if ?i produces an output at time t, it will only be used if the corresponding input is sampled no earlier than t ? 500ms. Computed results that exceed this propagation delay are dropped. Minimum Success Rates: ?i's rate constraint, MSRi , speci es a minimum acceptable success rate for outputs which meet their delay constraints. For example, if MSRi = 10=sec, it means that the the chain ?i must produce successful outputs at a mean rate of 10=sec. Moreover, MSRi implicitly speci es the minimum possible rate at which we can run the tasks in ?i ; e.g., if MSRi = 10=sec, than ?i 's minimum execution rate is also 10=sec { which would suce only if every output achieved success.

An Example. Consider the example shown in Fig-

ure 1, which possesses six chains, labeled ?1 -?6. Here, rectangles denote shared resources, black circles denote tasks, and the shaded boxes are eternal inputs and outputs. The end-to-end constraints and resource requirements are shown in Figure 2. As for the latter, note that any discrete probability distributions can be used. For the sake of this example we discretized two well-known continuous distributions, and re-used the results for multiple tasks. In Figure 2(4), \Type" denotes a base continuous distribution gener-

?3 X3 ?2 X2 X1 ?1

r1 3;1 2;1 1;1

?4 X4 r2 3;2

2;2

r3 4;1

r4 4;2

r5 4;3

3;3

3;4

3;5

r10 ?6 r 8 X6  Y  6;2 6 6 ; 1 r6 4;4 4;5 Y4

3;6 r7 2;3 2;4 1;2 X5 ?5

3;7 3;8 Y3 r9 2;5 2;6 Y2 1;3 Y1 5;1 Y5

Figure 1: Example System Topology. ated and quantized using the parameters Min, Max, E[X] (mean), V ar[X] (variance) and Q (number of intervals). In the case of an exponential distribution, the CDF curve is shifted up to \Min," and the probabilities given to values past \Max" are distributed to each interval proportionally. The granularity of discretization is controlled by Q, where we assume that execution time associated with an interval is the maximum possible value within that interval. For example, the time requirement for 1;1 follows normal distribution, with 10ms mean, 8ms standard deviation, a minimum time of 4ms, and a maximum of 35 ms. The continuous distribution is chopped into in 10 intervals; since the rst interval contains all times within [4ms; 7ms], we associate 7ms with this interval, and give it the corresponding portion of the continuous CDF. Note that if we attempt a hard real-time approach here, no solution would be possible. For example, 3;1 requires up to 200ms dedicated resource time to run, which means that 3;1 's period must be no less than 200ms. Yet MSR3 = 5, i.e., ?3 must produce a successful, new output at least 5 times per second. This, in turn, means 3;1's period can also be no greater than 200ms. But if the period is exactly 200ms, the task induces a utilization of 1.0 on resource r1 { exceeding the resource's intrinsic 0.9 limit, and disallowing any capacity for other tasks hosted on it.

2.1 Run-Time Model All tasks within a chain ?i are scheduled in a quasicyclic fashion, and all are assigned the same constant period Ti . In a hard real-time system, this typically means that a task i;j is re-initiated every period, e.g., at times 0, Ti , 2Ti, etc.; moreover each instance nishes within the Ti frame. In our problem domain the meaning of Ti is signi cantly di erent, though we still refer to it as ?i 's \period." For us, Ti is ?i 's service interval { i.e., the

1. Maximum Delay Constraints (ms): MD1 = 240 , MD2 = 1000, MD3 = 1000 MD4 = 700 , MD5 = 100, MD6 = 300 2. Minimum Success Rates (per second): MSR1 = 10, MSR2 = 5 , MSR3 = 5 MSR4 = 5 , MSR5 = 5, MSR6 = 5 3. Resource Time PDFs: (ms) Type Normal Normal Exp Exp Normal

E[X ] V ar[X ] 10 64 20 100 10 20 8 144

[Min,Max] [4,35] [10,50] [0,100] [0, 200] [2,48]

Q 10 20 30 50 20

4. Maximum Resource Utilizations: m 1 0.9

m 2 0.7

m 3 0.9

m 4 0.6

m 5 0.9

m 6 0.7

Tasks 1;1 ; 3;2 ; 3;5 ;4;1 ; 6;1 1;3 ; 2;3 ; 3;6 ;3;8 ; 4;2 2;1 ; 3;4 ; 3;7 ;4;3 ; 5;1 2;4 ; 2;6 ; 3;1 ;4;5 ; 6;2 1;2 ; 2;2 ; 2;5 ;3;3 ; 4;4

m 7 0.7

m 8 0.8

m 9 0.9

m 10 0.9

Figure 2: Chain Constraints and Task PDF's. time quantum at which ?i 's load guarantees are maintained by each of its constituent resources. Our design algorithm assigns each task i;j a proportion of its resource's capacity, which we denote as ui;j . Given this assignment, i;j 's runtime behavior can be described as follows: (1) Within every Ti interval, the runtime system ensures that i;j can use up to ui;j of its resource's capacity. This is policed by assigning i;j an executiontime budget Ei;j = bui;j  Ti c; that is, Ei;j is an upper bound on the amount of resource time provided within each Ti frame, truncated to discrete units. (We assume that the system cannot keep track of arbitrarily ne granularities of time.) (2) A particular execution instance of i;j may require multiple periods to complete, with Ei;j of its running time expended in each period. (3) A new instance of i;j will be started within a period if no previous instance of i;j is still running, and if i;j 's input bu er contains new data that has not already exceeded MDi . This is policed by a timestamp mechanism, put on the computation when its input is sampled. Due to a chain's pipeline structure, note that if there are ni tasks in ?i , then we must have that MDi  ni  Ti , since data has to ow through all ni elements to produce a computation.

2.2 Solution Overview The design process works by (1) partitioning the CPU and network capacity between the tasks; (2) selecting each chain's period to optimize its success rate; and (3) checking the solution via simulation, to ver-

ify the integrity of the approximations used, and to ensure that each chain's output pro le is suciently smooth (i.e., not bursty). The partitioning algorithm processes each chain ?i and nds a load assignment vector for it, denoted ui. This vector represents the minimal resource load sucient for ?i to meet its performance objectives, where an element ui;j in ui contains the load allocated to i;j on its resource. Now, given a load assignment for ?i , we attempt to nd a period Ti which achieves ?i 's success rate. This computation is done approximately: For a given period Ti , a success rate estimate is derived by (1) treating all of ?i's outputs uniformally, which yields an i.i.d. \success probability" i for all outputs; and (2) then simply multiplying i  T1i to approximate the chain's current success rate, SRi . If SRi is lower than the MSRi requirement, the load assignment vector is increased, and so on. Finally, if sucient load is found for all chains, the resulting system is simulated to ensure that the approximations were sound. At this point, excess resource capacity can be given to any chain, with the hope of improving its overall success rate.

3 Throughput Analysis In this section we describe how we approximate ?i 's success rate, SRi , given a particular period Ti and utilization assignment ui. Then in Section 4, we show how we make use of this technique to help search for candidate instantiations of Ti and ui. We carry out our analysis using a discrete-time model, in which the time units are in terms of chain periods; i.e., our discrete domain f0; 1; 2; : : :g corresponds to the real times f0; Ti; 2Ti; : : :g. Not only does this reduction make the analysis more tractable, it also corresponds to our stipulation that tasks can only be initiated at period boundaries. Moreover, since the underlying system may schedule a task execution at any time within a Ti frame, we have to assume that its input data may be read as early as the beginning of a frame, and output as late as the end of a frame. And with one exception, a chain's states of interest do occur at its period boundaries. That exception is in evaluating a computation's aggregate delay { which, i in our discrete time domain, we treat as b MD Ti c. Hence the fractional part of the last period is ignored, leading to a tighter notion of success (and consequently erring on the side conservatism). Theoretically, we could model a computation's delay by constructing a stochastic process for the chain as a whole { and solving it for all possible delay probabilities. But this would probably be a futile venture, even smaller chains; after all, such a model would have

to simultaneously keep track of each task's local behavior. And since a chain may hold as few as 0 ongoing computations, and as many as one computation per task, it's easy to see how the state-space would quickly explode. Instead, we go about constructing a model in a compositional (albeit inexact) manner, by processing each task locally, and using the results for its successors. Now consider the following diagram, which portrays the ow of a computation at a single task:  ?1



i;j

age ?1 i;j

B

i;j



i;j

i;j

X = ][=

1 +k2 +k3 =k

age

i;j

k

P agei;j?1

= k1 ]  P [B = k2 ]  P [ = k3 ] i;j

1

3 0 5

4

Figure 3: Chain Xi;j (t), max( i;j ) = 6.

The variable agei;j charts a computation's total accumulated time, from entering ?i's head, to leaving i;j . The task reads an incoming input's time-stamp as its rst action; if it exceeds the chain's maximum allowable delay, the input is dropped at that point. Now, there are two types of delay that i;j can add to agei;j : (1) blocking time, or the time during which an output from i;j ?1 is blocked in i;j 's input bu er; and (2) processing time, i.e., the number of periods i;j actually needs to perform its computation or data transfer. A computation su ers blocking time whenever its result arrives from i;j ?1, but i;j is still un nished with some previous sample. We use the random variable Bi;j to describe a tagged data's blocking time at task i;j . As for processing time, let ti;j be a random variable whose values are generated via i;j 's execution time PDF. Then, letting i;j be the corresponding random variable expressing i;j 's execution time in units of periods, we have that i;j = d Eti;ji;j e. Given this model, and assuming data is always ready at the chain's head, agei;j can be calculated via the following recurrence relation: agei;1 = i;1 agei;j = agei;j ?1 + Bi;j + i;j For j > 1, we approximate the agei;j distribution by assuming the three variables to be independent, i.e., P [agei;j

2

i;j

k

We also require an approximation of i;j 's interoutput distribution, Outi;j , which measures time be-

tween two successive outputs produced by the task. Note that i;j 's success probability, i;j , will then be 1 E [Outi;j ] , i.e., the probability that a (non-stale) output is produced during a random period. For the head task i;1, we trivially have that P[Outi;1 = t] = P[ i;1 = t] and that i;1 is E [ 1i;1 ] . Since i;1 can retrieve fresh input whenever it is ready (and hence it incurs no blocking time), i;1 can execute a new phase whenever it nishes with the previous phase. Blocking Time Analysis. To help analyze Bi;j , we set up a stochastic process, Xi;j (t), which describes random states of i;j . Speci cally, we are interested in capturing (1) i;j 's remaining execution time until the next task instance; an (2) whether an input is to be processed or dropped. Xi;j (t)'s transitions can be described using a simple Markov Chain, as shown in Figure 3 (where the maximum execution time is 6Ti .) The transitions are event-based, i.e., they are triggered by new inputs from i;j ?1. On the other hand, states measure the remaining time left in a current execution, if there is any. When the process moves from state k to l, it means that (1) a new input is received; and (2) it will induce blocking time of l periodic units. For the sake of our analysis, we distinguish between three di erent outcomes which can occur when state k moves to state l. Here we di as the end-to-end delay i bound in periodic units, or di = b MD Ti c. (1) Dropping, de ned by P d[k ! l]: The task is currently executing, and there is already another input queued up in its bu er { which was calculated to induce a blocking-time of k. The new input will overwrite it, and induce l blocking-time periods. P [k ! l] = P [Out ?1 = k ? l] (k > l  0) d

i;j

the task's overall success probability: P[Bi;j = k] = p1s  i;j (k)  P s[k ! l] i;j l0

X

Out ?1 i;j

 ?1 i;j



t1

t2

i;j

k

l

Figure 4: Transition k ! l, with 0 < k, 0 < l. (2) Failure, de ned by P f [k ! l]: A new input will not get dropped, but it will be too stale to get processed by the nish-time of the current execution. P [k ! 0] = P [Out ?1 > k]  P [age ?1 > d + k] f

i;j

i;j

i

(3) Success, de ned by P s[k ! l]: A new event ar-

rives, and will get processed with blocking time l. Figure 4 illustrates a case of a successful transition. Case A: Destination state is 0. P [k ! 0] = s

X [ 0

P

i;j

t

X [

= t]  P [Out ?1  t + k]  P [age ?1  d ? k] i;j

i;j

Case B: Destination state l > 0:

P [k ! l] = s

0

P

i;j

t

i

= t]  P [Out ?1 = t + k ? l]  P [age ?1  d ? k] i;j

i;j

i

So, an untyped transition probability from k to l is: P[k ! l] = P f [k ! l] + P s[k ! l] + P d[k ! l] and hence, we can obtain Xi;j (t)'s steady-state probabilities in the usual manner. Formally, let i;j (k) be i;j (k) = tlim !1 P[ Xi;j (t) = k ] Now, we use the i;j (k) distribution to derive i;j 's \success probability," i.e., the probability that it successfully processes a random input from i;j ?1: psi;j =

X i;j(k)  X P s[k ! l] k0

l0

and so, task i;j 's mean output probability is just i;j = i;j ?1  psi;j which denotes the chance that an output is produced during a given (random) time unit. The blocking-time PDF is conditioned on the fact that input is actually processed; hence we derive it by using the chain's steady-state distribution, the associated \successful" transition probabilities, as well as

The nal ingredient is to derive task i;j 's interoutput distribution, Outi;j . To do this we use a coarse mean-value analysis, as follows. After i;j produces an output, we know that it goes through an idle phase (waiting for fresh input from its producer), followed by a busy phase, culminating in another output. Let Idlei;j be a random variable which counts the number of idle cycles before the busy phase. Then we have: 1 E[ i;j + Idlei;j ] = E[Outi;j ] = i;j 1 ? E[ i;j ] E[Idlei;j ] = i;j Now assume that i;j delivers a random output at time t1, and then starts a new computation at time t2. From the analysis above, it is clear that E[t2 ? t1] = E[Idlei;j ] + 1. Using this information, we model the event denoting \compute-start" as a pure Bernoulli decision in probability STi;j : STi;j = E[Idle1 ] + 1 i;j i.e., after an output has been delivered, STi;j is the probability that a random cycle starts the next busy phase. We then approximate the idle durations via a modi ed geometric distribution, as follows: P[Idlei;j = l] = (1 ? STi;j )l  STi;j In deriving the distribution for P[Outi;j], we capture a particular interval's idle phase using Idlei;j , and its busy phase using the (period-based) PDF for i;j :

=)

P [Out

i;j

= k] =

X

0 1  k

P [Idle

i;j

= k1 ]  P [ = k ? k1 ] i;j

k

Finally, we calculate the end-to-end success probability, which is just i;n's output probability, appropriately scaled by the probability of excessive delay injected during i;n's execution: i = i;n  P[agei;n  di]

Example. As an example, we perform throughput

estimation on ?6, assuming system parameters of T6 = 60ms, E6;1 = 6ms, and E6;2 = 30ms. Recall that the delay bound for the chain is MD6 = 300; hence thus d6 = bMD6 =T6 c = 5. Within our head-to-tail approach, we rst have to consider task 6;1. Recall,

however, that the distributions for both age6;1 and Out6;1 follow that for 6;1; moreover, we have that 6;1 = E [ 16;1 ] = 0:3291: Next we consider the second (and last) task 6;2 . The following tables show the PDFs for 6;2, for the Markov Chain's steady states, the blocking times, and for age6;2. k

0 1 2 3 4 5 6 7 8 9 10 11 12

P [ 6 2 ] 0:0 0:7543 0:1968 0:03751 0:0098 0:0019 0:0005 0:00008 0:0 0:0 0:0 0:0 0:0 ;

6 2 (0) 0:975 0:019 0:0045 0:0009 0:0002 0:00003 0:000001 0:0 0:0 0:0 0:0 0:0 0:0 ;

P [B6 2 ] 0:980 0:017 0:002 0:00009 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 ;

P [age6 2 ] 0:0 0:0 0:0 0:2658 0:3400 0:2446 0:1153 0:0269 5:7  10?3 1:3  10?3 2:7  10?4 5:3  10?5 5:8  10?6 ;

Again, the success probability is derived as follows: ps6;2 6;2 P [age6;2 6

= 0:9804 = 6 1  p6 2 = 0:3228  d6 ] = 0:850 = 6 2  P [age6 2  d6 ] = 0:2745 ;

;

s ;

;

And hence, SR6 = 0:2745  1000 60 = 4:574

4 System Design Process We now revisit the \high-level" problem of determining the system's parameters, with the objective of satisfying each chain's performance requirements. As stated in the introduction, the design problem may be viewed as two inter-related sub-problems: (1) Load Assignment. Given a set of chains, how should the CPU and network load be partitioned among the set of tasks, so that the performance requirements are met? (2) Rate Assignment. Given a load-assignment to the tasks in the chain, what is the optimal period to execute the chain such that the e ective throughput is maximized? Our solution uses separate heuristics for each of these sub-problems. In particular, the top-level \Load Assignment" algorithm works on a chain-by-chain basis, iteratively adjusting the proportional load allocated to each of the chain's tasks. In doing this, it makes use of a \Rate Assignment" subroutine, which determines a suitable processing rate for a chain, under the given load assignment { and computes the corresponding success rate.

Algorithm 4.1 Load-Assignment Algorithm Output : fu : 1  i  ng, where each u is the utilization vector for ?i fTii: 1  i  ng, where each Ti iis the period of ?i

fSRi : 1  i  ng, where each SRi is the success rate for ?i at period Ti Algorithm : (1) foreach (resource rk ) (2) k  m k (3) foreach (?i : 1  i  n) f 1000 e (4) Ti d MSR (5) foreach (i;ji in ?i with rk = resource(i;j ) f (6) (7) (8)

(9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20)

ui;j E[Tti;j ] k k ? iui;j g (SRi ; Ti ) Evaluate Success Rate(?i; Ti; ui)

g f?i : 1  i  n; SRi < MSRi g while (S 6= ;) f Find the i;j in ?i 2 S , with rk = resource(i;j ) that maximizes i ?SRi  uk wi;j MSR MSR i i;j if(k < ) return Failure ui;j ui;j +  k k ?  (SRi ; Ti ) Evaluate Success Rate(?i, Ti,ui ) if ( SRi  MSRi ) S S ? f?i g g return Feasible Solution S

Figure 5: Top-Level Load-Assignment Algorithm.

Load Assignment Algorithm. The load assignment algorithm is presented in Figure 5. Consider a particular chain ?i , and note that Ti (expressed in milliseconds) is initialized to the highest period that could potentially achieve the desired success rate MSRi (step (4)). Also note that task i;j 's resource share ui;j is initialized to accommodate its mean execution time E[ti;j ], and that its host resource's capacity is depleted by the same amount (steps (6)-(7)). (Note that the system could only be solved with these initial parameters if all execution times were constant.) The algorithm works by iteratively re ning the load-assignment vectors (the ui's), until a feasible solution is found. The entire algorithm terminates when the success rate for all chains meet their performance requirements { or when it discovers that no solution is possible. The load assignment problem is thus a search over the space of utilization vectors. In order to control the complexity of the search, two restrictions are made on the search procedure: (1) there is no backtracking, and (2) the load assignment of a task is never reduced. Hence, since the entire solution space is not searched, the algorithm is an ap-

Algorithm 4.2

Evaluate Success Rate (? , T ,ui) i

Input :

Chain ? Utilization vector ui of ? Period T for ?

i

i

i

Output : Success Rate for ? , and its associated period T Algorithm : (1) SR get success rate(? ,T ,u ) (2) for (t T ? 1; t > 0 ; t t ? i1) (3) if ((MD0 mod t) < (  t)) f (4) SR get success rate(? , t, u ) (5) if (SR0 > SR )0 f i

i

i

i

i

i

i

i

(6) (7) (8)

SR SR T t i

g

g

i

i

i

i

i

return (SR , T ) i

?i ?1 ?2 16 ?3 5

5:50 5:26

?4 20 ?5 10 ?6 5

5:47 5:39 6:91

i

i

i

A. Synthesized Solutions for Chains. Ti SRi ui 3 11:33 [u = 0:33;u = 0:33;u = 0:33]

i

1;1 1;2 1;3 [u2;1 = 0:188;u2;2 = 0:188;u2;3 = 0:25; u2;4 = 0:188;u2;5 = 0:188;u2;6 = 0:25] [u3;1 = 0:2; u3;2 = 0:2;u3;3 = 0:2; u3;4 = 0:2;u3;5 = 0:2; u3;6 = 0:2; u3;7 = 0:2;u3;8 = 0:2] [u4;1 = 0:25;u4;2 = 0:2; u4;3 = 0:2; u4;4 = 0:15;u4;5 = 0:25] [u5;1 = 0:10] [u6;1 = 0:20;u6;2 = 0:20]

B. Resource Capacity Used by System. 1 2 3 4 5 0:72 0:39 0:45 0:4 0:65 6 7 8 9 10 0:52 0:35 0:62 0:65 0:65

Figure 6: Find period and success rate for ?i .

Figure 7: Solution of the Example System.

proximate one { i.e., in some cases the algorithm will not nd a feasible solution, even if one exists. At each iteration step, the algorithm increases a selected task's utilization by an incremental amount , which is a tunable parameter. (For the results presented in this paper, we set  = :05.) Clearly, a smaller  will lead to a greater likelihood of nding a feasible assignment, but it will also incur a higher computational cost. The heart of the algorithm can be found on lines (11)-(12), where all of the remaining unsolved chains are considered, with the objective of assigning additional load to the \most deserving" task in one of those chains. This selection is made using a heuristic weight wi;j , re ecting the potential usefulness of increasing i;j 's utilization in the quest toward nding a feasible solution. The weight actually combines three factors, each of which plays a part in achieving feasibility. The rst factor is the relative increase in success rate required for the task's chain to meet its performance requirement, i.e., (MSRi ? SRi )=MSRi , where SRi is the current success rate for chain ?i . The second factor is the remaining capacity of the resource on which the task is allocated. The nal factor is the inverse of the current load assignment of the task { i.e., if the task already has a high load assignment, working on one of the other tasks in that chain would probably be more bene cial. The heuristic function is given below: 1 i ? SRi m wi;j = MSR MSRi  (k ? k )  ui;j

After additional load is given to the selected task, \Evaluate Success Rate" is again called to determine its chain's new period and success rate. And if the chain meets its minimum success rate, it is removed from further consideration. Rate Assignment. \Evaluate Success Rate" derives a feasible period (if one exists) for a chain under its current load assignment. While the problem seems straightforward, there are non-linearities introduced due to two factors: First, in our analysis, we assume that the e ective maximum delay is b(MDi =Ti )  Ti c { i.e., this is an artifact of our discrete-time model. Secondly, the true, usable load for a task is given by buij  Ti c=Ti , which assumes that the system cannot handle proportional-share queueing at arbitrarily ne granularities of time. The negative e ect of the rst factor is likely to be higher at larger periods; it results in neglecting the fractional part of nal period assessed to the overall delay. On the other hand, the negative e ect due to the second factor is likely to be larger at smaller periods (although it decreases when a substantial load is given to every task in the chain). Due to these non-linearities, a search algorithm is needed to nd a suitable period for the chain. Our approximation utilizes a few simple rules. As we stated, when a task's load increases, the negative e ects induced by a smaller period decrease { so under these circumstances, it is more likely that an optimal solution exists at a smaller period. We use this information, in conjunction with the fact that the loadassignments are monotonically increased by the toplevel algorithm. Hence, we restrict the search to periods that are lower than the current period. We further

restrict the search to periods for which the fractional lost delay lost is no more than times the current period, where < 1. Using these rules, the algorithm utilizes the throughput analysis presented in Section 3 to determine the success rate for a candidate period. The algorithm is presented in Figure 6.

Rate Design Process - Solution of the Example. We ran the algorithm in Figure 5 to nd a fea-

sible solution, and the result is presented in Figure 7. The tables show the utilizations given to all tasks, the periods and success rates for chains, and the used capacity for all resources. Note that the spare resource capacity can be used to increase any chain's success rate, if desired.

5 Simulation Since the throughput analysis uses some key simplifying approximations, we validate the resulting solution via a simulation model. Recall that the analysis injects imprecision for the following reasons. First, it tightens all (end-point) output delays by rounding up the fractional part of nal period. Analogously, it assumes that a chain's state-changes always occur at its period boundaries; hence, even intermediate output times are assumed to take place at the period's end. And note that while the runtime system has the discretion of scheduling a task as late as its period allows, this will not occur for all tasks running on a resource. A further approximation is inherent in our compositional data-age calculation { i.e., we assume the success ratios from predecessor tasks are i.i.d., allowing us to solve the resulting Markov chains in a quasi-independent fashion. The simulation model discards these approximations, and keeps track all tagged computations through the chain, as well as the \true" states they induce in their participating tasks. Also, the simulator's clock progresses along the real-time domain; i.e., if a task ends in the middle of a period, then it is placed in successor's input bu er at that time. Moreover, each resource is scheduled by a modi ed EDF-driven runtime system (where a task's deadline is considered the end of its period). Hence, higher-priority tasks will get to run earlier than the analytical method assumes { i.e., that computations may take place as late as possible, within a given period. On the other hand, the simulator does inherit some other simpli cations used in our design. For example, reading an input bu er is assumed to happen when a task gets released, i.e, at the start of a period. As in the analysis, context switch overheads are not considered.

Veri cation of the Design. The following table compares the estimated success rates with those derived via simulation, where all chains are run for at least 10; 000 periods, and success rates are evaluated every second. Output Period SR (Analysis) SR (Simulation) Y1 3 11.33 11.50 Y2 16 5.50 5.50 Y3 5 5.26 5.39 Y4 20 5.47 5.79 Y5 10 5.39 5.46 Y6 5 6.91 6.93

Note that the resulting (simulated) system satis es minimum success rates of all chains, so we have a satisfactory solution. If desired, we can improve it by doling out the excess resource capacity, i.e., by simply iterating through the design algorithm again. Perhaps more interesting, however, is Figure 8, which compares the simulated and analytic results for multiple periods, assigned to three selected chains. Here, only the periods are changed; the system utilization is xed throughout. Moreover, when we vary the periods for the chain under consideration, we keep the rest of the system xed, and use the parameters from the solution given above. The simulation runs displayed here ran for 10; 000 periods, for the largest period under observation (e.g, on the graph, if the largest period tested is denoted as 200 ms, then that run lasted for 2000 seconds). Also, success rates were measured over every one-second interval, and the variance in the success-rates (over one-second intervals vs. the mean) were calculated accordingly. The combinatorial comparison allows us to make the following observations. First, note a chain's success rate usually increases as its period decreases { up until the point where the the system starts to inject a signi cant \multiplexing tax." Recall that we do not assume the runtime system can multiplex in nitesimal granularities of execution time; rather, a task's utilization is allocated in integral units over each period. Second, the relationship between variance (measured over one-second intervals) and success is not a direct one. Note that for some chains (e.g., ?2), success-rate variance tends to increase at both higher and lower periods. This re ects two separate facts, the rst of which is fairly obvious, and is an artifact of our measurement process. At lower periods (i.e., higher rates), output successes are sampled much more frequently. And when a chain truly tends toward having a uniform output success probability (as longer chains do at higher output rates), taking more readings per second can arbitrarily lead to higher variances. In other words, the number of successes-per-second for chain ?2 converges on having a Binomial distribution, with a variance of 2 (1 ? 2 )=T2 .

Chain 2 − Success Rate

Chain 3 − Success Rate

6

Chain 6 − Success Rate

6

8

Analysis Simulation

Analysis Simulation

5

5

4

4

Analysis Simulation

7

6

Success Rate

Success Rate

Success Rate

5

3

3

4

3 2

2

1

1

2

1

0 0

20

40

60

80

100 Period

120

140

160

180

0 0

200

20

40

60

Chain 2 − Variance

80

100 Period

120

140

160

180

0 0

200

40

60

80

100 Period

120

140

160

180

200

140

160

180

200

Chain 6 − Variance

Chain 3 − Variance

3.5

20

8

3

7 3

2.5

6

2

1.5

Variance of success rate

Variance of success rate

Variance of success rate

2.5 2

1.5

1

1

4

3

2 0.5

0.5

0 0

5

1

20

40

60

80

100 Period

120

140

160

180

200

0 0

20

40

60

80

100 Period

120

140

160

180

200

0 0

20

40

60

80

100 Period

120

Figure 8: Analysis vs. simulation: SRi for ?2 ,?3,?6; simulated variance of SRi over 1 second intervals. But at large periods (i.e., low processing rates), another e ect comes into play. Recall that tasks are only dispatched periodic boundaries, and only when they have input waiting from their predecessors. Hence, if a producer overruns its period by even a slight amount, its consumer will have to wait for the next period to use the data { which consequently adds to the data's age. And in a longer chain, this time will start to accumulate, causing many outputs to exceed their maximum delay constraints. On the whole, however, we note that the simulated variances are not particularly high, especially considering the fact that they include the one-second intervals where measured success rates actually exceed the mean. Also, while success rates decrease as periods increase, the curves are not smooth, and small spikes can be observed in both the simulated and analytical results. Again, this e ect is due to the \multiplexing tax" discussed above, injected since executiontime budgets are integral. For example if a task's assigned utilization is 0.11, and if its period is 10, then its execution-time budget will be 1 { resulting in an actual, usable utilization of 0.1. However, the success rates do monotonically decrease for larger periods when we only consider candidates that result in integral execution-time budgets. Next, we note that the di erence between analysis and simulation is larger when a period does not divide

maximum permissible delay. Again, this is the result of rounding up the fractional part of a computation's nal period, which will cause us to overestimate the output's age.

6 Conclusion We presented a design scheme to achieve stochastic, real-time performance guarantees by adjusting system utilizations and processing rates. Our solution exploits both discrete-time Markovian analysis, as well as real-time scheduling theory. The solution strategy uses several approximations to avoid modeling the entire system; moreover, our search algorithm makes use of two heuristics, which help in signi cantly reducing the number of potential feasible solutions. In spite of the approximations, however, our simulation results are promising { they show that the approximated solutions are suciently close to be dependable; and that error is typically on the conservative side. As we explained above, this is usually due to the per-period assumptions about when a task can end, and the amount of system load it can use. Much work remains to be done. First, we plan on extending the model to include a more rigid characterization of system overhead, due to varying degrees of multiplexing. Currently we assume that

execution-time can be allocated in integral units; other than that, no speci c penalty functions are included for context-switching, or for network-switch overhead. We also plan on getting better approximations for the \handover-time" between tasks in a chain, which will result in tighter analytical results. Finally, we plan on benchmarking \true" per-packet delay times on suitable LAN-based testbeds, and using the PDFs derived from those benchmarks.

References [1] Alan Burns. Preemptive Priority Based Scheduling: An Appropriate Engineering Approach. In Sang Son, editor, Principles of Real-Time Systems. Prentice Hall, 1994. [2] Alan Demers. Analysis and Simulation of a Fair Queueing Algorithm. In Proceedings of ACM SIGCOMM, pages 1{ 12. ACM Press, September 1989. [3] Norival R. Figueira and Joseph Pasquale. Leave-in-Time: A New Service Discipline for Real-Time Communications in a Packet-Switching Network. In Proceedings of ACM SIGCOMM, pages 207{218. ACM Press, October 1995. [4] R. Gerber, S. Hong, and M. Saksena. Guaranteeing RealTime Requirements with Resource-Based Calibration of Periodic Processes. IEEE Transactions on Software Engineering, 21, July 1995. [5] Pawan Goyal, Xingang Guo, and Harrick M. Vin. A Hierarchical CPU Scheduler for Multimedia Operating Systems. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI '96), pages 107{121, October 1996. [6] Pawan Goyal and Harrick M. Vin. Network Algorithms and Protocol for Multimedia Servers. In Proceedings of IEEE INFOCOM. IEEE Computer Society Press, March 1993. [7] M. Harbour, M. Klein, and J. Lehoczky. Fixed Priority Scheduling of Periodic Tasks with Varying Execution Priority. In Proceedings, IEEE Real-Time Systems Symposium, pages 116{128, December 1991. [8] Ben Kao and Hector Garcia-Molina. Deadline Assignment in a Distributed Soft Real-Time System. In Proceedings of International Conference on Distributed Computing systems, pages 428{437. IEEE Computer Society Press, May 1993. [9] J. P. Lehoczky and S. Ramos-Thuel. An Optimal Algorithm for Scheduling Soft-Aperiodic Tasks in FixedPriority PreemptiveSystems. In Proceedings of IEEE RealTime Systems Symposium, pages 110{123. IEEE Computer Society Press, December 1992. [10] J. Leung and M. Merill. A Note on the PreemptiveScheduling of Periodic, Real-Time Tasks. Information Processing Letters, 11(3):115{118, November 1980. [11] C. Liu and J. Layland. Scheduling Algorithm for Multiprogramming in a Hard Real-Time Environment. Journal of the ACM, 20(1):46{61, January 1973. [12] Cli ord W. Mercer, Stefan Savage, and Hideyuki Tokuda. Processor Capacity Reserves: Operating System Support for Multimedia Applications. In Proceedings of IEEE International Conference on Multimedia Computing and Systems. IEEE Computer Society Press, May 1994.

[13] A. K. Parekh and G. Gallager. A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks - The Single Node Case. In Proceedings of IEEE INFOCOM, pages 915{924. IEEE Computer Society Press, March 1992. [14] S. Sathaye and J. Strosnider. A Real-Time Scheduling Framework for Packet-Switched Networks. In Proceedings of IEEE Real-Time Systems Symposium, pages 182{191. IEEE Computer Society Press, December 1994. [15] D. Seto, J.P. Lehoczky, L. Sha, and K.G. Shin. On Task Schedulability in Real-Time Control System. In Proceedings of IEEE Real-Time Systems Symposium. IEEE Computer Society Press, December 1996. [16] Kang G. Shin and Yi-Chieh Chang. A Reservation-Based Algorithm for Scheduling Both Periodic and Aperiodic Real-Time Tasks. IEEE Transactions on Computers, 44:1405{1419, December 1995. [17] Ion Stoica, Hussein Abdel-Wahab, Kevin Je ay, Sanjoy K. Baruah, Johannes E. Gehrke, and C. Greg Plaxton. A Proportional Resource Allocation Algorithm for Real-Time, Time-Shared Systems. In Proceedings of IEEE Real-Time Systems Symposium, pages 288{299. IEEE Computer Society Press, December 1996. [18] Jun Sun and Jane W.S. Liu. Bounding completion times of jobs with arbitrary release times and variable execution times. In Proceedings of IEEE Real-Time Systems Symposium, pages 2{12. IEEE Computer Society Press, December 1996. [19] T. S. Tia, Z. Deng, M. Shankar, M. Storch, J. Sun, L.-C. Wu, and J. W.-S Liu. ProbabilisticPerformanceGuarantee for Real-Time Tasks with Varying Computation Times. In Proceedings of IEEE Real-Time Technology and Applications Symposium, pages 164{173. IEEE Computer Society Press, May 1995. [20] K. Tindell, A. Burns, and A. Wellings. Analysis of hard real-time communication. The Journal of Real-Time Systems, 9, September 1995. [21] David Yates, James Kurose, Don Towsley, and Michael G. Hluchyj. On Per-session End-to-end Delay Distributions and the Call Admission Problem for Real-time Applications with QOS Requirements. In Proceedings of ACM SIGCOMM. ACM Press, September 1993. [22] Hui Zhang and D. Ferrari. Rate-controlled static-priority queueing. In Proceedings of IEEE INFOCOM. IEEE Computer Society Press, September 1993. [23] Hui Zhang and Edward W. Knightly. Providing End-toEnd StatisticalPerformanceGuaranteeswith Bounding Interval Dependent Stochastic Models. In ACM SIGMETRICS. ACM Press, May 1994. [24] Lixia Zhang. VirtualClock : A New Trac control Algorithm for Packet Switching Networks. In Proceedings of ACM SIGCOMM, pages 19{29. ACM Press, September 1990. [25] Zhi-Li Zhang, Don Towsley, and Jim Kurose. Statistical Analysis of Generalized Processor Sharing Scheduling Discipline. In Proceedings of ACM SIGCOMM, pages 68{77. ACM Press, August 1994.

Suggest Documents