Jim Ras and Albert M. K. Cheng. Department of Computer Science @ University of Houston. Abstract. There has been significant study of implementations of.
15th IEEE Real-Time and Embedded Technology and Applications Symposium
An Evaluation of the Dynamic and Static Multiprocessor Priority Ceiling Protocol and the Multiprocessor Stack Resource Policy in an SMP System Jim Ras and Albert M. K. Cheng Department of Computer Science @ University of Houston Abstract
study’s results were not duplicative, because their results were based solely on simulations rather than on any actual implementation, which is not generally known. What is missing is how the simulation was done (no code was truly written). Moreover, the paper addresses a situation quite different from ours because it focuses on a very specific embedded platform, while our paper focuses on a general purpose computer system. Our evaluation is based on actual implementation. We first implemented and tested the protocols using C/POSIX. Although the implementations were new and different from prior work, we wanted to focus more on the new real-time Ada-2005 language. Thus, we implemented the protocols again with Ada. We used an Ada runtime environment and compiler, running on a tightly coupled SMP architecture hosting a version of Linux 2.6.18. The x86_64 system has eight processors. Although, the Ada tests were similar to the C/POSIX implementations, the work adds to the research community since we have designed an API using C/POSIX to map Ada tasks to processors. No previous work on Ada has shown how to use such processor affinity to maximum performance. Moreover, both of our results are different than in [15]. This could be because their tests were based on RM scheduling for MPCP, but EDF for MSRP. This may have introduced some error into the results, as EDF is known to be the optimal method in preemptive scheduling. On the other hand, RM could be as efficient (if considering EDF overhead) for harmonic task sets. Accordingly, we will test with both the RM and the EDF scheduling algorithms. In these respects, our work will differ from other studies.
There has been significant study of implementations of a variety of priority inversion control algorithms in uniprocessor systems, but there has been far less work done on the multiprocessor implementations of these algorithms. Herein, we will present such an evaluation of the Multiprocessor Priority Ceiling Protocol (MPCP) and the Multiprocessor Stack Resource Policy (MSRP). To our knowledge, no such empirical evaluation of these two policies has been conducted prior to this. We will show that the results differ from the previous simulation-based studies and that both policies are more or less equally effective. The main difference is the MSRP’s expense. We discuss the efficacy of Ada-2005 and C/POSIX. We also discuss the methods through which we have attempted to overcome Ada’s weakness in mapping tasks to processors. ∗ 1. Introduction There are implementations of a variety of priority inversion control algorithms in single-processor systems running real-time operating systems (such as Quadros [27] and VxWorks [28]), and their performances are well known and understood. However, there has been little work done on multiprocessor implementations of these algorithms. Thus, following the examples set by prior studies, we will test two such algorithms, the MPCP [23] and the MSRP [15] by means of implementations rather by just simulations. We will test each protocol with both the Rate-Monotonic (RM) and the Earliest Deadline First (EDF) scheduling algorithms. We will discuss the efficacy of using Ada and C/POSIX, and we discuss some of the challenges faced, as well as the various methods that we have attempted to overcome Ada2005’s weakness in mapping tasks to processors.
1.2. Organization The paper is organized as follows: Section 2 explains the problem and the properties of the MPCP and the MSRP protocols. Section 3 describes various methods that we have attempted to map tasks to processors. Section 4 presents two case studies to support our theory. Section 5 shows the results obtained. And finally, Section 6 points out our conclusions and perspectives on the future work.
1.1. Motivations for Current Work To our knowledge, no empirical multiprocessor-based evaluation of these two policies has ever been conducted prior to this. However, a number of studies related to our own research are of relevance [3,6,12,15,20,22, 23, and 24]. We were also mindful of a study [15] that appeared to duplicate our intended research. However, in private communications with the authors, we learned that the
2. Preliminaries In this section, we will discuss the problem in depth: What it is and what others have done. However, before we proceed further, we define some terms that we will be using in this paper. Due to resource sharing among tasks, priority inversion can occur. Priority inversion occurs when a higher-priority task is blocked and is waiting for a resource
∗
Supported in part by the National Science Foundation under Award CNS-0720856, a grant from the Institute for Space Systems Operations, and GEAR Grant I092831-38963.
1080-1812/09 $25.00 © 2009 IEEE DOI 10.1109/RTAS.2009.10
13
2.2. The Multiprocessor Priority Ceiling Protocol
being used by a lower-priority task, which itself has been pre-empted by an unrelated medium-priority task. In this situation, the higher priority task's priority level has effectively been inverted to the lower priority task's level. The Priority Ceiling Protocol (PCP) minimizes this blocking by guaranteeing that a high-priority task will only be blocked by, at most, one critical region of any lower priority task. It is helpful to summarize the different types of blocking that may occur: 1.
Direct blocking: Occurs when a high-priority task tries to acquire a resource already held by a lower-priority task.
2.
Push-through: Occurs when a medium-priority task is blocked by a lower-priority task that has inherited a higher priority from a task it directly blocks.
3.
Ceiling blocking: A task is prevented from entering a critical section by ceiling of an active resource. This helps prevent deadlocks and chained blocking.
The Multiprocessor Priority Ceiling Protocol (MPCP) extends PCP to multiprocessor systems and reduces the remote blocking problem. In order to extend the priority ceiling protocol to multiple processor systems, the priority ceiling must be maintained and extended to include the key notion of a global priority ceiling. Global critical sections, a resource shared by tasks on different processors, must be assigned a ceiling that is higher than the priority of any other task in the system. For example, if PH is the assigned priority of the highest priority task among all the tasks in the system, then PH +1+ maxi {pi | ti that uses resource rk} is the priority ceiling for the semaphore protecting the global resource rk. Further important design issues of the MPCP are given below: 1.
PCP is used for requests to local semaphores—semaphores protecting a resource shared by tasks on the same processor.
2.
If task T requests a global semaphore SG that has already been locked, then T is added to a prioritized queue. Task T remains at its current active priority as it is inserted in the queue.
3.
If a task T is blocked but has not locked any semaphores, then the task with a priority lower than T may run and try to complete.
4.
When a task T releases a global semaphore SG—semaphores protecting a resource shared by tasks on the different processors, the highest priority task TH waiting in the queue becomes eligible for execution and may lock SG.
In multiple processor environments, there is an additional type of blocking known as remote blocking. 4.
Remote blocking occurs when a task has to wait for the execution of another task of any priority assigned to another processor.
Accordingly, we must add remote blocking to the list of challenges in a multiprocessor environment. 2.1. Related work
Chen and Tripathi [12] extended the MPCP so that it can be used with EDF scheduling. However, this version of the MPCP is no different than any other MPCP described in the literature. Chen et al [12] simply extended the PCP so that resources that provide the same or similar services are grouped together. Thus, a limit is placed on the degree of sharing to reduce worst-case blockings.
A great deal of effort has been spent investigating extreme forms of unbounded blocking1. In 1987, (building upon the schedulability work originally developed by Liu and Layland [19]), Sha et al formulated the concept of priority inheritance to solve the priority inversion problem [24]. One of the proposed protocols was the PCP. Rajkumar subsequently developed the MPCP [23]. If the number of processors equals one, MPCP reduces to PCP.
In the fundamental work of Liu and Layland [19], a set of n periodic tasks can be scheduled by the EDF scheduling algorithm iff the total utilization, U, ≤ 1.
PCP defines the priority ceiling of a critical resource as equal to the priority of the highest priority task that can use it at any time. Thus, if an application consists of a set Ω = {t1, t2, tn} of real time tasks, and each task is characterized by a priority pi, then the ceiling attached to each resource rk is ceil(rk) = maxi {pi | ti that uses resource rk}.
n
∑ CP i1
≤1
i i
C i worst−case com putation tim e P i period of task i
Chen et al [13] extended the research for the dynamic PCP: n
∑ C PB i
i1
i
i
≤1
B worst−case blocking
Since the periods of higher priority tasks are shorter than the periods of lower priority tasks, minimizing the remote blocking of high priority tasks keeps schedulability loss to a minimum. For more explanation on the composition of the blocking factors, please refer to Chen et al [13].
1
The direct application of a simple synchronization variable for sharing critical data between tasks may result in a high priority task being blocked by a lower priority task for an unbounded amount of time. This type of unbounded blocking is usually referred to as priority inversion. 14
2.3. The Multiprocessor Stack Resource Policy Baker’s [2] Stack Resource Policy (SRP) is similar to PCP. However, each task under the SRP is assigned a preemption level. Preemption levels reflect the relative deadlines of the tasks, the shorter the deadline the higher the preemption level. At run-time, resources are given a ceiling values based on the maximum preemption level of the tasks that use the resource. When a task is released it can only preempt the currently executing task if its absolute deadline is shorter and its preemption level is higher than the highest ceiling currently locked resources. The result of this protocol is almost identical to PCP; tasks suffer only a single block, deadlocks are prevented, and a simple formula is available for calculating the blocking time.
Figure 1: Schedule for example task set.
With PCP, T2 can run since T1 has not blocked any higher priority task; the priority of T1 does not change even while holding resource S. Thus, task T2 preempts T1 and executes. However, with SRP / ICPP, T2 is blocked since T1 inherits the ceiling priority of S. Thus, it needs more time to finish.
Gai et al [15,16] extended SRP to the Multiprocessor SRP (MSRP). The MSRP allows tasks to use the local critical resources under the SRP policy. If a task tries to access a global critical resource that is already locked by a task on another processor, the task performs a busy wait (called a spinlock). Given below are more important design characteristics of the MSRP: 1. 2. 3.
The problem of sharing resources between tasks was briefly discussed in the first section. Ada-2005 gives direct support to protected shared data by the protected object. Although protected objects are a high-level construct, they enable very efficient implementations of various semaphores and other similar paradigms. Protected objects serve their intended purpose well. They bring the benefits of speed provided by low-level primitives, without the risks incurred by the use of such unstructured primitives. For example, on a Linux kernel compiled with the SMP, protected objects’ locks are implemented with Linux pthreads2. It is possible to return immediately if a lock is busy. With an OS supporting the shared memory SMP model, protected objects help with multiprocessor parallelism.
SRP is used for all requests to local semaphores. When a task locks a global critical resource, the task becomes non-preemptable. If the critical resource is already locked, the task is inserted in an FCFS queue. When a task T releases the global resource, the task once again becomes preemptable. The MSRP then checks the FCFS queue. If a task is waiting in the queue, it becomes eligible for execution and can lock the global resource. .
It may seem that the logical choice is to build each critical resource using one protected object. However, before granting a lock, we must check the status of every other lock in the system to ensure that the locking conditions are satisfied. Thus, we use a server task (another process) for granting lock requests. The protected object “Resource Scheduler” implements the concurrent server.
The key idea is that when a task needs an unavailable resource, the task is blocked at the time of an attempted preemption. As a result, SRP saves unnecessary context switches by blocking earlier. 3. Ada-2005 Ada-2005, in its annex on systems programming and annex on real-time systems, provides many of the facilities necessary for implementing the MPCP and the MSRP efficiently. For example, Ada’s Immediate Ceiling Priority Protocol (ICPP) is like SRP but without preemption levels. With SRP, ICPP, and PCP, mutual exclusion is guaranteed. However, unlike PCP, during the entire protected operation, tasks that lock a critical resource will always have their priority raised to the same ceiling priority as the critical resource. Consider the following example and figure 1:
Type Scheduler is protected interface; Procedure Request is abstract; Procedure Release is abstract; Protected type Resource_Scheduler is new Scheduler …
The scheduler provides only two operations in its public interface: the Request and the Release operations. Moreover, it provides three other operations in its private interface: Queue1, Queue2, Check_Ceiling_Level, and Check_Nesting. If a task Ti wants to execute the critical section protected by the semaphore Sj, it executes:
1. Task T1 arrives and locks the shared resource S
Request(Sj); - P(Sj) or Requeue(Sj) or raise Ceiling_Error Release(Sj); - V(Sj) or raise unauthorized Release_Error
2. Task T2 then arrives. It has a priority higher than T1, but less than the ceiling priority of S. 2
15
Linux implementation of POSIX threading API.
support for SMP systems. However, we took an approach using the pragma feature to modify the Ada runtime system. We were able to interface Ada to C functions that worked with the underlying system libraries. In the Linux kernel, processes have a data structure associated with them called the task_struct. This structure is important to our research because it contains the cpus_allowed bitmask flag, which consists of a series of n bits, one for each of n logical processors in the system. For example, a system with eight processors would have eight bits. If a given bit is set for a given task, then that task may run on the associated processor. The API allows programmers to change the bitmask or view the current bitmask. For example, the set_affinity function is coded as follows:
The exception Ceiling_Error is raised when the basic rules of the MPCP or the MSRP policy are violated. Release_Error is raised when a task attempts to release a semaphore that it did not lock. If the semaphore cannot be locked, then the following two steps are performed in the given order: (1) With the MPCP, the active priority of the task that had locked the semaphore is raised to the ceiling of the lock, only if its active priority was previously lower. With the MSRP, if a task holds a lock, then its active priority is always elevated to the ceiling of the lock; and (2) the calling task is requeued for future completion. When a task releases a semaphore S, its priority is reset to its priority at lock time, and S becomes available for locking. If a task is waiting in the queue for S, then the locking conditions may now be satisfied. Two important parameters are always needed with each request and release operation: the task id of the task that calls the protected operations, and the index of the semaphore the task wants to lock or release.
package SMP is -- not all features shown function available_cpus return integer; pragma Import(C, available_cpus); -- sets the affinity for a processor function set_affinity(cpu_n: natural; pid: natural) return integer; pragma Import(C, set_affinity); function get_affinity return integer; -- returns the task’s affinity pragma Import(C, get_affinity); end SMP;
3.1. Allocating Tasks to Processors The MPCP and the MSRP employ the distributed rather than the centralized scheduling model. Tasks can be interleaved on one processor or run in parallel on multiple processors. The Linux kernel scheduler enforces soft processor affinity, which simply means that tasks normally do not migrate between processors. Tasks that seldom migrate incur less overhead. However, for this work, we need a mechanism that allows us to programmatically enforce hard processor affinity, meaning we want the MPCP and the MSRP to explicitly specify on which processor a given task may run.
int set_affinity (int cpu_n, pid_t pid) { int num_procs = sysconf (_SC_NPROCESSORS_CONF); if( (cpu_n < 0) || (cpu_n > (num_procs-1) ) ) return Affinity_Error; cpu_set_t mask; // set all the bits in the mask to zero CPU_ZERO(&mask); // set only the bit corresponding to cpu_n CPU_SET(cpu_n, &mask); // set the CPU affinity mask of the task denoted by pid return sched_setaffinity (pid, sizeof(mask), &mask); }
There are two configuration models for the Ada tasking runtime system. One model uses a single process, with its own multi-threading executive, based on the Fsu (Florida State University) threads library [29]. In this model there is one process and one thread, at the OS level. The other model uses native threads. In the latter model there is one process and several threads at the OS level. In Unix/POSIX compliant systems the thread id space and process id space should be distinct, and all threads of a given program should have the same process id.
Unfortunately, Ada-2005 has not added new facilities to support programs running on systems with more than one processor [4]. The Ada Reference Manual (ARM) permits a program’s implementation to run on systems with more than one processor. However, Ada provides no facilities that give programmers the ability to map a task on a specific processor. The following ARM quote shows the approach:
Our objective here was to find a way to partition threads among the processors without altering how Ada protects and serializes critical data updates. In the FSU model, all Ada tasks operate under one process id. This is often more efficient than operating with separate processes, since less context-switching overhead is incurred. Our first approach was to create an application based on the FSU model. The application was divided into partitions, as in a distributed application. Each partition was then mapped to a different processor. The partitions ran independently, other than when communicating. The approach can be successful, but does not easily lead to tight scheduling analysis.
“Concurrent task execution may be implemented on multicomputers, multiprocessors, or with interleaved execution on a single physical processor. On the other hand, whenever an implementation can determine that the required semantic effects can be achieved when parts of the execution of a given task are performed by different physical processors acting in parallel, it may choose to perform them in this way.” ARM Section 9 par 11, [5]. This means that a task may execute in parallel as if executing sequentially, but only if the task can be achieved. At the moment, there has been no standardization of 16
Our second approach was to map each task to a processor. However, the Ada compiler that is available for the SMP system did not support the native threads model. In this model there could have been several ‘processes’, each with its own process id. So a decision was made to emulate its effects without depending on run-time support. The solution we have formulated is based on the boss/worker model described by Hennessy and Patterson [17]. There are many variations, such as the work crew model where the workers cooperate on a single task, each performing a small piece. Nearly every management arrangement used by humans can be employed by multithreaded programs. We call ours the lazy-boss (or lazy-task) and worker scheduling model. In this model, each task functions as a ‘boss’ because it creates worker threads to perform most of its work. Each boss assigns the clone (worker) either its entire workload, or just the critical section. Each worker is then mapped to a processor. The worker performs its assigned duty until finished, and then notifies the boss that it is ready to accept more work. To ensure that workers are not heavily loaded, we assign workers with the workload of their parents only. However, an alternative approach to deal with the issue of load balancing is for the worker to simply terminate, and the boss to spawn the next worker.
locked by tasks other than T. If T satisfies both (a) and (b), it can lock SX. T then retains the semaphore SX until done with the critical section. Therefore, if task T is in the critical section X, no other task can lock semaphore SX to enter the critical section X. The MSRP and MPCP protocols have freedom from starvation as a requirement. Of course without it we cannot get a bounded blocking time. The highest priority task gets the resource if it is available. If a task can be starved then it indicates a logic error in the application: either the priority assignment is wrong, or there is a coupling between tasks that needs to be broken. Thus, our design guarantees that every task that calls Request eventually enters the critical section. The two steps are performed in the following order, if necessary: (1) if the resource has already been locked, then the active priority of the task that has locked the resource is updated (if not done already), and (2) the calling task is requeued to the current active queue for future completion of the request. 4.1 Case Study I Like other authors, we like the GAP (Generic Avionics Platform) task set, a small avionics case study written by Locke et al [20]. There are seventeen tasks in the GAP set, of which all but one are strictly periodic. Task 11 is a sporadic task, the arrival of which is assumed to be polled by the tick scheduler. We have assumed a tick scheduler with the same parameters as described by Tindell et al [26]. We have chosen this case study because our next case study is based on performance metrics such as percentage of missed deadlines and average context switches. However, in hard real time systems, average performance is usually based on the worst-case response time. Therefore, in the first case study, we want to evaluate the heart of the MPCP and the MSRP. If we first look at how the PCP and the SRP protocols behave, some readers may draw different conclusions than those obtained in our second case study.
4. Implementation Notes and Case Study In this section and the next, we will discuss the strengths and weaknesses of our solution. However before we proceed further, we will discuss the correctness of the design. There are a number of correctness properties that must be met for software implementations such as this one. Specifically, there are 3 of interest in this effort: • Freedom from deadlock • Satisfying mutual exclusion • Freedom from starvation
A deadlock can only be formed by a cycle of tasks waiting for each other. If a task does not hold a lock, then it cannot contribute to a deadlock. PCP and SRP address and solve the deadlock problem. For each semaphore, a priority ceiling is defined. The priority ceiling is the priority of the highest priority task that contains a call to the semaphore. A task cannot lock a semaphore unless its priority is greater than the priority ceilings of all the semaphores that are currently locked. As stated above and shown in [2,24], this completely prevents mutual deadlock. And this means that the code is deadlock-free.
We have modified the GAP task set in a way that would have priority inversions without a protocol, since there will be task sets that make one protocol appear more effective and vice versa. The tables 1 and 2 summarize the GAP task set, where all the times are given in microseconds, and C is the worst-case execution time (WCET). The tasks are listed in priority order as described in [26]. When the task set is assigned a deadline rate-monotonic priority ordering, the average-case response times (ACRT) and worst-case response times (WCRT) are as shown in the tables 1 and 2.
The existence of at most one lock server ensures mutual exclusion. To enter a critical section X, a task requests a lock on semaphore SX. A task T can only lock SX if: (a) semaphore SX is not locked, and (b) the priority of task T is greater than the ceilings of all the semaphores
It is well known that although priority inheritance and PCP have better average-case response times than the stack resource policy (SRP) [2]. Nevertheless, note that in the GAP set example above, both the average-case response times and the worst-case response times of PCP are better
17
4.2 Case Study II
than that of SRP. This does not imply that PCP is always better, but for this particular task set, SRP did not perform as well. There are optimization schemes that do a "lazy" priority change, and change it only if necessary, i.e., if a higher priority task preempts the critical section. This may reduce the average case overhead of SRP. Moreover, we heard from an anonymous reviewer that there is a very recent patch to their Linux kernel that is performing lazy boosting in their priority-inheritance implementation.
In order to evaluate the performance of the resource sharing algorithms further, we will present another case study (real-time acquisition and analysis of steel mill signals) that could benefit from our analysis. The system is driven by several computers that control both the motors driving the rollers and the input thickness variations. The acquisition and analysis system perform three main operations: (1) acquiring signal samples from the different embedded real-time sensors and storing them in files for off-line processing (semaphores are employed during the writing process);
Table 1: GAP task set and worst-case response times Task 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Period T 200000 25000 25000 40000 50000 50000 59000 80000 80000 99995 200000 200000 200000 200000 199995 1000000 1000000
Deadline D 5000 25000 25000 40000 50000 50000 59000 80000 100000 115000 200000 200000 200000 200000 200000 1000000 1000000
WCET C 3000 2100 4200 1000 3000 5000 8000 9000 2000 5000 1000 1000 1000 3000 3100 1000 1000 Total:
WCRT PCP 4180 6480 13180 14280 17580 23344 38848 49648 40780 98954 138794 138860 139926 144540 146488 147554 148620 1312056
WCRT SRP 4180 6480 14930 16030 19330 24694 40198 50998 43630 122604 140144 158510 159576 144540 146488 147554 148620 1388506
(2) acquiring signal samples for immediate processing from three of the controlling computers that also share a common memory, and (3) generating system messages to monitor the manufacturing process. Signals and messages include signal analysis and statistical data about the product and its quality trends. This includes system speeds and temperatures, the depth variations in the slabs, and the temperature of each slab. System belt speed can reach up to 32 m/s. Human operators supervise the system input and output. The main objective of the signal analysis system is to improve the system operations. Based on the application, we experimented extensively and tested tasks from 10 to 100 with different configurations. The tasks were distributed evenly among the eight processors. The tasks’ periods ranged from 50 to 1000 milliseconds. These tasks were periodic and their deadlines were equal to their periods. The tasks were divided into three classes:
Table 2: GAP task set and average-case response times Task 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Period T 200000 25000 25000 40000 50000 50000 59000 80000 80000 99995 200000 200000 200000 200000 199995 1000000 1000000
Deadline D 5000 25000 25000 40000 50000 50000 59000 80000 100000 115000 200000 200000 200000 200000 200000 1000000 1000000
WCET C 3000 2100 4200 1000 3000 5000 8000 9000 2000 5000 1000 1000 1000 3000 3100 1000 1000 Total:
ACRT PCP 3236 2704 7804 4207 11959 17764 18798 29628 19763 84008 115190 117776 118842 107736 122894 147554 148620 1078483
ACRT SRP 3236 2704 7834 5214 12018 18199 19681 33424 40260 89773 119060 122786 123852 115676 122894 147554 148620 1132785
1. High rate (highest priority): from 50 to 100 ms. 2. Medium rate: from 125 to 475 ms. 3. Low rate (lowest priority): from 500 to 1000 ms. Tasks were distributed among the three classes as follows: 40% of the tasks belonged to the high rate type, periodically writing new data to many shared variables. 35% of the tasks belonged to the medium rate type, periodically working on the calculations and updating critical data. The other task types accounted for 25%, and their job was to store data for off-line processing and to generate messages and reports. The tasks run independently, other than when reading or updating the critical data. They update between 25 and 100 variables each. The fraction of runtime that tasks spend in a critical section ranges from .02 to .2 milliseconds. The number of critical sections accessed by each task is a random value chosen in the intervals (0,4), (1,6), (2,8) depending on the experiment.
18
5. Experiments
To illustrate the theory better, consider this simple example: Task
Period, T
WCET, C
Blocked
T1 T2 T3
300 400 790
70 100 120
20 20 0
In this section, we have reported the results of the experiments that were conducted. We tested the software with gnat-gpl on a Linux system. It has a tightly coupled SMP architecture and hosts a version of GNU/Linux 2.6.18. The x86_64 system has eight processors, 32 GB of RAM, and four 1 GB DIMMS directly managed by each processor. Finally, the choice of the Linux kernel is because that is what was available. Moreover, comparing the ‘overhead’ of each protocol (memory, processor usage, etc.) would not be accurate because an accurate comparison requires the construction of both algorithms from scratch, starting with the bare machine (machine instructions, kernel level).
Uses resource R1 none R1
To check the schedulability of the task set using the RM algorithm, we will apply the sufficient and necessary condition of [24]. With the following inequality test we can check whether each task is schedulable or not. 1) Task T1: Check C1+B1 ≤ 300 Since 70+20 ≤ 300, task T1 is schedulable 2) Task T2: Check 2C1+C2+B2 ≤ 400 Since 140+100+20 ≤ 400, task T2 is schedulable 3) Task T3: Check 3C1+2C2+C3 ≤ 790 Since 210+200+120 ≤ 790, task T3 is schedulable
Such a construction would be beyond the size and scope of the project. However, the intrinsic performance of the protocols is clear as shown by the previous case study. Furthermore, if we try to use a particular set of primitives such as Linux pthreads or CVs, the performance would be biased by the way those implementations are built, which would naturally make the implementation of one protocol on top of it more natural and easier than the other. Thus, to begin with, we coded SRP instead of using the predefined ICPP, and then tried to implement both protocols using the same architecture as far as possible. Trying to understand and evaluate their performance better, we also ran tests with the first-come-first-served (FCFS) protocol. The FCFS treats all tasks equally—tasks are scheduled in the order of arrival without context switching. We ran many experiments with different configurations. The performance measures were:
Thus, the example satisfies the sufficient and necessary test. A scheduler can feasibly schedule these tasks. The task set also satisfies the simpler, sufficient condition [19]: 3
U
∑ CP i1
i i
Bi Pi
1
0.7519 ≤ n2 n − 1
The tasks presented in our example are not independent— they synchronize with one another in a mutually exclusive manner to share a resource. This requires analyzing each task individually. For example, for the highest-priority task, its worst-case response time will equal its own computation time (that is, R=C). Other tasks will not only suffer interference from higher-priority tasks, but also suffer blocking from lower-priority tasks when priority inheritance is used. Thus, finding the worst-case response time, ri is not a trivial job when tasks are allowed to share resources. In contrary to our intuition, the worst-case response time of a task is not always found in the first busy period. For example, consider the task set given previously. For simplicity, assume that the tasks are executing on the same processor, and other tasks on the other processors. The behavior of task T3 illustrates the problem. At time 0, a critical instant, T2 has interference only from T1, and has a response time of 170. At its next release (at time 400), T2 has a response time of only 100, since no other tasks are active. However, at its third release (at time 800), task T3 is still active and locks R1 just before T2 arrives. With MPCP, T2 preempts T3, and executes since its period is shorter than that of T3. As a result, its response time is only 100 and it completes before T1 arrives. On the other hand, with MSRP, T2 cannot execute, since T3’s priority is raised to the ceiling level of R1 throughout the critical operation. Also, T2 gets interference from the next release of T1, and with the MSRP, T2’s response time is about 190.
• Deadline missed – with and without a protocol • Lock waiting time – with and without ceiling changes • Priority Scheduling – with and without a protocol • Context Switches – with and without a protocol We began by reducing priority inversion, using a layered implementation to compare the two and determine how well one protocol was doing in allowing higher priority tasks to meet their deadlines. We looked at task sets that were structured in a way that would have priority inversions without a protocol, since there will be task sets that make one protocol appear more effective and vice versa. We also ran a variety of task sets to find out what is it about a particular set (workload) that makes one protocol more suitable than the other. We ran the tests more than 10 times and took the minimum average to reduce the noise in the data collected. Consequently, in the first experiment we varied the timings, and looked for the worst case, since these protocols were concerned with minimizing worst-case priority inversion rather than average cases.
19
In the third experiment, we selected task sets where the utilization bounds were one and where the task set periods were harmonic. The results for this case (Figure 4) clearly show that EDF scheduling still managed to deliver better performance. As expected, the MPCP and the MSRP performed equally well. However, MSRP’s early blocking policy caused more tasks to miss their deadline—even those that required no shared resources to complete.
Figure 2: Percentage of deadline misses -- task sets with randomly selected periods, and with heavy resource usage.
The X-axis in figure 2 denotes the number of tasks and the Y-axis denotes the deadline miss ratio. Our results are clearly different from those of the previous study [15]. We ran many tests with task sets that had randomly selected periods, and the results show that none of the protocols had significant advantage over the other. Figure 2 clearly shows that both the protocols performed equally well, giving slightly more credit to the MPCP and the EDF scheduling. As expected, FCFS performed poorly. Since it treated all tasks equally, the tasks with short computation time missed their deadline when a task with a long computation time preceded them. There were more missed deadlines caused by the additional blocking with the MSRP than the MPCP, since MPCP adhered to the priority-scheduling better with the priority of a task raised only when necessary.
Figure 4: Percentage of deadline misses -- task sets with harmonic periods, and with light resource usage.
Figure 3: Percentage of deadline misses -- task sets with randomly selected periods, and with lighter resource usage.
Figure 5: Lock waiting ratio -- task sets with randomly selected periods, and with heavy resource usage.
In the second experiment, we focused on the number and length of each critical section. With fewer and shorter critical sections, MSRP performed even better than before, as can be seen in Figure 3. Looking at the Y-axis, it is apparent that both, the MSRP and the MPCP were nearly equal. MSRP’s performance improved in particular when the number of critical sections used by each task was reduced (light) and their lengths were shorter (20 to 100 milliseconds). Even under heavy loads with 100 tasks executing, the number of deadline misses was almost the same as MPCP. In conclusion, FCFS still performed poorly, as high priority inversion still occurred.
In the fourth experiment, we focused on the total lockwaiting time. This is the total time taken by the resource server (scheduler) to grant a lock. The total time included: (1) any time spent in the queue; (2) checking if the lock rules are met; and (3) raising the priority of the task to the level of the lock, if necessary. Figures 5 and 6 clearly show that FCFS was the best performer. However, we are not measuring total response time (completion time)—previous experiments have shown that many tasks miss their deadlines under the FCFS. Even so, FCFS had lower total lock-waiting time, because it treated all tasks equally. The queue wait time is its most significant overhead. In addition, with priority inheritance there is potentially higher
In all our experiments on random and harmonic task periods, both the protocols had about the same results, which were strikingly different from the previous study. To highlight the differences between each protocol better, we have concentrated on the architecture and behavior of each protocol in the next experiments.
20
custom kernel. In fact, the priority change replaces the mutex lock operation with one processor. Locking simply cancels the old priority value and sets a new one; unlocking restores the priority from the saved value, and calls the scheduler if the priority is thereby lowered. There are details that must be attended to atomically, but these are not overly difficult. Finally, with reference to FCFS, there are no locking rules, and thus no priority checking or changing.
overhead. However, even with priority inheritance, MPCP performed slightly better than MSRP— the level of resource usage makes no difference.
We looked at the number of preemptions and context switches generated in our sixth experiment. As expected, Figure 8 shows that the number of context switches is lower with RMS. FCFS had the lowest number of context switches overall, followed by MSRP, and then MPCP. FCFS had the fewest context switches because it treats tasks equally without preemption. Hence, the reason for MPCP's lower performance was its closer adherence to priority scheduling. In comparison with MSRP, a positive consequence of the early blocking policy of MSRP is a reduction in unnecessary context switches. The cost of context switch can be significant. As seen earlier, MSRP finds priority changes problematic, but again this entirely depends on how the dynamic priority change is implemented. Priority change is much simpler than context switching, but only if implemented over kernel-based threads. However, with normal OS processes, it involves a trap into the kernel and a return, a similar operation to a thread context switch but still less complex than a process switch (since the memory mapping is not changed).
Figure 6: Lock waiting ratio -- task sets with randomly selected periods, and lighter resource usage.
As mentioned earlier, if MSRP is implemented over an OS kernel, rather than in the kernel itself, then the priority changes create significant overhead. Even inside an operating system’s kernel, MSRP can have significant overhead unless the kernel is designed specifically for MSRP. In experiments with custom kernels that exclusively use SRP internally [25], showed lower overhead for SRP than for conventional locks (semaphores) implemented in Solaris threads.
Figure 7: Percentage of priority changes ratio by MPCP, MSRP, & FCFS.
The fifth experiment presents the effects of priority changes. Figure 7 clearly shows that there are fewer priority changes with RMS. This is one of the side effects of the reduction of schedulability with RMS (vs. EDF). The experiment clearly shows that MSRP’s overhead was the highest. With MSRP, a task that locks a critical resource will always have its priority raised to that of the ceiling priority of the critical resource, throughout the protected operation. The advantage of MPCP is its closer adherence to priority scheduling. The priority of a task is only raised when it is necessary to reduce priority inversion. However, MSRP’s overhead is higher because the POSIX mutex operation does not implement the prio_protect efficiently even though it is in place since at least a couple of versions of the glibc library. There need not be high overhead in a
Figure 8: Percentage of context switches by MPCP, MSRP, and FCFS.
Contrasting prior work, we provided a practical evaluation on a real working system instead of solely simulation. On random tasks, MSRP did not always perform better, as indicated in the previous study (page 7, column 2, first paragraph). When we tested each protocol with the same scheduling algorithms as the other, both protocols were about equally effective. MSRP was a little more expensive because the thread library did not provide cheap, direct control over thread priority. If the objective is to compare performance, then we must compare “apples with apples” or the results will be skewed, especially since EDF is optimal whereas RMS is not.
21
6. Conclusion and Future Work
[8] Calandrino J., Baumberger D., Li T., Hahn S., Anderson J. Soft Real-Time Scheduling on Performance Asymmetric Multicore Platforms. RTAS 2007. [9] Cheng A. M. K., Real-Time Systems Scheduling, Analysis, and Verification. 2nd. ed.: Wiley & Sons 2002, 2005. [10] Cheng A. M. K. and Ras J. The Implementation of the Multiprocessor Priority Ceiling Protocol in Ada-2005 Using a Shared Memory Programming Model. RTAS, WIP, 4/2007. [11] Cheng A. M. K. and Ras J. The Implementation of the Priority Ceiling Protocol in Ada-2005. Ada Letters, 4/2007. [12] Chen C. M. and Tripathi S. Multiprocessor priority ceiling based protocols. ACM Computer Science Technical Report; Vol. CS-TR-3252. Technical Report: CS-TR-3252, 1994. [13] Chen M. I. and Lin K. J. Dynamic Priority Ceilings: A Concurrency Control Protocol for Real-Time Systems. Technical report UIUCDCS-R-89-1511, Dept. of Computer Science, University of Illinois at Urbana-Champaign, 4/89. [14] Cottet F., Deacroix J., Kaiser C., Mammeri Z. Scheduling in Real-time Systems. John Wiley & Sons Ltd, 2002. [15] Gai P., Natale M., Lipari G. Ferrari A., Gabellini C., Marceca P. A comparison of MPCP and MSRP when sharing resources in the Janus multiple-processor on a chip platform. RTAS 2003. [16] Gai P., Natale M., Lipari G. Minimizing memory Utilization of Real-Time Task Sets in Single and Multi-processor Systems-on-a-chip. RTSS 2001. [17] Hennessy J. and Patterson D. Computer Architecture: A quantitative approach. Morgan Kaufmann Publishers, 1995. [18] Lee E. The problem with Threads. Berkeley report, Jan/2006. [19] Liu C., and Layland J. Scheduling Algorithms for MultiProgramming in a Hard Real Time Environment. JACM 73. [20] Locke D., Vogel D.R., and Mesher T.J. Building a Predictable Avionics Platform in Ada: A Case Study. Proc. of IEEE Real-Time Systems Symposium, 1991. [21] Locke D., Sha L., Rajikumar R., Lehoczky J., Burns G. Priority inversion and its control: An experimental investigation. ACM Ada Letters 8(7):39-42, 1988. [22] Lopez J.M., Diaz J.L., and Garcia D.F. Utilization bounds for EDF Scheduling on Real-time Multiprocessor Systems. RealTime Systems, 28(1):39–68, 2004. [23] Rajkumar R. Synchronization in multiple processor systems. In Synchronization in Real-Time Systems: A Priority Inheritance Approach. Kluwer Publishing, 1991. [24] Sha L., Rajkumar R., and Lehoczky J. Priority inheritance protocols: An approach to real-time synchronization. IEEE transaction on computers, 39(9), September 1990. [25] Shen H., Baker T., and Charlet A. A Bare-Mach. Implementation of Ada Multi-Tasking Beneath the Linux Kernel. Reliable Software Technologies, Ada-Europe 99, Lecture Notes in Computer Science, Springer Verlag, 6/1993. [26] Tindell K., Burns A., Wellings A. An Extendible Approach for Analyzing Fixed Priority Hard Real-Time Tasks, RealTime Systems 6(2): 133-151, 1994. [27] http://www.quadros.com/ [28] http://www.windriver.com/ [29] http://www.cs.fsu.edu/~baker/florist.html
We have presented a fair, realistic, rigorous, and systematic evaluation of the MPCP and the MSRP. Our conclusion is that both policies are more or less equally effective; the main difference being MSRP’s expense. We provided a clear explanation of the protocols and how to implement them in Ada/POSIX. MSRP is a consistent extension of MPCP. Both protocols have their strengths and weaknesses. One of the strengths of MPCP is that it does not raise priority until it is necessary to prevent priority inversion, so in many cases, the priority need not change. For this reason, MPCP adheres to priority scheduling better. However, there is still some overhead to determine whether a priority change is needed or not. The gain is that there are fewer cases of priority inversion. With MSRP, whenever we raise the priority of the lock-holder that is causing a temporary priority inversion, we gain over ‘no protocol’ by the fact that this inversion is strictly bounded in duration. It happens less often with MPCP; a gain on average. On the other hand, we have shown that in the worst-case scenario, the number of context-switches is lower with the MSRP. There are several important directions for future work. In the immediate future, we are interested in perfecting our API and integrating it with Ada, as well as improving the thread library. We are also interested in testing optimization schemes that do a "lazy" priority change, and testing other approaches to resource sharing such as the abort-and-restart schemes when preemption is a complete rollback action. Acknowledgments We would like to thank the anonymous referees, whose careful reading of the paper has led to significant improvements in the accuracy and the clarity of the presentation. We especially thank Professors Burns and Wellings (University of York), and Professor Baker (Florida State University).
References [1] Anderson J., Ramamurthy S., and Jeffay K. Real-Time Computing with Lock-Free Shared Objects. ACM Trans. on Computer Systems. May 1997. [2] Baker T. A stack-based scheduling of real-time processes. In Real Time Systems Symposium, pages 191-200, 1990. [3] Block A., Leontyev H., Brandenburg B., and Anderson J. A Flexible Real-Time Locking Protocol for Multiprocessors. 13th IEEE International Conference on RTCSA, 2007. [4] Brandenburg B., Calandrino J., Block A., Leontyev H., and Anderson J. Real-Time Synchronization on Multiprocessors: To Block or Not to Block, to Suspend or Spin? RTAS 2008. [5] Burns A. and Wellings A.J. Beyond Ada 2005: Allocating Task to Processors in SMP Systems. Ada Letters, 8/2007. [6] Burns A. and Wellings A. Programming Execution TimeServers in Ada-2005. RTSS 2006. [7] Burns A., Wellings A.J., and Zerzelidis A. Correcting the EDF protocol in Ada 2005. Ada Letters, August 2007.
22