Applying Compiler Optimization in Distributed Real-Time ... - CiteSeerX

Applying Compiler Optimization in Distributed Real-Time Systems Mohamed F. Younis

Thomas J. Marlowe

Grace Tsai

Alexander D. Stoyenko Abstract

Compiler optimization techniques have been applied to facilitate development and performance tuning of non-real-time systems. Unfortunately, regular compiler optimization can complicate the analysis and destroy timing properties of real-time systems. In this paper, we discuss the diculties associated with performing compiler optimization in distributed real-time systems. We present an algorithm to apply machine-independent compiler optimization safely to distributed real-time systems. The algorithm uses resources' busy-idle pro les to investigate eects of optimizing one process on other processes. A restricted form of resource contention is assumed to simplify the analysis.

1 Introduction Although compiler optimization techniques have been applied successfully to non-real-time systems [18], they can destroy safety guarantees and deadlines of real-time systems. For this reason, real-time systems developers have tended to avoid automatic compiler optimization of their code. However, real-time applications have been growing substantially in size and complexity in recent years. That growth makes it impossible for programmers to write optimal code, and consequently indicates a need for compiler optimization transformations to tune performance without degrading worst-case execution times. In addition, proper optimization can sometimes transform programs which may not meet constraints/deadlines, or which result in timeouts, Younis

and Stoyenko are with Real-Time Computing Laboratory, Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark NJ 07012 USA, Tel: (201)596-2989, Fax: (201)596-5777, E-mail [email protected]. Tsai is visiting the Real-Time Computing Laboratory from the Department of Computer Science, Fairleigh Dickinson University, Teaneck, NJ 07666 USA. Marlowe is visiting the Real-Time Computing Laboratory from the Department of Mathematics and Computer Science, Seton Hall University, South Orange, NJ 07079 USA. This work is supported in part by US ONR Grant Number N00014-92-J-1367, by US-NSWC Grant N00014-93-1-1047 and N60921-93-M1912 , by Siemens Research Collaborative Grants and by AT&T UEDP Grant 91-134. The authors are indebted to the many constructive comments by members of the Real-Time Computing Lab., and to B. Ghahyazi for conducting the simulation.

1

into deadline-satisfying programs. New opportunities for parallelism can be detected, enhancing resource utilization and speeding up execution [27]. Moreover, optimization of hard real-time programs has bene ts even for real-time programs which are running, and which can be proved to meet their timing constraints. For these programs, it is often preferable to reduce resource usage (time, space, or processors), especially in multiuser or multiprogramming environments. Not only do resources then become available to other system tasks or users, but also this may make the programs more robust in the face of faults or unpredictable system overload [2]. For example, achieving a speedup of 10% for an overall system of 100 processes by optimization may permit the system to accept an extra load of perhaps 10 processes (assuming all processes have similar requirements). There has been an increase, during the past few years, in the use of distributed computer systems in real-time applications, such as patient monitoring, avionics and air-trac control. Thus, transformations must consider other complicated issues such as synchronization and shared resources. Although performing safe compiler optimization will not extend the deadline of a process, it can aect the timing behavior of other processes. Consider, for example, the code in Figure 1 which consists of a loop followed by a call to a critical region crit(R). ORIGINAL

OPTIMIZED

while (i Next Idle Time), we reset the new release time for the incoming requests. // Algorithm to compute the maximum release time of a resource Current Time = 0 Next Idle Time = 0 While Current Time LCM Release Time = 0 While Current Time Next Idle Time For i = 1 To Number Processes If Process has a request to R at Current Time Release Time = Release Time + Service Time Next Idle Time = Next Idle Time + Service Time EndIf EndFor Current Time = Current Time + 1 EndWhile Next Idle Time = Next Idle Time + 1 EndWhile i

Figure 4: Algorithm to Calculate Maximum Release Time of A Shared Resource If a process contains unresolved conditionals with resource requests on multiple branches, we must assume that all of them may occur. A post-pass can shorten busy-idle intervals in many cases in which a process makes non-coexecutable requests to the same busy-idle interval. Note that, in some cases, it may be statically possible to re ne this information without scheduler support, when one group of requests demonstrably arrives after service has begun on the last of the previous group, etc.

5 Classi cation of Code Improving Transformations Program improvement transformations can be classi ed according to the scope of the analysis into; single process/processor, multiprocess, and multiprocessor. Opportunities for optimization within a single process depend on whether we have access requests for shared resources. In the absence of critical sections, it is always safe to apply compiler optimization techniques as long as we are not extending the execution time along worst-case path. We call this property local safety. However, in the presence of critical sections, safety of the transformation depends on not worsening shared resources' busy-idle pro les. We refer to that property by multiprocess safety. Thus, a transformation is safe if it is locally safe, and multiprocess safe. Consider the example in Figure 1, in the absence of the call crit(R), moving the loop invariant out of the loop is always safe. However, it is not guaranteed that it is safe in the presence of the call unless we prove that it is multiprocess safe. Removal of unreachable code is necessarily multiprocess safe, but essentially no 7

other transformation is always safe. The second type of code improvement transformations is multiprocess optimization. Optimizing interprocess communication is a common approach. Compilers can detect useless messages, or redundant parameters in a message. Another common approach for asynchronous messages moves sends upward and receives downward [1, 7]. The removal of unnecessary messages and parameters, or optimizing asynchronous messages, is usually locally safe because it shortens execution time. However, it may aect other resources' busy-idle pro les. The third form is multiprocessor optimization. Migration and speculative execution, remote calls, and moving code in/out of a call are examples of this category [3, 4, 27]. Multiprocess and multiprocessor optimizations can also be unsafe with respect to other processes, processors, and resources. We are concerned with the results of transformations/optimizations, and making them safe for a multiprocess real-time application. We are here addressing primarily the result of single-process optimizations but the approach extends with a few checks for semantic safety checks, to the other types. We do not address how optimizations are identi ed or performed, but assume that implementing transformations are provided to the system. Our approach, as we illustrate, is to inject delay statements in the code rather than actually optimize during local analysis, to avoid changing the access pro les of the process with respect to each resources. Then, we perform multiprocess analysis to defer and eventually remove these delay statements without destroying the multiprocess safety property.

6 Performing Multiprocess Safety Analysis The analysis of eects of a single process optimization on other processes can be performed using shared resources' busy-idle pro les. In this section, we discuss the multiprocess safety property of compiler transformations. We say a request lands in an idle interval if its service does not overlap any busy interval; it lands in a busy interval if it overlaps other services. Transformations may aect the resource busy-idle pro le as follows: 1. Moving a request out of an idle interval into an idle interval, 2. Moving a request out of a busy interval into an idle interval, 3. Moving a request from one busy interval into the same busy interval, 8

4. Moving a request from one busy interval into another busy interval, 5. Moving a request from an idle interval into a busy interval. 6. Moving a request and merging two busy intervals. In cases 1 and 5, "idle interval" should be taken to mean that the current request is the only one in the current busy interval. The rst is generally safe while cases two and three can be applied safely with care. Moving a request from an idle interval to the same or dierent idle interval is always multiprocess safe. Such optimization will not introduce contention. On the other hand, moving the request within the same busy interval will not aect the release time, according to our model, of all the processes sharing that busy interval unless the entire interval will be shifted forward. Shifting the busy interval forward will decrease the release time of all requests in this busy interval, which may aect other processes. In this situation, the amount of shift can be compensated by inserting exactly a compensating amount of delay past the resource request in each of those processes. The second case is ideal. Moving a request from a busy interval into an idle one will not only accelerate the execution of the optimized process, but also will reduce shared resource contention and speed up other processes competing for that resource, and may actually split the busy interval. Consider the example in Figure 5 (a), where the resource R busy-idle pro le has been produced using the algorithm in Figure 4 by combining processes A, B , C , and D access pro les. Assume we successfully have optimized the code of process A, so that its request at time 8 can be initiated at time 7. This will split the busy interval (8,15) into two intervals (7,9) and (10,15), as shown in Figure 5(b). Now the response time for the request from processes A and B will be faster, due to reduced contention. However, the decrement in the release time for the other processes should (at least temporarily) be compensated by inserting delay after the request. In our example, a delay of 6 units should be inserted after the call made by B , and one of 7 units after the call in A. The fourth and fth cases are not in general multiprocess safe. Moving a request from a busy interval to another busy interval needs more careful analysis. This case can increase the contention in the target busy interval and extend the release time. On the other hand, it may be bene cial to reduce contention in an interval as long as the increase in the release time of the target busy interval is acceptable. In Figure 5 (a), if the request after optimization of process A becomes at time 6, this will extend the release time of the previous requests to 7 instead of 6, while enhancing the response time of the request from B at time 8. Such cases require checking all the processes that will suer and gain from that optimization, which may lead to exponential analysis. However, there are obvious unsafe situations, such as linking two busy intervals 9

(case 6). For example in Figure 5 (a), moving the request of process C from time 10 to time 6 will link the two busy interval. There are nonetheless two safe special cases. First, if all the requests in the target busy interval were originally in the source interval. Second, when all processes participating in the target busy interval have a delay past the request large enough to accommodate the extension in the release time. This subsumes the rst case, since shifting previous requests out of the current interval creates such delays. On the other hand, moving a request from an idle interval into a busy interval will introduce new contention while only one process will gain. Thus, case 5 can be avoided or handled similar to case 4. Sometimes, it may be highly desirable to decrease the execution time of a process to meet its timing constraints, as long as the other processes can tolerate delays due to increased contention. Applying multiprocess safe transformations may split busy intervals. Moving one request from one of these subintervals to another should not be considered unsafe; even merging them again should be considered safe. Therefore, the resource busy intervals should be identi ed. Splitting an interval by a safe multiprocess creates two subintervals with the interval id. This allows application of safe subcases of 4, 5, and 6. Process ‘A’ access profile of ‘R’ 0

1

2

8

9

8

9

Process ‘B’ access profile of ‘R’ 0

2

3

Process ‘C’ access profile of ‘R’ 0

1

12

10

Process ‘D’ access profile of ‘R’ 0

2

4

11

14

Resource ‘R’ busy-idle profile 0

6

1

7

15

8

Resource maximum release time

(a) Before Optimizing Process A

Process ‘A’ after optimization 0

1

0

1

2

7

8

Resource ‘R’ busy-idle profile 6

7

9

10

15

Resource maximum release time

(b) After Optimizing Process A

Figure 5: Eects of Compiler Optimization on Response Time of Shared Resources In the next section, we provide an algorithm to perform optimization without destroying either local or multiprocess safety.

10

7 The Optimization Algorithm Our algorithm for applying machine-independent compiler optimization safely for distributed real-time systems is composed of two phases. In the rst phase, we consider each individual process in isolation. Code improvement transformations will be applied safely, in the sense of [10], for individual processes. However, in all cases gains in performance are compensated by injecting delays to preserve the shared resource access pro le. These delay statements are the seeds for delay propagation in the second phase. For simplicity, we assume all timing constraints are absolute max constraints, relative to the start of the process frame. Timing constraints can be applied to any statement. In our algorithm, we assume by default that precisely the set of shared resource requests and the end of the process are observable. However, the algorithm can be extended easily to accommodate other timing constraints. Again, according to our restricted resource contention model, whenever there is a queue, all processes involved (including the rst, which claimed the resource without waiting) are released at the same time. Multiprocess analysis is performed in the second phase of the algorithm. Delay statements are pushed toward the end of processes while consulting the busy-idle pro les of shared resources. A cleanup step is performed at the end of this phase to remove delays reaching the end of processes. However, we may need to perform dierent delay shift and removal optimizations in each instance of every process in LCM. Successfully removed delays in the rst instance does not necessarily mean it is safe to remove them in another, due to dierences in contention for resources. We may thus need to clone dierent instances of a process within LCM on the same processor. Alternatively, a set of directives can be generated to the scheduler to report the dierence between various instances. The algorithm outline is shown in Figure 6. In the following subsections, we explain the two phases of the algorithm in more details.

7.1 Single Process Optimization Most optimization techniques can be viewed as removing unnecessary code, moving code, or replacing it by faster equivalent computations. The common goal is to reduce the execution time of programs, and optimization usually results in performance speedup. For example, code invariant motion out of a loop reduces the loop body for faster execution. Common subexpression elimination removes useless computation. So, we argue that optimization can be viewed as a transformation to wasteful code (or useless delay), followed by movement/elimination of that redundancy. Our approach is based on deferring movement/elimination of delays until the multiprocess safety property is guaranteed. In the algorithm (Figure 6), we require that any time gained by the optimization be compensated by 11

// Algorithm to perform compiler optimization // Extract process access pro les with respect to shared resources For i = 1 To Number Processes Extract shared resources access pro les EndFor // Build shared resources busy-idle pro les For i = 1 To Number Resources Combine access pro les of Resourcei for all processes EndFor // Single Process Optimization For i = 1 To Number Processes Optimize Processi with injecting delays EndFor // Multiprocess safety analysis apply algorithm Shift delay // Single Process cloning if necessary For i = 1 To Number Processes clone the Processi if necessary on the same processor or generate directives for the scheduler EndFor

Figure 6: Algorithm to Perform Safe Single and Multiprocess Compiler Optimization injecting delays, so as not to aect process access pro les with respect to shared resources. After the rst phase of the algorithm, the timing behavior and resource access of every optimized process remains the same. The time, relative to the beginning of the process, of all access requests for shared resources will not change; consequently, busy-idle pro les of resources will not be aected by these transformations. In the second phase, we try to move and eventually eliminate injected delays in individual processes without aecting the timeliness of the other processes. This phase generates for every process an array of points of interests and their times relative to the start. Points of interest are locations in the code that contains either a delay, a resource request, a split of a conditional, a merge of a conditional, or the end of a process. These points will be used by the second phase to correctly propagate delays towards the end of the process. We refer to this array as Test Points. Delay points includes delays which have been introduced by other transformations, for example [23]. Resource request points contain the resource busy interval id. We extend the Test Points list by replicating the points of the rst period, a number of times equal to the number of instances of the process in LCM (LCM/period size). The idea is to work on this array and clone the process on the same processor, if necessary, at the end.

7.2 Multiprocess Safety Analysis In the multiprocess analysis phase, we try to defer and then remove delays inserted in the code during the rst phase. The outline of the second phase is shown in Figure 8. The idea in this phase is to shift the delay statement(s) in the process code towards the end, using a greedy approach. Delays reaching the end of 12

delay K

delay K

delay K’

delay K

delay K

delay L

End

Original

delay K-L

delay K’-L

Transformed delay K

(1)

delay K

delay L =min(K,K’)

delay (K + L)

(2)

(3)

End

(4)

Figure 7: Rules For Delay Propagation (easy cases) the process may safely be removed. Shifting a delay will be performed in steps. Every step we try to move a delay beyond a call to a shared resource, so that the call will be made earlier. We consult the resource busy-idle pro le(s), for the call (algorithm Propagate in Figure 9). If it is multiprocess safe to make the request earlier, we shift the delay. We continue until the end of the process code, removing delays reaching there. This operation is assisted by the array Test Points generated by the rst phase. The function Next returns the location of the next point in the array Test Points. Delays can be propagated toward the end using the rules in Figure 7: (1) In a split point (conditional), delays can be propagated in both branches. (2) In a merge, we propagate the minimum of the delays in the branches. (3) Consecutive delays can be combined. (4) Delays reaching the end can be safely removed. Propagating a delay beyond a resource request needs further multiprocess analysis, checking the busy-idle pro le of that resource. If a request appears in a condition, delays that can be shifted successfully beyond this condition need to be propagated to both branches. The outline of the multiprocess safety check is split across Figures 8, 9, 10. The idea is to set up a simulated clock and consider the earliest delay point in all processes, and so on. This delay may be propagated if it is safe. Note that we work only on the Test Points generated from the rst phase. The procedure Check is used to apply the appropriate rules in Figure 7 to propagate the delay and setup for the next trial in this process. The algorithm in Figure 9 outlines the propagation process in case of a resource request. The idea is to remove as much delay as possible from before the request to a compensating delay after the process. The Propagate algorithm iteratively invokes the procedure in Figure 10 to check for the safety of the shift. This test is based on the classi cation of transformations in Section 5. The test will need to recalculate the busy-idle pro le of the resource; the algorithm relies on checking the location of the new request in the current busy-idle pro le of the resource. The new request time and the service time are used to relate the new location to the current pro le and predict the eect on the response time of that request as well as other processes. The function get interval returns the interval within which a call lands. The classi cation 13

// Algorithm Shift Delays For i = 1 to Number Processes // Initialization of processes code pointers point1(i) := Next(i); // Next will update the pointer to Test Points point2(i) := Next(i); EndFor Current process := 1; // Initialization of current simulated time Current time := Get min(Current process); While (Current time LCM) // Loop for the entire LCM Check(point1(Current process); point2(Current process)); // Adjusting the pointers to the next point in the code Current time := Get min(Current process); EndWhile // Procedure to get the earliest point of interest of all processes Proc Get min(process) Current time := LCM + 1; Not done := 1; // indicators for completion While (Not done) For i = 1 to Number Processes If (Current time point1(i):time) and (point1(i):type is delay) Current time := point1(i):time; Current process := i; Not done := 0; EndIf EndFor If (Not done) counter := 0; // counts the number of done processes For i = 1 to Number Processes If (point2(i):time = LCM) counter := counter + 1; // done process Else point1(i) := point2(i); point2(i) := Next(i); EndIf EndFor If (counter = Number Processes) Not done := 0; // no more points to try in any process EndIf EndIf EndWhile return Current time EndProc // Procedure to propagate delays in a processes Proc Check(point1(i); point2(i)) Switch (point2(i):type) case simple split: apply split rule; case merge: apply merge rule; case two-delays: apply consolidate rule; case resource 'R' use: // point1(i) is the corresponding release point, point2(i) is the resource access delay := Get delay size(point1(i)); Propagate(i;delay; R; point2(i):time; Service time); case end of process: point2(i) := Next(i); // Remove the delay EndSwitch point1(i) := point2(i); // update the process pointers point2(i) := Next(i); EndProc

Figure 8: Algorithm to Perform Shift/Remove Delays Preserving The Multiprocess Safety 14

of the eect on the resource busy-idle pro le, discussed in Section 5, is performed according to the table in Figure 11 (some subcases must be handled with care: case 3 if the start of the interval shifts, case 4 and 5

when there are covering delays). Every resource busy interval will have a ag indicating whether all requests in this interval were originally located in a single resource busy interval. This ag will be set to mixed when a request from another interval (interval with dierent id) is to be added. Using such ag enables handling safe subcases of 4, 5 , and 6 when the target interval is composed of requests previously moved from the current interval. Ideal release time is the response time in absence of contention. Note that in cases 4 and 5, we are not performing an exhaustive analysis for all processes anticipating extra release time due to contention, which in fact may need exponential time; we are only detecting simple subcases which are proven to be safe by simple analysis.

7.3 Complexity The complexity of our algorithm, in Figure 6, can be computed as the sum of the complexities of the compiler preparation phase to build the resources busy-idle pro les, of performing the rst optimization phase, and of the second optimization phase. In this section, we use m as the number of shared resources, and n as the number of processes. The preparation phase starts with extracting the process access pro le with respect to every shared resource. Repeating that for all processes makes this step O(mn). The following step is to build the busyidle pro le of every resource using access pro les of processes which is also O(mn). (Any post-pass to account for conditionals can be halted at any time, and introduces at worst a factor of the number of improving steps.) In the rst phase, we apply single process optimization techniques; depending on the optimization level selected the cost can range from O(n[logn]) upward, but is predictable. In Shift delay algorithm (phase 2), we have one essential loop repeated a maximum of LCM times. This is an upper bound, assuming all code is transformed to delays, which of course is unrealistic. We do not expect to require that number of iterations. Every iteration involves application of one of the rules in Figure 7. Split, merge, consolidate, and end of a process rules are very simple. Shifting a delay past a

resource request involves the Propagate procedure in Figure 9. This analysis may need a number of trials equal to the delay size in case of failing to shift. Again the worst-case is in being unable to shift all delays which is less than LCM. Note that cases 4 and 5, discussed in Section 7.2, may in general lead to exponential explosion in the analysis, but since in our algorithm, we only handle obvious and safe subcases; whose costs are small. The detection of safe subcases of cases 4 and 5 is O(n). Consequently, the complexity of phase 2 15

// Algorithm Propagate(P;delay; R; Current call time; Service time) Optimal shift := 0; Status = Unsafe; D := delay; While (Status = Unsafe and D 1) If Shift one instance(P; D; R; Current call time; Service time; Case) is Safe Status = Safe; Optimal shift := D; EndIf D := D - 1; EndWhile If (Optimal shift = 0) return // No success Else // Success {> shift a delay of size Optimal shift Create a delay point for process P of size (delay - Optimal shift) before point1(p) Set delay size(point1(p); Optimal shift); // split the delay point into two points Current interval := get interval(R,Current call time) ; If request is in a condition Apply split rule // propagate delay of Optimal shift in all branches Else Swap(point1(p); point2(p)); EndIf Switch (Case) case 1: case 2: Create new resource busy interval with the same current request interval id Foreach Processi 2 Current interval Insert a delay point after the resource call to extend previous release point // Even for process P release time is less EndFor case 3: case 4: Foreach Processi in S and Processi 6= P Decrease the size of the delay in point1(i) by Service time EndFor case 5: case 6: Foreach Processi in S and Processi 6= P Decrease the size of the delay in point1(i) by Service time // point1(i) is a delay point as requests in S were originally in the source interval EndFor EndSwitch EndIf

Figure 9: Algorithm to Propagate the Delay Without Destroying Multiprocess Safety

16

// Procedure to propagate one delay instance in a processes called from Propagate procedure Proc Shift one instance(P; D; R; Current call time; Service time; Case) Current interval := get interval(R,Current call time) ; New call time := Current call time - D ; If New call time 2 Current interval Case := 3; // case 3 return safe; EndIf Ideal release time := New call time + Service time ; New interval := get interval(R,New call time) ; If Ideal release time 2 Current interval If New interval is empty // Same interval but shifted (case 3) // The shift needs to be compensated by delays past the // resource request similar to case 2 Case := 2; return safe; Else // Merging two intervals (case 6) If New interval:id = Current interval:id and New interval:flag 6= mixed Case := 6; // originally same interval and split into two return safe; Else return unsafe; EndIf EndIf EndIf Bound interval := get interval(R,Ideal release time) ; If Bound interval is empty If New interval is empty // Target is an idle interval (case 1 or 2) If Current interval is not empty Case := 2; Else Case := 1; EndIf return safe; Else Extended release time := New interval:upper + Service time; de ne set S of all processes in New interval EndIf Else Extended release time := Bound interval:upper + Service time; de ne set S of all processes in Bound interval EndIf // Moving to another busy interval (case 4 or 5) If point2(P ):id = New interval:id and New interval:flag 6= mixed Case := 5; // originally same interval and split into two return safe; EndIf // Handling the simple safe case Foreach Processi in S and Processi 6= P If point1(i).type is not delay or delay size < Service time return unsafe; EndIf EndFor // There is delay to compensate the extension in release time (case 4) Case := 4; New interval:flag = mixed // changing the interval id to avoid erroneous future safety analysis return safe; EndProc

Figure 10: Algorithm to Check the Safety of The Forward Motion of Delays 17

New Call Time Same interval Idle interval Dierent interval Dierent interval

Ideal Release Time do not care Same interval Same interval Idle interval

Status Safe Safe Unsafe Investigate

Case 1,3 2,3 6 4,5

Figure 11: Multiprocess Safety Classi cation of Transformations is O(n[LCM ]2 ). However, on the average, delays are smaller than LCM and we will not reach that bound. The success of applying our algorithm is application dependent. We may be able to remove all, some, or in the worst case none of the delays. In the next section, we examine the applicability of our algorithm using simulation.

8 Experiments In the previous section, we have shown that the multiprocess analysis algorithm can be eciently applied. In this section, we address the applicability of our algorithm in real-time programs using simulation. Performing compiler optimization in distributed real-time systems can potentially be aected by a number of code parameters, themselves dependent upon application domain, module type, and individual and team coding styles and practices. Since it is dicult to obtain a suitable variety of real-world programs, we have decided to investigate the eects of coding parameters through simulation. The experiment uses randomly generated programs. We discuss the simulation design, and present the experimental results of applying the multiprocess analysis, highlighting the impact of various parameters aecting the overall success rate.

8.1 Design of Simulation The experiment consists of the following steps: (1) Generating workloads, (2) Assigning execution times to statements, (3) Calculating worst-case execution time of processes (WCET), (4) Injecting DELAY statements, (5) Constructing resources access pro les for each process, (6) Building the busy-idle pro le for each shared resource, (7) Performing multiprocess analysis to shift and remove the delay statements. We illustrate each step starting with program generation through a workload generator.

Generating Workloads: A program is a group of statements selected out of the following; IF, WHILE, ASSIGNMENT, CALL, BLOCKING CALL, READ, and WRITE. The frequency of each type of statement is controlled by a probability density function. Based on our experience with the code of real-time systems [24], we assigned the probabilities in Figure 12 to each statement type. 18

IF LOOP ASSIGNMENT CALL BLOCKING CALL READ/WRITE 10% 10% 35% 20% 5% 20% Figure 12: Frequency of various statements in a generated program Both READ and WRITE use buers. Consequently they are considered non-blocking, although writes to a common location are ordered. Calls can be blocking (BLOCKING CALL) to access a shared resource, or non-blocking (CALL). Parameters to calls are randomly selected from the set of variables and classi ed randomly as in or out parameters. Loops and if statements are not primitive statements, in the sense that they contain more than one statement. Loops have an upper bound on the number of iterations, which will be used in the next step to compute the worst-case execution time.

Assigning Times to Statements: After generating the program control ow graph, the next step is to associate with every statement its worst-case execution time. For a primitive statement, we assume that the execution time is proportional to the number of variables used or modi ed in that statement. Consequently, the execution time of a primitive statement can be computed by multiplying the number of variables involved by a constant. On the other hand, the execution time of a block is the sum of execution times of statements in that block.

Generation of WCET and deadline: As in real-time systems, worst-case execution time is of greatest interest, the generated program control ow graph is analyzed to calculate an upper bound on the execution time. The worst-case execution time (WCET) is the sum of worst-case times of all the statements of a program. For primitive statements, WCET is the assigned execution time, as discussed at the previous step. For conditional statements, the time of the longest path is used as WCET. The WCET for a loop is given by the sum of times of the loop block statements times the upper bound of the loop index. In this simulation, we assume that a deadline is exactly WCET. Therefore, we use WCET of processes to calculate the least common multiple (LCM). Next, we inject delay statements and the WCET needs to be adjusted.

Injection of DELAY Statements: We conservatively assume that single-process compiler optimization can speed programs up by a percentage up to 10%. We randomly select a speedup percentage S in the range [0,10]%. DELAY statements are inserted with a total delay size S * WCET. These DELAY statements are inserted randomly throughout the program. The worst-case execution time, calculated in the previous step, needs to adjusted by multiplication to the a factor of (100 + S )%. After delay insertion, the program control ow graph is traversed and the array Test Points of points of interest is generated, as discussed in Subsection 7.1. 19

Constructing Resources Access Pro le: As discussed in Section 4, to build resources access pro le for each process, rst we have to apply the analysis of [23] to balance access to shared resources along branches of conditionals. Loops need to be unrolled (which may have been done in the previous step while calculating WCET) to calculate the call time to the shared resource. For each call to a resource from a process, we record start time, service time, process id, and statement number of this call as an entry in an array indexed by process id and resource id. We repeat that for a number of periods up to LCM. The contents of that array is used in the next step to build the busy-idle pro le of shared resources.

Building The Busy-Idle Pro le: Combining resources access pro les of individual processes, we construct busy-idle pro le of shared resources. The restricted resource contention model, described in Subsection 4.2, is used to compute the boundaries of busy intervals. Note that a busy interval may contain several calls from dierent processes. The start time of an interval is the start time of the earliest request and release time is the sum of the start time and service times of all requests in the that interval. The busy-idle pro les generated in this step as well as the list of points of interest built earlier, are used by the multiprocess analysis algorithm.

Multiprocess Analysis: At this step, we do multiprocess analysis to push delays to the ends of programs. Delays can be injected DELAY statements (compiler optimization), added while balancing conditions, or inserted past requests moved out a busy interval (case 2 ) during the execution of the multiprocess safety analysis algorithm. The results of our experiment are described in the next subsection.

8.2 Evaluation We run experiments with dierent number of processes and resources. The number of safe and unsafe cases are reported in Figure 13. The table lists a sample runs showing the number of processes, number of shared resources, the percentage of blocking calls in programs1, the percentage of the size of injected delays relative to WCET, and the safe and unsafe cases found during the application of multiprocess analysis2. As expected, the increase in the number of processes aects the number of successful cases. The busy intervals in resources pro les become close to each other. Unsafe subcases of 4 and 5 appear more frequent. The table also indicates that the increase in the number of shared resources accesses has the same eect. Case 2 happens with the highest rate which indicates the eectiveness of our algorithm on reducing contention Changes to the percentage of blocking calls are compensated by other changes in the frequency of non-blocking calls 2 Further experimentation is being carried out and will be reported in the nal version.

1

20

in distributed real-time systems. According to the algorithm, resource waiting time will be added as a post-access delay which signi cantly reduces WCET (Figures 14 and 15). Figures 14 and 15 also show that the reduction in WCET scales with the size of the injected delays (single

process compiler optimization). WCET decreases with the increase in the number of requests. However, the amount of the decrease diminishes and eventually will saturate because the increase in resources requests makes boundaries of the resource busy intervals closer and consequently harder to shift delays. On the other hand, small number of resource requests shrinks the eect on WCET, since resources contentions is less signi cant. process 10 10 10 20

resource 5 5 5 10

request 3% 5% 8% 10%

delay 5% 5% 5% 10%

case 1 57 92 174 194

case 2 79 166 288 150

case 3 21 29 47 43

case 4+5 10 48 72 75

safe 167 335 581 462

case 6 1 3 5 80

case 4+5 67 67 480 452

unsafe 68 70 485 532

Figure 13: Summary of safe and unsafe cases Reduction in WCET

Reduction in WCET

5 shared resources 10 processes

15%

10% injected delays

5 shared resources 10 processes 5% requests

20% 10% 10% 5%

Requests 3%

5%

8%

Delays

10%

5%

8%

10%

Figure 14: Reduction in WCET versus frequency of requests and delays Reductionin WCET due to contention 15%

Reduction in WCET due to contention

5 shared resources 10 processes

5 shared resources 10 processes 5% requests

10% injected delays 10%

10% 5% 5%

Requests 3%

5%

8%

Delays 5%

10%

8%

10%

Figure 15: Reduction in resource contention versus frequency of requests and delays

9 Conclusion and Future Work While single process optimization can be performed safely, it may aect other processes by introducing contention for shared resources. We have developed an algorithm to apply code improvement optimization 21

safely in distributed real-time systems. The algorithm relies on a model of resource use and on scheduling with a restricted resource contention model. In this model, we show that we can perform the analysis in polynomial time. Our approach is based on applying machine-independent compiler optimization in two phases. In the rst phase, we perform the transformations, compensating the enhancement in performance with delays. In the second phase, we try to remove delays when this can be proved to be multiprocess safe. Experimental results indicate that we can not only enhance the worst-case execution time of distributed real-time processes, but also reduce resource contention. Currently, we try to extend our resource model to allow for resource optimization. We also are studying how to integrate multiprocess optimization safely in real-time systems. In the future, we intend to integrate our algorithm in a platform, currently under development, for complex real-time systems [25]. We are planning to test the applicability of compiler optimization techniques in realistic applications, and to investigate how to apply machine-dependent optimization safely to distributed real-time systems. We will also study whether repeating our algorithm achieves additional gains in performance.

References [1] S. Amarasinghe, M. Lam, Communication Optimization and Code Generation for Distributed Memory Machines. Proceedings of ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation, 126{138. [2] S. Baruah, L. Rosier, Limitations concerning on-line scheduling algorithms for overloaded real-time systems. IEEE/IFAC Real-Time Operating Systems Workshop, Atlanta, Georgia, May 1991. [3] V. Cherkassky, Redundant task-allocation in multicomputer systems. IEEE Trans. on Software Engineering, Vol 3, No. 41, pp 336-342, Sept. 1992 [4] K. Cooper, M. Hall, K. Kennedy, A methodology for procedure cloning. Proceedings of the IEEE international conf. on Computer Languages, April 1992 [5] R. Gerber, S. Hong, Compiling Real-Time Programs with Timing Constraints Re nement and Structural Code Motion. CS-TR-3323, UMIACS-TR-94-90, Department of Computer Science, University of Maryland, Jul, 1994. [6] R. Gupta, M. Spezialetti, Busy-Idle Pro les and Compact Task Graphs: Compile-time Support for Interleaved and Overlapped Scheduling of Real-Time Tasks. Proceedings of the 15th IEEE Real-Time Systems Symposium, San Juan, Puerto Rico, December 1994. [7] R. von Hanxleden, K. Kennedy, Give-N-Take | A Balanced Code Placement Framework. Proceedings of ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, 107{120. [8] S. Hong, R. Gerber, Compiling Real-Time Programs into Schedulable Code. Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, ACMPRESS, Jun, 1993, SIGPLAN Notices, 28(6):166176. [9] D. Lienbaugh, M. Yamini, Guaranteed Response Times in a Distributed Hard Real Time Environment. IEEE Transactions on Software Engineering, Vol. SE-12, No. 12, pages 1139-1144, December 1986.

22

[10] T. Marlowe, S. Masticola, Safe Optimization for Hard Real-Time Programming, Special Session on Real-Time Programming, Second International Conference on Systems Integration (June 1992), 438 - 446. [11] S. Midki, D. Padua, Issues in the optimization of parallel programs. Proceedings of the 1990 International Conference on Parallel Processing, vol. II, 105{113. [12] A. Mok, P. Amerasinghe, M. Chen, K. Tarntisivat, Evaluating Tight Execution Time Bounds of Programs by Annotations. IEEE Workshop on Real-Time Operating Systems and Software, Pittsburgh, PA, pp. 74 - 80, May 1989. [13] F. Mueller, D. Whalley, M. Harmon Predicting Instruction Cache Behavior. Proceedings of the ACM SIGPLAN Workshop on Language, Compiler, and Tool Support for Real-Time Systems, Orlando, Florida, June 1994. [14] V. Nirkhe, W. Pugh, A partial evaluator for the Maruti Hard-Real-Time System, Real-Time Systems, 5 (1), pp. 13{30, March 1993. [15] C. Park, Predicting program execution times by analyzing static and dynamic program paths. Real-Time Systems, 5 (1), pp. 31{62, March 1993. [16] J. Snyder, D. Whalley, T. Baker, Fast Context Switches: Compiler and Architectural Support for Preemptive Scheduling. Journal of Microprocessors and Microsystems (to appear). [17] M. Spezialetti, R. Gupta, Timed Perturbation Analysis: An Approach for Non-Intrusive Monitoring of Real time Computations. Proceedings of the ACM SIGPLAN Workshop on Language, Compiler, and Tool Support for Real-Time Systems, Orlando, Florida, June 1994. [18] G. Steele Jr., et. al. Fortran at ten giga ops: the Connection Machine Convolution Compiler. Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation 145{156. [19] A. Stoyenko, A Schedulability Analyzer for Real-Time Euclid. Proceedings of the IEEE 1987 Real-Time Systems Symposium, pages 218 - 225, December 1987. [20] A. Stoyenko, A Real-Time Language with A Schedulability Analyzer, Ph.D. Thesis, Department of Computer Science, University of Toronto, 1987. [21] A. D. Stoyenko, V. C. Hamacher, R. C. Holt, Analyzing Hard-Real-Time Programs for Guaranteed Schedulability. IEEE Transactions on Software Engineering, pages 737 - 750, SE-17, No. 8, August 1991. [22] A. D. Stoyenko, T. J. Marlowe, Schedulability, program transformations and real-time programming. IEEE/IFAC RealTime Operating Systems Workshop, Atlanta, Georgia May 1991. [23] A. D. Stoyenko, T. J. Marlowe. Polynomial-Time Transformations and Schedulability Analysis of Parallel Real-Time Programs with Restricted Resource Contention. Journal of Real-Time Systems, Vol 4, No. 4, pages 307 - 329, November 1992. [24] A. D. Stoyenko, T. J. Marlowe, W. A. Halang, M. Younis. Enabling Ecient Schedulability Analysis through Conditional Linking and Program Transformations. Control Engineering Practice, Vol 1, No. 1, pages 85 - 105, January 1993. [25] A. Stoyenko, T. Marlowe, M. Younis, A Language for Complex Real-Time Systems. To appear in the Computer Journal, Vol. 38, No 5, Nov 1995. [26] H. F. Wedde, B. Korel, D. M. Huizinga, Static analysis of timing properties for distributed real-time programs. IEEE Workshop on Real-Time Operating Systems and Software, May 1991, 88{95. [27] M. Younis, T. Marlowe, A. Stoyenko, Compiler Transformations for Speculative Execution in a Real-Time System. Proceedings of the 15th Real-Time Systems Symposium, San Juan, Puerto Rico, December 1994. [28] M. Wolfe, Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge, MA, 1989.

23

Applying Compiler Optimization in Distributed Real-Time ... - CiteSeerX

Applying Compiler Optimization in Distributed Real-Time ... - CiteSeerX

Suggest Documents

Compiler Optimization-Space Exploration - CiteSeerX

Compiler Optimization-Space Exploration - CiteSeerX

The PARADIGM Compiler for Distributed-Memory ... - CiteSeerX

Distributed Java Compiler: Implementation and Evaluation ... - CiteSeerX

HPF Compiler for Distributed Memory MIMD ... - CiteSeerX

Applying Global Optimization in Structural Engineering - CiteSeerX

Improving Energy Consumption by Compiler Optimization ... - CiteSeerX

Improving Energy Consumption by Compiler Optimization ... - CiteSeerX

Compiler-Assisted Leakage Energy Optimization for ... - CiteSeerX

Toward Compiler Optimization of Distributed Real-Time ... - UMBC CSEE

Distributed Java Compiler - Semantic Scholar

ROBUST DISTRIBUTED OPTIMIZATION IN

Applying a Traffic Lights Evolutionary Optimization ... - CiteSeerX

Compiler Optimization for Configurable Accelerators

Distributed Realtime Interaction and ... - Jan-Michael Frahm

Applying Sensitivity Analysis in Real-Time Distributed ... - CiteSeerX

Fortran 90D/HPF Compiler for Distributed Memory MIMD ... - CiteSeerX

Realtime Burstiness Measurement - CiteSeerX

Applying Topological Constraint Optimization

An Overview of the PARADIGM Compiler for Distributed ... - CiteSeerX

Optimization of Message Encryption for Distributed ... - CiteSeerX

Mixed-Discrete Structural Optimization with Distributed ... - CiteSeerX

Distributed Optimization in DTNs: Towards

A Complete Distributed Constraint Optimization Method ... - CiteSeerX