Abstract Adaptive Resource Allocation for Embedded ... - CiteSeerX

1 downloads 157523 Views 39KB Size Report
... to monitor application performance and to detect significant drops in performance. .... Monitor application performance using a real-time instrumentation system ..... Mohamed, and J.A. Telle, “OREGAMI: software tools for mapping parallel ...
Adaptive Resource Allocation for Embedded Parallel Applications Rakesh Jha, Mustafa Muhammad Honeywell Technology Center, Minneapolis, Minnesota, USA Sudhakar Yalamanchili, Karsten Schwan, Daniela Ivan-Rosu. Chris. de Castro Georgia Institute of Technology, Atlanta, Georgia, USA.

Abstract Parallel and distributed computer architectures are increasingly being considered for application in a wide variety of computationally intensive embedded systems. Many such applications impose highly dynamic demands for resources (processors, memory, and communication network), because their computations are data-dependent, or because the applications must constantly interact with a rapidly changing physical environment, or because the applications themselves are adaptive. This paper presents a set of dynamic resource allocation techniques aimed at maintaining high levels of application performance in the presence of varying resource demands. It focuses on a class of applications structured as multiple pipelines of data-parallel stages, as this structure is common to many sensor-based applications. We discuss the issues involved in resource management for such applications, and present preliminary results from our implementations on Intel Paragon. Our approach uses feedback control - a real-time monitoring system is used to detect significant performance shortfalls, and resources are reallocated among the application components in an attempt to improve performance. The main contribution of this work is that it combines real-time monitoring of an application’s performance with dynamic resource allocation, and focuses on practical implementations rather than simulation and analysis.

1. Introduction This paper addresses parallel applications whose computational needs and resource requirements vary significantly at run-time. For such applications to perform well over the entire range of their dynamic resource needs, it is necessary to allocate and reallocate resources continually during application execution. Our work is motivated by real-time defense applications that must constantly react to changes in an external physical environment and whose processing loads are heavily data-dependent. For example, in automatic target recognition (ATR) systems, the processing loads can vary widely due to their heavy dependence on scene and algorithm parameters. As an aircraft moves closer to the targets of interest, the number of regions of interest to process changes, thereby changing the computational load on the aircraft’s computers. Our goal is to develop ways to

deliver the aggregate computing power of a parallel machine to such dynamic applications as effectively as possible. Our approach is based on the operational model shown in Figure 1. HPC platform

Application

Sensor

Monitor Desired performance

Monitored performance

Detect deviation

Trigger

Enact Allocate

Other triggers

Figure 1. Operational model of dynamic resource allocation

We use a real-time instrumentation system to monitor application performance and to detect significant drops in performance. Detections trigger computation of new resource allocation based on application execution profiles and its most recent performance history. The new allocation is enacted to close the control loop. This paper describes the application model, real-time instrumentation system, and the algorithms used by us for detection, resource allocation and enactment. The implementation was tested on a synthetic parameterized application as well as a real ATR application using data from a Forward Looking Infra-Red sensor. Preliminary results are reported in terms of metrics for the quality, overhead and stability of reallocation, and the perturbation to applications during and after reallocation. In the reported work, our emphasis has been the actual implementation of techniques that may be applied in practice rather than theoretical analysis and simulation. As described in the next section, much research has been done in dynamic resource allocation, but few practical implementations exist that can be used for the application systems of interest to us.

2. Related research Existing research in dynamic resource allocation for parallel applications can be classified into the following categories - processor allocation [6,7,8,9,10], load

balancing [2,3,4], program mapping [11,12], and dynamic partitioning of data-parallel computations [13]. In processor allocation research, the application model generally consists of a set of randomly arriving and departing independent applications. Allocation is triggered when applications arrive or depart, or when a time quantum expires, or when applications signal change in their parallelism, or when processors fail.

3. System models

Our application- and performance models are different target applications do not consist of independent tasks, their resource needs are data-dependent, applications may use both task- and data-parallelism, and performance metrics may be application-specific rather than measures such as mean-response time commonly used in processor allocation research. No research known to us addresses the issue of automatic detection of current and impending performance problems to trigger a corrective reallocation.

Applications of interest consists of tasks that must cooperate toward a shared goal. Tasks may be sequential or parallel. Applications may employ both task and dataparallelism. The run-time behavior of tasks may be highly dynamic in terms of their computations loads, memory requirements and communication loads. The variability arises because the applications are data-dependent, must react to an external physical environment, or are adaptive.

Dynamic load balancing techniques which assume that tasks are independent of each other are not applicable to our target applications (structured as multiple pipelines) because their tasks are interdependent. Other work on load balancing addresses one-time assignment - once a task is assigned to a processor, no further assignment is done for that task. There is little research in continual reallocation. However, the reported success of simple load balancing algorithms [3] encouraged us to believe that similarly simple reallocation algorithms would result in better performance than doing no dynamic reallocation at all. Compiler support for resource allocation as exemplified in data-parallel run-time libraries like PARTI and CHAOS, addresses dynamic partitioning of data and computations for mesh-structured SPMD (single program multiple data) problems [13]. In our target applications, changes in resource needs are driven by data-dependence and external changes, and are not deducible from array access patterns in loops. Also, the applications employ both task- and dataparallelism. The Fx compiler at CMU[14] provides automatic mapping of integrated task and data-parallel computations. However, the mapping is done statically. Static mapping algorithms are unsuitable for on-line invocation because of their high run-time cost. However, they can be used to compute the initial allocations if algorithms can be found to inexpensively modify the initial allocation at run-time. Only Nicol et al [5] address dynamic remapping. They describe a scheme for dynamic remapping of data-parallel computations in response to unpredictable phase changes in the application. They use a user-written routine to detect phase change, a Markov decision process to decide when to remap, and dynamic invocation of a static mapping algorithm to choose a new mapping. Our work applies to task-parallelism as well as data-parallelism, generalizes the triggers for remapping, and uses real-time instrumentation to detect the need for reallocation. Operating systems such as Locus, Sprite, and GLUnix support processor allocation and remote program loading to harness available idle processor cycles on a network. They target general purpose, multi-user environments where jobs are independent of one another, and support only one-time assignment.

This section delineates the system models addressed by us - including those for the target applications, machines, and resources. It also identifies the subset of the models that we have addressed in our work so far.

3.1.

Target applications

Work reported in this paper addresses a class of sensorbased applications that can be structured as multiple pipelines of individually data-parallel stages, as shown in Figure 2. A sensor provides periodic batch input in the form of frames, which are processed by the application in sequence. Each pipeline stage is a parallel computation in its own right, that must be carefully parallelized for it to perform well. A stage is also referred to as a task. The parallel version of a stage may involve significant communication between its subtasks. sensor input data frames

output

data-parallel stages

Figure 2. Current target application model

Application performance metrics are often applicationspecific (for example throughput, false alarm ratio in ATR systems), rather than resource oriented such as efficiency, utilization, or mean response time. This is particularly true of embedded systems as their performance is best viewed in terms of the quality of their interaction with the environment in which they are embedded.

3.2.

Target machines

The target machines are multicomputers consisting of full-featured processors connected by an interconnection or communication network. Examples include machines such as Cray T3D, Intel Paragon, IBM SP2, and workstation clusters. In general, remote operations on these machines (e.g. message passing) are significantly more expensive than local operations (e.g. computation on local data). This implies that the relative mapping of an application’s components to the target machine can have an enormous effect on performance.

3.3.

Resource allocation model

Resources consist of processors, memory, and the communication network. The mapping of an application to the target machine is a two part problem - a) allocation,

which concerns the allocation of processors to the stages of the pipeline, and b) assignment, which concerns the binding of specific processors to specific subtasks of the pipeline stages. Assignment affects the cost of intra-task communication (between subtasks) as well as of inter-task communication (between stages). We assume that the set of resources to be allocated is fixed - hence there is no dynamic addition or deletion of resources. Any change in mapping is effected only at frame boundaries. Hence, the mapping when a frame enters the pipeline is the same as when its processing is completed. It is assumed also that a processor executes only one subtask at a time, that is the processors are not multiprogrammed.

4. Adaptive resource allocation (ARA) As mentioned earlier, our approach is based on a fourstep operational model for dynamic resource allocation • Monitor application performance using a real-time instrumentation system • Detect significant deviation in performance from desired performance levels • Compute a new resource allocation that would likely improve performance significantly • Effect the new resource allocation in a manner that minimizes the perturbation to the application due to the transition. Figure 3 shows the run-time system architecture of our resource allocation system. One of its key design goals is that it should allow us to plug in a different realization of any component without affecting the other components. Application

Enactment

new map

Probes monitored data query upcalls

allocation model performance model

other triggers trigger

Instrumentation Information Repository

Allocation algorithms

performance history monitor data

Detection

Figure 3. ARA run-time system architecture

The rest of the paper presents our approach in more detail. We illustrate the issues involved and our solutions in terms of a generic application as shown in Figure 2.

4.1.

Adaptation Models

To effectively monitor performance, detect shortfalls, allocate resources and enact reallocations, ARA must have access to information on - a) application characteristics, performance requirements and resource attributes, and b) the policies according to which that information is to be used. Collectively, this information is referred to as adaptation models. Figure 3 shows two such models performance model and allocation model. For example, performance models describe how to measure and interpret

application performance, and allocation models capture task execution profiles and allocation constraints.

4.2.

Real-time monitoring and detection

The purpose of the real-time monitoring system is to continually evaluate user-specified performance measures to detect conditions that warrant resource reallocation. We use a real-time instrumentation system called Honeywell Scalable Parallel Instrumentation system (SPI)[1]. SPI is based on an event-action model, in which all monitoring activity is viewed in terms of actions taking place in response to events. As actions can themselves generate new events, this permits hierarchical construction of monitoring functions. SPI has three main components - a) an eventaction machine; b) a notation for specifying events, actions and their relationships; and c) a library of predefined actions for real-time data analysis and display.

4.2.1 Monitoring SPI permits the monitoring of application-specific performance measures, as well as resource oriented metrics used in traditional resource management techniques (e.g. processor loads in dynamic load balancing). Application data is signalled by probes inserted into the application. To minimize the interference between the application and the instrumentation, we use dedicated resources for executing the instrumentation functions. As every subtask of a pipeline stage in the application completes its work on a frame, it sends data to SPI about its performance on that frame. This data consists of the frame identifier, task and subtask identifiers, and the total time spent by the subtask in computation and intra-task communication. When all tasks have reported data for a frame, a number of different statistics are computed for the completed frame, including the performance of individual tasks and that of the pipeline as a whole on that frame. SPI maintains a repository of monitored data for use in the detection and allocation decisions. Also, it allows us to continually update the execution profiles of individual tasks to capture their performance as a function of the resources allocated to them. The advantage of dynamic updates is that allocation decisions can be based on the most recent behavior of the application.

4.3.

Detection

The monitoring data in the SPI repository is used to construct a set of time varying functions to represent the behavior of the application, as shown in Figure 4. These functions are analyzed to detect performance deviations. The main difficulty in analysis arises because detection must be done in the presence of noise, transients, and the unpredictability of performance changes caused by a dynamic external physical environment. Currently, we use simple filters to eliminate noise and transient effects in the raw data. For example, instead of instantaneous raw performance values, moving averages are used. Two types of mechanisms are used to decide if reallocation should be triggered - detection of performance deviation after it has already occurred, and prediction

based on current trends. Prediction is based on determining the current performance trend by fitting a line through the moving averages and then extrapolating that line into the future frames. raw performance (e.g. throughput) smoothed performance (e.g. moving average)

Acceptable threshold frames

Figure 4. Analyzing performance to detect shortfall

A basic reallocation condition exists when either the instantaneous performance falls below acceptable levels or the performance is predicted to fall below acceptable levels in the near future. In order to prevent thrashing and filter transient effects, a reallocation trigger is generated only if the basic conditions persist over a user-specified number of successive frames, and if the previous remapping if any is consistent across the pipeline. In addition to automatic detection as described above, other triggers may be used to trigger resource reallocation. For example, reallocation may be requested directly by an application when it knows of an impending change in its resource needs (e.g. in multi-phase computations), or requested by an interactive user to steer a long running application, or be triggered in response to faults.

4.4.

Resource allocation and assignment

Computation of a new allocation of resources is done when a reallocation trigger is received. As algorithms that compute optimal solutions are high in run-time cost, we use greedy techniques and incremental algorithms which compute a new allocation incrementally from a previous allocation and the changes that have taken place since that allocation. As has been found in dynamic load balancing and paging systems, we expect that even simple algorithms would lead to significantly better performance than when no dynamic reallocation is attempted at all.

4.4.2 Assignment Given a new allocation, assignment computes which specific processors will execute which subtasks of the tasks in an application. The list of donors and acceptors from the allocation step is the main input to assignment. The computation is based on determining which donor tasks will donate processors to which acceptor tasks. Different assignments may have different effects on the application performance during and after transition from an old assignment. Recall that reassignment is effected at frame boundaries, that is, the system is frame-synchronous. Let the first frame that executes per a new assignment be called the effective remapping frame, F r . Transition is defined as the period between the time that the effective frame enters the pipeline and the time that it leaves the pipeline. During transitions, the old and new maps coexist. We developed two algorithms for assignment - one based on minimizing the perturbation to the application during transition (called cascading), and the other based on an incremental branch-and-bound search. 4.4.2.1

Cascading

Figure 5(a) illustrates the perturbation in an application if care is not taken in matching donor tasks to acceptor tasks. If a donor task is downstream in the pipeline from the corresponding acceptor task, then the acceptor task must wait for the donor to complete processing of the frame F r – 1 , before the acceptor can start processing frame F r . This amounts to requiring that the pipeline segment between the donor and the acceptor be flushed. The delay is minimized by using a scheme called cascading. Donations to upstream tasks are effectively implemented as a series of donations to immediately upstream tasks. The intermediate tasks act as donors as well as acceptors. In Figure 5(b), the acceptor task B can start processing frame F r as soon as task C has completed frame F r – 1 . This reduces the delay in the processing of frame F r through the pipeline. A

C

E

A

F

Fr – 1

acceptor

donor

Fr (b)

D

Fr

(a)

4.4.1 Processor allocation Performance statistics for the recently completed frames are examined to determine which tasks are becoming bottlenecks and which have slack. A new processor allocation is computed from the current allocation by evaluating potential donations by lightly loaded tasks to heavily loaded tasks. Each potential donation is evaluated for its effectiveness by estimating the application performance under the new allocation, using task execution profiles. An execution profile describes how a task’s performance changes with the number of processors allocated to it. This process continues until a preset number of iterations have been completed or further donations do not improve estimated performance significantly.

B

Fr – 1 B

acceptor

C

D

E

F

donor

Figure 5. Cascading to reduce perturbation in transition

As an application may not be structured as a straight pipeline and may contain multiple paths, a generalization of the above scheme is actually used. Tasks are sorted in increasing order of their critical times, where critical time is defined as the estimated latest time that a task must start executing frame F r so as not to degrade the current throughput from the pipeline. The estimates are computed based on the monitored performance of every task on the latest completed frame. Note that in a straight pipeline,

upstream tasks are always more critical than downstream tasks, in that upstream tasks have lower critical times. Assignment is done by traversing the sorted list looking for a potential acceptor starting from the most critical task, and for every acceptor looking for potential donors also starting from the most critical task. A task is a valid donor only if it has processors, and is either less critical than the acceptor or has more processors than it needs in the new allocation. 4.4.2.2

Incremental space search

Incremental versions correspond to the search of a reduced search space. An example incremental mapping algorithm. Performs reasonable well. Two techniques to reduce the space - selection policy that determines which tasks are to be reassigned, and a state generating function to seed the incremental search. Effective frame is whenever the source receives the new mapping + 1 frame.

4.5.

Enactment

Enactment refers to ensuring that signal frames starting from the effective remapping frame onwards are executed according to the new mapping. We decided to implement enactment at the application level due to its simplicity. The main issues in application-level enactment are - a) sending the effective remapping frame number and the new mapping to the application, and b) effecting the remapping in a frame-synchronous manner. In our implementation, the entire application is written as a SPMD program that contains the code for all tasks in the application. It is assumed that the input signal frames arrive at a single processor, and a source task running on that processor then sends it to the pipeline. On every processor, a local copy of the currently active task mapping is checked at the start of every frame to determine which subtask will be executed by that processor for that frame. New mappings, if any, are received from the resource allocation system at the start of a frame, to be effected starting at the effective remapping frame. Rather than broadcasting a new task mapping to every processor, the resource allocation systems sends the new map only to the processor executing the source task. Then, the new map travels through the pipeline, riding piggyback with frame data sent from the source task and the data output by every subtask to subtasks of the next stage in the pipeline. Piggybacking a new map with application data reduces the number of messages sent by the resource allocation system and hence its interference with application messages. The above scheme ensures that the senders always send messages to the right set of destination processors and that receivers post receives from the right set of sending processors. Because code for all tasks exists on every processor, no code migration is needed. Also, we have assumed so far that applications do not have memory between frames, i.e. the results of computations on one frame are not needed in the computation of subsequent frames. This implies that data migration is not needed either for frame-synchronous enactment.

5. Performance Metrics We identified the following measures to evaluate the effectiveness of ARA • How effective are the mechanisms for detecting reallocation triggers in the presence of noise, transients, and unpredictability of external changes? • What are the time delays between the detection of a reallocation trigger and the resulting reallocation? • What is the nature of perturbation to the observable behavior of applications due to reconfiguration? • How effective are the reallocations computed by the allocation algorithms that we implemented? • How stable is the control? Simple metrics were defined to evaluate the measures listed above. ARA Speedup S is the ratio of total time taken to execute a given number of frames without ARA to the time taken with ARA. To evaluate the allocation and assignment algorithms, remapping gain G is the ratio of the application performance for the frame immediately following a remap to that for the frame immediately before remap. The time delay between detection and actual remap is evaluated in terms of the performance loss L due to the extra delay. Let F t be the frame upon whose completion a reallocation was triggered. F o , the earliest frame at which remapping could have been effected is computed as: latency ( F t ) F o = F t + -----------------------------------+1 frame period

(1)

The performance loss is the difference between the actual time taken to execute the frames from F o to F r , and an estimate of the time that would have been taken if remapping had been effected at frame F o . If T ( F ) denotes the time at which frame F was completed, then L = T act – T est, where T act = T ( F r ) – T ( F o – 1 )

(2)

T est = T ( F r + ( F r – F o ) ) – T ( F r – 1 )

The total performance loss due to delays in remapping over possibly multiple remappings over a long application run is expressed in terms of efficiency E , where T run – ∑ L i N

r E = ---------------------------T run

(3)

where T run is total run time of the application, N r is the number of remaps, and L i is the loss during the i-th remap.

6. Preliminary results ARA is being tested using synthetic as well as real applications. The synthetic application is a parameterized SPMD program that implements the application model described in section 3.1. Its parameters can be controlled so as to create user-defined pipeline structures and patterns of dynamic variations in computation and communication

loads in the various pipeline stages, as well as variations in interstage communication loads. The real application is an automatic target recognition algorithm where the resource demands change continuously as the distance to the targets reduces on approach. Table I gives the observed results for a synthetic application (multiple pipeline with 6 stages) over 50 frames for two runs on a 16-node Intel Paragon.

∑ Li

T run

S

Nr

Avg. G

E

1

265 s

67.12%

2

70%

5.64 s

97.8%

2

264 s

7.98%

2

9%

8.78 s

96.6%

Table I: Synthetic application results In the first run, the initial allocation was arbitrary, while in the second run it was based on equal allocation across all pipeline stages. Poor initial allocation explains the higher speedup obtained in the first run relative to no dynamic reallocation. Table II presents initial results from the ATR application over a 60-frame run on a 16-node Paragon.

even as the monitored data from completed frames is used to make the allocation decisions. With decentralized allocation, the data used for making decisions will also be out of date - just more so. Further work is in progress to evaluate the system more thoroughly. It is encouraging that simple algorithms for detection, remapping and enactment show significant improvement in overall application performance in the tests so far. The use of application-specific instrumentation in dynamic resource allocation is a new and appealing technique for high performance embedded systems.

References 1.

D. Bhatt, R.Jha, T. Steeves, and R. Bhatt, “SPI - An instrumentation development environment for parallel/ distributed systems”, in Proceedings of the 9th International Parallel Processing Symposium, April 1995.

2.

A. N. Tantawi and D. Towsley, “Optimal static load balancing in distributed computer systems”, Journal of the ACM, 32(2), 4/85.

3.

D. L. Eager, E. D. Lazowska, and J. Zahorjan, “Adaptive Load Sharing in Homogeneous Distributed Systems”, IEEE Trans. on SE, Vol. 12, No. 5, May 1986.

4.

G. Cybenko, “Dynamic load balancing for distributed memory multiprocessors”, Journal of Parallel and Distributed Computing”, 7(2), 10/89.

The overhead of dynamic remapping is the sum of the times spent in monitoring, updating the monitor data repository, detection of significant performance drop and generating the reallocation trigger, computation of new allocation and assignment, and in enactment. The observed overhead for both synthetic and ATR applications reported above, was less than 1.0 second over application runs of over 4 minutes each, and was dominated by the time spent in the mapping algorithms.

5.

D. M. Nicol and P. F. Reynolds, “Optimal dynamic remapping of data parallel computations”, IEEE Transactions on Computers, 39(2), 2/90.

6.

C. McCann and J. Zahorjan, “Processor Allocation Policies for Message-Passing Parallel Computers”, Proceedings of ACM SIGMETRICS, May 1994.

7.

K. C. Sevcik, “Application Scheduling and Processor Allocation in Multiprogrammed Parallel Processing Systems”, Performance Evaluation, Vol. 19, 1994.

Prediction algorithms did not work well since the load changes were often sudden rather than gradual. While there was some thrashing in the allocation algorithms, the overall control was stable, mainly due to the damping effect of allowing no further reallocation until the previous one has gone through the pipeline. The performance of cascading versus incremental branch-and-bound assignment has not yet been evaluated.

8.

T. Brecht, X. Deng, and N. Gu, “Competitive Dynamic Multiprocessor Allocation for Parallel Applications”, Proceedings of the 7th IEEE Symposium on Parallel and Distributed Processing (SPDP’95), October 1995.

9.

S. K. Setia, M. S. Squillante, and S. K. Tripathi, “Analysis of Processor Allocation in Multiprogrammed, DistributedMemory Parallel Processing Systems”, IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 4, 4/1994.

T run

S

Nr

1

968 s

16.94%

3

2

1097 s

3.19%

3

Table II: ATR results

7. Conclusion

10. E. Rosti, E. Smirni, L. W. Dowdy, G. Serrazi, and B. M. Carlson, “Robust Partitioning Policies of Multiprocessor Systems”, Performance Evaluation, Vol. 19 (2-3), 3/ 1994.

This paper described an approach and implementation for adaptive resource allocation for applications structured as multiple pipelines of data parallel computations, and characterized by highly dynamic resource needs. We are encouraged by the initial results, although the application and implementation models are based on a number of simplifying assumptions. Some of the assumptions must be removed for the techniques to apply in practice, including for example, the assumptions of memoryless processing of frames, and centralized resource management.

11. S.H. Bokhari, “On the mapping problem,” in IEEE Transactions on Computers, Vol.30, No. 3, 1981

It is interesting to note that even with centralized resource allocation, decisions about reallocation are based on outdated data, as the application continues to execute

14. J. Subhlok, “Automatic mapping of task and data parallel programs for efficient execution on multicomputers”, Technical Report CMU-CS-93-21, CMU, Nov. 1993.

12. V.M. Lo, S. Rajopadhye, S. Gupta, D. Keldsen, M.A. Mohamed, and J.A. Telle, “OREGAMI: software tools for mapping parallel computations to parallel architectures”, in Proceedings of the 1990 ICPP, Vol. II, Software, Aug. 1990. 13. R. Ponnusamy, J. Saltz, and A. Choudhary, “Runtime compilation techniques for data partitioning and communication schedule reuse”, in Supercomputing-94, IEEE Press, pg. 97-106.

Suggest Documents