A Hardware/Software Codesign Approach to System-Level Design of Real-Time Applications Jakob Axelsson Dept. of Computer and Information Science Link¨oping University S-581 83 Link¨oping, SWEDEN Email:
[email protected]
Abstract The problem of validating that a completed implementation of a real-time system meets its deadlines has been extensively studied, but much less research has been done on how to find such an implementation from a given set of requirements. This paper proposes a hardware/software codesign methodology that makes it possible to evaluate early in the design process whether a proposed implementation will satisfy the timing constraints, and to use the constraints as a basis for implementation decisions. It also suggests a set of tools which can automatically investigate the design space. These features of the approach makes the design process more efficient and predictable.
1 Introduction When a hard real-time system is designed, the main concern is to guarantee that the final implementation will always meet its timing constraints. This means that it should be possible to prove, using a static validation procedure, that the performance of the implementation is sufficient to reach this goal. At the same time, it is desirable to reduce the implementation cost, thus computational power has to be traded off for cost. Plenty of research has been done on how to validate that a final implementation meets its real-time deadlines, but as important as this may be, it is of little help when selecting which implementation to use. The problem is that the validation procedure applies to a completed system at the end of the design flow, but the important characteristics of the system—its performance and its cost—are to a large extent determined by early design decisions. The most important ones are those that involve the selection of a hardware architecture, and, in the case when there are
several processing elements, deciding how the functionality is distributed on that architecture. If an architecture is selected which is not powerful enough, this is not likely to be found out until the final validation phase, when the designer attempts to prove that the deadlines are reached. This might mean that some of the fundamental design decisions have to be revised, leading to serious delays and a higher development cost. In other words, problems are created early and discovered late. Characteristic of embedded real-time systems is also that they are special-purpose systems, i.e. they are intended to be used only for a particular task in a certain environment. This means that the implementation can be carefully optimized to give the best compromise between performance and cost for that particular situation. It is not unusual that the best hardware architecture for a certain application is one which contains several processors of different kind, or even application-specific integrated circuits (ASICs). We refer to an architecture with several processing elements of different types as a heterogeneous architecture. Of course, with heterogeneous architectures, the number of possible implementation combinations increases drastically, and it becomes very difficult to make a thorough design space exploration manually. Therefore, researchers in the area of hardware/software codesign (hereafter codesign, for short) have studied automatic techniques for the analysis and synthesis of heterogenous systems. The key issue that needs to be analyzed is how hardware resource sharing between tasks influences the timing characteristics. This analysis is then used to improve the design by reducing task conflicts over resources or reducing the cost by selecting cheaper resources. In this paper, we present a codesign approach to early design exploration for real-time systems. It allows the timing constraints to influence the early design decisions, and increases the likelihood that the development pro-
cess will lead to a final result which reaches the deadlines. In this way, the real-time system design process becomes more predictable. In the next section, we give a brief introduction to some related research, and in Section 3 the proposed design environment is introduced. Then, in Section 4, the necessary system analysis models are discussed, and in Section 5, tools are presented that can be used to automatically or semi-automatically search for suitable implementations. Finally, in the last section, conclusions and directions of future research are given. This article provides an overview of several other papers [2, 4, 5], where much more details can be found. It is also a summary of [3].
2 Related research There is certainly not a lack of methods for specifying and developing real-time systems. Many of these deal mostly with the specification and refinement of functional requirements, like Mellor and Ward’s [11] or MASCOT [12], but a few, e.g. the HRT-Hood method [6], also consider hardware resource usage and the relation to timing constraints and schedulability analysis, even though it is unclear how variations in the hardware architecture are treated in these methods. However, to our knowledge there is no method which allows early analysis of resource usage for heterogeneous hardware platforms, and we believe that this is a key capability for the predictable development of real-time systems. In the codesign area, many techniques and tools for developing heterogeneous embedded systems have been proposed (see e.g. [8, 13, 14]). However, these techniques do not deal with timing constraints of the kind that are normally found in real-time applications. Many researchers aim at accelerating a software program as much as possible by implementing parts of it using ASICs (see e.g. [7]), and in the cases where timing constraints are considered at all, these are usually very local (e.g. [10]), as opposed to the end-to-end deadlines that are common in real-time systems. Further, they do not treat the constraints rigorously using static analysis, but instead validate them through simulations, which is clearly not suitable for hard real-time applications. This article differs from earlier research in that it combines issues of software engineering, hardware/software codesign, and real-time systems analysis and modelling into one consistent methodology, supported by dedicated tools. Unlike previous work, it stresses the need to analyze and monitor timing constraints throughout the design process, and provides the necessary techniques for doing so on a wide range of hardware architectures that may be used to implement real-time systems.
Behavioural description
Component library
Virtual Virtual prototype prototype
Transformations
Synthesis
Analysis Implementation
Figure 1. A workbench for system-level codesign of real-time systems.
3 A codesign workbench In this section, we present a workbench, or design environment, which can be used to do implementation space exploration at the system level, during the early design phases. At the heart of the system is a data structure called a virtual prototype (VPs), which captures the design decisions, and are sufficiently detailed to allow a meaningful analysis of cost and timing constraints. The VPs can be manipulated via a set of transformations, thereby producing new, and hopefully better, designs, and these transformations form the basis for a set of synthesis tools, which can automatically improve different aspects of the design. Figure 1 shows an overview of the workbench. In the remainder of this section, we will describe the VPs and the transformations in more detail.
3.1 Virtual prototypes The purpose of the VPs is to capture the most important design decisions in sufficient detail to allow a good prediction of the cost and performance that a real implementation with similar structure would have. We have worked with VPs containing the following information: Behavioural description. This is the specification of the system’s intended functionality, expressed as a graph of tasks and shared data objects. The tasks can be organized in producer/consumer chains, and deadlines may be placed over several tasks in such a chain. This problem formulation is similar to that used in e.g. [9]. The minimal time between task activations is assumed to be given. Accesses to the shared data objects constitute critical regions where a task may not be pre-empted. Architecture. The hardware architecture which is used to execute the behaviour is described by a graph where the nodes are components and the edges their
logical interconnections. We have considered four kinds of components: micro-processors, ASICs, instruction caches, and memories. Buses are implicit in the architectures. Component library. The components in the architecture are selected from a component library, which also contains analysis data that can be used when determining a system’s characteristics. Partitioning. The tasks and the shared data objects of the behavioural description have to be allocated in different components in the architecture, and this allocation is described by the partitioning. We require that a task is allocated in a processing element (a processor or an ASIC), and a data object in a storage element (a memory or an ASIC), and that the communication and data usage prescribed in the behavioural description are indeed possible on the architecture under this partitioning. Scheduler. The tasks are scheduled according to a fixedpriority relation, so the scheduler is conveniently described by a total order over the set of tasks. Figure 2 illustrates how a behavioural description, with tasks τi and data objects δ j , can be partitioned on an architecture. There are certainly many ways the virtual prototypes can be organized, and different assumptions can be made depending on what class of systems are considered. But the overall contents of the VPs has to be approximately the above, because all these aspects are necessary for analyzing the timing characteristics of the complete systems.
3.2 Tools and transformations Out of the five parts in the virtual prototype, the behavioural description and the component library are assumed to be given, and not modified during the design. But for the other three parts there are a lot of possible choices that need to be considered, and to do this efficiently, automatic search procedures can be used. These procedures work iteratively, by modifying the VPs using a set of transformations, and then, for each generated VP, its cost and ability to reach the deadlines are estimated. It is beneficial to use a limited set of well-defined transformations, since it then becomes easy to formally verify that only correct VPs are generated by a tool which only uses these transformations. Another reason is that a lot of analysis of the VPs is performed during the iterative search, and when a transformation is applied, only the analysis results which are directly affected by this change need to be updated. By using a small set of simple
Processor
Memory
ASIC τ2
τ1 δ1
τ3
δ2 τ5
τ4
τ6
im
dm
dm
Cache nlm Bus
Figure 2. An example of how a behaviour can be partitioned on an architecture.
transformations, it becomes easy to ensure that no unnecessary updating is carried out. The transformations we have used are: Scheduler transformations. Any fixed-priority order can be constructed from any other by a series of swaps of the priorities of two tasks, so this is the only transformation needed for the scheduler. Partitioning transformations. To change the partitioning, tasks and data objects must be moved between components, so we need two transformations for this. A pre-condition of these transformations is that the new allocation does not make it impossible for some task to communicate with other tasks or use data objects, all according to the structure of the behavioural graph. Architecture transformations. The architecture is more complex than the scheduler or partitioning, and we need a larger set of transformations to restructure it. We have used transformations for adding a new component, removing a component, and replacing a component with an equivalent one (a processor by another processor, etc.). Further, we might reconnect components to a different memory or cache, and split and merge processing units, thereby influencing the partitioning at the same time. In addition, we have also used a cleaning transformation, which removes components that are not used from the architecture. For instance, if a processor only contains one task, and this is moved away by a partitioning
transformation, the processor can be removed, thus reducing the cost.
4 Analysis models The different VPs that are generated need to be evaluated to determine if they meet the timing constraints and if their cost is acceptable. In this paper, we concentrate on the former aspect, since it is the distinctive feature of real-time applications. As mentioned above, the goal is not only to design a system that meets the deadlines, but one for which it is possible to prove that they are always met. This is done using a validation procedure, which is applied to the final system. During the early stages of the design, we do not have sufficient knowledge about the implementation to apply this validation procedure directly, but we should try to use as much as possible of it, in order to get a sufficiently good performance prediction. It is customary to divide the timing analysis of a system into two separate stages: the intrinsic analysis calculates the resource needs, and thus the execution time, of a task in isolation, when it has all the resources of the system at its own disposal; and the extrinsic analysis accounts for the effects of resource sharing between several parallel tasks. In the next two subsections, we will discuss how these analysis aspects can be handled, and what the differences are between validation and early estimation. In the last subsection, we propose a constructive way of dealing with uncertainties in the intrinsic analysis.
4.2 Extrinsic analysis The extrinsic analysis requires much less information about the fine details of the implementation; instead it is based on the structure of the task graph, and what hardware resources are used. These aspects can be assumed to be the same for the early design and for the final implementation, which means that the same model can be used for both estimation and validation. The difference is that the values indicating how much resources the tasks use come from the intrinsic estimation in the first case, and from the intrinsic validation in the second. The extrinsic analysis of our VPs is based on schedulability models for fixed-priority schedulers, and we have extended the traditional theory (see [1] for an overview) in several respects:
Most of the fixed-priority scheduling theory is only concerned with single-processor systems but our architectures are heterogenous, involving several processing elements, perhaps with shared memories. We have therefore extended this theory by introducing the concept of computational resources (roughly corresponding to time-shared components in the architecture), and derived analysis procedures that can handle several such resources.
The traditional fixed-priority analysis is only meant to be used for validation, where it is interesting to get a yes/no answer if the deadlines are met. But during the iterative system-level design we would also like to know how close a design comes to meeting the deadlines since this allows a comparison between VPs that do not yet meet them. We have therefore introduced the metric minimal required speedup (MRS), which, intuitively, indicates how much the execution would have to be accelerated to exactly meet all timing constraints. Of course, if the MRS 1, this means that the deadlines are respected.
Often, it is assumed that the tasks in the behavioural description are independent, but our VPs may contain communication and shared data. We have therefore developed ways of calculating the MRS for such behaviours, and this can be done as efficiently as calculating the response times of independent tasks. This approach can also handle end-to-end timing constraints.
4.1 Intrinsic analysis The intrinsic analysis is used to determine, for each task, how much resources it uses during each activation, and also the amount of memory, ASIC area, etc., that is required. The actual values of these metrics depend on very detailed knowledge about the final implementation, and there is therefore a certain discrepancy between the validation and early analysis. The approach we have taken is to use partial design estimators, by which the analyzer makes guesses about the outcome of the remaining design process, and thus assumes one possible implementation of the high-level behavioural description. For instance, for a task implemented on a micro-processor the partial design might be arrived at by compiling the high-level behaviour into an intermediate form, close to machine-code, from which much more detailed estimations of the execution time can be made than is possible directly from the high-level description. For ASICs, a similar translation can be performed, e.g. to the register-transfer level, thereby reducing the abstraction gap to the actual chip.
These analysis techniques are equally useful for the early design estimation and the final validation. The MRS calculation for independent tasks has previously been presented in [2].
Behavioural description
4.3 Resource budgets A problem with the partial design approach for intrinsic analysis is that the actual low-level design process might lead to different implementation decisions than were assumed by the estimator. It would then seem that the estimator is very inaccurate, but the correct interpretation is that there exists one possible implementation which has approximately the estimated characteristics. However, it does not mean that this is the one which will eventually result. Still, the estimation data can be used constructively to increase the predictability of the design process. Consider the situation where a VP has been found, for which the extrinsic validation model says that the deadlines are reached, when supplied with estimation data from the intrinsic analysis. Then, it is very likely that an implementation exists, which has roughly the characteristics given by the intrinsic analysis, and this implementation would be satisfactory. Therefore, we can use the intrinsic estimation results as constraints on the lower-level design, and we call these constraints resource budgets. The original end-to-end deadlines stretch over chains of tasks, which may be allocated in different processing elements, and when doing the detailed design of these tasks, it may be difficult to see what the consequences are on the overall system behaviour. The benefit of the resource budgets is that they localize the timing constraints to individual components, saying concretely how much resources may be used by each task. Thus, the individual components may be further developed separately, but still it is implied that as long as the resource budgets are kept, the complete system will behave according to the specification.
5 Synthesis tools The analysis models are very useful as a means of providing feedback during the manual design of a system, but they can also be used by optimization algorithms that automatically searches the space of possible VPs. The goal of this search is to find the cheapest implementation that fulfils the timing constraints, i.e. which has MRS 1. There are three parts of the VP that are not given by the designer: the architecture, the partitioning, and the scheduler. Since the partitioning depends on the architecture, the architecture has to be selected before the partitioning can be done, and the scheduler depends on both the architecture and partitioning, so it has to be derived after these. On the other hand, the interesting characteristics—MRS and cost—can only be established for a completed VP, which suggests that the three synthesis activities are in fact interrelated. Figure 3 illustrates
Component library
Virtual Virtual prototype prototype
Architecture selection
Partitioning
Scheduling Implementation
Figure 3. The relation between the synthesis tools in the workbench.
the relation between the synthesis tools. In the following subsections, we describe the synthesis tools that are currently included in the workbench.
5.1 Scheduling
Synthesizing a scheduler means assigning priorities to the tasks, thereby defining the total order which is the basis for fixed-priority schedulers, and the goal is to find an order which yields an MRS 1. We can define an optimal priority order to be one which gives a minimal MRS, since it will always meet the timing constraints whenever any other order does so. By calculating the optimal order, there is no need to consider any other orders, even if there might be other ones which also meet the deadlines. Plenty of research has been done on this problem, and some classical results are the rate-monotonic and deadline-monotonic priority assignments, which order the tasks according to increasing periods and deadlines respectively. These priority orders are implementationindependent, in the sense that they only depend on data given in the specification, and they are known to be optimal for independent behaviours on single-processor architectures. However, it is easily shown that no implementation-independent algorithm can exist for heterogeneous architectures; instead the optimal priority order must be derived with respect to a particular architecture and partitioning. Therefore we have developed a special optimal priority assignment algorithm, which requires O(m2 E ) execution time, where m is the number of tasks and E is the complexity of the MRS calculation. However, it does not consider task chains, and for the general behavioural descriptions, we are not aware of any optimal priority assignment algorithm with polynomial complexity.
5.2 Partitioning To determine a partitioning, the tasks and data objects must be assigned to components in the hardware architecture, and this assignment must be done so that communication along the task chains are possible, and data objects can be accessed by the tasks. Thus the partitioning has to be done with respect to a particular hardware architecture. We have developed a branch-and-bound algorithm which finds the best partitioning. It assigns tasks by following the chains, only considering a task when its activator has already been allocated. In this way, it is easy to determine what processing elements may possibly be used for that task. In a similar way, data objects may be handled by allocating them as soon as one of its users has been allocated. The branch-and-bound algorithm works with partial solutions, in our case partial partitionings, that are iteratively extended to become more and more complete. It is however not necessary to consider a partial solution further, if it can be shown that any complete solution derivable from it is worse than the best complete solution found so far. It is evident that both the MRS and the cost of an implementation can only increase when more tasks and data objects are added to the system, and therefore all complete solutions derivable from a partial one have higher value than the partial for these metrics. So if the best solution found has MRS 1, and a partial solution is found which has already a higher cost, or an MRS > 1, then there is no need to pursue a further investigation of this particular partial solution. The partitioning problem is NP-complete, but by using these rules for eliminating unfruitful parts of the search space, the execution time still becomes acceptable, even for large problems. To further increase the efficiency, heuristics can be included to decide which partial solution to consider next, and thereby attempt to guide the search towards good solutions quickly. In that way, it becomes possible to cut more branches. The scheduling and partitioning procedures for independent tasks are described closer in [2].
5.3 Architecture selection The remaining design problem for which synthesis techniques could be used, is the selection of components in the hardware architecture, and the decision how they are to be interconnected. However, since the MRS and cost metrics cannot be calculated without also having a partitioning, architecture selection and partitioning cannot be separated. One way would then be to generate architectures, and for each architecture calculate its optimal partitioning. But the computational complexity of this is
prohibitive, and it is therefore better to attempt heuristic techniques, which decide concurrently on the architecture and partitioning. These cannot guarantee an optimal solution, but they can often come close to it. We have done a comparative study of three heuristic search algorithms: simulated annealing, genetic algorithms, and tabu search. All these algorithms move iteratively through the search space, and use different ways of trying to avoid local optima. Simulated annealing does it by always moving towards better solutions by default, but occasionally accepting worse solutions. Genetic algorithms keep a set of solutions which is updated in each iteration, and in this way it is not disastrous if one of the solutions is a local optimum, since the others may continue towards the global one. Tabu search forbids backward moves, by keeping recently visited solutions on a tabu list for a certain amount of time. So in a local optimum, the algorithm can be forced to move away from it to a worse solution, because all other alternatives are tabu. We implemented concurrent architecture selection and partitioning using the three algorithms, based on the transformations described above, and conducted a number of experiments on 6 different applications. We only studied independent tasks, because the task chains are mostly a concern of the partitioning, and our emphasis in these particular experiments was mostly on the architecture selection. The result of over 350 trials was that tabu search is the most suitable algorithm, because it is the easiest to implement, and found the optimum quicker and more often than the others. Simulated annealing is a possible alternative; its performance was within reach of that of tabu search. However, the genetic algorithm was very difficult to implement, and performed clearly worse than the others. Architecture synthesis is presented in detail in [4].
6 Conclusions and future work In this paper, we have presented a hardware/software codesign approach to real-time systems design with the aim of improving the possibility of reaching the timing constraints. The methodology consists of a thorough design space exploration during the early design phases to determine suitable architectures, partitionings, and schedulers to implement a specified functionality. This is accomplished using a set of tools that are essentially optimization procedures based on transformations of the design. The quality of a proposed design is evaluated using a set of analysis models that predict the cost and performance. This makes it possible to take the timing constraints into account from the beginning of the design flow, thereby increasing the chances that they will
be met by the final implementation, and thus making the process more predictable. There are many variations possible within the general scheme described in this paper. Among the most urgent improvements are a generalization of the behavioural descriptions accepted for scheduling and architecture selection, to also handle task chains. For the architecture selection, this will be done using the tabu search algorithm. Another issue are the models for intrinsic estimations. The ones we have used are fairly simple, and a lot of improvement is possible, especially for ASIC estimations.
[8] D. D. Gajski, F. Vahid, S. Narayan, and J. Gong. Specification and Design of Embedded Systems. Prentice Hall, 1994. [9] R. Gerber, S. Hong, and M. Saksena. Guaranteeing real-time requirements with resource-based calibration of periodic processes. IEEE Transactions on Software Engineering, 21(7):579–592, July 1995. [10] R. K. Gupta and G. De Micheli. Hardware-software cosynthesis for digital systems. IEEE Design & Test of Computers, 10(3):29–40, Sept. 1993.
Acknowledgements
[11] S. J. Mellor and P. T. Ward. Structured Development for Real-Time Systems. Yourdon Press, 1986.
Many thanks to Erik Stoy for his suggestions of improvements to this paper. The research has been funded by the Swedish National Board for Industrial and Technical Development (NUTEK).
[12] H. Simpson. The Mascot method. Software Engineering Journal, 1(3):103–120, May 1986.
References [1] N. C. Audsley, A. Burns, R. I. Davis, K. W. Tindell, and A. J. Wellings. Fixed priority pre-emptive scheduling: An historical perspective. Real-Time Systems, 8(2/3):173–198, Mar. 1995. [2] J. Axelsson. Hardware/software partitioning aiming at fulfilment of real-time constraints. Journal of Systems Architecture, 42(6–7):449–464, 1996. [3] J. Axelsson. Analysis and Synthesis of Heterogeneous Real-Time Systems. Ph.D. thesis, Link¨oping University. To appear, 1997. [4] J. Axelsson. Architecture synthesis and partitioning of real-time systems: A comparison of three heuristic search strategies. In Proc. 5th International Workshop on Hardware/Software Codesign, pages 161–165, 1997. [5] J. Axelsson. A hardware/software codesign methodology and workbench for predictable development of hard real-time systems. In Proc. 9th Euromicro Workshop on Real-Time Systems, 1997. [6] A. Burns and A. J. Wellings. HRT-HOOD: A structured design method for hard real-time systems. Real-Time Systems, 6(1):73–114, Jan. 1994. [7] M. D. Edwards and J. Forrest. Software acceleration in a hardware/software codesign environment. In Proc. 21st Euromicro Conference, pages 727–733, 1995.
[13] M. B. Srivastava and R. W. Brodersen. SIERA: A unified framework for rapid-prototyping of systemlevel hardware and software. IEEE Transactions on Computer-Aided Design, 14(6):676–690, June 1995. [14] W. H. Wolf. Hardware-software co-design of embedded systems. Proceedings of the IEEE, 82(7):967–989, July 1994.