Operation Serializability for Embedded Systems - CiteSeerX

Operation Serializability for Embedded Systems Rajesh K. Gupta Department of Computer Science University of Illinois, Urbana-Champaign Urbana, Illinois 61801.

Abstract

times that are not currently used in software compilers. Second, unlike operation scheduling in high-level synthesis, operation linearization must take into account the delay from additional data movement operations that are not explicitly modeled. While several linearization techniques have been developed in the past, the correctness of this linearization has not been examined thus far. In this paper, we address this issue and suggest tests to ensure correctness of a linearization. This paper is organized as follows. We brie y describe the input for embedded system co-synthesis and the software design problem in Section 2. We consider conditions for feasible software generation in Section 3. In Section 4 we present a summary of the techniques for operation linearization and present experimental results in Section 5. We summarize contributions and discuss open issues in Section 6. The Input: A speci cation language for EC systems must provide mechanisms for synchronization and concurrency between parts of the system that may operate at dierent speeds of execution, and a notion of structural parts of a system design. Most current hardware description languages satisfy these requirements, hence our choice for input to system cosynthesis. In particular, we use HardwareC [10] though other languages such as Verilog, VHDL may be used as well [11]. In HardwareC the basic entity for system speci cation is a process. A process executes concurrently with other processes in the speci cation. A process restarts itself on completion of the last operation in the process body. Multiple processes are used to capture parts of a system that operate at dierent speeds and have varying storage and implementation requirements. All communication between operations in a process body is based on shared storage that is declared as a part of the process body. All inter-process communication is speci ed by message-passing operations that use a blocking protocol for synchronization purposes. There are three constructs for speci cation of syn-

We consider the problem of generation of embedded software from input system descriptions in a hardware description language (HDL). Generation of software for embedded computing requires a total ordering of operations, or linearization, under constraints to ensure timely interaction with other system components. We show by example conditions where no ordering of operations in a HDL can produce the modeled functionality in software. Therefore, the existence condition for software generation or serializability must be ensured before attempting any linearization. We present the conditions based on variable de nition and use analysis under which operation linearization is possible. We then present our approach to operation serialization under timing constraints to produce ecient schedules for the embedded software.

1 Introduction The goal of our project is to develop compilation and synthesis (co-synthesis) techniques for design of hardware and software in embedded computing systems [1]. Unlike a general-purpose computing system, an embedded computing system (ECS) is designed to deliver a speci c functionality under constraints on relative timing of its actions. Software design for embedded computing systems is an important and growing problem due to its complexity and need for reliability [2, 3, 4, 5, 6]. Driven by the need to ensure timely performance of the software in an embedded system, recent research in this area has focused on timing characterization and optimization of the software [7, 8, 9]. Most solutions to the problem of software generation for embedded computing systems build on previous work in software code-generation and operation scheduling in high-level synthesis. However, there are important dierences. First, operation linearization is aected by constraints between operation execution ED&TC ’96

0-89791-821/96 $5.00  1996 IEEE

1

chronization and concurrency in HardwareC. We explain the syntax and semantics of these using the Example 1.1 below. Example 1.1. Consider the HDL fragment (described in HardwareC) below:

process vars (inp1,inp2,outp) in port inp1[size], inp2[size]; out port outp[size]; { int a,b; boolean c; 1: a = 2: b = 3: < 3.1: 3.2: 3.2.1: 3.2.2:

read(inp1); read(inp1); c = fun(a,b); if (c) [ write outp = a; write outp = b; ]

> 4: c = b+c; }

The above HDL program has the following semantics: Steps 1, 2, 3, 4 are data parallel. These steps may execute in parallel, at dierent possible speeds, as long as all the data dependencies are not violated. Step 3 is forced parallel. This means that the sub-steps 3.1 and 3.2 must be initiated at the same time. Step 3.2 is sequential. The semantics of this statement is that sub-steps 3.2.1 and 3.2.2 must execute sequentially. From the semantics of forced-parallel executions, note that the value of variable `c' tested in 3.2 may be dierent from the one computed by 3.1.2

2 The Problem Model Software generation for embedded computing requires total ordering of operations in the input HDL model. While it is possible to derive a linearization of operations from the HDL program, arbitrarily chosen linearizations may result in a violation of the timing constraints. Speci cally, a linearization may introduce a dependency from a synchronization operation to other operations. This dependency makes the time of execution of an operation dependent on completion of the operations whose delay is uncertain or unknown. Because of these additional dependencies,

timing constraint satisfaction may be impossible for a given linearization [12]. One solution is to map the HDL program into a set of concurrent program threads [1, 13]. This strategy avoids complete serialization of all operations. In our approach to thread generation, we rst compile the HDL input into a hierarchical graph model. The graph model consists of a set of acyclic polar graphs G(V; E). The vertex set V (G) represents languagelevel operations and special link vertices that are used to encapsulate hierarchy. A link vertex induces (single or multiple) calls to another ow graph that may represent a shared resource or body of a loop operation. The edge set E (G) represents dependencies between operation vertices. Timing Constraints: An operation-level timing constraint speci es bounds on the interval of execution of operation pairs or on the rate of execution of an operation. The timing constraints are abstracted in an edge-weightedconstraint graph GT which is based on the ow graph G. An edge from vi to vj with positive weight induces an ordering relation from vi to vj with a minimum delay constraint on the interval from initiation of vi to initiation of vj . An edge with negative weight induces no particular ordering relation on the vertices but represents a maximumdelay constraint on the interval between the start times of the two operations. Since an edge in the (acyclic) ow graph G represents a minimum delay constraint in GT , edges with negative weight are also called backward edges. The ow graph model is input to a set of partitioning transformations that generate set of ow graphs to be implemented in hardware and software. As mentioned earlier, we generate software as a set of concurrent program threads. The task of synthesis of software from ow graphs is divided into following three steps: 1. We rst create a program thread as a linearized set of operations that may or may not begin by a synchronization operation (also called nondeterministic delay, or ND, operations). Other than the beginning synchronization operation, a thread does not contain any other such operations. Depending on the ND operations in a ow graph G, multiple program threads may be created from G. Two threads are either hierarchically related or concurrent. For hierarchically related program threads since the dependencies between threads are known, these are built into the threads as additional enabling operations. 2. Next, overhead operations are added to implement concurrency between program threads ei-

ther by subroutine calling, or as coroutines [6]. 3. Finally, routines are compiled into machine code using compiler for the processor. We assume that the processor is a predesigned component with available compiler and assembly tools. Therefore, the important issue in software synthesis is generation of the source-level program. The rest of this paper is devoted to the speci c issue of nding a feasible linearization of operations that can be used to generate the program threads.

3 Operation Serializability For concurrent program threads in a singleprocessor ECS implementation, concurrency between program threads is achieved by using an interleaved execution model. For a functionally correct execution in software it is essential that the execution of operations in a program thread yields the same result (modulo timing) as would the execution of the operations in the input ow graph model. We formulate this condition as a serializability condition on the ow graphs. Serializability is a necessary (but not sucient) condition for generating a single-processor ECS software. Serializability is a concern, for instance, when two operations in a program thread belong to two graphs or they are speci ed within the same graph using the `forced parallel' semantics in HardwareC. We consider the interleaved executions of two program threads speed independent if the actions performed by a program thread have no in uence on the rate at which it proceeds. The reaction rate of a program thread depends on the presence of other threads and the concurrency structures. A sucient condition for speed independence is that the program threads implement pure functions [14], that is, program threads be functional. However, the requirement for a functional behavior is only an abstraction and need not imply an implementation of program threads as functions. In practice, it is sucient if the storage common to program threads is accessed sequentially. This introduces the notion of critical sections in program threads, that is, operations that access or modify the shared storage. One way to achieve the functional property for a thread is by blocking on critical sections such as by using semaphores. From a practical point of view, the functionality property is not very useful due to excessive overhead incurred in implementation of blocking and the resulting loss of timing certainty in software. Therefore, we

focus on methods to achieve speed independence of program threads by a close examination of the thread side-eects on storage and input/output. As mentioned earlier the operational semantics of a

ow graph, G, uses a set of variables, M (G), that are associated with the operations. An operation in V (G) can take the following actions on a given variable, u in M (G): de nes u. When the operation corresponds to an assignment statement where the variable u appears on the left-hand side (lhs) of the assignment, for instance u = ; uses u. When the operation corresponds when u appears in an expression or the right-hand side (rhs) of an assignment; tests u. when u is a part of a conditional or loop exit condition. Example 3.1 illustrates the dierence. Example 3.1. Variable def, use, and test. Consider the HDL example in 1.1. First two assignments (statement 1 and 2) de ne variables a and b whereas assignment 3.1 uses both these variables and de nes variable c. Operation 3.2 tests variable c where as operation 4 uses and de nes this variable.2

We note that the distinction between variable use and test operations is new and speci c to the problem of hardware-software co-synthesis. A variable that is only tested (and not used) is considered a control variable. This distinction between use and test is made to model dierent strategies for control and data transfer across a hardware-software partition [15].

3.1 Testing Serializability We assume that each operation (vertex) in G is implemented as an atomic operation in hardware or software. We explore conditions for operation serializability by examining the interactions of variables used in concurrent operations. We consider a variable private to an operation if it is used or tested by that operation alone. A variable that is used or tested in multiple (concurrent) operations is considered a shared variable. From our discussion of speed independence of program threads, a sucient condition for serializability is to check if program threads are functional such that there are no shared variables. In presence of shared variables the program threads will contain critical sections. Any interleaving of operations in the critical

sections must ensure that the de nition and use ordering relations on shared storage are not interfered by competing concurrent operations. For concurrent operations that both de ne and use a variable, the ordering between operations will aect the output and thus make any interleaving of operations impossible. Therefore, a ow graph is serializable if the storage M (G) can be partitioned into shared and private variables such that only the private variables can be both used and de ned by the same operation. However, note that variables that are both used and tested can be shared between concurrent operations. Thus the serializability of a graph is checked by examination of de nition and use operations for its variables, to ensure that no shared variables are both de ned and used by the same operation. As mentioned earlier, for concurrent operations across process graph models, the communication is only by message passing operations. In absence of any shared storage program threads created from separate ow graphs are always functional. For program threads created from the same ow graph, the condition of serializability requires examination of concurrent operations (i.e., operations without any transitive dependencies) that may belong to the same or dierent program threads.

3.2 Unserializable Flow Graphs

There are two situations when a ow graph is not serializable. First condition is when two operations are required to execute in parallel regardless of the operation dependencies, i.e., using `forced-parallel' semantics of HardwareC. Forced-parallel semantics is often used to specify behavior of hardware blocks declaratively, and for correct interaction with the environment assuming a particular hardware implementation. For instance, assuming a master-slave ip- op for a storage variable, concurrent read-write operations to the variable may refer to operations on two separate values of the variable corresponding to the master and slave portions of the ip- op. Flow graphs with forced-parallel operations are made serializable using the following procedure: 1. Decompose concurrent operations into multiple simple operation using additional variables. A simple operation either uses, de nes or tests a shared variable; 2. Add dependencies between simple operations that observe polarization of the acyclic ow graph from its source to sink. The new ow graph can now be linearized into one or more program threads.

Example 3.2 below shows the transformation. Example 3.2. Consider the following fragment representing a swap operation: static int a, b; < a = b; b = a; >

This is translated into two sets of simple operations as shown in the following gure. For the

a=b

t1 = b

t2 = a

a = t1

b = t2

b=a

new graph model, any of the four valid serializations lead to correct implementation of the original concurrent operations. 2

A second situation arises in the case of dedicated loop operations. A hardware implementation of a loop operation is by using shared memory [10]. The operations in body of the loop have access to the storage de ned in the calling graph. Here the program thread corresponding to the loop body may have access to storage in the program thread corresponding to the parent graph. We address this case by making explicit all data transfers between the loop graph and the parent graph and then performing the serializability tests as in the case of concurrent processes. The rest of the paper examines the issue of achieving a linearization under timing constraints.

4 Operation Linearization A linearization speci es the order in which data is transferred between operations. We assume that a variable resides in memory or register where latter provides a faster access to data. Unlike memory a register storage is shared between several variables to improve utilization of the faster storage in program execution. Due to a limited register storage, additional inter-storage operations are needed to reuse a register for dierent variables. A variable may be removed from register storage to memory by a spilling operation. We consider the following two problems:

Problem P1: [Feasible Linearization] Given a timing constraint graph GT = (V; E; ) and an upper bound R on the maximum number of registers available for arbitrary variable storage, nd a linearization of operations in V such that for a given choice of the spill set, R V , all timing constraints are satis ed.

Problem P2: [Optimum Linearization] Given a timing constraint graph GT = (V; E; ) and an upper bound R on the maximum number of registers available for arbitrary variable storage, nd a linearization of operations in V and a spill set, R V , such that all timing constraints are satis ed, and the size of spill set j j is minimum.

Problems P1 and P2 are related. P2 is a storage optimization problem over possible solutions to P1. In the following we present a summary of approaches to solve the two problems. We rst present linearization as a solution to Problem P1, and then consider methods to solve Problem P2 by incorporating spill set estimation in pruning the linearization results. Exact linearization methods always return a valid linearization if one exists. Most common exact method for operation linearization is to use simple back-tracking (SBT) to nd a feasible solution for P1 or a solution that uses minimum spilling for P2. A SBT procedure selects one operation from a set of available operations, at a time, for addition into a partially linearized set and checks the feasibility of linearization under timing constraints. In case of an infeasible linearization, the algorithm back-tracks to a previous (partial) solution and considers alternative choices for operation selection. The time complexity of SBT linearization is exponential in number of operations. Since the algorithm constructs one solution at a time, the space complexity is linear in the number of operations. An exact search is computationally expensive and not particularly a good strategy for solving P1 since, in general, there may not exist a valid linearization. We consider two ways to improve the average time taken by the SBT algorithm: (a) Ordered Back-Tracking or OBT maintains a count of the times the algorithm backtracks to an operation (vertex). Using this count, the set of available vertices for backtracking is ordered such that heavily backtracked vertices are avoided in favor of other vertices. The intuition behind this bias in selection is that heavily backtracked nodes are common to a large number of infeasible linearizations. Thus the search for a

feasible solution (Problem P1) is improved by avoiding potentially futile searches. (b) Back-Tracking with Clustering or BTC uses the structure of the input graph to cluster operations such that a feasible linearization consists of linearization of individual clusters. We partition operations in the input constraint graph GT into clusters C1 ::: Ck such that each cluster forms a strongly connected component (SCC). Recall that in constraint graph, forward edges Ef represent minimum delay binary constraints, whereas backward edges Eb represent maximum delay binary constraints. We assume that the imposed minimumdelay constraints are consistent with the sequencing relation induced by the original (polar acyclic) ow graph. Therefore, GfT is acyclic and a connected graph. Therefore, cycles in GT are created by the presence of backward or maximum delay constraint edges. SCCs in GT are de ned as equivalence classes under the relation that there exists a two-way path between two operation vertices. SCCs induce a partition of V (GT ) such that there can be no backward edge between two operations in dierent SCCs. For a constraint graph with only minimum delay constraints, any linearization that is topologically consistent with the graph edges is timing feasible since a violated minimumdelay constraint can always be satis ed by introducing additional delays without aecting satis ability of other minimum delay constraints. It follows, therefore, that a linearization of GT consists of linearization of the SCCs that are in turn linearized into operations. To improve runtime, a polynomial-time nondeterministic linearization can be built by maintaining two lists, Q for operations that have been linearized, and R for operations that are candidates for linearization. The linearization can be based on breadth- or depth- search depending on how the data structure for R and associated operations are implemented. For breadth-based linearization, R is a queue and with enqueue and dequeue operations. For depthbased search the R is a stack with push and pop operations. In this linearization, a non-deterministic choice is made to select a vertex from the list of successors of the rst element in R. Note that neither depth-based or breadth-based algorithm alone is capable of producing all possible linearizations for a given ow graph. The choice of depth-based or breadth-based linearization depends on the structure of the ow graph. For graphs with a high degree of fanout nodes a breadthbased linearization reduces the maximum number of live registers. In general, depth-based linearization is good for thin and tall graphs, whereas breadth-based linearization is good for short and obese graphs. Fur-

ther this choice is also aected by the structure of the timing constraints. For this purpose, we distinguish two types of maximum delay timing constraints (or backward edges): (a) Sequencing type timing constraints are represented by edges between two operations that are (transitively) related; (b) Synchronization type are edges between two operations that are not related. A constraint graph model that is dominated by sequencing (synchronization) edges is likely to lead to a feasible solution for problem P1 by using a depth(breadth)-based search strategy. This observation is used in developing our heuristic linearization procedure described below. Note that a non-deterministic (ND) linearization algorithm can be made deterministic by allowing backtracking in case of infeasible linearization (for example violation of a maximum timing constraint). Such an algorithm is described in [3]. However, in presence of backtracking, the worst case time behavior of the algorithm may be exponential. The chief strength of a ND linearization algorithm is that it allows for ecient polynomial-time solutions by incorporating appropriate heuristics in place of random choice operations. We consider such a method for solving time constrained linearization under a spilling cost criterion next.

Heuristic Methods: Our linearization heuristic is based on a vertex elimination scheme that repetitively selects a zero in-degree vertex and outputs it. A heuristic selection of this vertex is used to steer the algorithm towards nding a feasible solution for problem P1. This selection is based on the criterion that the induced serialization does not create a positive weight cycle in the constraint graph. Among the available zero in-degree vertices, we select a subset of vertices based on a measure of urgency associated with each source operation and select the one with the least value of the urgency measure. This measure is derived from the intuition that a necessary condition for existence of a feasible linearization (i.e., scheduling with a single resource) is that the set of operations have a schedule under timing constraints assuming unlimited resources. A feasible schedule under no resource constraints corresponds to an assignment of operation start times according the length of the longest path to the operations from the source vertex. Use of urgency measure to select the operation for linearization has the eect of moving the invocation of all tightly bound sets of operations to a later time in the interest of satisfying timing constraints on operations that have already been linearized.

Constraints min, max 8,4 14,2 12,2 19,1 19,3 16,3 20,8 24,2 34,3

Exact Heuristic Optimum SBT OBT BTC 39 38 13 6 78 379 379 43 9 585 125 125 41 9 266 158 158 53 11 738 643 175 109 11 2069 25 25 30 12 31 544 542 217 9 758 23 23 27 13 42 18168 16134 774 18 144735

Table 1: Eciency of linearization algorithms.

5 Experimental Results The serializability tests and operation linearization algorithms have been implemented in C programming language as a part of the Vulcan co-synthesis system. The input to a linearization problem is characterized by the number of forward and backward timing constraints, and the topology of constraints. The last factor is quanti ed by identifying two major types of backward constraints, sequencing- and synchronization-related constraints and appropriate choice of depth- or breadth-based search strategy is used for the respective cases as discussed earlier. The quality of a linearization is determined by the size of spill set, , that is, the number of operations that must be spilled to meet a given limitation on maximum number of registers available. In the event of a spill, the input constraint graph is modi ed to include additional delay for the de nition and each use of the variable. Operations are selected for spilling based on a heuristic that the latest de nitions in the partially linearized set of operations are spilled rst. This choice reduces the likelihood of spilling variables that have been live over relatively longer times, thus reducing fragmentation of a long live range in to smaller ranges. The algorithms were tried on a number of constraint graphs. Table 1 compares the eciency of various linearization algorithms by comparing the number of operation nodes that are visited in a particular search strategy. `SBT', ÒBT' and `BTC' refer to simple backtracking, ordered back-tracking and backtracking with clustering respectively. `Heuristic' refers to the linearization heuristic presented here. Òptimum' explores all possible linear order to generate a solution to the optimization problem P2. Note that exact methods guarantee a feasible linearization if one exists, whereas Òptimum' method selects the best lin-

earization from the set of all possible of feasible linearizations. In contrast, a heuristic linearization may result in no feasible solution when there exists one. Though in practice, this is rarely encountered. Recall that ordered back-tracking prioritizes vertices for back-tracking so that frequently backtracked vertices are avoided in favor of other vertices. Use of ordered back-tracking reduces computation time on an average by 12% over simple backtracking. However, in some cases OBT may lead to increased search time because the feasible solution may lie on a search path through a node that is heavily backtracked, i.e., a linearization solution that lies in the neighborhood of several infeasible solutions. This condition is more likely to happen in case of examples that are tightly constrained. As shown in Table 1 the heuristic linearization provides substantial improvement in runtime over both exact and optimum linearization methods. Comparing the spill set size, the heuristic algorithm results in a spill set size that is 8-13% larger than the optimum (j j = 86 for heuristic linearization compared to 79, 76 for SBT and OBT respectively). However, this reduction in spill set is obtained by substantial increase in runtime over both exact and heuristics linearization methods as indicated by the last column in Table 1.

6 Summary and Future Work There are two main problems in generating software from ow graphs. First, since the program generation necessarily requires serialization of operations in the

ow graph, one must ensure that it is indeed possible to preserve the HDL-modeled behavior through such serialization. We have discussed sucient conditions to ensure that such a serialization is possible based on variable de nition, use and test analysis on the ow graphs. For serializable graphs we have compared exact and heuristic linearization schemes. The key to a good linearization heuristic is to exploit the structure in the ow graph to reduce the amount of work needed to obtain a solution. Our heuristic is substantially faster than exact ordering search algorithms and leads to quality of results that are comparable to exact methods. The quality of the results can be improved further by an improved choice of the spill set heuristic. Optimization of spill set over a given linearization is a problem that needs to be looked into further to improve the quality of heuristic linearization.

7 Acknowledgments The author would like to thank Sharad Mehrotra and Anupam Sharma for helpful discussions and contributions. This research was sponsored by a grant from the AT&T Foundation and NSF Number MIP 95-01615.

References [1] R. K. Gupta and G. D. Micheli, \A Co-Synthesis Approach to Embedded System Design Automation," Design Automation for Embedded Systems, vol. 1, no. 1-2, Jan. 1996. [2] M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, A. Sangiovanni-Vincentelli, E. Sentovich, and K. Suzuki, \Synthesis of software programs for embedded control applications," in Proc. DAC, June 1995. [3] P. Chou and G. Borriello, \Software scheduling in the Co-Synthesis of Reactive Real-Time Systems," in Proc. DAC, June 1994. [4] W. Wolf, \Hardware-Software Co-design of Embedded Systems," IEEE Proceedings, vol. 82, no. 7, pp. 965{ 989, July 1994. [5] S. Kumar, J. H. Aylor, B. W. Johnson, and W. A. Wulf, \Object-oriented techniques in hardware design," Computer, vol. 27, no. 6, pp. 64{70, June 1994. [6] R. K. Gupta, C. Coelho, and G. D. Micheli, \Program Implementation Schemes for Hardware-Software Systems," IEEE Computer, Jan. 1994. [7] C. Y. Park, Predicting deterministic execution times of Real-time Programs. PhD thesis, University of Washington, Seattle, Aug. 1992. [8] Y.-T. S. Li and S. Malik, \Performance analysis of embedded software using implicit path enumeration," in Proc. DAC, June 1995. [9] W. Ye, R. Ernst, T. Benner, and J. Henkel, \Fast timing analysis for hardware-software co-synthesis," in Proc. ICCD, 1993. [10] D. Ku and G. D. Micheli, High-level Synthesis of ASICs under Timing and and Synchronization Constraints. Kluwer Academic Publishers, 1992. [11] D. D. Gajski et. al., Speci cation and Design of Embedded Systems. Prentice-Hall, 1994. [12] R. K. Gupta and G. D. Micheli, \Constraint Analysis and Propagation Techniques for Embedded Systems," (submitted. available as tech. report), University of Illinois, 1994. [13] P. Marwedel and G. Goosens, Code generation for embedded processors. Kluwer Academic, 1995. [14] P. B. Hansen, Operating System Principles. PrenticeHall, 1973. [15] S. Agrawal and R. K. Gupta, \System Partitioning using Global Data-Flow," Memorandum UIUC DCS 1995, University of Illinois, Oct. 1995.