A survey on design quality estimation for embedded systems Gianluca Cornetta and Jordi Cortadella Department of Computer Architecture Universitat Politecnica de Catalunia 08071 Barcelona, Spain E-mail: fcornetta,
[email protected] March 6, 1996 Abstract:This paper is a brief survey on hw/sw codesign, its present features and its possible future developments. It is beyond the scope of this article to give a full explanation of all the steps that lead to a complete design. We will, instead, focus our attention on that set of evaluation criteria used to estimate the eciency of a design in terms of time to market, cost, power consumption, area occupation and other design constraints.
1 The need for an integrated development environment The great advance in microelectronic technology in the last decade has permitted to increase enormously the number of devices integrated on a single chip. The drawback of the growth of on-chip transistors number has been the growth of the design complexity itself. As a result neither a single designer or a design team could be able to complete the design cycle in a reasonable time. This lead the researchers involved in this area to change drastically the design style and techniques introducing software tools, well known as CAD tools, intended to help the designer to carry out his task as soon and as eciently as possible. A CAD tool is roughly divided into three parts: a schematic editor, that is a tool by which the designer speci es, at dierent levels of abstraction (i.e. transistor level, gate level, functional block level), the circuit functionalities; a time analyzer and a test pattern generator for debugging circuit and evaluating its performances; a silicon compiler, that is a tool that capture the schematic view, translate it into a layout and maps it into silicon according to the target technology. A graphic schematic is not the only way to specify the circuit functionality, ad-hoc languages,known as HDLs, have been developed as well. These languages allow a circuit description by means of a highly-specialized instruction set. Several CAD tools have been developed, but what the designers and scienti c community still did not think about, was a set of rules to make more ecient and faster the design. The need of a general and ecient design methodology is even much stronger in the area of special purpose applications in which the designer must skim among several possible architectures and decide which tasks will be carried out by hardware and which ones by software routines without knowing a-priori which solution better matches the design constraints. Furthermore the exigencies of the market must be satis ed: the lifetime of a product is shorter and shorter, as a consequence the design cycle must be very fast and must lead to competitive product in term of cost and performances. Hardware-software codesign aims to extend the current CAD tools capabilities and design techniques embedding them in an unique framework capable of generating an optimum design 1
according to a given set of design constraints and integrating in the same tool the design of both hardware and software components of a complex system [1].
1.1 The describe-and-synthesize approach
The old fashion design styles were based on the capture-and-simulate methodology, that is, once speci ed the circuit functionalities, a schematic is produced with an ad-hoc tool. This schematic will be, afterword, captured with another tool and simulated in order to verify the correctness of the design, timing and faults. The captured schematic can also drive tools for placement, routing and layout synthesis whether in full-custom or semi-custom technologies. The CAD tools of the new generation, instead, are based on a describe-and-synthesize methodology. This new design approach allows the designer to describe the circuit functionality in a merely behavioral form, without specifying any implementation detail. This technique can be applied to dierent level of abstractions. At the gate level (i.e. for FSM and datapath synthesis) a circuit can be synthesized in two steps: the former consists in a logic minimization (state minimization in case of a FSM) of the boolean expression that describes the circuit; the latter consists in a technology mapping using libraries of pre-de ned cells. At the register level the whole system is described with high-level synthesis techniques, that is modeling the various components with owcharts, Petri nets, data ow graphs or other models that better suit to the system characteristics. To transform this behavioral description into a structural one, we must pass trough three phases: 1. resource allocation, 2. scheduling, 3. binding. In the resource allocation phase is determined the number of functional units and which operation they implement, as well as the number of registers, buses and pins. In the scheduling phase, the behavior of the systems is partitioned into time intervals each lasting one clock cycle. To each partition is assigned a certain number of resources (i.e. functional units). The assignment operation can be performed as soon as data dependencies among the various resources make it possible (ASAP) or in the last possible partition, but without exceeding the depth of the data ow graph critical path (ALAP) [2, 3, 4]. It is clear that a new variable cannot be stored in a register if the lifetime of the previously stored variable is not still extinguished. Binding consists in the assignment of each variable of the system to a storage unit and of each function performed by the system to a particular functional unit. In conclusion, the great advantage of the describe-and-synthesize technique, that is leading it to overwhelm all the previous design styles, is that with this technique it is possible to synthesize any circuit simply starting from a behavioral description that is void of any technological information.
1.2 The specify-explore-re ne paradigm
This methodology consists in specifying all the system functionality in the early phase of the project instead of de ning them gradually during the whole design cycle. This approach, as far, will lead to large savings in design cost. The speci cations must then be partitioned into software and hardware parts. Each software routine will be executed in one or more of the allocated processors, and the hardware sub-system will be synthesized as one or more ASICs. Since each dierent partition and allocation generates a dierent architecture, it is necessary a criterion to 2
skim among the various solutions and to select the best one, i.e. the one that better ts the design constraints. Once the best solution has been found, the speci cations need to be re ned in order to re ect the decisions taken during the partition and allocation phases. The great advantage of such a design approach is a signi cant speedup of design cycle and time to market, in fact the designer is only required to select technology, allocate components and specify requirements, before the framework automatically explore all the possible design alternatives in order to nd the best one and re ne the speci cation by adding more structural details as soon as architectural and technological decisions are made.
1.3 Mapping software into hardware
One of the directions in hardware-software codesign is mapping software speci cations into hardware [5, 6, 7, 8], as a result a large variety of CAD tools for special purpose applications has been developed. Starting from a behavioral description of the circuit to be implemented, given in particular languages that are well suited for this task like CSP [10] or OCCAM [11], the rst step that leads us toward the physical implementation of our high-level description, is resource allocation [12, 13, 14] , i.e. the act of mapping the circuit functionality into functional blocks (ALU, registers, etc.) built from library cells. Once the optimum number of resources has been allocated, the next step consists in resource scheduling [15, 16]. In this phase the behavior is divided into time intervals and the resources allocated are partitioned among the various intervals according to the design constraints. In [17] the authors, starting from a high level speci cation in Scheme [18, 19], realize a 32 bit microprocessor. First the initial speci cation is decomposed, using factorization techniques, into a control component and a structural one; then the structural component is ulteriorly re ned in order to isolate the memory from the CPU speci cations. The communication process between CPU and memory is described by an ad-hoc table called register transfer table (RTT). In this table are mapped all the register transfers that occur in parallel and their corresponding control code. On the RTT is carried out a series of successive re nements in order to transform the memory from a functional abstraction to a process abstraction, generating, according to the operation stored in memory, a set of signals necessary to communicate with the factored components. The behavioral description of the circuit is rst factored into basic blocks, each of them carrying out a particular task (storage, calculus, etc.). As soon as a component is allocated, signals are generated to allow communication among the factored components. An extension of the RTT is the behavior table [20, 21]. This is an alternative way of representing a nite state machine that models the behavior of the system. Each row in the BT represents a transition in the machine described by the table. The columns are divided into two sections: the decision section and the action section. The decision section represents state and conditions that must hold for the transition to be executed. The action section represents data ow through the functional units, ports and next state. Unlike the RRT, the BT is used to describe the whole design under a both arquitectural and behavioral point of view, i.e. emphasis is given to control and datapath architecture as well as to communication protocol and data expansion. Some other recent CAD tools for integrated hardware-software synthesis and performance evaluation are described in [22, 23]. This does not want to be an exhaustive list; the interested reader may refer to [25] for a detailed description of several other tools for embedded systems design.
2 Design quality estimation
In order to verify if a particular partition matches the design constraints is necessary to develop a set of criteria to estimate which among all the possible implementations better suites the constraints. Naturally we want to obtain this estimation as soon as possible and we want it as more 3
faithful as possible to the reality, i.e.the error margin between our estimation and the real system performances, must be very small. Estimation techniques let the designer explore quickly, in each design step, system performance by providing a quick feedback of all the design alternatives instead of evaluating each time a complete design implementation for the desired metric. In order to evaluate system performances a design model must be created. The accuracy [25], delity [26] and speed of an evaluation depend upon the model we choose. The more complex the model is, the higher will be accuracy and delity and the slower will be the evaluation time. The accuracy of an estimate is a measure of how close the estimate of a particular metric is to its eective value, measured after the design implementation. If E(D) represents the estimate of a particular metric for a design D and M(D) the measured value for the same design, then the accuracy A is given by: ? M(D) j (1) A = 1 ? j E(D)M(D) Hence, a perfect estimate leads to A = 1. Let now D = fD1 ; D2; : : :; Dng be a set of implementations of a certain speci cation and let us de ne ij as follows: 8 1 if E(Di ) > E(Dj ) ^ M(Di ) > M(Dj )_ > > < E(Di ) < E(Dj ) ^ M(Di ) < M(Dj )_ (2) ij = > E(D i ) = E(Dj ) ^ M(Di ) = M(Dj ) > : 0 otherwise The delity F of an estimation method is de ned as the percentage of correct predicted comparisons between dierent design implementations by the following equation: n X n X ij (3) F = 100 n(n2? 1) i=1 j=i+1 n being the number of possible implementations of the given speci cation. The higher the delity of the estimation, the more likely that correct design decisions will be taken, according to the result of the estimates comparison of two dierent design implementations. Clearly, if estimated and measured values always match, delity F is, obviously, 100%.
2.1 Estimation metrics
There are several quality metrics to take into account when a design is implemented, among them, the two that play a main role are the cost and performance of hardware and software implementations. Although other estimation metrics are relevant as well, many high-level decision are entirely based on these two metrics.
2.1.1 Hardware cost metrics
Hardware cost metrics include engineering and design costs as well as testing, manufacturing and packaging costs. The manufacturing cost depends on the size of the design, i.e. on the silicon area required by an implementation. Area estimates are useful in order to decide if a given design can t into a given chip area predicted to produce the maximum yield [27] during fabrication. Sometimes the size of an implementation is given not in terms of area but by related measures such as the number of transistors, gates or register-level components. Such size metrics oer very high delity because the size of these components is well known by the designer and the area can be approximated as the sum of their sizes scaled by an appropriate constant to take into account wiring and padding. The packaging cost is a function of the number of pins of the implementation. A good design strategy consist in minimizing the number of pins of the design because their number aects not only the packaging cost but also the board area resulting from interconnecting all the chips on the board. 4
2.1.2 Software cost metrics
Part of the behavior of a design is very often implemented by a software routine even if a hardware implementation oers enhanced performances. This is due to several reasons: rst of all the cost of a software implementation is very low because only requires compilation whereas a hardware implementation requires several steps, including testing and fabrication; the design time of a software routine is, usually, shorter than implementing it into hardware; in addition a software implementation better suits to speci cation changes during the design cycle. Furthermore a large variety of low-price programmable components such as microprocessors and microcontrollers are avalaible on the market. The two cost metrics associated with a software implementation are the program memory size, i.e. the memory required to store the compiled speci cations instructions, and the data memory size, i.e. the amount of memory required for the storage of the data values created and manipulated during the computation. The amount of memory required by a software routine to execute aects the cost and the performance of a system. If a program will not t in the built-in processor memory, we are compelled to add an external memory chip, rising the implementation cost. In addition we may want to limit the data memory size in order to store the data in the processor cache and registers increasing, so, the performance of the system.
2.1.3 Performance metrics
Performance metrics can be divided into computation and communication metrics. The former give a measure of the time required to perform a computation within a behavior, the latter are a measure of the time spent by a behavior in interacting with other behaviors of the implementation. Among the computation metrics we can mention: clock cycle, control steps and execution time. Choosing the best clock cycle is very important in order to get an ecient design [25], since the clock cycle aects not only the execution time but also the number of resources to allocate for implementing a given behavior. The clock cycle also determines the technology used to implement the design, in fact certain technology libraries specify the maximum operation frequency of its components. The number of control steps, i.e. the number of clock periods used to implement the behavior, aects the complexity of the control logic. For example, if we determine that a particular behavior could be divided into N control steps, the control unit will have a state register of log2 N bits. The execution time of a behavior is the average time that the behavior takes to carry out its task and it is directly proportional to the number of control steps required to execute the behavior. Estimating the execution time is very important, rst because a performance constraint may have been xed for a certain behavior; second, because execution time aects the technology and the component libraries used for the design implementation. A design can also be implemented in a pipelined fashion in order to maximize the system throughput. Pipelining consists in splitting the design in smaller functional units each carrying out a particular task. The output of each unit is latched and fed into the next pipeline stage. In this way the clock cycle is equal to the maximum stage delay and the execution time is given by the following formula: (4) exec time = num stages stage delay Communication between concurrent behaviors are modeled as xed-size messages sent through an abstract channel. The bandwidth of the channel is measured by its bit rate, i.e. the data transfer rate. It is possible to de ne two dierent type of data transfer rate. The average rate, is de ned as the rate at which the data is sent during the entire lifetime of the two communicating behaviors: (5) av rate = n T s where n is the number of messages sizing s bits and T the lifetime of the channel between the two behaviors. 5
The peak rate is de ned as the rate at which data is sent in a single message transfer through the channel: (6) peak rate = st where t is the time for which each message occupies the channel. Evaluating the data transfer rate is very important because it aects the bus width and the execution times of the two communicating behaviors. Communication rates play, as well, an important role during the partitioning of the behaviors into chips and channel into buses. In order to avoid o-chip access delays, it is better to assign to the same chip behaviors that communicate with other over channels with high bitrates.
2.1.4 Other quality metrics
In the previous sections we have dealt with the most common quality metrics for evaluating system performance; nevertheless they are not the only applicable evaluation criteria. Other evaluation metrics could be power dissipation, design for testability, design time, time to market and manufacturing cost as well. Estimation of power dissipation is particularly useful when designing battery-operated systems. An increased power consumption will result in adopting larger batteries, hence increasing the weight and the overall cost of the system. Power dissipation aects the clock cycle as well, i.e. the higher the clock frequency, the higher the power dissipation. Design for testability produces a design with minimal test cost. A drawback of this design approach could be the increasing of production costs due to the incresead complexity of built-in test circuitry, the extra pin for fault controllability and observability and the augmented power dissipation and packaging costs. Design time is de ned as the time required to obtain a correct implementation from the functional speci cation. Design time depends on several factors as designer experience, CAD tools, abstraction level of the design. Design time can be reduced using components libraries instead of custom ones, because customs components require signi cant design and testing overhead. Time to market is de ned as the elapsed time between the design conceptualization and the delivery of the product to the costumer. This includes the time for performing a market study, design, fabrication, testing, distribution and developing of supporting hardware and software. The manufacturing cost is a measure of the cost associated to a design and includes the cost of manpower, raw materials, fabrication, packaging, testing and maintaining facilities. In [22, 24] are described some new performance evaluation criteria among wich, the most remarkable is the capacity of an embedded processor. This new parameter is used to evaluate the load of a given processor in the system; it is, hence, an estimation of the average processor cycles available for each process running on that processor. Process scheduling is performed by Petri nets or queuing nets. Furthermore, the designer must be able to evaluate total system capacity limits; to achieve this, for each hardware component of the system can be de ned a maximum event rate, i.e. the maximum rate at wich its inputs signals can change. The actual event rate is computed by simulation and compared with the xed maximum event rate in order to detect constraints violations.
3 Estimating hardware performance
In the previous section we have introduced a set of criteria and design rules to enhance hardware performance. In this section we will go into the topic thoroughly.
3.1 Clock cycle estimation
We have seen that clock cycle can aect both the execution time and the number of resources required to implement the desired behavior, thus it is important to perform the estimation of clock cycle before starting to sinthetize the behavior. 6
In most CAD tools the clock cycle must be speci ed by the designer prior to developing the design. This strategy better suits the case in which the behavior designed is part of a larger system. In this case the clock cycle for the various blocks in the systems are known and can be used to determine the clock cycle of the whole system. If the clock cycle is not speci ed by the designer, is necessary an evaluation criterion to estimate it. One estimation criterion, known as maximum-operator-delay method [28, 29, 30], consist in equating the clock cycle to the delay of the slowest operation in the design. The advantage of this strategy is that the estimation time is extremely fast and easily implemented, the disadvantage is that this strategy leads to underutilization of faster functional units and consequently to a slower design implementation. To improve performance it is necessary to minimize the time that faster functional units are idle, i.e. the clock slack. Slack S(; ti ) for clock period and operation type ti is computed by the following formula: S(; ti ) = (dd(ti )=e ) ? d(ti )
(7)
Where d(ti) is the delay of operation type ti . To improve the functional unit utilization, hence, enhancing system overall performances, it is necessary to minimize slack. Algorithms for slack-minimization are described in [25, 31, 32]. First of all is computed the range of clock cycles that will be examined by the algorithm according to the delays of the functional units that implement the T distinct operation types of the behavior. Afterword is computed the number of occurrences of all the possible operations ti 2 T . For each clock cycle in the computed range, the slack associated to each functional unit that implements operation ti is computed in according to (7). Then the average slack S() is computed in according to the following expression: PT i ) S(; ti ) S() = i=1PR(t (8) T R(t ) i=1 i R(ti ) being the number of occurrences of operation type ti . Finally the utilization U() is computed accordingly to: U() = 1 ? S()
(9)
If the utilization computed at the last iteration is greater of the previously computed value, the utilization and its related clock frequency must be updated with the newly computed values. Experimental results illustrated in [32] have shown that designs implemented with clock estimated with slack minimization method have a 32% increase in performance with respect to the same design implemented with maximum-operator-delay method. As showed in Figure 1, the choice of the clock cycle can aect both execution time and the number of resources required to implement a given behavior. Figure 1(a) show the fastest implementation possible. The clock cycle is 380 ns, but four adders and two multipliers are needed. In order to reduce the overall number of resources required by the implementation, it is possible to partion the execution time into more clock cycles and for each cycle allocating the required resources. Figure 1(b) shows a multi-cycle implementation with maximum-operator-delay method. The clock cycle is chosen to be equal to the delay of the slowest resource in the behavior, i.e. the multiplier. This design approach results in an execution time of 600 ns but requires only two resources: a multiplier and an adder. Finally, in Figure 1(c) is illustrated a multi-cycle implementation with minimum slack. The clock cycle is 80 ns and the average slack time is only 3.3 ns, versus the 46.6 ns of the implementation of Figure 1(b). This results in a better resource utilization and hence in a faster execution time (400 ns). The resources required by this implementation are only an adder and a multiplier. In conclusion, the design approach that oers the best compromise between execution time and number of allocated resource is that of Figure 1(c). 7
i1
i2
150
i3
i4
80
x
i5
i6
i1
+
i2
150
i3
i4 i5
x
80
80
+
80
+ 150
150
x o1
80
(a)
x
i
1
o 1
i
2
150
i
i
3 80
+
80
+
80
+
i
5
80
+
80
+
(b) i
o 2
6
x
80
150
4
+
+
+ o2
80
i6
+
x
o 1
(c)
o 2
Figure 1: Clock cycle choice: (a) single cycle implementation, (b) multi-cycle implementation, (c) multi-cycle implementation with minimum slack.
8
3.2 Control step estimation
The number of control steps required to execute a behavior can be estimated in several ways. A rst method, known as operator-use method [25], consists in partitioning a behavior into a set N of N nodes maintaining the dependencies among the various statements in the behavior. For each node is, then, computed the number of control steps required to carry out each dierent operation ti 2 T according to (10). i ) d(t ) (10) cstep(nj ; ti) = R(t i n(ti ) n(ti ) and d(ti ) being respectively the number and the delay, expressed in clock cycles, of the functional units implementing operation ti . R(ti ) d(t ) cstep(nj ) = max (11) i ti n(ti ) The number of control steps that must be assigned to the considered node, is then the one associated to the slower operation of the node; that is the maximum number of control steps necessary to execute a certain operation in the considered node, as stated in (11). Finally, the number of control steps determined for each node are summed to determine the total number of control steps for the whole behavior, according to (12). 2T
cstep(B) =
X
nj
cstep(nj )
(12)
2N
The operator-use method provide a fairly rapid estimation of the number of control steps needed to execute a particular behavior. This method has a computational complexity that is O(n), where n is the number of operations in the behavior. However this method could produce evaluation errors since it operates at the statement level, ignoring the dependencies among the operations in the same statement [25], therefore it is better to use this method if each statement in the speci cation is restricted to one operation. The operations in a behavior may also be scheduled [4] according to some resource costraintnts in order to reduce the total number of control steps. List scheduling [25] keeps a priority list for every operation type ti from which operations are assigned to control steps. Each operation assigned to current control step is deleted from the priority list and the schedule is updated to re ect this choice. The iterations go on until all operations in the given behavior have been scheduled. This algorithm has a O(n2 ) complexity, where n is the number of operations in the behavior; thus it is computationally more expensive than the operator-use method, but it provides more accurate results. The use of the mobility-based list scheduler [33] (a list scheduler whose priority function is the mobility of an operation, i.e. the number of potential control steps to which the given operation can be assigned) with several high-level synthesis benchmarks has produced an average estimation the 13% more accurate than with the operator-use method [25]. Both operator-use method and list scheduling do not take into account branching and iterations. Such a behavior must be divided in basic blocks [34], i.e. a set of consecutive HDL statements that neither halt the ow nor generate branches, except at the end. The number of control steps required by each basic block can be determined with the above methods. The total number of control steps for the whole behavior depends on whether some blocks share the same control step or not (see Figure 2). In case of shared control steps, the total number of control steps of the behavior is equal to the number of control steps along the longest path through the
ow-control graph; in case of separate control steps, the total number of control steps required by the behavior is equal to the sum of the control steps associated to each basic block. The rst solution is more expensive under a hardware point of view because requires a status register to skim between the concurrent basic blocks. 9
B1 O1
S1
O1
O2
S2
O2
B3
B2 O3
O6
O4
O7
O6
O4
O7
S4
O5
S5
B4
O3 S3
S6
O8
(a)
O5
O8
(b) O1
S1
O2
S2
O3
O6
O4
O7
S6
S3
S4
S5
S8
S7
O5
O8
(c) Figure 2: Estimating control steps number: (a) Behavior with branching, (b) shared control steps scheduling, (c) separate control steps scheduling.
10
3.3 Execution time estimation
The execution time E(B) for a given behavior B is given by the: E(B) = cstep(B) (13) where cstep(B) is the number of control steps required by B. This formula is not directly applicable in case of behaviors with loops and branching. In this case, rst we must compute the execution time e(bi ) for each basic block bi and the related frequency of execution f(bi ), i.e. the average number of times that bi will be executed during a single execution cycle of the behavior to which bi belongs. According to these considerations, the execution time of the behavior B results: E(B) =
X
bi B
e(bi ) f(bi )
(14)
2
In order to evaluate the execution frequency of a given behavior we must map it into a control
ow graph. Each vertex is associated to a particular basic block. An edge connects two vertices if and only if there exist a relation between the associated basic block. To each branch or loop edge is associated a branching probability computed as follows: a probability of (n ? 1)=n is assigned to a loop edge and 1=n to the exit edge, n being the number of iterations performed. For if and case statements equal probabilities are assigned to each branch. The execution frequencies of the individual nodes in the control ow graph (Figure 3) are computed as follows: 1. rst create a start node S, whose execution frequency is xed to 1, since it is executed exactly once at the beginning. 2. For any node nj in the control- ow graph, the execution frequency is given by: X
f(nj ) = 8
predecessor of ni
f(ni ) p(eij )
(15)
where p(eij ) is the branching probability between node ni and nj . 3. Solving the linear system (15), will give us the execution frequency for each node in the ow graph.
3.4 Communication rate estimation
A communication channel may be explicitly de ned in a behavioral description or implicitlycreated when two communicating behaviors are assigned to dierent chips during system partitioning. The total execution time for a given behavior B consists of two components: 1. Computation time C(B) is de ned as the time required by the behavior B to perform its internal computations. 2. Communication time C (B; c) de ned as the time spent by the behavior in accessing external data through a communication channel c. The computation time can be computed with the data ow analysis described in the previous section. The communication time, instead, is given by the following formula: C (B; c) = A(B; c) D(c) (16) where A(B; c) is the average number of accesses to channel c of the behavior B and D(c) is the delay associated to the transfer of a single message along channel c. The average data transfer rate over channel c is de ned as: c) (17) av rate(c) = C(B)T(B; + C (B; c) 0
0
0
11
A:=A+1; for I in 1 to 10 loop B:=B+1; C:=C−A;
B1
A:=A+1;
B2
B:=B+1; C:=C−A; DA
if (D > A) then D:=D+2; else D:=D+3; end if
D:=D+2;
B5
E:=D*2; end loop; B:=B*A; C:=3;
B4
B3 D:=D+3;
E:=D*2;
I10
B6
(a)
B:=B*A; C:=3;
(b) S
V1 e 12
V 2 0.5
0.5
e23
e 24
V4
V3
e 35
e 45 V5
0.9 e52
0.1
e 56
V6
(c) Figure 3: Control- ow graph for a given behavior: (a) VHDL behavioral description, (b) basic blocks, (c) equivalent control ow graph with branching probabilities.
12
where T(B; c) is the total number of bits sent over channel c during its lifetime. The peak rate of channel c is: (18) peak rate(c) = M(c) D(c) where M(c) is the length (in bit) of a message sent over the channel.
3.5 Area estimation
In order to give an estimation of the area occupied by the behavior, we must take into account two factors: the number of components and the technology that implements the behavior. Let us suppose that the behavior is implemented like a FSMD ( nite state machine with datapath). The datapath consists of three kinds of components: storage units (register and latches), functional units (such as ALUs and comparators), interconnection units (buses and multiplexers). In order to minimize the overall area occupation we must nd the optimum number of datapath components. Storage units are used to hold variables values. The simplest strategy consist in allocating a register for each variable. This approach could lead to an excessive number of storage units, since a variable is not needed for the whole execution of a behavior. A better strategy consists in evaluating variables lifetime according to the data ow graph of the scheduled behavior. Variables used concurrently are assigned to dierent storage units. To determine whether two variables are concurrent or not we must draw a graph with the lifetime intervals (Figure 4). The variables whose lifetimes overlap are concurrent and cannot be assigned to the same register. The total number of register to be allocated is equal to the number of concurrent variables. Another technique is based on clique partitioning [35]. First a graph is built, with each variable representing a node. An edge eij between nodes vi and vj exists only if variables vi and vj can be assigned to the same storage unit, i.e. if their lifetimes do not overlap. The heuristic consist in several iterations. In each iteration a pair of nodes with the highest number of common neighbor is merged into a single node. The merge process will create a set of cliques. Each clique corresponds to a storage unit that will store the variables belonging to the clique (Figures 5 (a), 5 (b)). Another heuristic is based on the left-edge algorithm [36]. The variables are rst sorted according to the start point of their lifetime interval. The rst step consist in allocating a register for the rst variable in the list. This variable is then deleted from the list. The next step consist in searching in the updated list a variable whose lifetime interval does not overlap with the one of the variable of the preceding step. If such a variable exist it is rst allocated to the same register of the previous variable and then deleted from the list. The algorithm (Figures 5 (c), 5 (d)) ends when the list is empty. The number of functional unit required by a behavior can be explicitally de ned by the designer or allocated by heuristic algorithm during the scheduling phase. If a performance constraint is speci ed for the behavior, the minimal number of functional units required to implement the behavior can be determined using the force-directed algorithm [37]. The algorithm try to distribute uniformly into all the control steps, the operations of the same type. In each iteration, the algorithm assigns to each control step exactly one unscheduled operation in order to minimize the number of functional units required by the design. At the end of the iterative process, the maximum number of operation of a speci c type assigned to a control step is the number of needed functional units of that type. A clique-partitioning algorithm [38] can be used as well, in order to determine the number of functional units required if the behavior has already been scheduled into control steps. This approach implies the construction of a graph whose nodes represent an operation in the behavior . An edge between two nodes exists if and only if the corresponding operations have been assigned to dierent control steps and there are functional units that can perform both the operations. Clique-partitioning, unlike force-directed algorithm, does not need an operation to be assigned to a speci c functional unit; so, the latter requires a functional-unit binding, i.e. a mapping of all the operations into functional units in order to allow us to determine the number of interconnections, such as buses and multiplexers, needed between the storage and functional units. A bus interconnection has the advantage of an easier routability, since the lines run along the set 13
V6
V10
V1
V2
V4
S0
+
S1
V3
V5
V7
V8
V11
V2
+
+
X
S3
V1
−
X
S2
S4
−
S5
V1
V9
X V2
(a) V3
V4
V5
V6
V7
V8
V9
V10
V11 S0
S1
S2
S3
S4
(b)
S5
Figure 4: Variables lifetime: (a) Scheduled behavior, (b) variables lifetime intervals for the scheduled behavior.
14
V8
V10
Cliques
Storage unit
{V ,V } 2 3
R 1
{V ,V ,V } 6 7 9
R 2
{V ,V ,V } 4 5 8
R 3
{V ,V } 10 11
R 4
{V } 1
R 5
V2 V1
V9
V7
V5
V3 V11
V4
V6
(a) R1
R3
R2
R4
(b)
R5
S0
S1
V1
V4
V2
S3
S5
V5
V8
V9
V1
V2
Variables
Storage unit
{V1 ,V8 }
R1
{V ,V ,V } 2 3 9
R 2
{V ,V ,V } 4 5 11
R 3
{V ,V } 6 7
R 4
{V } 10
R 5
V10
V3
S2
S4
V6
V7
V11
(c)
(d)
Figure 5: Register allocation: (a) clique partitioning graph model, (b) clique partitioning solution, (c) overlapped lifetimes after left edge algorithm, (d) variables-to-storage unit mapping.
15
of components that communicate over it; a multiplexer, instead, could result in a line congestion, since it requires all the lines to be routed to a single component, i.e. the multiplexer itself. The interconnection units can be estimated directly from the description of the behavior and from the mapping of variables and operation into storage and functional units, respectively. A simple strategy consists in mapping into the same bus or multiplexer all the line directed to the same module. Figure 6(a) shows the connections between a set of registers and the two inputs of a functional unit. Since to each input of the functional unit go four lines we can map them into a two 4 to 1 multiplexers (Figure 6(b)). In order to reduce the multiplexer cost, i.e. the total number of multiplexers input, we can factore the inputs common to both the multiplexers assigning them to the a same multiplexer within a two level multiplexer chain (Figure 6(c)), this reduces the total number of inputs from eight to seven thus reducing the multiplexers cost. A third approach (Figure 6(d)), consists in applying a clique-partitioning technique. A graph model is built: each vertex represents the connections between two units. An edge between two vertices exists if and only if the corresponding connections are not used concurrently for data transfer in the same control step. Each of the cliques of the graph represents an interconnect unit. All the interconnections whose representative vertices are in the clique, are assigned to the same bus or multiplexer. As showed in Figure 6(d) multiplexers may be still required in case two or more buses must be connected to the same input of a functional unit. Usually, most designs are carried out in R1
R1
R3
R2
R4
R3
R2
R4
R5
M2
M1
I1
R1
I1
I2
I2
FU
FU
(a)
(b)
R3
R2
R4
R5
R5
R1
R3
R2
R4
R5
B1
M3 B2
M2
M1
I1
M1
I1
I2
I2
FU
FU
(c)
(d)
Figure 6: Interconnect units allocation: (a) connections between storage and functional units, (b) multiplexer implementation, (c) reducing mux inputs by factoring, (d) two-level interconnections after clique-partitioning. 16
a semi-custom fashion, i.e. using libraries of pre-de ned cells. In the datapath layout generated by a standard-cell silicon compiler, all the cells are arranged in parallel rows separated by a routing channel. The cells have dierent widths, but all the same height Hc. The total length L of each row is: L = T(DP) (19) Where is the transistor pitch coecient in m=transistor. For a given library, is: N wi X n
i (20) N i=1 Where N is the number of cells in the library, wi and ni are respectively the width and the number of transistor of the ith cell of the library. The number of transistors of the datapath T(DP) is:
=
T(DP) =
nR X i=1
T(REGi ) +
nF X j=1
T(FUj ) +
nM X k=1
T(MUXk )
(21)
Where nR , nF and nM are, respectively, the total number of registers, functional units and multiplexers, and T(REGi), T(FUj ), T(MUXk ) are, respectively, the number of transistors of ith register, jth functional unit and kth multiplexer. Let now be the wiring pitch, i.e. the minimal separation between two metal lines. Let be the number of nets between the components in the datapath and the average number of net that can be implemented on the same track. The height Hrc of the routing channel is then: (22) Hrc = can be evaluated empirically; a better estimate can be obtained using simple routing algorithms such as the left-edge algorithm [36] which has O(nlogn) complexity, where n is the number of nets. The area of each bit-slice results: Area(BS) = L (Hc + Hrc)
(23)
Area(DP ) = W(DP) Area(BS)
(24)
Finally, the datapath area is: W(DP) is the number of bit-slices that compose the datapath. The control unit is composed by a state register, a control logic and a next-state logic. The width of the state register is function of the number of control steps. If a design has N control steps then the state register will have log2 (N) bits. Control and next state logic can be implemented with random logic, ROM or PLA [4]. In case of random logic implementation, the number and the size of each gate, buer and register must be determined. This permits to compute the total number T(CU) of transistors in the control unit. Let be the transistor area coecient in m2 =transistor ( may be determined experimentally for a given library). The total area of the control unit is then: Area(CU) = T(CU) (25) This formula only gives a rough estimate of the area because it does not take into account the logic optimizations and technology mappings done by the silicon compiler. The way of generating an estimate that consider these two parameters as well is still far to be solved [25]. When using a ROM implementation, the area of the control unit can be computed as the sum of the areas of the state register and of the ROM.
17
3.6 Pin number estimation
A given behavior can often communicate with dierent o-chip behaviors through a communication channel. This implies a certain number of pins in the chip to be reserved for communication purpose. An estimation of the number of bin for a given design is very important, because pin number aects design size, and hence area occupation and packaging. The number of pins of a design is also aected by port declaration in the behavior speci cation, by global data accesses that require extra ports and by procedure calls that require an handshaking protocol and hence a certain number of pins to implement it. For a given behavior B, let P be the set of port declarations, C the set of communication channels, V the set of global variables accessed and S the set of procedures called by the behavior. If L(x) represents the number of wires required to access object x, the total number of pins is: Pins(B) =
X
pi
L(pi ) +
2P
X
ci
L(ci ) +
2C
X
vi
2V
L(vi ) +
X
si
L(si )
(26)
2S
4 Software estimation For a software implementation the speci ed behavior must be compiled into the instruction set of a target processor. The variables of the behavior are mapped into processor memory, consequently, all accesses to variables in the behavior are implemented as memory read/write operations. Concurrent behavior may be scheduled into one ore more processors. In case of a single processor implementation, the behaviors execution must be interleaved in order to satisfy data dependencies and timing constraints. Communication among the dierent behaviors is carried out through shared memory locations. If a multiple processor implementation is chosen, communication can be achieved either by a shared memory location or by a physical communication channel. The models adopted for software quality estimations are two: 1. processor-speci c estimation model; 2. generic estimation model.
4.1 Processor-speci c estimation model
To evaluate software performance with a processor-speci c estimation model, a behavior rst must be compiled into the instruction set of the desired processor using a dedicated compiler, and then, using timing and sizing information relative to that processor, a performance estimation may be drawn. Naturally there exist several estimators, i.e. the programs that evaluate software performance, each one targeted to a speci c processor. Clearly, such a system is economically and computationally expensive since we must have a compiler and an estimator for each dierent processor, but it produces very accurate performance metrics.
4.2 Generic estimation model
In this model, proposed in [39], instead of using speci c compilers and estimators for each dierent target processor, a behavior is rst compiled into a set of generic instruction set. The estimator evaluates, then, software performance for the target processor in according to the information made avalaible by processor-speci c technology les. These les contain information about the clock cycles and bytes number that each type of generic instruction requires. This estimation model has several advantages with respect of a processor-speci c one. First of all a structure that allows us to use only a compiler and an estimator for each kind of processor; the processor-speci c information is kept in the technology les. This also allow us to easily retarget the system for a new processor simply adding a new technology le. A processor-speci c model would have requested a new compiler and a new estimator. Furthermore the generic instruction set is very simple and faster to compile than a speci c instruction set. Nevertheless the disadvantage of such estimation 18
model is its lower estimation accuracy, mainly due to the fact that the generic instruction set represents only a small subset of the processor entire instruction set.
4.3 Program execution time
The software execution time can be determined in two dierent ways: dynamic simulation and static estimation. Dynamic simulation executes the program several times and records the overall number of clock cycles required by each execution. The number of clock cycles necessary to execute a program may change due to data-dependent branching and looping. On the other hand, static estimation is insensitive to data dependencies and can lead to fairly good estimations if the number of loop iterations is known and the conditional branch probability can be predicted correctly. Static estimation is faster and requires less space than dynamic simulation.
4.4 Program memory size
In order to estimate program size, the given behavior must rst be compiled into a generic instruction set G , then taking into account the information stored in the technology le for the target processor, the size S(B) of the compiled behavior B is: S(B) =
X
g
s(g)
(27)
2G
Where s(g) represents the size of a generic instruction g 2 G . The (27) can be scaled by a factor in order to obtain a more accurate estimation that consider the size of compiled and optimized behavior.
4.5 Data memory size
The data memory size is computed by examining the data declarations in the functional speci cation of the behavior. The data memory size m(d) of a declaration d is determined in according to its type size and the number of elements in d. Let now be D, the set of declarations in behavior B, then the data memory M(B) occupied by a behavior B is: M(B) =
X
d
m(d)
(28)
2D
The (28) is valid under the assumption that all the variables have a lifetime equal to the execution time of the behavior, i.e. two dierent variables cannot share a memory location. In case a behavior has a large number of variables with short lifetimes, data memory size estimation must include a lifetime analysis as well.
5 Conclusive remarks and future directions In this article we have presented a short survey of the state of the art in hardware software codesign. In particular we have described some of the quality metrics used to evaluate performance of both software and hardware implementation of a given behavior: clock cycle, control steps, communication rates, hardware and software execution times etc. We also have discussed some design approach [4, 9, 16, 17, 20, 21, 22, 23, 25],and some new performance metrics [22, 24]. The research eorts in the area of quality estimation are directed mainly in three directions: 1. Optimization: estimation techniques must me improved to allow the evaluation of the optimizations performed by synthesis and compilation tools. If we do not take into account this factor the result will be an overestimation of some quality metric like design area and software execution time. 19
2. New metrics: system design are more and more accurate, hence are necessary new evaluation criteria for power dissipation, testability, hardware/software integration, manufacturability. 3. New architectural features: software estimation must be enhanced to include quality metrics for more complex architectural features such as: pipelining, caching, instruction prefetching. Hardware estimation must include complex features such as multi-phase clocking.
References [1] P. Subrahmanayam, \Hardware-Software Codesign: Cautious optimism for the future (hot topics)", IEEE Computer, 26(1), pp. 84{85, 1993. [2] W. Wolf, Modern VLSI design. A system approach, Prentice Hall, 1994. [3] P. Michel, U.Lauther, P. Duzy, The synthesis approach to digital system design, Kluwer Academic Publishers, 1992. [4] D. Gajski, N. Dutt, C. Wu, Y. Lin, High-level synthetis: Introduction to chip and system design, Kluwer Academic Publishers, 1991. [5] F. Hanna, M. Longley, N. Daeche, \Formal synthesis of digital systems", Formal VLSI speci cation and synthesis, L. Claesen ed., number 1 in VLSI design methods, pp. 153{169. Elsevier-Science, 1990. [6] K. van Berkel, J. Kesels, M. Roncken, R. Saejis, F. Schalij, \The VLSI-programming language Tangram and its translation into handshake circuits", Proc. of EDAC 1991. [7] D. May, C. Keane, \Compiling Occam into silicon", Communicating process architecture, Prentice Hall and Inmos, 1988. [8] W. Luk, T. Wu, \Toward a declarative framework for hardware-software codesign", Proc. Third International Workshop on Hardware-Software Codesign, pp.181{188, IEEE Computer Society Press, 1994. [9] I. Page, \Constructing hardware-software systems from a single description", Colloquium: Structured methods for hardware systems design. IEE, 1994. [10] C.A.R. Hoare Communicating sequential processes, International series in computer science, Prentice Hall, 1985. [11] Inmos The occam2 programming manual, Prentice Hall, 1988. [12] E.D. Lagnese, D.E. Thomas, \Architectural partitioning for system level synthesis of integrated circuits", IEEE Trans. on Computer Aided Design, July 1991. [13] C.H. Gebotys, \Optimal scheduling and allocation of embedded VLSI chips", Proc. of the Design Automation Conference, pp. 116{119, 1992. [14] C.H. Gebotys, \Optimal synthesis of multichip architectures", Proc. of the Design Automation Conference, pp. 238{241, 1992. [15] K. Kucukcakar, A.C. Parker, \CHOP: A constraint-driven system-level partitioner", Proc. of 28th DAC, 1991. [16] F. Vahid, D. Gajski, \Speci cation partitioning for system design", Proc. of 29th DAC, 1992. 20
[17] B. Bose, M. Esen Tuna, S.D. Johnson, \System factorization in codesign. A case study of the use of formal techniques to achieve hardware-software decomposition", Proc. of IEEE ICCD 1993, October 1993. [18] \The revised4 report on the algorithmic language Scheme", Lisp pointers, vol. 4, n. 3, pp. 1{55, 1991. [19] G. Springer, D.P. Friedman, Scheme and the art of programming, Mc Graw Hill, 1990. [20] K. Rath, M. Esen Tuna, S.D. Johnson, \Behavior tables: a basis for system representation and transformational system synthesis," Proc. of IEEE-ACM ICCAD 1993, November 1993. [21] K. Rath, M. Esen Tuna, S.D. Johnson, \An introduction to behavior tables", Technical report, University of Utah, 1993. [22] V. Krishnaswamy, P. Wilsey, \A framework for visualizing performance data in a graphical design environment," Proc. of 3rd MASCOT, 1995. [23] N. Rethman, P. Wilsey, \RAPID: A tool for hardware/software tradeo analisys," Proc. of ASP-DAC, 1995. [24] S. Mohanty, V. Krishnaswamy, P. Wilsey, \System modeling, performance analysis, and evolutionary prototyping with harware description languages," Proc. of EURO-DAC, 1995. [25] D. Gajski, F. Vahid, S. Narayan, J. Gong, Speci cation and design of embedded systems, Prentice Hall, 1994. [26] F.J. Kurdhai, D. Gajski, C. Ramachandran, V. Chaiyakul, \Linking register-transfer in phisycal levels of design", IEICE Trans. on information and systems, vol. E76-D, n. 9, September 1993. [27] T.E. Price, Introduction to VLSI technology, Prentice Hall, 1994. [28] N. Park, A.C. Parker, \Synthesis of optimal clocking schemes", Proc. of 23rd DAC, 1985. [29] A.C. Parker, T. Pizzaro, M. Mlinar, \MAHA: A program for datapath synthesis", Proc. of 24th DAC, 1986. [30] R. Jain, M. Mlinar, A. Parker, \Area-time model for synthesis of non-pipelined designs", Proc. of ICCAD, 1988. [31] A. Gutierrez, P. Sanchez, E. Villar, \VHDL high-level silicon compilation: Synthesis methodology and teaching experience", Proc. of 3rd Eurochip Workshop on VLSI Design Training, 1992. [32] S. Narayan, D. Gajski, \System clock estimation based on clock slack minimization", Proc. of EDAC, 1992. [33] J. Lis, Behavioral synthesis from VHDL using structured modeling, PhD Thesis, University of California, Irvine, January 1992. [34] A. Aho, R. Sethi, J. Ullman, Compilers: Principles, techniques and tools, Addison Wesley, 1988. [35] T. Cormen, C. Leiserson, R. Rivest, Introduction to algorithms, MIT Press, 1989. [36] A. Hashimoto, J. Stevens, \Wire routing by optimizing channel assignments within large apertures", Proc. of 9th DAC, 1971. [37] P. Paulin, J. Knight, \Force-directed scheduling for the behavioral synthesis of ASICs", IEEE Trans. on Computer-Aided Design, June 1989. 21
[38] C. Tseng, D. Siewiorek, \Automated synthesis of datapaths in digital systems", IEEE Trans. on Computer-Aided Design, pp. 379{395, July 1986. [39] J. Gong, D. Gajski, S. Narayan, \Software estimation from executable speci cations", Journal of Computer and Software Engineering, 1994.
22