A Cycle-Based Decomposition Method for Burst-Mode Asynchronous ...

2 downloads 0 Views 699KB Size Report
Abstract. In this paper, a systematic and automated methodology is pro- posed for decomposing an asynchronous burst-mode (BM) con- troller into smaller ...
A Cycle-Based Decomposition Method for Burst-Mode Asynchronous Controllers Melinda Y. Agyekum Steven M. Nowick∗ Department of Computer Science, Columbia University New York, NY, 10027, USA Email: {melinda, nowick}@cs.columbia.edu Abstract

In this paper, a systematic and automated methodology is proposed for decomposing an asynchronous burst-mode (BM) controller into smaller sub-controllers, where each resulting subcontroller is activated on a communication channel. The proposed approach consists of a new decomposition algorithm, control micro-architecture and inter-controller communication protocol. This method has also been broadened to handle extended burstmode (XBM) controllers. For both controller types, only a moderate amount of auxiliary hardware is required, and optimizations are proposed to eliminate or simplify this hardware. Initial runtime results for both burst-mode and extended burstmode controllers are promising. Two of the largest BM benchmarks (dean-cache, scsi) were run using the Minimalist CAD tool and an optimized script. While the original controllers each timed out after 10 hours, the decomposition runs each completed in under 84 seconds. Further attempts to synthesize the original controllers using a suboptimal script succeeded, but with 16-200x greater runtime. Several XBM benchmarks were synthesized using the 3D CAD tool; one large complex controller (cdp-p1) was unable to complete while the decomposed run succeeded in under 197 seconds.

1 Introduction Synthesis of large controllers is a challenging problem for asynchronous design. At times, runtime can be a major bottleneck and, in some cases, controllers may not be synthesizable. Decomposition is a technique used in sequential logic synthesis to divide a controller into multiple smaller controllers, thereby facilitating improved synthesis runtime as well as potentially other benefits. For asynchronous controllers, decomposition is a significantly more challenging problem than for synchronous, since the system has no regular clock or discrete schedule, and the decomposed controllers must interact concurrently with each other and the environment. In this paper, a systematic and automated methodology is proposed for decomposing an asynchronous burst-mode (BM) controller into smaller sub-controllers, where each resulting subcontroller is activated on a communication channel. The proposed approach consists of a new decomposition algorithm, control micro-architecture and inter-controller communication protocol. This method has also been broadened to handle extended burstmode (XBM) controllers. For both controller types, only a moderate amount of auxiliary hardware is required, and optimizations are proposed to eliminate or simplify this hardware. The primary motivation of the proposed research is to decrease the synthesis runtime of larger asynchronous controllers. In gen∗ This work was partially supported by NSF ITR Award No. NSF-CCR0086036.

eral, runtime is a major bottleneck using several existing CAD tools such as Minimalist [9, 12], 3D [26] and Petrify [8, 12]. A secondary goal is reduction in next-state complexity. The next-state logic for each decomposed controller is generally less complex than that of the original monolithic controller; reduced next-state complexity for each controller tends to narrow the window for the fundamental mode timing constraint. Total power consumption may also be reduced using this decomposition method since only a single controller is active at any time. Finally, as an added benefit, designers using the proposed method gain a sense of support and ease. In particular, the method alleviates designers from the burden of manually splitting up complex specifications into smaller and more localized interacting units. The method also provides designers with a higher level of abstraction, where a monolithic controller specification can be used for simulation and validation, while the actual implementation can be subdivided into smaller components. For the BM controllers, the monolithic and resulting decomposed controllers were run using the Minimalist CAD tool [9] and and an optimized script. For two of the largest BM controllers, (dean-cache, scsi), each of the original controllers timed out after 10 hours while the decomposition runs each completed in under 84 seconds. Further attempts to synthesize the original controllers using a suboptimal script succeeded, but with 16-200x greater runtime. The decomposition method was also applied to several XBM benchmarks. The 3D CAD tool [26] was used to synthesize these controllers and preliminary results show significant improvements on larger complex examples as well. The largest controller (cdp-p1) was unable to be synthesized, while the decomposed run succeeded in under 197 seconds. In some cases, the sub-controllers are reduced to simple combinational logic blocks.

2 Background

2.1 Burst-Mode Asynchronous Controllers Burst-mode is a commonly-used Mealy-type state machine specification for asynchronous controllers [9, 15]. A BM specification consists of a set of states and a set of arcs, as illustrated by an example in Figure 1a (see [12]). An arc is labeled with an input burst (set of input transitions, for example “BR+” in state s0 and “BRCA-” in state s4) followed by an output burst (set of output transitions, for example “CR+” in s0 and “BA-” in s4), and connects two specification states. An input (output) burst describes a set of input (output) transitions: rising transitions are marked with a ’+’, and falling transitions are marked with a ’-’. A BM machine waits for an input burst to arrive; individual input transitions may come in any order and at any time, but only specified input changes are allowed. Once a complete input burst has arrived, the corresponding output burst is generated and the machine moves to the next specification

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

2.2 Related Work

state. The implementation of a BM specification (Figure 1a) is in the form of a Huffman machine, as shown in Fig 1b (see [12]), consisting of combinational logic, inputs, outputs, and fedback state variables. State is stored on the feedback loops. The implementation is fully-robust in its forward input-to-output combinational paths: no logic hazards will occur, regardless of gate or wire delays. However, for correct sequential operation, there are two simple onesided timing constraints [9, 12]: (a) fedback state requirement: the feedback paths must be made slower than the worst-case forward output path, to ensure a correct state change; and (b) generalized fundamental mode requirement: no new input burst can arrive until the entire machine has stabilized from the previous input burst. The latter is effectively a “hold” time requirement. Under a generalized fundamental mode assumption, the inputs received by the machines in each specification state are clearly separated by a sufficient time by the environment. Burst-mode machines are most suitable for moderateconcurrency control applications where the forward latency in processing each input event or burst is critical (e.g., a cache or DRAM controller, or measurement systems), but where there are well-bounded minimum delays that separate one input burst from the next (e.g. waiting for cache or DRAM completion).1 Burst-mode has been effectively used for the design of a large number of applications, including for experimental asynchronous chips fabricated by Hewlett-Packard under the Mayfly [19] and Stetson [13] projects (where, in the latter, only minor delay padding was required), for a recent chip being developed at NASA,2 as well as for a variety of substantial controllers such as a cache controller [16], differential equation solver control [24], DRAMand SCSI-controllers [17, 26]. Interestingly, several of these design projects (e.g., Mayfly, Stetson, NASA, and differential equation solver) chose to perform a manual decomposition of a complex specification into multiple interacting burst-mode controllers. The Minimalist CAD package [12, 9] is used in the proposed design flow to synthesize and optimize burst-mode controllers. (a) Specification

(b) Gate-Level Implementation Inputs

Inputs:

Outputs:

AR 0 BR 0 CA 0

AA 0 BA 0 CR 0

AR+/CR+

s0

ar br

ca y1 y0

BR+/CR+

Outputs

cr

s1 CA+/ AA+ CR-

s2

s3 ARCA-/ AA-

ba

BRCA+/ CA-/ BA+ BACR-

aa State

s4

y1 y0

Delay Element

Figure 1. Burst-Mode Example (concur-mixer [12])

1 For other applications requiring small asynchronous interface circuits, to robustly handle highly-concurrent environments (e.g. pipeline stage controllers), the Petrify tool and methodology is sometimes more suitable.[12]. 2 The second author currently has a joint project with NASA Goddard Space Flight Center using Minimalist and BM controllers for a fabricated experimental chip for space instrumentation.

Several asynchronous methods have been proposed to decompose a large system automatically to smaller components, including both datapath and control. These include handshake circuit synthesis methods [4, 1], a quasi-delay insensitive flow [14] and a highlevel synthesis flow [20]. However, none of these methods focuses on individual controllers, and each is either template-based (Tangram, Balsa) or fairly coarse-grained. Kudva et al. [11] have proposed a high-level synthesis flow which maps a control-dataflow graph into separate datapath and control. The control is then partitioned into burst-mode sub-controllers, where several may be simultaneously active. However, strict requirements on series-parallel structure are imposed on the original specification, so the method has limited applicability. For individual controllers, a number of decomposition approaches have been proposed. In the synchronous domain, single controllers have been decomposed for low-power [3]. The goal is to activate sub-controllers only when needed, and partitioning is based on locality of computation. Interestingly, though quite different from our approach, both approaches allow only one active controller at any time, with activation control passed on channels. For single asynchronous controllers, decomposition techniques have been proposed both for quasi-delay-insensitive (QDI) circuits as well as burst-mode circuits. For QDI circuits, tools such as Petrify [8] have been used to synthesize the individual controllers. Net contraction can be used to project an initial Petri net specification onto smaller interacting specifications which can be separately implemented [7, 22]. In [21, 18], Vogler and Wollowski presented a formal method for decomposition based on the net contraction approach of [7], which is then validated through bisimulation. Kapoor and Josephs have proposed a simple source language-level decomposition technique, equivalent to Petri net decomposition, which significantly improves runtime, but (as with [11]) can only decompose for fixed series-parallel constructs such as forks and joins [10]. A direct mapping approach by Bystrov and Yakovlev [5], targeted for low-latency, is also a form of Petri-net decomposition. The approach reduces synthesis runtime overheads and circuit latencies for large specifications by template-based mapping of individual places to specialized David cells. This approach is orthogonal to ours, which instead uses global optimization during sequential synthesis (rather than template-based mapping), cycle-based decomposition and standard cells. For burst-mode controllers, there has been only limited exploration of decomposition. Beister et al. [2] propose a method starting from a Petri net specification to a set of interacting XBM controllers, which is effectively a form of net contraction. Each controller handles a single primary output. Each resulting decomposed XBM specification is effectively a projection of the original one onto that output. However, only a basic method is presented (which is quite complex), and no benchmark results are reported.

3 Overview This section provides a general overview of the decomposition algorithm and the micro-architecture that results from the proposed decomposition scheme. Section 3.1 introduces a working example which will be used to illustrate the basic decomposition method, and the target micro-architecture is outlined in Section 3.2. Further details on these topics are presented in later sections.

3.1 Decomposition Method The proposed decomposition method is a simple graph-based algorithm. The algorithm accepts a single monolithic burst-mode

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

(a) Original Spec.

(b) Decomposed Specs.

0

0

ain+ zout+

1 bin- youtcin-

2

ain+ zout+

ain+ bin+

bin- zout-

1 bin- zout-

ACKinA+ | REQ1-

REQ1+

ain- | zout+

2 bin+ yout+ cin+

3

4 din+ zoutyout+

din- zout+ yout-

6 ACKinA- | REQ1+

5 din+ bin- | ACK1b-

bin- youtcin- ACK1a-

2 2 bin+ yout+ cin+ ACK1a+

4

7

ACK2+

4

5 din- zout+ yout-

REQ2+

ACKinB+ | REQ2-

3

1

zout+

ain- ACK1b+

zout- yout+

ACK2-

ain+ bin+

REQ2-

Child Monitoring

KEY: Entry Point Arcs

ACKinB- | REQ2+

Figure 2. Decomposition Method on BM Spec. inputs

ain bin

BM CntrlA

ACK 1a

To output generator

Channel 1

REQ 1 inputs

ain bin

BM CntrlB

B_to_zout

inputs

bin cin

To output generator

Channel 2

To output generator A_to_zout B_to_zout

ACK 2

D_to_zout

REQ 2

input

din

C_to_yout

BM CntrlC

BM CntrlD

Output Generator zout

zout Primary Outputs

D_to_yout C_to_yout

D_to_zout To output generator

Decomposed Controllers

D_to_yout

Output Generator yout

In Figure 3, controller A is the top-level controller for the system. Observe that a parent/child relationship exists between controllers {A, B}, {A, C}, and {B, D}. Controller D does not have any children, and is therefore a leaf controller. There is a multi-way activation channel from A to B and C, and a single activation channel between B and D. The controllers sending a REQ are identified as parent controllers. The child controllers are directly connected to their parent with an ACK. Operationally, each controller has three modes (active, inactive, suspend) and can only be in one of the three modes at any time. The operation of the controllers is mutually exclusive: only a single decomposed controller can be active at a time. When a controller is active, its parent (and ancestors) are in a “suspend” mode and all other controllers are inactive. After becoming active, a controller typically completes an entire cycle, returns back to its entry (i.e. start) point and then passes control back to its parent (with the exception of a few corner cases, which are discussed in Section 4).

A_to_zout

ACK 1b

activation channel [4] (REQ/ACK), which is used to enable the controller. A controller can only be enabled by a parent controller at a single designated point, called its entry point. All communication with the parent occurs on the activation channel. Once activated, controllers have the ability to both initiate and receive communication with other controllers. Controllers that receive communication but do not initiate communication are called leaf controllers. Within the system, there is a unique controller called the top-level controller. This top-level controller does not have a parent controller; it is activated externally by primary inputs.

yout

Primary Output Generators

Figure 3. Block View of Decomposed System for BM Spec.

controller and decomposes it into several smaller burst-mode controllers. Coordinating together, these decomposed controllers continue to maintain the behavior of the original specification. An example of an original BM specification is shown in Figure 2a. Figure 2b, shows the result of applying the proposed algorithm on the specification. Four new machines are created; together, the interactive behavior of these machines is equivalent to that of the original BM machine. For two of the controllers, additional child monitoring arcs have been added to facilitate inter-controller communication.

Additional Hardware. Additional hardware is needed to maintain correctness when producing primary outputs. Controllers no longer directly generate primary outputs. Instead, a distinct primary output generator unit is created for each of the primary outputs. Each unit is designed to correctly generate and maintain the global value for that output. An output generator combines output transitions from each of the decomposed controllers, using a transition-sensitive gate. Each controller that transitions an output has a single input to the corresponding primary output generator gate, which is used to change the primary output. Only a single input to the primary output generator can be active at a time. Figure 3 shows the two primary output generators for each of the primary outputs in the example. Each of the controllers that transitions on a specific primary output has a corresponding outgoing signal which is an input to the corresponding primary output generator. Details on the construction of the output generator are presented in Section 5.

3.2 Target Micro-Architecture An overview of the micro-architectural design of the decomposed system, as well the basic idea of the inter-controller communication protocol, are now provided.

3.2.1 System Overview Basic System Micro-Architecture. The system micro-architecture consists of several decomposed burst-mode controllers and an output generator for each primary output. Figure 3 depicts a block-level view of the target micro-architecture of the BM specification after decomposition. Each of the four machines in Figure 2b is mapped to a distinct controller, as shown in the block figure. Additional communication signals which allow these components to communicate are also depicted. The structure of the system is designed in such a way to allow decomposed controllers to communicate efficiently through pointto-point communication. Each decomposed controller has a single

3.2.2 Top-Level Communication Protocol At the system level, controllers communicate using point-to-point communication through activation channels. After system initialization, only the top-level controller is active and awaits the arrival of primary inputs. This controller continues to maintain control by processing inputs and producing outputs (through the primary output generators) until it reaches an entry point for another controller. Upon reaching this entry point, the active controller will broadcast on the activation channel corresponding to that entry point. At that point, control can be passed to any candidate controller which is awaiting activation. Only a single controller will respond. If control is passed, the controller receiving control becomes active and the controller relinquishing control becomes suspended. Eventually, the active controller completes its operation and returns to its entry point. An in-depth presentation of the communication protocol can be found in Section 5.

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

(a)

4 Decomposition Method

Original Spec.

This section introduces a new algorithm for decomposing a single monolithic burst-mode specification into multiple smaller controller specifications.

3 The closed reachability property is related to graph dominators, but with some differences: unlike dominators, closed reachability holds with respect to a given ancestor path P , and the property is determined separately for each outgoing arc Ai of decision point x.

(b)

1

Spec. marked with cutoff points

0

1

2

2

4

3

4.1 Declarative Formulation The proposed decomposition strategy is informally as follows. Consider a simple path from the root v (i.e. burst-mode “start state”) to a decision point x. Suppose the decision point x has n outgoing arcs, A1 , . . . , An . Each outgoing arc is considered in turn, and potentially up to n subgraphs can be broken off, or decomposed, one for each outgoing arc. In particular, on each arc, Ai , in turn, the entire reachable subgraph from x on arc Ai is considered. If this entire subgraph is “closed” (defined below), the region is decomposed; if the subgraph is not closed – includes certain “ancestor” points on the simple path from root v to x – then this region is not decomposed at arc A. This process is repeated for each of the n outgoing arcs at x, and for each a partition is made. Regardless of the outcomes, the same decision procedure is repeated at further decision points reachable from A, while extending the ancestor path into each given subregion, and additional decompositions can be considered, allowing additional smaller subregions to be formed in a hierarchical manner. More formally, the proposed decomposition criterion is based on the notion of closed reachability. Definition 1. Basic Graph Properties. Given a rooted directed graph G = (V, E, v), with vertex set V , edge set E, and root vertex v ∈ V , a vertex x is called a decision point if its out-degree is two or higher. Vertex x is called a convergence point if its in-degree is two or higher. A path from a vertex w to a vertex y is a simple path if no vertex appears twice on the path. A cycle in the graph is a simple cycle if no vertex appears twice on the cycle. Definition 2. Closed Reachability. Let G = (V, E, v) be a rooted directed graph, with vertex set V , edge set E, and root vertex v ∈ V . Let x be any decision point in the graph, with n outgoing arcs, A1 , . . . , An . Also, let P be any simple path leading from root v to decision point x, called an ancestor path to x. Then the graph has closed reachability on arc Ai , with respect to ancestor path P if no simple cycle containing arc Ai also contains any other decision point on the ancestor path P . Closed reachability is the precise criterion that is proposed in this paper to decompose a burst-mode specification. This property means that the entire reachable region from x on arc Ai is “selfcontained”: once reached through ancestor path P , the only possible means of exiting this region is through decision point x. If this property holds, then there is a well-defined closed subregion below x on outgoing arc Ai which forms the specification for an independent controller (or set of controllers), that is activated by the parent region on a unique allocated communication channel at vertex x, and which signals its completion on this same channel. However, if the property is violated, then this reachable subregion can be exited at a decision point that is an ancestor of x, hence no unique activation channel from the ancestor region can be allocated to control the operation of the subregion, and no decomposition on arc Ai is performed.3 Note that in each case, decision points within this reachable subregion are still examined for further decomposability. Examples. Figure 4(a) shows a simple BM specification with a single decision point 2, reachable by ancestor path 0-1-2. Closed reachability trivially holds on both outgoing arcs, 2-3 and 2-4. For

0

4

3

(c) Decomposed Specs. 2

0

2 4 1 3

1

2

Figure 4. Decomp. Example: Simple Case 0

(a) Original Spec.

0

(b) Spec. marked with cutoff points

1

1

2

2

3

4

3

4

5

(c)

5

Decomposed Specs. 2

0 2

4 4

1 3

5 1

2

Figure 5. Decomp. Example: Complex Case (Fig. 2 Spec.) 0

(a) Original Spec.

0

(b) Spec. marked with cutoff points

1

1

2

Dead-End Segment

3

2

4

Dead-End Segment

6

4

3 5

5

7

8

Trap Cycle

6

7

8

Trap Cycle

Figure 6. Decomp. Example: Special Case

the former, the reachable subgraph can only be exited at decision point 2. For the latter, though there is an arc from 4 to an ancestor vertex 1, this vertex is not a decision point, hence the reachable region still trivially can never be exited at an earlier decision point (since there are no others). The resulting cut-points and decomposed specifications are shown in Figure 4(b) and (c), respectively. Figure 5(a) shows a more complex BM specification, which extends the previous example with additional vertices and arcs. There are two decision points, vertices 2 and 4. Given ancestor path 0-1-2, closed reachability still holds at decision point 2 on each outgoing arc (2-3 and 2-4), hence the corresponding reachable subregions are decomposed. Proceeding hierarchically down the latter arc on ancestor path 0-1-2-4, decision point 4 has two outgoing arcs: 4-5 and 4-1. Closed reachability holds on the former, but not on the latter. In particular, the path 4-1-2 hits the ancestor decision point 2, above the current decision point 4, hence the reachable subregion from 4-1 cannot be decomposed, and arc 4-1 remains merged with the ancestor path (0-1-2-4). The corresponding cutpoints and decomposed specifications are shown in Figure 5(b) and (c), respectively. Figure 6 shows two corner cases: a dead-end segment (path 3-5-6) and a “decision-free” cycle (3-7-8-7). The latter is a trap

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

cycle, which only has convergence points, and no decision points. By closed reachability, both such regions are decomposed. Finally, there are rare cases where the same subgraph is cut multiple times (i.e. when reachable from multiple ancestor paths). In this case, a single subcontroller is allocated with activation from multiple parents. Details are omitted due to space limitations.

4.2 Algorithm

A simple algorithm is now proposed to decompose a burst-mode specification into multiple sub-controllers, based on the closed reachability property. The algorithm uses a modified form of depthfirst search. In the forward direction, the search explores reachable regions; during backtracking, the decomposed burst-mode controllers are grown and cut. When backtracking reaches the root, all decomposed controllers have been generated. Pseudocode of Algorithm. The pseudocode of the basic algorithm is shown in Figure 7. It assumes a global variable ”FIN-BMSET”, which will contain the final decomposed controllers. There are two unusual features. First, segments in the original graph may be visited multiple times, once from each possible ancestor path. Hence, the classic DFS flow is altered: vertices are marked in the forward traversal (to identify bounds on reachable regions), but unmarked during backtracking (to allow the same regions to be re-explored from different ancestor paths). Hence, the algorithm is super-linear in terms of the original graph size, though in practice it runs very efficiently (see Results section). Second, decomposed burst-mode sub-controllers are constructively grown during backtracking. In forward traversal, when a visited decision point is reached, or a dead-end leaf or trap cycle, a “seed” burst-mode controller strand is created using this leaf vertex. If the leaf is a decision point, the vertex name is used in a label set to identify the termination condition; for the dead-end or trap case, an empty label set is used. During backtracking, the controller strand is grown, as new vertices are added. Finally, when a decision point is reached, if the strand’s label set contains only the same vertex as as the current decision point, a closed region has been identified; the controller strand is cut, and the strand is reinitialized to allow for new growth. Alternatively, if the strand’s label set is empty, the strand was a dead-end segment or trap path, and it is likewise cut. Finally, if the strand’s label set does not contain only the current decision point, then the strand includes a path to a leaf which must be an ancestor of the current decision point. In this case, the region is not yet closed, and no cut is performed; all such uncut child strands at this decision point are then joined together, to form a composite burst-mode controller, and passed up through the ancestor path for continued backtracking, along with the combined label sets, until the appropriate decision point is reached where the entire controller can be cut. Examples. Again, consider Figure 4(a), but now applying the algorithm. In the forward traversal, on path 0-1-2, the decision point 2 is marked. There is no need to mark non-decision points, since they are not examined for marks. When reaching 2 again on path 0-1-2-3-2, this mark is seen, and a burst-mode strand is created containing only the terminating node 2, and with a strand label of {2}. On backtracking, at node 3, the strand is grown to include 2-3. Finally, when backtracking to decision point 2, strand label {2} is compared to the decision point; since they are identical, a closed region has been formed, and the strand is cut, to form a simple cycle controller, 2-3-2. Using a similar approach, simple cycle controller 2-4-1-2 is formed during backtracking and cut at decision point 2. Finally, the remaining initial segment 0-1-2 becomes a separate controller. The example in Figure 5(a) is more complex. For two forward paths, the method is similar to the above example. On forward path

BM-DECOMP (root u) / ∗ top-level driver function ∗ / (strand, str-label-set) ← BM MOD DFS(u) ADD(strand, F IN -BM-SET)

FUNCTION

1 2 3

BM MOD DFS(current vertex v) / ∗ recursive search ∗ / tmp-BM-set ← [] tmp-label-set ← [] if D ECISION P T (v) then if M ARKED (v) then / ∗ CASE #1a : v ← marked decision point ∗ / strand ← INIT -STRAND (v) str-label-set ← [v] return (strand, str-label-set) else / ∗ CASE #1b : v ← unmarked decision point ∗ / mark v as visited for each child w i of v do (strand, str-label-set) ← BM MOD DFS(w i) if E MPTY (str label set) then / ∗ dead-end/trap segment ∗ / strand ← JOIN-BM-STRANDS ([strand], v) ADD(strand, F IN -BM -SET ) str-label-set ← str-label-set − [v] if E MPTY (str label set) then / ∗ success : closed region ∗ / strand ← JOIN-BM-STRANDS ([strand], v) ADD(strand, F IN -BM -SET ) else / ∗ failing match : open region ∗ / tmp-BM-set ← tmp-BM-set U strand tmp-label-set ← tmp-label-set U str-label-set Mark v as unvisited strand ← JOIN-BM-STRANDS (tmp-BM -set, v) str-label-set ← tmp-label-set return (strand, str-label-set) else / ∗ visit non-decision point ∗ / if L EAF (v) then / ∗ CASE #2a : v ← leaf (dead-end) ∗ / strand ← INIT -STRAND (v) str-label-set ← [] return (strand, str-label-set) else / ∗ CASE #2b : v is a non-terminal ∗ / (strand, str-label-set) ← BM MOD DFS(child(v)) strand ← JOIN-BM-STRANDS (strand, v) return (strand, str-label-set)

FUNCTION

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Figure 7. Pseudocode for Decomposition Algorithm

0-1-2-3-2, the marked decision point 2 is seen, backtracking commences, sub-controller 2-3-2 is created, with label set {2}, and it is cut on revisiting decision point 2. Likewise, on forward path 01-2-4-5-4, the marked decision point 4 is seen, sub-controller 4-5-4 is created during backtracking, with label set {4}, and it is cut on revisiting decision point 4. Forward path, 0-1-2-4-1-2, is more interesting. During forward traversal, decision points 2 and 4 are marked. When the traversal hits 2 again, it identifies the mark, creates an initial burst-mode strand (containing node 4) with label set {2}, and begins backtracking. When decision point 4 is reached, the strand’s label set {2} is compared to the decision point. They are not equivalent, indicating that the current strand (4-1-2) is not closed: it reaches an ancestor decision point 2 of vertex 4. Hence no cut can be performed. Backtracking then continues to decision point {2}. Now, the label set {2} does match the vertex name, identifying a closed subregion, and the strand is cut, forming the simple cycle controller 2-4-1-2. Pre- and Post-Processing. A small amount of pre- and postprocessing is needed to finalize the decomposed sub-controllers. First, as pre-processing, if an initial specification has a decision point at its start point, a single dummy initial arc is added, with a global activation signal, so that the top-level controller has at least one arc before the decision point. Second, as post-processing, after the graph decomposition algorithm, appropriate inter-controller channel communication signals (inputs and outputs) must be added, as shown in Figure 2b (compare to the initial specification in Figure 2a). Finally, for parent controllers (i.e. controllers which activate other controllers), small self-loops are added at relevant de-

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

cision points, to handle the processing protocol for initiating and monitoring inter-channel communication (see dotted cycles in Figure 2b and the next section for details).

5 Details of Hardware Implementation The key details of the hardware implementation are now presented.

5.1 Decomposed Controllers Controller Structure. Structurally, a decomposed controller has two parts: input latches and the core BM controller. Input latches are added to control the input paths of the decomposed controller, and ensure safe operation in the decomposed system. The BM controllers must be modified to support communication with other controllers. Input Latches. Unlike the original burst-mode machine, decomposed controllers no longer directly receive primary inputs. Instead, controllers receive inputs that have been filtered through input latches. These input latches are needed to prevent the decomposed controller from receiving spurious inputs (i.e. inputs intended for other controllers). A transparent D-latch is allocated for each input that a controller can receive. Each latch input has an enabling unit which is used to control when the latch is transparent and when it is opaque. The latch enabling unit is created using OR gates and channel signals. At least one signal from a controller’s own activation channel is used to generate its latch enable. In addition, if the controller can activate other controllers, signals from these latter activation channels are also used to generate the latch enables. BM Controllers. To enable communication, each decomposed controller generates a new ACK output for its own activation channel. If a controller causes activation of other controllers, it also generates a REQ output for each decision point at which it can activate others. Additionally, the decomposed controller will receive ACK inputs from all controllers that it can activate. Controller Operation. The simulation provided in this section outlines the sequential process of the lifetime of a controller and incorporates how the input latches are used during this process. A typical sequence of events begins with a controller initially inactive and its input latches disabled. The controller remains inactive until it receives communication from its parent on its activation channel. During this time, its latches are partially enabled; thus allowing only a subset of its input latches to be transparent while others remain opaque. Following the partial enable phase, two scenarios can occur. One scenario is that a controller is successfully activated by its parent which, in turn, causes its input latches to become fully enabled. When this occurs, the controller will receive all inputs without blocking. The other possible scenario is that a controller fails to receive activation and again becomes inactive. At this time, the controller’s input latches are disabled. After a controller has become activated, one of two events will occur. If it does not activate any child controllers, the current controller continues its operation until it again reaches its own entry point. Its latches then are disabled and the controller becomes inactive. Alternatively, the controller may assume the role of the parent and activate other child controllers. As the parent, the controller’s input latches are set to partial enable when it establishes communication with its children and set to fully disable once its child becomes active. Illustrative Examples. A complete view of the latch operation and each of the corresponding phases are presented next through a series of examples. Examples 1 and 2 are explained from the viewpoint of a leaf controller and Examples 3 and 4 explained from the parent controller’s viewpoint. As a byproduct, the actual latch

enable hardware is also derived, thus explaining how and when a controller’s input latches are enabled and disabled for safety. Example 1. Activation of Leaf Controller: Child Accepts. In this example, the input latch control structure and controller operation for an accepting leaf controller are described. Figure 8 shows the specifications and latch enable hardware for a single parent controller (BM CntrlC) and two children controllers (BM CntrlA, BM CntrlB). For this example, CntrlA is the accepting child. In Figure 8, the details of the parent controller are left amorphous and will be explained in subsequent examples. Children of Parent Controller

2 Z+ A_to_Out+ ACK1a+

U+ Z+ A_to_Out-

Latch Enable+

Latch Enable

Input Y

Child BM CntrlA Implmentation

ACKIN+ | REQ1-

C_to_out

7

2

Y- | REQ1-

ACKIN- | REQ1+

Child BM CntrlB Implementation

BM CntrlB Specification X- U-

REQ1 ACK1b

Latch Enable*

Input Z

ACK0

6 Y+ | REQ1+

B_to_OutACK1b+

2

(from CntrlB activation channel)

DQ

(from CntrlA activation channel)

1

(broadcasted to children)

Latch Enable*

ACK1a

REQ1 ACK1a

BM CntrlC (Parent)

DQ

(from CntrlA activation channel)

Input U

REQ1

DQ

U- | ACK1a-

3

BM CntrlC Specification Fragment

ACKIN

Input U

BM CntrlA Specification 4

Parent Controller

Latch Enable

BM CntrlA (Leaf)

Input X

ACK1a (to parent) A_to_Out (to output generator)

DQ

*Case1 Latch Enable: Entry Point Input

B_to_Out+ ACK1b+

DQ

Latch Enable*

BM CntrlB (Leaf)

REQ1 ACK1b

(from CntrlB activation channel)

Input U

4 X+ U+

ACK1b (to parent) B_to_Out (to output generator)

DQ

+Case2 Latch Enable: Non-Entry Point Input

Entry Pt Input

Figure 8. Detailed Leaf Controller Implementation

The latch control structure for CntrlA is as follows. An input latch is allocated for each of CntrlA’s primary inputs (U, Z). In CntrlA’s specification, Z is an entry point input (an input that is visible on the arc exiting the entry point). This input is enabled by either the REQ1 or ACK1a from its activation channel. Input U is not an entry point input and is only enabled using the ACK1a from the activation channel. All input latches are initially disabled. A complete simulation is now described. Initially, CntrlA is dormant and it remains dormant until its parent CntrlC asserts REQ1 (at CntrlA’s entry point). The assertion of REQ1 is the first level of enabling which allows for the primary input Z latch to be enabled. (This latch becomes enabled through an OR2(REQ1, ACK1a) gate.) In contrast, the primary input U latch remains opaque because this input is not visible on the entry point arc. (This latch becomes enabled through a single ACK1a wire.) This hybrid latch state is called partial enable, where only the latches of the inputs on the entry point arc are enabled. CntrlA’s latches will remain in partial enable until primary input Z arrives. After Z arrives, CntrlA asserts its individual output ACK, ACK1a. The assertion of ACK1a serves two purposes. First, ACK1a is sent to the parent CntrlC, to inform it of its success. The parent then de-asserts REQ1. (The parent’s added dotted child monitoring arcs are used to handle this transaction.) Second, ACK1a is also used to open all of CntrlA’s input latches, allowing all other inputs (e.g. input U) to be received without blocking. This second level of enabling of the input latches is called fully enable. After CntrlA cycles through all its transitions, it returns to its entry point and finally de-asserts ACK1a, thus completing the four-phase handshake. The de-asserted ACK1a disables both of CntrlA’s input latches and also informs parent CntrlC that child CntrlA has completed its operation. Example 2. Activation of a Leaf Controller: Child Does Not Accept. This example describes how a child controller (BM CntrlB) has the opportunity to become active, but it loses to a sibling con-

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

troller (BM CntrlA above). This simulation begins with CntrlB inactive (with its input latches disabled), at its entry point where, again, parent CntrlC asserts REQ1. As in CntrlA, this REQ is used to place the input latches in partial enable. Unlike CntrlA, all the inputs of CntrlB appear on the entry point input burst, and must be immediately enabled, hence their latch enables are similar to that of input Z of CntrlA. In this simulation, suppose CntrlB never receives a complete input burst (X+U+), and therefore will not succeed. Meanwhile, assume sibling controller CntrlA successfully receives a complete input burst and responds to the parent controller (as described in Example 1). As before, when the parent determines that a child (e.g. CntrlA) has responded, it de-asserts REQ1. The de-asserted REQ1 now closes both of CntrlB’s input latches causing the controller to become disabled. In short, the channel handshaking protocol for the losing controller is REQ1+ followed by REQ1-, with no ACK1b events. Parent Specification U- | REQ1+ C_to_outACK0-

0 U+ | C_to_out+ ACK0+

Parent Controller BM CntrlC activation channel

ACK0

ACKInC

(from CntrlC activation channel)

(generated by its children)

Children activation channel

REQ1

REQ0

(broadcasted to its children)

(broadcasted by parent of BM CntrlC)

Y+ | REQ1+ Input U

C_to_out

DQ

Latch Enable Unit

2

5

ACK0

Y- | REQ1-

(from CntrlA activation channel)

6

ACKInC+ | REQ1-

(generated by its children)

Input Y

KEY:

Parent Entry Pt Children Entry Pt Child Monitoring Arcs

ACK1a

Latch Enable

Input U

DQ

REQ1 ACK1a

Latch Enable

Input Z

DQ

ACK1a (to parent) A_to_out (to output generator)

BM CntrlC REQ1 (Parent) (broadcasted to its children)

DQ

ACKInC (generated by its children)

Children of Parent Controller

BM CntrlA (Leaf)

ACK0

ACKInC

Input ACK Generator ACK1a

REQ1 ACK1b

Latch Enable

Input X

DQ

REQ1 ACK1b

Latch Enable

Input U DQ

ACKInC

ACK1b

BM CntrlB (Leaf)

Decomposed BM Controller

BM CntrlB Output Logic BM CntrlC Output Logic

CntrlA_To_Output

Primary Output CntrlB_To_Output

CntrlC_To_Output

Output Gate Logic Primary Output Generator

(See Table 1)

Figure 10. Block View of a Primary Output Generator

Example 4. Activation of Leaf Controller: Parent Maintains Control. At a child’s entry point, a parent can either relinquish control to a child or maintain control. In the above example, the parent relinquished control; here, the parent will maintain control. The control structure for the parent remains as described in Example 3. In Figure 9, Input U is disabled by REQ1 while input Y remains transparent. After REQ1 is asserted, parent CntrlC awaits the arrival of input Y. If input Y arrives prior to either child asserting an ACK, CntrlC will de-assert REQ1 which fully enables all of the input latches for CntrlC. At this point, CntrlC regains control.

5.2 Primary Output Generators

1 ACKInC- | REQ1+

BM CntrlA Output Logic

ACK1b (to parent) A_to_out (to output generator)

Figure 9. Detailed Parent Controller Implementation

Example 3. Activation of Leaf Controller: Parent Suspends. This example shows how a parent controller (BM CntrlC) passes control to a child (BM CntrlA) and in turn becomes suspended while its child is active. Figure 9 is used to illustrate this example. In this figure, the details of the child controllers now remain amorphous. The latch control for parent CntrlC is divided into two parts (Figure 9). The left-hand part uses the controller’s activation channel. This part is similar to that of a leaf controller and CntrlC is enabled by its parent as described in Example 1. The right-hand part of the latch control uses the activation channel for its child controllers (CntrlA, CntrlB). Working together, the left-hand part is used to enable, and the right-hand part is used to temporarily disable, the input latch. The simulation for the parent controller is as follows. Prior to the events of Examples 1 and 2, parent CntrlC was activated by its own parent through REQ0+, and responded by asserting its ACK0+. As a result, CntrlC’s input latches entered into the fully enable phase. When reaching state 2 (i.e. entry point for children CntrlA and CntrlB), the parent CntrlC asserts REQ1. CntrlA then responds by asserting its ACK1a. Note that the parent does not directly receive the child’s ACK; instead, it receives the generated ACKInC (an OR of all of its children’s individual ACKs), as shown in Figure 9. This generated ACK immediately disables the parent’s latches and suspends the parent, while the winning child takes over control. The parent will remain dormant until the child completes its operation and de-asserts its ACK. The parent then becomes reactivated and its latches become fully enabled.

Figure 10 depicts the general structure of a primary output generator. A single output generator is allocated for each primary output. The primary output generator takes several inputs, one per controller, and produces a single output. In general, the output gate is implemented using either an XOR or XNOR gate. The actual selected gate depends on the initial values of the primary generator’s inputs and output. There are two case in which an XOR is assigned. These cases occur when the values of the decomposed controllers are a mixture of 0s and 1s. The conditions for assigning an XOR gate is when (1) there is an even number of inputs set to 1 and the initial output value is 0 or (2) there is an odd number of inputs set to 1 and the initial output value is 1. Similarly, there are two conditions in which an XNOR gate is assigned. Each of these cases are listed in Table 1a. In some cases, AND/OR gates or a single wire can be used for the output generator. An OR (AND) gate is assigned when all initial input and output values are set to a uniform value of 0 (1) (See Table 1a). There are two special cases in which AND/OR gates can also be used. These cases are quasi-uniform – cases where the initial value of the top-level controller differs from the initial value of all other decomposed controllers. For these cases, with the exception of one decomposed controller, all decomposed controllers will have the same initial value for a particular input. Finally, the output gate is reduced to a single wire when at most one controller transitions on the output. These special cases are listed in Table 1b.

5.3 Timing Constraints For correct sequential operation, the decomposed system (BM controllers and primary output generators) must continue to respect the two simple BM timing constraints of (a) fedback state requirement and (b) generalized fundamental mode requirement (see Section 2.1). Timing constraints are addressed for both the burst-mode controllers as well as for input latches and output generators, for each type of communication. Parent-to-Child. A parent controller initiates communication with a child twice during the handshaking protocol. First, the parent sends a REQ+ to its disabled children. This transaction does not affect either BM timing constraint ((a) or (b)) since all children are dormant, therefore no additional timing constraints are needed. The parent controller also sends a REQ- to its children (both “winner” and “losers”). The children receiving this REQ are no

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

Initial Output Value Original Cntrlr

(a) General Cases

Mixed/Uniform 0 1 0 1

Primary Output Gate

Decomposed Cntrlrs # of 1s Values (Even/Odd)

Uniform Uniform Mixed Mixed Mixed Mixed

Even Odd Even Odd

1 1 1 1

Initial Output Value

(b)

Special Cases

Decomposed Cntrlrs

OR AND XOR XNOR XNOR XOR

Primary Output Gate

Original Cntrlr

Top-level Cntrlr

All Other Cntrlrs

Quasiuniform

1 0

1 0

0 1

OR AND

Single Owner

0/1

-

-

WIRE

Table 1. Primary Output Generator Look-up Tables

longer dormant and at least a subset of their input latches are enabled. For the winning child controller (the controller that asserts an ACK), neither (a) nor (b) is affected. However, an added constraint is needed to prevent the input latch hardware from malfunctioning: REQ- must arrive after ACK+ arrives. For losing child controllers (which did not assert an ACK), (b) is now involved: the delay path from the winning controllers’ output burst (in particular, ACK+ to the parent) to the receipt of the REQ- by the losing controller (from the parent) must all occur before the next input burst arrives. In practice, this new path is short, and is evaluated as part of (b). Child-to-Parent. A “winning” child controller responds to the parent twice during the handshaking protocol. First, the child responds with an ACK+. For this transaction, only constraint (b) of the parent is modified by adding a one-sided delay on the ACK that is sent by the child. This precaution ensures the parent has completely stabilized prior to receiving the child’s ACK. A child also sends an ACK- to the parent. The parent receives this ACK while in “suspend” mode and neither (a) nor (b) is affected. Controller-to-Self. Both parent and child controllers perform self communication during four-phase handshaking. For the parent, it first sends a REQ+ to itself when activating a child. This REQ+ output may be used to disable some of the parent’s inputs latches. Here, (b) is affected, because this short disabling path must be completed before the next input burst arrives. However, REQ- simply re-enables these input latches, with no impact on (a) or (b). For a winning child controller, its ACK+ output fully enables its input latches without timing constraints; its final ACK- output disables its latches on a short path, which affects constraint (b). Controller-to-Output-Generator. This communication effectively reduces constraint (b). For BM controllers, the one-way controller-to-output communication only increases the path in which outputs are generated; therefore, making the fundamental mode requirement easier to satisfy. Controller-to-output generator communication does not have any affect on (a).

6 Optimizations Two classes of optimizations are now introduced: to remove input latches, and to simplify the output generators.

6.1 Input Latch Removal Single Owner Latch Removal. This optimization removes input latches for those primary inputs which are only received by a single controller. In this case, the input never needs to be blocked. After applying this optimization, the primary input signal is no longer filtered and is directly visible to the controller, hence the critical input-to-output path has no latch.

Proper Subset Latch Removal. This optimization removes a proper set of input latches for those inputs that occur in a controller’s entry point burst. By removing these latches, any toggling on these unlatched inputs will not affect the state or the outputs of the controller since the controller is inactive and in its initial state. In burst-mode implementations, any partial input burst can be received and then de-asserted without problems [9]). Complete Set Latch Removal. Under certain conditions, all input latches for those inputs that occur in a controller’s entry point input burst can be removed. In particular, if it can be guaranteed that throughout the entire dynamic simulation of the system, when the given controller is inactive, this complete input vector of the controller’s entry point will never be seen, all corresponding input latches can be removed. Reduction in Strength: Replacing Latches with Gates. For leaf controllers, input latches can always be replaced by simple gates. For primary inputs with an initial value set to 0, an AND2 gate is used to replace the latch. Similarly, for primary inputs with an initial value set to 1, an OR2 gate is used to replace the latch. Each gate has two inputs: the primary input and the enable signal (formerly used in the latch). For the OR2 case, the enable is complemented.

6.2 Output Optimizations Reduction in Strength. Reduction in strength occurs when the XOR/XNOR gate is replaced with either (1) a simple gate (AND/OR) or (2) a single wire. Both instances were already discussed when defining the output generator (see Section 5.2).

7 XBM Extension The structural decomposition method proposed in Section 4 can also be used to decompose extended burst mode (XBM) controllers. Given a single XBM controller, the decomposition algorithm is applied and several decomposed XBM controllers are produced. This simple XBM extension uses the same graph-based decomposition algorithm but applies a modified labeling scheme. XBM specifications are more expressive than BM specifications, and therefore can capture a wider ranger of behaviors.

7.1 XBM Background XBM is a Mealy-type specification for asynchronous controllers [26, 27]. This type of machine was designed to overcome some limitations of burst-mode through the addition of two new features, called conditionals and directed-don’t-cares (DDCs). Conditionals permit the level sampling of signals and DDCs allow for inputs to change concurrently with outputs. A brief description of these features is presented below (see also [26, 27]). Directed-don’t-cares. DDCs allow inputs to change concurrently with outputs. In XBM, there are two types of signal transitions, terminating and directed-don’t-cares. Terminating signals, which are found in BM, are signals denoted with +/- which indicate that a signal changes monotonically (i.e. once) from 0 to 1 or 1 to 0, respectively. In an XBM spec, an input burst must contain at least one terminating signal. DDCs are denoted with #/∼ which indicates that an input signal eventually changes from 0 to 1 or 1 to 0, respectively. *, is a generic symbol for a DDC which does not indicate the direction of transition. Once a signal appears as a DDC in an input burst, it will continue to appear in the subsequent series of input bursts as a DDC until it appears as a terminating signal. Conditionals. A conditional signal allows for decisions to be based on the levels of designated input signals. A conditional signal for input signal s is denoted by / and implies ”if input signal s = 1” and ”if input signal s = 0”, respectively. In addition, there are also several restrictions placed on conditional signals. These restrictions are listed below.

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

• If one arc exiting state X has a conditional signal, all other arcs exiting X must also have a conditional.

Decomposed XBM Specification (a) Before

• Conditionals have setup and hold time requirements. Prior to setup time, conditionals are free to change values; however, all conditionals must be setup before the arrival of any of the terminating signals in the input burst. This new “hold time” requirement — between the arrival of the conditionals and the arrival of the first terminating input signal – is a fundamental mode timing requirement.

ok- fain- dackn+ |

fain+ dreq+ frout-

fain+ dreq+ frout-

2

4

dackndreqfain* [cntgt1+]

3

faindackn+ frout+

dackndreqfain* [cntgt1-]

a- | z+

a+ | z-

3 KEY:

a- | z+

Entry Point Arc

(c) Decomposed Controller Implementation (After) Latch Enable (REQ removed)

ACK

a

Qualifying Conditional (Enables Latch)

a_i DQ

Latch Enable xREQ

c

outputs

z

c_i DQ

Figure 12. Transform #1: Entry Pt. Arc (b) Complex Series

(a) Simple Series

1

0

1

6

Internal Activation Pt

AckInA- | Req1+

5

a* d+ | z+

ok- fain- dackn+ |

a* b- | z2 a* c+ | z+

Terminal Input

a+ b+ | z+

a* b+ | z+

DDC Series

AckInA+ | Req1-

2

ACK

Decomposed XBM Cntrl

ACK

0

fain+ dreq+ froutReq1+

3

Original XBM Specification Fragment (with cut-points shown)

0

DDC Series

1

Internal Activation Pt

2

a* d+ | z+

a* b- | z-

a* c+ | z+

0

fain+ dreq+ frout- Ack1a-

faindackn+ frout+

3

dreqdackn- Ack1a+

2

dackn- dreqAck1b+

Child Monitoring Arcs

Figure 11. Decomposed XBM Specification

7.2 XBM Decomposition Method The same graph-based decomposition approach proposed for BM controllers can be used directly for XBM controllers. However, because of the additional XBM features (DDCs and conditionals) a series of post-processing transformations must be performed on the labeling of inputs and outputs, in the resulting decomposed specifications, to ensure a final correct implementation. This section outlines each transformation applied to the decomposed XBM controllers. The transformations remove certain DDCs and conditionals from key points within a decomposed specification, and modify the implementation accordingly. All other labels remain unaffected. Effectively, the XBM specification is made to mimic that of a BM specification at three critical points. Transformation on the Entry Point Arc. Within a specification, the first transformation is to remove both conditionals and DDCs from the entry point arc, the arc exiting the entry point. Prior to any modification, both DDCs and conditionals appear on the entry point arc (See Figure 12a). Conditionals and DDCs are removed from the entry point arc, and the resulting arc now appears as a BM arc. The conditionals are still retained, but are instead used to qualify the latch enable for entry point inputs. Effectively, a twostep enabling procedure is used: (i) the conditional first activates the latch enable, and then (ii) the BM controller evaluates the terminating signals. Figure 12b shows the result of this transformation. In this figure, inputs ”[x-]” and ”a*” have been removed from the specification arc. In the implementation for this controller (Figure

truncated child activation arc

truncated cntrlr-owned arc

Figure 13. Types of Unterminated Series of DDCs

dreq+ fain+ froutAck1b-

5 Entry Point

KEY:

ok+ | frout+

1

2

4

KEY:

4

a+ | z-

ok+ | frout+

1

c+ | z+ Ack+ 2

4

(b) Decomposed Specs.

0

Transformed input burst

1

2

An example of an XBM specification for xscsi-biu-fifo2dma [23] is shown in Figure 11a. The result of the proposed decomposition method is a set of three controllers shown in Figure 11b.

ok+ | frout+

c- | z- Ack-

[x-] a* c+ | z+ Ack+

• A distinguishability constraint exists whereby from a single state, the conditional clauses are either mutually exclusive or the set of inputs in an input burst is not a subset of another.

(a) Original Spec.

(b) After Original input burst

1

c- | z- Ack-

12c), DDC input a is now only enabled by an ACK, indicating that it no longer appears on the entry point arc. Input c, which remains an entry point input, now has its REQ qualified by conditional x-. This transformation allows for a further input optimization. The conditional appearing in the entry point input can now be entirely removed as an input to the controller, as shown in the figure, because it does not appear on any other arcs within the controller. Transformation on an Unterminated Series of DDCs. A transformation is also performed on arcs immediately preceding an internal activation point of a controller, a decision point where at least one sub-controller can become active. This is the point within a parent controller where it will activate a child. The transformation is applied if the arcs preceding this decision point have an unterminated series of DDCs. An unterminated series of DDCs is a series of DDCs that is not terminated prior to an internal activation point. Figure 13 shows the two types of unterminated series of DDCs within an original specification; decision point 2 is an internal activation point. Both series are shown prior to being partitioned into decomposed controllers; however, the cutpoints points have been added to show where the algorithm will make the partitions. The unterminated series of DDCs shown in Figure 13(a) consists of a signal that appears exclusively as a DDC and does not appear as a terminal signal prior to the decision point. This type is referred to as the simple series. A second type of unterminated DDCs series is a complex series and is shown in Figure 13(b). This type of DDC series contains a signal that also appears as a terminal signal prior to the decision point. On both types of DDC series, the transformation is to delete all the DDCs (Figure 14 (a1, b1)). These DDCs do not need to be observed before the activation point is reached; by eliminating these DDCs existing BM hardware solution can be used, as shown in the figure. For the complex series of DDCs, one small addition is required: since the input must be observed as a terminal, and masked

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

Original XBM Specification Fragment (with cut-points shown) (a1)

(b1)

Simple Series

Before

After

0

After

0

0

SeriesFlag+

2

a* z+ b+

a* z+ d+

Latch Enable

REQ ACK

XBM Controller

DQ

SeriesFlagEntry Point

Latch Enable

REQ_C ACKIN

z ACK

a

outputs

DQ

SeriesFlag

a_i

z

Decomposed ACK XBM Controller REQ_C

REQ ACK AckIn

REQ

Added Output

AckIn

SeriesFlag

b_i Decomposed

b+ | z+

(b2) Decomposed Controller Implementation (After)

AckIn

REQ ACK AckIn

2

a* b+ | z+ a* z+ d+

truncated cntrlr-owned arc

(a2) Decomposed Controller Implementation (After)

Added Output

b- | z-

2

2

truncated child activation arc

KEY:

1 a* b- | z-

b- | z-

a* z+ a* z+ d+ b+

a* z+ d+

1

DDC Series

1 a* b- | z-

Terminal Input

a+ b+ | z+

a+ b+ | z+ b+ | z+

1

b

Before

0 a* b+ | z+

DDC Series

Complex Series

b

DQ

b_i

outputs

Latch Enable

Figure 14. Transform #2: Unterminated Series of DDCs Decomposed XBM Specification (b) After

(a) Before 5

a- d+ | z+

5

a- d+ | z+

d- | REQ_C+

d- | REQ_C+

6 8

[x-] a*

a+ d- | z-

7

6

Child Monitoring Arcs (truncated)

d+

8 7

REQ_C-

a+ d- | zOriginal Input Burst

KEY:

[x-] d+

Entry Pt.

Child Monitoring Arcs (truncated) REQ_CTransformed Input Burst

Internal Activation Pt.

(c) Decomposed Controller Implementation (After) Qualifying Conditional (Disables Latch)

x+

ACK

ACKIN

Latch Enable

(Disables Latch)

x ACKIN

x+

ACK

REQ_C

a

ACKIN

Qualifying Conditional (Disables Latch)

x_i DQ

ACK

REQ ACK

d

a_i DQ

Decomposed XBM Cntrl

REQ_C

outputs

z

d_i DQ

ACKIN

Figure 15. Transform #3: Internal Activation Pt. Arc

as a DDC, a new output signal, “SeriesFlag”, is added to disable the input during the DDC series. Note that, in some cases, the DDC input is no longer observed within the specification; in this case, the DDC is removed as an input to the controller. Figure 14(a2, b2) shows the controller implementation for both types of series after the transformation. In (a2), input a has been removed as an input to the controller. Figure 14(b2) shows the output SeriesFlag added to qualify the latch enable for input a. Transformation After an Internal Activation Point. A transformation is also applied to arcs exiting an internal activation point, called internal activation point arcs. These are the arcs at a decision point that are still “owned” by the controller; at the same the decision point, other competing child controllers can also be activated. In Figure 15(a), decision point 6 is the internal activation point for the controller. Here, the controller, acting as a parent, attempts to activate any competing children. Either a child controller will win control, or the parent will resume control and continue. The transformation on the internal activation point arc is to remove DDCs (Figure 15b) which appear on all the outgoing arcs. After being removed from the arc, the latch for the DDC remains disabled using the same hardware approach as for the BM controllers. While conditionals are not removed from these arcs in the specification, hardware modifications are still required. Conditional values from each child are now used as “inhibits” to the parent’s latch enable. As the conditionals stabilize, if a child’s conditional ap-

pears, which is different from any of the parent’s conditionals, it immediately disables the parent’s latches, thus avoiding incorrect processing by the (losing) parent. Figure 15c shows the resulting implementation. Input a is now disabled at the internal activation point by REQ C. If the child (not shown) that is activated at the internal activation point has a conditional value of “x+”, while the parent has “x-” (as shown), the latch enables for the input latches for x and d are now both qualified with this conditional value (“x+”).

8 Results The proposed decomposition algorithm was implemented in a new CAD tool called bm-decomp, and experiments were performed to evaluate its effectiveness. bm-decomp was written in roughly 2100 lines of C and experiments were were conducted on a 2.40GHz Pentium 4 machine with 512KB RAM running Fedora Linux 2.6.11-1.35 FC3. The program takes in an original (X)BM specification and produces several (X)BM specifications. All optimizations in Section 6 are implemented in the tool. The controller benchmarks were obtained from a wide range of academic and industrial projects. The BM examples include controllers from Hewlett-Packard’s Mayfly [19] and Stetson [13] projects, as well a cache-controller [16], DRAM- [17] and SCSIcontrollers [17, 25], and a distributed asynchronous FIFO controller (opt-token-distributor [6]). The collection of XBM specifications were obtained from a number of publications, including a controller for a differential equation solver [24] and a SCSI-controller [25], To focus on larger examples, cutoff criteria were used: 12 states or higher for BM specifications, and 9 states or higher for XBM specifications. The largest BM specification had 71 states, which is regarded as a very large number for a machine interacting in a concurrent environment, which must also ensure hazard-freedom and critical race-freedom. Preliminary experimental results are reported in Table 2. This table presents a summary of the entire synthesis flow by comparing the original monolithic controller to a suite of decomposed controllers. Horizontally, the table is divided into two sections: (1) BM and (2) XBM. The BM specifications were synthesized using a soon-to-be released version of Minimalist. This version improves the runtime of large monolithic controllers, by using better data structures for the critical race-free state assignment step. The 3D CAD tool was used to synthesize the XBM specifications. The columns of this table are partitioned into three large sections: (1) Original Specification, (2) Monolithic Runs, and (3) Decomposition Runs. (1) provides general information that is relevant to both monolithic and decomposed controllers. In (2), details of the synthesis (Minimalist/3D) of the original monolithic controllers is given. Finally, (3) shows details of the complete decomposition runs, including both results of the decomposition algorithm and controller synthesis runs (Minimalist/3D). Results of the decomposition algorithm, are reported in the left of the Decomposition Runs column. The bm-decomp tool adds only a negligible amount of overhead to the total runtime: under 1 second for each benchmark. The tool successfully completed on all examples, including several substantial controllers with complicated structure. For BM, the decomposition method allocated between 3 and 20 smaller sub-controllers per benchmark. Between 2 and 5 sub-controllers were allocated for each of the XBM specifications. The BM controllers were synthesized using a pre-existing speed script in Minimalist. The speed script performs three steps: (1) state minimization, (2) optimal (critical race-free) state assignment, and (3) hazard-free logic minimization, each without using any primary outputs as fedback state variables [9, 15]. Minimalist also has a command-line feature that allows for scripts to be bypassed

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

Original Specification

Monolithic Runs

Decomposition Runs

Synthesis Spec Type

Machine Name [with citations]

I/S/O

dram-ctrl [17] rf-control [13] opt-token- distributor [6] stetson-p2 [13] sd-control [13]

7/12/6 6/12/5

0.49 0.53

4/12/4

3.39

8/25/12 8/27/12

1.67 10.03

sc-control [13]

13/29/14

{212.67}+ {514.32}~

{93/359}+ {137/724}~

{72/275}+ {101/542}~

stetson-p1 [13]

13/33/14

dean-cache [16]

16/37/19

pscsi [17]

10/45/5

scsi [17]

9/71/5

{26.10}+ {8.80}~ {1340.23}+ {22137.32}~ {28.42}+ { }~ {1237.11}+ { 221.09}~ 0.58 0.56 1.07 15.88 2988.89 3516.12

{61/232}+ {79/409}~ {171/671}+ {256/1121}~ {87/414}+ { }~ {106/519}+ {112/544}~ 13/83 13/53 24/105 69/226 [*] [*]

BM

Runtime (sec)

6/9/4 scsi-targ-send [26] 5/9/3 scsi-init-send [26] 5/14/7 diffeq-ALU2 [24] XBM barcode-xbm 13/14/18 15/20/19 fibonacci-xbm 21/28/24 cdp-p1 Shaded Gray(BM) = Time-Out after 10 Hours Shaded Gray(XBM) with [*] = 3D returned no solution

Total Logic Area (Prods/Lits)

24/83 15/60

Synthesis

Decomposition Algorithm

# Output Logic State Area (Prods/Lits) Bits

# Total # Decomp Inputs Cntrlrs

14/60 7/32

1 3

4 3

12/45

4/21

4

42/166 49/205

25/94 26/104

7 8 {3}+ {4}~

Avg # Cntrlr States

Runtime (sec)

Total Logic Area (Prods/Lits)

Output Logic Total # State Area Bits (Prods/Lits)

12 9

4.25 5.33

0.12 0.09

1

4

12.00

0.02

12/50

5 6

15 15

6.8 4.73

0.14 0.15

86/256 74/227

3

24

19.5

0.96

T{152/542}+ T{171/751}~

T{122/422}+ T{135/574}~

0.14

113/487

0.13

303/989

0.27

222/688

0.35 0.06 0.10 0.11 0.10 0.10 0.13

{47/179}+ {3}+ 6 21 7.33 {55/288}~ {4}~ {159/620}+ {2}+ 5 35 8.8 {219/829}~ {4}~ {51/214}+ {4}+ 15 36 4.73 { }~ { }~ {60/243}+ {4}+ 20 57 5.55 {62/267}~ {4}~ 8/59 3 5 15 5.15 9/35 3 3 13 6.67 19/78 2 4 35 6.00 45/168 3 3 23 8.00 3 30 11.33 [*] [*] [*] [*] 5 53 14.8 + = Manual Minimalist run with fed-back outputs ~ = Manual Minimalist run without fed-back outputs

44/120 28/86

Combined

39/104 20/62

Runtime (sec)

Total Runtime (Decomp + Synthesis) (sec)

2 5

0.92 0.97

1.04 1.06

4/25

4

3.39

3.41

63/193 59/178

13 12

0.45 2.11

0.59 2.26

T{6}+ T{8}-'

T{203.97}+ {355.97}~

T{204.93}+ {356.93}~

72/247

14

123.73

123.87

282/919

5

83.59

83.72

151/462

29

37.61

37.88

329/981

250/700

41

2.23

2.58

20/65 29/104 52/162 64/237 83/350 153/634

16/58 23/85 47/142 54/219 76/316 135/559

3 5 3 2 2 3

1.90 1.52 2.3 3.49 2472.09 289.56

1.96 1.62 2.41 3.59 2472.19 289.69

Table 2. Experimental Results: Decomposition Method

and steps (1-3) can be performed with various customized options. While synthesizing some of the BM specifications, the commandline feature was used to avoid optimal state assignment, and to use basic critical race-free state assignment only, in an attempt to reduce runtime when needed. Additionally, each manual step was performed both with and without fedback outputs. The BM decomposition runs showed significant benefits over the original runs. First, using speed script on the original specifications, only 5 out of the 10 benchmarks could be synthesized before a 10 hour timeout. In contrast, using speed script on the decomposed specifications, all were synthesizable except for one controller (for sc-control). To complete the synthesis runs on the timed-out examples, the manual command-line mode of Minimalist was next attempted. All benchmarks were finally synthesized, except for one run of the original pscsi specification. There were large runtime improvements on several examples. For example, the dean-cache example timed out in the speed script, and was then synthesized in two manual runs in 1340 and 22,137 seconds; in contrast, the entire decomposition run (including partitioning and synthesis) took under 84 seconds using the speed script. Similarly, the scsi example timed out, and was then synthesized in 1,237 and 221 seconds in two manual runs; in contrast, the entire decomposition flow (including partitioning and synthesis) took under 3 seconds using the speed script. Interestingly, the changes in the updated version of Minimalist were more favorable to synthesizing large monolithic BM controllers. Prior to this work, scsi has never been successfully synthesized using a burst-mode synthesis tool (Minimalist, 3D) targeting a clockless implementation. Only a locally-clocked burst-mode implementation has been previously reported. The new changes in Minimalist allowed the monolithic controller to be synthesized using command line options, though still not with the speed script. The XBM specifications were synthesized using the 3D CAD tool. Due to a limitation in 3D,4 a modification was made to a few of the decomposed controllers to allow them to synthesize properly. 5 4 In 3D, each output in an XBM specification must make at least two transitions or the program will terminate unsuccessfully. 5 The specifications were modified by adding a single new arc with an extra output transition. The new transition has an added “dummy signal” and the (de)asserted output signal as input and output, respectively.

On average, this change only affected one decomposed controller per benchmark. The XBM synthesis results on the decomposed machines also show some significant improvements over the original runs. The two largest monolithic XBM controllers could not complete the synthesis runs (error messages in hazard-free logic minimization), while all decomposition runs successfully completed. As an example, one of the larger original controllers (cdp-p1) was unable to be synthesized, while the decomposed run succeeded in under 197 seconds. Unlike Minimalist, 3D does not have a commandline feature so no alternative synthesis run could be used. Total logic area and output logic area are also listed, for both BM and XBM runs. The total logic area for each of the decomposed controllers does not include input latches or optimized enabling gates; these can be recovered from Table 2 using the “Total # Inputs” column, along with input optimization information from Table 4, nor does it include the latch enable logic (which is not considered to be on the critical path). Although our focus is on runtime, area results were reasonable. While, as expected, the original controllers tended to be smaller, there were several instances where the decomposed ensemble had lower or roughly comparable total logic area. Next-state logic is also an important factor, because it can have a direct impact on fundamental mode timing requirements. Nextstate logic in the decomposed controllers is significantly smaller, per state bit, than in the original controllers. For most examples, there was a 2-10x reduction in next state logic per bit, and roughly a 5x reduction on average. These results validate that the decomposition method can greatly reduce the fundamental mode “hold time” window for each individual controller. (Results can be derived from Table 2 by subtracting output logic from total logic, and dividing by the number of state bits.) Finally, several of the smaller decomposed controllers were reduced to simple combinational logic blocks (not indicated in the table). Combinational logic blocks were produced in 7 out of the 16 decomposed benchmark runs, with an average of 3 combinational controllers for each of the 7 benchmarks. For stetson-p1, 4 out of the 6 controllers synthesized were combinational blocks of logic. Tables 3 and 4 shows the result of applying the optimizations for output logic and input latches, as presented in Section 6. Overall, the output optimizations were highly effective, in some cases

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

up to 100% effective. In Table 3, the trends show that AND/OR gates and single wires are most frequently allocated for the output gate logic, rather than the more complex XOR/XNOR gates. Table 4 shows that the input optimizations were fairly effective. After the optimizations were applied, 31% of the inputs were unlatched, 44% used simple 2-input gates, and 25% remained latched. For leaf controllers, the substitution of latches with gates can always be performed, and since leaf controllers are most common, this optimization was highly effective (applicable to 60% of input latches, after initial latch removal optimizations). In the dram-ctrl and deancache examples, after the substitution is performed, all of the input latches can be removed. Table 4 shows that, for each benchmark, at least two of the four input optimizations could be applied. 100% single wire

80%

AND/OR XOR/XNOR

60% 40% 20% 0%

cdp-p1 fibonacci-xbm barcode-xbm diffeq-ALU2 scsi-init-send scsi-targ-send scsi pscsi dean-cache stetson-p1 sc-control sd-control stetson-p2 opt-token rf-control dram-ctrl Table 3. Experimental Results: Output Optimizations 100% 80%

Not Optimized Gate Substitution Single-Owned Complete Set Proper-Subset

60% 40% 20% 0% fibonacci-xbm

cdp-p1

barcode-xbm

diffeq-ALU2

scsi-init-send

scsi-targ-send

scsi

pscsi

dean-cache

stetson-p1

sc-control

sd-control

stetson-p2

opt-token

rf-control

dram-ctrl

Table 4. Experimental Results: Input Optimizations

9 Conclusions In this paper, a new decomposition method for asynchronous burst-mode controllers was presented. A wide-ranging set of decomposed BM and XBM controllers were synthesized using the Minimalist and 3D CAD tools, respectively. Significant improvements in runtime were observed. On almost every benchmark, the decomposition method produced results in under about 2 minutes, including on some of the largest existing BM/XBM benchmarks. In contrast, the monolithic runs timed out using an existing script (BM), or failed (XBM), on 7 out 16 benchmarks. With the BM runs, resynthesis using alternative sub-optimal command line options in several cases still resulted in runtime degradation of 16-200x over the decomposed runs. Acknowledgments. We would like to thank Tiberiu Chelcea for creating the new version of Minimalist used to evaluate results.

References [1] A. Bardsley and D. Edwards. Compiling the language Balsa to delayinsensitive hardware. In C. D. Kloos and E. Cerny, editors, Hardware Description Languages and their Applications (CHDL), pages 89–91, April 1997. [2] J. Beister, G. Eckstein, and R. Wollowski. From STG to ExtendedBurst-Mode Machines. In Async-99. [3] L. Benini, F. Vermeulen, and G. D. Micheli. Finite-state machine partitioning for low power. In Int. Symp. on Circuits and Systems, 1998. [4] K. v. Berkel. Handshake Circuits: An Intermediary between Communicating Processes and VLSI. PhD thesis, Eindhoven University of Technology, 1992. [5] A. V. Bystrov and A. Yakovlev. Asynchronous circuit synthesis by direct mapping: Interfacing to environment. In Async-02. [6] T. Chelcea and S. M. Nowick. Low-latency asynchronous fifo’s using token rings. In Async-00. [7] T.-A. Chu. Synthesis of Self-timed VLSI Circuits from Graph-theoretic Specifications. PhD thesis, MIT, June 1987. [8] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev. Petrify: a tool for manipulating concurrent specifications and synthesis of asynchronous controllers. IEICE Trans. Inf. and Syst., E80-D(3):315–325, Mar. 1997. [9] R. Fuhrer and S. Nowick. Sequential Optimization of Asynchronous and Synchronous Finite-State Machines: Algorithms and Tools. Kluwer Academic, 2001. [10] H. K. Kapoor and M. B. Josephs. Decomposing specifications with concurrent outputs to resolve state coding conflicts in asynchronous logic synthesis. In DAC-04, pages 830–833. [11] P. Kudva, G. Gopalakrishnan, and H. M. Jacobson. A technique for synthesizing distributed burst-mode circuits. In DAC-96. [12] L. Lavagno and S. Nowick. Asynchronous control circuits. In S. Hassoun and T. Sasao, editors, Logic Synthesis and Verification, chapter 10. Kluwer Academic, 2002. [13] A. Marshall, B. Coates, and P. Siegel. Designing an asynchronous communications chip. IEEE Design and Test of Computers, 11(2):8– 21, 1994. [14] A. Martin. Compiling communicating processes into delay-insensitive VLSI circuits. Distributed Computing, 1:226–234, 1986. [15] S. Nowick. Automatic Synthesis of Burst-Mode Asynchronous Controllers. PhD thesis, Stanford University, March 1993. (revised tech. report, Stanford Computer Systems Lab. CSL-TR-95-686, Dec. 1995). [16] S. M. Nowick, M. E. Dean, D. L. Dill, and M. Horowitz. The design of a high-performance cache controller: a case study in asynchronous synthesis. Integration: the VLSI Journal, 15(3):241–262, Oct. 1993. [17] S. M. Nowick, K. Y. Yun, and D. L. Dill. Practical asynchronous controller design. In ICCD-92. [18] M. Schaefer, W. Vogler, R. Wollowski, and V. Khomenko. Strategies for optimised stg decomposition. In ACSD ’06: Proceedings of the Sixth International Conference on Application of Concurrency to System Design, pages 123–132, 2006. [19] K. Stevens, S. Robison, and A. Davis. The post office - communication support for distributed ensemble architectures. In ICDCS-86. [20] M. Theobald and S. Nowick. Transformations for the synthesis and optimization of asynchronous distributed control. In DAC-01. [21] W. Vogler and R. Wollowski. Decomposition in asynchronous circuit design. In Concurrency and Hardware Design, pages 152–190, 2002. [22] T. Yoneda, H. Onda, and C. J. Myers. Synthesis of speed independent circuits based on decomposition. In Async-04. [23] K. Y. Yun. Automatic synthesis of extended burst-mode circuits using generalized C-elements. In EURODAC-96, pages 290–295. [24] K. Y. Yun, P. A. Beerel, V. Vakilotojar, A. E. Dooply, and J. Arceo. The design and verification of a high-performance low-control-overhead asynchronous differential equation solver. In Async-97, pages 140– 153. [25] K. Y. Yun and D. L. Dill. A high-performance asynchronous SCSI controller. In ICCD-95, pages 44–49. [26] K. Y. Yun and D. L. Dill. Unifying synchronous/asynchronous state machine synthesis. In ICCAD-93, pages 255–260. [27] K. Y. Yun, D. L. Dill, and S. M. Nowick. Practical generalizations of asynchronous state machines. In EDAC-93, pages 525–530.

13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 $20.00 © 2007

Suggest Documents