Advanced Technology Mapping for Standard-Cell Generators

5 downloads 494 Views 1MB Size Report
ABSTRACT. In this paper, a new algorithm for technology mapping aiming standard-cell generators is proposed. The proposed method has features that explore ...
Advanced Technology Mapping for Standard-Cell Generators Vinícius Correia, André Reis Universidade Federal do Rio Grande do Sul P.B.: 15064, Av. Bento Gonçalves, 9500, Bloco IV Porto Alegre, RS - 91501-970 - BRAZIL +55 51 3316 6810

{vincor, andreis}@inf.ufrgs.br The first papers on library based technology mapping were published in the eighties and two key approach threads have been trailed. The first, widely studied and most published one tries to map the circuit with a given library of arbitrary cells [3,5,6,7,8,10,11], where each cell is previously characterized and designed. Accordingly, the algorithm optimizes a cost function, generally timing or area, while covering the circuit description. The second approach uses virtual libraries instead, a specification of set of cells that can be used to map the network, [1,2]. Cells that meet specified topological constraints usually define the set contained in the library. For these approaches, a cell generator, which automatically generates the cells used in the mapped network, is needed. The most relevant approaches of this category in the available literature were proposed in [2] and in [1]. To the best of our knowledge, there are no other relevant works in the published literature.

ABSTRACT In this paper, a new algorithm for technology mapping aiming standard-cell generators is proposed. The proposed method has features that explore several AND/OR circuit decompositions by using a n-ary tree representation of the circuit. In the covering step, the cell that leads to the smaller depth increase is chosen. Depth calculation is not limited to the subject tree and takes into account all previously mapped trees representing sub-expressions used as inputs. Experimental results show gains in circuit depth measured by the number of gates in series, as well as in area measured by transistor count when compared to SIS mapping approach using the same libraries. The gain in circuit depth translates to better timing as verified by spice simulations.

Categories and Subject Descriptors B.6.3 [Hardware]: Logic Design – Design aids.

In [2], the pioneering work aiming cell generator, a mapper that uses cells constrained by its size was presented. However, the technique was limited in the sense that it was not aware of performance since it only attempts to map the largest cells possible. In addition, a bad cover strategy made it harder to decide which nodes should be picked together to arrange a complex gate without increase the logic depth.

General Terms Algorithms, Design.

Keywords Technology mapping, logic synthesis, cell library, libraryfree, complex gates.

Another approach aiming cell generators was proposed in [1], where both the mapping method and the automatic cell generator were discussed. Here, a new matching method and a different cover strategy were proposed, however, without concerning the logic depth of the circuit also.

1. INTRODUCTION Technology mapping is an important step in the synthesis of semicustom circuits, since the quality of the designs heavily relies on it. Traditional methods consist of transforming arbitrary logic representations of combinational portions of a design into a functionally equivalent connected set of cells from a library, and it is known this problem is NP-complete. Thus, academic and industrial research efforts developed heuristics to solve this problem within an acceptable time and most works have formulated technology mapping as a tree covering problem.

This paper presents a new approach to the problem stated in [2], performed over a data structure similar to [1], nevertheless using a distinct covering strategy that exploits all the possible decompositions in the same subject description, taking into account criteria that optimize the logic depth of the nodes before gathering them into complex gates. The method is rather simple and efficient in terms of CPU time. The main contributions of this work are:

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. to copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SBCCI’04, September 7–11, 2004, Pernambuco, Brazil. Copyright 2004 ACM 1-58113-947-0/04/0009...$5.00.

254



The decision about the order in which the nodes should be picked together forming a new complex gate must consider the actual logic depth level of each node. The covering strategy of our approach allows the reasoning of this decision with real depth level calculation, not relying on heuristics.



The decision about which and how many nodes should be picked at each gate must not hinge only on the constraints sizes of the largest mappable cell, but also on the logic

depth of the nodes being selected. As a result of trying several possible groups of nodes taking into account its logic depths, our algorithm can choose the best cluster to be used at each iteration. •

The quality of the final implementation heavily depends on the initial decomposition of the circuit description. Some proposed works in the literature deal with this point by encoding several decompositions in the same subject description [7,10], while most others just rely on the decomposition granted by the previous step. The subject description in our approach enables the investigation of several decompositions encoded in the same data structure without modifying it, and this contribution shall mean the same advance for cell generator based technology mapping algorithms as [7] represented for library-based methods.

The remainder of this paper is organized as follows. An overview of technology mapping is presented in section 2. Section 3 is focused on the previous contributions to technology mapping targeting cell generators in the literature. Section 4 presents our proposal, including data structure and main features. Section 5 presents experimental results against a reference logic synthesis tool [9]. Finally, conclusions and future works are discussed in section 5.

2. TECHNOLOGY MAPPING Technology mapping is the choice of the elements from a technology (typically cells from a library) that will be effectively used to implement a given circuit. Traditional technology mapping algorithms are typically divided in three main steps: decomposition, matching and covering. The decomposition step transforms the initial description of the circuit (typically a DAG) into a forest of trees, as shown in Fig. 1. This decomposition into trees is common to most technology mapping approaches, and the description of the algorithms focuses on the mapping of individual trees [3,5,7,8]. For this reason, the description of the proposed algorithm will also treat individual trees.

Figure 1 – Partitioning a DAG into a forest of trees The mapping algorithms targeting cell generators rely on the completeness of a given virtual library induced by some topological constraints, as desired or imposed by the technology limitations. The studied literature approaches [2,1], as well as ours, use the maximum number of series transistors in the n-plan and p-plan of CMOS complex gates as parameters to induce the virtual library. Therefore, any complex gate that respects these constraints is contained in the library. As a consequence, the cell generator must be able to generate the complete set of cells induced by these parameters. The limitation imposed by the technology in the number of series n-type (p-type) transistors is often denoted by s (p), and this is the notation followed in this paper.

After the circuit DAG is decomposed into trees, each algorithm decomposes the trees into base functions, like NAND/NOT or AND/OR/NOT trees. The data structure representing a decomposed portion of the circuit used to perform the mapping algorithm is called subject tree. The quality of the final implementation effectively depends on this initial decomposition. A good preliminary decomposition can lead to best mappings, while an unsound one may lead to a low quality implementation with the same mapping algorithm. The matching is a crucial step, which tries to determine which technology elements, like a standard-cell or a complex gate, may be used to implement a set of nodes in the decomposed network. The two major approaches to solve this are the structural [3,5] and the Boolean [8] matching. The algorithms targeting cell generators [1,2] also includes a matching step, where sets of nodes in the decomposed description are picked to implement complex gates obeying to a restriction in the number of serial transistors.

The main advantage on using this kind of mapping is the completeness of the complex CMOS gates library, since complete libraries are larger and that leads to better quality mappings, as experimentally verified by [3,11].

3. PREVIOUS WORK This section will discuss the previous works in technology mapping targeting standard cell generators presented by Berkelaar [2] and Abouzeid [1].

3.1 Berkelaar 88 algorithm The algorithm proposed in [2], was the pioneering work aiming standard-cell generator. The algorithm takes as input a complex Boolean expression representing a combinational portion of the previously partitioned and decomposed circuit. Most algorithms in the literature use a bottom-up approach to apply dynamic programming techniques [1,3,5,6,7,8,9,10,11]. That work, though,

Finally, the covering step is performed to choose some of the found matchings to implement the entire subject description in terms of the target technology. The best set of matchings that implement the circuit is selected while some matchings are discarded.

255

Additionally, no extra decompositions are attempted. Hence, the first maximal cluster that matches the specified costs is cut, covering the description sub-optimally.

used a top-down covering strategy, unique in the known literature on technology mapping. This choice of a top-down approach is the root of three drawbacks in the approach. First, as the algorithm cuts the top of the tree at first place, the intermediate nodes are not mapped yet, and this way their delays may not be taken into account. As a result, the decision of which subexpressions might be substituted at every iteration is taken with the rationale of heuristics instead of solving it with a dynamic programming strategy, which would permit the decision to be taken precisely in terms of previously covered sub-expressions and its delay costs. This will be discussed further in section 4.

It is necessary to say that the approach in [1] was not devoted to delay mapping and that no delay data was presented in that paper.

4. OUR PROPOSAL This paper presents a new approach to the problem stated in [2], performed over a data structure similar to [1], nevertheless using a distinct covering strategy that exploits several decompositions in the same subject description, taking into account criteria that optimize the logic depth of the nodes before gathering them into complex gates. The mapping procedure actuates over tree representations of Boolean expressions, which can be easily achieved by partitioning heuristics, as those proposed in [3] and [8]. In our implementation, each combinational portion of a given circuit is initially represented as a DAG, as shown in fig. 2. Afterward, each multiple fanout node is taken as the root of a partitioned subtree. By applying this transformation, we get a disjoint forest of mappable trees. This process was shown in Fig. 1.

(a) bottom-up

(b) top-down

Figure 2 – Bottom-up vs. top-down covering The second drawback is shown in Fig. 2. The algorithm will have the preference to put cells with higher complexity at the root of the subject tree, and this may cause timing problems because the root of the subject trees are charged nodes, as may be inferred from the decomposition step presented in Fig. 1. Notice that node V1 will have fanout greater than one. Therefore, some postprocessing optimization will be necessary to add buffers to these multiple-fanout nodes or the timing cost will be prejudiced. The third drawback: a large number of small gates will awkwardly settle the bottom of the network description, each of them grouping small sub-expressions that had not been substituted earlier.

(a)

(b)

Figure 3 – N-ary tree transformations rules The subject tree used to the mapping is n-ary, as in [1], with multiple input AND/OR operator nodes and leaf nodes representing the input literals. The subject tree is mandatorily modified by the following rules. First, each and every inverter (NOT) node in the description is propagated to the input variables, in the leaf boundary of the tree (fig. 3a). Afterward, every connected pair of nodes with the same operator label is gathered into a single node (fig. 3b).

It is necessary to say that the approach in [2] was not devoted to delay mapping and that no delay data was presented in that paper.

3.2 Abouzeid 92 algorithm

After applying these rules shown in fig. 3, every consecutive node will have a different label, and the root node of the description has the label OR or AND. Besides the label, each node has associated (s,p) costs and a logic depth (l). The series (parallel) cost of an AND (OR) node is given by the sum of all the s (p) costs of its children. The parallel (series) cost of an AND (OR) node is given by the highest cost among all the p (s) costs of its children. If the node is a leaf, its (s,p) costs are (1,1).

In [1], a new method targeting cell generators was proposed. The method applies on previously factorized Boolean expressions represented as trees. After a factorization step, each Boolean expression ended up as a n-ary subject tree with nodes labeled by OR or AND operations. The mapping method is applied directly on this structure. Each tree composed of multiple-input AND/OR nodes have its arrangement granted by the factorization step. Each node of the subject description has series and parallel costs associated. These values are used to identify sub-trees that match the constraints used in the mapping. The algorithm performs on a bottom-up manner, but the covering is performed in a greedy manner. The main disadvantage of this covering tactic, which was modeled as an effortless greedy clustering, is that little or no concern is given to the criteria used when choosing the nodes besides the (s,p) costs. Thus, the gathered nodes may be part of a critical path, deprecating the logic depth of the subject node.

The logic depth (l) of a internal (non-leaf) node is given by the highest value among all its children. The l of the leaves is given by the initial partitioning of the network and by the previously mapped trees. All logic depths are calculated starting from the primary inputs of the entire Boolean network, instead of looking only in the local description. This information is accurately achieved by applying the technology mapping method over the tree descriptions of every network portion in a trivial order, which follows the precedence of each logic function implemented.

256

network description, as stated earlier. An example of embedded decompositions made possible by this kind of description is shown in fig. 4. By exploring different decompositions, different results are achieved. Thus, the best result found may be kept as the best cover for the subject description.

(a)

(b)

By exploring dynamically several decompositions embedded in the same tree, our method can achieve better results than simply relying on the initial structure provided. This improvement is accomplished at a lower computational cost than barely trying exhaustively all the possibilities over numerous instances of the same description.

(c)

Figure 4 – Some decompositions (b,c) embedded in a n-ary tree (a), depending on the chosen match

The covering procedure is bottom-up, following a dynamic programming tactic. As a result, each node whose children have already being mapped has precise information about s, p and l. The mapping algorithm is as follows. Each node is searched in a depth-first manner and has its costs analyzed. At each node where the series and parallel costs comply with the specified constraints, a match is established. The costs for a complete tree are shown in fig. 5a. If a node is found having the (s,p) costs in the maximum value allowed, it is marked as an ideal cut. An example of a ideal cut made at a node for a (2,2) restriction is shown in fig. 5b. The sub-tree rooted at this node is then cut and directly associated to a CMOS complex gate. The cut node at the original tree is replaced by a new leaf node, with associated costs (1,1) and logic depth given by the value of the root of the recently cut tree plus 1. The search then restarts at this node. If a node is found with any of its costs exceeding the limits, it is further sliced. Its set of sons is scanned in terms of costs and depths. Several clusters grouping the nodes with lower depths are considered at this time. A new node with the same operator is added as its son, and receives a set of sons that maximizes the (s,p) costs and minimizes the depth of the new node. This new node is then cut and the cluster directly associated to a CMOS complex gate. The set of nodes removed from the original tree is substituted by a single new leaf node, with associated costs (1,1) and logic depth given by the worst value of this set plus 1. The search restarts then at the node that had some sons removed. If it still has any exceeding cost, the last step is repeated. After each iteration, the costs of the remaining nodes that root subtrees recently modified are recalculated. The main advantage of this method is that it considers dynamically several decompositions of the subject description at low or no cost and covers it optimally in a dynamic programming manner, following a bottom-up strategy. Additionally, each set of nodes that covers a portion of the description with a complex CMOS gate is chosen taking into account the locally stored logic depth information that reflects the whole network, hence reducing the number of series gates in the critical paths of the circuit, instead of merely minimizing the longest path of a narrow description. A final cut and its associated implementation is shown in fig. 6. After this, a dedicated process will proceed the gate polarization step, where inverters are used where necessary, prioritizing the critical paths. The results, shown in the next section, are mapped circuits with primary outputs at lower depths as expected, thus reducing the number of gates in the critical paths.

Figure 5 – Initial costs, first cut The subject description proposed embeds all possible decompositions achieved by the application of the associative transformation. The quality of the final implementation significantly depends on the provided decomposition of the

257

undergraduate and graduate levels. We expect that the approach will lead to even better results after applying post-processing heuristics for optimizing critical paths, and some experimental results arguments favorably in this sense. In tab.1, it is shown that the proposed method does not increase the area of the design, even decreasing the transistor count (Xtors column). This is achieved by choosing large and less cells (Gates column). The number of inverters in the design relies on inverter minimization heuristics and is shown on column “Invs”. Table 1 – Gate, transistor and inverter counts Proposal vs. SIS Mappings for (s,p) = (2,1)

Table 2 – Circuit logical depth and delay simulation Proposal vs. SIS Mappings for (s,p) = (2,1)

Figure 6 – Final cover and naïve implementation (before inverter minimization) In tab. 2, some delay data (Delay column) obtained with spice simulations are compared to the level depth (LD column). These gains are limited to the tested library (2,1) and do not reflect the real capabilities of the method. The main reason for that is the size of the circuits and the library. Larger circuit benchmarks and libraries with higher s,p constraints lead to even better mappings, as will be shown for the final version of this paper, with many more circuits, libraries and comparisons.

5. EXPERIMENTAL RESULTS The CPU time of our implementation [4] is rather fast and runs in linear time accomplishing O(tree_size). Typical execution times are few seconds for circuits of up to 10k gates. Small benchmarks ran in less than 100msecs. The results available at the present moment are very promising and compare positively over SIS [9], which still is a reference tool for teaching and research at

258

proceedings on Design automation conference, June 28-July 01, Miami Beach, Florida, United States, pp. 341-347, 1987.

6. CONCLUSIONS In this paper, a new approach to technology mapping for standard-cell generators is proposed. The method explores some properties of a n-ary tree description that permits many embedded decompositions to be dynamically explored, leading to better results than just relying on a first given decomposition. Comparisons with classic methods and solutions to the problems identified were also described. The implementation is also presented, together with experimental results that compares favorably with our expectations.

[6] Y. Kukimoto , R. K. Brayton , P. Sawkar, “Delay-optimal technology mapping by DAG covering”, Proceedings of the 35th annual conference on Design automation conference, San Francisco, California, United States, pp. 348-351, June 1998. [7] E. Lehman, Y. Watanabe, J. Grodstein, H. Harkness, “Logic Decomposition During Technology Mapping”. IEEE ICCAD ’95, pp. 264 –271, November 1995.

References [1] P. Abouzeid, R. Leveugle, G. Saucier, and R. Jamier, "Logic synthesis for automatic layout," in Proceedings of the Euro ASIC'92, pp. 146-151, 1992.

[8] F. Mailhot, G. DeMicheli – “Algorithms for technology mapping based on bynary decision diagrams and on Boolean operations”, IEEE Transactions on CAD for IC and Systems, vol. 12 nº 5, pp. 599-620, 1993.

[2] M. Berkelaar, J. Jess. “Technology mapping for standard-cell generators”. IEEE ICCAD ‘88, Santa Clara, pp. 470-473, 1988.

[9] SIS: A System for Sequential Circuits Synthesis (Logic Synthesis Tool). University of California, Berkeley, 1992.

[3] E. Detjens, G. Gannot, R. Rudell, A. Sangiovanni-Vincentelli, and A. Wang, "Technology mapping in MIS," IEEE ICCAD ‘87, pp. 116-119, 1987.

[10] L. Stok , M. A. Iyer , A. J. Sullivan, “Wavefront technology mapping”, Proceedings of the conference on Design, automation and test in Europe, Munich, Germany, pp.108-es, January 1999.

[4] ELIS Tool Home Page: http://www.inf.ufrgs.br/lagarto

[11] M. Zhao, S. S. Sapatnekar, “A new structural pattern matching algorithm for technology mapping”. Proceedings of the 38th Conference on Design Automation, Publisher ACM Press, New York, NY, United States, pp. 371-376, June 2001.

[5] K. Keutzer, “DAGON: technology binding and local optimization by DAG matching”, 24th ACM/IEEE conference

259