A Novel Approach to Compute Spatial Reuse in the ... - CiteSeerX

1 downloads 0 Views 214KB Size Report
Department of Computer Science & Engineering ... Basic block 2, Freq = 80. P2. P1. P2. P2 ..... In other words, a graph g and all its descendants along a node v forms a ..... TR 07/2007, Department of Computer Science, IIT Delhi,. July 2007.
A Novel Approach to Compute Spatial Reuse in the Design of Custom Instructions Nagaraju Pothineni Anshul Kumar [email protected] [email protected]

Kolin Paul [email protected]

Department of Computer Science & Engineering Indian Institute of Technology, Delhi New Delhi, INDIA - 110016

ABSTRACT In the automatic design of custom instruction set processors, there can be a very large set of potential custom instructions, from which a few instructions are required to be chosen, taking into account their spatial as well as temporal reuse and cost. Using the existing pattern matching techniques, finding complete reuse of every identified pattern in the entire application would be very slow and may even be computationally infeasible. Due to this, the existing selection methods employ pattern matching techniques at a very later stage of selection process on a small set of patterns, compromising the quality of selected candidates. In this paper, we propose a method by which each pattern’s reuse information can be derived at an early stage of selection process even when there are very large number of potential patterns. The novel contributions of this paper include a simple and efficient algorithm for finding all the isomorphic convex subgraphs (termed as Recurring Pattern Information(RPI)) of the given application’s Control Data Flow Graph (CDFG). The proposed technique is integrated into the estimation phase of the Instruction Set Extension(ISE) automation. Experimental results show the efficiency of the proposed algorithm and demonstrate its utility in generating high quality custom instructions.

1.

INTRODUCTION

In recent years, several commercial extensible processor platforms have become available in the market. Examples of these platforms include Tensilica Xtensa [1], Altera NIOS [2], MIPS CorExtend [3] etc. Using these platforms, application specific functionality can be introduced into the base processor to speedup the critical parts of applications. The designer specifies the application specific functionality in a high level language and the platforms automate the integration of custom hardware units into the processor pipeline and generate the software tool kit for the custom processor. Over the last decade, several heuristic techniques [4] [5] [6] [7] were proposed for automating the identification of best possible set of Custom Functional Units (CFUs). The approach followed is to identify a set of legal computation patterns and then select a few of the most beneficial patterns based on cost-benefit analysis. The gain of a pattern mainly depends on a) efficiency of its hardware implementation and b) reuse. Reuse of a pattern may be due to its multiple instances in different parts of the application and due to the repeated execution of same pattern instance at run time. We call the former spatial

P1

*

*

P2

+

*

P2

+

P2

+

+

+ Basic block 1, Freq = 50

Basic block 2, Freq = 80

Pattern

Gain

Area(um 2 ) Total Gain

P1 P2

2 1

Complete Reuse 50 50 + 50 + 80 =180

19 12

100 180

Figure 1: Motivating example for considering complete reuse

reuse and the latter temporal reuse. To ensure good results, the area and performance estimates should be as accurate as possible. This implies that, while computing the gain of a pattern, its complete reuse should be taken into account. However, the current techniques that exhaustively identify all legal patterns, do not take into account complete spatial reuse while selecting the patterns. Due to this, the estimated gain is often an underestimate. This has two main disadvantages 1). Underestimates would influence the pattern selection and thus the final quality of solution 2). The true speedup potential of patterns remains hidden. Our experimental results (in section 5) clearly show that these two problems indeed occur in real benchmarks. Consider the two patterns P1 and P2 shown in Example 1. Pattern P1 occurs exactly once, whereas the pattern P2 occurs at three different places. If spatial reuse is not taken into account, then P1 would incorrectly appear to have higher performance benefit. But when spatial reuse is taken, the true benefits of patterns P1 (100), P2(180) would be unveiled and P2 would be preferred over P1. Selecting P2 instead of P1 is beneficial not only from the performance point of view but also P2 has smaller area than P1. In this paper, we present an efficient method for computing the spatial reuse of each identified pattern. The custom instruction selection phase utilizes the spatial reuse information for judging the usefulness of each potential candidate. Experimental results indicate upto 25% improvement in the speedup achieved over state of the art techniques.

2.

RELATED WORK

Instruction extension methods can be classified into two categories. In the first [8], new instructions are created to encode frequently occurring parallel micro-ops or micro-op dependency chains. This kind of ISE requires little or no modification to the data path of the processor but microprogram is augmented for the new instruction. In the second category [4] [5] [6] [7] [9], a Custom Functional Unit (CFU) is tightly integrated into the data path of a base processor of RISC/VLIW class. Automatic Instruction Set Extension (ISE) method can be divided into three stages: Identification, Cost-Benefit analysis and Selection. Broadly, custom instruction identification methods can be grouped into two categories: Exhaustive and Non-Exhaustive heuristic techniques. Atasu et al. [7] and Yu et al. [10] enumerates exhaustively all the patterns satisfying the microarchitectural constraints imposed by the base processor. Exhaustive identification has the advantage of not missing any potential candidate. But the number of legal candidates could be very large and the pattern evaluation and selection should be very efficient to handle those large number of patterns. In contrast to the exhaustive technique, the techniques proposed in [5, 6, 9, 11, 12] identify a few patterns through incremental growing and/or combination of generated patterns guided by a heuristic. Due to the inherent greedy nature of these heuristics, the patterns identified in this manner are likely to be suboptimal. The scope of identification is usually a basic block. However, Arnold et al. [11] identify patterns from trace, hence the identified patterns may span multiple basic blocks. Using such patterns may require a lot of bookkeeping code to preserve the semantics. But the paper does not discuss how actually such patterns are used. In [5, 6], where identification pass generated few patterns, graph matching based techniques are used for obtaining the spatial reuse. But the approach [7] generating millions of patterns does not consider the reuse factor while evaluating the patterns. Standard graph matching techniques [13] are quite slow, because each identified pattern is treated independently even if there is a strong correlation among them. Our approach calculates the spatial reuse of each pattern very efficiently through the technique of recurring pattern computation and can be used in combination with exhaustive pattern identification. Finally, few patterns that maximize/minimize the objective function under the given constraints such as area, power etc, are selected. The selection problem is shown to be NP Complete even for the simple formulation of selecting N best patterns [14] and is even so when more constraints are included. The non-exhaustive identification techniques [5, 6] formulate an ILP for custom instruction selection. But ILP based methods are not scalable to large number of patterns. The approach in [7] is essentially a greedy approach that iteratively selects the best pattern without taking account reuse factor. Our selection strategy judiciously selects the patterns taking into account the reuse of patterns.

3.

ISE METHODOLOGY

Our automatic application specific ISE methodology, shown in Fig. 2, consists of three steps. 1. Custom instruction identification, 2. Cost-Benefit analysis and 3. Custom instruction selection.

{.. ..}

Patterns {..P..} Pattern Identification

{....} Extraction (API)

Gain, Area Estimation

Cost−Benefit Analysis

Pattern Selection

Selected Patterns, Speedup

RPI Recurring Pattern Computation

Figure 2: Our ISE methodology

In the first step, all subgraphs of the given application’s CDFG that satisfy the micro-architectural constraints imposed by a typical RISC processor are identified. We employed an algorithm similar to [7] for enumerating all the legal candidates. In the second step, the isomorphic instances of each identified pattern are found with the aid of the Recurring Pattern Compuation (RPC) algorithm. RPC preprocesses the application and classifies the convex subgraphs into isomorphic equivalence classes. In addition to the spatial reuse, the latency and area of hardware implementation are found in the second step. Latency of hardware implementation can be obtained by dividing the critical path delay by the base processor’s clock period. And area can be obtained by summing the areas of individual operations. In the final step, a set of N custom instructions that maximize the performance gain are selected. When an instance of a pattern p1 overlaps with an instance of another pattern p2 then only one of p1 and p2 can be utilized in that portion of the code. The problem at hand is more complex than the code covering problem, well known in the compiler theory. Because, we not only have to decide the instances each pattern would cover but also decide the best set of patterns. The pattern selection problem is shown to be NP Complete [14]. We employ a greedy heuristic that iteratively selects the best of the remaining patterns. After selecting a pattern p, the instances of the remaining patterns that overlap with p are discarded. Let P 1, P 2, . . . , P k be the patterns finally selected. The speedup of application A is calculated as follows. Speedup =

(SW (A) −

SW (A) P 1≤i≤k

Gain(Pi ))

(1)

Where SW (A) is the execution cycles of application A on the base RISC processor.

4. RECURRING PATTERN COMPUTATION Let G1 , G2 , . . . , GN be the Data Flow Graphs (DFGs) corresponding to the N basic blocks of the application. Each Graph Gi is associated with the triple (Vi , Ei , ti ) where Vi is the set of operation nodes, Ei is the set of directed data flow edges and ti specifies the operation class of each node. Let Si denote the set of all convex subgraphs of Gi and S be the set of all convex subgraphs of the given graphs, i.e S = S1 ∪ S2 . . . ∪ SN . Let S i ⊂ S denote the set of all convex subgraphs having i nodes. Problem definition : Find all the isomorphic equivalence classes of S ( where S is the set of convex subgraphs of G1 , G2 , . . . , GN ) which consists of atleast two elements. Let Υ(S k ) denote the set of all isomorphic equivalence classes of S k and let Ψ(g) denote the equivalence class containing g. We are interested in finding all equivalence classes

H ∈ Υ(S) such that |H| > 1.

4.1 Our Algorithm We partition S 1 , S 2 , . . . in that order, and while partitioning S k , the equivalence class information that is already computed for S k−1 is effectively utilized. Our overall algorithm is depicted in Algorithm 1. Algorithm 1 Recurring Pattern Computation 1: Let G1 , G2 , . . . , GN be the N given DDGs 2: Partition S 1 3: R1 = { g ∈ S 1 | |Ψ[g]| > 1 } 4: k = 2 5: while |Rk−1 | > 1 do 6: Step 1 : 7: P Rk = Potential-Recurring-Subgraphs(Rk−1) 8: Step 2 : 9: Compute-Equivalence-Classes(P Rk , Υ(Rk−1 )) 10: Rk = { g ∈ P Rk | |Ψ[g]| > 1 } 11: k=k+1 12: end while To start the process, S 1 is partitioned as follows: All the nodes with same type would have to be placed into the same equivalence class. Hence, each possible operation type defines an equivalence class. We scan each node of the given DDGs and simultaneously place it into the equivalence class defined by its type. After placing each node into their corresponding equivalence classes, the set of single node recurring subgraphs, R1 , is initialized to the set of nodes whose equivalence class size is higher than one. After partitioning S 1 , in each iteration of the loop (lines 5-10) S k is partitioned, for k = 2, 3, . . . in that order. In our approach, Υ(Rk ) is efficiently calculated based on Υ(Rk−1 ). Our algorithm has two major steps for finding Υ(Rk ). In the first step, the function Potential-Recurring-Subgraphs, instead of generating/evaluating all the k-node graphs, avoids the generation/evaluation of non-convex and non-recurring subgraphs based on Rk−1 . The inference mechanism employed in pruning those two categories is described in detail in the following subsection. In the second step, the function Compute-EquivalenceClasses finds the equivalence classes of generated potential recurring subgraphs. It does not compare every pair of generated potential recurring subgraphs. Rather, it compares two graphs g1 and g2 only if a (k − 1) node convex subgraph of g1 is isomorphic to a (k − 1) node convex subgraph of g2 . Detailed description of which graphs are compared is presented in section 4.1.2. After finding the equivalence classes of P RK , the actual recurring subgraphs, Rk , whose equivalence class size is more than one, are preserved and the non-recurring subgraphs in P Rk are discarded. Since, the set of potential recurring subgraphs P Rk are computed based on Rk−1 , the loop terminates once there are no more recurring subgraphs of size (k − 1). In the following subsections, we describe in detail the key ideas of partitioning S k using the partitioning information about S k−1 .

4.1.1 Potential Recurring Subgraphs(P Rk) Let us classify all the k−node graphs denoted as Ak into three categories, non-convex, convex recurring (Rk ) and

convex non-recurring (N Rk ). From the practical examples, we observed that majority of the graphs in Ak are nonconvex and among the convex, majority of them are nonrecurring. Since we are not interested in the non-convex and non-recurring subgraphs, in our approach, we use a heuristic which prunes most of these kinds of graphs based on the computed equivalence class information of S k−1 . The graphs generated after pruning the above mentioned two categories from Ak , are termed as potential recurring subgraphs (P RK ). P Rk includes all the recurring subgraphs (i.e Rk ⊂ P Rk ) and may include some non-recurring subgraphs as well, which would be eliminated after partitioning P Rk into equivalence classes. A graph g k ∈ Ak can be inferred as non-recurring if it contains a non-recurring convex subgraph. This criteria can , in principle, be verified by checking whether every (k − 1) node convex subgraph of g k is part of Rk−1 . We are interested in diligently enumerating all the graphs in S k that satisfy this criteria. Any potential recurring subgraph g k ∈ P Rk can be seen as a single node extension of recurring subgraph in Rk−1 . Our potential recurring subgraph generation method would be to consider a recurring subgraph in Rk−1 and extend it with appropriate node such that the resulting graph qualifies our definition of potential recurring subgraph. Let g k−1 ∈ Rk−1 be a subgraph of the DDG Gi . Single node / g k−1 is denoted extension of g k−1 with a node v ∈ Gi ∧ v ∈ k−1 as g  v. Before proceeding further, we need to be cautious that the proposed method doesn’t generate the same graph more than once. This can be achieved by assigning unique labels to each node and each recurring subgraph is extended only with nodes whose label is higher than the labels of all its nodes. However, arbitrarily assigning the labels may necessitate the generation of non-convex subgraphs on the way to generation of convex subgraphs later. We assign the labels in such a way that the label of a node is higher than the labels of all its predecessors. This labeling is feasible since each DDG Gi is acyclic. This labeling strategy guarantees that no non-convex graph need to be generated on the way to generation of convex subgraph later [7]. Consider a graph g k = g k−1  v, v ∈ / g k−1 such that Lab(v) > max∀u∈gk−1 Lab(u), i.e g k is an extension of g k−1 along node v. We define all the graphs obtained by deleting a single primary input node (a node without a single predecessor) of g k−1 from g k as the descendants of g k−1 along the node v. Comparing g k−1 with its descendants along node v, we have the following two important observations. 1. Node v is present in each descendant but is absent in g k−1 . 2. Exactly one primary input node of g k−1 is absent in each descendant, and the descendants differ based on what primary input node of g k−1 is absent. Let Desc(g k−1 , v) denote the set of descendants of g k−1 along node v. Let P I(g) denote the set of primary input nodes of graph g. It can be observed that, for any node node v satisfying labeling constraint, there would be exactly |P I(g k−1 )| descendants of g k−1 along node v. Convexity of g k : We provide the following theorem for checking whether g k is convex or not. This theorem mainly relates the convexity of g k with convexity of some subgraphs of it.

Theorem 1 : Let g k = g k−1  v, where g k−1 ∈ S k−1 and Lab(v) > max∀u∈gk−1 Lab(u). If |P I(g k−1 | > 1 then g k ∈ S k if and only if Desc(g k−1 , v) ⊂ S k−1 . Proof : Proof of this theorem in outlined in [15] Considering the criteria for a potential recurring subgraph, the following corollary is a straight forward consequence of the above theorem. Corollary : Let g k = g k−1  v, where g k−1 ∈ Rk−1 and Lab(v) > max∀u∈gk−1 Lab(u). g k ∈ Rk only if Desc(g k−1 , v) ⊂ Rk−1 . Two graphs g1k , g2k ∈ S k are said to be mergable if g1k ∪g2k ∈ S k+1 and we call g1k ∪ g2k as the child of g1k and g2k .. A set of graphs is said to be mergable if every pair of graphs in the set are mergable and the child of each pair is identical. It can be proved that the set containing g k−1 and all the descendants of g k−1 along node v is a mergable set and g k = g k−1 v is its child. In other words, a graph g and all its descendants along a node v forms a mergable set, we denote that mergable set as M SET (g, v). Graph g k is a potential recurring subgraph only if M SET (g k−1 , v) ⊂ Rk−1 . It can be observed that any two graphs in M SET (g k−1 , v) have k − 2 nodes in common and merging any two graphs would give g k . Another point to note is that a graph may be a member of several mergable sets and although the size of M SET (g k−1 , v) may be more than two, still given any two members of M SET (g k−1 , v) the other members could be derived. This sets an upper bound on the number of potential recurring subgraphs P Rk to be atmost |Rk−1 |2 . The potential recurring subgraph generation can be summarized as follows: For each graph g k−1 ∈ Rk−1 , its extension along node v (i.e g k = g k−1  v) such that Lab(v) > max∀u∈gk−1 Lab(u) is potentially recurring iff all descendants of g k−1 along node v are in Rk−1 . Algorithm 2 Potential-Recurring-Subgraphs(Rk−1) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

P Rk = φ for each g k−1 ∈ Rk−1 loop Let g k−1 ⊂ Gi for each v ∈ Gi − g k−1 | Lab(v) > Lab(u), ∀u ∈ g k−1 loop if Desc(g k−1 , v) ⊂ Rk−1 then P Rk = P Rk ∪ {g k−1  v} end if end loop end loop

Algorithm 2 show the steps for the generation of potential recurring subgraphs P Rk . The algorithm finds all different mergable sets in Rk−1 and adds the child of each mergable set to the potentially recurring subgraphs set P Rk . In the new algorithm, explicit convexity check is necessary only when g k−1 has exactly one primary input node. In this case, it is just sufficient to check whether there is a path from the primary input node of g k−1 to an immediate predecessor of v not in g k−1 , and if it exists then g k−1  v is not convex, otherwise it is convex. To simplify this task, we, apriori, compute the path matrix for each of the given DDGs

and since each node has atmost two predecessors, checking for convexity requires constant number of operations as opposed to general convexity checking time of O(|Vi | + |Ei |) . Hence, our algorithm almost completely saves the time spent in convexity check.

4.1.2 Incremental isomorphism checking The function Compute-Equivalence-Classes computes the equivalence classes of the generated potential recurring subgraphs, P Rk . In our algorithm , the isomorphism checking of graphs of size k is performed using the knowledge of isomorphism between subgraphs of size k-1. Consider a graph g1k = g1k−1  v. Graph g1k can only be isomorphic to a graph containing an isomorphic image of g1k−1 . Let g2k−1 be an isomorphic graph of g1k−1 and graph g2k = g2k−1 ∪ {v  } ∈ P Rk is isomorphic to g1k if type(v) = type(v  ) and there exists an isomorphic mapping (as there can be many) between g1k−1 and g2k−1 such that the predecessors of v and v  match. The matching criteria is defined based on whether type of node v is commutative or not. If v is commutative then no distinction is made between its predecessors and it is sufficient if all the predecessors of v are mapped onto all the predecessors of v  . Incase v is not a commutative operation, then left node of v must necessarily map onto the left node of v  and similarly right node. In the search for isomorphic graphs of g1k if we could find one such isomorphic graph g2k , the potential subgraph g1k is added to the equivalence class in which g2k part of. Since isomorphism is an equivalence relation g1k is guaranteed to be isomorphic to every subgraph in the equivalence class in which g2k is part of. In case no subgraph among the generated potential subgraphs is found to be isomorphic to g1k then a new equivalence class is generated which contains only the subgraph g1k .

5. EXPERIMENTS & RESULTS The proposed ISE methodology was implemented in Trimaran retargetable compiler framework. The front end of the compiler performs several classical optimizations on the given application program. The ISE pass is inserted in the backend of the compiler immediately at the point where the control and data flow analysis is complete. We have taken the same approach as [7] for estimation of pattern’s software and hardware latencies. The latency of the patterns is derived for the UMC 0.18 micron CMOS process using the Synopsys Design Compiler. The latency is normalized with the latency of 16-bit MAC unit to convert into the processor cycles. All the timings reported are collected on a PIV 3.0GHz machine with 1GB RAM. The proposed ISE methodology was applied to a total of four benchmark applications taken from Mediabench [16] and Cryptography [17] suites. The selected benchmarks are representative of applications having a rich set of fixed point arithmetic/logical operations. The kernels of these benchmarks consists of basic blocks having more than 50 operations that can be safely implemented on a custom functional unit. For each benchmark, number of patterns extracted in the identification step with the I/O constrain (4,3) and the time taken for identification are shown in Table. 3(a). Number of identified patterns is of the order 106 . The summary of performance analysis of our recurring pattern computation is shown in Table. 3(b) and detailed analysis of the performance of our algorithm is plotted in

Fig. 3(c)-(g). For each benchmark, the Table. 3(b) shows a) No. of recurring subgraphs (Nr ), b) No. of equivalence classes (Ne ), c) Average equivalence class size EC SIZE = Nr , d) No. of potential recurring subgraphs (Np ), e) PreNe diction accuracy Pa = Nr /Np , f) No. of equivalence checks and h) (Nec ), g) Average equivalence checks ECavg = NNec p Time taken for recurring pattern computation The prediction accuracy Pa indicates the non-recurring subgraph pruning efficiency. Prediction accuracy of 1 indicate every non-recurring subgraph is pruned. The values prediction accuracy, average no. of equivalence checks, recurring pattern distribution and average size of equivalence class for all pattern sizes are plotted in Fig. 3(c)-(g). The most notable observations from those plotted results are a) Prediction accuracy Pa is above 0.95 for all the studied benchmarks, b) the number of equivalence checks of k-node patterns is less then the average equivalence class size of knode and c) the recurring pattern distribution follows a bell shaped curve and recurrence factor is steeply grows and falls and then reaches a saturation level. The recurrence factor touches peak at around 2 to 4 pattern size range. Interestingly, the time taken for recurring pattern computation is much smaller (around 2 orders) than pattern identification time. This can be attributed to the fact that number of recurring patterns are smaller than the identified patterns and the pattern identification algorithm is a brute force technique with pruning based only the number of inputs/outputs. The impact of incorporating spatial reuse inform into the design space exploration is shown in Fig. 3(h)-(l). For the speedup studies, we finally selected four patterns that gives highest possible performance gain. For different I/O constraints((m,n) indicates m read and n write register ports), the speedups achieved for the two cases, a) RPI is used and b) RPI is not used, are shown in Fig. 3(h)-(l). Introduction of spatial reuse information into the gain estimation has altered the fate of finally selected patterns. When spatial reuse information is not used, due to the greedy selection technique, the larger patterns used get selected. But when spatial reuse information is used, average size patterns with high recurrence factor got selected. For instance, in the IDCT benchmark, a pattern with 3-inputs and 2-outputs has reuse factor of 12. The gain of the single instance of the pattern is 3. When reuse information is not used, this pattern doesn’t get selected. With the inclusion of its spatial reuse, the gain of this pattern dominated the gain of other patterns and gets finally selected. The impact of selecting that pattern resulted in improving the speedup by around 15% at (4,2) I/O constraint. The average speedup improvement using RPI is around 10% and it is upto 25% for some I/O constraints.

6.

CONCLUSION

The paper presents a simple and fast algorithm for identifying all the convex isomorphic subgraphs in the entire application. An efficient method of organizing the recurring patterns is also presented. Both of these enabled the application of recurrence information at an early stage of selection process when there are very large number of identified candidates, thus achieving better quality solutions as observed from the results. The proposed method can be extended to group patterns that approximately match. Approximately matched patterns can be synthesized into a ciruit whose area

is smaller than the sum of the individual areas of the patterns, through resource sharing. We would like to perform area conscious exploration that takes into account the possibility of resource sharing as suggested by Brisk et al. [18] and several others.

7. REFERENCES [1] R. E. Gonzalez, “Xtensa: A configurable and extensible processor,” IEEE Micro, vol. 20, no. 2, pp. 60–70, 2000. [2] D. Lau, O. Pritchard, and P. Molson, “Automated generation of hardware accelerators with direct memory access from ansi/iso standard c functions,” in FCCM ’06: Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06), pp. 45–56, 2006. [3] T. R. Halfhill, “Mips embraces configurable technology,” March 2003. [4] D. Goodwin and D. Petkov, “Automatic generation of application specific processors,” in CASES ’03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, pp. 137–147, 2003. [5] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “Synthesis of custom processors based on extensible platforms,” in ICCAD ’02: Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design, pp. 641–648, 2002. [6] N. Clark, H. Zhong, and S. Mahlke, “Processor acceleration through automated instruction set customization,” in MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p. 129, 2003. [7] K. Atasu, L. Pozzi, and P. Ienne, “Automatic application-specific instruction-set extensions under microarchitectural constraints,” Int. J. Parallel Program., vol. 31, no. 6, pp. 411–428, 2003. [8] H. Choi, S. H. Hwang, C.-M. Kyung, and I.-C. Park, “Synthesis of application specific instructions for embedded dsp software,” in ICCAD ’98: Proceedings of the 1998 IEEE/ACM international conference on Computer-aided design, pp. 665–671, 1998. [9] R. Kastner, A. Kaplan, S. O. Memik, and E. Bozorgzadeh, “Instruction generation for hybrid reconfigurable systems,” ACM Trans. Des. Autom. Electron. Syst., vol. 7, no. 4, pp. 605–627, 2002. [10] P. Yu and T. Mitra, “Scalable custom instructions identification for instruction-set extensible processors,” in CASES ’04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pp. 69–78, 2004. [11] M. Arnold and H. Corporaal, “Designing domain-specific processors,” in CODES ’01: Proceedings of the ninth international symposium on Hardware/software codesign, pp. 61–66, 2001. [12] N. Pothineni, A. Kumar, and K. Paul, “Application specific datapath extension with distributed i/o functional units,” in VLSI Design, pp. 551–558, 2007. [13] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, “A (sub)graph isomorphism algorithm for matching large graphs,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 26, no. 10, pp. 1367–1372, 2004. [14] Y. Guo, G. J. Smit, H. Broersma, and P. M. Heysters, “A graph covering algorithm for a coarse grain reconfigurable system,” in LCTES ’03: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems, pp. 199–208, 2003. [15] N. Pothineni, A. Kumar, and K. Paul, “Exhaustive enumeration of legal custom instructions for extensible processors,” Tech. Rep. TR 07/2007, Department of Computer Science, IIT Delhi, July 2007. [16] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems,” in International Symposium on Microarchitecture, pp. 330–335, 1997. [17] T. R. Halfhill, “Eembc releases first benchmarks,” May 2000. [18] P. Brisk, A. Kaplan, and M. Sarrafzadeh, “Area-efficient instruction set synthesis for reconfigurable system-on-chip designs,” in DAC ’04: Proceedings of the 41st annual conference on Design automation, pp. 395–400, 2004.

Benchmark Adpcm DES COMPRESS IDCT SHA

8 6 4 2

Nr 7015 28288 16032 97613 827232

45

Ne EC SIZE Np Pa Nec 1044 7 7375 0.95 79755 706 40 28353 0.97 135716 1018 16 16095 0.99 221763 2968 33 97828 0.99 1047373 6787 121 816931 0.98 21628741 (b) Performance of our RPI algorithm

PredAccuracy Recurring pattern distribution Avg Isomorphic checks Avg Equivalence class size

40 35

Efficiency factors/Recurrence Info

10

Efficiency factors/Recurrence Info

PredAccuracy Recurring pattern distribution Avg Isomorphic checks Avg Equivalence class size

12

30 25 20 15 10 5 0

0 1

2

3

4

5

6

7

8

9

4

6

8 10 12 Pattern size

(c) ADPCM 100

14

16

18

20

Efficiency factors/Recurrence Info

60 40 20 0

Time 0.33s 1.57s 0.84s 7.50s 89.2s

PredAccuracy Recurring pattern distribution Avg Isomorphic checks Avg Equivalence class size

30 25 20 15 10 5 2

(d) DES

PredAccuracy Recurring pattern distribution Avg Isomorphic checks Avg Equivalence class size

80

35

ECavg 11 5 14 11 26

0 2

10

Pattern size

Efficiency factors/Recurrence Info

Efficiency factors/Recurrence Info

Benchmark No. of Patterns ADPCM 7760 DES 722831 COMPRESS 76222 IDCT 289727 SHA 1266867 (a) Pattern identification step

4

6

8 10 12 Pattern size

14

(e) COMPRESS PredAccuracy Distribution Avg checks Avg class size

120 100 80 60 40 20 0

5

10

15 20 25 Pattern size

30

35

1

(f) IDCT

(i) DES

2

3

4

5 6 7 Pattern size

8

9

10

(g) SHA

(j) COMPRESS

(h) ADPCM

(k) IDCT

Figure 3: RPI generation algorithm’s performance and its impact on ISE

(l) SHA

16

18