Application Specific Datapath Extension with Distributed I ... - CiteSeerX

Page 551-556

Application Specific Datapath Extension with Distributed I/O Functional Units Nagaraju Pothineni

Anshul Kumar

Kolin Paul

Department of Computer Science Indian Institute of Technology, Delhi Email : {nagaraju,anshul,kolin}@cse.iitd.ernet.in

Abstract— Performance of an application can be improved through augmenting the processor with Application specific Functional Units (AFUs). Usually a cluster of operations identified from the application forms the behavior of an AFU. Several researchers studied the impact of Input and Output (I/O) constraints for a legal operation cluster on the overall achievable speedup. The general observation is that the speedup potential grows with the relaxation of I/O constraints. Going further, in this paper, we investigate the speedup potential of AFUs in the absence of I/O constraints. Design challenge in the absence of I/O constraints is addressed in a very practical manner, through the identification of maximal convex subgraphs. Usually the available register ports are few but the number of inputs/outputs of the identified patterns are likely to be large. We solve the register port limitation by the design of distributed I/O functional units, in which the operands are communicated in multiple cycles. The experimental results show that selection of maximal clusters achieves average 50% higher speedup than selecting I/O constrained operation clusters. Also, our identification algorithm runs 2 to 3 orders faster than an exhaustive identification approach.

I. I NTRODUCTION Now a days it is common to find Application Specific Instruction Processors (ASIPs) in the media as well as mobile appliances. In designing ASIPs it is beneficial to choose custom instructions that encode large cluster of operations (patterns) and execute on AFUs. Such AFUs can better exploit the hardware optimizations such as operator chaining, operation level parallelism, thus accelerating their pattern’s computations by several factors. ISE automation methods are evolving for more than a decade and state-of-the-art techniques [1] [2] [3] [4] [5] can speedup the application around 2 to 5 times with the availability of enough register ports. The typical steps in the automation of ISE are 1) Base processor selection 2) Custom instruction identification 3) Custom instruction selection and 4) Generating the extended processor’s hardware model and its associated software tool chain. In the recent past several extensible platforms that automate step 4 of ISE have emerged. As manual engineering of step 4 is not practical it is inevitable to use the extensible platforms to perform the step 4. For instance Tensilica [6] has a configurable/extendible RISC core and extensions can be specified in a verilog like proprietary language known as TIE and VHDL RTL model of extended processor and the associated software tool chain are automatically generated. The designer can now focus fully on

deciding the optimal set of custom instructions. The micro-architecture of the chosen base processor, invariably restricts the patterns that can be implemented on an AFU. Extending a standard RISC kind of processor, imposes the following constraints a) Load/Store : RISC processor is based on Load/Store architecture. Any FU can’t directly accesses memory. Hence, an AFU pattern should not have load/store operations that need to access memory. b) Convexity : In a legal pattern P none of its input operand should be dependent on an output of the pattern, i.e if all the operations of P are collapsed into a single operation in the basic block BB containing P, the resulting DDG graph of BB should not contain cycles. c) Input/Output : In a standard RISC architecture, the inputs and outputs of an FU are communicated at the beginning and completion of its execution respectively. This directly implies that the number of inputs and outputs of an AFU are constrained by the number of register read and write ports respectively. Among the above three constraints, the first two are immutable, whereas the Input/Output (I/O) constraint can be configured by the designer to suit the application needs. Several researchers studied the impact of I/O constraints on the overall achievable speedup. The common observation is that the speedup potential grows with the relaxation of I/O constraints. We go further and investigate the speedup potential in the absence of I/O constraints. In the absence of I/O constraints, exhaustive identification of each legal pattern is computationally infeasible due to space/time limitation. We address this design challenge in a very practical manner, through the identification of maximal convex subgraphs (Convex MaxMIMOs). Experimental results show that customizing with Convex MaxMIMOs speeds up the application 50% (on average) more than customizing with I/O constrained patterns. The convex MaxMIMOs are likely to have large number of inputs and outputs (as many as 5 inputs and 5 outputs in some benchmarks). Integrating the AFUs with such large number of inputs and outputs into a Rigid I/O timeshape RISC requires that the processor’s register file is equipped with that many number of read and write ports. Having large number of register ports might benefit the custom instruction patterns

Conference Proceedings: 20th VLSI Design - 6th Embedded Systems, January 2007. Copyright © 2007 by The Institute of Electrical and Electronics Engineers, Inc. All rights Reserved.

Page 551-556

but causes an increase in the delays and energy consumption of basic RISC instructions [7] resulting in bringing down the effectiveness of ISE. The extensible platform may not even support altering the register ports of the base processor. In that case, the full benefit of ISE that is achievable with convex MaxMIMOs can’t be realized. We propose the design of distributed I/O timeshape AFUs as means of overcoming the above disadvantages and accommodating AFUs without I/O constraints. Our ISE methodology recognizes the limitation on the register ports and estimates the gain of each pattern while timeshaping its inputs and outputs. Although there is an overhead in transferring the operands, experimental results show that the speedup achieved with just (2,1) register ports and the design of flexible I/O timeshape AFUs nearly matches the speedup achievable with enough ports to support any Rigid I/O AFU. The rest of the paper is organized as follows. Related work is presented in section 2. Overall ISE methodology is described in section 3. Identification of maximal patterns is presented in section 4 and 5. Selection of patterns is detailed in section 6. Finally the paper concludes after showing the experimental results in section 7. II. R ELATED W ORK Instruction set of a processor is extended mainly in two different ways. In the first [8], new instructions are added to encode frequently occurring parallelizable micro-operations or micro-op chains. This method of ISE has the effect of code size reduction as well as speedup of the application execution time. The datapath of the processor is either unchanged or modified slightly but the microcode corresponding to the new instruction is augmented to the micro-program memory. The speedup achievable is limited by the available parallelism in the processor computations. This method is generally applied for CISC like architectures. In the second method of ISE [1]– [5], AFUs are attached to implement coarse grain operation clusters. As the critical sections of the applications are implemented on a custom hardware unit, this method of ISE can better exploit the available parallelism in the application and can offer higher levels of speedup. Mostly this method of ISE is being applied to a RISC/VLIW kind of processors. Our work also falls in this category of ISE. Atasu et al. [4] identify exhaustively all convex patterns satisfying the given I/O constraints. Usually the number of legal patterns grows exponentially with the relaxation of I/O constraints, and this method is not scalable for large number of inputs/outputs. Kastner et al. [5] identify patterns based on recurrence frequency without imposing I/O constraints. This method tends to bias towards smaller pattern due to their high recurrence frequency. Arnold et al. [9] builds the patterns iteratively through combination of current identified patterns. Partha et al. [10] identifies patterns that may include a load/store operations from a read-only memory. All these methods invariably assume the base processor to be a standard RISC processor with Rigid I/O timeshape. The observation

common in all these methods is that speedup potential increases with the relaxation of I/O constraints. A practical way of realizing the speedup potential at relaxed I/O constraints, without having to add large number of register ports is presented by Pozzi et al. [11]. They overcome register port limitation by designing distributed I/O functional units. We also follow the same technique and tune the ISE methodology to exploit and support this feature. When flexible I/O timeshape of AFUs is supported, number of inputs/outputs is not a constraint for a legal pattern. To handle the large design options for custom instructions, Pozzi et al. [11] suggest gradual relaxation of temporarily imposed I/O constraints. This approach although can give optimal results but is not scalable for large size benchmarks. Other identification techniques e.g. [2]–[4] are fast and they try to identify few highly beneficial patterns by iteratively growing the pattern in the most promising direction. However, the generated patterns cannot guaranteed to be locally best due to the greedy nature of strategy. Our approach in contrast, identifies all maximal gain patterns in an efficient manner. The gain of a pattern can be estimated by scheduling the pattern under the resource constraints of register ports. The resource constrained scheduling is addressed in the context of high level synthesis [12] and optimal resource constrained scheduling is shown to be NP-complete. There exists rapid estimation methods [13] that approximates the schedule length to a reasonable accuracy. The optimal instruction selection is shown to be NPComplete [14]. In the literature, there exists greedy techniques [3], [4] and optimal techniques based on ILP [2] for instruction selection. The instructions are selected under the constrains of area and/or maximum custom instruction count. Some selections methods [3], [4] don’t modify the clock of the base processor, whereas [2] considers the optimization of processor clock to minimize the clock slack for AFUs. III. M ETHODOLOGY Given an application and an extendible base processor that supports FUs with flexible I/O timeshape, we are interested in finding the instruction extensions to the base processor that results in the minimal execution time of that application. The steps involved in arriving at the best set of custom instructions are, 1) Potential Custom Instruction(CI) pattern identification, 2) Potential custom instruction pattern I/O timeshaping and 3) Custom instruction selection. A. Potential CI Identification In the absence of I/O constraints, any convex subgraph extracted from a basic block of the application, consisting only nodes of Allowed type, is a legal pattern and can be implemented as a custom instruction. However, the number of such legal patterns is exponential in the number of nodes in basic block. One pragmatic approach to identifying the custom instruction patterns is to run an I/O constrained pattern identifier such as [4] repeatedly, gradually relaxing the I/O constraints as long as the number of generated patterns is


Page 551-556

manageable both in terms in time and space. Another strategy would be to identify only the potential patterns based on speedup potential without imposing the I/O constraints. We assume that the performance gain provided by a pattern P is higher than any pattern that is a subgraph of P . Although this is true in majority of the cases, some peculiar examples can be created to disprove it. Nevertheless, we work with this assumption and find all the maximal convex subgraphs of the application. In section 5, We present a simple and fast algorithm for identifying all the maximal convex subgraphs of the given application that runs several orders faster than an exhaustive identification algorithm. We evaluated both the gradual I/O relaxation based identification method and convex maximal subgraphs identification. Experimental results show that, the gradual I/O constraint relaxation approach is not able to reach the stage of identifying all the convex maximal subgraphs due to time or space limitations and identifying the maximal convex subgraphs outperformed the gradual relaxation approach, for all the benchmarks.

Let G(V,E) represent the dataflow graph of a basic block BB of the given application, where V and E being the set of nodes and edges representing the BB’s operations and their dependencies as usual. Consider a subgraph of G with the set of nodes V 0 ⊂ V and all the edges in E whose end points lie in V 0 , for the sake of brevity we would refer to this subgraph as simply subgraph V 0 . For a subgraph V 0 , F rontier(V 0 ) = {n ∈ V − V 0 | ∃m ∈ V 0 s.t m, n are adjacent} We classify each operation into one of the two types, Allowed and Restricted depending on whether it can become part of AFU behavior or not respectively. Arithmetic/Logical and comparison operations are classified as Allowed type operations and the rest are classified as Restricted type of operations. A Multiple Input and Multiple Output pattern (MIMO) is a connected subgraph of G with multiple inputs and multiple outputs with only the operations of Allowed type. The inputs and outputs of a subgraph V 0 of G(V, E) are defined as follows

B. Timeshaping of Patterns

Input−Set(V 0 ) = { n ∈ V −V 0 | ∃m ∈ V 0 s.t (n, m) ∈ E}

The patterns identified may have more input/output operands than the available register ports. In that case, operands have to be communicated between the register file in multiple cycles. Timeshaping of pattern refers to the assignment of transfer cycles to the operands between the register file honoring the maximum available register bandwidth. For a legal I/O Timeshape, there are no constraints on the schedule cycles of input operands even though some are imposed for optimization but the timestep assignment to outputs have constraints because output ready times are function of input arrival times and also the internal computations of the pattern. The constraints on a legal timeshape are formally presented in [11]. An optimal timeshape that results in minimum latency of the pattern is found by evaluating all legal timeshapes.

IV. M AX MIMO I DENTIFICATION

Output−Set(V 0 ) = { n ∈ V 0 | ∃m ∈ V −V 0 s.t (n, m) ∈ E}

A MaxMIMO of a DFG G is a MIMO of G that is not contained in another MIMO subgraph of G. In other words MaxMIMO is a maximal connected subgraph of G with only nodes of Allowed type. The MaxMIMOs obey the following properties, Property 1. The F rontier of a MaxMIMO consists only nodes of Restricted type. Property 2. Two MaxMIMO’s M1, M2 can’t overlap. This is so because overlapping requires that some node of M1 be a member of F rontier of M2 which is not possible due to Property 1. In other words a node of Allowed type uniquely determines the MaxMIMO to which belongs. Given a seed node, MaxMIMO containing it can easily C. CI Selection be constructed by iteratively expanding the partially formed Given the set of potential patterns along with their timesubgraph along its frontiers with only nodes of Allowed type shapes, we are interested in a certain number of nontill the F rontier set consists only nodes of Restricted type. overlapping patterns that maximize the speedup of the apExample 1. An example DFG is shown in Fig. 1(a) with the plication. Let Tsw (C) denote the software estimation for the label within the circle indicating the operation and a label component C and Thw (C) denote the hardware latency of the adjacent to each node for unique identification. In the DFG timeshaped component C. Speedup of an application A when shown in Fig. 1(a), the subgraph consisting of all the nodes the patterns P S = {P1 , P2 , . . . , Pk } are selected, is calculated except G and I forms a MaxMIMO. This MaxMIMO is nonaccording to Equ. 1. convex due to existence of paths between several pair of nodes Tsw (A) in the MaxMIMO through either G or I which are lying outside Speedup = P Tsw (A) − ki=1 F req(Pi ) ∗ (Tsw (Pi ) − Thw (Pi )) the MaxMIMO. (1) V. C ONVEX M AX MIMO I DENTIFICATION Overlapping of patterns poses a challenge in selecting an optimal set of patterns. Optimal selection is shown to be NP-Complete [14]. We apply a greedy approach wherein the patterns are iteratively selected. Our approach to selection is similar to [14]. In [14], when a pattern is selected the gains of other overlapping patterns are nullified, whereas in our case, the overlapping parts are chopped so as not loose a potential pattern. The algorithm for selection is presented in section 6.

Given a DFG G(V,E) and a non-convex MaxMIMO MM, this section presents an efficient algorithm for identifying all maximal convex subgraphs of MM, termed as Convex MaxMIMOs. Identifying convex subgraphs is a non-trivial process. The brute force method of checking every subgraph of MM for convexity property could be infeasible as the number of nodes in a typical MaxMIMO may be very large. First


Page 551-556

+ A

+ B

+ C

| E

+ D

+ C

| F

+ D

+ A

+ B

& H | F

LD I LD G

| E

& H

/ K + A

+ B

+ C

* J

+ D

/ K

& H

* J ^ L

| E

& H

^ L

| F / K

* J + M

^ L

& H

+ M

(a) Data Flow Graph

(a) Dfg

(b) Convex MaxMIMO 1 Fig. 1.

(c) Convex MaxMIMO 2

(d) Convex MaxMIMO 3

(e) ConvexMaxMIMO 4

MaxMIMO and Convex MaxMIMO Example

we analyze the causes of non-convexity and find conditions for maximality (Section 5.1) which helps in formulating an efficient algorithm (Section 5.2) for identifying all the maximal convex subgraphs. A. Conditions for Convexity & Maximality Definition : For a non-convex subgraph S of G, a pair of nodes a and b ∈ S are said to be causing non-convexity iff at least one of the path between a and b lies outside S. Such pair of nodes are said to be incompatible w.r.t S. Property 3. Let a, b ∈ S, form an incompatible pair (a, b) w.r.t S and a be an ancestor of b then for each node c ∈ S reachable from b, (a, c) is also an incompatible pair w.r.t S Based on the above property, the algorithm 1 enumerates all the incompatible pairs of nodes w.r.t MM. Algorithm 1 Generate incompatible pairs of MM 1: ∀ node n ∈ MM, ϑ(n) = ∅ 2: ∀ node n ∈ V-MM, ϑ(n) = MM ∩ Pred(n) 3: For each node n in MM in topological order 4: loop 5: ϑ(n) = ∪p∈preds(n) ϑ(p) 6: ∀ p ∈ ϑ(n), list (p,n) as Incompatible 7: end loop Property 4. Let (a, b) be an incompatible pair of nodes w.r.t to a non-convex subgraph S of G, then for any subgraph S 0 of S consisting of both the nodes a and b, the pair (a, b) still remains a cause of non-convexity for S 0 also, i.e S 0 also still remains non-convex The above property implies that a convex subgraph of MM can’t contain both the nodes of any incompatible pair of MM. In other words, every pair of nodes in the convex subgraph of MM is a compatible pair w.r.t MM. Property 5. For a non-convex subgraph S of G, every maximal set S 0 ⊂ S of compatible nodes w.r.t S forms a convex subgraph of S From the above two properties it can be seen that, finding all the maximal compatible set of nodes is equivalent to finding

all the maximal independent sets of Incompatibility graph of MM, whose node set is same as the node set of MM and edges represents the incompatibility relation. Maximal independent set problem is known to be NP-complete. As the number of nodes in MM can be large, a solution based on direct computation of maximal independent sets is computationally very costly and may even be infeasible. We present below more rigorous analysis of the origin of incompatibilities and define groupwise incompatibility between subsets of nodes, which leads to an algorithm that is exponential in the number of groupwise incompatibilities of MM. This number is much smaller than the number of nodes in MM. B. Algorithm based on Groupwise Incompatibility A cyclic dependency of MM is defined as the path from an output node of MM to an input node of MM that lie completely outside MM. The cause of non-convexity of a subgraph MM of G is the existence of a cyclic dependency of MM. Let a,b ∈ MM and a is an output node of MM and b is an input node MM and there is a path from a to b that lies completely outside MM. Then every node in P red(M M, a) ∪ {a} is incompatible with every node in Succ(M M, b) ∪ {b}, where P red(M M, a) is the set of ancestor nodes of a in MM and Succ(M M, b) is set of descendant nodes of b in MM. We call this incompatibility between two sets of nodes as a groupwise incompatibility. Corresponding to each cyclic dependency of MM, we create a groupwise incompatibility. Because every incompatibility is caused by a cyclic dependency, the above groupwise incompatibilities captures all the incompatibilities of MM. Let there be a total of n groupwise incompatibilities denoted by (S10 , S11 ), (S20 , S21 ), . . . , (Sn0 , Sn1 ), where Si0 , Si1 ⊂ M M . Any compatible subgraph CS of MM should not contain nodes from both the groups of a pair, i.e all the nodes in either Si0 or Si1 for each 1 ≤ i ≤ n must be absent in any compatibility set CS (including maximal compatibility set). Therefor, a potential maximal compatibility set can be represented by an n-bit bit vector P, whose j’th bit denote which set of nodes among Sj0 and Sj1 are absent in the compatibility set. There are 2n


Page 551-556

Compliment of compatible set Maximal Compatible Set ABE

CDF

JM

KLM

(a) Groupwise Incompatibility

Fig. 2.

1. {J M} U {K L M}

{ABCDEFH}

2. {A B E} U {K L M}

{CDFHJ}

3. {J M} U {C D F}

{ABEHKL}

4. {A B E} U {C D F}

{HJKLM}

(b) All maximal compatibility sets

Maximal Compatibility Sets for MaxMIMO in Fig. 1(a)

different values P can take, for each value of P, the potential maximal compatibility set Ψ(P ) is generated according to the Equ. 2. P [i]

Ψ(P ) = M M − ∪1≤i≤n Si

(2)

It should be noted that all compatibility sets generated by Equ. 2 are not guaranteed to be maximal. Because all the maximal sets are part of the generated compatibility sets, all non-maximal compatibility sets can be eliminated by checking for subset relation with the remaining potential maximal compatible sets, leaving only the maximal convex subgraphs of MM. Subset relation checking can be performed efficiently by using the standard bit vector representation of sets. Example 2 For the MaxMIMO shown in Fig. 1(a), there are two cyclic dependencies one from E to J and second from F to K. The groupwise incompatibilities originating from these two cyclic dependencies are shown in Fig. 2(a). The potential maximal compatibility sets derived from the groupwise incompatibilities are shown in Fig. 2(b). In this case, each potential maximal compatibility set happen to be maximal compatibility set. We can observe that the four convex MaxMIMO’s shown in Fig. 1(b) are exactly same as the derived maximal compatibility sets. VI. CI S ELECTION In this section we present the overall algorithm and briefly describe the selection step. Let M axM IM Os(BB) denote the set of MaxMIMOs in basic block BB and ConvexM axM IM Os(MM) denote the set of convex MaxMIMOs of MaxMIMO MM. Let T imeshape(P, Rr , Rw ) denote the pattern P timeshaped with register ports (Rr , Rw ). We refer to Gain(P ) as the gain of pattern P calculated as F req(P ) ∗ (Tsw (P ) − Thw (P )). Our complete ISE methodology is shown in Algorithm 2. Initially, all the convex MaxMIMOs in the given application A are calculated and stored in the set PS (steps 1-12). Then the steps 13-30 constitute the selection of the given N nonoverlapping patterns. In each step best pattern (BestP) among the current set of potential patterns that has highest gain is selected. For all the patterns PS-{BestP}, the overlapping parts with BestP are chopped from them. After removing some nodes from a convex pattern, it might become non-convex, in that case all the maximal convex subgraphs of the pattern will be added to the potential pattern set. Finally the algorithm reports the speedup with the selected patterns.

Algorithm 2 ISE algorithm 1: MMS = φ 2: for each basicblock BB ∈ A do 3: MMS = MMS ∪ M axM IM Os(BB) 4: end for 5: CMM = φ 6: for each MM ∈ MMS do 7: CMM = CMM ∪ ConvexM axM IM Os(MM) 8: end for 9: PS = φ 10: for each CM ∈ MMS do 11: PS = PS ∪ T imeshape(CM, Rr , Rw ) 12: end for 13: CI = φ 14: Gain = 0 15: for I = 1,2, . . ., N do 16: Let {P1 , P2 , . . . , Pk } be PS 17: BestP = Pi ∈ P S | Gain(Pi ) ≥ max1≤j≤k Gain(Pj ) 18: CI = CI ∪ BestP, PS = PS - BestP 19: Gain = Gain + Gain(BestP ) 20: for each P ∈ PS do 21: if Overlap(P,BestP) 6= φ then 22: PS = PS - P 23: P = P - Overlap(P,BestP) 24: Temp = ConvexM axM IM Os(P ) 25: for each Q ∈ Temp do 26: PS = PS ∪ T imeshape(Q, Rr , Rw ) 27: end for 28: end if 29: end for 30: end for Tsw (A) 31: Speedup = T (A)−Gain sw

VII. E XPERIMENTS We have built the experimental framework based on our proposed methodology in the Trimaran retargetable compiler infrastructure. The front end of the compiler performs several classical optimizations. In the backend of the compiler namely elcor, immediately after the data flow analysis is performed, the ISE method was inserted. The operation latencies, formats, resource usage, I/O timeshapes are specified in a machine description file. The patterns are timeshaped assuming (2,1) register ports and hardware latency of 16-bit MAC operation derived from 0.18 micron CMOS synthesis process, as the clock time. We had evaluated our methodology on a set of four benchmarks taken from Mediabench and Cryptography. For all the benchmarks, total of four instructions are selected. The experiments are conducted to investigate a) The performance potential of convex MaxMIMOs over I/O constrained patterns and b) The effect of register port limitation on the speedup. The speedups achieved with different I/O constraints for both the Rigid I/O timeshape AFUs and Flexible I/O timeshape with (2,1) register ports are shown in Fig 3. The label (m,n) indicates that patterns are identified imposing (m,n) I/O


Page 551-556

3.5

4

Rigid I/O Flexible I/O with (2,1) ports 5

2.5

2.5

Speedup

Speedup

Speedup

3

Rigid I/O Flexible I/O with (2,1) ports

5

3 4

6

Rigid I/O 3 Flexible I/O with (2,1) ports

Rigid I/O 3.5 Flexible I/O with (2,1) ports

2

Speedup

6

2 1.5

4 3

1.5

2

1

2

0.5

0.5

1

0

0

1 1 0 (2,1)

(3,1)

(3,2)

(4,1)

(4,2)

(4 3)

MAX

(2,1)

(3,1)

(3,2)

I/O Constraint

(4,1)

(4,2)

(4 3)

MAX

I/O Constraint

(a) Adpcm

Benchmark Adpcm Compress Des Idct

(3,1)

(3,2)

(4,1)

(4,2)

(4 3)

I/O Constraint

(b) Compress Fig. 3.

0 (2,1)

(c) Des

MAX

(2,1)

(3,1)

(3,2)

(4,1)

(4,2)

(4 3)

I/O Constraint

(d) Idct

Impact of I/O constraints on the speedup achieved from ISE

MaxMIMO Exhaustive Count Time Count Time 4 1s 10936 1315s 6 2s 56222 4833s 8 2s 314443 5280s 1 1s 152884 5460s

[2]

[3]

TABLE I A NALYSIS OF C ONVEX M AX MIMO AND EXHAUSTIVE IDENTIFICATION [4]

constrain and for Rigid I/O timeshape AFUs it also indicates that the base processor is equipped with (m,n) register ports. The label M AX indicates no I/O constrain is imposed while identification. In case of I/O constraint, we used the exhaustive identification approach and in the absence of I/O constraint, convex MaxMIMO identification is used. The results show that the performance potential with convex MaxMIMOs is 50% (on average) higher than I/O constrained patterns. For flexible I/O AFUs with (2,1) register ports, the overhead of extra cycles for transferring operands resulted in upto 15% reduction in the speedup. The performance analysis of convex MaxMIMO and exhaustive identification with (4,3) I/O constraints is shown in Table I. The identification of convex MaxMIMOs is 2 to 3 orders faster than exhaustive identification technique.

[5] [6] [7]

[8]

[9]

[10]

VIII. C ONCLUSION In this paper, we have investigated the speedup potential of ISE in the absence of I/O constraint. The proposed identification technique is effective in capturing the most potential candidates and deliver higher levels of speedup than an I/O constrained identification technique. Distributed I/O functional units is shown to be a practical way of realizing the higher levels of speedup achievable with convex MaxMIMOs without having to add large number of register ports. We plan to study the architectural and timing effects of flexible I/O functional units. R EFERENCES

[11]

[12] [13] [14]

conference on Compilers, architecture and synthesis for embedded systems, (New York, NY, USA), pp. 137–147, ACM Press, 2003. F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “Synthesis of custom processors based on extensible platforms,” in ICCAD ’02: Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design, (New York, NY, USA), pp. 641–648, ACM Press, 2002. N. Clark, H. Zhong, and S. Mahlke, “Processor acceleration through automated instruction set customization,” in MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, (Washington, DC, USA), p. 129, IEEE Computer Society, 2003. K. Atasu, L. Pozzi, and P. Ienne, “Automatic application-specific instruction-set extensions under microarchitectural constraints,” Int. J. Parallel Program., vol. 31, no. 6, pp. 411–428, 2003. R. Kastner, A. Kaplan, S. O. Memik, and E. Bozorgzadeh, “Instruction generation for hybrid reconfigurable systems,” ACM Trans. Des. Autom. Electron. Syst., vol. 7, no. 4, pp. 605–627, 2002. R. E. Gonzalez, “Xtensa — A configurable and extensible processor,” IEEE Micro, vol. 20, pp. 60–70, /2000. S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi, and J. Owens, “Register Organization for Media Processing,” in International Symposium on High Performance Computer Architecture (HPCA), (Toulouse, France), January 2000. H. Choi, S. H. Hwang, C.-M. Kyung, and I.-C. Park, “Synthesis of application specific instructions for embedded dsp software,” in ICCAD ’98: Proceedings of the 1998 IEEE/ACM international conference on Computer-aided design, (New York, NY, USA), pp. 665–671, ACM Press, 1998. M. Arnold and H. Corporaal, “Designing domain-specific processors,” in CODES ’01: Proceedings of the ninth international symposium on Hardware/software codesign, (New York, NY, USA), pp. 61–66, ACM Press, 2001. P. Biswas, V. Choudhary, K. Atasu, L. Pozzi, P. Ienne, and N. Dutt, “Introduction of local memory elements in instruction set extensions,” in DAC ’04: Proceedings of the 41st annual conference on Design automation, (New York, NY, USA), pp. 729–734, ACM Press, 2004. L. Pozzi and P. Ienne, “Exploiting pipelining to relax register-file port constraints of instruction-set extensions,” in CASES ’05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, (New York, NY, USA), pp. 2–10, ACM Press, 2005. G. D. Micheli, Synthesis and Optimization of Digital Circuits. New York: McGraw-Hill, 1994. B. K. Dwivedi, A. Kejariwal, M. Balakrishnan, and A. Kumar, “Rapid resource-constrained hardware performance estimation,” in International Workshop on Rapid System Prototyping (RSP06), (Greece), June 2006. Y. Guo, G. J. Smit, H. Broersma, and P. M. Heysters, “A graph covering algorithm for a coarse grain reconfigurable system,” in LCTES ’03: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems, (New York, NY, USA), pp. 199–208, ACM Press, 2003.

[1] D. Goodwin and D. Petkov, “Automatic generation of application specific processors,” in CASES ’03: Proceedings of the 2003 international


MAX

Application Specific Datapath Extension with Distributed I ... - CiteSeerX

Application Specific Datapath Extension with Distributed I ... - CiteSeerX

Suggest Documents

finite state machines with datapath - CiteSeerX

Incremental Checkpointing with Application to Distributed ... - CiteSeerX

FPGA-Specific Custom Arithmetic Datapath Design

The Distributed Application Architecture - CiteSeerX

algorithmic state machines with datapath

IDCT application specific VLIW ... - CiteSeerX

Application-Specific Heterogeneous Multiprocessor ... - CiteSeerX

HASHI: An Application-Specific Instruction Set Extension for Hashing

Designing Application-Specific Networks on Chips with ... - CiteSeerX

Enhanced GALS Techniques for Datapath Applications - CiteSeerX

Enhanced GALS Techniques for Datapath Applications - CiteSeerX

Improving Polynomial Datapath Debugging with HEDs

Adobe Application Extension SDK

An Architecture for Mobile, Distributed Application ... - CiteSeerX

A Review on Distributed Application Processing ... - CiteSeerX

Architecturing and Configuring Distributed Application ... - CiteSeerX

Trading and Distributed Application Management: An ... - CiteSeerX

Trading and Distributed Application Management: An ... - CiteSeerX

a distributed interactive simulation application - CiteSeerX

A Distributed Interactive Music Application using ... - CiteSeerX

A Distributed Interactive Music Application using ... - CiteSeerX

System Services for Distributed Application Configuration - CiteSeerX

System Services for Distributed Application Configuration - CiteSeerX

System Services for Distributed Application Configuration - CiteSeerX