An architecture framework for an adaptive ... - Semantic Scholar

15 downloads 176079 Views 2MB Size Report
Department of Informatics, Graduate School of Information Science and ... the CIs are generated and added after chip-fabrication, fully transparently and auto-.
J Supercomput DOI 10.1007/s11227-008-0174-4

An architecture framework for an adaptive extensible processor Hamid Noori · Farhad Mehdipour · Kazuaki Murakami · Koji Inoue · Morteza Saheb Zamani

© Springer Science+Business Media, LLC 2008

Abstract To improve the performance of embedded processors, an effective technique is collapsing critical computation subgraphs as application-specific instruction set extensions and executing them on custom functional units. The problem with this approach is the immense cost and the long times required to design a new processor for each application. As a solution to this issue, we propose an adaptive extensible processor in which custom instructions (CIs) are generated and added after chipfabrication. To support this feature, custom functional units are replaced by a reconfigurable matrix of functional units (FUs). A systematic quantitative approach is used for determining the appropriate structure of the reconfigurable functional unit (RFU). We also introduce an integrated framework for generating mappable CIs on the RFU. Using this architecture, performance is improved by up to 1.33, with an average improvement of 1.16, compared to a 4-issue in-order RISC processor. By partitioning the configuration memory, detecting similar/subset CIs and merging small CIs, the size of the configuration memory is reduced by 40%. Keywords Reconfigurable functional unit · Extensible processor · Custom instruction · Temporal partitioning · Similarity detection · Profiling

H. Noori () · K. Murakami · K. Inoue Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan e-mail: [email protected] F. Mehdipour Research Institute for Information Technology, Computing and Communication Center, Kyushu University, Fukuoka, Japan M. Saheb Zamani IT and Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran

H. Noori et al.

1 Introduction In designing an embedded System-On-Chip (SoC), general and well-known approaches include: Application Specific Integrated Circuits (ASICs), General Purpose Processors (GPPs), Application Specific Instruction-set Processors (ASIPs), and extensible processors. Although ASICs have much higher performance and lower power consumption, they are not flexible and have an expensive and time-consuming design process. For GPPs, although the availability of tools, programmability, and ability to be rapidly deployed in embedded systems are good reasons for their common use, usually they do not offer the necessary performance. ASIPs are more flexible than ASICs and have greater potential to meet the challenging high-performance demands of embedded applications, compared to GPPs. However, the synthesis of ASIPs traditionally involved the generation of a complete instruction set architecture (ISA) for the targeted application. This full-custom solution is too expensive and has a long design turnaround time. Another method for providing enhanced performance is application-specific instruction set extension. By creating application-specific extensions to an instruction set, the critical portions of an application’s dataflow graph (DFG) can be accelerated by mapping them to custom functional units. Custom instructions (CIs) reduce the latency of critical paths and the number of intermediate results read/written to the register file. Though not as effective as ASICs, instruction set extension improves performance and reduces energy consumption of processors due to the reduction of instruction memory accesses. Instruction set extensions also maintain a degree of system programmability, which enables them to be utilized with more flexibility. The main problem with this method is that there are significant nonrecurring engineering costs associated with its implementation. The addition of instruction set extensions to a baseline processor for each application brings along with it many of the issues associated with designing a brand new processor in the first place. The recent emergence of configurable and extensible processors is associated with a favorable tradeoff between efficiency and flexibility, while keeping design turnaround times short. In the design of an embedded SoC, the success of product generation depends on the efficiency and flexibility for accommodating future design changes. Flexibility allows system designs to be easily modified or enhanced in response to bugs, market shifts, evolution of standards, or user requirements during the design cycle and even after production, which also means an increase in productivity. Efficiency is required to meet the tight cost, timing, and power constraints associated with embedded systems. Efficiency and flexibility are both critical, but are usually conflicting design goals in embedded system design. While efficiency is obtained through custom hardwired features, flexibility is best provided through programmable features. The important motivation toward specialization of existing processors versus the design of complete ASIPs is to avoid the complexity of complete processor and toolset development. Xtensa from Tensilica [44], Jazz from Improv Systems [40], Stretch S5 engine from Stretch [43], ARCtangent from ARC [38], SP-5flex from 3DSP [36], LISATek products from CoWare [39], Altera’s NIOS [37], and Xilinx MicroBlaze [46] are some commercial examples of these kinds of processors.

An architecture framework for an adaptive extensible processor

In our approach, an adaptive extensible processor (AMBER) is presented in which the CIs are generated and added after chip-fabrication, fully transparently and automatically, according to the behavior of target applications. This approach reduces the design time and cost of ASIPs and extensible processors drastically and increases the performance compared to a GPP, while still providing the benefits of a similar level of flexibility. To support this feature, the custom functional units are replaced by a reconfigurable functional unit (RFU). Using an RFU also helps to support more CIs. In proposing a proper architecture for the RFU, we follow a systematic quantitative approach. In this approach, we first try to generate CIs without considering constraints on the number of inputs, outputs, operations, etc., for 19 applications of Mibench [41] and then according to the features of these generated CIs, we suggest an architecture for the RFU. After fixing the RFU architecture, we apply the final constraints of the RFU to an integrated framework [20, 21] for generating mappable CIs. This framework partitions large CIs or rejected CIs (those CIs that cannot be mapped on the RFU due to its constraints such as number of inputs, outputs, FUs, etc.) to smaller mappable CIs. Mappable CIs are those CIs that can be mapped on the RFU considering all of the final constraints of the proposed RFU. In our method, CIs are generated by exploiting the hot basic blocks (HBBs). An HBB is a basic block that is executed more than a given threshold. A basic block is a sequence of instructions that terminates in a control instruction. Our RFU is a coarse grain accelerator based on a matrix of FUs, and is tightly coupled with the base processor. In this method, there is no need to add new opcodes for CIs, develop a new compiler, change the source code, and recompile it. Work in [25] introduced our basic idea. The proposed architecture of the RFU, including the network connection, in [25] was not fixed and assumed to have infinite number of inputs, outputs, width, and depth. A brief overview of our algorithm for generating custom instructions and design methodology for fixing and proposing a proper architecture for the RFU was given in [26], however, it did not include the synthesis results. In this paper, we present the details of our algorithm for generating custom instructions. Moreover, the details of our quantitative approach for designing the RFU, which has been briefly described in [26], is discussed. It also includes the results of synthesis of the RFU, the number of required configuration bits for each custom instruction and speedup results considering the synthesis results. Moreover, the size of the configuration memory is determined and two techniques (similarity detection and spatial merging) are presented for reducing the size of configuration memory. Besides, the sequencer hardware proposed in [25] and [26], which is used for detecting custom instructions is replaced with a new technique (Sects. 4.2 and 6) which does not have any hardware overhead. We also provide more experimental results (e.g., study the effect of different network connection architecture on the speedup, execution time of the proposed tool chain, etc.), and give a brief summary of our mapping algorithm and integrated framework which have been presented in [20] and [21]. The paper is organized as follows. In Sect. 2, we highlight some related work. The general overview of the AMBER architecture is given in Sect. 3. In Sect. 4, we discuss our quantitative approach and the proposed architecture for the RFU. Section 4 also includes CI generation and mapping algorithms. An explanation of the integrated

H. Noori et al.

framework is given in Sect. 5, and Sect. 6 covers the integration of the RFU and the base processor. The experimental results are presented in Sect. 7, and finally the paper is closed by conclusions and future work.

2 Related work The identification and generation of an optimal set of CIs is a topic that has recently received a lot of attention [3, 4, 8, 11, 12, 15, 28, 32, 34, 35]. These approaches try to provide either exact formulations or heuristics to effectively identify those portions of an application’s DFG that can be efficiently implemented in hardware. Most of these techniques generate CIs without considering all the details of the underlying custom functional units. Using an RFU enables us to implement more CIs, but brings more constraints (e.g., connections and hardware resources) that should be considered while generating the CIs. To resolve this issue, we use an integrated framework that checks if the generated CIs can be mapped on the RFU while generating them. This interaction between generating and mapping the CIs is done iteratively and guarantees that the generated CIs are mappable. PRISC [29], Chimaera [13], OneChip [7], MOLEN [33], and XiRisc [18] are some examples of tightly-coupled integration of a GPP with fine-grain programmable hardware, and ADRES [22] is an example of a tightly-coupled coarse-grain accelerator. The number of inputs/outputs and integration method of the RFU and the base processor differs for each design (e.g., PRISC uses an RFU with two inputs while the RFU in Chimaera has nine). Fine-grain accelerators allow for very flexible computations, but there are several drawbacks to their use. They have a long latency and reconfiguration time, and furthermore, they require a large amount of memory for storing configuration bits. AMBER falls in the coarse-grain category. However, the RFU of AMBER is designed using a quantitative approach. For loosely-coupled systems like MorphoSys [17], Garp [14], and REMARC [23], there is an overhead for transferring data between the base processor and the coprocessor. When the RFU is tightly coupled, data is read and written directly to and from the processor register file making the RFU an additional functional unit in the processor pipeline. This makes the control logic simple, as almost no overhead is required in transferring data to the programmable hardware unit; however, it increases the read/write ports of the register file. Chimaera adds a shadow register to solve this issue. In our case, we share the input/output resources between the RFU and processor functional units (PFUs). All of these designs require a new programming model, new opcode for CIs, a new compiler, source code modification and recompilation. In our approach, however, we do not encounter these issues. The user simply runs the applications on the simulator and then generation of the CIs and handling of their execution are done transparently and automatically. Consequently, our approach is applicable in cases where the source code is not available. Adaptive dynamic optimization systems such as Turboscalar [5], rePLay [27], PARROT [30], and Warp Processors [45] select frequently executed regions of the code through dynamic profiling, optimize the selected regions, and cache/rewrite the

An architecture framework for an adaptive extensible processor

optimized version for future occurrences. The execution of the optimized version is carried on by extra tasks sharing the main processor and/or by extra hardware. The Warp Processor uses fine-grain hardware for accelerating whole loops, while Turboscalar and PARROT use wide VLIW for accelerating hot paths and traces. In binary translation systems [1, 2, 16], the code for a source instruction set architecture (ISA) is transparently translated to a different native instruction set, where it is executed on hardware (usually a VLIW processor core). In these systems, due to the atomic feature of traces, when there is a branch misprediction, they must roll back to the beginning of the trace, but in our approach, there is not such a kind of penalty. Moreover, to overcome the overhead of dynamic optimization, we have defined two phases for our processor. In the work done by Sassone and Wills [31], they propose the dynamic detection and speculative execution of linear chains of dependent instructions. They use a strand cache fill unit and a strand cache to find transient operands, connect them, and cache them for future use. They also utilize hardware called the dispatch engine to insert strands into the instruction stream and remove the individual instructions from the stream. Then closed loop ALUs are used for executing these chains. However, we use an offline approach to detect and generate CIs, which obviates the need to use hardware such as the strand cache fill unit. Moreover, execution and insertion of CIs in the instruction stream is performed by changing the object code, which does not require any hardware (Sassone et al. use the dispatch engine). Our CIs include both parallel instructions and chains of dependent instructions. In [9], Clark et al. have limited CIs to one basic block, while in [10] they try to extend CIs over basic blocks. To support the execution of their CIs, they modify the compiler to insert call instructions to the CIs before or after the branch instruction. Some code motion should be applied to the object code for supporting the misprediction of the branches. They also need to extend the branch target address cache (BTAC or sometimes called the branch target buffer) to store additional information. Moreover, they use a special hardware to generate control signals. However, in our approach, there is no need for any compiler modifications or code motions. Furthermore, we generate the control signals or configuration data for the RFU offline and, therefore, no more hardware is needed. We support the execution of the CIs by binary rewriting (changing the object code) instead of extending BTAC (hardware modification), and this approach of ours is much simpler and cheaper. In addition, our approach uses a different method for profiling and different algorithms are utilized for generating, mapping, and handling CIs. Our RFU is not as much integrated with the base processor like other functional units, but shares the available read/write ports. Finally, by applying some modifications in the routing resources and locations of input ports, our RFU can handle more CIs.

3 General overview of AMBER architecture AMBER, targeted for embedded systems, is composed of four main components: (i) a base processor, (ii) a coarse-grain RFU with functions and connections controlled by configuration bits, (iii) a configuration memory for keeping the configuration bits of the RFU for each CI, and (iv) counters for controlling the read/write

H. Noori et al.

Fig. 1 Integrating the RFU with the base processor

signals of the register file and selecting between processor functional units and the RFU (Fig. 1). AMBER tunes its extended instructions with consideration of the target applications, and that is why we call it adaptive. The base processor is a 4-issue in-order RISC processor that supports MIPS instruction set. The RFU is a matrix of functional units (FUs). According to the size of data in the processors, a matrix of FUs seems to be efficient and reasonable hardware for accelerating dataflow subgraphs as CIs. The use of coarse-grain reconfigurable accelerators results in less demand for configuration memory. In addition, they are faster in comparison to fine-grain programmable hardware, and mapping instructions on them is easier. Each FU of the RFU supports all fixed-point instructions of the base processor except multiply, divide, and load. The RFU is in parallel with other functional units of the processor, and reads and writes data from and to the register file. After execution of each CI on the RFU, the program counter (PC) is updated from the configuration memory considering the original sequence execution, so that the processor can continue from the correct address. The RFU is multi-cycle and has a variable delay which depends on the depth of the DFG of each CI and the clock frequency of the base processor. The required clock cycles for executing each CI on the RFU is kept in the configuration memory as part of the configuration data. The architecture of the RFU (number of inputs, outputs, and FUs, its width, depth, etc.) is determined using a quantitative approach at the design phase (Fig. 2). The counters are used for controlling read/write signals of registers of the register file and switching between processor functional units and the RFU. They are activated as soon as a CI is detected. At this time, the required clock cycles for executing the corresponding CI is loaded from the configuration memory into the counters. For the specified clock cycles, the counters select the configuration bits for choosing the input and output registers of the CI for the RFU, and also select RFU outputs (see Sects. 4.2 and 6 for more details).

An architecture framework for an adaptive extensible processor

Fig. 2 Different phases for designing and using AMBER

For using AMBER, there are two phases: configuration phase and normal phase (Fig. 2). The configuration phase is done offline. In this phase, target applications are run on an instruction set simulator (ISS) and profiled. Then the start addresses of hot basic blocks (HBBs) are detected [25] and read from the object code. CIs are generated using HBBs, where each CI is limited to one HBB. Mapping CIs to generate configuration bit-stream for the RFU is also performed in this phase. In the normal phase, the bit-streams from the configuration memory are used and loaded on the RFU for executing custom instructions. Normally, critical regions of a code are in loops. Therefore, our profiler looks for jumps and taken branches by monitoring the PC. The profiler compares the previous PC and the current PC in each cycle. If the difference in these two values is not equal to the instruction length, a taken branch or jump has occurred. It has a table with two fields for each entry: (i) one field is used for storing the start address of the basic block and (ii) the other is a counter to record the execution frequency for the corresponding basic block. In the case of a taken branch/jump, the profiler table is checked. If the target address (current PC) is in the table, the corresponding counter is incremented, otherwise the current PC is added as a new entry and its counter is initialized to one. Using the profiler table and a defined threshold, the start addresses of HBBs are detected. Figure 3 shows the profiler algorithm.

4 RFU architecture: a quantitative approach In this section, using a quantitative approach, we aim to determine the details of the architecture of the RFU (i.e., the numbers of inputs, outputs, and FUs as well as its width and depth, etc.). We explain our utilized tool chain for designing the RFU, as well as our CI generation method and mapping algorithm. Finally, the proposed architecture for the RFU is presented. 4.1 Utilized tool chain for the quantitative approach Figure 4 shows the tool chain used for designing the RFU. It was applied to 19 applications of Mibench [40] in the design phase. Our simulation environment is based on Simplescalar [42]. It was modified to generate a trace of taken branches and jumps. The trace file is used as an input by the profiler for detecting start addresses of HBBs.

H. Noori et al.

Fig. 3 Profiler algorithm Fig. 4 Tool chain of the quantitative approach

The corresponding basic blocks for all HBBs are read from the object code using these addresses. Reading the HBBs terminates when the first control instruction is seen. Then the instruction sequence (or instruction list) and the dataflow graph (DFG) are generated for each HBB and passed to the CI generator. The CI generator makes the CI and applies constant propagation for optimization. The mapping tool receives the optimized CIs and maps them on the RFU using the DFG. The results obtained from the mapping tool are used in our design process for proposing a proper architecture for the RFU. The same tool chain is used in the configuration phase (Fig. 2) to generate configuration data for configuration memory and modified object code. Our CI generator, mapping tool, and RFU were developed in two phases. In the first phase, we assumed some primary constraints for both the CIs and the RFU. The two primary constraints for CIs are: (i) supporting only fixed-point instructions except multiply, divide, and load and (ii) including at most one store and at most one control instruction as the last instruction. At this phase, there are no constraints on the number

An architecture framework for an adaptive extensible processor

of inputs, outputs, operations, etc. Multiply and divide were excluded due to their low execution frequency and the large area required for hardware implementation, and load is ignored because of the cache misses and long memory access time which makes the execution latency unpredictable. The preliminary version of the RFU is a matrix of FUs which supports only fixedpoint instructions of the base processor, without any limitations on the number of inputs, outputs, FUs, width, and depth. The output of each FU in the RFU was supposed to be used by the right and left neighbors of the same row and by all other FUs in the lower level rows. Then in the second phase, the architecture of the RFU (number of inputs, outputs, FUs, width, depth, connections, etc.) is fixed using the mapping results, and then the final constraints are applied to the Integrated Framework (see Sect. 5). 4.2 Generating custom instructions In this section, the algorithm for generating custom instructions is described. Before explaining the algorithm, we first present some basic definitions: Definition 1 The DFG of an HBB is assumed to be a directed acyclic graph (DAG) called G(V , E) where V is the set of nodes which denote primitive operations (i.e., instructions of the base processor) and edges in E represent the flow-dependences between the operations. The nodes can have at most two inputs and their single output can be used by multiple nodes. Definition 2 The Instruction Sequence of an HBB is considered as a sorted list called IS(v1 , v2 , . . . , vn ) where each vi is denoting an instruction in the HBB. Definition 3 If V  is a subset of V and E  is a subset of E, then G (V  , E  ) is a subgraph of G(V , E). Definition 4 If V  is a subset of V , then IS (V  ) is a sub-list of IS(V ). Definition 5 A subgraph G ⊂ G is convex if there is no path between two nodes of G which involves a node of G\G (complement of G ). In other words, ∀v1 , v2 ∈ V  , all nodes on the paths between v1 and v2 are contained in V  . Definition 6 For a graph G(V , E), Level : V → N is the function defined as follows: • level(v) = 0, if v is an input node of G; • level(v) = α > 0, if there are α nodes on the longest path from the input nodes to v ∈V. The depth d of a graph G is the maximum level of its nodes. Definition 7 Member v ∈ V of list IS(V ) is valid if and only if v can be executed on the RFU, in other words, if v is not a floating point, load, second store/control, multiply, or divide instruction. A sublist IS (V  ) ⊂ IS(V ) is valid if ∀v ∈ V  , v is valid.

H. Noori et al.

Definition 8 Data dependence from an assignment to a use of a variable is called flow-dependence and data dependence from use of a variable to a later reassignment of that variable is called antidependence. Two consecutive assignments to the same destination are referred as output-dependence. Violating flow-, anti-, and outputdependences results in Read After Write (RAW), Writze After Read (WAR), and Write After Write (WAW) hazards, respectively. In this paper we refer to flow-, antiand output-dependences checking as data-dependences checking. Our proposed CIs are single entry and they do not cross loop boundaries. Therefore, we detect HBBs of hot loops and generate CIs from the innermost loop to the outermost. Instruction sequence and DFG are generated for each HBB. The DFG is used for detecting flow-dependence and instruction sequence is used for detecting anti- and output-dependences. Our CI generator obtains the instruction sequence of each HBB as an input named IS(V ) and looks for the largest valid sub-list IS (V  ) in it. Then the same set of V  in the corresponding DFG called G(V , E), is specified and named as G (V  , E  ). Next, by detecting and checking valid instructions in the IS(V ), we try to add more valid instructions to the head and tail of IS (V  ) using the following algorithm. In this method, always an ordered sequence of instructions is selected as a CI, which guarantees the convexity of the corresponding selected subgraph of DFG. The previous instruction of the first member (instruction) in IS (V  ) is looked for in IS(V ) and specified as vi . While there is such an instruction and it is not valid, it is added to the unlinkable_node_list. The unlinkable_node_list is a list for recording the instructions that can not be linked to the IS (V  ) and may be moved. This is repeated until a valid vi is detected. There are two cases for adding a valid vj to the head of IS (V  ) in the object code. First, the data dependences between vj and ∀vk ∈ unlinkable_node_list are checked. vj and unlinkable_node_list are swapped in the object code if there is no such dependency and the corresponding part of the object code, where moving instructions is going to take place is not the target of any branch instruction (steps 8, 9 in Fig. 5). Otherwise, in the second case, the datadependences are checked for ∀vm ∈ V  and ∀vk ∈ unlinkable_node_list. In this case, the IS (V  ) and unlinkable_node_list are exchanged if there is no data-dependences, and the corresponding part of the object code is not the target of branches (steps 10, 11). This process continues for each new first member vi of the updated list IS (V  ) upon reaching to the first member of IS(V ). Figure 5 shows part of the algorithm that adds valid instructions to the head of IS (V  ). A similar idea is applied to the last member. The next instruction of the last member (instruction) of IS (V  ) named as vi is looked for in IS(V ) and specified as vi . While there is such an instruction and it is not valid, it is added to the unlinkable_node_list set until a valid vi is detected. There are two cases for adding a valid vj to the tail of IS (V  ) in the object code. First, the data-dependences between vj and ∀vk ∈ unlinkable_node_list are checked. vj and unlinkable_node_list are exchanged in the object code if there is no such dependences and the corresponding part of the object code, where moving instructions is going to take place is not the target of branch instructions. Otherwise, in the second case, the data-dependences are checked for ∀vm ∈ V  and ∀vk ∈ unlinkable_node_list. In this case, the IS (V  )

An architecture framework for an adaptive extensible processor

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16

find largest valid sub-list in IS(V ) and name it IS (V  ); unlinkable_node_list := φ; set vi as the first instruction of IS (V  ); find the previous instruction of vi in IS(V ) and name it vj if vj is not a valid instruction then unlinkable_node_list := unlinkable_node_list ∪ {vj }; else if there is no data dependency between vj and ∀vk ∈ unlinkable_node_list and corresponding part of the object code is not target of branches then swap vj and unlinkable_node_list; else if (there is no data dependency between ∀vm ∈ V  and ∀vk ∈ unlinkable_node_list) and corresponding part of the object code is not target of branches then swap IS (V  ) and unlinkable_node_list; else unlinkable_node_list := unlinkable_node_list ∪ {vj }; end if; end if; find the instruction prior to vj in IS(V ) and iterate from step 5 if vj exists else terminate;

Fig. 5 Algorithm of adding valid instructions to the head of sub-list IL (V  )

and unlinkable_node_list are swapped if there is no data-dependency, and the corresponding part of the object code is not the target of branches. This process continues for each new last member vi of the updated list V  upon reaching to the last member of IS(V ). Sometimes HBBs are so large such that more than one CI can be extracted. Therefore, the same procedure is executed for the remaining nodes of V to generate new CIs. In our current implementation, CIs with |V  | ≤ 5 are ignored. Figure 6 exemplifies the CI generation algorithm. Figure 6a shows the instruction sequence (IS(V )) of an HBB of adpcm generated by Simplescalar compiler. According to the algorithm, first the largest valid instructions sequence (IS (V  )) is detected which has been bolded in Fig. 6b. lbu and lw (load instructions) are invalid instructions and determine the initial top and bottom boundaries of IS (V  ), respectively. sll (at address 0x400698) is the first instruction for IS (V  ). The two lbu instructions are invalid instructions and are added to the unlinkable_node_list. Since subiu is valid and there are no data-dependences between subiu and instructions in the unlinkable_node_list, they are swapped. As there is no instruction before subiu in the HBB, the process of adding instructions to the head of instruction sequence finishes. In Fig. 6, a DFG corresponding to HBB has been shown in different stages of the algorithm and newly added instructions to CI have been shaded. The function for adding instructions to the end of the IS (V  ) is similar. Because lw is invalid, it is added to the unlinkable_node_list. The next valid instruction (xori) does not have data-dependences with instructions in the unlinkable_node_list, therefore, they are exchanged. addu and bgez have flow-dependency with the instruction in the unlinkable_node_list so they are added to it. Even if bgez had no data-dependency with the instructions in the unlinkable_node_list, it should not be moved since it is a

Fig. 6 Moving instructions to make a larger custom instruction

H. Noori et al.

An architecture framework for an adaptive extensible processor

branch instruction. The algorithm terminates in this stage due to reaching to the end of HBB. After collapsing the subgraph as a CI in the DFG and overwriting object code for the moved instructions, then the first instruction of the sublist of the CI is rewritten by an mtc1 (move to coprocessor) instruction in the object code. In addition, the address of the instruction next to the CIs last instruction (e.g., lw in the above example) is stored in configuration memory along with the configuration bit-stream. When the execution of CI finishes on the RFU, then the address is loaded into the PC to resume the program execution from the correct point. In the example of Fig. 6, the subiu instruction is replaced with mtc1 and address 0x4006c8 is saved in configuration memory as the point where execution should continue after executing CI on the RFU. The operand of the mtc1 instruction specifies the index to the configuration memory for that CI which can be considered as the identification number (ID) of the CI. In the normal mode, when an mtc1 is visited, using its operand as an index for the configuration memory, the counters of the multiplexers are loaded from the configuration memory for the corresponding CI and simultaneously the configuration bit-stream is loaded on the RFU. It has been assumed that the width of the output data of configuration memory is equal to the configuration bits required by the RFU so that configuration loading can be done in one clock cycle. The RFU has a variable delay for each CI which depends on the depth of the DFG of the CI, the base processor’s clock frequency, and the specifications of the synthesis target technology. The counters force the multiplexers to enable input and output registers for the RFU using configuration bits and switch to select the RFU outputs. When the execution of a CI on the RFU finishes, the multiplexer’s selectors are changed to switch to the processor’s functional unit and the control of enable signals of register file is transferred to the signals generated by the decoder (Figs. 1 and 13). At the same time, the next PC is loaded from the configuration memory and execution continues from the exit node of CI. 4.3 Mapping custom instructions on the RFU Mapping DFG nodes on the RFU is a placement problem. In other words, the mapping is to determine appropriate positions for DFG nodes on the FUs of the RFU. Assigning CI instructions or DFG nodes to FUs is done based on the priority of nodes. The ASAP (As Soon As Possible) level of nodes represents their execution order according to their dependences. On the other hand, slack of each node represents its criticality. For example, a node with a zero slack means that it is located on the critical path and should be scheduled with the highest priority. Therefore, in the first step, ASAP and slack values of each node in the DFG are determined [6, 24]. In the second step, a preliminary mapping of nodes on the RFU is done by assignment of the nodes to FUs according to their ASAP level. In other words, a one-to-one mapping of nodes to FUs is performed starting from the nodes with lower ASAP levels. In this phase, there is no constraint for the number and type of FUs and the connections between FUs of the RFU. Only bottom-up connections are not supported. Using this simple algorithm does not guarantee the minimum connection length between nodes. Therefore, a complementary algorithm is presented, which attempts

H. Noori et al.

Fig. 7 Mapping a custom instruction on the RFU

to minimize the total connection length between nodes on the RFU. After initial mapping of nodes onto the RFU, nodes are moved to other FUs to achieve shorter connection lengths. The moving scope of a node is restricted to the bounding box of its connections. Position of ancestors and descendents of each node determines the boundaries of moving, and some nodes may not be able to be moved. A node with a zero slack cannot be moved to the neighboring FUs; for example in Fig. 7 node 5 can only move to the left column and its row cannot be changed since its slack is equal to zero. This kind of node can only move to different columns of the same row. As another example, a node with slack value equal to one can only move to one row above or below. This node has to be located under or at the same row of its parents and above or at the same row of its children. By moving nodes to unoccupied locations at the scope of its bounding box, appropriate locations in terms of minimum connections criteria, can be found. In Fig. 7, mapping of a CI and the moving scope of nodes are shown. 4.4 Proposed architecture for the RFU In this paper, mapping rate is defined as the percentage of generated CIs that can be mapped on the RFU for 19 applications of Mibench [40]. We have considered the execution frequency of CIs while calculating the mapping rate because the CI with higher execution frequency should have more impact on the specifications of the architecture. All 19 applications were executed until completion. Since execution time varies for each application, for a fair evaluation, a weight was assigned to each application so that the weighted execution time can become equal for all of them. To determine the proper numbers for the RFU inputs and outputs, the generated CIs were mapped on the RFU, without considering any constraints (infinite number of inputs, outputs, FUs, depth, width, etc.). By examining the mapping rate for different numbers of inputs and outputs, we achieved to the following curves shown in Fig. 8. According to these curves, eight and six are good candidates for number of the inputs and outputs, respectively. The graph almost flattens after these numbers. To determine the appropriate number of FUs, the mapping rate was measured for various numbers of FUs. The measurement was done for two cases. First, we measured the mapping rate for CIs that meet the input and output constraints (inputs

An architecture framework for an adaptive extensible processor Fig. 8 Mapping rate for different number of inputs and outputs for RFU

Fig. 9 Mapping rate for different number of FUs

are less than 9 and outputs are less than 7) obtained from the previous experiments, and in the second case, no constraints were assumed (see Fig. 9). In both cases, 16 was considered to be a good candidate for the number of FUs. The curve marked by diamonds, which shows the mapping rate without considering any constraints and illustrates that most of the remaining CIs were so large that even 35 FUs were not enough for executing these CIs. The curve marked by square symbols shows that by 16 FUs, almost all (99.84%) of the CIs that meet the input and output constraints can be handled by the RFU. Two other important parameters that should be determined for the RFU are its depth and width. Width (number of columns) and depth (number of rows) show the maximum instructions that can be executed in parallel and the length of the critical path in CIs (i.e., depth of DFG of the CI), respectively. A similar procedure was applied to determine these parameters. The experimental results show that 6 for width and 5 for depth, respectively, were suitable numbers (Fig. 10). By adding the width and depth constraints to the number of inputs, outputs and FUs, the mapping rate decreases from 94.74 to 93.51%. There are 16 FUs that should be laid out on a 6 by 5 matrix. We followed the same procedure to determine the numbers of FUs in each row. The results show that 6, 4, 3, 2, and 1 are appropriate candidates for the first to fifth rows, respectively. By adding these new constraints, the mapping rate decreased to 92.28%. However, in

H. Noori et al.

Fig. 10 Mapping rate for different widths and depths

this architecture we assumed that the inputs of the RFU could be accessed by any FU directly and that there were direct connections from each row to all other rows in the lower levels. Moreover, each FU can have outputs to the left and right neighbors and accept input from them. To make the design more realizable and to simplify the connections, the same approach described above was repeatedly employed to find the number of inputs connected to each row as well as the number of connections among different rows. All connections to the left and right FU neighbors were eliminated. The final architecture of the RFU is shown in Fig. 11. The 8-input ports have been replicated and distributed among different rows to facilitate data access (7, 2, 2, 2, and 1 for Row1 to Row5, respectively). In the RFU, the output of each FU in a row can be used by all FUs in the subsequent row (connections with length one). Besides these, there are two connections with length two (Row1 → Row3, Row2 → Row4), one connection with length three (Row1 → Row4) and one connection with length four (Row1 → Row5). Each of these four connections was generated by multiplexing the outputs of row X and sending them to the row Y as input. In the third and fourth rows, three unidirectional connections to the neighboring FUs were added. Using these connections, CIs with critical path length more than 5 and less than 9 can also be supported by the RFU. Experiments show that each FU of the RFU does not need to support all the operations. We defined three types of operations: logical operations (type 1), add/sub/compare (type 2) and shift operations (type 3). The distribution and numbers of operations of each type for each row are given in Table 1. Considering all the constraints for the proposed architecture, the mapping rate decreased to 90.48%. We developed the VHDL code of the RFU and synthesized it using Synopsys tools [47] and Hitachi 0.18 µm technology library. The area of the RFU was 1.15 mm2 . As a comparison, we measured the area of a 32K cache with 32 bytes line size and associativity 4 using CACTI tool [48] in 0.18 µm technology. The reported area is 2.23 mm2 . Besides, we developed a register file in VHDL which includes 64 registers, 8 read ports, and 4 write ports and synthesized using the same Hitachi 0.18 µm. The reported area for the register file is 2.092 mm2 . Since each FU output can be accessed directly via the output ports of the RFU and since also the DFG depth of each CI is known after mapping, we can have an

Fig. 11 Proposed architecture for the RFU

An architecture framework for an adaptive extensible processor

H. Noori et al. Table 1 Number of functions for each type for different rows

Table 2 RFU latency according to the DFG depth of a CI

Row No.

Type 1

Type 2

Type 3

1

2

6

4

2

3

3

2

3

1

3

2

4

1

2

1

5

1

1

0

Depth of DFG of CI

RFU Delay (ns)

1

1.38

2

2.28

3

3.12

4

4.89

5

6.47

6

7.57

7

8.65

8

9.66

RFU with variable delay in which the latency (required clock cycles for execution) depends on both the DFG depth of each CI and the base processor clock frequency. This data is kept for each CI as a part of the configuration data in the configuration memory. Table 2 shows the latency of the RFU for CIs with different depths from one to eight. For example, for executing a CI with the depth of four (4.89 ns) and the base processor with clock frequency equal to 300 MHz, two clocks are required. According to the architecture of the RFU, each CI configuration needs 308 bits for storing control signals and 204 bits for immediate values. Therefore, each CI configuration requires 512 bits in total.

5 Integrated framework After fixing the final architecture of the RFU, some of the generated CIs (9.5%) could not be mapped on the RFU due to the limitations on hardware resources and connections. To solve this issue, an Integrated Framework is proposed which performs an integrated temporal partitioning and mapping process to generate mappable CIs. The basic idea and concept of this framework has been inspired by our previous work [19]. The proposed design flow for Integrated Framework is shown in Fig. 12. This design flow takes CIs that cannot be mapped (rejected CIs) as input and attempts to partition them to CIs with the capability of being mapped on the RFU. In our methodology, a DFG corresponds to a CI. Partitions obtained from the integrated temporal partitioning process are the same appropriate CIs which are mappable on the RFU. In the first stage, primary constraints of the RFU (8 inputs, 6 outputs, and

An architecture framework for an adaptive extensible processor Fig. 12 Design flow of the Integrated Framework

16 FUs) are considered to generate initial partitions. Then for each CI generated in the first step, the mapping process is performed and the generated CIs are accepted and finalized if they can be mapped on the RFU. Otherwise, an incremental temporal partitioning algorithm modifies the CI (partition) by moving some of the nodes to the subsequent CI (partition) and the mapping process is repeated. This process is done iteratively until all partitions are mapped successfully on the RFU. We attempted two different approaches for partitioning: (i) Horizontally Traversing Temporal Partitioning (HTTP), and (ii) Vertically Traversing Temporal Partitioning (VTTP). HTTP traverses DFG nodes horizontally according to the ASAP level of the nodes and adds them to the current partition while architectural constraints are satisfied. The HTTP algorithm partitions the input DFG by horizontally traversing DFG nodes. This algorithm usually brings about more parallelism for instruction execution and, therefore, it may result in an increase in the required intermediate data size. On the other hand, the intermediate data size can affect the data transfer rate and configuration memory size. VTTP traverses DFG nodes vertically. Although using this algorithm creates partitions with longer critical paths, it reduces the intermediate data size. For more details refer to [20, 21].

6 Integrating the RFU with the base processor As mentioned in Sect. 3, AMBER includes a reconfigurable functional unit that is tightly coupled to the base processor. The details of designing of the RFU were explained in Sect. 4. In this section, first the details of integration of the RFU to the base processor is presented, then it is shown how the overall system works. Also, the mechanisms of executing CIs on RFU are discussed in more details. For interfacing application code—running on the base processor—and RFU, a new instruction

H. Noori et al.

Fig. 13 Integrating the RFU and the base processor

referred as mtc1 (Sect. 4.2) is utilized. This instruction provides facilities for invocating CIs, loading corresponding configuration bit-stream from configuration memory and executing the CI on the RFU. Figure 13 depicts how the RFU is integrated with the base processor. As can be seen in the figure, the input and output ports of the functional units of the base processor are shared by the RFU. Using this technique, there is no need to add more read/write ports to the register file. However, the disadvantage is that they cannot execute in parallel. On the other hand, because the RFU executes a sequence of instructions, it usually cannot be used in parallel with the processor’s functional units. In a conventional RISC processor, the signals for read/write registers are generated by the decode stage. In the proposed design, two sets of signals control read/write signals of the register file: (i) the signals from the decode stage and (ii) the configuration bits kept in the configuration memory. As was mentioned in Sect. 4.2, when an mtc1 instruction is seen, it means that a CI has been started. At this time, the counters are activated and loaded with specified clock cycles for executing corresponding CIs from the configuration memory, as well as the RFU control signals and immediate inputs. In addition, the configuration bits for enabling corresponding input and output registers of the register file for the CI are selected simultaneously as the RFU output. Since the PFUs and RFU share the input registers, the input data is ready for both, however, the multiplexer at their outputs selects which valid output should be sent to the next stage. The processor waits for specified cycles (determined by the content of the counter) so that execution of the CI finishes on the RFU. The same signal generated by counter for controlling multiplexer selector is used for the stall signal as well. When the execution finishes, the counters become inactive, and then multiplexers select decode signals for controlling the register file and the output of functional units of the base processor. As an example, Fig. 14 shows a scenario in which the mtc1 is seen in clock cycle n + 1. The execution of CI is assumed to take four clock cycles. Therefore, when mtc1 is seen, the counters are loaded with four. Since the counters are loaded with a nonzero value, they force the multiplexers to select the configuration bits for enabling

An architecture framework for an adaptive extensible processor

Fig. 14 Switching between PFUs and RFU

the required input/output registers and also select the output of the RFU as well. After four clock cycles when the counters reach to zero, the multiplexers select the decode control signals for enabling the needed input/output registers and also select the output of processor functional units (PFUs). Then the rest of the instructions are executed on the PFUs as usual. Figure 13 shows that the RFU has four outputs, whereas we had mentioned that the RFU has 6 outputs. To support the RFU with six outputs without adding more write ports to the register file, we added two registers to the RFU. For CIs including more than four outputs, extra write values are stored in these registers; four of them are written in one cycle, and the remaining ones in the next cycle. Therefore, for these CIs, execution takes one more cycle.

7 Experimental results Experiment setup Our simulation environment is based on Simplescalar. As for the base processor, a 4-issue in-order RISC processor is employed. More details on the base processor configuration can be found in Table 3. The experiments were performed using 19 applications from Mibench benchmark suite. Effect of clock frequency on the speedup As was mentioned, the RFU has a variable latency that depends on the depth of the DFG of the mapped CI, and the clock frequency of the base processor. Using the data in Table 2, we measured the speedup for five different clock frequencies (200, 250, 300, 350, and 400 MHz). Figure 15 shows the obtained speedup for some selected applications and also the average speedup. Speedup ranges from 1.05 for adpcm and qsort to 1.33 for sha for a base processor with 300 MHz clock frequency. For applications like qsort and adpcm that have small HBBs, most of the generated CIs were rejected or were so small that substantial performance improvement could not be obtained. In susan and gsm, multiplication covered 13 and 7% of the dynamic instructions (committed instructions by the

H. Noori et al. Table 3 Base processor configuration

Issue

4-way

L1-I cache

32K, 2 way, 1 cycle latency

L1-D cache

32K, 4 way, 1 cycle latency

Unified L2

1M, 6 cycle latency

Execution units

4 integer, 4 floating point

RUU size

64

Fetch queue size

64

Branch predictor

bimodal

Branch prediction table size

2048

Extra branch misprediction latency

3

Fig. 15 Speedup for some of Mibench applications using the RFU

processor), respectively, so adding a multiplier to the RFU can enhance the performance more for these applications. For applications with higher speedup, such as sha, stringsearch, rijndael and gsm, the CIs have covered a high percentage of dynamic instructions; specifically, 41, 51, 45, and 32%, respectively. The average speedup for a 300 MHz base processor is 1.16. Execution time of the configuration phase To measure the execution time of the configuration phase, the tool chain in Fig. 4 was run on a Pentium 4 (CPU 3.6 GHz) with 1 GB main memory. Table 4 contains the results. For the applications with larger numbers of HBBs and branches, profiling was longer. That is why lame, which had the largest number of branches and HBBs, has the longest configuration phase. In the configuration phase, a large part of the time was spent in running the simulator and profiling whereas it takes than 2 seconds in other processes (e.g., generating CIs, mapping, etc.). Size of configuration memory Each CI configuration needs 512 bits. Therefore, for an application such as rijndael with 117 generated CIs, around 7.4 KB memory is needed to keep the configuration data. Two techniques were used to reduce the size of the configuration memory: (i) similarity detection and (ii) merging of CIs.

An architecture framework for an adaptive extensible processor Table 4 Execution time of the configuration phase

Application

Exec. time (s)

Application

adpcm

225

gsm

461

bitcounts

331

lame

526

blowfish

94

patricia

basicmath

34

qsort

cjpeg

132

dijkstra

101

djpeg fft

Fig. 16 Dividing 512 bits into four parts

75

crc

P1 = 155 bits func + connect

9 36

P2 = 92 bits inputs

Exec. time (s)

84 233

rijndael

68

sha

29

stringsearch

3

susan

122

Average

150.8

P3 = 61 bits outputs

P4 = 204 bits immediates

Two CIs can be similar according to their: (i) nodes (FUs) and connections, (ii) inputs, (iii) outputs, and (iv) immediates. In most cases, the configuration bits related to FUs and connections for CIs are the same but the inputs, outputs, or immediates are different. Detecting the similarity of nodes (FUs) and connections requires a special similarity-detection process. Therefore, nodes and connections similarities were isolated from the other kinds of similarity. Two types of similarities were defined for FUs and connections: complete similarity and subset similarity. Two CIs are completely similar if the functionality of their nodes (FUs) and connections are the same. Two CIs have subset similarity if one of the CIs is completely similar to a subset of another CI, considering the nodes and connections similarities. A graph isomorphism algorithm was used to explore nodes and connections similarity of two DFGs. To support similarity detection, CI configuration data was divided into four parts. In other words, RFU was provided with partial reconfiguration: (i) 155 bits for the selecting operations of FUs as well as selectors of intermediate multiplexers (connections) (P1 ), (ii) 92 bits for inputs (P2 ), (iii) 61 bits for outputs (P3 ) and (iv) 204 bits for immediate values (P4 ), as shown in Fig. 16. Two CIs have similar inputs, outputs, or immediates if their P2 , P3 , or P4 , respectively, are exactly the same. By generating one configuration for similar P1 , P2 , P3 , or P4 , the size of the configuration memory is decreased. We try to reduce the configuration memory size by merging small CIs of each application, which we call this process as spatial merging. For example, in one application there are two CIs, one with a length of 6 and the other with a length of 7, which do not have complete or subset similarity. If these two CIs can be merged into one configuration, then their two P1 s are merged into one. However, there are still separate P2 , P3 , and P4 s for each CI. These two techniques were utilized to reduce the configuration memory size. Figure 17 includes the first 6 attempted applications with the largest number of CIs. For each application, the leftmost bar shows the number of initial generated CIs and the second to fifth bars specify the total number of required P1 , P2 , P3 , and P4 , respectively, after applying similarity detection and merging techniques. For exam-

H. Noori et al. Fig. 17 Number of initial CIs and generated P1 , P2 , P3 and P4 for some applications

ple, for rijndael, 117 CIs were generated initially, but only eight P1 s were needed, however, the P2 s of most of the CIs were different. Using this method, the size of configuration memory decreased from 7.4 to 4.4 KB, thereby indicating 40% improvement. Effect of long connections on the mapping rate To determine the effect of connections of the RFU with a length of more than one, all of them were deleted and only connections with length one were kept. In this architecture, the outputs of each row can be used as inputs only by all FUs in the subsequent row, and input ports are applied only to the first row (similar to [9, 10]). For passing data from one FU to another FU in a nonsubsequent row, or for passing data of input ports to FUs placed in rows other than the first row, move instructions should be inserted in the intermediate FUs. We applied CIs generated for the proposed RFU (Fig. 11) to this architecture. According to the results, 12.95% of the CIs could not be mapped due to limitations of the hardware resources. This shows that by using these connections, FUs can be employed more effectively. Effect of more write ports on the speedup In Sect. 6, it was mentioned that CIs with more than four outputs needed one more clock cycle for execution. To observe the effect of this extra cycle on speedup, we assumed that two more write ports were added to the register file so that there would be no need for the extra clock cycle. Figure 18 compares the speedup for this case with the case with extra clock cycle. The clock frequency of the base processor was assumed to be 300 MHz. The results in Fig. 18 show that the average speedup increased by only 3.1%, which is not substantial considering the hardware overhead. Therefore, considering more than the four outputs brings about remarkable area and energy consumption overhead while the performance improvement is very small. Consequently, it does not worth to increase the number of write ports of the register file. Comparing HTTP and VTTP Speedup values presented in Figs. 15 and 18 have been obtained using the Integrated Framework and HTTP algorithm. Figure 19 depicts the comparison of speedup achieved by HTTP and VTTP considering the clock frequency of 300 MHz for the base processor. It also depicts the speedup for the case

An architecture framework for an adaptive extensible processor

Fig. 18 Speedup for RFU with 4 output ports vs. RFU with 6 output ports

which does not use the Integrated Framework. Using both of HTTP and VTTP, all CIs were mapped successfully on the RFU but HTTP resulted in better speedup since it benefits from more instruction level parallelism. In other words, the depth of DFG of CIs is less when using HTTP and, therefore, the CIs require fewer clock cycle to be executed on the RFU. Moreover, speedup rate is reduced without using the Integrated Framework due to the effect of rejected CIs.

8 Conclusions and future work Using a quantitative approach, we proposed an RFU for an adaptive extensible processor called AMBER. AMBER has four main components: a base processor, a coarse-grain reconfigurable functional unit (RFU), a configuration memory, and counters. It has two phases. In the configuration phase, it learns about custom instructions (CIs) and in the normal phase, the CIs are executed on the RFU and the rest of the application runs on the base processor functional units. The RFU has 8 inputs and 6 outputs with 16 FUs. By adding four longer connections between rows and facilitating input access, we could improve the mapping rate by 13%. Exploiting the proposed RFU, performance was improved by up to 33%, with an average of 16%. Experimental results show when a high percentage of dynamic instructions was covered by CIs, speedup was higher. We used an integrated temporal partitioning and mapping framework to partition large CIs and map them on the RFU. Experimental results show that by using the Horizontal Traversing Temporal Partitioning method, higher speedup can be obtained compared to the Vertical Traversing Temporal Partitioning method. Furthermore, the configuration memory

H. Noori et al. Fig. 19 Obtained speedup for HTTP and VTTP

was reduced by 40% through making the RFU partially reconfigurable, generating one configuration for similar CIs and merging small CIs. To increase the speedup for more applications, the CIs and the RFU can be extended to support multiplications and floating point instructions. Another solution, as our next step, is relaxing CIs over HBBs. Moreover, in our future work, we plan to study the effect of augmented RFU on the energy consumption of the processor.

References 1. Altman E, Ebcioglu K (200) DAISY dynamic binary translation software. In: Software manual for DAISY open source release 2. Altman E, Gschwind M (2004) BOA: a second generation DAISY architecture. In: 31st international symposium on computer architecture 3. Arnold M, Corporaal H (2001) Designing domain specific processors. In: Proceedings of the 9th international workshop on hardware/software codesign, Copenhagen, April 2001, pp 61–66 4. Atasu K, Pozzi L, Ienne P (2003) Automatic application-specific instruction-set extension under microarchitectural constraints. In: 40th design automation conference 5. Black B, Shen JP (2000) Turboscalar: a high frequency high IPC microarchitecture. 27th international symposium on computer architecture 6. Bobda C (2003) Synthesis of dataflow graphs for reconfigurable systems using temporal partitioning and temporal placement. Ph.D. thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University of Paderborn 7. Carrillo JE, Chow P (2002) The effect of reconfigurable units in superscalar processors. In: ACM/SIGDA symposium on field programmable gate arrays, pp 141–150 8. Clark N, Zhong H, Mahlke S (2003) Processor acceleration through automated instruction set customization. In: The 36th annual IEEE/ACM international symposium on microarchitecture 9. Clark N, Kudlur M, Park H, Mahlke S, Flautner K (2004) Application-specific processing on a general-purpose core via transparent instruction set customization. In: The 37th annual IEEE/ACM international symposium on microarchitecture 10. Clark N, Blome J, Chu M, Mahlke S, Biles S, Flautner K (2005) An architecture framework for transparent instruction set customization in embedded processors. In: International symposium on computer architecture 11. Cong J, Fan Y, Han G, Zhang Z (2004) Application-specific instruction generation for configurable processor architectures. In: ACM/SIGDA symposium on field programmable gate arrays 12. Goodwin D, Petkov D (2003) Automatic generation of application specific processors. In: International conference on compilers, architecture, and synthesis for embedded systems 13. Hauck S, Fry T, Hosler M, Kao J (1997) The Chimaera reconfigurable functional unit. In: IEEE symposium FPGAs for custom computing machines, April 1997, pp 87–96

An architecture framework for an adaptive extensible processor 14. Hauser JR, Wawrzynek J (1997) GARP: A MIPS processor with a reconfigurable processor. In: IEEE symposium on FPGAs for custom computing machines, April 1997 15. Kastner R, Kaplan A, Memik S, Bozorgzadeh E (2002) Instruction generation for hybrid reconfigurable systems. In: ACM transactions on design automation of embedded systems (TODAES), October 2002 16. Klaiber A (2000) The technology behind Crusoe processors. Transmeta Technical Report 17. Lee MH, Singh H, Lu G, Bagherzadeh N, Kurdahi FJ (2000) Design and implementation of the MorphoSys reconfigurable computing processor. J VLSI Signal Process Syst Signal Image Video Technol 18. Lodi A, Toma M, Campi F, Cappelli A, Canegallo R, Guerrieri R (2003) A VLIW processor with reconfigurable instruction set for embedded applications. IEEE J Solid-State Circuits 38:1876–1886 19. Mehdipour F, Saheb Zamani M, Sedighi M (2006) An integrated temporal partitioning and physical design framework for static compilation of reconfigurable computing systems. Microprocess Microsyst 30:52–62 20. Mehdipour F, Noori H, Saheb Zamani M, Murakami K, Inoue K, Sedighi M (2006) Custom instruction generation using temporal partitioning techniques for a reconfigurable functional unit. In: IFIP international conference on embedded and ubiquitous computing (EUC’06) 21. Mehdipour F, Noori H, Saheb Zamani M, Murakami K, Sedighi M, Inoue K (2006) An integrated temporal partitioning and mapping framework for handling custom instructions on a reconfigurable functional unit. In: 11th Asia-Pacific computer systems architecture conference (ACSAC 2006) 22. Mei B, Vernalde S, Verkest D, Lauwereinsg R (2004) Design methodology for a tightly coupled vliw/reconfigurable matrix architecture: a case study. In: Proc. design, automation and test in Europe 23. Miyamori T, Olukotun K (1998) A quantitative analysis of reconfigurable coprocessors for multimedia applications. In: Symposium on FPGAs for custom computing machines 24. Micheli GD (1994) Synthesis and optimization of digital circuits. McGraw–Hill, New York 25. Noori H, Murakami K, Inoue K (2006) A general overview of an adaptive dynamic extensible processor. In: Proc. workshop on introspective architecture 26. Noori H, Mehdipour F, Murakami K, Inoue K, Saheb Zamani M (2006) A reconfigurable functional unit for an adaptive dynamic extensible processor. In: 16th IEEE international conference on field programmable logic and applications (FPL 2006), pp 781–784 27. Patel S, Lumetta S (2000) rePLay: a hardware framework for dynamic optimization. IEEE Trans Comput 50:590–608 28. Peymandoust A et al (2003) Automatic instruction set extension and utilization for embedded processors, application-specific systems, architectures, and processors 29. Razdan R, Smith M (1994) A high-performance microarchitecture with hardware-programmable functional units. In: The 27th annual IEEE/ACM international symposium on microarchitecture, pp 172–180 30. Rosner R, Almog Y, Moffie M, Schwartz N, Mendelson A (2004) Power awareness through selective dynamically optimized traces. In: 31st international symposium on computer architecture 31. Sassone PG, Scott Wills D (2004) Dynamic strand: collapsing speculative dependence chains for reducing pipeline communication. In: The 37th annual IEEE/ACM international symposium on microarchitecture 32. Sun F et al (2002) Synthesis of custom processors based on extensible platforms. In: International conference on computer aided design 33. Vassiliadis S, Wong S, Gaydadjiev G, Bertels K, Kuzmanov G, Panainte EM (2004) The MOLEN polymorphic processor. IEEE Trans Comput 53:1363–1375 34. Yu P, Mitra T (2004) Scalable custom instructions identification for instruction-set extensible processors. In: International conference on compilers, architecture, and synthesis for embedded systems 35. Yu P, Mitra T (2004) Characterizing embedded applications for instruction-set extensible processors. In: Design automation conference 36. 3DSP Corp. http://www.3dsp.com 37. Altera Corp. http://www.altera.com 38. ARC International http://www.arc.com 39. CoWare Inc. http://www.coware.com 40. Improv Systems Inc. http://www.improvsys.com 41. Mibench www.eecs.umich.edu/mibench 42. Simplescalar www.simplescalar.com

H. Noori et al. 43. 44. 45. 46. 47. 48.

Stretch Inc. http://www.stretchinc.com Tensilica Inc. http://www.tensilica.com Warp Processors http://www.cs.ucr.edu/~vahid/warp/ Xilinx Inc. http://www.xilinx.com Synopsys Inc. http://www.synopsys.com/products/logic/design_compiler.html Tarjan D, Thoziyoor Sh, Jouppi NP (2006) Cacti 4.0, HP laboratories. Technical Report