CRISP: A Template Architecture for - Semantic Scholar

CRISP: A Template for Reconfigurable Instruction Set Processors Pieter Op de Beeck, Francisco Barat, Murali Jayapala and Rudy Lauwereins KULeuven, Kasteelpark Arenberg 10, Leuven-Heverlee 3001, Belgium {Pieter.OpdeBeeck, Francisco.Barat, Murali.Jayapala, Rudy.Lauwereins}@esat.kuleuven.ac.be

Abstract. A template for reconfigurable instruction set processors is described. This template defines a design space that enables the exploration of processors potentially suitable for flexible, power and cost efficient implementations of embedded multimedia applications, such as video compression in a hand held device. The template is based on a VLIW processor with a reconfigurable instruction set. In the future this template will be used for design space exploration, compiler retargeting and automatic hardware synthesis. Several existing reconfigurable- and non-reconfigurable processors were mapped onto the template to assess its expressiveness.

1

Introduction

Current and future multimedia applications such as 3D rendering, video compression or object recognition are characterized by computationally intensive algorithms with deep nested loop structures and hard real time constraints. It has been demonstrated that implementing this type of applications in embedded systems (e.g. multimedia terminals, notebook computers or cellular phones) leads to a power-optimizing problem with time and area constraints [1]. Although an application specific integrated circuit (ASIC) might give the best results in terms of power and speed, other cost factors have to be considered as well. One of them is design time, which for an ASIC is extremely high compared to writing software for a processor. Related to this are the higher non-recurring costs in ASIC designs (e.g. building prototypes) compared to software designs. Another cost is introduced when trying to cope with different applications, changing standards or different levels of quality of service as in MPEG-4 [2]. We can therefore conclude that embedded implementations of multimedia applications should be power efficient, meeting the area and time budget. Furthermore, they should be flexible and economically viable to design in the first place. It is common practice that in order to meet the above stated design goals − the embedded system should be processor based, and − the processor should have many processing elements working simultaneously (temporally as in VLIW processors [3] and spatially as in FPGAs [4]).

Also, we believe that the instruction set should be modifiable at runtime by using a tightly coupled reconfigurable unit (i.e. a reconfigurable instruction set processor [6]). However there exists no methodology yet to decide whether this is true or not for a given embedded application. In order to make this decision we need a description of the design space that covers both the reconfigurable and the non-reconfigurable processors suitable for embedded multimedia systems. The goal of this paper is to describe a template processor architecture that covers this design space. We will call instances of this template “(C)onfigurable, (R)econfigurable (I)nstruction (S)et (P)rocessors” or CRISPs. Configuration is done at design time, while reconfiguration is done at runtime. Viewed as a configurable processor, CRISP is similar to parameterizable processors such as PlayDoh [14] and ARC Cores [15]. Its runtime reconfiguration capabilities are comparable to existing reconfigurable processors like Chimaera [11] and OneChip [5]. In the future this template will be used for design space exploration, compiler retargeting and automatic hardware synthesis. Indeed, to execute an application on a CRISP, three main steps will need to be done. First of all a processor should be instantiated from the CRISP template. This effectively gathers all information for the compiler/synthesis tool to map the source code to an executable, which is the second step. The final step is to synthesize the CRISP processor. Again, compiler/synthesis tools are used. The reconfigurability of CRISP will be addressed in the next section. In section three we describe the CRISP template as a configurable processor architecture. In the next section, a number of case studies validate that CRISP covers these processor architectures in which we are interested. In the final section we present the conclusions of this paper and outline future work.

2

Reconfigurable Processor Architecture

A reconfigurable instruction set denotes a runtime modifiable instruction set. There are several academic and commercial reconfigurable instruction set architectures in existence with this characteristic [9, 10, 11]. The common thread among these is a RAM based memory that controls or even implements the instruction decoding. This memory is called decoder memory or configuration memory (as used in the FPGA community). A few examples are shown in Figure 1. In the first example, the ARMThumb [9] architecture, two different instruction set decoders are implemented. A bit in a special register will decide which one is used (this bit is considered as decoder memory). The REAL DSP [10], as a second case, has a set of reconfigurable instructions. The decoder is not fixed for these, but is implemented as RAM memory (decoder memory) indexed by the opcode of the reconfigurable instruction. If the instruction is not a reconfigurable one, a standard hardwired decoder is used. Finally, Chimaera [11] has a totally reconfigurable decoder for one of its functional units. This decoder is implemented with a single RAM, addressed by the opcode.

instruction

decoder 32

instruction

decoder 16

decoder memory

decoder

instruction decoder

decoder memory

decoder memory

select

(a)

(b)

(c)

Fig. 1. Instruction decoding in the ARM Thumb (a), the REAL DSP (b) and the Chimaera (c) The sizes of the decoder memories in Figure 1 naturally differ. In fact, the decoder memory size of the ARM Thumb is only one bit, for the REAL DSP it is 256x96. The decoder memory of the Chimaera is actually the FPGA configuration memory and this contains 45056 bits (estimated). This example shows the continuum of decoder memory sizes for reconfigurable decoders. There are power- and speed considerations when using (very) wide decoder memories. Indeed, each time a different configuration is needed, a fair amount of bits would have to be loaded from external memory, a time consuming process for sure. That is why most current FPGAs are not runtime reconfigurable. In order to reduce these loading times, the decoder memory could contain several contexts. These are implemented as the addressable RAM in the REAL and Chimaera processors. In the end the actual reconfigurable operation is an address pointing to the desired entry in this decoder memory, and as such selecting the required context. What actually happens is aggressive encoding/decoding of the instruction because from instruction memory an address is fetched, while only later on this address is ‘decoded’ into a context. In other words, the size of the Very Long Instruction Word (VLIW) grows proportional only to the number of contexts, instead of the number of configuration bits, which in itself is proportional to the number of processing elements. As a result, there is no need anymore to transfer a huge amount of configuration bits through the internal memory system. Instead, an external memory management unit, operating at a much lower bandwidth, can offload this transfer. However, to make it really work the decoder memory should be distributed over all processing elements, otherwise there would still be a power- and speed bottleneck located between decoder memory and processing elements. Nevertheless, we consider the decoder memory a part of the memory hierarchy as described in section 3.3. Loading times are further amortized by grouping processing elements into segments. These are defined as the smallest unit of reconfiguration. Having several segments, then, leads to a scheme where some segments are being reconfigured while others are actually executing an operation. This effectively hides reloading times.

3

Configurable Processor Architecture

The CRISP template is organized in a hierarchical way, as depicted in Figure 2. Fixing all the associated parameters instantiates a particular CRISP processor. As was outlined in the introduction, this step is termed design time configuration. CRISP

global interconnect

CPU core

memory cluster

data path

functional unit

register file

segment

processing element

memory cluster interconnect

inter-cluster interconnect

cluster

memory

decoder decoder memory

intra-cluster interconnect

inter-segment interconnect

intra-segment interconnect

Fig. 2. CRISP hierarchy

3.1

CRISP

At the top level in Figure 2, a CRISP consists of CPU cores, memory clusters and global interconnect. The template covers both single and multiprocessor architectures. Figure 3 shows such a CRISP multiprocessor with its memory hierarchy made out of a number of memory clusters, but for clarity only the complex one is indicated. Furthermore, no decoder memory is visible since it is distributed inside the CPU cores, as was motivated in section 2. main memory cluster 1 L1 I-cache

L2 cache L1 I-cache

L1 Loop-cache

CPU core

L1 D-cache

CPU core

Fig. 3. A CRISP architecture at the highest level

3.2

CPU Core

At the level of the CPU core the characteristics of the instruction set are defined. In CRISP an instruction is a group of operations that will execute concurrently in a manner similar to VLIW processors [8], a powerful implementation to exploit this parallelism [3]. In the case of a reconfigurable instruction set, a specific amount of operations (not necessarily all of them) will change their behavior at runtime. In order to reduce the effect of branch latencies, CRISP processors can include speculative execution, branch prediction and predication. For the latter this will affect the instruction format and introduce one or more predicate registers. Predication is a technique typically used in VLIW processors to collapse two or more control flows into a single one. This reduces the number of branches and allows powerful scheduling techniques such as software pipelining. The execution pipeline has N stages which are grouped into fetch, decode, execute and write back phases, like in most processors. All instructions require a fixed number of pipeline stages for the fetch, decode and write back stages (assuming there are no instruction cache misses), but need a varying number of execute stages depending on the operation type. 3.2.1 Data Path and Clusters The data path can be composed of one or more data path clusters. It contains one or more register files and one or more functional units. By clustering, the routing length and interconnect complexity inside a cluster are reduced (i.e. good for both power and speed), at the price of increased compilation complexity due to the additional clusterto-cluster data transfers. Between clusters there is still communication possible by means of inter-cluster interconnect. This, and all other interconnect resources in the CRISP hierarchy, are modeled using an interconnect matrix, such as in Figure 4. Cluster 1

Cluster 2

RF1 A 1

FU1

RF2 B

C 2

A B C 1 é1 1 0 ù 2 êë0 0 1úû

FU2

Fig. 4. Inter-cluster interconnect modeled as a matrix In Figure 5 a functional unit is depicted. It is an element of the processor to which operations, generated by the compiler, are issued. Inside the functional unit this operation is further decoded and routed to the different resources it is composed of, namely, processing elements and interconnect. Processing elements are the subunits of the functional units. This description unifies all types of reconfigurable logic going from processing elements in fine-grained architectures, such as CLBs in an FPGA, to coarse-grained architectures build from ALUs. Furthermore it allows us to present the functional unit on different levels of abstraction. Which level is needed depends on where it is used. For instance, during instruction scheduling only the notion of a func-

tional unit is required, whether it is a reconfigurable- or a fixed one. However, prior to that, some reconfigurable operations had to be mapped onto the reconfigurable functional unit, resulting in the configuration bits. This requires full knowledge of the internal processing elements. As motivated in section 2, the functional unit should be divided into segments because this can hide the long reloading times associated with large amounts of configuration bits. input ports

PE1

PE2

PE1


segment 3 PE3


PE2

PE3


memory ports

opcode

segment 2

segment 1

intra-segment interconnect

output ports

Fig. 5. Functional unit and segment model The cycle-accurate timing of the functional unit is specified using a resource model. For each operation performed in the functional unit, this resource model indicates the processing elements used in each cycle of the operation. This information is of vital importance to the compiler/synthesis tools. 3.3

Memory Cluster and Decoder

A memory cluster is defined as a number of memory blocks with common input- and output ports. In Figure 3, for instance, CPU core 1 will receive instructions from either the loop cache or the L1 instruction cache. A memory selection unit decides which one to take. The memory blocks themselves could have a complex internal memory organization (e.g. interleaved memory banks) and local memory control (like loop control, cache replacement policy or FIFO control). Each cluster has a set of ports connecting to other memory clusters. It also has an optional decoder that is used to reduce the instruction traffic on the memory bus (which is power consuming) and to reduce the size of the required program memories (which are expensive). In most modern architectures, only the decoder closest to the data path is actually implemented (normally referred to as the instruction decoder). Having the possibility to place it at different levels of the instruction memory hierarchy extends the decoder concept. In theory, even the decoder memory could be clustered and hierarchical, but we will not focus on this aspect of the design space. Most work only focuses on placing the instruction decoders in the execution pipeline [17]. Some work has been done on placing them in the external memory interface to reduce external memory traffic by compressing the data stream [18].

Several decoder types exist, such as hardware (i.e. fixed) decoders, decoders with decoder memory or a mixture of these two. A decoder can even be software based [16] even though this is usually called instruction set translation or emulation. As indicated in section 2, RAM based decoders are the key point in the template where reconfigurability is accounted for.

4

Case Studies

In this section, several existing processors (both commercial and academic) are mapped onto the CRISP template. The goal behind this mapping is to show that the design space represented by the template covers most of the processor types we are interested in. This would validate CRISP as a design space, which will allow us to make a trade-off between non-reconfigurable (VLIW) and reconfigurable processors in an embedded context. The focus is on the data path of the processors and on the decoding characteristics of the instruction set. Table 1 summarizes the main characteristics of each of the processors. 4.1

MIPS32 4KP

The MIPS32 4Kp core from MIPS Technologies [7] is a 32-bit MIPS RISC core designed for custom system-on-a-chip applications. The four execution units (ALU, shifter, multiply/divide and branch) can be considered as a functional unit with four distinct processing elements. In this processor, decoding is only performed in the last stage of the program memory hierarchy and is not reconfigurable. This processor represents one of the simplest processors that can be instantiated from the CRISP template. 4.2

TMS320C64x

The TMS320C64x core [8] is a VLIW processor core specifically designed to maximize channel density in communications infrastructure equipment. It contains two identical clusters with four functional units each, and represents a typical clustered processor. The functional units are specialized in multiplication, arithmetic and logical operations, branching and load/store operations. In CRISP terminology, this is a clustered processor without a reconfigurable instruction set. 4.3

ARM7TDMI

The ARM7TDMI [9] is a RISC core designed for simple, low-cost applications. The differentiating characteristic of this processor is that it can use two different instructions sets, the standard ARM instruction set and the Thumb instruction set. Thumb instructions are a subset of the most commonly used 32-bit ARM instructions that

have been compressed into 16-bit opcodes. On execution, these 16-bit instructions are decompressed to 32-bit ARM instructions in real time. Designers can use both 16-bit Thumb and 32-bit ARM instruction sets and therefore have complete flexibility to compile for maximum performance or minimum code size. Both instruction sets cannot be active at the same time however. The instruction set used is selected through a bit in a status register. This is our first case of a reconfigurable processor that can be instantiated from the CRISP template. It uses the simplest reconfigurable decoder you can place in a processor. 4.4

R.E.A.L. DSP

The R.E.A.L. DSP [10] is a traditional DSP with two 16-bit data bus pathways to its Data Computation Unit (DCU). The instruction set word length ranges from 16 to 32 bits. With the limited 32-bit instruction word length it is not possible to freely control all the available functional units (2 multipliers, 4 ALUs and 2 address generation units) inside the DCU. A reconfigurable decoder extends these 32 bits to 96 bits length (what is referred to as Application Specific Instructions or ASIs). In order to call these instructions from R.E.A.L. DSP’s 16-bit program memory, selected ASIs are stored in a look-up table and pointed to by a special class of 16-bit opcodes. This look-up table is actually one of the reconfigurable decoders presented in section 2. Up to 256 ASIs can be concurrently stored in the R.E.A.L. DSP, which means that the depth of the decoder memory is 256. 4.5

Chimaera

The Chimaera [11] is a processor for high performance, general-purpose reconfigurable computing. It is based on a fine-grained logic functional unit connected to a MIPS R4000. The reconfigurable logic is organized in 32 rows of 32 logic blocks each (for a total of 1024 logic blocks or processing elements). The reconfigurable decoder is implemented inside one of the functional units, while other functional units have fixed decoding. This means that only one of the functional units is reconfigurable. It is clear that the Chimaera can be instantiated from the CRISP template. 4.6

Remarc

Remarc [12] is a reconfigurable coprocessor connected to a RISC processor. It consists of a global control unit and 64 programmable logic blocks called nano processors. These nano processors have a 16-bit data path that support arithmetic, logical and shift operations. Each nano processor contains a local instruction memory of 32 entries. The instruction address to be executed in the nano processors is selected by the global controller. This address is the same for all the nano processors. It is possible to consider the Remarc coprocessor as a CRISP (without the controlling RISC). The processor now has one functional unit with 64 processing elements.

The functional unit has a completely reconfigurable decoder. This case is the extreme, where all functional units are reconfigurable. Table 1. CRISP parameters of some processors Clusters Register files Instruction width #Independent parallel ops Number of units active in parallel Number of units Pre-execution stages Execution stages Predication Memory ports Decoder memory

5

MIPS32 4KP 1 1x32x32

C64x 2 2x32x32

ARM7 1 1x32x32

REAL 1 1x8x16

Chimaera 1 1x32x32

Remarc 2

32

32 to 256

16 or 32

16 - 32

32

32

1

8

1

1

1

1

1

8

1

8

1024

64

4

8

4

8

1024

64

1

6

2

1

1

4 No 1x32 0

1 to 5 All units 2x64 0

1 No 1x32 1

4 No 1 45056

1 No

No 256x96

65536

Conclusions and Future Work

In this paper we have described a template processor architecture, named CRISP. CRISP covers the design space of reconfigurable and non-reconfigurable processors suitable for embedded multimedia systems. A CRISP instance is a VLIW processor with, optionally, reconfigurable functional units. It can be configured at design time and reconfigured at run-time. Several existing processors have been mapped onto the CRISP template to test it. The results have shown that the most interesting processors can be represented. In fact the template defines a continuum of reconfigurability levels going from zero to fully reconfigurable. We are currently working on a methodology for implementing multimedia applications on a custom CRISP. This will include design space exploration tools, retargetable compilation and automatic hardware synthesis. In the end it will allow that, given an application and a set of constraints, the best instance of CRISP will be found. The outcome could be a reconfigurable- or a totally fixed VLIW processor.

References 1. Catthoor F., et al, Custom Memory Management Methodology, Kluwer Academic Publishers (1998) 2. Overview of the MPEG-4 Standard, International Organization for Standardization, ISO/IEC JTC1/SC29/WG11 N4030 (march 2001) 3. Jacome M. F., de Veciana G., Design Challenges for New Application-Specific Processors, IEEE Design & Test of Computers (2000) 40-50 4. Seals R. C., Whapshott G. F., Programmable Logic: PLDs and FPGAs, Macmillan Press Ltd (1997) 5. Wittig R., Chow P., OneChip: An FPGA Processor With Reconfigurable Logic, Proc. IEEE Symp.FCCM (1996) 145-154 6. Barat F., Lauwereins, R., Reconfigurable Instruction Set Processors: A Survey, IEEE Workshop in Rapid System Prototyping (2000) 168-173 7. MIPS32 4Kp™ Processor Core Datasheet, MIPS Technologies Inc. (June 2000) 8. TMS320C6000 CPU and Instruction Set Reference Guide, SPRU189F, Texas Instruments, (October 2000) 9. ARM7TDMI Technical Reference Manual, Rev 3, ARM (2000) 10. Kievits P., Lambers E., Moerman C., Woudsma R., R.E.A.L. DSP Technology for Telecom Baseband Processing, ICSPAT (1998) 11. Hauck S., Fry T.W., Hosler M.M., and. Kao, J.P., The Chimaera Reconfigurable Functional Unit, IEEE Symposium on FPGAs for Custom Computing Machines (1997) 12. Miyamori T., Olukotun K., REMARC: Reconfigurable Multimedia Array Coprocessor, FPGA'98 (1998) 13. Jacob J. A., Chow P., Memory interfacing and instruction specification for reconfigurable processors, Proceedings of the 1999 ACM/SIGDA seventh international symposium on Field programmable gate arrays (1999) 145-154 14. Kathail V., Schlansker M. S., Rau, B. R., HPL-PD - Architecture Specification, Version 1.1, Hewlett-Packard Labs Technical Report HPL-98-128 (February 2000) 15. Technical Summary of the ARC Core, www.arccores.com (2001) 16. Klaiber A., The Technology Behind Crusoe Processors, Transmeta Corporation (January 2000) 17. TM1000 Data Book, Philips Electronics North America Corporation (1997) 18. Benini L., et al, Selective Instruction Compression for Memory Energy Reduction in Embedded Systems, ISLPED ’99 (1999) 206-211