Implementation of a Real-Time RISC Core for

0 downloads 0 Views 214KB Size Report
cuits for application specificity in embedded systems. This leads to faster ... processing, must respond quickly to interrupts, and must ... The mechanisms are invoked by executing ... returns from an interrupt routine, a set of load instructions.
Casablanca II: Implementation of a Real-Time RISC Core for Embedded Systems Kiyofumi Tanaka School of Information Science, Japan Advanced Institute of Science and Technology [email protected]

Abstract We extended general-purpose RISC processor architecture and developed a new RISC core, Casablanca II, for supporting real-time processing in embedded systems. The processor core has multiple register-sets and achieves fast context-switching by automatically changing the active register-set and reducing overheads to save and restore the contents of the registers when exceptions or interruptions occur. In addition, the core has mechanisms for explicit data cache control, enabling data prefetching and fast DMA, which is invoked by executing extended instructions. In this paper, we describe the organization of Casablanca II developed by using an ASIC process and present preliminary evaluation of the processor.

1. Introduction In recent years, PLD/FPGA and ASIC have a great capacity for logic gates. Therefore, an LSI chip can accommodate a processor core as a controller and peripheral circuits for application specificity in embedded systems. This leads to faster processing, lower consumption of electricity, and a smaller system area than in distributed circuits with several chips. On designing and developing chips for embedded systems, the development time can be reduced by employing existing processor cores as controllers of newly designed peripheral circuits. In this study, we developed a new RISC core, Casablanca II, which is suited for embedded systems, and can support research and development by providing open source codes. The logic description of the core is based on VHDL. Therefore, this core can be applied to design tools from many EDA vendors and can be easily reused. An embedded processor must have enough power for processing, must respond quickly to interrupts, and must have a convenient programming environment. Real-time processing is often essential in embedded systems. For applications in which hard deadlines are spec-

ified, processors must have enough power. Commercial, general-purpose and high-performance processors are unsuitable for embedded systems since they require large amounts of logic gates and consume a significant amount of power On the other hand, embedded processors with low power are unsatisfactory for hard real-time applications or complicated objectives. Our RISC core has sufficient processing power to deal with hard real-time applications. Peripheral circuits commonly notify processors using interrupts when requests occur. Quick response from the processor to the interrupts is one of keys to achieving realtime processing. Our core uses multiple-context architecture [13] for fast responses to interrupts. When a new instruction set is used in designing a processor core, an assembler program must be developed as a programming environment. Although using a compiler is effective for shortening the development time of system programs, a new compiler must be developed for the new instruction set, which can extend the development time. On the other hand, the use of an existing instruction set allows existing assemblers and compilers to be used, and the burden to programmers can be reduced. Our core uses integer instructions in SPARC architecture version 8 (SPARC V8) [11] as an instruction set. However, the organization of the core is different from commercial SPARC processors. The use of a SPARC instruction set makes using a free compiler, for example GNU Compiler Collection [5], possible. In this paper, we describe the organization of the new processor core, Casablanca II, that we developed to meet the requirements above, and show the effectiveness of the architecture through a preliminary evaluation of the processor.

2. Casablanca II Casablanca II was developed using Fujitsu 0.18 m ASIC technology. The die size is 5.0  1.0 mm¾ in HQFP240 plastic packaging. The number of total gates is about 900,000. The organization is shown in Table 1.

Proceedings of the16th International Conference on Application-Specific Systems, Architecture and Processors (ASAP’05)

1063-6862/05 $20.00 © 2005 IEEE

Table 1. Organization of Casablanca II. Instruction set Extended instructions (See section 5)

Execution Pipeline Cache Memory (4-way set associative) TLB # of register-set (See section 3)

SPARC V8 integer insts [11] Inter-Register-set Insts Cache Line Control Insts Processor Interrupt Level Insts Byte Twisting Insts Prefetch Insts 10 stages Instruction: 16KB Data: 16KB Instruction: 256 entries Data: 256 entries Normal mode: 1 Internal trap mode: 1 Interrupt mode: 6

Integer Unit (IU)

Instruction Cache

I-BUS

64

DI-BUS

64

DO-BUS

64

Interrupt Request Controller (IRC)

Memory Access Control Register (MACR)

MMU (with TLB)

15

Data Cache

Bus Interface Unit (BIU)

ADDR CNTL

Figure 1. Block diagram of Casablanca II. A block diagram of Casablanca II is shown in Figure 1. The core consists of an integer unit (IU), instruction/data caches, a memory access control register (MACR), a bus interface unit (BIU), an interrupt request controller (IRC) and a memory management unit (MMU), that are connected by 64-bit internal buses.

Casablanca II is connected to a 64-bit interlock bus through the BIU. Supported transfer size is 1, 2, 4, 8, and 32 bytes.

¯ Interrupt Request Controller

¯ Integer Unit 1

The IU executes SPARC V8 integer instructions. . The unit has supervisor and user execution modes. Privileged instructions can be executed only in the supervisor mode. All instructions are executed through a ten-stage pipeline. Instead of register windows in SPARC architecture, IU includes eight register-sets, each of which holds thirty two 32-bit registers.

¯ Instruction/Data Cache Casablanca II has separate instruction and data caches. Each 16KB cache is four-way set associative. The instruction and data caches have a lock mechanism to assist fast responses to interrupts, where locked blocks are not replaced. Moreover, the data cache has mechanisms for prefetching and explicit control, described in section 4. The mechanisms are invoked by executing corresponding extended instructions.

¯ Memory Access Control Register The MACR is a 64-bit register and controls MMU status, instruction/data cache status, cache auto-lock status, the cacheability of MMU-through memory accesses, and the boot mode. 1 Some

¯ Bus Interface Unit

of all integer instructions which appear rarely in programs are not implemented as a function unit. When running of a program encounters such instructions, an “unimplement trap” occurs and the trap handler emulates the instructions.

The IRC has 15 channels for interrupt request inputs. A register-set mode assignment register (RSMAR) and an interrupt request level assignment register (IRLAR) are included in the IRC. The RSMAR indicates correspondences between the request level, or priority, of an interrupt and the register-set used for the interrupt. The IRLAR indicates correspondences between physical channels and request levels of interrupts. Those two registers make it possible to change interrupt request levels after wired connections in systems have been fixed.

¯ MMU The MMU is based on the SPARC reference MMU architecture [11], except that the page size is fixed to 4 KB. The MMU includes a 512-entry (256 for instruction and 256 for data) TLB.

3. Fast Interrupt control Generally, when an interrupt occurs, the interrupt handler must save a processor context through the execution of store instructions. This can be a bottleneck with respect to fast response to the interrupt and a barrier to real-time processing. Casablanca II has six register-sets dedicated to interrupt routines and automatically switches to one of the

Proceedings of the16th International Conference on Application-Specific Systems, Architecture and Processors (ASAP’05)

1063-6862/05 $20.00 © 2005 IEEE

register-sets when an interrupt occurs, eliminating the need to save contexts through software execution. When control returns from an interrupt routine, a set of load instructions is not needed to restore the saved context, since the previous register-set can be automatically reactivated through the execution of a return-trap instruction. This mechanism enables fast response to interrupts. When an interrupt with higher priority occurs while another interrupt routine is already running, the higher interrupt should take priority over the lower one and start immediately. This is why there are multiple register-sets for interrupts. The multiple register-sets permit the reception of further interrupts without context switching overheads during the processing of an interrupt. Casablanca II does not have register windows, which SPARC processors commonly do. When instructions that rotate the windows explicitly, that is, “save” and “restore”, appear in the program execution, an unimplement trap occurs. Although the trap handlers can emulate the rotation of register windows, this takes a long time. Therefore, we assume that a program does not include the save and restore instructions. GNU compilers [5] can generate codes that do not include such instructions by specifying an option, “mflat”. Thus, a lack of register windows does not impede software development. The usage of register-sets is shown Table 2. A registerset that is currently active is pointed to by a current registerset pointer (CRSP) field in the processor status register (PSR). The register-set that was previously used is indicated by a previous register-set pointer (PRSP) field. There is a common CRSP shared by all register-set modes. On the other hand, each register-set mode has an individual PRSP, and a currently valid PRSP is indicated by the CRSP. An illustration of how an active register-set is switched is shown in Figure 2. The processor receives a new interrupt request with higher priority which uses a register-set 2 than a currently processed interrupt using a register-set 3. Solid lines in the figure indicate the state before the occurrence of the new interrupt. The value of the CRSP is 3, and that of the active PRSP that corresponds to the register-set 3 is 0. When the processor catches the interrupt, the value of that CRSP changes to 2, and the old value is stored in the PRSP that corresponds to the register-set 2. This is indicated by the broken lines in the figure. At the end of processing the

Table 2. Register-set modes. CRSP 0 1 2–7

Mode Normal mode Internal trap mode Interrupt mode

Purpose Exec. of normal process Exec. of exception handler Exec. of interrupt handler

CRSP 3㸢2

Register Set 0

Register Set 1

Register Set 2

3

0

PRSP0

PRSP1

PRSP2

PRSP3

㪅㩷㪅㩷㪅

Figure 2. A switch of register-sets on occurrence of an interrupt.

interrupt, the execution of a return-trap instruction, RETT, restores the value of that CRSP. That is, the value in the PRSP of register-set 2 is stored in the CRSP, and control returns to the previous interrupt process.

4. Data Cache Mechanisms Embedded systems that should be inexpensive generally use an interlock memory bus instead of an expensive splitphase bus. To mitigate the overheads of slow speed of memory accessing, Casablanca II has a cache prefetching mechanism. The prefetching mechanism can not only relieve latency overheads of slow memory accessing but can also can absorb fluctuations in the latency cost caused by the bus competition with other memory accessing, for example, by DMA. Moreover, Casablanca II has mechanisms for maintaining consistency between the data cache and an external memory. These mechanisms enable applications to send and receive data efficiently through the external memory.

4.1. Data prefetching A prefetching instruction forces the data cache to fill a cache block with a memory block that includes the location addressed by the instruction if and only if the block does not exist in the cache. The content of the register 0 (%r0) is always zero in SPARC architecture. A load instruction whose destination register is %r0 and size is 4 bytes or less has of no effect on the processor context. The data cache interprets this ineffective load instruction as a prefetching instruction when the address of the load is in cacheable memory space. This in-

Proceedings of the16th International Conference on Application-Specific Systems, Architecture and Processors (ASAP’05)

1063-6862/05 $20.00 © 2005 IEEE

Register Set 3

struction can be combined with other extended instructions. For example, an FF-load instruction whose destination is %r0 can forcibly fill a cache entry with a memory block before the succeeding and a corresponding load instruction is executed (For a FF-load, see section 4.2).

4.2. Explicit data cache control When moving data between peripheral circuits and a processor via memory by a DMA mechanism, generating data in the memory efficiently using a data cache is difficult, since a coherent cache requires a large amount of hardware. Therefore, a low-cost embedded processor generally does not have a coherent cache. Non-cacheable memory accesses per byte or word by a processor can easily degrade the performance. Here, we describe mechanisms for sending and receiving data using a data cache that does not require complicated hardware and keeps coherence with explicit instructions. A processor invokes DMA when it gives data to peripheral circuits. The processor only has to have a mechanism for updating memory with data generated in the cache before it starts DMA. To implement this, “forced-update store” (FU-store) instructions are provided. The execution of the instruction updates the memory if and only if the cache includes the addressed block, and the block is in “dirty” (or “modified”) state. On the other hand, when the processor receives data generated by the peripheral circuits or DMA, the peripheral circuits or the DMA element notifies the processor that the data has been ready in the memory, for example, by an interrupt. Then the processor only has to inject the updated blocks into the data cache. At that time, it must inject regardless of hit or miss-hit of the address because data which exist in the cache may have been already stale. To implement this, “forced-fill load” (FF-load) instructions are provided. The execution of these instructions forces the data cache to fill an entry with a block in the memory addressed by the instruction. Generally, a data cache inherently has mechanisms for filling a block on cache miss-hit and for write-back on replacement. The amount of hardware required for the above mechanisms is trivial. Furthermore, the two instructions can assist designers in building high performance multiprocessors in embedded systems.

5. Extended Instructions Casablanca II uses instructions that are implementation dependent in SPARC V8 for extended instructions that implement several mechanisms, since importance is attached to the feasibility of using existing compilers and assem-

blers. The following extended instructions are implemented in Casablanca II.

5.1. Inter-Register-Sets Instructions (IRSI) The instructions which make it possible to access any register-set in any register-set mode are provided to realize fast system-calls. The following three instructions utilize a load/store instruction from/into alternate space in SPARC. IRS-Load This is a load instruction from alternate spaces specx  , where  is an adified by  dress space identifier in SPARC, and  is a destination register-set number. The memory address is calculated using register values in the register-set pointed to by the CRSP. A loaded value is written to a register in the destination register-set. All data sizes are supported. IRS-Store This is a store instruction into alternate spaces specix  , where  is an address fied by  space identifier, and  is a source register-set number. The memory address is calculated using register values in the register-set pointed to by the CRSP. A register value in the source register-set is stored in the address. All data sizes are supported. IRS-Add This is a load instruction from alternate spaces specixC  , where  is an address fied by  space identifier,  is a destination register-set number. The added value of two registers in the registerset pointed to by the CRSP is written to the destination register in the destination register-set. (The value is originally a memory address in a load instruction. However, this instruction does not invoke memory access.) When the zero register, %r0, is specified as either of the two source registers, a simple register movement is performed.

5.2. Cache Line Forced Instructions (CLFI) The following instructions implement explicit control of a data cache, as described in the previous section. FF-load The execution of this instruction forces the data cache to fill a block with a 32-byte memory block which includes the address specified by the instruction. At the same time, addressed data whose size is specified by the instruction are written to the destination register. It utilizes a load instruction from alternate spaces,  x – x6F.

Proceedings of the16th International Conference on Application-Specific Systems, Architecture and Processors (ASAP’05)

1063-6862/05 $20.00 © 2005 IEEE

FU-store The execution of this instruction updates the memory with a cache block if and only if the block of the specified address exists in the cache and is in “dirty” state. After the execution, the cache block becomes “clean” state. Although it utilizes store instructions into alterx – x6F, the source regisnate spaces,  ter value to be stored as a store instruction is ignored, written to neither the cache nor the memory. This instruction is applied in two ways according to the size of target data.

¯ Update on address specifies a memory address. This is used when the size of the target data sequence is smaller than that of the cache, since sweeping all cache entries is unnecessary. ¯ Update on set specifies an entry in the cache. This is used when targeting a data sequence that is larger than the cache.

5.3. Byte Twisting Instructions (BTI) These are instructions for reversing the endian byte order, from big endian to little, and vice versa. These instructions utilize load from and store into alternate spaces 2 . The function is activated when a bit,   in    , is set to 1. For a load, a read value from the memory or the cache is twisted, and the twisted value is written to the destination register. On the other had, for a store, a source register value is twisted and the value is stored into the memory or the cache. These instructions can be combined with other extended instructions. When this is combined with IRSAdd, for example, the calculated result is twisted, and the value is written to the destination register.

5.4. Processor Interrupt Level Instructions (PILI) A critical section among interrupt handlers can be controlled by regulating the interrupt level of the processor. Although the processor interrupt level (PIL) can be changed by executing an instruction, WRPSR, which rewrites the processor status register (PSR) in SPARC V8, this procedure requires execution of several instructions before the WRPSR to generate the value that is written to the PSR. Thus, the procedure is inefficient. Casablanca II has extended instructions for directly modifying the PIL field in the PSR. A critical section can be controlled fast and efficiently since the execution of a single instruction can change the PIL. 2 In SPARC version 9, data can be accessed in a little-endian format by using selected ASIs [12].

SAVE&WRITE PIL This utilizes an instruction, WRASR, that was originally for writing to ancillary state registers (ASRs) in SPARC V8. The execution of this instruction moves the value of the current PIL to a sheltering register while simultaneously, an immediate value in the instruction is written to the PIL field in the PSR. The sheltering register is prepared for each register-set mode. RESTORE PIL This utilizes an instruction, RDASR, that was originally a read instruction from the ASR. The execution of this instruction restores the PIL field to the value in the corresponding sheltering register.

6. Evaluation In this section, we show a preliminary evaluation of Casablanca II in terms of fast interrupt control and explicit data cache control. We performed a simulation using RTL description of the processor core. In the simulation, we assumed that the clock frequency was 100MHz and latency of memory access was 10 clock cycles, that is a cache miss penalty.

6.1. Evaluation of fast interrupt control This section discusses the effects of the mechanism of fast interrupt control. We prepared an interrupt handler code that consisted of dozens of instructions. Three methods were simulated: a conventional method where a processor context is saved when an interrupt occurs, and it is restored when the interrupt processing finishes, a method using SPARC register windows, and the method of fast interrupt control we proposed. In the method using register windows, we made and used a handler that emulated the register windows, assuming that there were register windows. The results of the execution between occurrences of an interrupt request and the end of the interrupt handler are shown in Table 3. From the results, saving and restoring a processor context takes about 2.6 sec.

Table 3. Processing time for an interrupt. Method Conventional Register windows Fast interrupt control

Proceedings of the16th International Conference on Application-Specific Systems, Architecture and Processors (ASAP’05)

1063-6862/05 $20.00 © 2005 IEEE

sec 4.29 3.57 1.70

Next, we had six interrupt requests happen in the simulation. Each request had a distinct priority. All interrupts occurred almost at the same time, in ascending order of the priority. The processing time between the first interrupt occurrence and the end of the sixth interrupt processing is shown in Table 4. Our fast interrupt control method used distinct register-set modes for the six interrupts, and executed faster than the other methods in overlapped interrupt processing. We verified that the conventional and register windows methods generated cache misses when saving a context, which caused a longer processing time. On the other hand, our method generated no cache misses when saving a context.

Table 4. Processing time for overlapped six interrupts. Method Conventional Register windows Fast interrupt control

sec 29.23 23.18 10.22

int A[1024], x, y; for (i=0; i

Suggest Documents