Performance Evaluations of a Multithreaded Java Microcontroller ...

Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany

Abstract We propose handling of exter-

nal real-time events through multithreading and describe the microarchitecture of our multithreaded Java microcontroller, called Komodo microcontroller. Real-time Java threads are used as interrupt service threads (ISTs) instead of interrupt service routines (ISRs). Our proposed Komodo microcontroller supports multiple ISTs with zerocycle context switching overhead. We evaluate the basic performance and optimize the internal structures of our microcontroller using real-time applications as benchmarks on a software simulator. We determine the correlation between the fetch bandwidth and the instruction window size. Our results show a performance gain of about 1.4 by latency reducing over the single threaded model. multithreading, microcontroller, Java microprocessor, real-time, interrupt service threads

Keywords:

1 Introduction A multithreaded processor is characterized by the ability to simultaneously execute instructions of dierent threads within the processor pipeline. Multiple on-chip register sets are employed to reach an extremely fast context switch. Multithreading techniques are proposed in processor architecture research to mask latencies of instructions of the presently executing thread by execution of instructions of other threads [1].

U. Brinkschulte, C. Krakowski Institute for Process Control Automation and Robotics University of Karlsruhe, Germany Recently, multithreading has also been proposed for event-handling of internal events ([2], [3]) by taking advantage of its fast context switching ability. However, to date multithreading has never been explored in context of microcontrollers for handling of external hardware events. Contemporary microprocessors and microcontrollers activate Interrupt-Service-Routines (ISRs) for event handling. Assuming a multithreaded processor core allows to trigger so-called Interrupt-Service-Threads (ISTs) instead of ISRs for event handling. In this case an occurring event activates an assigned thread. Using threads instead of ISRs in combination with a multithreaded processor core has several advantages: Thread activation is extremely fast due to the multiple register sets on the multithreaded processor chip, thus enabling single-cycle event reaction. Because of the parallel execution, ISTs are not blocked by ISTs with higher priority, in contrast to ISRs. Latencies within the pipeline can be utilized by other concurrently executing threads or ISTs. A exible context switch between event triggered ISTs and other threads is made possible by the IST scheme. In contrast, ISRs can only be interrupted by other ISRs, but not by threads.

Threads allow exible priority schemes.

Our project explores the suitability of multithreading techniques in embedded real-time systems. We propose multithreading as an event handling mechanism that allows ecient handling of simultaneous overlapping events with hard real-time requirements. Moreover, we explore our methods in context of a realtime Java system. State-of-the-art embedded systems are programmed in assembly or C/C++ languages. Java features several advantages with respect to real-time systems: The object orientation of the Java language supports easy programming, reusability, and robustness. Java bytecode is portable, of small size, and secure due to the bytecode veri er in the Java-VirtualMachine (JVM). However, to date the use of Java is hardly possible in embedded real-time systems because of its high hardware requirements and unpredictable real-time behavior. Both problems are solved with the design of a multithreaded Java-microcontroller enhanced by an adapted JVM. A Java-microprocessor, like the picoJava-I and II from Sun ([4], [5]), executes Java bytecode instructions directly in hardware raising the performance of Java applications and decreasing the hardware requirements compared to a Java interpreter or Just-In-Time (JIT) compiler. In enhancement to the picoJava approach the real-time issue is supported by applying the multithreading technique to reach a zero cycle context switching overhead and by hardware support for advanced real-time scheduling schemes. The next section presents our Komodo microcontroller and section three our performance results.

high priority IST may execute with full speed, lower priority ISTs and non-real-time threads are restricted to the latency pipeline slots of the high priority IST. In the actual design, the microcontroller holds the contexts of up to four threads. If there are more then four threads to handle, our architecture allows up to three real-time threads, which are directly mapped to hardware threads. The remaining threads must be non real-time and are scheduled within the fourth hardware thread. To scale up for larger systems with more than three real-time threads, we propose a parallel execution on several microcontrollers connected by a middleware platform. This topic is covered in another paper [6]. Because of its application for embedded systems, the processor core of the Komodo microcontroller is kept at the hardware level of a simple scalar processor. As shown in Fig. 1, the microcontroller core consists of an instructionfetch unit, a decode unit, a memory access unit (MEM) and an execution unit (ALU). Four stack register sets are provided on the processor chip. A signal unit triggers IST execution due to external signals. instruction fetch address PC1 PC2 PC3 PC4 instructions

memory interface

IW1 IW2 IW3 IW4

priority manager micro-ops ROM

data path

signal unit

extern signals

instruction decode

MEM

ALU

2 The proposed Komodo microcontroller

Figure 1: Block diagram of the Komodo micro-

The Komodo microcontroller is a multithreaded Java microcontroller which supports multiple ISTs with zero-cycle context switching overhead and several priority schemes. A

The instruction fetch unit holds four program counters (PC) with dedicated status bits (e.g. thread active/suspended), each PC is assigned to another thread. Four byte por-

stack register sets

controller

tions are fetched over the memory interface and put in the according instruction window (IW). Several instructions may be contained in the fetch portion, because of the average bytecode length of 1.8 bytes. Instructions are fetched depending on the status bits and ll levels of the IWs. The instruction decode unit contains the above mentioned IWs, dedicated status bits (e.g. priority) and counters. A priority manager decides subject to the bits and counters from which IW the next instruction will be decoded. We de ned several priority schemes [7] to handle real-time requirements. The priority manager applies one of the implemented thread priority schemes for IW selection. A bytecode instruction is decoded either to a single microop, a sequence of micro-ops, or a trap routine is called. Each opcode is propagated through the pipeline together with its thread id. Opcodes from multiple threads can be simultaneously present in the dierent pipeline stages. The instructions for memory access are executed by the MEM unit. If the memory interface only permits one access each cycle, an arbiter is needed for instruction fetch and data access. All other instructions are executed by the ALU unit. Both units (MEM and ALU) can take several cycles to complete an instruction execution. After that, the result is written back to the stack register set of the according thread. External signals are delivered to the signal unit from the peripheral components of the microcontroller core as e.g. timer, counter, or serial interface. By the occurrence of such a signal the corresponding IST is activated. As soon as an IST activation ends its assigned real-time thread is suspended and its status is stored. An external signal may activate the same thread again. To avoid pipeline stalls, instructions from other threads can be fed into the pipeline. Idle times may result from branches or memory accesses. The decode unit predicts the latency after such an instruction, and proceeds with instructions from other IWs. There is no overhead for such a context switch. No save/restore

of registers or removal of instructions from the pipeline is needed, because each thread has it's own stack register set. Because of the unpredictability of cache accesses, a non-cached memory access is preferred for real-time microcontrollers. The emerging load latencies are bridged by scheduling instructions of other threads by the priority manager. Therefore a cache is omitted from our Komodo microcontroller.

3 Performance evaluation For the evaluation of our design, we use an execution-based simulator of our Komodo microcontroller with an adapted JVM to run several benchmarks (e.g. CoeineMark 3.0, a FFT algorithm, a PID element, and an impulse counter). This JVM is yet not optimized with quick-opcodes or something else. Therefore, the benchmarks are used to apply a real bytecode mix for microarchitecture optimization and not to compare the performance of the Komodo microcontroller with other approaches. First of all, we want to prove the presumption of a performance gain by bridging latencies. Our rst experiment assumes processor clock cycles of 100MHz or less and fast SRAM storage for memory thus eliminating load/store latencies. Under these assumptions the only latencies stem from branches, that take two cycles overhead in the pipeline to resolve and fetch from the new PC. The results pointed out a gain of 1.28 with: cycles needed for sequential execution gain = cycles needed by multithreaded execution

The second experiment additionally regards latencies of data accesses. We assume a large data memory with DRAM modules instead of costly SRAM chips. Due to the need for available instructions we assume a small instruction memory implemented with SRAM chips or as an on-chip memory. This yields instruction fetch without latency. We vary the share of load/store instructions and their latency. The

3,5 10 latency of l/s-instructions 8 6 4 3 2 1 0

3 2,5 Gain

results of this situation are shown in gure 2. The dierent graphs show the gain with respect to memory latency (in cycles). Each graph represents a program with another percentage of l/s-instructions (10, 20, 40, 50, 70, 90%) respectively the impulse counter (imp), PID element (pid), and fast fourier transformation algorithm (t). The results in Fig 2 show a

2 1,5 1 0,5

4

1

3

4

5

6

7

Figure 3: Benchmark with dierent number of

3 Gain

2

Number of threads

3,5

threads

2,5 2 1,5 share of l/s-instructions in %

1 0,5 0

1

2

3

4

5

6

7

8

10 40 70 imp fft

9

20 50 90 pid

10

11

Memory latency

Figure 2: Benchmark with dierent share and latency of l/s-instructions

signi cant performance gain which is reached only by the multithreading ability of latency hiding and not by parallel processing. As a third performance test, we determine the number of threads needed to bridge nearly all latencies in typical Java bytecode programs subject to the latency of data access. The "zero" latency graph in Figure 3 shows the gain yielded by branch latency hiding only, while the other graphs concern branch and load/store latency hiding. The gure shows an increasing gain until the number of threads reaches the latency of load/store-instructions. At this point, nearly all latencies are bridged and the processor is fully utilized. After the investigation the basic performance abilities of our architecture, we take a closer look at some hardware parameters. Henceforth, we assume a single memory interface which prefers data access over instruction fetch and a bandwidth of 32 bit. We assume latency of zero cycles for instruction fetch, one cycle for data access, and two cycles for branches. Four threads are active and scheduled by a xed priority until they are

blocked. The share of load/store-instructions is 20% which is typical for many applications. For a zero cycle context switch, it is necessary to keep the instruction windows stued. To reach this goal, several instruction fetch strategies are investigated:

only the ll level of the IWs is taken into account,

the ll level and the priority of the thread are evaluated,

ll level and recently taken branches are taken into account, because the IW is empty after a branch, or

a combination of all three techniques is used.

Figure 4 shows the dierence between these strategies subject to the IW size. As you can see, there is only a minimal discrepancy and therefore the simplest fetch strategy (only ll level) is sucient. To nd an optimal size of the instruction windows, we run several tests with dierent sizes. We found out, that the window size correlates with the instruction fetch bandwidth, as shown in gure 5. The minimum of the window size is about the fetch bandwidth plus two, necessary to avoid a deadlock that arises when no complete bytecode is in the buer and there

A complete system with hardware, operating system (JVM), middleware and application (automated guided vehicles) is under construction. We are working on a detailed hardware design in VHDL of the Komodo microcontroller and further improve the adapted JVM.

1,435

Gain

1,43

1,425 fill level fill level & priority

1,42

fill level & branches fill level & priority & branches 1,415 6

8

10 12 14 16 Window size in bytes

18

20

Figure 4: Benchmark with dierent fetch strategies

is not enough space to fetch a new portion1. A detailed view shows an optimum size of about fetch bandwidth plus six, with regard to hardware costs and performance. 1,6 1,4 1,2

Gain

1 0,8 0,6 0,4 0,2 Fetch bandwidth in bytes

1 2 4 6 8 10

0 4

6

8

10 12 14 Window size in bytes

16

18

20

Figure 5: Benchmark with dierent window size and fetch bandwidth

4 Conclusions This paper describes the basic idea of ISTs to handle external events and the appropriate multithreaded Java microcontroller. The microarchitecture is outlined and several benchmark results are shown. On one hand the performance gain possible through multithreading is proven and on the other hand some hardware parameters are optimized. The maximum length of a bytecode instruction executed direct by the Komodo microcontroller is three. 1

References [1] J. Silc, B. Robic, and T. Ungerer. Processor Architecture: From Data ow to Superscalar and Beyond. Springer-Verlag, Heidelberg, 1999. [2] R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). ISCA'26 Proceedings, Atlanta, Georgia, Vol 27, No 2, pp. 186195, May 1999. [3] S. W. Keckler, A. Chang, W. S. Lee, W. J. Dally. Concurrent Event Handling through Multithreading. IEEE Transactions on computers, Vol 48, No 9, pp. 903-916, September 1999 [4] J. O'Connor and M. Tremblay. PicoJava-I: The Java Virtual Machine in Hardware. IEEE Micro, pages 45{53, March/April 1997. [5] Sun. PicoJava-II Microarchitecture Guide. Sun microsystems, Part No.: 960-1160-11, March 1999. [6] U. Brinkschulte, C. Krakowski, J. Kreuzinger, R. Marston, and T. Ungerer. The Komodo Project: Thread-Based Event Handling Supported by a Multithreaded Java Microcontroller. 25th EUROMICRO Conference, Milano, September 1999. [7] U. Brinkschulte, C. Krakowski, J. Kreuzinger, T. Ungerer. A Multithreaded Java Microcontroller for Thread-oriented Real-time EventHandling. 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT '99), Newport Beach, Ca., pp. 34-39, October 1999.

Performance Evaluations of a Multithreaded Java Microcontroller ...

Performance Evaluations of a Multithreaded Java Microcontroller ...

Suggest Documents

Java 8: Multithreaded programs

Performance Evaluation of a Non-Blocking Multithreaded ...

Performance Study of a Multithreaded Superscalar Microprocessor ...

PERFORMANCE EVALUATIONS:

Systematic Testing of Multithreaded Java Programs - CiteSeerX

Workload Characterization of Multithreaded Java Servers - CiteSeerX

Dynamic Analysis of Java Applications for MultiThreaded ...

Deterministic Replay of Java Multithreaded Applications - CiteSeerX

Deterministic Replay of Java Multithreaded Applications - CiteSeerX

Cooperari: A Tool for Cooperative Testing of Multithreaded Java ...

Checkmate: a Generic Static Analyzer of Java Multithreaded Programs

Checkmate: a Generic Static Analyzer of Java Multithreaded

On Verifying Distributed Multithreaded Java Programs - IEEE ...

Compiling Multithreaded Java Bytecode for Distributed Execution

Multithreaded Dependence Graphs for Concurrent Java Programs ...

Specifying Multithreaded Java Semantics for Program Verification ...

On Verifying Distributed Multithreaded Java Programs - CiteSeerX

Nice Evaluations, Bad Performance:

Performance Evaluation of Multithreaded Geant4 Simulations Using ...

performance limitations of block-multithreaded distributed-memory ...

Mtgp: a multithreaded java tool for genetic programming ...

Real-time Event-handling and Scheduling on a Multithreaded Java ...

Real-time Garbage Collection for a Multithreaded Java ...

Prediction-based Power-Performance Adaptation of Multithreaded ...