A Dualthreaded Java Processor for Java Multithreading

1 downloads 0 Views 92KB Size Report
JavaSim, a Java processor simulator. This research is fo- cused to enhance the performance of Java processor by considering the characteristics of Java ...
A Dualthreaded Java Processor for Java Multithreading Chun-Mok Chung and Shin-Dug Kim Parallel Processing System Laboratory Department of Computer Science Yonsei University, Seoul, 120-749, Korea fchunmok, [email protected] Abstract Java-Web Computing paradigm changed Internet into computing environment. For Java-Web Computing and many Java applications, a new Java processor, called simultaneous multithreaded (SMT) JavaChip, is proposed to enhance the performance of previous Java processor by hardware support of Java multithreading. 1 SMT JavaChip is a modified architecture with the enhanced mechanism of stack cache, instruction cache, functional units, and etc. It executes dual independent threads simultaneously and enhances instruction level parallelism. The performance of SMT JavaChip is evaluated through the simulation using JavaSim, a Java processor simulator. This research is focused to enhance the performance of Java processor by considering the characteristics of Java language and computation environment. Performance results show that SMT JavaChip can provide the execution speedup of between 1.28 and 2.00 compared with the single threaded Java processors.

1. Introduction Internet has been widely used with an exponential growth and Java was focused on applications for Internet environment. Major characteristics of Java such as hardware independence, internal security, and network based environment, enforce Java environment to be quite promising. Such development of Java with previously developed Web interface derived a new computing environment, called Java-Web Computing (JWC) environment [1, 2, 3]. It is based on the combination of user-friendly Web interface and Java computation power with many computers connected to Internet. JWC environment can be applied to solve many problems difficult to be solved by a single sys1 This work was partially supported by Korea Science Foundation interdisciplinary Grant 97-0102-02-01-3.

tem. In spite of its conceptual excellence, Java has a drawback, namely low execution speed. As an approach for hardware independence, software Java Virtual Machine (JVM) [6] and its interpreting mechanism show relatively low performance compared with conventional computing environment. To overcome such drawback of Java, several alternative methods were proposed. One of those methods is to design Java processor executing Java language directly [8]. Java processor has a small and simple architecture. Considering mainly the structure of JVM without considering characteristics of Java language, conventional Java processor may be useful to apply to general Java programs for moderate performance [9, 10]. Java processor is a specialized processor to execute Java programs effectively. Specifically significant performance can be achieved when both JVM specification and direct support for primary characteristics of Java language are considered in designing Java processor. Structural characteristics of Java are stack intensive processing, memory references according to the object-oriented language, and supporting multithreading in language. Conventional Java processor is designed to reflect stack intensive characteristic. In this research, an architectural model considering Java multithreading feature is proposed. To utilize multithreading feature implemented in Java language, programmers should specify explicitly. In turn this can be easily adapted to the processor architecture to support multithreading. Thus programmers can convert any conventional Java program into the multithreaded version using special compiler aids. Multithreading is suitable to be processed in parallel and shows another possibility to enhance Java performance. Stack oriented Java processor needs fewer functional units and registers to maintain the status of any context, compared with general processors. As multithreading in Java was first implemented in software JVM, this makes effective in context switching between multiple threads and this can provide an advantage in designing multithreaded Java processor compared with

general processors in the point of cost. A new Java processor architecture, called a simultaneous multithreaded (SMT) JavaChip, is proposed in this research to be applied to JWC or many other applications. It is a modified architecture with the enhanced mechanism in terms of stack cache, instruction cache, ALU units, and etc., so that it can support simultaneous multithreading of Java directly. The performance and efficiency of the proposed architecture are evaluated through the simulation using Java processor simulator, JavaSim. The proposed architecture can be applied to an embedded Java station containing SMT JavaChip as a Java accelerator for fast Java execution. Also it can be shown that SMT JavaChip is efficient to be used in JWC environment. According to the result of simulation, SMT JavaChip provides the execution speedup of between 1.28 and 2.00 compared to the previously proposed Java processor, when multithreaded Java workloads are executed. This performance enhancement is gained from the two major characteristics of SMT JavaChip, i.e., the modified structure and its management mechanism of stack cache and the added functional units. It shows that supporting Java threads directly in hardware is efficient to enhance the performance of Java and interactions between hardware and software turns out to be advantageous approaches to enhance the overall performance. In Section 2, related work is described. The organization and the operational model of SMT JavaChip are provided in Section 3. The performance of the proposed architecture is measured and evaluated in Section 4. At last in Section 5, conclusion is made.

2. Related work In this Section, previous research about Java processor, Java-Web Computing, and simultaneous multithreaded processors which are directly or indirectly related with SMT JavaChip are described. O’Connor et al. implemented previously software based JVM into the hardware based Java processor [8]. This small and flexible microprocessor core named as picoJava enhanced Java execution performance with stack cache, dribbling, and instruction folding mechanism more than 20 times as high as that of conventional interpreting method [9, 10]. Based on the fact that Java processes stack operations intensively, picoJava enhanced performance using a hardware stack, called stack cache, with dribbling algorithm for efficient management of the stack cache. To decrease an excessive number of stack operations and enhance its performance, instruction-folding mechanism was selected in picoJava. Ton et al. proposed Java processor architecture with enhanced instruction folding mechanism [11] which is based on picoJava instruction folding mechanism and compared

the performance of the proposed architecture with that of picoJava. Their architecture expanded the cases when instruction foldings can happen to 3- and 4-instructions, where only a case of two is considered in picoJava. An elementary processor architecture with multithreads was proposed in [5]. This processor supports multiprograms directly to enhance its overall performance. Two mechanisms, i.e., concurrent multithreading and parallel multithreading, are provided in the main. They said that multithreading resembles superscalar in cost and architecture, and showed that the former is more excellent than the latter through experiments. In [4, 13], a simultaneous multithreading (SMT) was proposed in the mean of enhancing on chip parallelism and throughput. Here SMT exploits both instruction-level parallelism by issuing instructions from different threads in the same cycle and thread-level parallelism. To solve a bottleneck of SMT, [12] suggests processor architecture and several fetching mechanisms. Several techniques are used for optimizing in compiler level in [7] to enhance parallelism in processor and increase SMT processor throughput. The performance of SMT processor is compared to that of superscalar and multiprocessor systems through simulations. For the applications of the proposed architecture, ParaWeb [3] provides extensions to the Java programming environment and Java runtime system using multithreading facilities in parallel based on Internet environment. Charlotte [2] supports some of the key functionality for harnessing the Web as a metacomputing resource for parallel computations using Java methodology. ATLAS [1] is designed to execute parallel multithreaded programs on the networked computing resources integrating Java and Cilk technologies.

3. Simultaneous Multithreaded JavaChip Java-Web Computing environment is briefly introduced to show one target application for SMT JavaChip. For this type of applications, SMT JavaChip is designed. Also its structure and operational mechanism are explained.

3.1. Java-Web Computing Environment Java-Web Computing (JWC) environment is to provide a global view of programming to the programmers [1, 2, 3]. JWC environment is assumed to consist of a client computer, a set of worker computers, and a manager computer. In this model, client computer requests its program to be executed through JWC to the manager computer. Worker computers are the cooperation computers that join JWC during their idle times to provide themselves as the computing resource. Figure 1 shows the basic JWC execution mechanism assumed in this research. It starts from

Thread Logical Processor processor

Manager computer M-Main thread M-Mng thread 1

W-Comm thread W-Comp thread

Instruction Instruction Cache Cache

Worker n-1

Selector Selector

Selector Selector

Data Data Cache Cache

Worker n

W-Comm thread

W-Comm thread

W-Comp thread

W-Comp thread

Integer Integer ALU unit

Floating Floating -point -point ALU unit

Load Load// Store store unit unit

Stack Stack Cache Cache

(a) Overall architecture

Figure 1. Java-Web Computing environment.

the first step that a client computer requests to the manager computer its program to be executed through JWC. At the second step, volunteer worker computers connected to the manager computer are figured out and at the third step the manager computer sends a series of executable programs (Java applets) to the worker computers. At the fourth step all worker computers execute the program that is downloaded from the manager computer and return any result to the manager computer when they finish executing. At the last step, the manager computer gathers all the result from the workers and returns it to the client computer waiting for it. At the worker computer, two types of threads, i.e., a worker-computation (W- Comp) thread and a worker- communication (W-Comm) thread, are created to execute the Java applet downloaded in this model. The W-Comp thread is to execute the program except any communication. When W-Comp thread of a worker computer wants to communicate with other worker computers, it calls the method of WComm thread. This method opens a socket connection and sends the message to the M-Mng thread that is in charge of this worker computer. Then the M-Mng thread processes the communication request. By using Java multithreading mechanism, computation and communication can be performed independently and concurrently. All the functions related to the communication can be managed and performed by M-Mng thread and W-Comm thread. Thus, the role of manager computer can be simplified and it can support effective cooperation among workers. Embedded Java station means a computer system containing high speed Java execution ability. It can be used in conventional computing environment and also suitable in a new computing environment, namely JWC environment. Embedded Java station is assumed to be an architecture which is constructed as a Java processor module supporting one or more Java threads in the conventional computer system. Java processor module plays a role as an accelerator of central processor. Thus, an architecture to support two simultaneous threads is considered hereafter.

PC PC

Fetch Fetchunit unit

Stack Stackregister register

Instruction Instruction buffer buffer

Execution Execution controller controller

Decoder Decoder

Instruction Instruction Cache Cache

M-Mng thread n-1 M-Mng thread n

Selector Selector

Worker 1

Thread Logical Processor processor

Memory Memory interface interface

Selector Selector

(b) Structure of Thread Processor

Figure 2. Organization of SMT JavaChip.

3.2. Architecture of SMT JavaChip with dual Thread Processors SMT JavaChip is designed as an architectural model to support dual threads simultaneously as shown in Figure 2 (a). Each thread maintains its context by using a thread processor (TP). To support simultaneous dual threads, SMT JavaChip includes two TPs in its core. Two TPs share several functional units such as integer ALU, floating-point ALU, and Load/Store units. For efficient support of dual threads, the architecture of SMT JavaChip is designed as an extension of the basic model of picoJava architecture. As shown in Figure 2, the SMT JavaChip is constructed as, two TPs, three functional units with their associated selectors, and three caches. Additional integer ALU and Load/Store units can be added to support dual threads efficiently, and an instruction cache is divided logically per thread to keep high hit ratio. A stack cache can also be divided into two separate stack caches, where this stack cache can be divided and executed by either logically or physically. These two different methods are evaluated via simulation in the later Section. SMT JavaChip supports 4-stage pipeline; fetch, decode, execute, and write back, like the basic structure of picoJava specification. In the fetch stage, instructions are fetched from the instruction cache and saved at the instruction buffer for each TP. The instruction buffer is constructed as 12-byte width. In the decode stage, instruction decoding and operand loading from stack are performed, where instruction folding is applied directly at the same time if possible. In the execute stage, ALU operations or memory references are performed. All execution results are saved to the stack cache in write back stage. TP is a primary structure to maintain a valid single thread (context) and consists of a program counter (PC), a fetch unit, a stack register, an instruction buffer, and a decoding unit. As shown in Figure 2 (b), it contains one PC pointing to the next instruction’s position to fetch. Fetch unit reads instructions from the instruction cache and writes them to

the instruction buffer according to the value of PC. The contents of instruction buffer are decoded by decode unit at the decoding pipeline stage and its corresponding control signals are generated by execution controller. A stack register holds the starting point of memory stack that is used by a thread maintained at each TP. It is designed because Java is a stack based architecture and a thread uses its own separate memory stack. Hence a thread can be allocated to a TP one at a time. Two TPs in single SMT JavaChip can execute communication thread and computation thread simultaneously. As two TPs are independent to each other, worker can execute two thread simultaneously and prevent the computation stall derived from communication latency. Two TPs supporting dual threads can share three functional units. It means that two TPs may want to use an integer ALU at the same time. In this case, one TP can use it, but another must wait until the former may finish and the integer ALU becomes available. It can cause any pipeline stall for a TP to wait for any common resource available, resulting in delayed execution time. In SMT JavaChip, a selector is used to select any one of functional units to prevent any delay time. Selector is a kind of buffer, which stores operation code for a functional unit. When an ALU arbitration happens, a waiting TP stores its operation code at selector and another pipeline stage can continue. Selector is sufficient enough to support dual threads for the floating-point ALU which is used less frequently, but not enough for the integer ALU or Load/Store unit because they used more frequently. Another way is to include additional integer ALU and Load/Store units. These added units can decrease pipeline stalls happened by any ALU conflict and enhance the overall performance.

3.3. Structure of Instruction and Stack Caches Maintaining dual threads means that two separate programs can be executed simultaneously and two different program codes may be resident in memory. Because two different codes are resident in the same instruction cache, when instructions are fetched through instruction cache, the hit ratio of instruction cache may decrease according to the degradation of locality and the reduced effective cache size per thread. As the number of threads increases, the hit ratio of instruction cache will decrease. Thus multiprocessor systems and multiprogramming systems tend to use separate instruction caches per processor to reduce cache misses. In the fetch stage, 12 bytes of instructions are fetched from instruction cache to the instruction buffer of each TP. As SMT JavaChip fetches instructions as twice as 12 bytes per TP and uses the same bandwidth of data bus, the latency of instruction fetch with the same hit ratio of instruction cache is twice of the conventional Java processor.

High High Watermark Watermark Register

n-1 0 I/O I/O Buffer Buffer

Top Register Register

Top Register

Frame

Data DataRegisters Registers Top Register Register

Dribbling I/O I/O Buffer Buffer

(a) Structure

Dat Data a Cache

Frame Top n/2 Register

Low Low Watermark Watermark Register Register

High Watermark

Low Watermark

(b) Management Mechanism

Figure 3. Logical dual stack cache.

SMT JavaChip is designed such that each TP can maintain its separate instruction cache to provide reasonable cache hit ratio. Hence separated dual instruction caches are designed in SMT JavaChip. Because the cache structure specified in picoJava is only divided into two separate instruction caches, the total size of instruction cache is maintained the same. Especially Java programs are performed by stack based instruction set and the sizes of Java class files are usually small. So spatial locality given by Java program code is moderately high and if the size of instruction cache is large enough to contain a single class file, high cache performance can be achieved. Because Java has a stack based instruction set architecture, there is a stack cache in picoJava specification, which can provide fast accesses to memory stack and hide any memory reference delay. Stack cache of SMT JavaChip is based on that of picoJava. But picoJava is designed to support a single thread and it is not suitable for SMT JavaChip supporting dual threads. If two threads use a single shared stack cache, two frame sequences between the caller and called frames in each thread are mixed. So, we need another mechanism in managing stack cache. In this approach, two related stack cache architectures and their management methods are designed and evaluated to prevent frame mixing. To support dual thread accesses for the stack cache simultaneously, each thread holds and maintains its own starting position in the shared stack cache by using two-ported stack cache. Logical dual stack cache (LDSC) consists of four special registers, 64 data registers, two I/O buffers, two local buses as shown in Figure 3 (a). Each top register as the top of stack directs its entry position used by each thread, and high watermark register and low watermark register are used in dribbling operation. Frame data are stored in 64 data registers, which are connected with two independent local buses, so two different data registers can be accessed at the same time. The data for two selected data registers can be transferred simultaneously through two I/O buffers connected to each local bus. LDCS uses a single shared

Thread Logical Processor Processor

EN

Stack Stack Cache Cache

Thread Logical Processor Processor

EN

Stack Stack Cache Cache

ALUs ALUs

Figure 4. Physical dual stack cache.

data register set by differentiating the entry starting position of one thread within data registers of one thread from that of another. As in Figure 3 (b), if stack cache has n entries, thread0 starts to save its frame at the first entry and thread1 starts at the n/2-th entry location. The amount of effective stack space per thread is n/2 on average. For example, if thread0 and thread1 use the same stack cache with the entry size of 64 as picoJava specified, the first method for the frame of thread 0 saves it at entry number 0 and that of thread 1 starts to save at entry number 31. Because stack cache is managed as a circular queue, the starting point of each thread keeps regular distance. In LDSC, a sequence of method frame management between the invoker method frame and the invoked method frame can be maintained for each thread independently. As dual threads share a single stack cache, the average utilization of stack cache is guaranteed as high as picoJava, though one thread can cause frequent method invocation. Using the same size of stack cache for dual threads, effective utilization of the stack cache can be twice as high as picoJava at the best case without any extension in size. In physical dual stack cache (PDSC), two physically separate stack caches are designed to support simultaneous dual threads. In SMT JavaChip, the basic structure of a stack cache in PDSC is the same as that of picoJava’s. According to the enable signal (EN) from a TP, a stack cache of PDSC starts to work and the data of stack cache are transferred to ALU or data cache through the system data bus as shown in Figure 4. PDSC consists of two stack caches, each thread uses independently its individually dedicated stack cache. Dual threads can access stack caches without any conflict simultaneously. Maintaining dual threads in the stack cache is simple by using picoJava mechanism without any modification because a stack cache interacts with a single thread. PDSC can show better performance than LDCS because it prevents any delay time which might occur by any conflicts in LDSC, when two threads try to access the same stack cache entry according to the unbalanced method invocation frequencies between two threads. If the frequency of method invocations is unbalanced between two threads,

poor utilization of effective stack cache space given by two stack caches can happen in PDSC. A method with frequent method invocations can suffer from high miss ratio on the stack cache even though another stack cache related with other thread has enough entries for that frame to be stored without any penalty.

4. Performance evaluation In this Section, the performance of SMT JavaChip is evaluated via simulation. Simulation environment and benchmarks are presented as well. Specifically several design issues are analyzed based on the simulation results. The impact given by several architectural constructs is evaluated by simulation. Especially, the effect by changing the size and the type of the instruction cache, stack cache, and the number of functional units is provided in this subsection. Also the overall performance of SMT JavaChip is compared with that of conventional approach.

4.1. Simulation Environment To evaluate any effectiveness for multithreading feature, a Java processor simulator, JavaSim, is designed and all simulation results are collected from its simulation. JavaSim is developed as an evaluation tool for Java processor core design. It simulates the behavior of Java processor by executing Java programs directly and collects statistics from this execution directly. System parameters used in the simulation for JavaSim are shown in Table 1. As shown Table 1. Function units and execution cycles. Function unit Integer ALU Floating-point ALU Cache miss penalty Cache reference Dribbling per word

Execution cycles 1 4 10 1 1

in Table 2, JavaSim can be configured in different ways by using four basic system configuration parameters, i.e., the type of instruction cache, the number of functional units, the structure of stack cache, and the size of stack cache. Instruction cache is configured as either single or double. Single instruction cache is shared by dual TPs and each of dual instruction caches is used individually by its associated TP. For the functional unit, single means that a single set of integer ALU and Load/Store unit is assumed and double means two sets of integer ALU and Load/Store units. Two

90% 80% 70% 60% 50% 40% 30% 20%

;; ;; ;; ;; ;;

;; ;; ;; ;; ;;

;; ;; ;; ; ;; ; ; ;

10% FindPrimary

35000000 30000000

OBJ_NEW METHOD

25000000

ALU_FP ALU_INT MEMORY STACK

0% CalFibonacci

40000000

ETC

BRANCH

ProcData

; ; ;; ;;;; ;;;; ;; ; ;; ; ;; ;; ;; ; ; ;;; ; ;;;; ;;; ;; ;;; ; ;; ; ; ;; ; ;;; ; ;;;; ;; ; ;;;;;; ;;;;;;;;;;;;;; ;; ;;;

Execution time - Addition of functional units.

Instruction frequency

Cycles

100%

ETC

OBJ_NEW METHOD

20000000

BRANCH ALU_FP

15000000 10000000

ALU_INT MEMORY STACK

5000000 0

(a) (b) (c)(d) (e) CalFibonacci

FindPrimary

(a) Conventional Java processor (c) LDSC128 + 1FU (b) PDSC128 + 1FU (d) LDSC128 + 2FUs

ProcData

(e) LDSC128 + 2FUs

Figure 5. Instruction frequency of three benchmarks.

Figure 6. Performance of SMT JavaChip with additional functional units.

types of stack cache structure, i.e., PDSC and LDSC structures, are evaluated, where the size of stack cache entries varies as 64 and 128. Using these four parameters, 17 cases per a workload are simulated; one is for conventional Java processor and 16 cases are from different combinations of four configuration parameters.

frequent method invocations and returns. Thus a benchmark to reflect this impact is developed by varying the amount of method invocations. Instructions of Java are generally contained in the method of a class. And data fields for the objects and classes are almost implemented through a method invocation in the object-oriented language such as Java. In turn, this causes Java program to include frequent method invocations and returns. Last, an application with a lot of computation is selected as a benchmark. There are computation intensive applications, like encryption/decryption and multimedia, in Java applications. The major reason using a Java processor instead of previous interpreter is to pursue high performance and overcome drawback of Java. Thus, a sort of computation intensive workloads is required to evaluate SMT JavaChip. As the representatives for those three types of workload groups, three benchmark programs are chosen with the variations of each feature. Instruction frequencies for each of benchmarks are shown in Figure 5. CalFibonacci denotes the Fibonacci permutation that consists of a large number of method related instructions, i.e., 37% of its entire instruction executions. FindPrimary is the program for primary number computation that shows a large portion of integer arithmetic related instructions, i.e., around 20%. ProcData is for multimedia data processing as a program with considerable memory related instructions, being around 22%.

Table 2. Simulator parameters to evaluate SMT JavaChip. Functional unit Configuration Instruction cache Single Dual Functional unit Single Dual Stack cache structure PDSC LDSC Stack cache size 64 entries 128 entries

4.2. Benchmarks In general, the performance of processors is evaluated by using SPEC benchmarks. As there is no benchmark written in Java language until now, Java benchmarks about previous general algorithms are developed first. Considering Java language characteristics, several benchmarks as a different combination of three major Java features, such as the amount of objects involved, the number of method invocations, the amount of computation related operations, are designed. Those three characteristics for the Java language are described as follows. First, Java is an object-oriented language. Differing from other conventional languages, most of the operations occur with objects (instances) of classes. As the objects are created and resident in garbage collected heap area in JVM, operations with objects require frequent memory references. Therefore programs having frequent memory references are selected to evaluate the proposed architecture. Second, one of typical operations can be described as

4.3. Effect of Added Functional Units Figure 6 shows the simulation result on three benchmarks. In Figure 6, different configurations are represented as cases (a) through (e). Case (a) represents the execution time for the conventional picoJava processor. Cases (b) and (c) show SMT JavaChip with a single integer ALU and Load/Store unit with different configurations of stack cache. Case (d) and (e) are the cases of SMT JavaChip with additional functional units for two different stack cache specifi-

;; ; ;; ;; ; ;; ;; ;; ; ; ;; ;; ;; ;; ;;; ; ;;; ;; ;; ;;; ;; ; ;; ; ;; ; ;; ; ;; ;; ;;;;;;; ;; ; ;; ; ;; ; ; ;; ;; ;; ;; ; ;;; ;; ; ;; Execution time - Various stack caches

Conventional PDSC64 LDSC64 PDSC128 LDSC128

40000000 35000000

ETC

OBJ_NEW

30000000

METHOD

Cycles

Dribbling ratio

Stack Cache - Dribbling ratio 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

25000000

BRANCH

20000000

ALU_FP ALU_INT

15000000

MEMORY

10000000

STACK

5000000 0

CalFibonacci

FindPrimary

ProcData

(a)(b)(c)(d)(e) CalFibonacci

FindPrimary

(a) Conventional Java processor (c) LDSC64 (b) PDSC64 (d) PDSC128

Figure 7. Dribbling ratios of various stack cache structures.

cations, where LDSC128 means the LDSC with 128 cache entries and PDSC128 shows the PDSC with 128 cache entries. Adding an additional integer ALU can decrease integer execution time by 5  50% for three benchmarks. The time on memory references can also be reduced by 9  50% depending on the type of benchmarks. But the overall performance gain by additional functions is smaller than the individual enhancement; 1.09  1.25 speedup than single functional unit system. This is mainly because the effect by the corresponding functional unit is limited by given instruction frequencies for these benchmarks. Instructions related with method invocation group such as invokestatic, invokevirtual, and invokespecial and those related with object creation occupy a considerable amount of overall execution time and they are nothing to do with added functional units.

4.4. Effect of Modified Stack Caches As dribbling needs data transfer between the stack cache and data cache, processor is assumed to pause to execute instructions while dribbling happens in this work. Frequency of dribbling operations tends to decrease the overall processor performance. Stack cache performance is obtained over the frequency of dribbling. The frequency of dribbling is measured through the simulation of stack cache about three benchmarks. A serialized thread for the benchmarks, denoted by conventional in Figure 7, is used for previous picoJava’s stack cache to compare the performance of proposed stack cache. Also dual threaded benchmarks are simulated for logical and physical dual stack caches, denoted by PDSC and LDSC in Figure 7, respectively. PDSC64 denotes the physical dual stack caches with the total size equivalent to previous picoJava’s; one data register set is 32 entries as half of picoJava’s. It shows that higher dribbling frequency than picoJava is generated. For the analysis for method frames of benchmarks, the largest frame needs 15 entries which Is almost half size of each stack cache in PDSC64. So, each stack cache of

ProcData

(e) LDSC128

Figure 8. Performance of SMT JavaChip with various stack cache architectures.

PDSC64 cannot maintain more than two frames because high-watermark becomes small to maintain enough entries for the next frame. Each stack cache in PDSC128 has 64 entries as the same as in picoJava and the total stack cache of PDSC128 is 128 entries. When dual threads are simultaneously executed in SMT JavaChip, the frequency of dribbling for PDSC128 is the same as that of picoJava. As dribbling operations in stack cache are processed in dual stack caches as in PDSC, the amount of dribbling overhead is half of previous picoJava’s and 128 entries of stack cache are enough to support simultaneous dual threads. In LDSC64, the size of one data register set is the same as picoJava. Frequency of dribbling operations for LDSC64 is higher than that of picoJava and lower than that of PDSC64. If the watermark is defined as the number of used entries for the data registers in stack cache, the watermark in LDSC is obtained from one shared data register set. Watermark of PDSC is calculated from each data register set in the separated stack cache and maintained with two watermark values, one per data register set. Thus, frequency of dribbling for LDSC64 is lower than for PDSC64. Figure 8 shows the effects of various stack caches in SMT JavaChip to overall performance. Conventional means the performance of picoJava. Numbers after stack configuration, i.e., PDSC and LDSC mean the total entry size of each stack cache. According to the simulation result, PDSC128 and LDSC128 show the same best performances in all stack cache organizations and this gain increases especially for the method intensive workloads. As the dribbling latency is small to affect overall performance, performance improvement by four different stack cache organizations is similar if they work well with dual threads.

4.5. Overall Performance Figure 9 shows the overall performance for the best structure of SMT JavaChip. Speedup of overall perfor-

;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ; ;; ;;

40000000 35000000

Cycles

30000000 25000000 20000000 15000000 10000000 5000000 0

(a) (b) (c) CalFibonacci

Execution time - Overall

; ;;; ;; ;; ;; ;; ; ;;; ;; ;; ;; FindPromary

(a) Conventional Java processor (b) PDSC128 + 2FUs

; ;; ;; ; ;; ;; ;; ;; ; ;; ; ;; ;; ;;; ;; ETC

OBJ_NEW METHOD BRANCH ALU_FP ALU_INT MEMORY STACK

ProcData (c) LDSC128 + 2FUs

Figure 9. Overall performance of SMT JavaChip.

mance is between 1.28 at FindPrimary which is computation intensive and 2.00 at CalFibonacci which is method invocation intensive workload. Similar performance enhancement with LDSC128 and PDSC128 stack cache structures can be achieved in all benchmark programs because two stack cache structures show the same dribbling ratios. This speedup is achieved by executing dual threads simultaneously and decreasing resource conflicts between dual TPs. As shown in Figure 9, dual stack caches can decrease execution cycles especially for method and stack related instructions with low stack cache dribbling ratio, and additional functional units enhance the performance for the instructions related with integer arithmetic and memory reference. Through simulations with three benchmarks in Java applications, SMT JavaChip supporting simultaneous dual threads can gain speedup between 1.28 and 2.00 compared with previous Java processor.

5. Conclusion JWC paradigm changed Internet not only into information repository but also into computing environment. However, its low speed became an obstacle for development. To overcome the drawback of Java language with slow execution speed, Java processor was proposed. In this research a new Java processor, SMT JavaChip, is proposed to enhance the performance of Java processor by hardware support of Java multithreading. SMT JavaChip executes dual independent threads simultaneously and enhances instruction-level parallelism (ILP). The architecture of instruction cache is designed to support multiple instruction issues with its management algorithm of stack cache to support simultaneous multithreads. To enhance the ILP, frequently used functional units, i.e. instruction ALU and Load/Store unit are added. The performance of SMT JavaChip is evaluated through the simulation using JavaSim. The result of simulation shows that when utilizing dual threads a speedup of from

1.28 to 2.00 can be achieved over the previously proposed Java processor. Therefore, the effectiveness of SMT JavaChip can be applied to construct an embedded Java station for JWC environment. This research is focused to enhance the performance of Java by mapping the characteristics of Java language and computation environment to system architecture.

References [1] J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. ATLAS: An Infrastructure for Global Computing. 7th ACM SIGOPS European Workshop, pages 160–167, Sep. 1996. [2] A. Baratloo, M. Karaul, Z. Kedem, and P. Wyckoff. Charlotte: Metacomputing on the Web. 9th International Conference on Parallel and Distributed Computing Systems, pages 151–159, Sep. 1996. [3] T. Brecht, H. Sandhu, M. Shan, and J. Talbot. ParaWeb: Towards Word-Wide Supercomputing. 7th ACM SIGOPS European Workshop, pages 181–188, Sep. 1996. [4] S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, and D. M. Tullsen. Simultaneous Multithreading: A Platform for Next-Generation Processors. IEEE Micro, 17(5):12–19, Sep./Oct. 1997. [5] H. Hirata and et al. An Elemnetary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads. 19th Annual International Symposium on Computer Architecture, pages 136–145, May 1992. [6] T. Lindholm and F. Yellin. The Java Virtual Machine Specification. Addison-Wesley, Reading, Massachusetts, 1996. [7] J. L. Lo, S. J. Eggers, H. M. Levy, S. S. Parekh, and D. M. Tullsen. Tuning Compiler Optimizations for Simultaneous Multithreading. 30th Annual International Symposium on Microarchitecture, pages 114–124, Dec. 1997. [8] J. O’Connor and M. Tremblay. picoJava-I: The Java Virtual Machine In Hardware. IEEE Micro, 17(2):45–53, Mar/Apr 1997. [9] Sun. picoJava-I Microprocessor Core Architecture: Designed for the Embedded Market. Sun Microelectronics Technical Report WPR-0014-01, Nov. 1996. [10] Sun. Sun Microelectronics’ picoJava-I Posts Outstanding Performance: Preliminary Benchmarks Show the Power of Casting Java in Silicon. Sun Microelectronics Technical Report WPR-0015-01, Nov. 1996. [11] L.-R. Ton and et al. Instruction Folding in Java Processor. 1997 International Conference on Parallel and Distributed Systems, pages 138–143, Dec. 1997. [12] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. 23rd Annual International Symposium on Computer Architecture, pages 191–202, May 1996. [13] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. 22nd Annual International Symposium on Computer Architectur, pages 392–403, Jun. 1995.