The Impact of Branch Prediction on Control Structures ...

14 downloads 0 Views 1MB Size Report
abstraction and code-reuse, and is therefore one of the main ... have one virtual method call site in a large loop. ..... We thank Matt Holly, one of the developers of.
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE

The Impact of Branch Prediction on Control Structures for Dynamic Dispatch in Java Dayong Gu − Olivier Zendra − Karel Driesen

N°4547 Septembre 2002  THEME 2 

ISSN 0249-6399

apport de recherche

2

The Impact of Branch Prediction on Control Structures for Dynamic Dispatch in Java Dayong Gu − Olivier Zendra − Karel Driesen Thème 2 − Génie Logiciel et calcul symbolique Projet MIRO Rapport de recherche n° 4547 − Septembre 2002 − 17 pages Abstract: Dynamic dispatch, or late binding of function calls, is a salient feature of object-oriented programming languages like C++ and Java. The target of a dispatched call changes according to the type of the object receiving the call. Due to inheritance the exact type is unknown at compile time, and therefore dispatch must occur in general at run time, implying a cost to the use of object-oriented programming languages. In previous work, we measured the performance of various equivalent non-object-oriented control structures to determine if dispatch cost can be reduced by translation. Measurements on a variety of virtual machines and hardware platforms show that alternative control structures are useful for a low number of expected types (low degrees of polymorphism). However, the gains differ substantially for different type patterns, even when the number of types is constant. The difference is likely to be caused by a processor's branch predictor, which guess the outcome of branches involved in dynamic dispatch .In this paper, we simulate branch predictors of Athlon and Pentium in order to validate this insight. The results show that branch prediction accuracy is indeed responsible. For successful optimization it is therefore not sufficient to guess the number of types occurring in a call. The type pattern should also be taken into account. Key-words: Java, dynamic dispatch, object-oriented, control structure, branch prediction Ce travail de recherche fait suite au postdoctorat INRIA d'Olivier Zendra à McGill University. Egalement publié comme Technical Report SOCS-02.6, School Of Computer Science, McGill University. Dayong Gu est membre de l'Adaptive Computation Lab de McGill University, Montréal, Canada. Karel Driesen est professeur à McGill University, Montréal, Canada, et responsable du laboratoire ACL.

Unité de recherche INRIA Lorraine LORIA, Technopole de Nancy-Brabois, Campus Scientifique, 615 rue du Jardin Botanique, BP 101, 54602 Villers-Lès-Nancy (France) Téléphone : +33 3 83 59 30 00 − Télécopie : +33 3 83 27 83 19

3

Impact de la prédiction de branchement sur les structures de contrôle pour la liaison dynamique en Java Résumé: La liaison dynamique, ou envoi de messages, est un concept saillant dans les langages à objets comme C++ et Java. Java. La cible d'un envoi de message change en fonction du type de l'objet receveur. A cause de l'héritage, ce type exact n'est pas connu lors de la compilation; une liaison dynamique doit donc être effectuée dans le cas général lors de l'exécution, ce qui implique un coût supplémentaire lors de l'utilisation de langages à objets. Dans nos précédents travaux, nous avons mesuré la performance de diverses structures de contrôles classiques équivalentes, afin de déterminer si le coût de la liaison dynamique peut être réduit en les utilisant. Nos expériences, sur diverses machines virtuelles et plateformes matérielles, montrent que ces structures de contrôle alternatives sont utiles lorsque le nombre de types attendus est faible (faible degré de polymorphisme). Néanmoins, les gains varient largement selon les différent patterns de types, même quand le nombre de types est constant. Ces différences sont probablement causées par les prédicteurs de branchement des processeurs, qui prédisent le résultat des branchements impliqués dans la liaison dynamique. Dans ce document, nous simulons les prédicteurs de branchement de l'Athlon et du Pentium afin de valider cette hypothèse. Nos résultats montrent que la précision de la prédiction est en effet responsable. Pour optimiser efficacement, il n'est donc pas suffisant de prédire le nombre de types possibles pour le receveur. Le pattern des types doit aussi être pris en considération. Mots-clés: Java, liaison dynamique, orienté objet, structures de contrôle, prédiction de branchement

4

The Impact of Branch Prediction on Control Structures for Dynamic Dispatch in Java Dayong Gu Adaptive Computation Lab, McGill University, Canada [email protected] Olivier Zendra INRIA-Lorraine / LORIA, France [email protected] Karel Driesen Adaptive Computation Lab, McGill University, Canada [email protected]

Abstract Keywords: dynamic dispatch, Java, objectDynamic dispatch, or late binding of function calls, is a salient feature of object-oriented languages like C++ and Java. The target of a dispatched call changes according to the type of the object receiving the call. Due to inheritance the exact type is unknown at compile time, and therefore dispatch must occur in general at run time, implying a cost to the use of object-oriented programming languages. In previous work, we measure the performance of various equivalent non-object-oriented control structures to determine if dispatch cost can be reduced by translation. Measurements on a variety of virtual machines and hardware platforms show that alternative control structures are useful for a low number of expected types (low degrees of polymorphism). However, the gains differ substantially for different type patterns, even when the number of types is constant. The difference is likely to be caused by a processor’s branch predictor, which guesses the outcome of branches involved in dynamic dispatch. In this paper, we simulate branch predictors of Athlon and Pentium in order to validate this insight. The results show that branch prediction accuracy is indeed responsible. For successful optimization it is therefore not sufficient to guess the number of types occurring in a call. The type pattern should also be taken into account.

oriented, branch prediction, control structure

1 Introduction Dynamic dispatch is a salient feature of objectoriented programming languages like C++ and Java. When a virtual method call in Java is executed, the object that receives the call retrieves the class-specific method implementation and invokes it. This late binding of dispatch targets allows any object to play the role of the receiver object, as long as the new object implements the expected interface (it is substitutable à la Liskov [7]). Such type-substitutability enables better code abstraction and code-reuse, and is therefore one of the main advantages of object-oriented programming languages. Consequently, dynamic dispatch occurs frequently. For instance, virtual method invocations in Java [4] occur every 12 to 40 byte codes [2]). Unfortunately, virtual method calls can be very time-consuming. The main cause of their inefficiency is the indirect branch instruction that resides at the core of a virtual method call. On modern, deeply pipelined processors, mispredicted indirect branches cause “pipeline bubbles” which stall the CPU [3]. One possible optimization translates the call to a non-object-oriented control structure (e.g. if

5

sequence), in the expectation that it will be compiled into less expensive native code instructions. We explored this strategy in [11], using real time measurements of four groups of benchmarks running on various run time type patterns, on a several virtual machines and different hardware platforms. We found that a translation to non-object-oriented control structures reduces execution time for dynamically dispatched calls with few types. However, the gain depends highly on the type pattern, which led to the hypothesis that mispredictions of the branch predictor are responsible for this difference. In this study, we simulate the branch predictors of an Athlon and Pentium III processor in order to determine the influence of branch prediction on the efficiency of control structures for dynamic dispatch. The paper is organized as follows. In section 2, we present the methodology, both of the original real time experiment and the simulation executed in this paper. In section 3 we present and discuss the results. The last two sections conclude and mention future work.

2 Methodology Figure 1 compares the structure of the real time experiment in [11] and the current experiment, based on simulation:

architecture, to factor out differences in compilation and ISA. We found in [11] that results were determined primarily by the hardware platform and that execution trends differed little between different JVM’s. For the simulation experiment we first executed the benchmark using an instrumented Kaffe (1.0.6) JVM, which provided byte code execution traces for the inner loop of the benchmarks. These traces were then offered to the Plumber. Plumber is a branch prediction simulator designed to count branch prediction misses for a variety of different architectures. The aim of the experiment was to obtain cost estimates close to the real time measurements, using a simple cost calculation model at the byte code level which takes into account branch prediction at the processor level.

2.1 Benchmark programs We designed the micro benchmarks in order to emphasize the cost of one dynamically dispatched call under a variety of run time execution conditions. Figure2 shows an overview of all benchmark programs. They share the same structure, executing a loop in which a static function is called on an object extracted from an array (the array is initialized to a particular type pattern from a file; we show the type ID’s). The static function differs according to the control structure which we want to measure.

Benchmarks Hotspot 1.3.1 Hotspot 1.3.1 Pentium III Athlon

Kaffe JVM

Pattern array

Real time result

Athlon Branch Predictor Simulation in Plumber

Simulated result

Figure 1: Structure of the experiments The real time experiment simply consists of running the benchmarks on Sun’s Hotspot JVM 1.3 and recording the execution time (see section 2.1 for benchmark descriptions). In this study we only consider one virtual machine and two platforms based on the same x86 instruction set

10000

Pentium III Branch Predictor Simulation in Plumber

1|5|3|4|3|2|1|4|5|2|4|3|1…

Byte Code execution Traces

Loop : i = 0 to loopNum

Func(array[i])

Figure 2: Benchmark programs

6

Benchmarks were generated for different static type sizes (number of types that can occur):

. "Virtual invocation": In this group, all the benchmark programs have one virtual method call site in a large loop. The number of possible types is different for different benchmark programs. However, the core code sequence remains the same:

We therefore did not include them in this study, as their behavior in terms of branch prediction is explained by the corresponding if-sequence or virtual function call behavior.

. "NoCall": NoCall benchmarks are used as baseline, to estimate benchmark overhead: execution time of all byte codes except those involved in dynamic dispatch. We took care to ensure that the compiler was not able to optimize away this overhead.

p.foo(x);

. "If-Sequence": If-sequence benchmarks use a sequence of 2way conditional type checks to determine the call target. The size of this structure is determined by the number of possible types. The code below demonstrates the core code sequence for a call with 4 possible types: int localId = p.typeID; if (localId == ID_1 ) Class1.foo_static(x); else if (localId == ID_2 ) Class2.foo_static(x); else if (localId == ID_3 ) Class3.foo_static(x); else // localId == ID_4 Class4.foo_static(x);

In order to factor out differences in generated code sequences, we focus on benchmark programs which have 20 possible types. In [11] we measured the effect of different static type sizes.

2.2 Type patterns In order to measure the effect of different type execution patterns while ensuring that a smart compiler can not guess their occurrence, the array in figure 2 is initialized from a pattern file. Each number causes an object of a particular type to be allocated and store in the pattern array. We use four sets of pattern files in the experiment:

. Const Patterns:

. "Binary Tree":

A particular const pattern file is a sequence of one particular integer:

Binary tree benchmarks are similar to ifsequences, but use inequality and are organized as a binary decision tree. The depends on static type size. The core code sequence for four types:

cst_01: 1,1,1,1,1,1,1,1… cst_02: 2,2,2,2,2,2,2,2… cst_03: 3,3,3,3,3,3,3,3…

int localId = p.typeID; if (localId

Suggest Documents