Complex Branch Profiling for Dynamic Conditional ...

1 downloads 0 Views 704KB Size Report
2. one or more conditional forward hammocks whose tar- get address coincide with the join address of the outer hammock are called multiple join complex ...
Complex Branch Profiling for Dynamic Conditional Execution Rafael R. dos Santos1 , Tatiana G. S. dos Santos1 , Mauricio L. Pilla1 , Philippe O. A. Navaux1, Sergio Bampi1 , Mario Nemirovsky2 1

Universidade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil [rrsantos,tatiana,pilla,navaux,bampi]@inf.ufrgs.br 2 Kayamba San Jose, CA, USA [email protected]

Abstract— Branch predictors are widely used as an alternative to deal with conditional branches. Despite the high accuracy rates, misprediction penalties are still large in any superscalar pipeline. DCE, or Dynamic Conditional Execution, is a alternative to reduce the number of predicted branches by executing both paths of certain branches thus reducing the number of predictions and therefore the misprediction occurrence. The goal of this work is to analyze the complexity of branch structures and determine the number of branches that can be predicated in DCE and the distribution of mispredictions according to the classification proposed. The complex branch classification proposed extends the classification presented by Klauser [KLA98]. As result, it is showed that an average of 35% of all branches can be predicated in DCE and around 32% of mispredictions fall into these branches. Keywords— Superscalar architectures, Multipath execution, Branch prediction, Dynamic predication

I. I NTRODUCTION Conditional branches present a challenge to increase performance in current processors. Many mechanisms have been presented to mitigate their impact, like branch prediction, speculative execution, trace caches, instruction prefetching, and multipath execution. These mechanisms rely on fetching more instructions to feed the starving functional units [UHT95, TYS97, HEI96, KLA98, SAN98, SKA99]. The most common technique, implemented in all modern processors, is branch prediction. Branch prediction is widely used to predict the next instruction fetch address when conditional and/or unconditional branches are fetched. Nevertheless, despite the accuracy of current predictors, mispredictions still degrades considerably the performance. Superscalar pipelines are getting deeper to support increasing demands for clock frequency. Therefore when more stages are added, the penalty imposed by branch mispredictions increases. The accuracy of state of the art branch predictors, however, is not improving. Increasing the accuracy of current predictors implies an extraordinary increase in complexity. Since the predictor has Grants from CNPq, CAPES and UNISC

now to be split among several stages of fetch, many levels of predictions are necessary, again, due to the high clock frequencies required. One can avoid branch predictions by means of executing both paths of a conditional branch. For instance, multipath execution was extensively studied in the past but simply executing all paths of all branches proved not to be efficient. DCE [SAN01, SAN03] presents an alternative where only certain conditional branches are predicated. DCE can perform dynamic predication of simple and complex conditional branches without requiring a special instruction set nor special compiler optimizations hence it can be applied to legacy code. This paper presents an analysis on the behavior and patterns of direct conditional branches in order to better understand the dynamics of control structures and to quantify the number of branches that can be predicated. First, DCE is briefly presented in Section II. Then, Section III introduces a classification for hammocks that extends the classification presented by Klauser et al [KLA98]. The simulation environment is presented in Section IV, and the result analysis is discussed in Section V. The conclusions are drawn in the last section. II. DYNAMIC C ONDITIONAL E XECUTION DCE combines dynamic predication and multipath to reduce the complexity and disruptions of the fetch. This is achieved by fetching sequentially through branches that qualify for predication. In order to determine if a branch qualify for predication, an extension of the selection mechanism proposed in [KLA98] was developed. In their selection mechanism, only simple branches qualify for predication. The selection scheme used in DCE also qualify complex branches. The selection mechanism is static and runs at compilation time, marking branches which can be predicated according to the target locality. The compiler does not change the

Simple Complex Pure Complex Multiple Joins Complex Multiple Targets

10

DCE Misprediction Reduction (Distance 64)

Complex Overlapped

9 8 7 Reduction (%)

6 5 4 3 2

M1024 W64

M1024 W32

M512 W64

M1024 W16

M512 W32

M512 W16

M256 W64

M256 W32

M256 W16

M128 W64

M128 W32

M64 W64

M128 W16

M64 W32

M64 W16

M32 W64

M32 W32

M32 W16

M16 W64

M16 W32

1 0

M16 W16

original code, it only marks instructions valid for predication [SAN03]. At execution time, the fetch engine decides whether a selected branch will be predicated or not, based on the availability of resources. Therefore, DCE is a combination of a static selection mechanism and hardware support to execute certain branches eagerly. In DCE, selected branches are treated, by the fetch engine, as regular instructions and they never disrupt the fetching. As no prediction of control transfer is made at fetch time for these instructions, they can not cause mispredictions. The main difference between Klauser et al. [KLA98] and DCE [SAN03] is that DCE predicates both simple and complex branches and it does not use conditional moves to satisfy data dependences at the join point of predicated branches. In Klauser et al., conditional moves are inserted dynamically at the joint point to block the issue of instructions from the same data chain of the predicated paths. When the branch resolves, the conditional moves can be issued to copy the data from the correct physical register to the correct source register. Thus, consumer instructions that use the respective register become ready for issue only after the conditional move instruction has been executed. In DCE, a register renaming technique derived from Chaves et al. [CHA99] generates replicas of instructions at the join point of predicated branches. Replicas are instructions that use data that is produced in one of the paths of a predicated branch. Therefore, there is one replica for each path predicated. As DCE does not use conditional moves, it does not block the issue of the dependent instructions. In DCE, replicated instructions can be issued as soon as the the corresponding source physical registers are ready (i.e. the source data is available). If an instruction at the join point reads a register that is produced logically before a predicated branch, DCE will not introduce multiple replicas of that instruction since the instruction is control independent [ROT99] and data independent with relation to the paths being predicated. Figure 1 shows the misprediction reduction for the different classes of branches in DCE. Simple is the same class of branches predicated in the machine proposed by Klauser et al. The variations of complex are the extended model proposed here which include Simple and other 4 combinations of nested hammocks. The horizontal axis presents the different configurations simulated varying the number of mapping tables (M) and the architecture width (W). The number of mapping tables determines the maximum number of active paths supported and the width determines the number of instructions that can be renamed/issued/executed/committed per cycle. Also, there are a maximum of 64 instructions between branches to be predicated and their join points (Distance 64). Remarkably, predicating Simple branches allows a reduc-

Figure 1. Misprediction Reduction

tion of 3.3-5.3% of mispredictions but the reduction remains almost constant when the number of mapping tables is increased, i.e., it does not improve when the ability to predicate more branches increases. Nonetheless, the misprediction reduction when predicating complex branches almost doubles when the number of mapping tables is increased, which shows the potential of the technique. III. H AMMOCK C LASSIFICATION Two important questions for the performance of DCE are: (i) how many branches have near targets and contain only valid nested branches; and (ii) how many mispredictions fall into these branches. In order to answer these questions, the behavior of branches found in general applications were studied and further classified. This classification defines the predication limits in DCE. The goal is to find the best set of branches to be selected and optimize the architecture accordingly. The compiler was modified to analyze the static code and look for conditional forward branches. It then classifies the branches according to the number and type of nested branches that they contain and the distance to the taken target. A conditional forward branch that has no nested branches may have one (if-then) or two sides (if-then-else). These branches are called simple hammocks single sided (a) or double sided (b), respectively (figure 2). Conditional forward branches that have other conditional forward branches inside may be classified as follow (figure 3): 1. one or more nested conditional forward branches totally contained are called pure complex hammocks (a) 2. one or more conditional forward hammocks whose target address coincide with the join address of the outer hammock are called multiple join complex hammocks (b) 3. one or more conditional forward branches whose target

1 2 3 4 5 6 7 8 9 10 11 12 (a) 1 2 3 4 5 6 7 8 9 10 11 12 (b) Figure 2. Example of simple hammocks

address is beyond the join address of the outer hammock are called multiple target complex hammocks (c) 4. one or more conditional forward branches whose target address targets the body of the taken path are called overlap complex hammocks (d) A hammock may not qualify for predication due to the occurrence of any of the following: backward branches, indirect branches, unconditional jumps that are NOT related to one or more conditional branches (then path skip jump), subroutine calls or returns and system calls. Figures 2 and 3 presented the five different hammock classifications. The diagrams presented there are such as each number corresponds to an instruction and each arrow represents a branch to a given target, i.e. another instruction. The source of the arrow is the instruction which originated the branch (conditional or unconditional). The end of the arrow indicates the taken target instruction. Figure 2 (a) presents the most basic hammock structure. It is a typical if-then, where a condition is investigated in instruction 1 and, depending on the result, the instruction flow continues sequentially or is redirected to instruction 8. Note that instruction 8 is part of any flow path started in 1, taken or not, so it is called join point. Because in this structure there are no nested branches and there is only one side (if-then), this category is called One-sided Simple. 1

2

3

4

5

6

7

8

9

10

11

12

7

8

9

10

11

12

7

8

9

10

11

12

7

8

9

10

11

12

(a)

1

2

3

4

5

6 (b)

1

2

3

4

5

6 (c)

1

2

3

4

5

6 (d)

Figure 3. Example of complex hammocks

Figure 2 (b) shows a two-sided hammock. The example corresponds to an if-then-else, which has a condition evaluation in instruction 1 and a unconditional branch in a later instruction, represented by instruction 7. This unconditional branch is responsible for the flow redirection demanded by the else command. It is possible to see that the unconditional branch is the instruction right before the target of 1, i.e. instruction 8. In this example, the join point is given at instruction 10 and the category is called Two-sided Simple. For instance, for a branch to fall into this category it must have the unconditional jump right before the target of the first conditional branch (branch delay slots were not considered). If this condition is not true then the branch falls off this valid category. If a hammock has any nested hammocks it is then called Complex. It is important to mention that the classification for complex branches presented here extends the terminology and analysis once presented by [KLA98] addressing in more details the peculiarities of complex hammocks. An example of a Complex pattern is showed in Figure 3 (a). In this case, there are nested hammocks within the outer hammock. The outer hammock is an if-then-else similar to the one presented earlier, where instruction 1 jumps to 8, if the condition evaluated is true. Furthermore, there is an unconditional jump in instruction 7, jumping to the join point, i.e. instruction 10. Inside this hammock, instruction 3 is a simple if-then, like the one in the first example of this section. For this example of nested hammocks, the target of the second branch is totally contained within the most external hammock, instruction 5. When all nested branches have their targets totally contained within the boundaries of the most external branch, that branch is called Complex Pure. Figure 3 (b), presents an if-then-else hammock with multiple join points. This means that one of the sides of the most external branch (then or else) has a nested branch whose target is the same as the first, most external branch. In the example, instruction 8 is the target of two conditional branches (instructions 1 and 3). Then, the most external branch has the same target as the most internal branch and then it is called Complex with multiple joins. The join point of this hammock is instruction 10 as this instruction is the first instruction common to any path starting in 1. Observe though that branch 3 would have a join point at 8 if it was not a nested branch of 1. When classifying complex hammocks the join point is considered to be the first instruction common to all paths starting from the outer hammock. In other cases, nested branches may not have targets that are coincident with the target of the external branch. In this cases the target may be inside the else path while the branch is within the then path or it may be beyond the join point of the external branch, Figures 3 (c) and (d). When the target of a nested branch, located within the then

path, is actually in the else path of the most external branch, example (c), the two branches are overlapped and the category in each they are included is called Complex overlapped. The join point is still the common instruction to all paths, i.e. instruction 10. Example (d) shows the nested branch 3 which has a target 12 beyond the join point of the most external branch 1. In this case the join point is instruction 12 as it is the first instruction common to any path that starts in 1. This type of Complex branch is called Complex with multiple targets. DCE is designed to handle any possible combination of the above described categories of simple and complex hammocks. The complexity level is given by the highest number of paths or instructions required to execute all paths according to the DCE architecture and is as follow, in increasing levels of complexity: pure, multiple joins, multiple targets and overlapped, respectively. IV. S IMULATION

ENVIRONMENT

The simulator used in the experiments was developed using the SimpleScalar tool set [BUR96], which includes a very detailed cycle simulator. The simulator implements out-of-order issue using a Register Update Unit (RUU) [SOH90] and was leveraged with code provided by Klauser et al [KLA98] for hammock profiling. The original hammock profiling algorithm was extended to handle complex hammocks and their variations. The main goal of this work is to understand the behavior and pattern of short conditional branches in order to allow a better design of the DCE architecture. Although the performance of the architecture is not the goal of this work, the accuracy of the branch predictor employed in the experiments is relevant in the study of the results presented. An hybrid branch predictor mechanism was used, combining bimodal and 2-level adaptative branch predictors [MCF93]. The configuration adopted was the following: a meta-table with 2048 entries; a BTB with 2048 entries and a 2-bit saturated counter (bimodal portion); one single entry in the first level (gshared xor) and 8192 entries in the second level with 13 history bits (2-level adaptative predictor). Each benchmark was compiled with 3 different optimizations (–O1, –O2 and –O3) using gcc version 2.6.3. This methodology was used in order to observe the impact on the number of short branches introduced by ach optimizations. In addition, the option –funroll–loops was used in all 3 levels of optimizations. The benchmarks used in simulations were ammp, equake, gcc, gzip, parser, vpr place, vpr route and vortex, all from the SPEC2000 suite. For each configuration, 2 billion instructions were simulated except for benchmarks vpr place and vpr route which were simulated up to their completion.

V. R ESULTS In this section, the results obtained through simulation are presented. The dynamic analysis is discussed and conditional branches are classified according with the taxonomy presented previously. In DCE, any branch qualifying to one of the categories presented can be predicated. Non-qualifying branches need to be predicted. The architecture of DCE was designed and optimized to handle any branch that qualifies according to the rules discussed earlier. Those are the branches that can provide improvements in terms of reducing the misprediction as they dont need to be predicted. Both the compiler (parser) and simulator were modified to instrument the code. The parser provides information on to whether a branch qualifies for predication or not according to the DCE methodology. This selection is static. During simulation, branches are monitored and counted according to their category. Mispredictions are then counted also based on the category of the branch that produces the misprediction. In the simulations performed in this section, a conventional superscalar simulator was used. No predication was actually applied in these simulations as predication would change the ratio of branches predicted thus changing the results and misprediction rates. A. Total number of branches Table I presents the number of branches executed of each category per benchmark, defined by the static analysis of the source code. The first column indicates the optimization option and the number of sides (one or two). The second column defines the categories, while the remaining columns show the percentage of branches in each category, based on the number of dynamic branches counted in the simulations. There is almost no difference in the average of qualified branches for the different optimization levels. However, the distribution of branch categories changes from one optimization level to another, due to the way that the compiler reorganizes the code in order. The level of complexity tends to increase as the optimization level is increased. Most qualified branches are classified in either Simple or Pure Complex categories, and most of them are also singlesided branches (if–then constructions). One-sided branches produce less overhead than two-sided branches during the predication as only one extra path is added to the execution in order to guarantee that no misprediction will happen. When a one-sided branch is predicted, the then path is executed only when the condition evaluated is true. However, the not taken path is always executed regardless of the condition as it is control independent. Therefore, when a one-sided branch is predicated, only one extra path (the then path) is added to the execution. When two-sided branches are predicated, both the not

Table I: Classified hammocks One side O1

Two sides O1

Hammocks

Ammp

Equake

Gcc

Gzip

Parser

Vortex

Place

Route

Simple Compl Mult Join Mult Targ Mult Overl Simple Compl Mult Join Mult Targ Mult Overl

8.90 1.07 0.19 0.67 0.00 40.79 0.44 3.23 0.30 0.00 55.59 8.90 1.07 1.53 0.67 0.00 40.79 0.44 1.90 0.30 0.00 55.60 8.92 1.06 1.53 0.67 0.00 40.74 0.31 1.89 0.30 0.00 55.42

9.91 5.62 1.58 4.29 0.00 2.07 10.78 3.56 2.08 0.00 39.89 9.90 5.61 1.58 4.29 0.00 2.07 1.11 3.55 11.73 0.00 39.84 9.68 7.75 1.55 4.19 0.00 2.03 1.08 3.47 11.46 0.00 41.21

9.29 9.27 0.54 3.95 0.01 0.38 0.85 1.38 0.60 0.01 26.28 8.07 9.55 0.56 3.92 0.01 0.37 0.88 1.40 0.63 0.01 25.40 7.55 10.46 0.58 3.90 0.01 0.36 0.87 1.40 0.63 0.00 25.76

15.62 2.74 8.45 0.08 0.00 8.55 4.34 7.40 0.04 0.00 47.22 15.47 2.73 8.36 0.08 0.00 8.48 4.29 7.32 0.04 0.00 46.77 13.64 2.45 7.37 0.12 0.00 7.48 3.80 6.45 0.03 0.00 41.34

6.89 5.87 0.03 1.68 0.00 1.81 0.29 0.34 1.66 0.00 18.57 7.69 5.94 0.03 1.68 0.00 1.06 0.29 0.35 0.84 0.00 17.88 7.22 6.04 0.03 1.50 0.00 1.74 0.85 0.35 0.75 0.00 18.48

3.79 16.40 0.14 1.12 0.00 0.64 0.93 0.29 0.01 0.00 23.32 3.72 16.37 0.14 0.67 0.00 0.64 0.94 0.29 0.01 0.00 22.78 3.72 16.39 0.50 9.41 0.00 0.77 0.95 1.73 8.69 0.00 42.16

33.17 11.04 0.00 0.13 0.00 4.77 14.21 0.00 0.05 0.00 63.37 33.18 11.04 0.00 0.13 0.00 4.77 14.21 0.00 0.05 0.00 63.38 32.74 11.29 0.00 0.13 0.00 4.71 14.03 0.00 0.05 0.00 62.95

13.95 5.89 0.02 1.15 0.00 0.51 0.45 0.02 0.40 0.00 22.39 14.00 5.90 0.02 1.15 0.00 0.48 0.45 0.02 0.40 0.00 22.42 13.98 5.93 0.02 1.15 0.00 0.48 0.45 0.02 0.40 0.00 22.43

Total One side O2

Two sides O2

Simple Compl Mult Join Mult Targ Mult Overl Simple Compl Mult Join Mult Targ Mult Overl

Total One side O3

Two sides O3

Simple Compl Mult Join Mult Targ Mult Overl Simple Compl Mult Join Mult Targ Mult Overl

Total

Average Hammock Size

Total Number of Branches (Average) O1 O2 O3

15

Instructions

% of branches

60

10

40

20

5

ut

e

e

ro

vp

r_

ac

x rte

pl r_

er

ip

c

vp

vo

rs pa

gz

ua

gc

0

p

O3

m

O2

eq

O1

am

0

ke

Classified Unclassified

Figure 4. Qualifi ed and non-qualifi ed branches Figure 5. Average size of qualifi ed hammocks

taken path (else) and taken path (then) need to be executed and the instructions at the join point also need to be executed. In this case, there is more overhead to guarantee the same effect. Figure 4 shows the averages for all benchmarks of both qualified and non-qualified branches regardless of the number of sides. The horizontal axis depicts the optimization level, while the vertical axis presents the percentage of branches in each group. The average of qualified hammocks exceeds 40% of all conditional branches for the -O2 optimization level. This means that using O2, the set of qualified branches is larger and more branches are going to be selected for predication. DCE will, then, dynamically decide whether the predication is really going to take place based on the dynamic availability

of resources. Although the optimization -O2 produces more qualifying branches the difference is not large which does not disqualify other levels of optimizations for the application of DCE. B. Average size of hammocks Another important aspect for DCE is the average length of each branch. This size can also be defined as the number of static instructions between the branch and its target. In DCE, the shorter the branch, the less expensive will be the predication. A summary of the average hammock length is shown in Table II. The first column presents the number of sides and the optimization options. The hammock category is indicated

Misprediction distributions -- O1 100 Qualified Non-qualified

75

%

in the second column, and the remaining columns show the average distance in instructions from a branch to its target. For two-sided hammocks, the size is defined as the average between the then and the else sides. Thus, the average number of instructions for a two-sided hammock is twice the value referred in Table II. Figure 5 presents the average size of all qualified branches for the optimization options. The number of instructions appears in the vertical axis, while the benchmarks are shown in the horizontal axis. The average size of qualified hammocks is relatively small, ranging from 5 to 12 instructions per side.

50

C. Branch mispredictions

VI. C ONCLUSIONS This work studied the structure of branches in terms of hammocks for a set of benchmarks from SPEC CPU 2000 to identify patterns that are generated by the compiler and qualify for eagger execution in DCE. Results have shown that 35% of all conditional branches

e

e

vp

vp

r_

ro ut

ac

x

pl

rte

r_

vo

p

c

er rs pa

gz i

gc

eq

am

ua

m

p

0

ke

25

Misprediction distributions -- O2 100 Qualified Non-qualified

%

75 50

e

e

ut ro

vp

r_

pl

ac

x rte

vp

r_

vo

ip

c

er rs pa

gz

gc

ua eq

am

m

p

0

ke

25

Misprediction distributions -- O3 100 Qualified Non-qualified

75

%

50

ro

ut

e

e r_

vp

_p la c

te x

vp r

er rs

vo r

pa

c

ip gz

gc

ua eq

m

p

0

ke

25

am

Another important issue for DCE is whether the optimization options affect the misprediction rates. Benchmarks with large misprediction rates and short hammocks tend to benefit more from the DCE predication. Figure 6 shows the misprediction for the three optimization levels. For benchmarks ammp, gcc and parser, most of the mispredictions occur in non-qualified branches, reducing the performance improvements that may be obtained in DCE. On the other hand, equake, gzip, vortex, vpr place and vpr route presented a significant number of mispredictions in branches that may be predicated by DCE, thus showing a significant potential for performance improvement. An interesting aspect is that even when using -O3, which turns on function inlining, the number of qualified branches remains almost the same. As most non-qualified branches are caused by function calls, it was expected that function inlining would provide more branches suitable to be predicated. However, simulations showed that the code inserted by inlining functions contains invalid, complex structures that cannot be exploited by DCE. This phenomenon is going to be studied in a future research. In the next graph (Figure 7), the average branch mispredictions for all benchmarks is shown grouped by optimization level. As mentioned before, the optimization levels do not significantly impact misprediction rates for the studied benchmarks. The potential of mispredictions that can be avoided by DCE predicating all qualified hammocks, without considering resource limits, is about 32%. But it is important to consider the size of a hammock when deciding whether to pursue predication, since the overhead of extra instructions may hide the benefits of avoiding the mispredictions.

Figure 6. Branch misprediction for different optimizations

Table II: Average size of valid hammocks

One side O1

Hammocks

Ammp

Equake

Gcc

Gzip

Parser

Vortex

Place

Route

Simple Compl Mult Join Mult Targ Mult Overl

1.34 6.54 2.33 2.42 0.00 2.53 2.52 6.72 4.14 4.00 0.00 3.48 1.35 6.54 5.30 4.00 0.00 2.99 1.54 6.54 5.30 4.00 0.00 3.48 1.40 6.54 4.66 2.42 0.00 3.00 1.54 7.48 5.30 4.00 0.00 3.66

1.69 11.42 2.33 2.47 0.00 3.58 5.13 9.79 4.93 4.00 6.50 6.07 1.66 10.69 2.33 2.47 0.00 3.43 5.13 10.69 4.93 8.04 6.50 7.06 1.66 16.65 2.33 2.48 0.00 4.62 5.13 3.59 4.93 8.04 6.50 5.64

2.97 13.74 7.02 11.24 7.83 8.56 3.08 9.25 9.49 18.14 13.01 10.59 2.95 12.23 6.70 10.98 3.49 7.27 2.87 12.23 9.19 17.77 6.28 9.67 2.94 15.43 11.14 12.71 2.33 8.91 2.79 10.06 9.99 23.87 4.16 10.17

4.09 9.09 2.00 20.90 0.00 7.22 2.04 30.89 20.86 8.43 0.00 12.44 4.09 9.07 2.00 20.91 0.00 7.21 2.03 9.07 20.36 8.47 0.00 7.99 4.09 9.29 2.00 19.07 0.00 6.89 2.03 30.94 32.85 8.47 0.00 14.86

1.15 9.11 3.36 1.66 0.00 3.06 1.84 5.05 3.93 4.50 0.00 3.06 1.14 8.93 3.36 1.66 0.00 3.02 2.06 5.05 3.92 5.46 0.00 3.30 1.23 8.92 3.47 1.66 0.00 3.06 1.82 17.61 4.25 5.45 0.00 5.82

2.65 9.22 2.23 2.13 0.00 3.25 2.91 7.87 8.12 53.11 3.67 15.14 2.69 9.14 2.20 1.98 0.00 3.20 2.96 6.90 6.88 52.88 3.67 14.66 3.72 16.39 0.50 9.41 0.00 6.00 3.29 7.09 6.57 12.55 3.67 6.63

1.27 7.78 7.29 20.82 2.50 7.93 9.29 7.45 7.94 91.21 6.50 24.48 1.27 7.30 6.49 20.82 2.50 7.68 7.21 7.30 7.57 91.21 6.50 23.96 1.27 8.43 6.50 20.82 2.50 7.90 7.21 6.82 7.58 91.21 6.50 23.86

1.08 7.30 12.04 19.15 2.50 8.41 3.78 13.17 13.05 86.88 6.50 24.68 1.08 7.20 11.34 18.99 2.50 8.22 3.75 7.20 12.12 86.57 6.50 23.23 1.08 7.44 11.36 18.99 2.50 8.27 3.75 12.75 12.14 86.57 6.50 24.34

Average Two sides O1

Simple Compl Mult Join Mult Targ Mult Overl

Average One side O2

Simple Compl Mult Join Mult Targ Mult Overl

Average Two sides O2

Simple Compl Mult Join Mult Targ Mult Overl

Average One side O3

Simple Compl Mult Join Mult Targ Mult Overl

Average Two sides O3

Simple Compl Mult Join Mult Targ Mult Overl

Average

Misprediction rate (average)

% of branches

60

40

20 Classified Unclassified

0

O1

O2

O3

Figure 7. Average branch misprediction

can be predicated, and that 32% of all branch mispredictions occur in these branches. The chance to work around this large percentage of all mispredictions can represent a great improvement in the overall performance. Most of the non-qualified hammocks were caused by function calls, and even inlining these functions did not increase the number of selected hammocks. Thus, special code generation for these cases is an interesting work to be developed in the future. Although the study of the DCE performance is not the main goal of this work, graph 8 presents the DCE speedup comparing the predication of complex branches against single branches. Three benchmarks of the SPEC95 with the highest misprediction rates were used in this simulations.

5.57

12.49

5.38

11.67

6.08

11.87

The IPC of DCE was compared against a very aggressive reference machine. The details about DCE are not discussed here due to the space limitations but can be found in [SAN03]. The results show that the predication of complex branches, in special complex pure can provide as much as 50% increase in speedup for vary wide architectures. While single branches provide constant speedups across the different configurations used, the potential for exploiting complex branches exists but is limited mainly by the resource contention as a large overhead is generated. The overhead introduced by multiple instructions from different paths decreases the performance, but alternatives, such as instruction reuse, are being studied. In a preliminary study [SAT03] it was found that almost 30% of the instructions executed in DCE can be reused and that traces of reusable instructions are in some cases twice as big as traces of reusable instructions when only prediction is applied. ACKNOWLEDGEMENTS The authors gratefully acknowledge the support from CNPq/CAPES Brazilian agencies for R&D and UNISC in the form of scholarships and grants. The authors also thank Arthur Klauser for providing the simple hammock parser which leveraged the development of the complex parser and DCE.

CIDI Speedup (Local vs. Local)

Simple Complex Pure

14 12 10 8 6 4 2 0 -2

Simple Complex Mult Joins

m16w16 m16w32 m16w64 m32w16 m32w32 m32w64 m64w16 m64w32 m64w64 m128w16 m128w32 m128w64 m256w16 m256w32 m256w64 m512w16 m512w32 m512w64 m1024w16 m1024w32 m1024w64

Speedup (%)

14 12 10 8 6 4 2 0 -2

m16w16 m16w32 m16w64 m32w16 m32w32 m32w64 m64w16 m64w32 m64w64 m128w16 m128w32 m128w64 m256w16 m256w32 m256w64 m512w16 m512w32 m512w64 m1024w16 m1024w32 m1024w64

Speedup (%)

Distance 64

Simple Complex Mult Targets

14 12 10 8 6 4 2 0 -2

Simple Complex Overlapped

m16w16 m16w32 m16w64 m32w16 m32w32 m32w64 m64w16 m64w32 m64w64 m128w16 m128w32 m128w64 m256w16 m256w32 m256w64 m512w16 m512w32 m512w64 m1024w16 m1024w32 m1024w64

Speedup (%)

Maps x Width

m16w16 m16w32 m16w64 m32w16 m32w32 m32w64 m64w16 m64w32 m64w64 m128w16 m128w32 m128w64 m256w16 m256w32 m256w64 m512w16 m512w32 m512w64 m1024w16 m1024w32 m1024w64

Speedup (%)

Maps x Width

14 12 10 8 6 4 2 0 -2

Maps x Width

Maps x Width (*) Only Go, Gcc, Ijpeg: highest misprediction rates

Figure 8: Speedup R EFERENCES [BUR96] BURGER, D.; AUSTIN, T.; BENNETT, S. Evaluating future microprocessors: The Simplescalar toolset. Technical Report CS-TR-96-1308, University of Wisconsin, CS Department, July 1996. [HEI96] HEIL, T.; SMITH, J. Selective Dual Path Execution. Technical Report, University of Wisconsin, CS Department, 1996. [KLA98] KLAUSER, A.; AUSTIN, T.; GRUNWALD, D.; CALDER, B. Dynamic Hammock Predication for Nonpredicated Instruction Set Architectures. Proc. of the PACT’98. pp. 278–285. Paris, October 1998. [MCF93] McFARLING, S. Combining Branch Predictors. WRL TN-36, Digital Western Research Lab. June, 1993. [PIL02]

PILLA, M.; SANTOS, R.; SANTOS, T. Analyzing the Behavior Pattern and the Structures of Conditional Branches. Technical Report, PPGC/UFRGS, March 2002.

[SAN98] SANTOS, R.; NAVAUX, P. Analyzing a Multistreamed Superscalar Speculative Instruction Fetch Mechanism. Proc. of the EURO-PAR’98. Southampton, October 1998. [SAN01] SANTOS, R.; NAVAUX, P.; NEMIROVKSY M. DCE: The Dynamic Conditional Execution Approach. Work in Progress Session of 7th HPCA. Monterrey, Mexico, 2001.

[SAN03] SANTOS, R. DCE: The Dynamic Conditional Execution in a Multipath Control Independent Architecture. PPGC/UFRGS, 2003. PhD Thesis. [SAT03] SANTOS T.; SANTOS R.; NAVAUX, P.; NEMIROVSKY, M.; BAMPI, S. Analyzing the Limits of Trace Reuse in a Dynamic Conditional Execution Architecture. Proc. of the Workshop on Traces at the 17th ICS. San Francisco, 2003. [SKA99] SKADRON, K. Characterizing and Removing Branch Mispredictions. Princeton University, 1999. PhD Thesis. [SOH90] SOHI, G. Instruction Issue Logic for High Performance, Interruptible, Multiple Functional Unit, Pipelined Computers. IEEE Transactions on Computers, Los Lamitos, v. 39, n. 3, p. 349–359, March 1990. [TYS97] TYSON, G; LICK, K.; FARRENS, M. Limited Dual Path Execution. Technical Report CSE-TR 346-97, University of Michigan. [UHT95] UHT, A.; SINDAGI, V. Disjoint Eager Execution: An Optimal Form for Speculative Execution. Proc. of the 28th Micro. pp. 313–325, 1995. [CHA99] E. Chaves Filho; F. Santos; A. Santos; P. Navaux; R. Santos. MULFLUX: A Microarchitecture with Multiple Flows of Control. Pr oc. of Int. Evaluation Protem-CC. Brasilia 1999. p. 146-176. [ROT99] ROTENBERG, Eric.; JACOBSON, Quinn; SMITH, Jim. A Study of Control Independence in Superscalar Processors. In Proceedings of 5th International Conference on High Performance Computer Architectures. January, 1999.