Evaluating the Effects of Branch Prediction Accuracy on ... - CiteSeerX

3 downloads 735 Views 167KB Size Report
branch predictor accuracies in the performance of a. superscalar and a ..... perl. fpppp. mgrid. swim. wave5. IPC. Figure 3 Performance of the small superscalar.
Evaluating the Effects of Branch Prediction Accuracy on the Performance of SMT Architectures Ronaldo Gonçalves 1, Maurício Pilla 2, Guilherme Pizzol 3, Tatiana Santos 4, Rafael Santos 5, Philippe Navaux 6 1

Departamento de Informática, Universidade Estadual de Maringá Avenida Colombo 5790, CEP 87020-900, Maringá, PR, Brazil {[email protected]} 2,3,4,6 Instituto de Informática, Universidade Federal do Rio Grande do Sul Avenida Bento Gonçalves 9500, CEP 91501-970, Porto Alegre, RS, Brazil {pilla, gpizzol, tatiana, [email protected]} 5 Departamento de Informatica, Universidade de Santa Cruz do Sul Av. Independencia 2293, CEP 96815-900, Santa Cruz do Sul, RS, Brazil {[email protected]} Abstract Branch instruction occurrence reduces the parallelism exploited from the source code of singlethreaded applications. In order to reduce the branch penalty, several branch predictor techniques have been proposed. Branch predictors allow the fetch unit to continue fetching instructions along a predicted path after a conditional branch has been detected. Such techniques, when used in conventional superscalar architectures, may reach more than 95% of accuracy. These same techniques are also used in SMT architectures. However, SMT architectures may have a different behavior due to the parallelism exploration in several threads. Moreover, the effects supported by one thread may influence also the performance of other threads. In this work, we vary the accuracy of the branch predictor in order to evaluate the impact on the performance of a SMT architecture. Even though the SMT and superscalar have a different behavior, we observed that the effect of the improvement in the prediction accuracy is similar for both architectures.

1. Introduction With the increase in the complexity of applications, many researches have been done in order to design faster microprocessors. These architectures, which are currently being designed, use aggressive hardware techniques to 1

PhD student supported by CAPES PhD students supported by CNPq 3 Undergraduate student supported by CNPq 4 PhD student 6 PhD advisor at the II/UFRGS

2,5

explore efficiently the instruction level parallelism (ILP) and also the application level parallelism (ALP). In this context, two approaches can be emphasized: the superscalar [22] and the SMT [23] architectures. Superscalar architectures are used in many state-of-art microprocessors, such as Pentium [2], PowerPC [4], MIPS R10000 [16] and UltraSparc [26]. These architectures have several functional units and they can explore directly by hardware the instruction level parallelism (ILP). SMT architectures (Simultaneous MultiThreaded) have a larger amount of hardware in order to store different executions context and so, they may also explore the application level parallelism, executing instructions from different threads. Unfortunately, this parallelism is limited by many reasons, such as control dependencies [12]. Branches disrupt the instruction flow reducing the chances of parallelization. Nevertheless, many successful branch prediction techniques were developed and used in superscalar architectures, providing high accuracy rates. Some sophisticated techniques can reach 98% of accuracy. In spite of these accuracy rates are very high accuracy rates, the occurrence of a few mispredicted branches make the processor to execute instructions along wrong paths, reducing the performance. Moreover, the performance may be affected by another 3 reasons. The first one, is the pollution in the L1 I-cache. When a fetch of a wrong path is performed, useful instructions may be replaced in the cache and cache misses may increase. The second reason is regarding the memory bus. Hence, while a wrong path miss is being processed, using the bus, another miss generated by the correct flow may be waiting to be served. Finally, the pipeline has to be flushed in each misprediction, turning out that some

processing already done become useless. Currently, these same techniques are used in SMT architectures, but there is not any study evaluating if their use achieves the same effectiveness and if mispredictions cause the same impact as in a superscalar architecture. Besides, SMT architectures may have a different behavior because the performance reached by one thread may influence the performance of other threads. Thus, an analysis relative to the branch prediction utilization in these architectures is addressed by this work. Also in this work we evaluate the effect of different branch predictor accuracies in the performance of a superscalar and a SMT architectures. Additionally, we want to find out if the behavior of SMT architectures is similar to the behavior of superscalar architecture. This work is organized as follow. Section II presents a review of previous works, approaching questions related to branch prediction and SMT architectures. Section III introduces the simulation environment, showing configurations, benchmarks and methodology used. Next, section IV explains the simulations and analyses the obtained results. Finally, in section V conclusions and future works are presented.

2. Related works The performance of modern processors is limited by two well-known instruction dependencies [10, 22]: data and control dependencies. Many techniques have been proposed to overcome this limit. The data dependency can be reduced by data prediction techniques [13], register renaming [16] and dynamic scheduling. Control dependencies can be reduced by branch prediction techniques, multi-path execution [1, 17, 19] and trace cache mechanism [18]. Regarding branch prediction, several techniques were proposed, achieving different levels of accuracy. Smith [21] analyzed several branch prediction techniques, and he found an average accuracy rate between 76.68% and 92.55%. Later, Yeh and Patt [28] proposed the adaptive two-level branch predictor, where the behavior of correlated branches can be captured and stored in two levels of tables. They reported accuracy of 97% in SPEC benchmarks, against 93% of others mechanisms. McFarling [15] developed a branch prediction mechanism which combined others mechanisms, achieving accuracy rates up to 98.1%. Young et al [29] analysed schemes for correlated branch prediction. Uht et al [25] presented an overview of branch prediction mechanisms that are used in some of the state-of-art microprocessors. Kesller [11] presented the complex prediction mechanism used in Alpha 21264, which is an example of how complex are the branch prediction mechanisms implemented in modern microprocessors. Due to exploiting instruction level parallelism from single-threaded applications, superscalar architectures

require efficient mechanisms to overcome branch penalties. The researches in this area have been trying to reach perfect prediction (100% accuracy) because 5% or less of misprediction is enough to harm the performance of superscalar architectures. As SMT architectures can extract parallelism also from different threads it is necessary to evaluate the impact of these branch predictors over the performance of such architectures. SMT [23] have been proposed to execute instructions from many threads simultaneously [9, 27, 23, 5]. These architectures are able to hide high latencies from some threads by scheduling instructions from other threads. Moreover, SMT architectures take advantage of idle resources, e.g., functional units. However, the cache system suffers because memory addressing conflicts among threads, generated by compiler [14]. Consequently, many works on SMT have addressed this question. Tullsen [24] showed that instruction fetch must favor threads with less instruction in the pipeline. Hily [8] showed that caches require high associativity to sustain many threads. Gonçalves [6] showed analytically that cache miss rate can be reduced if a prefetch scheme directed by process scheduling is used. Also in 1999, Sigmund [20] concluded that cache replacement policy is a fundamental issue when memory bandwidth is restricted.

3. Simulation environment The simulation environment used in this paper consists of two different simulators, both developed using the most detailed simulator from the SimpleScalar Tool Set [3], called sim-outorder. This simulator supports out-oforder issue and execution, based on the Register Update Unit (RUU). It also supports non-blocking caches and four types of branch predictors: two-level, perfect, bimodal, combined (bimodal and 2-level adaptive). Its pipeline is composed of six stages: fetch, dispatch, issue, execute, writeback and commit, as shown in Figure 1. Our first simulator is an extended sim-outorder which includes a mechanism that allows the user to determine the branch prediction accuracy. When a branch instruction is reached, the mechanism draws a random real number. If this number is smaller than the pre-determined accuracy, the predictor forces a correct prediction. In this case, if the target or the direction predicted is wrong, the predictor adjusts to the right values, otherwise it keeps the original prediction. On the other hand, if the random number is greater than the desired accuracy, the predictor forces a misprediction. In this way, if the target was correct, the predictor redirect the fetch to the wrong address, adding the correct penalty, otherwise it keeps the wrong prediction. It is important to note that his mechanism does not simulate wrong target addresses. The simulated misprediction is related only to the direction of the branch, taken or not-taken. Whenever the branch is

predicted correctly as taken, the correct branch target address is used. i-cache i-tlb fetch i-queue decode ruu-q

ls-q Register Update Unit issue

d-tlb fus

execution

d-ch writeback commit

instructions from different slots in a round-robin sequence limited by the architecture width. Eight benchmarks from SPEC 95 [SPE 95] were simulated, mixing integer (gcc, ijpeg, li, and perl) and floating-point applications (fpppp, mgrid, swim, and wave5). We simulated 4 different combinations of 4 benchmarks each one (2 integers and 2 floating-points) in order to compute the average performance for the SMT with 4 threads (SMT-4), and one combination for the SMT with 8 threads (SMT-8). All simulations are executed until one of the benchmarks in each workload complete 250 millions of instructions, from which 50 millions of initial instructions are skipped to reduce the warm-up stage. Table 1 shows reference latencies considered in our simulations. In our analysis we have measured the performance in IPC (Instructions per Cycle).

regs

Table 1 Reference latencies Types of latencies

Figure 1 Superscalar architecture Using the branch predictor described above, our second simulator [7] implements a SMT architecture, which is part of the SEMPRE project [5]. This simulator was developed through replication of all resources from sim-outorder simulator in order to maintain and execute many contexts, as shown in Figure 2. Note that for this architecture, each thread is an independent application, thus dependencies among threads are not considered. Also, all applications on that architecture were executed under the same branch prediction accuracy, what does not reflect the impact of multiple threads over the prediction accuracy. But this is out of the scope of this work. i-cache system smt-fetch i –queue system smt-decode

smt-issue shared fus

smt-execution smt-writeback smt-commit reg frames

Figure 2 SMT architecture The resource set containing register file, tables and queues, used to maintain the context of one thread is called slot. The Fetch stage fetches up to one instruction block per cycle or up to the first taken branch composed by instructions from only one thread at a time selected in a round-robin fashion. The remaining stages schedule

1 ; 6 ; 30 (18 + n*2) 1 2 1 Div oper Mult oper Sqrt oper Div oper Mult oper

20 3 24 12 4

Table 2 Hardware configurations

S M A L L

shared or distributed RUU systen

data cache system

Number of cycles

L1 hit ; l2 hit ; tlb miss l2 miss (for n+1 chunks) IA: int-alu functional unit FA: fp-alu functional unit LS: ld/st functional unit IM: int-mult-div Functional unit FM: fp-mult-div Functional unit

L A R G E

Superscalar

SMT-4

SMT-8

w = 8, l2-c = 512k, l1-c = 32k, 10 fus = 3 IA, 3 FA, 2 LS, 1 IM, 1 FM, ruu = 32 e, lsq = 16 e. w = 16, l2-c = 1M, l1-c = 64k, 17 fus = 5 IA, 5 FA, 3 LS, 2 IM, 2 FM, ruu = 64 e, lsq = 32 e.

w = 16, l2-c = 1M, l1-c = 64k, 17 fus = 5 IA, 5 FA, 3 LS, 2 IM, 2 FM, ruu = 64 e, lsq = 32 e. w = 64, l2-c = 4M, l1-c = 256k, 68 fus = 20 IA, 20 FA, 12 LS, 8 IM, 8 FM, ruu = 256 e, lsq = 128 e.

w = 16, l2-c =1M, l1-c = 64k, 17 fus = 5 IA, 5 FA, 3 LS, 2 IM, 2 FM, ruu = 64 e, lsq = 32 e. w = 128, l2-c = 8M, l1-c = 512k, 136 fus = 40 IA, 40 FA, 24 LS, 16 IM, 16 FM, ruu = 512 e, lsq = 256 e.

The experiments are divided in six groups, using as criteria the simulated architecture (superscalar or SMT) and the number of allocated resources (small or large), as shown in Table 2. Also, we have simulated two different workloads for the SMT architectures 4 and 8 threads but we configured the resources accordingly. The small hardware represents state-of-art configurations based on various existent microprocessors. The large hardware

Performance of the Small Superscalar

IPC

4,0 gcc

3,5

ijpeg

3,0

li

2,5

perl

2,0

fpppp

1,5

m grid

1,0

sw im

0,5

wave5

95

100

90

85

80

75

70

65

60

55

0,0

accuracy (%)

Figure 3 Performance of the small superscalar Performance of the Large Superscalar

IPC 4,0 3,5

gcc

3,0

ijpeg

2,5

li

2,0

perl

1,5

fpppp

1,0

m grid swim

0,5

wave5

100

95

90

85

80

75

70

65

60

0,0

55

Our interest is to analyze the application behavior on SMT architectures, changing the branch prediction accuracy and comparing with that obtained on superscalar architecture. Thus, our first simulations were made with the superscalar architecture. We varied the accuracy of the branch predictor and gathered the respective IPCs. Note that the predictor was specially designed to allow to get the desired accuracy independently of the scheme used. In fact, the simulator is based on perfect prediction, where always the correct outcome is available and, accordingly with the prediction accuracy required, some mispredictions are intentionally forced. Figures 3 and 4 show the number of instructions per cycle (IPC), for each benchmark, as a function of the branch prediction accuracy. In both small and large architectures the benchmarks presented a similar behavior, in spite of the better performance obtained by the larger architecture. The benchmarks mgrid and fpppp are not largely affected by branch prediction improvement, presenting almost constant performance. Although these benchmarks show increasing in performance from the small to the large architecture even for the lowest prediction accuracies. The benchmarks, ijpeg and li, have showed the significant improvements when the prediction accuracy is improved. That means, that these benchmarks are more sensitive to the branch prediction thus their performance is directly affected by the branch prediction accuracy. However, we can see that when hardware is increased, the benchmarks mgrid and fpppp do not rely on the branch prediction itself thus they take full advantage of the available resources. ijpeg and li need a more efficient predictor to overcome the performance achieved by the benchmarks mgrid and fpppp. In the small superscalar architecture, the benchmark ijpeg overcomes mgrid when the accuracy is about 75%, and the benchmark li overcomes fpppp when the accuracy is close to 55%. In the large superscalar architecture these points are raised to 85% and 78%, respectively. Figure 5 shows the average performance in both superscalar architectures. Actually, we can note that, when the branch prediction accuracy is improved, the large architecture performance increases more intensively than the performance obtained by the small superscalar architecture. This fact happened because when the predictor is better there are more possibilities to exploit instruction level parallelism. However, this advantage is possible only when resources are available. On the other hand, when the hardware is enlarged there are more

50

4. Evaluating the effects of branch prediction accuracy

possibilities to exploit instruction level parallelism, so we can also see improvements in performance from the small to the large architectures for the same prediction accuracies. Although there is a performance increase, in this case the increase is more significant for higher prediction accuracies.

50

represents an increase in the available resources in order to foresee the performance of future configurations.

accuracy (%)

Figure 4 Performance of the large superscalar Another important issue in Figure 5 is that the small superscalar architecture with perfect accuracy achieves the same performance as the large one with accuracy near to 75%. In those circumstances, all efforts that are being done to improve the predictors could be directed to expand the hardware, achieving the same results. Indeed, the cost of each implementation should be considered in order to determine the best cost-benefit but it is clear that the performance is a combination of various aspects. For example, it is not possible to have an IPC greater than 2 for the small architecture even considering perfect branch prediction. However, for the large architecture this could be possible using a 85% accuracy branch predictor. Furthermore, the improvement of accuracy from 95% to 100% causes an increase of about 7% and 9% in the global performance to small and large architectures, respectively, as shows Figure 6. In this Figure, the speedup bar is the improvement of an architecture with a given

predictor accuracy, in terms of average IPC, over the immediately previous accuracy simulated. For example, the speed-up for the large architecture using a 85% accurate branch predictor over the performance of the same architecture using a 80% accurate predictor is around 8%. No results are presented for accuracies of less than 50%. A verage P e rfo rm an c e of the S u pers c a la r A rc h itec tures

IP C

parallelism so, the more accurate the predictor the more instructions these benchmarks will produce to enter the functional units. As the predictor gets better, mispredictions are reduced and more instructions become available for dispatch. That is why the performance of one thread can interfere on the performance of other threads. On the SMT-8 there are resources to support all threads, even when more parallelism is available as the predictor accuracy increases.

4,0

Performance of the Small SMT-4 Architecture IPC

3,5

1,8

3,0 2,5

s m all

1,6

swim

1,4

perl

la rge

2,0

1,2

1,5

mgrid

1,0

100

95

90

85

ijpeg wave5 gcc fpppp

0,2

ac c u rac y (% )

li

S pe ed-u p over the P redic tion A c c ura c y Im prove m ents

95

accuracy (%)

100

90

85

80

75

70

65

60

Figure 5 Avg performance of the superscalar

55

0,0 50

80

75

0,4 70

0,0 65

0,6 60

0,5

55

0,8

50

1,0

Figure 7 Performance of the small SMT-4

12% Perform ance of the Large SMT-4 Architecture

IPC

10%

7,0

speedup

8%

6,0

swim

5,0

perl

s m all

6%

larg e

mgrid

4,0

4%

ijpeg 3,0

wave5

2%

We also analyzed the SMT-4 architecture. Figures 7 and 8 show the performance of each benchmark in both small and large SMT-4 architectures, respectively. In both architectures, the benchmarks ijpeg and li have reached the best performances again. Also, on the large SMT-4, the benchmarks mgrid and fpppp kept the same behavior observed on the superscalar architecture presenting a flat performance throughout all configurations. Notice that on the small SMT-4, benchmarks mgrid and fpppp were penalized by the branch prediction improvement. In the SMT architectures, the functional units are shared among all threads. Even though threads (benchmarks) are scheduled one each a time, the pool of functional units is shared among them. Then, when the predictor accuracy is increased, there is a chance for more parallelism, in this way reducing the amount of available resources. Some benchmarks have more implicit

li

100

95

90

85

80

75

70

65

0,0

60

Figure 6 Speed-up over the prediction improvements

fpppp

55

acc u racy (% )

gcc

1,0

50

100

95

90

85

80

75

70

65

60

55

50

0%

2,0

accuracy (%)

Figure 8 Performance of the large SMT-4 Figure 9 shows the global performance of both SMT-4 architectures (small and large). The performance is the sum of the individual IPCs for the benchmarks used in each simulation, see section III. The biggest impact of the branch prediction accuracy increase was found in the larger architecture, as in superscalar architectures. However, there is no intersection between the SMT-4 curves for the performance of the small and large architectures. Even though the SMT-4 small architecture has similar resources when compared to the superscalar large architecture, the performance is even better. The reason is because in the SMT architecture there are multiple threads, in this case four. When multiple threads are being executed, there is more room to exploit the parallelism either in the thread level or in the instruction

Global Performance of the SMT-4 Architectures

IPC

prediction accuracy improvements. Analyzing their codes we verified that the number of instructions per branch is near to 70~80 (average size of basic block). Thus the performance of these benchmarks is not limited mainly by the accuracy of the predictor, but it depends basically on the data dependencies and available resources. Speed-up over the Prediction Accuracy Improvements

20% 18% 16% 14%

speedup

level. Notice that the performance of the small SMT-4 is limited by resource conflicts as the improvements in the prediction accuracy do not impact in the global performance at all. Therefore, the performance of the large SMT-4 shows more significant improvement as the predictor accuracy increases. Despite the small SMT-4 achieves a better performance than the large superscalar architecture, the performance curve is flat. This means that the predictor does not need to be as accurate as in the superscalar because the resources are being used essentially by multiple threads. Hence, the parallelism at the instruction level is not the key role in this scenario.

12% small

10%

large

8% 6%

16

small 8

100

95

90

85

80

75

70

65

10

60

0% 50

2%

12

55

4% 14

accuracy (%)

large

Figure 10 Speed-up over the prediction improvements

6 4

Performance of the Small SMT-8 Architecture

IPC 1,0 swim

0,8

perl

0,7

mgrid

0,6

ijpeg

0,5 0,4

wave5

0,3

gcc

0,2

fpppp

0,1

li 100

95

90

85

80

75

70

65

60

55

0,0 accuracy (%)

Figure 11 Performance of the small SMT-8 Performance of the Large SMT-8 Architecture

IPC

7,0 swim

6,0

perl

5,0

mgrid

4,0

ijpeg

3,0

wave5

2,0

gcc

1,0

fpppp li

100

95

90

85

80

75

70

65

60

0,0 55

Figure 9 Global performance of the SMT-4 Figure 10 shows the speed-up over the improvement of branch prediction accuracy as described previously. In this graph, we can see that the perfect branch prediction in the large SMT-4 architecture implies in a speedup about 18% over the same architecture with branch prediction accuracy of 95%. In the large SMT-4 we have observed that the predictor affects the performance because the resources available allow also to exploit more instruction level parallelism. In the small SMT-4 the resources are enough to exploit only the thread level parallelism. We concluded that when thread level is the main parallelism being exploited the accuracy of the predictor does not interfere at all. To measure the impact of the branch prediction accuracy in more aggressive SMT architectures, we have made simulations with 8 simultaneous tasks. Figure 11 shows the performances for each benchmark in a small SMT-8. In this graph, it is possible to see clearly that the increase in the number of tasks generates an overhead making the performance decrease even when the prediction accuracy is improved. When comparing with the small SMT-4 (Figure 7) we see that the lack of resources in this case harms the performance of some benchmarks. Figure 12 shows the results achieved by the large SMT-8 architecture. Once again the experiments confirm that the benchmarks mgrid and fpppp are not affected by

0,9

50

95

90

100

accuracy (%)

85

80

75

70

65

60

55

50

0

50

2

accuracy (%)

Figure 12 Performance of the large SMT-8 The comparison between the global performance of

both small and large SMT-8 architectures is shown in Figure 13. We can note that the small SMT-8 architecture was almost not affected by the accuracy increase. As in the small SMT-4 (Figure 9) the availability of resources compared with the number of threads does not allow the instruction parallelism to be exploited as well. Therefore, the large SMT-8 presents a wide improvement as the prediction accuracy increases. This happens because there are resources enough to allow the instruction level parallelism to be exploited as well as the thread level parallelism. In this case, we see a significant improvement, if compared with any other previous case analyzed in this work. Global Performance of the SMT-8 Architectures

IPC 35 30 25

small

20

large

15 10 5 100

95

90

85

80

75

70

65

60

55

50

0

accuracy (%)

Figure 13 Global performance of the SMT-8 Figure 14 shows the speed-up over the prediction accuracy for the small and large SMT-8 architectures. In fact, the small SMT-8 architecture has its speed-up reduced with the increase in the prediction accuracy. As we can see in Figure 13, there is no improvements in performance as the accuracy is increased.

speedup

Speed-up over the Prediction Accuracy Improvements

20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0%

small large

50 55 60 65 70 75 80 85 90 95 accuracy (%)

Figure 14 Speed-up over the prediction improvements

Thus, as the speed-up measured is the ratio between the performance of one architecture over the same architecture with a 5% less accurate predictor then the speed-up is reduced as the predictor accuracy is increased. On the other hand, the large SMT-8 architecture shows a

speed-up of 18% when the accuracy is increased from 95% to 100%.

5. Conclusions In this work we presented a comparison between a superscalar and a SMT architecture based on the impact of the branch prediction accuracy improvements. We analyzed both architectures under different scenarios which comprised different configurations (i.e., resources available) and different prediction accuracies. For the superscalar architecture we figured out that branch prediction improvements presented always some benefit in terms of performance. The better the predictor the better the performance. Although we observed improvements over the increase in the predictor’s accuracy the improvements from 95% to 100% are in the range of 7% to 9%. The superscalar architecture used in this simulations has a six stages pipeline. For deeper pipelines the improvements in the branch prediction accuracy might result in more gains, considering the misprediction penalties would affect more the performance. Regarding the SMT architectures we observed two different behaviors. First, for the small SMT we saw a flat performance curve when varying the branch prediction accuracy. In these cases the prediction did not affect the performance. If the ratio between the number of threads and the number of resources (threads/resources) is high, it means that probably the thread level parallelism is prioritized. Hence, instruction level parallelism is not the most important factor in the performance and prediction accuracy does not affect it as in the superscalar architecture. Second, for the large SMT we observed an improvement either by increasing the resources or the prediction accuracy. In these architectures the amount of available resources used was enough to exploit both thread and instruction level parallelism. When instruction level parallelism affects directly the performance then the prediction accuracy has an important role in the performance as well. In the large SMTs the ratio between the number of threads and the number of resources tends to be lower than in the small SMTs. Despite the SMT-8 architecture have twice more resources than SMT-4, it executes 8 applications while the SMT-4 architecture executes only 4. In this way, the resources available are the same for each application. We can conclude that when we increase resources and also proportionally the number of tasks on a SMT architecture, the global performance increases in the same ratio as well. For this reason, the effect of the improvement of the branch prediction in both architectures are also similar. If the ratio “number of threads / amount of resources” decreases, the performance benefits caused by increasing

the branch prediction accuracy can be higher than 18%. As presented in Figure 14 for the large SMT-8, 18% was the highest speed-up obtained when increasing the prediction accuracy from 95% to 100%. If more resources were available, the ratio “threads/resources” would be smaller and the speed-up could be higher than 18%. Even though the SMT and superscalar have a different behavior, we observed that the effect of the improvement in the prediction accuracy is similar for both architectures. For future works, the design of a branch prediction mechanism with variable branch target address prediction would allow the study of the effects of branch target address misprediction in both superscalar and SMT architectures. Another important issue is that the performance of the branch predictor can be affected by the multiple threads contexts. This was not in the scope of this work, but a study of this effects could be done.

6. References [1] Ando, H.; et ali., Speculative Execution and Reducing Branch Penalty in a Parallel Issue Machine, Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers & Processors, Cambridge, Massachusetts, Oct, 1993. [2] Anderson, D. & Shanley, T., Pentium Processor System Architecture, Second Edition, MindShare, Inc., AddisonWesley, Massachusetts, February, 1995. [3] Burger, D., Austin, T. M., The SimpleScalar Tool Set, Version 2.0, Technical Report #1342, University of Wisconsin – Madison, June, 1997. [4] Diep, T.A, Nelson, C., Shen, J.P., Performance Evaluation of the PowerPC 620 Microarchitecture, Proceedings of the 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June, 1995. [5] Gonçalves, R., Navaux, P., SEMPRE: Superscalar Architecture with Multiple Processes in Execution (in portuguese), X Symposium on Computer Architecture and High Performance Computing, Búzios, Brazil, Sept, 1998. [6] Gonçalves, R. A. L., Sagula, R. L., Divério, T. A., Navaux, P. O. A., Process Prefetching for a Simultaneous Multithreaded Architecture, 11st Symposium on Computer Architecture and High Performance Computing, Natal, Brazil, Sept/Oct, 1999. [7] Gonçalves, R. A. L., Ayguadé, E., Valero, M., Navaux, P. O. A., A Simulator for SMT Architectures: Evaluating Instruction Cache Topologies, XII Symposium on Computer Architecture and High Performance Computing, São Pedro, Brazil, October, 2000. [8] Hily, S., Seznec, A., Standart Memory Hierarchy Does Not Fit Simultaneous Multithreading, MTEAC – Multithreaded Execution, Architecture and Compilation, 1998. [9] Hirata, H. Et al, An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads, Proceedings of the 19th ISCA – International Symposium on Computer Architecture, ACM & IEEE-CS, May, 1992. [10] Johnson, M., Superscalar Microprocessor Design, Prentice Hall Series in Innovative Technology, PTR Prentice Hall, Englewood Cliffs, New Jersey, 1991.

[11] Kessler, R. E. The Alpha 21264 microprocessor. IEEE Micro, v.19, n.2, March/April 1999. [12] Lee, D. et alii, Instruction Cache Fetch Policies for Speculative Execution, 22th ISCA, Italy, 1995. [13] Lipasti, M.H.; Shen, J.P., Exceeding the Dataflow Limit via Value Prediction, 29th Micro, Paris, Dec, 1996. [14] Lo, J. et al., An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors, Proceedings of the 25th ISCA, 1998. [15] McFarling, Scott. Combining Branch Predictors. Technical Note NT-36. Palo Alto: Digital Western Research Laboratory, 1993. [16] MIPS R10000 Microprocessor User’s Manual, Version 1.0, MIPS Technologies, Inc. North Shoreline, Mountain View, California, June, 1995. [17] Pierce, J.; Mudge, T., Wrong-Path Instruction Prefetching, Proceedings of the 29th MICRO, Dec, 1996. [18] Rotenberg, E.; Bennett, S.; Smith, J.E., Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching, MICRO29, Paris, Dec, 1996. [19] Santos, Rafael R.; Navaux, P. O. A., Speculative Fetch Mechanism of Multiple Instruction Streams (in portuguese), IX SBAC-PAD, Campos do Jordão, Oct, Brazil, 1997. [20] Sigmund, U., Ungerer, T., Memory Hierarchy Studies of Multimedia-enhanced Simultaneous Multithreaded Processors for MPEC-2 Video Decompression. Workshop on MultiThreaded Execution, Architecture and Compilation (MTEAC 00), Toulouse, 8.1.2000 [21] Smith, J. E., A Study of Branch Prediction Strategies, Proceedings of the 8th ISCA- International Symposium on Computer Architecture, Minneapolis, May, 1981. [22] Smith, J.E, Sohi, G.S., The Microarchitecture of SuperScalar Processors, Proceedings of the IEEE, 83(12), pp.1609-1624, December, 1995. [23] Tullsen, D. M., et all, Simultaneous Multithreading: Maximizing On-Chip Parallelism, Proceedings of the ISCA’95, Santa Margherita Ligure, Italy, Computer Architetcure News, n.2, v.23, 1995. [24] Tullsen, D.M., et all, Exploiting Choice: Instruction Fetch and Issue on an Implementabela Simultaneous Multithreading Processor, 23rd ISCA – International Symposium on Computer Architecture, Philadelphia, PA, May, 1996. [25] Uht, A. K.; Sindagi, V.; Somanathan, S., Branch Effect Reduction Techniques, IEEE Computer Magazine, v.30, n.5, May, 1997. [26] UltraSPARC User’s Manual, UltraSPARC-I/UltraSPARCII, Revision 2.0, Sun Microsystems, Mountain View, CA, USA, May, 1996. [27] Yamamoto, W. et all, Performance Estimation of Multistreamed, Superscalar Processors, Hawaii International Conference on Systems Sciences, January, 1994. [28] Yeh, T.-Y.; Patt, Y.N., Two-Level Adaptative Training Branch Prediction, Proceedings of the 24th Annual International Symposium on Microarchitecture, New Mexico, Nov, 1991. [29] Young, C.; Gloy, N.; Smith, M. D., A Comparative Analysis of Schemes for Correlated Branch Prediction, Proceedings of the 22th International Symposium on Computer Architecture - ISCA, Santa Margherita Ligure, Italy, 1995.