A Computation Saving Partial-Sum-Global-Update ... - Semantic Scholar

20 downloads 0 Views 244KB Size Report
With the pipeline deepen and issue width widen, the ac- curacy of branch predictor becomes more and more impor- tant to the performance of a microprocessor.
A Computation Saving Partial-Sum-Global-Update Scheme For Perceptron Branch Predictor Liqiang He College of Computer Science Inner Mongolia University Huhhot, Inner Mongolia, 010021 P. R. China

Abstract With the pipeline deepen and issue width widen, the accuracy of branch predictor becomes more and more important to the performance of a microprocessor. State-of-theart researches have shown that perceptron branch predictor can obtain a higher accuracy than the existing widely used table based branch predictor. One shortcoming of perceptron branch predictor is the high prediction latency which most comes from the computation needed by the predicting process. In this paper, we propose a Partial-Sum-GlobalUpdate scheme to decrease the number of computation of perceptron predictor with marginal accuracy losing. This scheme is orthogonal to the other schemes such as ahead pipelining. Using O-GEHL predictor as example, the simulation results show that with the storage budget changing from 32kbits to 512Kbits our scheme can save up to 18.2% of computation for a prediction in average as while as only losing up to 1.75% accuracy. Another benefit from the saving computation is the saved power consumption on the predictor which is also an important factor in nowadays microprocessor. Keywords: perceptron branch predictor, O-GEHL, PSGU, partial sum, average computation number

1. Introduction With the pipeline deepen and issue width widen, the accuracy of branch predictor becomes more and more important to the performance of a out-of-order microprocessor. State-of-the-art researches in adapting machine learning algorithms to the branch prediction problem have been very successful in dramatically increasing branch prediction accuracy. Most of these algorithms use perceptron as the base structure of the predictor. However, such perceptron predictors need a large number of small adders to operate every cycle they make a prediction, increasing both the area of

the predictor and the energy per prediction, and suffer high prediction latency and hardware logic complexity. The latency issue has been addressed through ahead pipelining. In this research, we propose a Partial-SumGlobal-Update (PSGU) scheme accompany with the ahead pipelining. This scheme likes a pipeline gating technique which has been used in nowadays microprocessors. The idea behind PSGU scheme is that most of the prediction result from a perceptron predictor for a branch instruction can be obtained through only calculating a partial sum of the input vector and corresponding weights. In other words, the result from the partial sum is same as the result from the total sum at the end of predicting pipeline at most cases because they only depend on the sign of the sum. Thus the prediction result can be sent to the fetch unit without needing to wait the end output of predicting pipeline, and it saves several computing operation for one predicting operation. In addition, the power consumption according to the saved operations are not needed any more and are saved as a result. To determine the part consisting of the partial sum, we use an adaptive scheme which dynamically updates the number of part according to the prediction result and the final result of a branch instruction. PSGU scheme is orthogonal to the other similar techniques [1, 2] and can be used directly with them. For perceptron predictors needing a lot of adders, path-based predictor and Piecewise predictor for example, we believe that it can decrease the prediction computation and power consumption more effectively than that in our experiment. We use O-GEHL predictor in our experiment as an example which only needs 8 entries in the adder tree according to the 1st Championship Branch Prediction contest rules. Simulation results show that from the storage budget changing from 32kbits to 512kbits our scheme can save up to 18.2% computation for a prediction operation in average with only marginal (up to 1.75% when using misp/KI as its metric) accuracy losing. This paper is organized as follows. Section 2 is the related work. Section 3 describes our PSGU scheme for per-

ceptron predictors. Section 4 and 5 describe the evaluation framework and analyze the simulation results. Finally, section 6 concludes the paper.

2. Related Work A basic perceptron predictor [3] is a simple neural network. It calculates a dot-product of the weights and an input vector. The sign of the dot-product is used as the prediction result. Each weight represents the correlation of one bit of history with the branch to be predicted, and the input vector consists of 1’s for taken and -1’s for not taken branches. The dot-product can then be calculated using a Wallace-tree [4] adder. The latency caused by the adder tree is addressed by ahead pipelining [5]. It uses older history or path information to start the branch prediction multiple cycles before the prediction is needed, with newer information being injected as it became available. Comparing with the un-pipelined version of the same predictor, there is a small decrease in prediction accuracy. To improve the accuracy, there are different steps: mixing local and global history [6], using redundant history and skewing[7]. And the hardware logic complexity is reduced through MAC representation[7] or hashing[8]. When only considering the accuracy, the predictors selected by the 1st Championship Branch Prediction give a clear illustration. Piecewise linear neural predictor[9] improves the prediction accuracy on the path-based neural predictor[5] by changing the mapping of each weight from the address of a previous branch to a hash of a previous and the current branch address, and it uses local and global history to make prediction and dynamically adjusts the history length used. In addition, it uses a bias table which is larger than the other weight table and some other features to reduce aliasing. The shortcoming of it is that it needs hundreds of adders, may render the predictors infeasible from a latency, power, and complexity perspective. O-GEHL predictor[10], same time as the piecewise linear predictor, achieves very high branch prediction accuracy for storage budgets in the 32kbit-1Mbit range. It can exploit very long global history lengths in the hundreds bits range and uses a medium number of predictor tables and limited hardware logic for prediction computation. When making the final prediction, it uses an adder-tree to compute the result. Further more, TAGE predictor [11] uses O-GEHL as the base predictor and combines the partially tagged components as the PPM-like predictor[12]. It relies on partial hitmiss detection as the the prediction computation function for a prediction thus avoids the adder-tree in the original O-GEHL predictor. Although TAGE is the most accurate branch predictor in the literature, we believe that percep-

Figure 1. Percent of different partial sums in path-based perceptron predictor.

Figure 2. Percent of different partial sums in piecewise linear predictor.

tron branch predictor using adder-tree still has some space to live and this research mainly focus on this type of predictor.

3. A Partial-Sum-Global-Update Scheme 3.1. The basic idea n this research, we want to decrease the number of computation needed for a prediction in perceptron predictor, and save the power consumption corresponding. In a perceptron branch predictor, at most cases the prediction result can be obtained through only calculating the partial sum of the input vector and weights. This is because that the prediction result only depends on the sign of the sum. Take a real case in 64kbits O-GEHL predictor (proposed in 1st CBP) for example, the weights corresponding to a branch instruction I in the eight tables are {3, -5, 2, 4, 7, -9, -8, -3}, such that the prediction result based on the total sum of the weights will be not taken (−23 < 0). If calculating the sum in serial, the sign (1 for positive and 1 for negative) of the eight partial sums will be {1, -1, 1, 1, -1, -1, -1, -1}, so that the 5th sign is same as the final result from the total sum, and the other partial sums are not needed any more in fact. We construct an experiment to count the number and pro-

Figure 3. Percent of different partial sums in O-GEHL predictor.

portion of these cases in percent. If the predictor has Nentry weight, we calculate the partial sum in serial and get the prediction result from the final sum. If the sign of the ith partial sum is same as the sign of the final sum, we increase the number of the ith counter. At last we divide the number of a counter by the total number of prediction and get the proportion in percent. The simulation traces come from the 1st CBP which is introduced in section 4, and the tested predictors are path-based perceptron predictor, piecewise linear predictor, and O-GEHL predictor. Figure 1 shows the results of path-based perceptron predictor. For all the traces, 87.69% prediction results can be obtained through only calculating the 1st partial sum in average, that means such results can be read out directly from the 1st weight table and sent to the fetch unit. Further more, 91.55%, 93.14% and 94.02% predictions results can be obtained through calculating the 2nd, 3rd, and 4th partial sum respectively. For irregular program traces, INT-* and SERV* for example, the prediction needs to calculate more parts of partial sum then the regular program traces (FP-* and MM-*). Figure 2 shows the results of piecewise linear predictor. It has the same trend as Fig.1 with small different characteristics. First, it uses more information (history and path) to make a prediction, and these information contribute different parts of the final accuracy. As in Fig.1, the path information contributes most of its accuracy especially on the first several partial sums. And the history information contribute more and more accuracy with its length increasing though the steps become smaller and smaller. From the results of SERV-*, we can see that with history information combined in the path information, the accuracy from the first several partial sums decrease, and the lost accuracy needs to be compensate through further computation. Figure 3 shows the results of O-GEHL predictor. Because it only calculates 8 partial sums, the different parts of accuracy can be seen clearly. For regular program traces, most of the prediction results can be obtained through only calculating the first or second partial sum, whereas for irregular program traces the results need more calculating oper-

ations, 4, 5, or more. According to the above results, we can set a gating mechanism in the computing process of prediction sum. Suppose a prediction result in an original perceptron predictor needs N adding computations. In order to capture most of the prediction results the gating mechanism should be taken behind at least n partial sum. The value of n varies with different predictors. If a prediction result through calculating the nth partial sum is same as the prediction result from the final sum, the predicting process can save [N − n] adding computations. Otherwise, if they are not same but the prediction from the partial sum is right, it is another benefit from our scheme. But if the prediction from partial sum is finally found to be wrong, whether it is same as the result from the total sum or not, a misprediction is detected and the gating mechanism uses an adaptive method during the predictor updating process to add several adding operations for the next time prediction of the same branch instruction.

3.2. PSGU scheme: partial sum

Predicting through

int prediction(pc:int; other parameters) { /* using pc and other parameters to index step table */ i = index to step table(pc,...); yout = bias input;

}

For( j=0; j < step table[i]; j++) /* calculating step table[i] partial sum */ { k = index to W eightj ; yout += W eightj [k]; } return (yout ≥ 0); /* ≥ 0 for taken,

Suggest Documents