Instruction level parallelism (ILP) is the base for good ...

A Novel Heuristic for Selection of Hyperblock in If-Conversion Rajendra Kumar Department of CSE, Vidya College of Engineering, Meerut (UP), India [email protected]

Abhishek Kumar Saxena Department of IT, VIET, Dadri, G. B Nagar (UP), India [email protected]

Abstract - In this paper we present a novel heuristic for selection of hyperblock in If-conversion. The if-conversion has been applied to be promising method for exploitation of ILP in the presence of control flow. The if-conversion in the prediction is responsible for control dependency between the branches and remaining instructions creating data dependency between the predicate definition and predicated structures of the program. As a result, the transformation of control flow becomes optimized traditional data flow and branch scheduling becomes reordering of serial instructions. The degree of ILP can be increased by overlapping multiple program path executions. The main idea behind this concept is to use a step beyond the prediction of common branch and permitting the architecture to have the information about the CFG (Control Flow Graph) components of the program to have better branch decision for ILP. The navigation bandwidth of prediction mechanism depends upon the degree of ILP. It can be increased by increasing control flow prediction in procedural languages at compile time. By this the size of initiation is increased that allows the overlapped execution of multiple independent flow of control. The multiple branch instruction can also be allowed as intermediate steps in order to increase the size of dynamic window to achieve a high degree of ILP exploitation. Keywords: ILP; Basic block; Hyperblock; CFG;

I. INTRODUCTION Instruction level parallelism (ILP) is the base for good performance of today’s processor for executing multiple instructions per cycle. ILP is constrained by branch instruction. The predication is the process in which many ILPs are thrown to the scheduler which convert the control flow of ILP into predication instruction. These instructions are controlled using if-conversion. It also convert control dependency between branch instruction into data dependency [4] between predicate definition and predicated instruction. By this process ILPs is increased because multiple program paths are executed overlapping each other. If the ‘if-conversion’ is performed during compilation process the predicated instruction can be optimized further and if it is performed after the scheduled time it leads to the selection of performance and characteristics of target processor. A hyperblock [8] is the collection of blocks where control flow enters at first basic

P K Singh Department of CSE, MMM Engineering College, Gorakhpur (UP), India [email protected]

block and ‘if-conversion’ is used to handle the control flow internally. The execution capability of multiple instructions per cycle is necessary for better performance of modern processors [5]. This requires the knowledge of ILP. ILP is greatly constrained by branch instruction and branch prediction is often employed with speculative execution [6]. II. RELATED WORK There are two major questions regarding ifconversion: (i) when to if-convert, and (ii) what to if-convert. [6] indicates that performing if-conversion early in the process of compilation has benefit of enabling classical optimization of predicated instructions. An effective heuristic can be designed for selection of basic blocks. Though the ifconversion and predication process has many benefits but sometime due to overlapped path, high resource constraints are developed and if these constraints are greater then the limit of processor, then there will be performance loss in performing ‘if-conversion’. For this solution, a compilation framework [6] is introduced which allow the compiler to maximize the benefits of predication by delaying the control flow and predication process, ‘Partial reverse if-conversion’ is the process in which we replace predication by control flow back into the ‘if-conversion’ process. To form the hyperblock the ‘if- conversions are done during starting of compilation so that predicated representation can be retrieved by ILP optimization. After this, whatever information is provided by scheduler about code performance characteristics and target processor, this ‘reverse if-conversion’ inserts a branch to remove some part of hyperblock. This is done whenever the removed path has to be executed. Predicate flow graphs are analyzed to find the scheduled length with and without reverse ‘if-conversion’. A loop can be encapsulated in a multiblock. [2] presents role of multiblocks in control flow prediction (CFP) in parallel register sharing architecture to achieve high degree of ILP. The parallel register sharing architecture for code compilation is presented in [1]. [3] introduces control flow prediction (CFP) in parallel register sharing architecture. According to [7], the results of performing ‘if-conversion’ on compiler were quite disappointing. This work shows that on an average 7% of CPU cycles service branch mispredications in an Intel’s processor. When ‘if-conversion’ is applied on it, only half of these mis-predications has been solved whereas the other processors other then [6] has experienced much profit. [6] has worked on this drawback.

1

As much the optimization is performed by compiler, the more will be the performance of ‘if-conversion’. The speedup of ‘if-conversion’ depends upon the overlapping control paths of various ILPs and reducing branch mispredications. With the help of this predication process, the complier generates the higher ILP by overlapping separate execution path and is able to rewrite the program control structure again [4]. A technique to enhance the ability of dynamic ILP processors to exploit the parallelism is introduced in [9]. A performance metric is presented in [10] to guide the nested loop optimization. This way the effect of ILP is combined with loop optimization. III. THE BASE HEURISTIC Innermost loops in if-conversion are the loops inside which there is no other loop. Each loop consists of the loop that has more than one back edge and all, these back edges are inserted together directly into the new basic block which has a branch that goes to the loop header again. Now to select the basic block, the innermost loop path is identified first. This path is that path which is executed many times in comparison to other paths. Starting from the loop header, that block is identified in the loop body that has multiple back edge. That will be considered as basic block. The basic block is to be considered as present in the initially identified main path. Here the technique used is different where long main path is broken into many short paths by using trace stop threshold. This is the main heuristic technique that has been used in compiler. A hyperblock [8] consists of all the basic blocks that fall in main path. Now in the second step for heuristic approach, all the paths other then the basic block path are analyzed if they can be useful for hyperblock creation or not. For this, BSV (basic block selection value) is calculated as: BSV = issue rate × hazard × (bb_freq/bb_size) × (main path size/main path frequency) BSV is directly proportional to the size of main path and it represents the number of times the block has been executed by compiler. It is inversely proportional to the size of block and how many times the path has been traced by compiler according to the profiling. The size of basicblock depends upon how many ILP further can be exploited using resource length or dependence height. Degree of hazard is calculated based upon the instructions like memory store; subroutine instruction etc. BSV also depends upon issue rate of target processor, which is a fixed value for all basic blocks in architecture. BSV reflects that higher degree of parallelism is achieved from the basic block which does not lie in the main path and have higher execution frequency. A threshold value is set up by compiler to find the accurate basic block. If the calculated BSV value is greater, then the predefined threshold value then the basic block is considered beneficial in increasing the degree of parallelism by overlapping more execution path and then it is included in the main path with other basic blocks. While following this heuristic approach, following factors are to be considered:

1.

If there is any basic block that is not reachable from the block in the main path but its BSV value is greater then the threshold value, then also it will not be included in the main path. 2. To decide, in what sequence the basic blocks that are not in the main path, will be examined. All the blocks that are neighbor of basic blocks on the main path are analyzed first of all. Execution frequency of basic blocks not on the main path is the execution probability with respect to main path. As an example, for basic block BB7 the Absolute execution probability (bb_freq) can be calculated as: Absolute execution probability (bb_freq) = 10/45 × 45/100 where, 10/45 = relative frequency of BB7 to BB5, and 45/100= relative frequency of BB5 to BB4 Absolute execution frequency is the product of all basic block frequency until we reach the basic block of main path. Each basic block has a loop header. So in bottom-up looping approach, a basic block on the main path is reached there will be no problem of infinite looping as back edge coalescing will be accessible only from back edge basic block that is present on main path. 3.

Information regarding the main path is to be updated whenever a new basic block is added in the chain as it will affect the resource constraints, dependency constraints and BSV value will be changed. If there are certain loops that are not most often executed by compiler, they are ignored or discarded from creation of hyperblock or basic block selection process. 90

BB-1 20

80

BB-2

BB-3 20

80

BB-4

45

BB-5

55

BB-6

10

35

BB-7

BB-8

10

55

35 100

BB-9 90 Figure 1. Innermost loop sample.

2

IV EXTENSIONS TO BASE HEURISTIC Following extensions are performed in the base heuristic to update it as novel heuristic: A. To eliminate the long path by using trace stop threshold. B. To calculate resource and dependency constraints by combining various basic blocks. C. Performing ‘if-conversion’ on innermost loops and sequential code. A. Trace-stop Threshold If a basic block in the main path has excessive branch, it will have low execution probability. In the figure 1, the main path BB1-BB3-BB4-BB6-BB9 has execution frequency of 44% (80% × 55%) as a result of two branches. Since the figure shows that there are many branches in the main path itself so it will lead to low execution probability, whether each branch alone has execution value greater than the threshold value. so longer traces can lead to branching of hyperblocks sometime. To solve this problem trace-stop threshold is introduced where big paths are broken into small paths. Let us assume that trace-stop threshold defined for the given figure is 50%, so compiler will analyze the first main path as BB1BB3-BB4. This procedure is started with the basicblock that is executed most by the compiler and the process of analyzing the traces is done until all the paths are found whose total execution probability is above the defined trace-stop threshold. When the path is being found, basic blocks which are not analyzed in the main path are checked and analyzed and the process is repeated until each basicblock ultimately falls in exactly one path or trace. In this way various paths are found for every loop. If a path contains only one basicblock, if-conversion is not required for that path. Also, the basic blocks which are neighbor of those basic blocks on which the path has been constructed are also analyzed. B. Multi-Basicblock Combination There is a difference in accuracy of main path size estimation calculated by heuristic. It is done by counting the number of operations the main path contains. If a path contains large number of operations, the ILP is exploited to arrange these operations in various small cycles but if a path contains less number of operations, it reflects higher resource and dependency constraints affecting the long scheduled time. This heuristic approach adds up the resource length of each and every block and the integrated sum is considered as resource length for main path fully. To improve the accuracy of estimation, basic blocks on the main path are also merged to form the giant basic block and the value which is greater among the resource length and dependence height is taken as ‘size ‘of main path. C. If-conversion of non-innermost loop and sequential code In our heuristic approach, we have considered the basic block loops that are present in the innermost loop so that risk of border crossing hyperblocks can be avoided. But, due

to this problem the ‘if-conversion’ is not applied fully on all the basic blocks. So, to overcome this problem, we have tried to apply the ‘if-conversion’ on basic block of even the most innermost loop and sequential basic blocks outside the loop. Following steps are followed to find the basic blocks of the innermost loops: 1. Loop region is identified first. It is that region that has been formed after partitioning of big region in such a way that each small region consists of at least one basic block and these basic blocks does not belong to any other inner loop. These loop regions are disjoint and exhaustive in nature. This process of selecting the basic blocks is repeated until and unless all the non-innermost loops are covered at least once. 2. Basic blocks are selected in such a way that they are noninnermost loop and they do not contain any back edge. In this case , loop region decomposition and the basic block selection is applied on it. The if-conversion approach is applicable to only those basic blocks that are the first basic blocks and contain no outgoing edges. If these basic blocks are not the part of any loop, they are rejected in heuristic approach else ‘if-conversion’ is applied upon them. Here loop means that the first basic block must have at least one outgoing edge. V THE NOVEL HEURISTIC The basic difference between Novel Heuristic and base heuristic alone is that Novel Heuristic compiles benchmark source code into Lcode and when it is feeded into Elcor [4], then hyperblocks are created. Register allocation, scalar scheduling etc. are also done along with it. Whereas, the base heuristic alone cannot create the hyperblock earlier. When the Elcor is run, hyperblocks using ‘if-conversion’ are formed by novel heuristic approach. If elcor is run again on hyperblocks created, then they are further optimized. This novel approach gives heuristic its full strength and also reflects the benefits that an Elcor provides when it is run twice on hyperblock rather then base heuristic when it is run second time. VI EMPIRICAL RRESULTS Here extended heuristic results have been compared with those of base heuristic. Let us assume branch mispredication penalty as 3 and 1 unit integer is set in machine for the operations like float, integer, branch and memory operations. Both the heuristic technique has been compared on 12 different benchmarks on the basis of dynamic total scheduling cycles. Most of the benchmarks are media related because they are loop intensive and maximum profit can be achieved from them by formation of hyperblocks using ‘ifconversion’ loops. Comparison of base and Novel heuristics We have compared base heuristic approach to the novel heuristic approach on 12 benchmarks and out of these only 3 of our heuristic produces the less results then novel. On rest of the 9 benchmarks both the approaches perform equally.

3

TABLE 1. COMPARISON OF NOVEL AND BASIC HEURISTIC

Benchmark name gcc gsmnecode gsmdecode rawcaudio rawdaudio 583hw1c unepic rasta g721encode g721decode grep mpeg2enc

Dynamic total cycles Novel Heuristic Basic Heuristic 8×108 8×108 8 3×108 3 ×10 7 9×10 9×107 6 9×10 9.5×106 8×106 8×106 6 1.3×10 1.34×106 6 9×10 9.5×106 6 8×10 8×106 8 5×10 5×108 8 5×10 5×108 4×106 3×106 4 1.5×10 1.5×104 VII CONCLUSION

In this paper, we have discussed is how to break long traces or long paths into the shorter ones by using ifconversion and trace-stop threshold. We have focused over here on right selection of hyperblocks because wrong hyperblocks lead to performance loss. Not only this, resource constraints and dependency constraints have also been taken into consideration while choosing right path for the hyperblocks. The comparison emerged have been shown on how our heuristic approach proves beneficial and better then basic approach. In this paper, we have tried to apply the basic block selection approach on the whole process rather then innermost loop. VIII FUTURE WORK Choosing the best combination of thresholds - The combination of choosing the two threshold values for this extended approach should be optimal. Two threshold chosen are for limiting the trace length and to decide if the basic block can be used to create the hyperblock or not. The thresholds should be such that they are so robust and simple that they can be dynamically found out on the basis of benchmark properties. For this, control flow information and profiling information are found to determine the intensity of loop. Correlation among hyperblocks - The heuristic approach here tends to find the hyperblock that are independent of each other. It avoids the potential relationship between the hyperblocks. So in case, new hyperblocks are to be considered, old hyperblocks can be taken into consideration that shows the probability of including one hyperblock inside the other. Merging with other heuristics - Heuristic approaches can be merged to check and analyze which approach is beneficial to apply in which part of program. Our heuristic approach will be more beneficial on innermost loop whereas novel approach will be more beneficial on other loops.

REFERENCES Article in a journal:

[1] Rajendra Kumar, P K Singh, “A Modern Parallel Register Sharing Architecture for Code Compilation”, IJCA, Volume 1 – No. 16, 2010. [2] Rajendra Kumar, P K Singh, “Role of multiblocks in Control Flow Prediction using Parallel Register Sharing Architecture”, IJCA Volume 4 – No. 4, July 2010. [3] Rajendra Kumar, P K Singh, “Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture”, IJCSE Vol. 02, No. 04, 2010, 1179-1183 [4] David. I. August, Daniel. A. Connors, Scott. A. Mahlke, John. W. Sias, Kevin. M. Crozier, Ben-Chung. Cheng, Patrick. R. Eaton, Qudus. B. Olaniran, and Wen-mei. W. Hwu. Integrated Predication and Speculative Execution in the IMPACT EPIC Architecture. In Proceedings of the 25th International Symposium on Computer Architecture, June 1998, pages 227 – 237. [5] Guilin Chen, Mahmut Kandemir, “Compiler-Directed Code Restructuring for Improving Performance of MPSoCs”, IEEE Transactions on Parallel and Distributed Systems, Volume. 19, No. 9, 2008 Article in a conference proceedings:

[6] David I. August, Wen-mei W. Hwu, and Scott A. Mahlke, “A Framework for Balancing Control Flow and Predication”, In Proceedings of the 30th International Symposium on Microarchitecture, December 1997. [7] Youngsoo Choi, Allan Knies, Luke Gerke, and Tin-Fook Ngai, “The Impact of If-Conversion on Branch Prediction and Program Execution on the Intel Itanium Processor”, In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-34), December 2001. [8] Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A.Bringmann, “Effective Compiler Support for Predicated Execution Using the Hyperblock”, In Proceedings of the 25th International Symposium on Microarchitecture, pages 45-54, Dec 1992. [9] Scott A. Mahlke, Richard E. Hank, James E. McCormick, David I. August, and Wenmei Hwu, “A Comparison of Full and Partial Predicated Execution Support for ILP Processors”, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 138 – 150, 1995. [10] Steve Carr, “Combining Optimization for Cache and InstructionLevel Parallelism”, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques, 1996 Rajendra Kumar is McGraw-Hill author through book “Theory of Automata, Languages & Computation”. He is Associate Professor in Computer Science and Engineering department at Vidya College of Engineering, Meerut (India). He has twelve years of teaching experience. His current area of research is Instruction Level Parallelism. Abhishek Kumar Saxena is Lecturer in Information Technology department at VIET, Dadri, Gautam Budh Nagar (India). He has four years of teaching experience. Dr. P K Singh is Associate Professor in Computer Science and Engineering department at MMM Engineering College, Gorakhpur. His research are is Parallelizing Compiler and Computer Architecture.

4

Instruction level parallelism (ILP) is the base for good ...

Instruction level parallelism (ILP) is the base for good ...

Suggest Documents

Increasing Instruction-Level Parallelism with Instruction Precomputation

Available Instruction-Level Parallelism for Superscalar ... - CiteSeerX

compilers - The Journal of Instruction-Level Parallelism

CASH - The Journal of Instruction-Level Parallelism

Increasing the Instruction-Level Parallelism ... - Semantic Scholar

INSTRUCTION LEVEL PARALLELISM â THE ROLE ...

Ch4. Exploiting Instruction-Level Parallelism with Software ...

Advanced Computer Architecture Instruction Level Parallelism ...

Chapter 3 Instruction-Level Parallelism and Its

Limits of Instruction-Level Parallelism Capture - ScienceDirect

Limits of Instruction-Level Parallelism Capture - Core

Instruction Level Parallelism through MicrothreadingâA Scalable

IctÃneo: a Tool for Instruction Level Parallelism Research 1 ... - CiteSeerX

Instruction Level Parallelism - the static scheduling approach the ...

Increasing the Performance of Instruction Level Parallelism ... - ULBS

Impact of Software Bypassing on Instruction Level Parallelism and ...

Adapting Processor Supply Voltage to Instruction-Level Parallelism

Limits of Instruction Level Parallelism Critique - Google Groups

Instruction-level Parallelism in Prolog: Analysis and ... - Google Sites

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism ...

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism ...

Instruction scheduling for instruction level parallel processors

Higher-Level Parallelism Abstractions for Video

Store Memory-Level Parallelism Optimizations for ... - spracklen.info

Instruction level parallelism (ILP) is the base for good ...