An Approach for Compiler Optimization to Exploit Instruction Level Parallelism Rajendra Kumar1 and P.K. Singh2 1
2
Uttar Pradesh Technical University, India Madan Mohan Malviya University of Technology, Uttar Pradesh, India
[email protected],
[email protected]
Abstract. Instruction Level Parallelism (ILP) is not the new idea. Unfortunately ILP architecture not well suited to for all conventional high level language compilers and compiles optimization technique. Instruction Level Parallelism is the technique that allows a sequence of instructions derived from a sequential program (without rewriting) to be parallelized for its execution on multiple pipelining functional units. As a result, the performance is increased while working with current softwares. At implicit level it initiates by modifying the compiler and at explicit level it is done by exploiting the parallelism available with the hardware. To achieve high degree of instruction level parallelism, it is necessary to analyze and evaluate the technique of speculative execution control dependence analysis and to follow multiple flows of control. The researchers are continuously discovering the ways to increase parallelism by an order of magnitude beyond the current approaches. In this paper we present impact of control flow support on highly parallel architecture with 2core and 4-core. We also investigated the scope of parallelism explicitly and implicitly. For our experiments we used trimaran simulator. The benchmarks are tested on abstract machine models created through trimaran simulator. Keywords: Control flow Graph (CFG), Edition Based Redefinition (EBR), Intermediate Representation (IR), Very Large Instruction Word (VLIW).
1
Introduction
Instruction Level Parallelism [1] represents the typical example that redefines the traditional field of compilation. It raises the issues and challenges that are not addressed in traditional compilers. To scale up the amount of parallelism at hardware level, the compiler takes on increasingly complex responsibilities to ensure the efficient utilization of hardware resources [3].New strategies may result in long compilation time to speedup the compilation, for this two things are necessary to be considered: 1. Careful partitioning of application 2. Selection of better algorithm for the purpose of branch prediction analysis and optimization.
M.K. Kundu et al. (eds.), Advanced Computing, Networking and Informatics - Volume 2, Smart Innovation, Systems and Technologies 28, DOI: 10.1007/978-3-319-07350-7_56, © Springer International Publishing Switzerland 2014
509
510
R. Kumar and P.K. Singh
The major outcome of ILP compilers is to enhance the performance by elimination the complex processing needed to be parallelized the program. ILP compilers accelerate the non-looping codes widespread in most of the applications.For analysis purposes, we need statistical information (extracted through the trimaran simulator). The statistical compilation [4] improves the program optimization and scheduling. The improvement in performance of frequently taken path is also supported by statistical compilation. The conventional compilers and optimizers do not produce optimal code for ILP processor. Therefore the designers of processors and compiler would have to find useful methods for ILP compiler optimization which produce maximally efficient ILP processor code for processing references to subscripted array variable.To achieve high performance of ILP, the compiler must jointly schedule multiple basicblock [5]. The compiler optimization includes: 1. Basicblock formation and their optimization 2. Superblock optimization 3. Hyperblock optimization A superblock [6] is a structure in the form of control flow with a single entry and multiple exits and it has no side entrances. A hyperblock [10] is predicted region of code that contains straight-line sequence of instruction with a single entry point and possibly multiple exit points. The formation of hyperblock is due to modification in if-conversion. The hyperblock optimization adds the if-conversion to superblock optimization. The if-conversion is the process of replacing the branch statements with compare operations and associated operations with predicate defined by the comparisons.The exploitation of ILP is increased as early as the branches are predicted. The Control Flow Graph (CFG) and Predicated Hyperblockinitiate this process. Fig.1 shows a Control Flow Graph. The predicated hyperblock of Fig 1 is as follows: v = rand() v=q v=v+3 x=v*3
if true if c1 (if c1 is true, v = q else nullify) if c2 (if c2 is true, w = v + 3 else nullify) if true v = rand() if (v >a) C1
C2
v=q
w=v+3
x =v*3 Fig. 1. Control Flow Graph
An Approach for Compiler Optimization to Exploit Instruction Level Parallelism
2
511
Related Work
Instruction-level parallel processing has established itself as the only viable approach for achieving the goal of providing continuously increasing performance without having to fundamentally re-write the application. The code generation for parallel register share architecture involves some issues that are not present in sequential code compilation and is inherently complex. To resolve such issues, a consistency contract between the code and the machine can be defined and a compiler is required to preserve the contract during the transformation of code. [7] has proposed a Parallel Register Sharing Architecture for Code Compilation. The navigation bandwidth of prediction mechanism depends upon the degree of ILP. It can be increased by increasing control flow prediction [2] at compile time. In [8], the author has presented the Role of multiblocks in Control Flow Prediction using Parallel Register Sharing Architecture. There are two major questions regarding if-conversion: (i) when to if-convert, and (ii) what to if-convert. [11] indicates that performing if-conversion early in the process of compilation has benefit of enabling classical optimization of predicated instructions. As the control flow prediction is increases, the size of initiation is increased that permit the overlapped execution of multiple independent flow of control. [9] presented Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture. The impact of ILP processors on the performance of shared memory multiprocessors with and without latency hiding optimizing software prefetching is represented in [12].
3
Our Approach
Our work aims to explore the parallelism at compiler (software) and hardware (architecture) level. 3.1
At Software Level
For our purpose we have modified the compiler that uses an Intermediate Representation (IR). There are three basic steps that our compiler includes: 1. 2. 3.
record information about the Control Flow Graph so that the step 2 can work with Control Flow Graph and compute the Denominator Tree (DT). introduce function into CFG to modify the compiler. map the function to nodes of basic blocks that are handled by the code generator.
Construction of CFG and DT The constriction of Control Flow Graph and Denominator Tree are put in separate phases of the compiler. This task is performed after the semantic analysis. This includes the sequences of steps that this phase generation is supported to:
512
R. Kumar and P.K. Singh
1.
2. 3.
Build a CFG for each function. The graph will be stored in new data structure separate from the abstract syntax tree (AST). The package from the trimaran simulator in the framework contains the classes for representing the control flow graphs. The framework includes the code to determine the set of variables that are modified or used for each basic block. Construction of Denominator Tree. Computation of dominance frontier.
Formation of SSA (Static Single Assignment). Following steps have been applied for each function added: 1. 2.
For each simple source language variable we determine the set of nodes where compiler-modified functions are inserted. The compiler ensures the allocation of space for each newly inserted variable. To keep track of variable version, a stack data structure is used.
The above steps convert the intermediate representation in to SSA form. SSA form is optimal and has no unnecessary terms. As a next task we exploit the SSA form to implement code optimization phase. Modify the Compiler’s Backend. For production of executable code, it is necessary to modify the backend of the compiler. The easiest way we use to get a working backend is to use the code generation phase already provided. As next step we used to convert each modified function into copy statement. Prior to code generation we translate each modified function into a sequence of assignment statements. These assignment statements are place at the end of predecessor block. 3.2
At Hardware Level (ILP Processor)
The processor supporting ILP [13] is known as ILP processor. Its performance can be enhanced through compiler optimization. In an ILP processor, the basic unit of computation is a processor instruction, having operations like add, multiply, load or store. Non-interdependent instructions lead to load and execute in parallel. With the help of ILP processor the instruction scheduling [3] needs not to be done during program execution rather it can be done during compilation process. One possibility to optimize ILP processor’s operations is to create a compiler which generates effective code on the assumption that no run-time decision are possible; it is only the responsibility of compiler to take all the scheduling and synchronization decision. This shows that the processor has very little task of reordering of code at run time. The multi core systems provide remarkable efficiency as compare to single core. For our experimental purpose, we compared speed-up performance of 2-core and 4-core systems with single core.
An Approach for Compiler Optimization to Exploit Instruction Level Parallelism
4
513
How the Optimized Compiler Helps to Exploit ILP
The optimized compiler created has a function of determining whether the result of a statement to be executed precedent to an if-condition is not affected by the execution result of traditional branch statement. If the branch statement is not affected by the execution result of the precedent statement, the branch statement is shifted in front of the precedent statement, to suppress the execution of unwanted statements [11]. This way the branch statements are shifted or copied by the optimizing compiler to minimize the execution time of the object code. The setup we have from the Trimaran Simulator [14] is simulation of a computer system with parallel processor capable of executing two or more procedures in parallel. The assumed compiler (optimized), as shown in Fig. 2, is comprised of: 1. 2. 3.
A syntax analysis unit (to interpret the statements of the source code and translate into the Intermediate Representation (IR). An optimization unit (to optimize the use of parallel processor at the level of intermediate representation). A unit for producing the object program.
Source Code
Syntax Analyzer
Intermediate Code
Optimization Unit Automatic Parallelizing Unit
Intermediate Code
Code Generation
Object Code
Fig. 2. Model of Optimizing Compiler
The automatic parallelization unit (as shown in Fig. 3) consist of: 1. 2.
A detection unit for detecting and recording the IR corresponding to source code. A conversion unit for intermediate code conversion adding a different intermediate code resulting the similar result as of detection unit.
In Fig.3, the broad arrow represents the control flow while normal arrows represent data flow.
514
R. Kumar and P.K. Singh
Automatic Parallelization Unit Parallelizing Intermediate Code Detection Unit
IR
IR Intermediate Code Parallelizing & Conversion Unit
Fig. 3. Automatic Parallelization Unit
5
Experiments
For our experimental purpose we modified our compiler with EBR (Edition based Redefinition) operation. It allows toupgrade the database component (symbol table and library) of the compiler while it is in use. For evaluation of ILP exploitation on DVLIW (Distributed control path architecture for VLIW) [13] with modified ILP compiler, we used trimaran toolset [14]. We used 17 benchmarks for our experiments. We measured speedup on two core and four-core VLIW processors against a one-core processor. Each core was considered to have two integer units, one floating point unit,one memory unit, and one branch unit. We assumed operation latencies similar to
Table 1. Summary of Multi-core Speedup Name of Benchmark SPEC JetBench CloudSuite Bitarray Bitcnt Cjpeg Jcapistd Rdbmp Rdgif Wrbmp Wrppm Correct Dump Hash Gsmdecode Gsmencode Xgets
Speedup for 2-core 1.30 1.10 1.01 1.27 1.03 1.50 1.05 1.06 1.07 1.15 1.07 1.32 1.35 1.15 1.66 1.58 1.33
Speedup for 4-core 1.45 1.05 1.01 1.27 1.03 1.50 1.05 1.06 1.07 1.15 1.07 1.32 1.35 1.15 2.13 2.07 1.39
An Approach for Compiler Optimization to Exploit Instruction Level Parallelism
515
Table 2. Impact of Hardware and Software on Parallelism Name of Benchmark SPEC JetBench CloudSuite Bitarray Bitcnt Cjpeg Jcapistd Rdbmp Rdgif Wrbmp Wrppm Correct Dump Fpppp Gsmdecode Gsmencode Tomcat
Hardware style model 7.0 5.0 6.0 5.0 5.5 6.5 6.0 5.0 5.5 5.0 11.0 5.0 6.0 19.0 7.0 5.0 21.0
Software style model 7.0 5.0 6.0 4.5 5.0 6.0 6.5 4.0 5.0 4.5 13.5 5.0 5.0 27.0 8.0 5.0 44.0
those of the Intel Itanium. We compare DVLIW processor with two or four cores to multi cluster VLIW machine with centralized control path. The compiler employed the hyperblock region formation. Table 1 shows the speedup the ILP execution on two and four-core systems. The average speed-up measure for 2-core system was 1.24 and for 4-core was1.30. The speedup achieve in our experiments is found closely related to the amount of ILP in the benchmarks that can be exploited by the ILP compiler. The benchmark like gsmdecode and gsmencode exposing high ILP achieve high speedup while benchmarks CloudSuite and bitcnt show low ILP. In order to achieve ILP, we must not have the dependencies among the instructions executing in parallel. By taking selected 17 benchmarks, we compared parallelism achieved for hardware and software oriented models. For hardware model we considered zeroconflict branch and jump predictions; and for software style model we considered static branch and jump predictions. The Table 2 shows the summary of experiments. The average result comparison shows the better results for software style model. This shows more scope of ILP exploitation at software level. The average speed up for hardware style system was measured 7.68 and 9.47 for software style system.
6
Conclusions
In our experiments we noticed that some benchmarks suffer slight slow-down to expose ILP. This is due to the EBR operation inserted in ILP compiler which maintains the correct control flow.
516
R. Kumar and P.K. Singh
It increases pressure on the I-Cache and caused more I-Cache misses. The ILP compiler is not aware of this phenomenon and could possibly slow down the execution. The results show that VLIW architecture provides the mechanism for multi-core system to enforce existing ILP compiler to exploit ILP in the applications. We applied bottom up Greedy (BUG) algorithm for partitioning the operations to multiple cores. The ILP compiler ensured the control flow in the multiple cores for synchronization and operation insertion. The experiments conducted for hardware and software style models proved that much scope of ILP exploitation is at compiler level.
References 1. Carr, S.: Combining Optimization for Cache and Instruction Level Parallelism. In: Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques (1996) 2. Pnevmatikatos, D.N., Franklin, M., Sohi, G.S.: Control flow prediction for dynamic ILP processors. In: Proceedings of the 26th Annual International Symposium on Microarchitecture, pp. 153–163 (1993) 3. Lo, J., Eggers, S.: Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism. In: Proceedings of the Conference on Programming Language Design and Implementation (1995) 4. Zhong, H., Lieberman, S.A., Mahlke, S.A.: Extending Multicore Architectures to Exploit Hybrid Parallelismin Single-thread Applications. In: IEEE 13th International Symposium on High Performance Computer Architecture, pp. 25–36 (2007) 5. Posti, M.A., Greene, D.A., Tyson, G.S., Mudge, T.N.: The Limits of Instruction Level Parallelism in SPEC95 Applications. Advanced Computer Architecture Lab (2000) 6. Hwu, W.-M.W., Mahlke, S.A., Chen, W.Y., Chang, P.P.: The Superblock: An Effective Technique for VLIW and Superscalar Compilation. The Journal of Supercomputing 7, 227–248 (1993) 7. Kumar, R., Singh, P.K.: A Modern Parallel Register Sharing Architecture for Code Compilation. International Journal of Computer Applications 1(16) (2010) 8. Kumar, R., Singh, P.K.: Role of multiblocks in Control Flow Prediction using Parallel Register Sharing Architecture. International Journal of Computer Applications 4(4), 28–31 (2010) 9. Kumar, R., Singh, P.K.: Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture. Journal on Computer Science and Engineering 2(4), 1179– 1183 (2010) 10. Kumar, R., Saxena, A., Singh, P.K.: A Novel Heuristic for Selection of Hyperblock in IfConversion. In: 2011 3rd International Conference on Electronics Computer Technology, pp. 232–235 (2011) 11. August, D.I., Hwu, W.-M.W., Mahlke, S.A.: A Framework for Balancing Control Flow and Predication. In: Proceedings of 30th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 92–103 (1997) 12. Pai, V.S., Ranganathan, P., Abdel-Shafi, H., Adve, S.: The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors. IEEE Transactions on Computers 48(2), 218–226 (1999) 13. Zhong, H.: Architectural and Compiler Mechanisms for Accelerating Single Thread Applications on Multicore Processors. PhD thesis, The University of Michigan (2008) 14. http://www.trimaran.org