The general rule to convert any given program into queue program is to analyze program's ... Most optimizations performed in GCC 3.3.3 are done in RTL level.
A GCC-based Compiler for the Queue Register Processor (QRP-GCC) Arquimedes Canedo, Ben Abderazek, Masahiro Sowa Graduate School of Information Systems University of Electro-Communications of Tokyo {canedo, sowa, ben}@sowa.is.uec.ac.jp ABSTRACT Queue processors are a novel alternative of superscalar processors designed for the execution of highly parallel programs. The queue processor uses a first-in-first-out data structure for expression evaluation, therefore the instruction set has no register references. This research presents a new queue compiler, named QRP-GCC for a Parallel Queue Processor enhanced with a set of random access registers. QRP-GCC uses GCC infrastructure for queue program generation. GCC compiler was developed for conventional register architectures, the intermediate representation has register references and the expression evaluation is done through a depthfirst pre-order traversal over the abstract syntax trees (ASTs). The queue compiler requires to make expression evaluation and code generation using breadth-first traversal over the ASTs. We also propose a new algorithm to extract maximum parallelism within basic blocks by merging expression ASTs. The QRP-GCC compiler has been successfully completed. Static evaluation of the QRP-GCC compiler is made in terms of parallelism, program size, and execution time. Keywords: Optimizing Compiler, ILP Extraction, Queue Processor
1
INTRODUCTION
Queue computers were introduced several decades ago as an alternative to random access register machines and stack machines. Queue processor’s main goal is the explotation of instruction level parallelism (ILP) in scientific and multimedia programs. The queue processor has been developed using simple hardware configuration that makes it suitable as embedded processor. Different types of queue computers have been presented and evaluated our early work [1, 2, 3, 4]. To overcome the limitations imposed by the plain queue computer hardware (a Produced and Consumed Model), we developed a new type of queue parallel computer that effectively combines the use of the queue register with a set of random access registers. To take advantage of parallel queue hardware the compiler must schedule instructions in a particular order imposed by the rules of the queue computational model. A highly optimizing compiler is crucial for achieving high performance in modern microprocessors. In our previous research [1] we have presented a parallel queue processor (PQP). The custom compiler written for the PQP processor was the first compiler tailored for queue processors and its implementation provides the starting point for the development of new compiler algorithms targeting all flavors of queue processors. Conventional compiler is not suitable for the queue processor because the fundamental differences on the computation models. We have developed compiler techniques for queue computation. In this paper we present an optimizing compiler for QRP microprocessor. The organization of this paper is as follows: an overview to the steps required to generate programs for parallel queue processors is given in Section 2. The details of the implementation of the QRP-GCC compiler is provided in Section 3. Section 4 compares the QRP-GCC compiler with the PQP-GCC compiler. Section 5 concludes.
2
QUEUE PROGRAMS GENERATION
The first-in-first-out (FIFO) structure used by queue computers to perform operations demands a special order of execution for operands and operations [3]. This order is essential to keep correctness of expressions and programs since the operands loaded into the queue are taken from the top of the queue and results of operations are stored back in the tail of the queue. The mechanism to transform a given expression into a suitable queue program is named Queue Computation Model. The general rule to convert any given program into queue program is to analyze program’s data flow graph (DFG) and make a breadth-first traversal over its elements. Figure 1 shows how the data flow graph of the expression x = (a + b)/(c − d) is traversed in a breadth first order to get a correct queue program. The queue computational model assures correct evaluation of any given expression.
start
a
b
c
+
Queue Program
d
ld a ld b ld c ld d add sub div st x
-
/
x
finish
Figure 1: Breadth-first traversal of the expression x = (a + b)/(c − d) The Queue Computation Model is the fundamental idea behind the Parallel Queue Processor (PQP) and Queue Register Processor (QRP). In our earlier research about PQP we have found some situations where the queue computation model creates long programs [4]. In the QRP processor we have implemented the idea of a Produced Order Queue Computation Model which is an improvement to our earlier PQP architecture. The Producer Order Queue Computation Model address the limitations of the plain Queue Computation Model.
3
IMPLEMENTATION DETAILS OF QRP-GCC COMPILER
PQP-GCC and QRP-GCC compilers we have developed follow the standard porting mechanism recommended by GCC documentation [9, 8]. QRP-GCC compiler consists of three files: the machine description file named qrp.md, the main file implementing QRP-GCC functionality qrp.c, and the header file for QRP-GCC qrp.h. The machine description file (MD file) contains a detailed description of QRP hardware that is used by the GCC compiler to generate the proper instructions. The MD file for QRP-GCC compiler defines an abstract virtual QRP processor that includes QRP instruction set architecture (ISA), memory model, supported data types, the Application Binary Interface (ABI), procedure calling conventions, classes and number of QRP registers. QRP-GCC defines 1040 registers, 16 normal random access registers and 1024 queue registers. The 16 random access registers have a special role in the compiler. They contain the frame pointer register, stack pointer, register for argument passing, return value register and other special functionalities. The 1024 queue registers are defined as normal random access registers. GCC uses these queue registers to perform the register allocation as if it was a random access register processor. QRP-GCC has the task to convert these register into real queue registers. GCC firstly takes the source program and the correspondent front-end transforms it into abstract syntax trees (AST). The abstract syntax trees are then transformed into a lower level, machine independent representation called register transfer language (RTL). Most optimizations performed in GCC 3.3.3 are done in RTL level. The RTL is then transformed into a machine dependent optimized RTL with the constraints of the target
architecture described in the machine description file. As Figure 2 shows, the implementation of QRP-GCC compiler was done after the machine-dependent optimized RTL pass has been completed. The bold pass is where the QRP dependent reorganization of RTL instructions and code generation is accomplished, from now on this pass will be referred as QRP pass.
QRP-GCC Implementation QRP Machine Description File C Source
Trees
Optimized RTL
RTL
QRP Reorganization
GCC Core ASM
Figure 2: QRP-GCC Implementation The QRP pass relies on four sub-phases to transform the RTL instruction stream into a suitable queue program as depicted in Figure 3: reading and decoding, reordering, fixing, and code generation. The QRP pass is responsible of scanning all basic blocks inside functions and for each basic block it reads, analyzes and transforms its instructions to generate assembly code as the final stage. The Reading and Decoding sub-phase reads and extracts all sensitive information for each instruction within a basic block needed for code generation. RTL representation is transformed into a Internal QRP Trees. QRPGCC internal data structures are filled in this sub-phase. After reading and decoding sub-phase is completed all following sub-phases use the Internal QRP Trees instead of RTL representation to perform any needed transformation or information gathering. In Reordering sub-phase the instructions are analyzed for data dependences and reordered according the queue computational model to achieve high parallelism. The Internal QRP Trees are transformed into Internal QRP Parallel Trees. The dependence testing checks for true dependences, antidependences, output dependences, and dependences introduced by the target queue architecture [6, 5]. This sub-phase contains the core algorithm for transforming a program to the queue model. Fixing sub-phase adequate those instructions that are restricted by the hardware properties of the queue processor. It takes the Internal QRP Parallel Trees and generates a QRP Tree, a representation of the whole basic block that has been compiled. The final assembly code is generated by the last sub-phase, Code Generation sub-phase. It generates the assembly code from the QRP Tree. QRP Pass Optimized RTL
Reading and Decoding
Reordering
Internal QRP Tree
Code Generation
Fixing
Internal QRP Parallel Tree
Figure 3: QRP Pass Sub-phases
QRP Tree
QRP Assembly Program
3.1
Extraction of ILP
The extraction of instruction level parallelism (ILP) is accomplished by traversing the expression trees in a breadth first manner. GCC infrastructure used in our compiler generates independent trees for each expression. Breadth first traversal over each one of the expression trees guarantees correctness of the program. Nevertheless higher parallelism can be extracted by merging different expression trees into same depth levels. Figure 4(a) shows the original trees given by GCC core. Trees are merged using Boytchev’s algorithm [6] which looks for data independent operands starting from top to bottom across all statements of a basic block. The Figure 4(b) illustrates the case when two data independent statements are merged in the same level to extract maximum parallelism. Statements S1 and S2 are merged together in Level 2. All operands and operations at the same level can be executed in parallel without changing the meaning of computations. Whenever a data dependence is found between two operands, the operand is merged to the level n + 1 from where the data dependence is found. In the given example the statement S3 depends on S2 and S1 . From previous tree merging, the statements S1 and S2 have been merged to Level 2. Figure 4(b) shows that statement S3 cannot be merged beyond Level 3 since is dependent on S2 and S1 .
c
d
Level 0
c
d
d
e
+
Level 1
+
-
a
Level 2
a
b
Basic Block
d
S1: a = c + d; S2: b = d - e; S3: d = a * b / 3;
e
Level 3
-
Level 4
b
a
b
*
3
/
d
a
b
*
3
Level 5
/
Level 6
d
Level 7
Level 8
Level 9
(a) Independent trees for expressions in the Basic Block
(b) Merged Trees
Figure 4: Tree Merging Algorithm
4
QRP-GCC AND PQP-GCC
QRP-GCC compiler itself presents several improvements over the PQP-GCC compiler. In PQP-GCC compiler all instructions are generated from the instruction templates described in the machine description file leading to complex code and obscure transformations. QRP-GCC relies on the assembler to perform transformations from the virtual abstract QRP to the real QRP processor, this reduces complexity from the compiler. An important and significant difference is that PQP-GCC compiler performs all queue computation model dependent transformations in the GCC’s internal RTL representation. QRP-GCC takes a different approach, it adds the reading and decoding Sub-phase to fill its own data structures, this implementation feature provides a simpler interface for the code transformation. A set of small test programs have been used for testing specific features of the compiler. Figure 5 shows the number of instructions compiled with PQP-GCC compiler and the QRP-GCC compiler without optimizations
on a set of six test programs. For all compiled programs the number of instructions generated by QRP-GCC compiler is less than the number of instructions compiled by PQP-GCC compiler. QRP-GCC compiler generates 25% shorter programs than PQP-GCC for the case when optimizations are disabled. For the same set of test programs the maximum optimization level available has been activated (-O3). Figure 6 shows the number of instructions generated with PQP-GCC compiler and QRP-GCC respectively. QRP-GCC generates 40% shorter programs for the case when maximum optimization level is enabled. The reduction in the number of instructions generated by QRP-GCC compiler compared to PQP-GCC compiler is explained by the architectural differences between these two processors. QRP processor provides a flexible instruction set and a flexible architecture composed by the queue registers and random access registers. On the other hand, PQP processor gives less freedom to specify the operand locations making the instruction selection and scheduling awkward. PQP-GCC
QRP-GCC
450 424
360
350
270 233
180
196
183 129
128
90 76 47
0
79
92 61
58
36
t1float.c
t1func.c
t1indy.c t1nested.c t1shift.c
t1switch.c
t1xor.c
Figure 5: Number of instructions compiled by PQP-GCC and QRP-GCC with optimizations disabled.
PQP-GCC
QRP-GCC
250 200
204
160
150
134
100
102 80 67
60
50
42 30
0
29
t1float.c
45
t1func.c
t1indy.c t1nested.c t1shift.c
52 33
25
t1switch.c
t1xor.c
Figure 6: Number of instructions compiled by PQP-GCC and QRP-GCC with maximum optimization level enabled (-O3). Using the same GCC compiler (GCC 3.3.3) on which QRP-GCC is based, we compiled simple programs for QRP and Sparc 64 architectures. There were two things we wanted to compare: (1) number of assembly instructions generated by GCC compiler for these two different architectures, and (2) text segment size in bytes. Figure 7(a) shows the length of compiled programs for Sparc 64 and QRP processors respectively. Figure 7(b) shows the text segment size of the compiled programs for the referred architectures. In average, for the seven compiled programs, GCC 3.3.3 generates larger programs for QRP by a factor of 1.56 as shown in the Figure 7(a). The text segment size is smaller for QRP processor by a factor of 0.78. QRP-GCC compiler generates longer assembly programs since QRP hardware demands special operations for branch instructions and procedure calling and we are using a temporary simple method. After the assembly file is assembled and the final object code generated, the QRP-GCC generates smaller programs since QRP
processor has a 16 bit instruction set architecture. QRP
300
750 720
281
250 200 180
150 112
100 50
43 26
0
56 24 21
args.s calc_pi.s expr.s
22
fork.s
15 19
if.s
25
625 562
500 375 250 224 224
125
42 28
Text Segment Size (bytes)
Program Length (number of instructions)
Sparc 64
12 16
loops.s while.s
Total
0
104
86
96
88 42
100 56
args.s calc_pi.s expr.s fork.s
60
38
if.s
84 48
32
loops.s while.s Total
Figure 7: Set of programs compiled by GCC 3.3.3 compiler for Sparc 64 and QRP architectures.
5
CONCLUSIONS
This paper has described the implementation details and key concepts for compiling programs for QRP processor and queue processors in general. Because GCC is a freely available retargetable compiler infrastructure it has been chosen to provide a full set of high quality optimizations suitable for queue-based computers. The queue computation model has been explained and reviewed. Our research on parallel queue processors shows that an optimizing compiler capable of extracting ILP for the queue computation model is critical for preserving program semantics and achieving high performance. The QRP-GCC compiler has been compared with PQPGCC compiler and shows a significant improvement in the quality and size of compiled code. Significant implementation decisions in QRP-GCC simplify the compiler intrinsics for a better debugging, maintenance and embedding of new algorithms. QRP processor is a suitable microarchitecture for embedded applications. It has been designed for exploiting maximum parallelism available in scientific and multimedia programs.
References [1] Sowa M., Abderazek B., Yoshinaga T., Parallel Queue Processor Architecture Based on Produced Order Computation Model, The Journal of Supercomputing, June 2005, pp. 217-229 [2] Abderazek B., Markovsij A., Sowa M., Queue Processor for Novel Queue Computing Paradigm Based on Produced Order Scheme, Proceedings of the HPC, IEEE, July 2004, pp. 169-177 [3] Sowa M., Abderazek B., Nikolova K., Yoshinaga T., Proposal and Design of a Parallel Queue Processor Architecture (PQP), 14th IASTED Int. Conference on Parallel and Distributed Computing and Systems, USA, 2002, pp. 554-560 [4] Okamoto S., Suzuki H., Maeda A., Sowa M., Design of a Superscalar Processor Based on Queue Machine Computation Model, IEEE Pacific Rim Conferences (PACRIM), August 22-24, 1999 [5] Allen R., Kennedy K., Optimizing Compilers for Modern Architectures, Morgan Kaufmann, 2002 [6] Boytchev P., QRP-GCC a GCC-based C compiler for QRP, Technical Report, Graduate School of Information Systems, UEC, Japan, March 2005 [7] Canedo A., Abderazek B., Sowa M., Yoshinaga T., A General Purpose Assembler for Queue Computers, 67th IPSJ Conference, Tokyo, March 2005, pp 295-296. [8] GNU Compiler Collection (GCC) Internals, http://gcc.gnu.org/onlinedocs/gccint [9] Using and Porting the GNU Compiler Collection (GCC), http://gcc.gnu.org/onlinedocs/gcc-2.953/gcc.html