VL-PO1
Code generation for embedded processors with complex instructions Jone-Yeol Lee, Hyun-Dhong Yoon, Jin-Hyuk Yang, In-Cheol Park, and Chong-Min Kyung Dept. of EE, Korea Advanced Institute of Science and Technology Dept. of EE, KAIST, 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, Korea E-Mail :
[email protected], Fax: +82-42-866-0702
Abstract
instructions in the code generation of embedded
Code generation for embedded processors often
processors are proposed.
encounters the problem of using complex instructions. The problems come from the heterogeneous register
2 Modified Pattern Matching
architecture of the embedded processors, small number
One of the techniques used in our MetaCore C
of registers, and instructions with complex behaviors. In
Compiler[MCC] is the modzjied pattern matching with
this paper we propose some techniques for using
operation reordering. The whole compilation process
complex instructions. One of them is a simple technique
of MCC is divided into two steps. In the first step,
to use MAC instruction(Mo&fied Pattern Matching). The
which is a machine-independent process, MCC
other ywo techniques are implemented in the Postpass
translates the source program into intermediate
Optimizer that optimizes the generated code with
representation(RTL representation). In the second step,
hardware loop instructions and post-increment or post-
MCC translates the RTL representation into the
decrement addressing modes. Experimental results are
assembly code. The two major jobs in the second step
also presented.
are code selection and register allocation[l][2]. In the code selection, nodes of the dataflow graph represented
1 Introduction
using the RTL are covered with patterns that
The increasing use of embedded software, often
correspond to instructions. Instruction pattern is
implemented on a core processor in a single-chip system,
described as a sequence of primitive operations such as
is a clear trend in the telecommunications, multimedia,
addition, subtraction, multiplication, etc. For example,
and consumer electronics. Compilation in embedded DSP
a MAC instruction is composed of a multiplication
processor development environments often encounters
immediately followed by an addition that depends on
the problem of using complex instructions such as MAC
the multiplication. In MCC, the proposed modified
and hardware loop instruction [ 6 ] . However, the code
pattern matching technique is used instead of the
generation techniques employed in the compilers for
traditional pattern matching.
general microprocessors cannot efficiently use the
It was shown that the problem of generating
complex instructions and hence new techniques are
optimal coverings, i.e., pattern matching, is NP-
needed for the use of complex instructions. In this paper,
complete[3]. Furthermore, the result of pattern
two techniques that enable the use of complex
matching is highly dependent on the coding style of the
0-7803-5727-2/99/$10.00 0 1999 IEEE
- 525 -
source program. In a FIR filter example from
in Fig. 1 (e). Finally, MCC performs the pattern
DSPstone[4], shown in Fig. 1 (a), the operation 6 ‘‘p =
matchmg between the reordered RTL and instruction
ph[ph-index] * px[px-index]” and the operation 11 “y
patterns. The final code containing a MAC operation is
= y + p” together correspond to a MAC instruction. In
shown in Fig. 1 (f).
the original intermediate representation in Fig. 1 (c), the multiplication and the addition are separated from each other. Hence, MCC can not directly match the operations to the MAC instruction pattern in Fig. 1 (b). As a result, multiplication and addition are individually mapped to MUL and ADD instructions rather than a MAC instruction.
To reduce the effect of coding style and enhance the
1:l=0; 2: ph-index = L-1; 3: px-index = L -1 ; 4: px2-index = L -2; 5: for (I = 0 : I c L; I++)( 6: p = ph[ph-lndex] px[px_index]; 7: ph-index = ph-index 1: 8: ph[ph-index] = px2[px2-index]; 9: px-index = px-index 1; 10: px2-index = px2-index 1; 11: y=y+p 12:) (a) Source program
-
~
-
(set R3 (mu1 R1. R2)); (set RA (add RA,R3)); (b) MAC lnsbuctlon paltern
quality of the code selection, node reordering is used. In Fig. 1 (a), the operation 11 is not dependent on any operation between operation 6 and operation 11. So, the operation 11 can be moved next to the operation 6 such that the two operations can be mapped to a MAC
(set T1 (ref (ph+ph-index))); (set T2 (ref ( px+px-index))); (setp (mu/ Tl, TZ)); (set ph-index (sub ph-index. 1)); (set T3 (ref (pxZ+pxZ-index))); (set (ref (ph+ph-index)). T3); (set px-index (sub px-index. 1)); (set px2-index (sub px2-index. 1)); (set Y (addy. P)); (c) RTL representatlon for loop body of source code (a)
instruction. To reorder the original RTL description, MCC
performs ASAP(As Soon As Possible) scheduling[5] for
(set T1 (ref ( ph+ph-mdex))): (set T2 (ref ( px+px-index))): (setp (mu! T1, T2));Pothercodes */ (set y (add y, p)); P other codes */
(d) ASAP schedule of (c)
all operations in the intermediate representation. From the ASAP-scheduled code, MCC can efficiently find the target operation to be reordered since the data dependency between operations can be easily recognized
and the search depth is limited to the depth of the target instruction pattern graph. The depth of instruction pattern graph is defined by the length of the longest path from a primary input to a primary output of the instruction pattern. Using the ASAP-scheduled code in Fig. 1 (d) a MAC pattern can be found by searching two consecutive control steps for a multiplication followed by an addition
(set T1 (ref (ph+ph-index))); (set T2 (ref (px+px-index))); (set p (mu/ Tl, TZ)); (set Y (add Y. PI); (set ph-index (sub ph-index. 1)): (set T3 (ref (px2+px2_index))): (set (ref (ph+ph-index)). T3); (set px-index (sub px-index, 1)): (set px2-index (sub px2-index. 1)); (e) Reorderlng of (d)
(set T1 (ref (ph+ph-index))); (sec T2 (ref (px+px-index))); (sety(mac 11, TZ)); (set ph-index (sub ph-index. 1)); (set 73 (ref (pxZ+pxZ-index))): (set (ref ( ph+ph-index)), (set px-index (sub px-index. 1)); (set &-index (sub pe-index. 1)):
n);
(0 FInal code
because the depth of the MAC instruction pattern graph is two. In the ASAP-scheduled code “(set p (mu1 TI, T2)))” and “(set y (add y, p))” are scheduled in
consecutive control steps and can be mapped to a MAC
Fig. 1. The procedure of the code selection performed by MCC. To enhance the quality of the code selection, a code reordering technique is used.
instruction. Then MCC reorder the RTL description such that two operations are consecutively located, as shown
3 The Postpass Optimizer The other technique used in MCC is the postpass
- 526 -
optimizing in which the assembly code is optimized. As
decrement of addresses to the next data. The additional
shown in Fig. 2, an assembly code is input to the
instructions can be removed when memory operands
Postpass Optimizer. In Postpass Optimizer, the code is
with post-increment or post-decrement addressing
analyzed and transformed to contain special complex
mode are used.
instructions.
Al. 10 RC. A1
LOOP:
I
I.--i
;:
END ;; Al. ’AR7(-3) Al. 1 ‘ARR7(J)A1 . b b
LOOP
(a) Mor0 tnnrfonnabon
(b) M.rtnrutorm.tmn
Fig. 3. An example of hardware loop optimization. Before for instruction repetition counter(RC) must be set to the iteration count.
Fig. 2. The MCC compilation procedure is shown. Currently, the hardware loop instructions and memory operands
with
post-increment
or
(a) Cycle times
post-decrement
addressing modes are used to optimize input assembly code in postpass optimizer. In Fig. 3, a hardware loop optimization example is shown. The postpass optimizer first searches the code for the candidate loops whose iteration number can be found at compile time. For each
(b) Code size
loop the loop condition variables used in the exit
Fig. 4. Cycle times and code size of various
condition test are identified. When the loop condition
benchmarks from DSPStone. The results are
variable of a loop is incremented or decreased by the
compared with DSP56K compiler because MCC
same amount at each iteration, the number of the iteration
and DSP56K compiler are based on GCC. The
of the loop is calculated from the initial value and the
result of DSP56K compiler is from DSPStone
final value of the variable. Finally, the instructions that
benchmarks.
sets repetition counter(RC) with the calculated value are inserted and compare and branch instructions are
4 Experiments
replaced by a hardware loop instruction.
In Fig. 4, some experimental results are shown. The code size and cycle times are compared with the
In most DSP’s, memory access is completed in one
cases, our MCC
cycle and instructions can have both memory and register
DSP56K compiler. In most
operands. In MCC generated code, most operands are
outperforms the DSP56K compiler. This result shows
registers and initially all data are in memory, so
that the proposed techniques are quite effective since
additional instructions are needed for data movement
both MCC and DSP56K compiler are based on GCC
from memory to operand registers and the increment or
and the result of MCC without the postpass optimizer
- 527
-
is comparable to that of DSP56K compiler.
5 Conclusions We proposed some simple techniques for code generation with complex instructions. One of them is modified pattern matching technique in which ASAP scheduling is used to reveal the dependency between instructions. The other two techniques are implemented in the Postpass Optimizer and they use hardware loop instructions
and
post-increment
(post-decrement)
addressing modes. And we showed the experimental results that shows that our techniques effective.
References R. M. Stallman, Using and Porting GNU CC for Version 2.6, Free Software Foundation Inc., Sep.
1996. A. V. Aho, M. R. Sethi, and J. D. Ullman, Compilers - principles, techniques, and tools,
Addison-Wesley, 1986. A.V.
Aho,
S.C. Johnson, “Optimal code
generation for expression trees”, J. of the ACM, Vol. 23, No. 3, July 1976. V. Zivojnovic, J. Martinez, C . Schlager and H.
Meyr,
“DSPstone:
Benchmarking
A
DSP-Onented
Methodology”,
in
Proc.
ICSPAT‘94,Oct 1994.
G. D. Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994.
Phil Lapsley, et. al, DSP Processor Fundamentals,
IEEE Press, 1997
- 528 -