Code generation for embedded processors with ... - Semantic Scholar

1 downloads 0 Views 237KB Size Report
Al. 'AR7(-3). Al. 1. 'ARR7(J). A1. b b. LOOP. (a) Mor0 tnnrfonnabon. (b) M.rtnrutorm.tmn. Fig. 3. An example of hardware loop optimization. Before for instruction ...
VL-PO1

Code generation for embedded processors with complex instructions Jone-Yeol Lee, Hyun-Dhong Yoon, Jin-Hyuk Yang, In-Cheol Park, and Chong-Min Kyung Dept. of EE, Korea Advanced Institute of Science and Technology Dept. of EE, KAIST, 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, Korea E-Mail :[email protected], Fax: +82-42-866-0702

Abstract

instructions in the code generation of embedded

Code generation for embedded processors often

processors are proposed.

encounters the problem of using complex instructions. The problems come from the heterogeneous register

2 Modified Pattern Matching

architecture of the embedded processors, small number

One of the techniques used in our MetaCore C

of registers, and instructions with complex behaviors. In

Compiler[MCC] is the modzjied pattern matching with

this paper we propose some techniques for using

operation reordering. The whole compilation process

complex instructions. One of them is a simple technique

of MCC is divided into two steps. In the first step,

to use MAC instruction(Mo&fied Pattern Matching). The

which is a machine-independent process, MCC

other ywo techniques are implemented in the Postpass

translates the source program into intermediate

Optimizer that optimizes the generated code with

representation(RTL representation). In the second step,

hardware loop instructions and post-increment or post-

MCC translates the RTL representation into the

decrement addressing modes. Experimental results are

assembly code. The two major jobs in the second step

also presented.

are code selection and register allocation[l][2]. In the code selection, nodes of the dataflow graph represented

1 Introduction

using the RTL are covered with patterns that

The increasing use of embedded software, often

correspond to instructions. Instruction pattern is

implemented on a core processor in a single-chip system,

described as a sequence of primitive operations such as

is a clear trend in the telecommunications, multimedia,

addition, subtraction, multiplication, etc. For example,

and consumer electronics. Compilation in embedded DSP

a MAC instruction is composed of a multiplication

processor development environments often encounters

immediately followed by an addition that depends on

the problem of using complex instructions such as MAC

the multiplication. In MCC, the proposed modified

and hardware loop instruction [ 6 ] . However, the code

pattern matching technique is used instead of the

generation techniques employed in the compilers for

traditional pattern matching.

general microprocessors cannot efficiently use the

It was shown that the problem of generating

complex instructions and hence new techniques are

optimal coverings, i.e., pattern matching, is NP-

needed for the use of complex instructions. In this paper,

complete[3]. Furthermore, the result of pattern

two techniques that enable the use of complex

matching is highly dependent on the coding style of the

0-7803-5727-2/99/$10.00 0 1999 IEEE

- 525 -

source program. In a FIR filter example from

in Fig. 1 (e). Finally, MCC performs the pattern

DSPstone[4], shown in Fig. 1 (a), the operation 6 ‘‘p =

matchmg between the reordered RTL and instruction

ph[ph-index] * px[px-index]” and the operation 11 “y

patterns. The final code containing a MAC operation is

= y + p” together correspond to a MAC instruction. In

shown in Fig. 1 (f).

the original intermediate representation in Fig. 1 (c), the multiplication and the addition are separated from each other. Hence, MCC can not directly match the operations to the MAC instruction pattern in Fig. 1 (b). As a result, multiplication and addition are individually mapped to MUL and ADD instructions rather than a MAC instruction.

To reduce the effect of coding style and enhance the

1:l=0; 2: ph-index = L-1; 3: px-index = L -1 ; 4: px2-index = L -2; 5: for (I = 0 : I c L; I++)( 6: p = ph[ph-lndex] px[px_index]; 7: ph-index = ph-index 1: 8: ph[ph-index] = px2[px2-index]; 9: px-index = px-index 1; 10: px2-index = px2-index 1; 11: y=y+p 12:) (a) Source program

-

~

-

(set R3 (mu1 R1. R2)); (set RA (add RA,R3)); (b) MAC lnsbuctlon paltern

quality of the code selection, node reordering is used. In Fig. 1 (a), the operation 11 is not dependent on any operation between operation 6 and operation 11. So, the operation 11 can be moved next to the operation 6 such that the two operations can be mapped to a MAC

(set T1 (ref (ph+ph-index))); (set T2 (ref ( px+px-index))); (setp (mu/ Tl, TZ)); (set ph-index (sub ph-index. 1)); (set T3 (ref (pxZ+pxZ-index))); (set (ref (ph+ph-index)). T3); (set px-index (sub px-index. 1)); (set px2-index (sub px2-index. 1)); (set Y (addy. P)); (c) RTL representatlon for loop body of source code (a)

instruction. To reorder the original RTL description, MCC

performs ASAP(As Soon As Possible) scheduling[5] for

(set T1 (ref ( ph+ph-mdex))): (set T2 (ref ( px+px-index))): (setp (mu! T1, T2));Pothercodes */ (set y (add y, p)); P other codes */

(d) ASAP schedule of (c)

all operations in the intermediate representation. From the ASAP-scheduled code, MCC can efficiently find the target operation to be reordered since the data dependency between operations can be easily recognized

and the search depth is limited to the depth of the target instruction pattern graph. The depth of instruction pattern graph is defined by the length of the longest path from a primary input to a primary output of the instruction pattern. Using the ASAP-scheduled code in Fig. 1 (d) a MAC pattern can be found by searching two consecutive control steps for a multiplication followed by an addition

(set T1 (ref (ph+ph-index))); (set T2 (ref (px+px-index))); (set p (mu/ Tl, TZ)); (set Y (add Y. PI); (set ph-index (sub ph-index. 1)): (set T3 (ref (px2+px2_index))): (set (ref (ph+ph-index)). T3); (set px-index (sub px-index, 1)): (set px2-index (sub px2-index. 1)); (e) Reorderlng of (d)

(set T1 (ref (ph+ph-index))); (sec T2 (ref (px+px-index))); (sety(mac 11, TZ)); (set ph-index (sub ph-index. 1)); (set 73 (ref (pxZ+pxZ-index))): (set (ref ( ph+ph-index)), (set px-index (sub px-index. 1)); (set &-index (sub pe-index. 1)):

n);

(0 FInal code

because the depth of the MAC instruction pattern graph is two. In the ASAP-scheduled code “(set p (mu1 TI, T2)))” and “(set y (add y, p))” are scheduled in

consecutive control steps and can be mapped to a MAC

Fig. 1. The procedure of the code selection performed by MCC. To enhance the quality of the code selection, a code reordering technique is used.

instruction. Then MCC reorder the RTL description such that two operations are consecutively located, as shown

3 The Postpass Optimizer The other technique used in MCC is the postpass

- 526 -

optimizing in which the assembly code is optimized. As

decrement of addresses to the next data. The additional

shown in Fig. 2, an assembly code is input to the

instructions can be removed when memory operands

Postpass Optimizer. In Postpass Optimizer, the code is

with post-increment or post-decrement addressing

analyzed and transformed to contain special complex

mode are used.

instructions.

Al. 10 RC. A1

LOOP:

I

I.--i

;:

END ;; Al. ’AR7(-3) Al. 1 ‘ARR7(J)A1 . b b

LOOP

(a) Mor0 tnnrfonnabon

(b) M.rtnrutorm.tmn

Fig. 3. An example of hardware loop optimization. Before for instruction repetition counter(RC) must be set to the iteration count.

Fig. 2. The MCC compilation procedure is shown. Currently, the hardware loop instructions and memory operands

with

post-increment

or

(a) Cycle times

post-decrement

addressing modes are used to optimize input assembly code in postpass optimizer. In Fig. 3, a hardware loop optimization example is shown. The postpass optimizer first searches the code for the candidate loops whose iteration number can be found at compile time. For each

(b) Code size

loop the loop condition variables used in the exit

Fig. 4. Cycle times and code size of various

condition test are identified. When the loop condition

benchmarks from DSPStone. The results are

variable of a loop is incremented or decreased by the

compared with DSP56K compiler because MCC

same amount at each iteration, the number of the iteration

and DSP56K compiler are based on GCC. The

of the loop is calculated from the initial value and the

result of DSP56K compiler is from DSPStone

final value of the variable. Finally, the instructions that

benchmarks.

sets repetition counter(RC) with the calculated value are inserted and compare and branch instructions are

4 Experiments

replaced by a hardware loop instruction.

In Fig. 4, some experimental results are shown. The code size and cycle times are compared with the

In most DSP’s, memory access is completed in one

cases, our MCC

cycle and instructions can have both memory and register

DSP56K compiler. In most

operands. In MCC generated code, most operands are

outperforms the DSP56K compiler. This result shows

registers and initially all data are in memory, so

that the proposed techniques are quite effective since

additional instructions are needed for data movement

both MCC and DSP56K compiler are based on GCC

from memory to operand registers and the increment or

and the result of MCC without the postpass optimizer

- 527

-

is comparable to that of DSP56K compiler.

5 Conclusions We proposed some simple techniques for code generation with complex instructions. One of them is modified pattern matching technique in which ASAP scheduling is used to reveal the dependency between instructions. The other two techniques are implemented in the Postpass Optimizer and they use hardware loop instructions

and

post-increment

(post-decrement)

addressing modes. And we showed the experimental results that shows that our techniques effective.

References R. M. Stallman, Using and Porting GNU CC for Version 2.6, Free Software Foundation Inc., Sep.

1996. A. V. Aho, M. R. Sethi, and J. D. Ullman, Compilers - principles, techniques, and tools,

Addison-Wesley, 1986. A.V.

Aho,

S.C. Johnson, “Optimal code

generation for expression trees”, J. of the ACM, Vol. 23, No. 3, July 1976. V. Zivojnovic, J. Martinez, C . Schlager and H.

Meyr,

“DSPstone:

Benchmarking

A

DSP-Onented

Methodology”,

in

Proc.

ICSPAT‘94,Oct 1994.

G. D. Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994.

Phil Lapsley, et. al, DSP Processor Fundamentals,

IEEE Press, 1997

- 528 -