Stanford [I] we can follow the development of the RISC idea to the current ... The greatest number of clock periods is needed for ... evaluation of expressions as usual.The user .... organized by help of a method for coloring graphs [2].The last.
Code g e n e r a t i o n
Petr
for a R I S C M a c h i n e
Kroha
i. I n t r o d u c t i o n An important part of the whole d e v e l o p m e n t in c o m p u t e r s comes from the idea of a new computer architecture based on processors with a simplified instruction set.To date, two p r o m i n e n t a p p r o a c h e s to d e s i g n i n g CPUs and their i n s t r u c t i o n sets have emerged - reduced instruction set computer ( R I S C ) and c o m p l e x i n s t r u c t i o n set c o m p u t e r ( CISC ) designs.The primary difference between these strategies is whether to use a small number of fast-executing primitive i n s t r u c t i o n s or to use more c o m p l i c a t e d i n s t r u c t i o n s that make w r i t i n g programs easier. F r o m the first p r o j e c t s at IBM (IBM 801) and at B e r k e l e y and S t a n f o r d [I] we can follow the d e v e l o p m e n t of the R I S C idea to the c u r r e n t R I S C machines. From the software point of view we can see the d i f f e r e n c e in number of machine instruction (CISC about 200, RISC about 30).Because of the s i m p l i c i t y of a hardware c o n s t r u c t i o n we can a c h i e v e the same p e r f o r m a n c e m u c h c h e e p e r or we can r e a l i z e the chip on GaAs t e c h n o l o g y and increase the speed of c o m p u t i n g c o n s i d e r a b l y . In the following paragraphs we shall briefly show the s t r u c t u r e of our R I S C machine, general p r i n c i p l e s and i n t e r e s t i n g programming techniques used for e n g i n e e r i n g of a R I S C compiler and using them for m o d i f y i n g of the UNIX C c o m p i l e r for purposes of our R I S C machine. 2. S t r u c t u r e
of the used R I S C m a c h i n e
The d e t a i l e d d e s c r i p t i o n of our RISC machine is given in [5].We shall present here only information needed for u n d e r s t a n d i n g of the f o l l o w i n g parts of our c o n t r i b u t i o n . From the same reason we shall not describe the possibilities of p a r a l l e l i t y and r e c o n f i g u r a b i l i t y . The most i n s t r u c t i o n s of our R P - R I S C machine ( R e c o n f i g u r a b l e Parallel RISC machine) have three operands. E v e r y i n s t r u c t i o n is 32 bits long. The most significant 8 bits are used for the operation code, but only 5 bits really serve as code of operation, w h e r e a s the other 3 bits contain the i n f o r m a t i o n about the number of clock cycles needed for the i n s t r u c t i o n execution. The greatest number of clock periods is needed for CALLinstruction (6 cycles).Due to the fast execution of this instruction was used the windowing on a r e g i s t e r field.The u s u a l l y used stack concept for the CALL and R E T U R N i n s t r u c t i o n s d o e s n ' t s a t i s f y the c o n d i t i o n of the LOAD-STORE access as the only a c c e s s to the m e m o r y . U s i n g p r o c e d u r e s involves two groups of t i m e - c o n s u m i n g o p e r a t i o n s : s a v i n g or r e s t o r i n g r e g i s t e r s on each call and return, and p a s s i n g p a r a m e t e r s and results to and from the p r o c e d u r e . l n the w i n d o w c o n c e p t each p r o c e d u r e call a l l o c a t e s a new "window" w h i c h is a subset of the large r e g i s t e r f i e l d . T h i s w i n d o w s are o v e r l a p p i n g for nested procedure c a l l i n g ( s e e Fig.l).
205
Global
Obtained Local
from m a i n
program
P a s s e d - o n to P2 Local
Fig.
1
window
of P2
Passed-on
ii
w i n d o w of Pi
of Pl
to P3
of P2
Register windowing
The CALL and R E T U R N i n s t r u c t i o n s are then o n l y m o v i n g the window pointer on the field of registers.To the obvious programming techniques in R I S C p r o g r a m s we can c o u n t the p l a c i n g of the m o s t l y used c o n s t a n t s (e.g. zero, one) into the global registers.From the 32 global r e g i s t e r s in our R P - R I S C m a c h i n e there are 12 r e g i s t e r s used for saving constants and for some optimization purposes and 20 r e g i s t e r s are used for p u r p o s e s of e v a l u a t i o n of e x p r e s s i o n s as u s u a l . T h e user can a c c e s s in e v e r y moment only 56 registers 32 global and 24 r e g i s t e r s of the window. The total a m o u n t of r e g i s t e r s is 288.The w i n d o w pointer is incremented by CALL instruction and decremented by RET i n s t r u c t i o n . E a c h w i n d o w has t h r e e parts: - 8 r e g i s t e r s for p a r a m e t e r s o b t a i n e d from p r e v i o u s s u b r o u t i n e 8 r e g i s t e r s for local v a r i a b l e s 8 r e g i s t e r s for p a s s e d - o n p a r a m e t e r s . Stack in m e m o r y and stack pointer are used a u t o m a t i c a l l y by h a r d w a r e m e a n s o n l y if s u b r o u t i n e n e s t i n g is too deep. The r e l e v a n t i n s t r u c t i o n s are s h o w n in the f o l l o w i n g table. -
-
Machine
instruction
for the R P - R I S C m a c h i n e
Instruction
Function
ADD ADC SUB SBC
RI,R2,R3 RI,R2,R3 RI,R2,R3 RI,R2,R3
R1 R1 R1 Ri
AND OR XOR
RI,R2,R3 RI,R2,R3 RI,R2,R3
R1 and R2 R1 or R2 R1 xor R2
SRL
R
shift
ICR
R,n,const
insert constant of r e g i s t e r R
ICA
addr.
in page
+ + -
R2 R2 + C R2 R2 - C
right
insert
--> --> --> -->
R3 R3 R3 R3
--> R3 --> R3 --> R3 logical
with c a r r y
into nth cell
16-bit constant
into the
206
address ICP
number
insert prefix
of page
register 8-bit constant register
into
the
LMR
[direct
address]
load contents of given address into the register RP (one of the global r e g i s t e r s ) . I f no direct address is given, a d d r e s s i n g is i n d i r e c t , i . e . b y means of address register
SMR
[direct
address]
store contents of the register RP into the given a d d r e s s (as in LMR)
CJZ CJC CJN
[direct
address]
subroutine call a d d r e s s i n g like LMR,SMR return from s u b r o u t i n e
CALL [direct
address]
RET
Tab.
1 Instruction
3. Special
set of R P - R I S C
programming
c o n d i t i o n a l jump in accordance with flags Z E R O , C A R R Y , N E G A T I V E (if set), a d d r e s s i n g identical with LMR, SMR
techniques
machine
used
(the relevant
subset)
in a RISC compiler
A R I S C m a c h i n e provides primitive hardware functions rather than bundling functionality into complex instruction, allowing the compiler to optimize below the level of other a r c h i t e c t u r e s . T h e compiler can generate optimal code for a simple machine more easily since the architecture provides fewer alternatives in performing a given function.This allows the compiler to focus its a t t e n t i o n on other aspects of optimization, including global optimization and an efficient run-time e n v i r o n m e n t . l t must, of course, support a reduced instruction set by providing for function not present in the h a r d w a r e . E a c h compiler phase usually performs optimizations.After local optimization will be done the peep-hole optimizations and architecture-dependent pipeline scheduling. P r o c e d u r e call, entry and return must be d e s i g n e d as fast and simple as possible, because there are often called the procedures and the functions of the run-time support standing for the c o m p l e x instructions of CISC.It was d i s c o v e r e d that 95% of the calls pass fewer than four p a r a m e t e r s [3]. Using of p i p e l i n i n g improves performance by 20% on average [4], but brings problems because the next instruction will be started before the current instruction has completed its activity. We speak about data c o n f l i c t s and branch conflicts and solve this by load d e l a y and branch d e l a y m e t h o d s . A load d e l a y
207
occurs when one instruction loads a value into a register and the next instruction uses that register. It is necessary 4o change the instruction order so that the instruction following the load does not depend on the load.This permits one instruction to access memory in parallel while executing another instruction. An other effect in pipeline reorganization is called a delayed branch.When a branch instruction is executing the logical order and the physical order of instructions are different.The logically next instruction should be prepared but the physically next instruction is.The basic solution is to insert so many NOPinstruction how many are needed for elimination of the branch conflict.
intermediate
language
global o p t i m i z a t i o ~ I
intermediate language
code generation ..... and local optimization
i J
"
I
J
target code with conflicts and symbolic registers
[ register .......allocatiin~ I
target code with conflicts
I' p!~e line re°rgan!zat!in I
Fig.2
Scheme of a back-end of a RISC compiler
The first phase of the code generation is the global optimization and substitution of small procedures by sequences of instructions.In the following process a naive code will be generated which uses unlimited set of registers and doesn't worry about data and branch conflicts caused by pipelining. During this process are complex instructions of the intermediate language substituted by sequences of the RISC primitive instructions and a local optimization is provided.Register allocation will be organized by help of a method for coloring graphs [2].The last phase of the code generator deletes data and branch conflicts.Some algorithms are given in [4]. 4. Features of the UNIX C compiler The version of the UNIX C compiler [6] we have got in the source form doesn't use a machine-independent intermediate language, but it is not important for our purposes.Each intermediate file is represented as a sequence of binary numbers without any explicit delimiters. It consists of a sequence of logical lines, each headed by an operator, and possibly containing various operands (numbers or strings).Expressions are transmitted in a reverse-Polish notation; as they are being read,
208
a tree is built which is i s o m o r p h i c to the tree c o n s t r u c t e d in the last phase. The reader maintains a stack; each leaf of the e x p r e s s i o n tree (name, constant) is pushed on the s t a c k ; e a c h u n a r y operator r e p l a c e s the top of the stack by a node whose o p e r a n d is the old top-of -stack; each b i n a r y operator replaces the top pair on the stack w i t h a s i n g l e entry. When the e x p r e s s i o n is c o m p l e t e there is e x a c t l y one item on the s t a c k . F o l l o w i n g each e x p r e s s i o n is a special operator w h i c h passes the unique p r e v i o u s e x p r e s s i o n to the o p t i m i z e r and to the code g e n e r a t o r . The o p t i m i z e r tries to m o d i f y and s i m p l i f y the tree so its value m a y be c o m p u t e d more e f f i c i e n t l y and c o n v e n i e n t l y by the code generator.Each interior node will be marked with an estimation of the number of registers required to evaluate it.This r e g i s t e r count is needed to guide the code g e n e r a t i o n a l g o r i t h m . The common subexpression are not d i s c o v e r e d and used for o p t i m i z a t i o n . The basic algorithms the depth-first scan of the t r e e , b u t the code g e n e r a t i o n is in p r i n c i p l e i n d e p e n d e n t of a n y p a r t i c u l a r machine. It d e p e n d s l a r g e l y on a set of tables. It could seem to be easy to m o d i f y this tables for p r o d u c i n g code for other m a c h i n e s , but: ° - s t r u c t u r e of these tables is m a c h i n e - d e p e n d e n t - it is n o n - t r i v i a l to p r e p a r e such tables. The basic code g e n e r a t i o n p r o c e d u r e works with a pointer to a tree representing an expression, with the name of a code g e n e r a t i o n table, and r e t u r n s the number of the r e g i s t e r in w h i c h the value s h o u l d be p l a c e d . T h e r e are four code g e n e r a t i o n tables: - REGTAB actually produces code which causes the value r e p r e s e n t e d by the e x p r e s s i o n tree to be placed in a register. - C C T A B is used in the case, w h e n we need not the value of the expression, but instead the value of the c o n d i t i o n code r e s u l t i n g from e v a l u a t i o n of the e x p r e s s i o n . This table is used to e v a l u a t e the e x p r e s s i o n after IF. S P T A B is used when the value of an e x p r e s s i o n is to be pushed on the stack(e.g, an a c t u a l a r g u m e n t ) . E F F T A B is used when an e x p r e s s i o n is to be e v a l u a t e d for its side effects, not its value.This occurs mostly for e x p r e s s i o n s w h i c h are s t a t e m e n t s , w h i c h have no value. It doesn't need to leave the value in a register. All of the tables b e s i d e s R E G T A B handle a r e l a t i v e l y few s p e c i a l c a s e s . l f one of these tables does not c o n t a i n an e n t r y a p p l i c a b l e to the given e x p r e s s i o n tree, the table R E G T A B will be used and p e r h a p s a r e d u n d a n t code will be g e n e r a t e d . P a r t s of the items in the R E G T A B can be macros w h i c h are e x p a n d e d d u r i n g the generation. The processing of register allocation during the code g e n e r a t i o n is based on S e t h i - U l l m a n a l g o r i t h m . The main code generation tables c o n s i s t of entries each c o n t a i n i n g an o p e r a t o r number and a pointer to a s u b t a b l e for the corresponding operator.A subtable consists of a s e q u e n c e of entries, each with a key describing certain properties of the o p e r a n d s of the o p e r a t o r involved;associated with the k e y is a code s t r i n g . O n c e the s u b t a b l e c o r r e s p o n d i n g to the operator is found, the s u b t a b l e is searched linearly until a key is found such that the p r o p e r t i e s d e m a n d e d by the k e y are c o m p a t i b l e with the operands of the tree n o d e . A s u c c e s s f u l match returns the code string; an u n s u c c e s s f u l s e a r c h r e t u r n s a failure indication. -
-
209
5. M o d i f i c a t i o n
for the RISC
target
code
B e c a u s e the global o p t i m i z a t i o n is running on the p r o g r a m in the i n t e r m e d i a t e language it w a s n ' t changed. This means the ~irst we have to c h a n g e is the t a r g e t code to be g e n e r a t e d . T h e main part of it is g e n e r a t e d by help of the table R E G T A B . T h e table CCTAB isn't relevant for our purposes because we have the operands evaluated in registers for t e s t i n g a n y w a y (the LOADSTORE access to the memory).The SPTABLE isn't r e l e v a n t too because we have no instructions w o r k i n g with s t a c k . T h e table E F F T A B we n e e d n ' t for the same r e a s o n as the table CCTAB, because we a l w a y s o b t a i n a value in a register. The R E G T A B table is o r g a n i z e d h i e r a r c h i c a l l y . The first ievel is built on the operator of the tree node r e p r e s e n t i n g the operation to be translated (has a b o u t 50 e n t r i e s ) . T h e second level is built on the types of operands in the c o m p i l e stack (total about 120 entries).Because of many possibilities of a d d r e s s i n g in the PDP ii i n s t r u c t i o n set there are m a n y cases to distinguish.For the PDP ii code will be the operator of i n d i r e c t i o n t r a n s l a t e d not immediately. Its operand in the compile stack will be o n l y s i g n e d and the o p e r a t i o n of i n d i r e c t i o n will be g e n e r a t e d in the most cases by help of p d s s i b i l i t i e s of the a d d r e s s i n g modes of the PDP ll. In operands in the R E G T A B for the PDP !I is d i s t i n g u i s h e d b e t w e e n o p e r a n d *X and *(X+c) because of addressing modes (R) and X ( R ) . F o r the R P - R I S C m a c h i n e we have only one way, how to translate the indirect a d d r e s s i n g . It simplifies the tables because the number of cases is therefor c o n s i d e r a b l e s m a l l e r (about 4 times). The power of e x p r e s s i o n of the R P - R I S C code is i l l u s t r a t e d in the f o l l o w i n g table. Instruction
set of PDP I!
CLR COM INC DEC NEG ASR ASL
R R R R R R R
MOV ADD SUB
Ra,Rb Ra,Rb Ra,Rb
(Rb B[~] i=~+l i < 50
We c a n s e e t h a t we o b t a i n e d : instructions a 4 b y t e for R P - R I S C ... 76 b y t e i n s t r u c t i o n of v a r i a b l e l e n g t h for P D P Ii ... 60 byte. If we make a summa of the time n e e d e d for e x e c u t i o n of instructions in one s e q u e n c e we o b t a i n : -for R P - R I S C : 20 m i c r o s e c . ( a b o u t i00 c y c l e s for 5 M H z c l o c k ) - f o r P D P ii : 7 1 . 9 4 m i c r o s e c . The t a b l e s are not the o n l y one D l a c e , w h e r e the m a c h i n e dependent code o c c u r s . F o r e x a m p l e for a c o n d i t i o n a l b r a n c h t h e r e is in t h e U N I X C c o m p i l e r a p r o c e d u r e : branch (label,aop,c) 19 18
{ r e g i s t e r op; if ( o p = a o p ) P R I N S ( o p , c , b z a n c h t a b ) ; else printf("jbr"); printf('\tL%d\n',label);
} The d e p e n d e n t code occurs on about 70 places.The n u m b e r of registers used for e v a l u a t i o n is in the U N I X C c o m p i l e r d e f i n e d as 3, we e x t e n d e d to 20 ( o n l y t h r o u g h c h a n g i n g a c o n s t a n t ) .
6.
Conclusions
The a l g o r i t h m s of the U N I X C c o m p i l e r was u s e d to p r o v e s o m e p r o p e r t i e s of a Small C compiler for a R I S C m a c h i n e . B e c a u s e of lack of the m o s t a d d r e s s i n g m o d e s m a n y t a b l e s a n d marly p r o c e d u r e s were becoming much more s i m p l e . T h e p r o b l e m of d a t a and b r a n c h c o n f l i c t s we h a d n ' t to solve because our machine d o e s n ' t use pipelining. This is due to the motivation of the m a c h i n e in c o n t r o l of p a r a l l e l p r o c e s s e s , w h e r e the greatest part of the processor activity c a n be spent by w a i t i n g for a n a s y n c h r o n o u s event.That's why it didn't seem to be important to use pipelining.On examples we s h o w e d that the i n c r e a s e of n u m b e r of instructions of the g e n e r a t e d c o d e w h e n c o m p i l i n g a p r o g r a m isn't so c o n s i d e r a b l e as will be u s u a l l y p r o p o s e d .
214
References [i] Patterson,D.A.,Sequin,C.: A VLSI RISC. Computer,15,No 9,1982. [2] Chaitin,G.J.: Register Allocation and spilling via graph coloring.Proceedings of the SIGPLAN'82 Symposium of Compiler Construction, SIGPLAN Not. 17, 1982. [3] Chow,F.: Engineering RISC compiler system. Compcon Spring 1983. [4] Gross,T.R.: Code optimization technique for pipelined architecture. Compcon Spring 1983. [5] Blazek. Z.,Kroha,P.: Design of a reconfigurable parallel RISC machine. EUROMICRO'87,Portsmouth. reprinted in: Microprocessing and microprogramming 21(1987), North-Holland. [6] Ritchie,D.M.: A Tour through the UNI~ C Compiler. Bell Laboratories,documentation of UNIX. [7] Blazek,Z.:A reconfigurable processor with RISC features. Thesis,Czech Technical University, (to appear on Czech) [8] PDP ii Processor Handbook.Digital Equipment Corporation,1978.