Code generation for a RISC machine - Springer Link

Code g e n e r a t i o n

Petr

for a R I S C M a c h i n e

Kroha

i. I n t r o d u c t i o n An important part of the whole d e v e l o p m e n t in c o m p u t e r s comes from the idea of a new computer architecture based on processors with a simplified instruction set.To date, two p r o m i n e n t a p p r o a c h e s to d e s i g n i n g CPUs and their i n s t r u c t i o n sets have emerged - reduced instruction set computer ( R I S C ) and c o m p l e x i n s t r u c t i o n set c o m p u t e r ( CISC ) designs.The primary difference between these strategies is whether to use a small number of fast-executing primitive i n s t r u c t i o n s or to use more c o m p l i c a t e d i n s t r u c t i o n s that make w r i t i n g programs easier. F r o m the first p r o j e c t s at IBM (IBM 801) and at B e r k e l e y and S t a n f o r d [I] we can follow the d e v e l o p m e n t of the R I S C idea to the c u r r e n t R I S C machines. From the software point of view we can see the d i f f e r e n c e in number of machine instruction (CISC about 200, RISC about 30).Because of the s i m p l i c i t y of a hardware c o n s t r u c t i o n we can a c h i e v e the same p e r f o r m a n c e m u c h c h e e p e r or we can r e a l i z e the chip on GaAs t e c h n o l o g y and increase the speed of c o m p u t i n g c o n s i d e r a b l y . In the following paragraphs we shall briefly show the s t r u c t u r e of our R I S C machine, general p r i n c i p l e s and i n t e r e s t i n g programming techniques used for e n g i n e e r i n g of a R I S C compiler and using them for m o d i f y i n g of the UNIX C c o m p i l e r for purposes of our R I S C machine. 2. S t r u c t u r e

of the used R I S C m a c h i n e

The d e t a i l e d d e s c r i p t i o n of our RISC machine is given in [5].We shall present here only information needed for u n d e r s t a n d i n g of the f o l l o w i n g parts of our c o n t r i b u t i o n . From the same reason we shall not describe the possibilities of p a r a l l e l i t y and r e c o n f i g u r a b i l i t y . The most i n s t r u c t i o n s of our R P - R I S C machine ( R e c o n f i g u r a b l e Parallel RISC machine) have three operands. E v e r y i n s t r u c t i o n is 32 bits long. The most significant 8 bits are used for the operation code, but only 5 bits really serve as code of operation, w h e r e a s the other 3 bits contain the i n f o r m a t i o n about the number of clock cycles needed for the i n s t r u c t i o n execution. The greatest number of clock periods is needed for CALLinstruction (6 cycles).Due to the fast execution of this instruction was used the windowing on a r e g i s t e r field.The u s u a l l y used stack concept for the CALL and R E T U R N i n s t r u c t i o n s d o e s n ' t s a t i s f y the c o n d i t i o n of the LOAD-STORE access as the only a c c e s s to the m e m o r y . U s i n g p r o c e d u r e s involves two groups of t i m e - c o n s u m i n g o p e r a t i o n s : s a v i n g or r e s t o r i n g r e g i s t e r s on each call and return, and p a s s i n g p a r a m e t e r s and results to and from the p r o c e d u r e . l n the w i n d o w c o n c e p t each p r o c e d u r e call a l l o c a t e s a new "window" w h i c h is a subset of the large r e g i s t e r f i e l d . T h i s w i n d o w s are o v e r l a p p i n g for nested procedure c a l l i n g ( s e e Fig.l).

205

Global

Obtained Local

from m a i n

program

P a s s e d - o n to P2 Local

Fig.

1

window

of P2

Passed-on

ii

w i n d o w of Pi

of Pl

to P3

of P2

Register windowing

The CALL and R E T U R N i n s t r u c t i o n s are then o n l y m o v i n g the window pointer on the field of registers.To the obvious programming techniques in R I S C p r o g r a m s we can c o u n t the p l a c i n g of the m o s t l y used c o n s t a n t s (e.g. zero, one) into the global registers.From the 32 global r e g i s t e r s in our R P - R I S C m a c h i n e there are 12 r e g i s t e r s used for saving constants and for some optimization purposes and 20 r e g i s t e r s are used for p u r p o s e s of e v a l u a t i o n of e x p r e s s i o n s as u s u a l . T h e user can a c c e s s in e v e r y moment only 56 registers 32 global and 24 r e g i s t e r s of the window. The total a m o u n t of r e g i s t e r s is 288.The w i n d o w pointer is incremented by CALL instruction and decremented by RET i n s t r u c t i o n . E a c h w i n d o w has t h r e e parts: - 8 r e g i s t e r s for p a r a m e t e r s o b t a i n e d from p r e v i o u s s u b r o u t i n e 8 r e g i s t e r s for local v a r i a b l e s 8 r e g i s t e r s for p a s s e d - o n p a r a m e t e r s . Stack in m e m o r y and stack pointer are used a u t o m a t i c a l l y by h a r d w a r e m e a n s o n l y if s u b r o u t i n e n e s t i n g is too deep. The r e l e v a n t i n s t r u c t i o n s are s h o w n in the f o l l o w i n g table. -

-

Machine

instruction

for the R P - R I S C m a c h i n e

Instruction

Function

ADD ADC SUB SBC

RI,R2,R3 RI,R2,R3 RI,R2,R3 RI,R2,R3

R1 R1 R1 Ri

AND OR XOR

RI,R2,R3 RI,R2,R3 RI,R2,R3

R1 and R2 R1 or R2 R1 xor R2

SRL

R

shift

ICR

R,n,const

insert constant of r e g i s t e r R

ICA

addr.

in page

+ + -

R2 R2 + C R2 R2 - C

right

insert

--> --> --> -->

R3 R3 R3 R3

--> R3 --> R3 --> R3 logical

with c a r r y

into nth cell

16-bit constant

into the

206

address ICP

number

insert prefix

of page

register 8-bit constant register

into

the

LMR

[direct

address]

load contents of given address into the register RP (one of the global r e g i s t e r s ) . I f no direct address is given, a d d r e s s i n g is i n d i r e c t , i . e . b y means of address register

SMR

[direct

address]

store contents of the register RP into the given a d d r e s s (as in LMR)

CJZ CJC CJN

[direct

address]

subroutine call a d d r e s s i n g like LMR,SMR return from s u b r o u t i n e

CALL [direct

address]

RET

Tab.

1 Instruction

3. Special

set of R P - R I S C

programming

c o n d i t i o n a l jump in accordance with flags Z E R O , C A R R Y , N E G A T I V E (if set), a d d r e s s i n g identical with LMR, SMR

techniques

machine

used

(the relevant

subset)

in a RISC compiler

A R I S C m a c h i n e provides primitive hardware functions rather than bundling functionality into complex instruction, allowing the compiler to optimize below the level of other a r c h i t e c t u r e s . T h e compiler can generate optimal code for a simple machine more easily since the architecture provides fewer alternatives in performing a given function.This allows the compiler to focus its a t t e n t i o n on other aspects of optimization, including global optimization and an efficient run-time e n v i r o n m e n t . l t must, of course, support a reduced instruction set by providing for function not present in the h a r d w a r e . E a c h compiler phase usually performs optimizations.After local optimization will be done the peep-hole optimizations and architecture-dependent pipeline scheduling. P r o c e d u r e call, entry and return must be d e s i g n e d as fast and simple as possible, because there are often called the procedures and the functions of the run-time support standing for the c o m p l e x instructions of CISC.It was d i s c o v e r e d that 95% of the calls pass fewer than four p a r a m e t e r s [3]. Using of p i p e l i n i n g improves performance by 20% on average [4], but brings problems because the next instruction will be started before the current instruction has completed its activity. We speak about data c o n f l i c t s and branch conflicts and solve this by load d e l a y and branch d e l a y m e t h o d s . A load d e l a y

207

occurs when one instruction loads a value into a register and the next instruction uses that register. It is necessary 4o change the instruction order so that the instruction following the load does not depend on the load.This permits one instruction to access memory in parallel while executing another instruction. An other effect in pipeline reorganization is called a delayed branch.When a branch instruction is executing the logical order and the physical order of instructions are different.The logically next instruction should be prepared but the physically next instruction is.The basic solution is to insert so many NOPinstruction how many are needed for elimination of the branch conflict.

intermediate

language

global o p t i m i z a t i o ~ I

intermediate language

code generation ..... and local optimization

i J

"

I

J

target code with conflicts and symbolic registers

[ register .......allocatiin~ I

target code with conflicts

I' p!~e line re°rgan!zat!in I

Fig.2

Scheme of a back-end of a RISC compiler

The first phase of the code generation is the global optimization and substitution of small procedures by sequences of instructions.In the following process a naive code will be generated which uses unlimited set of registers and doesn't worry about data and branch conflicts caused by pipelining. During this process are complex instructions of the intermediate language substituted by sequences of the RISC primitive instructions and a local optimization is provided.Register allocation will be organized by help of a method for coloring graphs [2].The last phase of the code generator deletes data and branch conflicts.Some algorithms are given in [4]. 4. Features of the UNIX C compiler The version of the UNIX C compiler [6] we have got in the source form doesn't use a machine-independent intermediate language, but it is not important for our purposes.Each intermediate file is represented as a sequence of binary numbers without any explicit delimiters. It consists of a sequence of logical lines, each headed by an operator, and possibly containing various operands (numbers or strings).Expressions are transmitted in a reverse-Polish notation; as they are being read,

208

a tree is built which is i s o m o r p h i c to the tree c o n s t r u c t e d in the last phase. The reader maintains a stack; each leaf of the e x p r e s s i o n tree (name, constant) is pushed on the s t a c k ; e a c h u n a r y operator r e p l a c e s the top of the stack by a node whose o p e r a n d is the old top-of -stack; each b i n a r y operator replaces the top pair on the stack w i t h a s i n g l e entry. When the e x p r e s s i o n is c o m p l e t e there is e x a c t l y one item on the s t a c k . F o l l o w i n g each e x p r e s s i o n is a special operator w h i c h passes the unique p r e v i o u s e x p r e s s i o n to the o p t i m i z e r and to the code g e n e r a t o r . The o p t i m i z e r tries to m o d i f y and s i m p l i f y the tree so its value m a y be c o m p u t e d more e f f i c i e n t l y and c o n v e n i e n t l y by the code generator.Each interior node will be marked with an estimation of the number of registers required to evaluate it.This r e g i s t e r count is needed to guide the code g e n e r a t i o n a l g o r i t h m . The common subexpression are not d i s c o v e r e d and used for o p t i m i z a t i o n . The basic algorithms the depth-first scan of the t r e e , b u t the code g e n e r a t i o n is in p r i n c i p l e i n d e p e n d e n t of a n y p a r t i c u l a r machine. It d e p e n d s l a r g e l y on a set of tables. It could seem to be easy to m o d i f y this tables for p r o d u c i n g code for other m a c h i n e s , but: ° - s t r u c t u r e of these tables is m a c h i n e - d e p e n d e n t - it is n o n - t r i v i a l to p r e p a r e such tables. The basic code g e n e r a t i o n p r o c e d u r e works with a pointer to a tree representing an expression, with the name of a code g e n e r a t i o n table, and r e t u r n s the number of the r e g i s t e r in w h i c h the value s h o u l d be p l a c e d . T h e r e are four code g e n e r a t i o n tables: - REGTAB actually produces code which causes the value r e p r e s e n t e d by the e x p r e s s i o n tree to be placed in a register. - C C T A B is used in the case, w h e n we need not the value of the expression, but instead the value of the c o n d i t i o n code r e s u l t i n g from e v a l u a t i o n of the e x p r e s s i o n . This table is used to e v a l u a t e the e x p r e s s i o n after IF. S P T A B is used when the value of an e x p r e s s i o n is to be pushed on the stack(e.g, an a c t u a l a r g u m e n t ) . E F F T A B is used when an e x p r e s s i o n is to be e v a l u a t e d for its side effects, not its value.This occurs mostly for e x p r e s s i o n s w h i c h are s t a t e m e n t s , w h i c h have no value. It doesn't need to leave the value in a register. All of the tables b e s i d e s R E G T A B handle a r e l a t i v e l y few s p e c i a l c a s e s . l f one of these tables does not c o n t a i n an e n t r y a p p l i c a b l e to the given e x p r e s s i o n tree, the table R E G T A B will be used and p e r h a p s a r e d u n d a n t code will be g e n e r a t e d . P a r t s of the items in the R E G T A B can be macros w h i c h are e x p a n d e d d u r i n g the generation. The processing of register allocation during the code g e n e r a t i o n is based on S e t h i - U l l m a n a l g o r i t h m . The main code generation tables c o n s i s t of entries each c o n t a i n i n g an o p e r a t o r number and a pointer to a s u b t a b l e for the corresponding operator.A subtable consists of a s e q u e n c e of entries, each with a key describing certain properties of the o p e r a n d s of the o p e r a t o r involved;associated with the k e y is a code s t r i n g . O n c e the s u b t a b l e c o r r e s p o n d i n g to the operator is found, the s u b t a b l e is searched linearly until a key is found such that the p r o p e r t i e s d e m a n d e d by the k e y are c o m p a t i b l e with the operands of the tree n o d e . A s u c c e s s f u l match returns the code string; an u n s u c c e s s f u l s e a r c h r e t u r n s a failure indication. -

-

209

5. M o d i f i c a t i o n

for the RISC

target

code

B e c a u s e the global o p t i m i z a t i o n is running on the p r o g r a m in the i n t e r m e d i a t e language it w a s n ' t changed. This means the ~irst we have to c h a n g e is the t a r g e t code to be g e n e r a t e d . T h e main part of it is g e n e r a t e d by help of the table R E G T A B . T h e table CCTAB isn't relevant for our purposes because we have the operands evaluated in registers for t e s t i n g a n y w a y (the LOADSTORE access to the memory).The SPTABLE isn't r e l e v a n t too because we have no instructions w o r k i n g with s t a c k . T h e table E F F T A B we n e e d n ' t for the same r e a s o n as the table CCTAB, because we a l w a y s o b t a i n a value in a register. The R E G T A B table is o r g a n i z e d h i e r a r c h i c a l l y . The first ievel is built on the operator of the tree node r e p r e s e n t i n g the operation to be translated (has a b o u t 50 e n t r i e s ) . T h e second level is built on the types of operands in the c o m p i l e stack (total about 120 entries).Because of many possibilities of a d d r e s s i n g in the PDP ii i n s t r u c t i o n set there are m a n y cases to distinguish.For the PDP ii code will be the operator of i n d i r e c t i o n t r a n s l a t e d not immediately. Its operand in the compile stack will be o n l y s i g n e d and the o p e r a t i o n of i n d i r e c t i o n will be g e n e r a t e d in the most cases by help of p d s s i b i l i t i e s of the a d d r e s s i n g modes of the PDP ll. In operands in the R E G T A B for the PDP !I is d i s t i n g u i s h e d b e t w e e n o p e r a n d *X and *(X+c) because of addressing modes (R) and X ( R ) . F o r the R P - R I S C m a c h i n e we have only one way, how to translate the indirect a d d r e s s i n g . It simplifies the tables because the number of cases is therefor c o n s i d e r a b l e s m a l l e r (about 4 times). The power of e x p r e s s i o n of the R P - R I S C code is i l l u s t r a t e d in the f o l l o w i n g table. Instruction

set of PDP I!

CLR COM INC DEC NEG ASR ASL

R R R R R R R

MOV ADD SUB

Ra,Rb Ra,Rb Ra,Rb

(Rb B[~] i=~+l i < 50

We c a n s e e t h a t we o b t a i n e d : instructions a 4 b y t e for R P - R I S C ... 76 b y t e i n s t r u c t i o n of v a r i a b l e l e n g t h for P D P Ii ... 60 byte. If we make a summa of the time n e e d e d for e x e c u t i o n of instructions in one s e q u e n c e we o b t a i n : -for R P - R I S C : 20 m i c r o s e c . ( a b o u t i00 c y c l e s for 5 M H z c l o c k ) - f o r P D P ii : 7 1 . 9 4 m i c r o s e c . The t a b l e s are not the o n l y one D l a c e , w h e r e the m a c h i n e dependent code o c c u r s . F o r e x a m p l e for a c o n d i t i o n a l b r a n c h t h e r e is in t h e U N I X C c o m p i l e r a p r o c e d u r e : branch (label,aop,c) 19 18

{ r e g i s t e r op; if ( o p = a o p ) P R I N S ( o p , c , b z a n c h t a b ) ; else printf("jbr"); printf('\tL%d\n',label);

} The d e p e n d e n t code occurs on about 70 places.The n u m b e r of registers used for e v a l u a t i o n is in the U N I X C c o m p i l e r d e f i n e d as 3, we e x t e n d e d to 20 ( o n l y t h r o u g h c h a n g i n g a c o n s t a n t ) .

6.

Conclusions

The a l g o r i t h m s of the U N I X C c o m p i l e r was u s e d to p r o v e s o m e p r o p e r t i e s of a Small C compiler for a R I S C m a c h i n e . B e c a u s e of lack of the m o s t a d d r e s s i n g m o d e s m a n y t a b l e s a n d marly p r o c e d u r e s were becoming much more s i m p l e . T h e p r o b l e m of d a t a and b r a n c h c o n f l i c t s we h a d n ' t to solve because our machine d o e s n ' t use pipelining. This is due to the motivation of the m a c h i n e in c o n t r o l of p a r a l l e l p r o c e s s e s , w h e r e the greatest part of the processor activity c a n be spent by w a i t i n g for a n a s y n c h r o n o u s event.That's why it didn't seem to be important to use pipelining.On examples we s h o w e d that the i n c r e a s e of n u m b e r of instructions of the g e n e r a t e d c o d e w h e n c o m p i l i n g a p r o g r a m isn't so c o n s i d e r a b l e as will be u s u a l l y p r o p o s e d .

214

References [i] Patterson,D.A.,Sequin,C.: A VLSI RISC. Computer,15,No 9,1982. [2] Chaitin,G.J.: Register Allocation and spilling via graph coloring.Proceedings of the SIGPLAN'82 Symposium of Compiler Construction, SIGPLAN Not. 17, 1982. [3] Chow,F.: Engineering RISC compiler system. Compcon Spring 1983. [4] Gross,T.R.: Code optimization technique for pipelined architecture. Compcon Spring 1983. [5] Blazek. Z.,Kroha,P.: Design of a reconfigurable parallel RISC machine. EUROMICRO'87,Portsmouth. reprinted in: Microprocessing and microprogramming 21(1987), North-Holland. [6] Ritchie,D.M.: A Tour through the UNI~ C Compiler. Bell Laboratories,documentation of UNIX. [7] Blazek,Z.:A reconfigurable processor with RISC features. Thesis,Czech Technical University, (to appear on Czech) [8] PDP ii Processor Handbook.Digital Equipment Corporation,1978.

Code generation for a RISC machine - Springer Link

Code generation for a RISC machine - Springer Link

Suggest Documents

Low Power Code Generation for a RISC Processor by Register ...

Code Machine Code Assembleur Processeurs RISC (Byte code).

An Infrastructure for UML-Based Code Generation Tools - Springer Link

A new generation - Springer Link

A novel rotometer based on a RISC microcontroller - Springer Link

Low Power Code Generation for a RISC Processor by Register ... - LS12

Directed Proof Generation for Machine Code - University of Wisconsin ...

Code Generation = A +BURS

A Proof Theory for Machine Code

A Column-Generation Approach for a Short-Term ... - Springer Link

A Code Generation Approach for Auto ... - People.csail.mit.edu

A Machine Learning Method for Power Prediction on ... - Springer Link

A machine vision system for quantifying velocity fields ... - Springer Link

A tabu search algorithm for unrelated parallel machine ... - Springer Link

Machine Learning for Survival Analysis: A Case Study ... - Springer Link

PsychLib: A library of machine language routines for ... - Springer Link

A survey of methods for distributed machine learning - Springer Link

Algorithms for parallel machine scheduling: a case ... - Springer Link

Machine Learning for Survival Analysis: A Case Study ... - Springer Link

A Methodology for Evaluating Arabic Machine ... - Springer Link

A Support Vector Machine Approach for Video Shot ... - Springer Link

PredicT-ML: a tool for automating machine learning ... - Springer Link

AUTOMATIC CODE GENERATION FOR ... - CiteSeerX

Deciphering the secret code: A new methodology for ... - Springer Link