A Portable Model for Predicting the Size and Execution Time of Programs Jakob Axelsson Dept. of Computer and Information Science Linkoping University S-581 83 Linkoping, Sweden Email:
[email protected] Phone: +46{13{28 13 96 Fax: +46{13{28 26 66 Abstract
This paper discusses a highly portable model for estimations of the size and execution time of software on dierent microprocessors. The model is based on a small number of easily obtained characteristics of the processor to be evaluated, and it is intended to be used by embedded system designers in early phases of the design, to support design decisions such as chosing which microprocessor to use, or whether a certain function must be implemented in application-speci c hardware instead of in software.
1 Introduction When an engineer starts the design of an embedded computer system, one of the rst design decisions he has to make is what the hardware architecture should look like, and in particular which microprocessor(s) to use. This choice is governed by the performance required by the system, and therefore the designer must estimate the execution time of the program on each of the candidate processors. One way of doing this is to get a compiler for every processor, use this to compile the program, and calculate or measure the execution time based on the generated machine code. But this is a very tedious task, which drastically reduces the number of processors that can be evaluated, and it would therefore be very useful for a designer to have a more general software estimation tool. Several methods have been suggested that can be used for these kinds of estimations, on dierent levels of abstraction and of dierent portability and accuracy:
Ein-Dor and Feldmesser [1] present a model for predicting the relative performance of a computer, which is a measurement of the capacity of the machine in general without relation to any speci c program. The model 1
is based on a small number of measurements, such as clock frequency and cache memory size, which can be obtained without studying the processor in detail. A drawback of the approach is that the relative performance of a computer is not necessarily closely related to the execution time of a particular program. Park and Shaw [6, 7] discuss a source-level analysis which is based on timing schemas for the dierent constructs of the source language. It can be ported to a dierent processor by calculating new timing schemas, but since it works directly on the source code, it has some problems to take into account compiler optimizations made at lower levels. Gong et al. [3] compile a source program to an intermediate three-address code. The analyzer is then provided with a processor description le, which gives the expected execution time of each intermediate code instruction on the particular processor. A total of over 300 values are needed. The analyzer can be ported to a dierent processor by writing a new processor description le, but this requires detailed knowledge about the processor architecture and instruction set. Harmon et al. [4] do analysis on machine-code level. The estimator is portable, but only with a large eort, since it requires information about the tiniest details of the processor's instruction set. The technique can handle advanced processor features, such as cache memories and pipelines with high accuracy. In this paper we present an approach similar to Gong's, in that we compile the source code to an intermediate format, on which the analysis is done, but instead of giving explicitly the execution times for each intermediate level instruction on each processor, we estimate these times using similar ideas as Ein-Dor. In this way, we get predictions with similar accuracy as Gong, but we need to provide much less information.
2 Intermediate Code Format The input speci cation, which is assumed to be given in some high-level programming language, is rst compiled into an intermediate level code which is in principle a graph of basic blocks, where the contents of the blocks are as shown by the grammar in Figure 1. Each basic block is a straight sequence of simple instructions, nishing by a transfer of control to another basic block. This transfer can be made by unconditional jumps (the goto instruction) or conditional jumps (the if : : : goto instruction), or by putting one block after another in the linearized version of the intermediate code graph (this is marked by the pseudo-instruction continue). The only simple instruction in this version is assignment ( ), but others can be added at will. Some instructions can contain expressions, that are built from: 2
bblock
,! label : seq
seq
,! lastinstr j instr ; seq
lastinstr ,!
j j j
stop goto label continue label ifb exp goto label ; lastinstr
b
instr
,! ref
ref
,! loc j (loc)
loc
,! var j $temp
exp
,! val j f (val) j f (val1 ; val2 )
val
,! #const j ref
exp
Figure 1: Intermediate code format.
Constants representing absolute numbers. Variables indicating absolute addresses of memory locations. Temporary variables that are generated by the compiler, e.g. to hold inter-
mediate values in arithmetic expressions. The temporary variables are assumed to be stored in registers. References where the address of a memory location is given by the value of a variable or temporary variable. Functions for which we can expect the processor to have special instructions. These are almost always simple arithmetic and logic expressions. The data size of the expressions are given explicitly in the instructions in which they occur. For instance, an assignment of a 16-bit constant to a variable would be written var 16 #23.
3 The Model We will now describe a model for estimating size and execution time of intermediate code programs. In the model, we use the following characteristics of the microprocessor: Memory access time, mem [clock cycles] 3
Width of the data bus, data [bits] Width of the address bus, addr [bits] Size of instruction codes, op [bits], including information about which
registers are used, which function is calculated, etc. The execution time Tf of each built-in function f used in the program [clock cycles] The execution time of jumps, Tjump [clock cycles] Some of the instructions might take variable time, like conditional jumps, which often have dierent duration depending on whether the jump is made or not. In these cases, one should use an average or a worst case value, to suit the kind of analysis one wants to do. In addition to the above parameters, we also need to know the number of registers when translating the source code into the intermediate language, to determine whether a certain value should be stored in a temporary variable or in memory.
3.1 Estimating Size
Figure 2 de nes the functions SB , SI and SE , which are used to estimate the size of basic blocks, instructions respectively expressions. The size of basic blocks is measured in bytes, while the sizes of instructions and expressions are measured in bits. In the de nition of these functions, we use the double brackets [ : : :] to enclose intermediate code arguments. The function SE takes an extra argument b, which is the data size of the expression. The de nitions of the functions are relatively straightforward. As an example, consider the size of the assignment instruction, ref b exp:
First we have an instruction code, which has the size op (in bits). This g-
ure includes the actual operation code, the addressing mode information, etc. Then we have the size of the expression denoting the left-hand side of the assignment, and since this expression should result in an address, the data size of it is addr bits, hence SE [ ref ] (addr). Finally we have the size of the right-hand side expression, whose data size is given by b. For the expressions, we have estimated the size of a temporary variable (register) to be 0, since this information is normally included in the instruction code. The total size of a program is equal to the sum of the sizes of all the basic blocks. 4
SB [ label : seq] = SB [ seq] SB [ instr ; seq] = dSI [ instr] =8e + SB [ seq] SI [ stop] = 0 SI [ continue label] = 0 SI [ goto label] = op + addr SI [ ifb exp goto label] = op + SE [ exp] (b) + addr SI [ ref b exp] = op + SE [ ref ] (addr) + SE [ exp] (b) SE [ #const] (b) = b SE [ var ] (b) = addr SE [ $temp] (b) = 0 SE [ (loc)]](b) = SE [ loc] (b) SE [ f (val)]](b) = SE [ val] (b) SE [ f (val1 ; val2 )]](b) = SE [ val1 ] (b) + SE [ val2 ] (b) Figure 2: Calculating the size of basic blocks.
3.2 Estimating Execution Time The timing estimation is done in two steps:
1. First we calculate for each basic block the execution time from start to end. This will be discussed further below. 2. We then calculate the number of times each basic block is entered during the execution of the program. This can be done either by nding the longest path through the graph (if we want worst case performance) or by running the program on a set of typical inputs and counting the number of times each block is visited (if we want average case performance). Figure 3 de nes the function TB used for calculating the execution time of a basic block. We assume that the execution of an instruction proceeds according to the following steps: 1. Fetch instruction and operands | requires dSI [ instr] =datae reads from the memory, since all the bits of the instruction have to be read over the data bus. 2. Read operand values | gives a certain number of memory reads depending on the number of operands and where they are stored. 3. Execute instruction | the time depends on the function to be calculated. Normally, addition and similar operations are very quick, but multiplication and division take much time. 4. Write result | requires a certain number of memory writes depending on the size of the data to be written. 5
TB [ label : seq] = TB [ seq] TB [ instr ; seq] = read(SI [ instr] ) + TI [ instr] + TB [ seq] TI [ stop] = 0 TI [ continue label] = 0 TI [ goto label] = Tjump TI [ ifb exp goto label] = TE [ exp] (b) + Tjump TI [ ref b exp] = TE [ ref ] (b) + TE [ exp] (b) TE [ #const] (b) = 0 TE [ var] (b) = read(b) TE [ $temp] (b) = 0 TE [ (loc)]](b) = TE [ loc] (addr) + read(b) TE [ f (val)]](b) = TE [ val] (b) + Tf TE [ f (val1 ; val2 )]](b) = TE [ val1 ] (b) + TE [ val2 ] (b) + Tf read(b) = mem db=datae Figure 3: Calculating the execution time of basic blocks. The total execution time C , in clock cycles, for a program is calculated by:
C (prog) =
X
bb2prog
nbb TB (bb)
where bb is a basic block of the program prog, and nbb is the number of times the block is entered.
4 Model Validation To validate the model, we wrote a test set of ve small C programs, which were compiled for two dierent microprocessors, the Motorola 68000 and the Intel 80386. We used the same compiler front-end, but dierent back-ends, for both processors1. Data about the processors' characteristics were collected from [2] respectively [5], and are summarized in Table 1. The set of example programs were:
| calculates the faculty of 10; bubble | sorts a vector of 20 elements using the bubblesort algorithm; reverse | reverses a linked list of 10 elements; diffeq | solves a dierential equation numerically; fak
1 The compiler used is available (at the time of writing) by anonymous ftp from bugs.nosc.mil/pub/Minix/common-pkgs/c386-4.2b.tar.Z.
6
Factor f mem data addr op T+ T Tjump
Unit 68000 80386 MHz 8 20 cycles 4 2 bits 16 32 bits 24 32 bits 16 16 cycles 0 0 cycles 66 20 cycles 2 3
Table 1: Characteristics of the Motorola 68000 and Intel 80386 processors. Motorola 68000 Intel 80386 Actual Estim Error Actual Estim Error 38 41 +7:9% 47 46 ,2:1% 116 103 ,11:2% 114 110 ,3:5% 60 65 +8:3% 86 72 ,16:3% 86 81 ,5:8% 96 98 +2:1% 286 266 ,7:0% 334 332 ,0:6%
Program
fak bubble reverse diffeq elliptic
Table 2: Actual and estimated size (in bytes) of the example programs on the Motorola 68000 and Intel 80386.
elliptic |
calculates the equation of a digital lter. The actual size and execution time of the programs were measured by inspecting the instructions in the assembler code generated by the compiler, and then these gures were compared to the numbers estimated using our model. (We could also have executed the programs, and measured the time, but the measured times would then have included initialization routines that were not taken into consideration by the estimator). The result of this validation is summarized in Table 2, for program size, and Table 3, for execution time. The average estimation error for the example programs is a modest 5:6% for the execution time, and 6:5% for the program size. It is interesting to compare Program
fak bubble reverse diffeq elliptic
Motorola 68000 Intel 80386 Actual Estim Error Actual Estim Error 1; 430 1; 476 +3:2% 512 518 +1:2% 47; 608 45; 368 ,4:7% 19; 024 19; 140 +0:6% 1; 594 1; 688 +5:9% 490 530 +8:2% 4; 952 4; 788 ,3:3% 1; 590 1; 652 +3:9% 924 812 ,12:1% 294 332 +12:9%
Table 3: Actual and estimated execution time (in clock cycles) of the example programs on the Motorola 68000 and Intel 80386. 7
the accuracy of our model with that of Gong's [3], which is the model the most similar to ours, and we then nd that they report average errors of 5:7% and 6:1% for time respectively size, i.e. our estimator and Gong's have almost identical quality. But with our model, the user only needs to provide a small fraction of the information needed in Gong's model.
5 Advanced Topics The model described in this paper has been targeted mainly at traditional CISCtype processors, but many modern processors have RISC architectures. We believe that our model can be extended to capture many of the advanced RISC features in an adequate way, and we will discuss some of them below.
Cache memories can be handled by replacing the xed time for memory accesses by an expression like Phit cache + (1 , Phit ) mem, which gives
the expected duration of a memory access. Here, Phit is the probability of a cache hit, which is a function of the cache size, and cache is the number of clock cycles necessary to access the cache. Even better estimations are possible if the code is analyzed, in order to nd out if the same data or instruction is likely to have been referenced lately, and thus has an increased probability of being in the cache. Pipelines can be estimated by dividing the calculated instruction time by the expected speedup, which is a function of the pipeline length. Some of the overhead caused by branch instructions can be captured by analyzing the control ow through the basic blocks. Superscalarity can also be estimated if we consider a small region of instructions together, instead of just a single one. If some of the instructions in such a region are independent, and can be executed by separate units simultaneously, we can take the maximum instead of the sum of their execution times. Including these features makes the model more complicated, but a severer problem is the validation. Simply counting instructions in the actual program, as we did for the original model, is not enough, since the execution time for the same instruction will be dierent depending on if the instruction and data is in the cache or not. Probably, the only solution is to execute the program and measure the actual time, but this is also non-trivial, especially if one is interested in estimating worst-case performance.
6 Conclusions In this paper, we have presented a general model for predicting the size and execution time of programs on dierent microprocessors. The model requires a minimum of information about the processor, and hence it is very easy to 8
port. The empirical results presented in section 4 indicate that we can use it to get very accurate estimations. There are however situations where it will not provide good predictions: If the compiler makes code improvements that are very dierent from those foreseen by the timing analysis system, we can expect to get a large estimation error. But this problem is inherent when one tries to do an analysis on a language level higher than the actual assembler language generated by the actual compiler for the actual microprocessor. We believe, however, that we can capture many of the optimizations by doing a thorough analysis at the intermediate level, and by using the same front-end for the actual compiler and the analysis system. If the processor has a structure which is very dierent from the model's anticipations, e.g. if it has some very specialized instructions or uses advanced features like superscalarity, we will also get a large error. We should then instead look for other models that are more suitable. Despite these problems, that are common to all high-level timing analysis methods, we believe that our model can be useful for a wide range of microprocessors that are common in the embedded systems eld, and thus provide a good support for rapid evaluation of the consequences of many crucial design decisions.
References [1] P. Ein-Dor and J. Feldmesser. Attributes of the performance of central processing units: a relative performance prediction model. Communications of the ACM, 30(4):308{317, Apr. 1987. [2] W. Ford and W. Topp. The MC68000 Assembly Language and Systems Programming. Heath, 1988. [3] J. Gong, D. D. Gajski, and S. Narayan. Software estimations from executable speci cations. Journal of Computer & Software Engineering, 2(3):239{258, 1994. [4] M. G. Harmon, T. P. Baker, and D. B. Whalley. A retargetable technique for predicting execution time of code segments. Real-Time Systems, 7(2):159{ 182, Sept. 1994. [5] R. P. Nelson. Microsoft's 80386/80486 programming guide. Microsoft Press, second edition, 1991. [6] C. Y. Park and A. C. Shaw. Experiments with a program timing tool based on source-level timing schema. IEEE Computer, 24(5):48{57, May 1991. [7] A. C. Shaw. Reasoning about time in higher-level language software. IEEE Transactions on Software Engineering, 15(7):875{889, July 1989. 9