Hardware/Software Partitioning and Minimizing ... - Semantic Scholar

24 downloads 0 Views 43KB Size Report
Martin Edwards [4] describes a system in C which col- lects profiling information by compilation and execution of the program. Since he considers only functions ...
Hardware/Software Partitioning and Minimizing Memory Interface Traffic Axel Jantsch, Peeter Ellervee, Johnny Öberg, Ahmed Hemani and Hannu Tenhunen Electronic System Design Laboratory, Royal Institute of Technology, ESDLab, Electrum 229, S-16440 Kista, Sweden Email: {axel|lrv|johnny|ahmed|hannu}@ele.kth.se Abstract We present a fully automatic approach to hardware/software partitioning and memory allocation by applying compiler techniques to the hardware/software partitioning problem and linking a compiler to a behavioural VHDL generator and high level hardware synthesis tools. Our approach is based on a hierarchical candidate preselection technique and allows (a) efficient collection of profiling data, (b) fast partitioning, and (c) high complexity of the hardware partition.

1 Introduction A critical issue in complex embedded system design is to quickly find effective hardware and software partitioning with good complexity and performance estimations. This early system partitioning must also result in more refined specifications, which can be directly used for automatic hardware and software generation. The advent of mature high level and logic level synthesis tools for hardware design allows for automation of HW/SW codesign and for systematic exploration of trade-offs in HW/SW partitioning at the system level, because they bridge the gap between algorithmic specification and its implementation at the layout level. The objective of this research is to develop methods and tools for hardware/software codesign of embedded systems, in particular for the partitioning problem, with focus on telecommunication applications. Our target architecture is a board with a microprocessor, one or more ASICs, FPGAs, or standard components, and memory. We have chosen C and C++ as specification languages and adopt compiler based techniques. We build a prototyping environment based on the Sparc microprocessor, Xilinx FPGAs, and a logic emulator system to investigate all phases of the codesign cycle. In this paper we focus on the partitioning problem and

suggest a memory allocation technique and a novel partitioning algorithm based on a hierarchical candidate selection scheme which, together with the use of profilers and estimators allows us to solve the partitioning efficiently by using a dynamic programming technique. In particular, the memory allocation has not been addressed thoroughly in previous work but it is one of the key issues in hardware/software partitioning since both, hardware and software parts frequently have high memory requirements and communicate via memory. In the next sections we discuss related work, present our tool and prototyping environment, describe algorithms for the memory allocation and the partitioning problems, and finally present some results.

2 Related Work Athanas’ ambition [1] to automatically partition a C program in HW and SW parts and implement the HW parts in FPGAs is similar to ours but less general. He considers only pure combinational circuits without using high level synthesis techniques. He does not allow main memory access from the FPGA board and the partitioning process is guided by a human user. Ernst and Henkel [6] use an extension of C as specification language for embedded controller design. Their objective is to extract code segments for implementation in hardware when timing constraints are violated. Their partitioner is based on simulated annealing and considers every segment of C statements as a possible candidate, which is inefficient since most of the segments are not good candidates when the program includes loops. The profiling data is collected by a simulator which is much slower than the compilation and execution of a C program. This makes it hard to run long programs with extensive test data.

Martin Edwards [4] describes a system in C which collects profiling information by compilation and execution of the program. Since he considers only functions as candidate regions he may miss more promising larger or smaller regions like inner loops. As in Athanas’ work the regions are selected by a human user and no automatic partitioning is reported yet. Gupta and De Micheli [7] propose a hardware oriented approach and use HardwareC, a limited subset of C, as system specification language. Their design system gradually moves functions into software while considering timing constraints and synchronization. The hardware oriented approach and the use of HardwareC severely limits the overall system complexity.

3 System Overview

C or C++ specification Compiler inserts profiling marks Execute the program to collect profiling information Local Candidate Preselection Estimate hardware implementation and calculate speedup Perform partitioning

Figure 1 shows the rapid prototyping environment. The

Compiler generates Code

C or C++ Specification Behavioural VHDL Partitioning Behavioural VHDL

Object Code

SYNT, Synopsys ASIC vendor tools Sparc CPU

configurable logic + local Memory

SBus Figure 1. Rapid prototyping system specification is partitioned by an extended version of the GNU CC compiler [12] into hardware and software parts. The GNU CC compiler was chosen because it (a) provides access to state of the art compiler technology, (b) is highly portable and is available for many popular microprocessors including those that are often used in embedded systems, (c) is well documented and extendable in an easy way. The hardware partition is derived from one or more source code regions and automatically translated into behavioral VHDL. Figure 2 gives an overview of the partitioning method. The compiler has to be invoked three times, first to insert the profiling marks, second to identify code regions suitable for a hardware implementation, which are called candidates, and finally to generate the assembly and VHDL code according to the selected partitioning. The hierarchical candidate selection scheme identifies inner loops, leaf functions, and loops and functions

Assembly code

Figure 2. Partitioning tool flow which include already identified candidates. It is described in more detail in [10]. Our method allows to implement any C and C++ statement in hardware with three exceptions. Floating point data types, floating point operations and library functions, when the source code is not available, cannot be implemented in hardware. After local preselection, when candidates are identified, a separate program invokes the estimators to estimate the properties of the hardware implementations of possible candidates and calculates the speedup factors for each individual selection. When all these data has been collected and calculated the partitioner takes a global view of the whole program, which might consist of many functions and files, and selects those candidate regions for hardware implementation that fit on the available hardware and together provide the highest speedup. Each region that is selected for hardware implementation is translated into a VHDL process. This translation is straight forward since all simple data types and operators in C can be expressed as behavioural VHDL when the appropriate arithmetic packages are used. Access to complex data types are transformed into address calculations and primitive memory access operations. Pointers are regarded as memory addresses. For the main program the invocation of such a region is similar to a function call. Values that reside in registers at the time of the invocation and are used or updated inside the region, are passed as parameters. The generated VHDL code is processed by SYNT [9], a high level synthesis system which generates a dedicated

datapath and a separate controller. Synopsys tools perform logic synthesis and low level optimizations and the ASIC vendor specific tools place and route the design and generate configuration files. The architecture of our prototyping system consists of a conventional workstation with a user configurable add-on board that is connected to the workstation via the system bus [8]. An on-board interface chip implements direct virtual memory access and allows reasonable short memory access time. User configurable logic can be provided either by field programmable gate arrays from Xilinx or by a logic emulator system with up to 100,000 gates from Quickturn [13]. The remaining software part of the program is augmented with code that performs the communication between the software and the candidates implemented on the board and a startup routine that downloads the configuration files for the user configurable logic onto the board.

4 Memory Allocation High memory traffic and memory access time is one of the major obstacles to better system performance. Introducing a memory hierarchy is a common technique to take the edge of the problem. Our proposed board architecture has a three level hierarchy, namely main memory, local on board memory, and registers on the FPGA. On chip registers are the fastest storage and a high flexibility in allocating registers is a main advantage of field programmable devices compared to microprocessors. On board memory is the second best choice and due to fast static RAMs it can usually be accessed in one FPGA clock cycle. Main memory access from the board is expected to be very slow, probably slower than main memory access from the CPU and should be avoided whenever possible. As a whole, the coprocessor board memory hierarchy has to compete with fast and efficient caches of the CPU which achieve an average memory access time close to one CPU clock cycle in small, computation intensive inner loops or leaf functions. The GNU C compiler considers many variables to be allocated as registers in the CPU. They are called virtual CPU registers. Only some of them can be allocated to real CPU registers. Virtual CPU registers are usually of a simple data type such as int, char, or a pointer. We allocate FPGA registers for all virtual CPU registers. In this respect we adopt the compiler’s strategy and results but exploit the higher flexibility of field programmable hardware compared to microprocessor. We never consider allocating these data items in local memory when the FPGA area limit is hit, because (a) we feel it would usually not be competitive with a software implementation, and (b) FPGA area is in most cases not a critical resource. Data for which the compiler allocates main memory is

usually of compound types such as arrays and records. Those are investigated if they could and should be allocated in on board memory by algorithm 1. The mapping of virtual CPU registers onto FPGA registers has proved to be very successful. In over 2000 candidates we found an average speedup improvement by a factor of 11.9 due to this technique alone. The local memory is not as successful; it results in an average speedup improvement of a factor well below 2.

5 Partitioning The GNU CC compiler [12] consists of more than 20 passes. System partitioning is performed by a separate pass before data flow analysis and instruction scheduling. It replaces code, that is designated for hardware implementation, with code that passes parameters and invokes the design in the user configurable hardware. Figure 2 shows the consecutive steps of the partitioning process. After the profiling phase a set of candidates is preselected and their hardware implementation is estimated. The candidate preselection is described in [10] in more detail. Here we concentrate on the partitioning algorithm which is based on methods of dynamic programming. After the preselection phase we have a set of candidates that should be considered for hardware implementation. With each candidate a cost and a speedup factor is associated. The cost is a natural number and reflects the estimated cost of the hardware implementation in terms of gates. The speedup is a real number representing the program speedup that would be achieved if the candidate is implemented in hardware. Furthermore, we have a cost limit in terms of number of gates which represents the available area in the FPGA. All selected candidates have a cost less than the cost limit and a speedup greater than 1. Candidates can include other candidates. Two candidates are called mutual exclusive if one includes the other. Since a hardware implementation of the larger candidate implies a hardware implementation of the smaller one, they must not be selected for a solution together to avoid implementing the same code twice. We formulate the partitioning problem as to find a subset of the candidates that would gain the greatest speedup while still fitting in the available hardware. This is similar to the well known knapsack problem which usually can be solved efficiently by applying dynamic programming techniques [3]. The only difference to our problem is that we have mutual exclusive candidates. The knapsack stuffing algorithm, shown as algorithm 2, solves the problem. If we had no mutual exclusive candidates we would only need to store the best solutions to each subproblem rather than all solutions. To store all solutions means that

Algorithm 1. Memory allocation

1. Find all memory references inside a candidate. 2. Inspect their addresses and consider the following cases: The candidate is a function and the memory reference is the access of a local variable the size of which is known at compile time. b. The candidate is a loop, the memory reference is the access to a local variable the size of which is known at compile time, and this variable is not accessed outside the candidate. c. The candidate is a loop, the memory reference is the access to a local variable the size of which is known at compile time, and this variable is accessed outside the candidate. d. The memory reference is the access of a global variable the size of which is known at compile time. e. The address is known at run time at the beginning of the candidates execution. It can either be a constant, an input parameter to the function if the candidate is a function, a virtual CPU register, which has not yet been modified inside the candidate, or a memory location if the memory has not yet been written inside the candidate. f. The candidate is a loop, the number of its iterations are known, and the address is an induction variable of the loop. An induction variable is a variable which is multiplied with and incremented by constant factors in each loop iteration. Thus, if the initial value of the induction variable at loop start time and the iteration count of the loop is known, all values the induction variable assumes during loop execution can be calculated in advance. In this case the memory reference represents a set of memory locations, called a memory region, the size of which is equal to the number of loop itera-

a.

the algorithm has exponential memory requirements. However, the preselection phase reduces the number of candidates so efficiently that the knapsack stuffing algorithm can even be applied to large programs. As indicated above a simplified version of the knapsack stuffing algorithm is applied to the memory allocation problem described in the previous section. A similar matrix M is constructed. Instead of candidates we have memory regions, the cost factors are replaced by the region sizes, the speedup factors by λ , which has been defined in the previous section. The cost limit is the size of the local memory. The difference is, that we do not have to deal with mutual exclusive memory regions. Hence, we do not have to store all solutions to subproblems and the algorithm does not have exponential memory requirements.

6 Results We used seven C programs as test cases, namely filter and filter2, two small FIR filter examples, hamming, a

tions. The region is known at the time the loop starts execution. g. In all other cases the address is not known. Memory references of type a through d represent the memory regions where the accessed variable is allocated. In cases a and b the region is allocated in local memory only. No transfer of data to and from main memory is necessary. In cases c through f the region has to be read from main to local memory before execution of the candidate and has to be written back after execution. If the region is not modified by the candidate, no write back is necessary. If none of the memory elements of the region is used before it is modified and all elements are modified, no transfer from main memory is necessary. 3. Merge regions whenever they overlap partially. 4. Sort the regions according to a factor λ which reflects the benefit gained from allocating the region in local memory.

at m λ = --------------------at l + mt m

5.

where a is the number of times the region is accessed, t m is the access time of main memory, t l is the access time of local memory, and m indicates if the region should be read from main memory before the execution of the candidate or should be written back after the execution. m can assume the values 0, 1 or 2 indicating the number of times a transfer between local and main memory has to take place. For global variables of type a or b m would be 0. a and m are calculated for each region based on the analysis described above. The greater λ is the higher the benefit of moving a region to local memory. Select the regions to be allocated in local memory by applying a simplified version of the knapsack stuffing algorithm described below.

small program that calculates the hamming distance of one million pairs of numbers, sed, a stream editor, grep and egrep, two pattern matchers, gzip, a file compression program, and ghostscript, a program that visualizes postscript files on the workstation screen. These examples represent a wide range of applications. filter, filter2, and hamming are data flow oriented, computational intensive programs; grep, egrep, and gzip operate on a stream of data, have computational intensive inner loops, and moderate memory requirements; sed and gs exhibit both low control and data locality with high memory access requirements. Especially gs has been selected to demonstrate the applicability of the partitioner to large and complex specifications. Although we can process C++ specifications we have chosen C programs because we do not yet exploit any of the C++ properties which distinguish it from C. Table 1 shows the estimated speedups and gate counts of the solution found by the partitioner for two different sets of architectural parameters, A1 and A2. They differ mainly in the access time to memory. A1 has 500ns access time to main memory and 100ns to local memory. A2 has

Algorithm 2. Knapsack stuffing

1. Make a list of candidates c 1 …c n . The order is not rel2. 3. 4.

evant. Create an n × n matrix C where n is the number of candidates. C i, j = 1 if candidates c i and c j are mutual exclusive. Calculate the greatest common divider gcd of the cost factors of all candidates and of the cost limit. Create a k × n matrix M , where k = costlimit ⁄ gcd Each element of the matrix is a list of solutions. Each solution is a list of candidates. For i = 1…k { if cost ( c 1 ) ≤ gcd × i then set M [ i, 1 ] to a list of two solutions, one being the empty solution and the second consisting of c 1 ; else set M [ i, 1 ] to a list including only the empty solution. }

20 ns to both main and local memory. The values of A1 are close to what we expect for our prototype board. A2 resembles a coprocessor architecture where the FPGA has access to memory as fast as the CPU. The speedup values indicate the performance improvements compared with a pure software implementation on a SPARCstation IPX running at 20MHz. For architecture A1 the class of programs that can be accelerated is restricted due to slow memory access. Thus, the programs sed, grep, egrep, gzip, and gs cannot expect any performance improvements because of their frequent access of memory. The second severe limitation to higher speedup factors is the fact that the chosen candidates account for not high enough portions of the execution time. In all cases the candidate region could not be enlarged due to library function calls or floating point operations. Although the gate count numbers are only rough estimations they demonstrate that one XC4005 FPGA is certainly insufficient. But the complexity of the hardware partitions is in no case very high and probably well in reach of a larger FPGA. All candidates can definitely be implemented in one ASIC. Hence, the limit to enlarge the candidate regions comes from unimplementable program code rather than from size restrictions of the hardware. filter shows only a moderate performance improvement compared with filter2 because a sinus function generator is implemented in software by a call to a library function. This results in a high communication overhead between hardware and software parts. A more detailed discussion of results of applying the partitioner to various C programs can be found in [11]. Table 2 shows the performance of the partitioning procedure on a SPARCstation IPX. Note, that the partitioner has not been optimized, neither for speed nor for memory usage. The first two rows show the execution time of the candidate preselection phase. Local preselection analyses each function separately and generates all possible candi-

5. For i = 1…k For j = 1…n set M [ i, j ] = M [ i, j – 1 ]

6. if ( i – ( cost ( c j ) ) ⁄ gcd < 1 )

then set L to a list including the empty solution; else set

cost ( c j ) L = copy  M i – -------------------, j – 1    gcd

7.

inspect all solutions of L and for each of them generate a new solution by adding candidate c j to it if the sum of the costs of the candidates in the new solution is less than the cost limit and if c j is not mutual exclusive with any of the candidates in the solution. Matrix C is used for this check. Add the generated solution to the list M [ i, j ] . Take the solution with the highest speedup from the list M [ k, n ] , which is the highest achievable speedup for a given cost limit k × gcd .

dates while global preselection considers all candidates at the same time and estimates their cost, performance and speedup factors. The third row gives the time of the partitioning phase, in essence the knapsack stuffing algorithm. Row four, labeled “total partitioning time” is the sum of rows one through three. Row five gives the time of the rest of the compilation, parsing, various optimizations, and assembly code generation. Row six gives the total CPU time, essentially the sum of rows four and five. Row seven shows the ratio of the partitioning time to the total compilation time. VHDL code generation, program assembly and linkage are not included in these figures. Row eight gives the number of candidates generated by the local preselection, and row nine and ten show the number of input and output candidates of the partitioning, respectively. The last row shows the number of source code lines of the programs. The partitioning time is dominated by the local analysis and preselection phase. It is in all cases less than half the total compilation time. Note, that the number of candidates after the preselection phase is very small for all programs, because only few candidates meet the criteria to speed up the whole program by a factor greater than 1.0. Only candidates that account for at least 10% of the program’s run time are usually able to speedup the program at all. And the number of these candidates is always limited. Thus, we do not expect more than ten candidates after preselection for any program.

7 Conclusions We have presented a partitioning algorithm which is based on a candidate preselection technique. It is fast enough to be applied to very large specifications. A memory allocation method allocates memory, accessed by the

Table 1. Candidate cost and program speedup speedup (A1) speedup (A2) gate count (A1) gate count (A2)

filter 1.08 1.17 5120 5120

filter2 4.11 5.01 5920 5920

hamming 14.4 14.4 12416 12416

sed 1.0 1.15 0 1120

grep 1.0 1.12 0 5504

egrep 1.0 1.45 0 5984

gzip 1.0 2.93 0 11712

gs 1.0 1.03 0 960

Table 2. Performance of the partitioning process local preselection global preselection partitioning total partitioning time compilation total CPU time

filter 0.03 s 0.00 s 0.04 s 0.07 s 0.87 s 0.94s

filter2 0.02 s 0.00 s 0.06 s 0.08 s 0.82 s 0.90s

fraction generated candidates considered candidates selected candidates lines of code

7.5% 1 1 1 49

8.9% 3 2 1 56

hamming sed grep 0.07 s 11.23 s 12.65 s 0.00 s 2.8 s 2.3 s 0.27 s 0.37 s 1.24 s 0.34 s 14.4 s 16.19 s 0.41 s 118.38 s 76.47 s 0.75s 132.78s 92.66s

43.3% 5 4 1 26 [3] hardware partition, as close as possible to the hardware. It

minimizes the interface traffic between software and hardware partitions and reduces the average memory access time. It has been demonstrated that the partitioner can partition large programs and can be used to analyse the main limitations to performance improvement and how a program would benefit from different architectures by varying architectural parameters. Future work will concentrate on setting up the hardware design tool chain by linking together high level, logic level, and layout level synthesis tools. We will also exploit properties of abstract data types in C++ by considering complete class objects as candidates. This would allow to implement all class functions in hardware and to assign all data private to the class either to on chip registers or to on-board local memory. In addition we intend the development of data transfer profilers that provide information about the access frequency of memory regions and about the communication between candidates. The latter information would indicate which candidates should preferably be implemented together in hardware.

[4]

[5]

[6]

[7]

[8]

[9]

[10]

8 Bibliography [1]

[2]

Peter M. Athanas, An Adaptive Hardware Machine Architecture and Compiler for Dynamic Processor Reconfiguration, PhD thesis, Division of Engineering, Brown University, Providence, Rhode Island, September 1991. Th. Benner, J. Henkel, and R. Ernst, “Internal representation of Embedded HW/SW Systems”, 2nd IFIP International Workshop on HW-SW Codesign, May 1993.

[11]

[12] [13]

egrep gzip 12.32 s 32.61 s 2.3 s 1.9 s 2.32 s 3.12s 16.94 s 37.63 s 74.82 s 75.00 s 91.76s 112.63s

gs 323.49 s 20.7 s 17.92 s 362.11 s 789.28 s 1151.39 s 10.9% 17.5% 18.5% 33.4% 31.45% 326 246 246 225 1471 4 8 9 8 5 2 4 5 5 2 12855 6832 6832 7399 78083 Thomas H. Corman, Charles E. Leiserson, and Ronald L. Rivest, Introduction to Algorithms, MIT Press, Cambridge, Massachusetts, 1992. Martin Edwards, “A Development System for Hardware/ Software Co-Synthesis using FPGAs”, Second IFIP International Workshop on HW-SW CoDesign, May 1993. Peeter Ellervee, Axel Jantsch, Johnny Öberg, and Ahmed Hemani, “Nestimator - A Neural Network Based Estimator”, accepted for publication at 7th Annual IEEE International ASIC Conference, ASIC’94, September 1994. R. Ernst and J. Henkel, “Hardware-Software Codesign of Embedded Controllers Based on Hardware extraction”, Int. Workshop on HW-SW Codesign, September 1992. Rajesh K. Gupta and Giovanni De Micheli, “System-level Synthesis using Re-programmable Components”, Proceedings of the European Design Automation Conference, 1992. Shousheng He and Mats Torkelson, “FPGA Application in an SBbus Based Rapid Prototyping System for ASDP”, ICCD 1991. Ahmed Hemani and Adam Postula, “A Neural Net Based Self Organizing Scheduling Algorithm”, European Design Automation Conference, pp 136 - 140, 1990. Axel Jantsch, Peeter Ellervee, Johnny Öberg, Ahmed Hemani, and Hannu Tenhunen, “A Software Oriented Approach to Hardware/Software Codesign”, Proceedings of the Poster Session of the International Conference on Compiler Construction, Edinburgh, April 1994. Axel Jantsch, Peeter Ellervee, Johnny Öberg, Ahmed Hemani, and Hannu Tenhunen, “A Case Study on Hardware/ Software Partitioning”, Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1994. Richard M. Stallman, Using and Porting GNU CC, Free Software Foundation, December1992, version 2.3. J. Varghese, M. Butts, and J. Batcheller, “An efficient logic emulation system”, IEEE Trans. VLSI Systems, vol. 1, no. 2, pp. 171 - 174, June 1993.