design. Thus, with wide range of SW and HW IP solutions, the designer has several possibilities ... library, a set of different processor cores (but with the same instruction set), and a ... By taking these costs in consideration, a 40% power .... operations through stack emulation on their register files. ... supposed to be not taken.
Design Space Exploration with Automatic Selection of SW and HW for Embedded Applications Júlio C. B. Mattos1, Antônio C. S. Beck1, Luigi Carro1,2, Flávio R. Wagner1 1 Federal University of Rio Grande do Sul, Computer Science Institute Av. Bento Gonçalves, 9500 - Campus do Vale - Porto Alegre, Brasil 2
Federal University of Rio Grande do Sul, Electrical Engineering Av. Oswaldo Aranha 103 - Porto Alegre, Brasil {julius, caco, carro, flavio}@inf.ufrgs.br
Abstract. This paper presents a methodology for automatic selection of software and hardware IP components for embedded applications. Design space exploration is achieved by the correct selection of a SW-IP block to be executed in a platform containing different implementations of the same ISA. These different versions of the same ISA dissipate a different amount of power and require a different number of cycles to execute the same program. This way, by making the software execute on a different processor to achieve the required power or energy of an application, one can fine tune the platform for the desired market.
1 Introduction Embedded systems are everywhere, in consumer electronics, entertainment, communication systems and so on. Embedded applications demand varying combinations of requirements like performance, power, cost, among others. Since different platforms and IP cores are available, precise technical metrics regarding these factors are essential for a correct comparison among alternative architectural solutions, when running a particular application. To the above physically related metrics, one should add another dimension, which is software development cost. In platform-based design, design derivatives are mainly configured by software, and software development is where most of the design time is spent. But the quality of the software development also directly impacts the mentioned physical metrics. Presently, there is wide variety of Intellectual Property (IP) blocks such processors cores with a several architecture stiles, like RISC, DSP, VLIW. Also, there is an increasing number of software IPs that can be used in a complex embedded systems design. Thus, with wide range of SW and HW IP solutions, the designer has several possibilities and need methodologies and tools to make an efficient design exploration.
Presently, the software designer writes the application code and relies on a compiler to optimize it [1][2][3]. However, it is widely known that design decisions taken at higher abstraction levels can lead to substantially superior improvements. Software engineers involved with software configuration of embedded platforms, however, do not have enough experience to measure the impact of their algorithmic decisions on issues such as performance and power. Moreover, different applications have different resource requirements during their execution. Some applications may have a large amount of instruction-level parallelism (ILP), which can be exploited by a processor that can issue many instructions per cycle. Other applications have little amount of ILP, which can be executed by a single processor The software design exploration is done mainly by software optimizations. There are several works using software optimizations like [1][2][4]. The hardware design exploration is done by selecting the right/best architecture for one application, like [5][6][7]. This paper proposes a pragmatic approach, consisting in the use of a software library, a set of different processor cores (but with the same instruction set), and a design space exploration tool to allow an automatic software and hardware IP selection. The software IP library contains alternative algorithmic implementations for routines commonly found in embedded applications, whose implementations are previously characterized regarding performance, power, and memory requirements for each processor core. During the execution of an application, each different implementation of the ISA can be turned on or off so that the platform can achieve the desired execution time within limits for power dissipation, or energy consumption. Thus we can have different levels of optimization, one on the software level, and another on the hardware level, selecting the best core providing different performance levels and consuming different levels of power. By exploring design alternatives at the algorithmic level and the architectural level, that offer a much wider range of power and performance, the designer is able to automatically find, through the exploration tool, corner cases. As a very important side effect, the choice of an algorithm and an architecture that exactly fits the requirements of the application, without unnecessarily wasting resources. This paper is organized as follows. Section 2 discusses related work in the field of embedded software optimization and hardware exploration. Section 3 presents our approach to design space exploration, introducing the software IP library and the platform (HW IP cores). Section 4 presents the design exploration tool and experimental results, and Section 5 draws conclusions and introduces future work.
2 Related Work In software design exploration there are several work mainly based in software optimizations. Power-aware software optimization has gained attention in recent years. It has been shown [8] that each instruction of a processor has a different power cost. Instruction power costs may vary in a very wide range and are also strongly affected by addressing modes. By taking these costs in consideration, a 40% power
improvement obtained by code optimizations is reported [8]. Reordering of instructions in the source code has been also proposed [9], considering that power consumption depends on the switching activity and thus also on the particular sequence of instructions, and improvements of up to 30% are reported. In [10], an energy profiler is used to identify critical arithmetic functions and replace them by using polynomial approximations and floating-point to fixed-point conversions. In [4], a library of alternative hardware and software parameterized implementations for Simulink blocks that present different performance, power, and area figures is characterized. Although some code optimizations may improve both performance and power, many other improve one factor while resulting in worse figures for the other one. Recent efforts are oriented towards automatic exploration tools that identify several points in the design space that correspond to different tradeoffs between performance and power. In [5], Pareto-optimal configurations are found for a parameterized architecture running a given application. Among the solutions, the performance range varies by a factor of 10, while the power range varies by a factor of 7.5. In [6], the best performance/power figures are selected among various application-to-architecture mappings. These solutions exploiting the architecture to obtain the best option for a particular application In [7], a methodology to automates the design of applications-specific computer systems is proposed. This approach uses a application written in C as input and generates a VLIW processor and a optional nonprogrammable accelerator (one or more units) to execute compute-intensive loops. Kumar [11] presents an interesting idea using a heterogeneous multi-core architecture with the same ISA to reduce power dissipation. During an application executing, system software evaluates the resource requirements of the application and chooses the most appropriate core. In [12], a methodology for architecture exploration of heterogeneous signal processing systems is proposed. This exploration starts from executable specifications modelated as Kan Process Networks and the result is the architecture capable of executing the applications within system constraints. [13] also presents a technique for exploration of architectural alternatives mapping an application on to the best architecture solution.
3 The proposed approach In the proposed design space exploration approach, illustrated in Figure 1, the design starts with an application specification at a high level of abstraction, and the application code is automatically generated by a tool. This tool allows design space exploration based on software IP library and a set of cores different implementations of the same ISA. Note that, in this approach, the designer does not need to know the target platform, because this information is used only for the library characterization. Differently from traditional design flow, where the designer must know the target platform, and all optimizations are trusted to the compiler.
Specs Applications
Design Exploration Tool
Library
Application Code characterization Compiler
Core 1
Core 2
Core 3
Fig. 1. The design space exploration approach
The software IP library contains different algorithmic versions of the same function, like sine, table search, square root, IMDCT (Inverse Modified Discrete Cosine Transform), thus supporting design space exploration. Considering a certain core (HW IP) and for each algorithmic implementation of the library functions, it measures the performance, the memory usage, and energy and power dissipation. This way, the characterization of the software library is performed according to physical related aspects that can be changed at an algorithmic abstraction level. On hardware level, this approach uses different implementations of the same Instruction Set Architecture providing range solutions on performance, power and son on. This architecture is capable of supporting native Java instructions, since Java is a widespread language in the arena of embedded systems. Thus, this methodology allows the automatic selection of software and hardware IPs to better match the application requirements. Moreover, if the application constraints might change, for example with tighter energy demands or smaller memory footprint, a different set of SW and/or HW IPs might be selected. Using this methodology the space design exploration has several options to provide a final solution using different combination of SW IPs and HW IPs. Using only a single core and different algorithmic versions of the same function the designer has a good set of alternatives. However, when multiple cores are used the range of solutions is hugely increased.
3.1 Software Exploration As it has been already mentioned, the library contains different algorithmic versions of the same function, thus supporting design space exploration. Each algorithmic implementation of the library functions is measure in terms of performance (in
cycles), the memory usage (for data and instruction memories), and energy and power dissipation. Since embedded systems are found in many different application domains, this investigation has been started using classical functions: Sine – Two ways to compute the sine of an angle are provided. One is a simple table search, and the other one uses the CORDIC (Coordinate Rotation Digital Computer) algorithm [14]; IMDCT – The Inverse Modified Discrete Cosine Transform is a critical step in decompression algorithms like those found in MP3 players. Together with windowing, it takes roughly 70% of the processing time [15]. Others functions are implemented like Table Search, Square Root and Sort. 3.2 Hardware Exploration For hardware exploration we use a platform composed by different core implementations of the same ISA. The platform is based on different implementations of a Java microcontroller, called FemtoJava [16]. The FemtoJava Microcontroller implements an execution engine for Java in hardware through a stack machine compatible with Java Virtual Machine (JVM) specification. A CAD environment that automatically synthesizes an Application Specific Instruction-Set Processor (ASIP) version of the Java microcontroller for a target application [16] is available, using only a subset of instructions critical to the specific application. This paper presents three different versions of the FemtoJava processor: multicycle, pipeline and a VLIW one. The multicycle version supports stack operations through stack emulation on their register files. This approach reduces the memory access bottleneck of the stack machine, improving performance. The second architecture is the pipelined version [17], which has five stages: instruction fetch, instruction decoding, operand fetch, execution, and write back. The first stage, instruction fetch, is composed by an instruction queue of 9 registers. The first instruction in the queue is sent to the instruction decoder stage. The decoder has four functions: the generation of the control word for that instruction, to handle data dependencies, to analyze the forwarding possibilities and to inform to the instruction queue the size of the current instruction, in order to put the next instruction of the stream in the first place of the queue. Stack and the local variable pool of the methods are available in the register bank. Once the operands are fetched, they are sent to the fourth stage, where the operation is executed. There is no branch prediction, in order to save area. All branches are supposed to be not taken. If the branch is taken, a penalty of three cycles is paid. The write back stage saves, if necessary, the result of the execution stage back to the register bank, again, using the SP or VARS as base. There is a unified register bank for the stack and local variable pool, because this facilitates the call and return of methods where each method is located by a frame pointer in the stack. Thanks to the forwarding technique in the stack architecture, the write back stage is not always executed, and hence there a re meaningful energy savings when compared to a regular 5 state pipeline of a RISC CPU.
The VLIW processor is an extension of the pipelined one. Basically, it has its functional units and the instruction decoders replicated. The additional decoders do not support the instructions for call and return of methods, since they are always in the main flow. The local variable storage is placed just in the first register file, when the instructions of other flows need the value of a local variable, they must fetch from there. Each instruction flow has its own operand stack, which has less registers than the main operand stack, since the operand stacks for the secondary flows do not grow as much as the main flow grows. The VLIW packet has a variable size, avoiding unnecessary memory accesses. A header in the first instruction of the word informs to the instruction fetch controller how many instructions the current packet has. The search for ILP in the Java program is done at the bytecode level. The algorithm works as follows: all the instructions that depend on the result of the previous one are grouped in an operand block. The entire Java program is divided in these groups and they can be parallelized respecting the functional unit constraints. For example, if there is just one multiplier, two instructions that use this functional unit can not be operated in parallel. 3.3 Results on Hardware and Software Exploration To illustrate the results of library characterization using different algorithmic versions of the same function and different cores (multicycle, pipeline and a VLIW version with 2 words), there are two routines selected: the sine and the Inverse Modified Discrete Cosine Transform. The results in terms of performance, power and energy are obtained using the CACO-PS simulator [18]. Table 1 and 2 illustrates the characterization of the alternative implementations of the sine function. Table 1 shows the software results that do not depend on hardware and table 2 shows the results that depend on hardware. There are some entries that are pretty obvious. Since Cordic is a more complex algorithm, program memory size is larger than with Table Look-up (table 1), as well as the number of cycles required for computation for all the cores (table 2). It is interesting to notice, however, that when the resolution increases, the amount of data memory increases exponentially for the Table Look-up algorithm, but only sublinearly for the Cordic algorithm. The increase in memory reflects not only in the required amount of memory, but also in the power dissipation of a larger memory. Table 1. Sine Characterization (software dependent) Characteristic Program size (bytes) 1o Data mem (bytes) 0.5o 0.1o
Cordic
Table
206 184 184 184
88 220 400 1840
The table 2 presents the results in terms of performance, power and energy, using a frequency equal 50 MHz and Vdd equal 3.3v. It is obvious that the pipeline and VLIW architectures provide the better results in terms of cycles, the best results in
terms of performance is the combination of sine calculation as simple table search and VLIW architecture, but the worst results in terms of power. The best combination in terms of power but worst in terms of energy is the sine using Cordic and Multicycle core. Table 2. Sine Characterization (hardware dependent) Characteristic Performance (cycles) Power (mW) Energy (ηJ)
Cordic
Table
Multi Pipeline VLIW2 2447 755 599 11.8092 16.1626 22.9019 577.9421 244.0559 274.4414
Multi 136 13.4431 36.5652
Pipeline 65 17.8235 23.1705
VLIW2 55 20.1606 22.1779
Table 3, 4 e 5 show the main results of the characterization of the four different implementations of the IMDCT function. The IMDCT3 implementation has the better results in terms of performance in all architectures, but the size of program memory significantly increases. The opposite happens with the IMDCT implementation, which has far better results in terms of program memory, but consumes more cycles. In terms of power (using a frequency equal 50 MHz and Vdd equal 3.3v) the best combination are the IMDCT1 and IMDCT2 with multicycle core, but this combination does not have good results in terms of energy. Table 4 and 5 shows that the best results in terms of performance are the results that execute in VLIW core, since the IMDCT routine has lot of parallelism. Table 3. IMDCT Characterization (software dependable) Characteristic Program size (bytes) Data mem (bytes)
IMDCT
IMDCT1
IMDCT2
IMDCT3
344 3546
2,137 3546
4,260 3546
15,294 3546
Table 4. IMDCT Characterization (hardware dependable) Characteristic Performance (cycles) Power (mW) Energy (µJ)
IMDCT Multi 140300 8.8533 24.8424
Pipeline 40306 20.0533 16.1654
IMDCT1 VLIW2 33051 24.9227 16.4744
Multi 97354 8.6595 16.8607
Pipeline 31500 17.8944 11.2735
VLIW2 19325 25.3609 9.8021
Table 5. IMDCT Characterization (hardware dependable) (cont.) Characteristic Performance (cycles) Power (mW) Energy (µJ)
IMDCT2 Multi 92882 8.6483 16.0654
Pipeline 30369 17.7355 10.7722
IMDCT3 VLIW2 17329 27.2193 9.4334
Multi 51345 9.1435 9.3894
Pipeline 18858 17.3849 6.5569
VLIW2 9306 34.5890 6.4380
4 Design space exploration tool In the examples presented above, the design space concerning performance, power, energy, and memory footprint was large. However, the availability of different alternatives of the same routine is just a first step in the design space exploration of the application software. One must notice that embedded applications are seldom implemented with a single routine. There is another level of optimization, which concerns finding the best mix of routines among all possible combinations that may exist in an embedded application. Currently, we are adapting a design exploration tool that maps the routines of an embedded program to an implementation using instances of the software IP library, so as to fulfil given system requirements. The user program is modeled as a graph, where the nodes represent the routines, while the arcs determine the program sequence. The weight of the arcs represents the number of times a certain routine is instantiated. To generate the application graph representing the dynamic behavior of the application, an instrumentation tool was developed.. This instrumentation tool allows the dynamic analysis of Java Class files, generating a list of invoked methods with its corresponding number of calls, which can be mapped to the application graph. In the exploration tool, before the search begins, the user may determine weights for power, delay and memory optimization. It is also possible to set maximum values for each of these variables. The tool automatically explores the design space and finds the optimal or near optimal mapping for that configuration. The cost function of the search is based on a trade-off between power, timing, and area. Each library option is characterized by these three factors. The exploration tool normalizes these parameters by the maximum power, timing and area found in the library. The user can then select weights for the three variables. This way, the search can be directed according to the application requirements. If area cost, for example, must be prioritized because of small memory space, the user may increase the area weight. Although one characteristic might be prioritized, the others are still considered in the search mechanisms. There are two search mechanisms. The first one is an exhaustive search. For small programs, this search runs in acceptable times. For larger problems, since we are dealing with an optimization problem of multiple variables, we implemented a genetic algorithm. Usually, an embedded system can have different running applications. Currently, the design exploration tool is not able to manipulate more than one application on different architectures. To eliminate this restriction each application is modelated as a graph and the design exploration tool is executed several times (one time for each application/processor core), thus the tool selects the correct combination of routines for each processor. Finally, the results are a set of applications, where each application has its own set of routines and which processor core they will be executed that minimizes either power or energy, while maintaining the desired system performance. The methodology can be better understood through an example. Let us take, for instance, one embedded application that must execute the IMDCT (present in MP3 players, for example) and Crane algorithm (control algorithm that controls a crane).
These applications must run in parallel and the power has the highest priority. All the processor cores are running at 50MHz. The crane algorithm must execute less than 5 mili seconds and the IMDCT function must have less than 7 Kbytes of memory space and the application must respond in at most 0.5 mili seconds. Table 5 presents the results of one instance of the Crane and one instance of IMDCT the implementation. This table shows that the best option to execute the crane algorithm is the multicycle core, this core provides the best result in terms of power and satisfies the execution time. The IMDCT function has a cosine function that has smaller impact in terms of overall performance. Table 5 shows the IMDCT results that follows: muticycle implementation shows the best combination in terms of performance (IMDCT3 core with a table look-up cosine calculation); pipeline implementation shows the best combination in terms of performance (IMDCT3 core with a table look-up cosine calculation) and finally the VLIW implementation shows the best combination in terms of performance and power (IMDCT1 core with a table look-up cosine calculation). The IMDCT results present that the only combination that fulfils the above restrictions is the VLIW implementation. Because, the best multicycle implementation do not satisfy the requirements in terms of performance and the pipeline satisfies the performance but do not satisfy the memory space. However, the VLIW core must be select to satisfy the execution time requirements and memory space. Thus, an embedded systems application shows the use the correct/best option in terms of SW routines and processor cores to satisfy the system requirements. Table 5. Crane and IMDCT Results for different processors cores Characteristic Performance (cycles) Power (mW) Energy (µJ)
Crane Multi 179022 13.6235 48.7781
Pipeline 78287 15.9823 25.0241
IMDCT VLIW2 68008 16.6761 22.6821
Multi 56241 9.5177 10.7057
Pipeline 21198 17.4333 7.3910
VLIW2 21305 24.8776 10.6005
5 Conclusions This paper proposed a new methodology for design space exploration using automatic selection of software and hardware IP components for embedded applications. It is based on a software IP library and a set of cores with the same ISA, the library is previously characterized each core and uses a tool for automatic design space exploration and SW and HW IP selection. Experimental results have confirmed the hypothesis that there is a large space to be explored based on algorithmic decisions taken at higher levels of abstraction, much before compiler intervention. Selecting the right algorithm/right architecture might give orders of magnitude of gain in terms of physical characteristics like memory usage, performance, and power dissipation. As a future work, we plan to modify the design space exploration tool making possible that routines of the same application can be mapped onto different processors cores.
References 1. Dutt, N., Nicolau, A., Tomiyama, H., Halambi, A.: New Directions in Compiler Technology for Embedded Systems. In: Asia-Pacific Design Automation Conference, Jan. 2001. Proceedings, IEEE Computer Society Press (2001). 2. Dalal, V., Ravikumar, C.P.: Software Power Optimizations in an Embedded System. In: VLSI Design Conference, Jan. 2001. Proceedings, IEEE Computer Science Press (2001). 3. Kandemir, M., Vijaykrishnan, V., Irwin, M.J., Ye, W.: Influence of Compiler Optimizations on System Power. In: IEEE Transactions on VLSI Systems, vol. 9, n. 6 (2001). 4. Reyneri, L.M., Cucinotta, F., Serra, A., Lavagno, L.: A Hardware/Software Co-design Flow and IP Library Based on Simulink. In: Design Automation Conference 2001, Las Vegas, June 2001. Proceedings, ACM (2001). 5. Givargis, T., Vahid, F., Henkel, J.: System-Level Exploration for Pareto-optimal Configurations in Parameterized Systems-on-a-chip. In: International Conference on Computer-Aided Design, San Jose, Nov., (2001). 6. Nandi, A., Marculescu, R.: System-Level Power/Performance Analysis for Embedded Systems Design. In: Design Automation Conference, June 2001. Proceedings, ACM (2001). 7. Kathail, V., Aditya, S., Schreiber, R., Rau, B.R., Cronquist, D.C., Sivaraman, M.: PICO: Automatically Designing Custon Computers. In IEEE Computer, September (2001). 8. Tiwari, V., Malik, S., Wolfe, A.: Power Analysis of Embedded Software: a First Step Towards Software Power Minimization. In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 2, n. 4, Dec., (1994). 9. Choi, K., Chatterjee, A.: Efficient Instruction-Level Optimization Methodology for LowPower Embedded Systems. In: International Symposium on System Synthesis, Montréal, Oct. 2001. Proceedings, ACM (2001). 10.Peymandoust, A., Simunic, T., De Micheli, G.: Complex Library Mapping for Embedded Software Using Symbolic Algebra”. In: Design Automation Conference, New Orleans, June 2002. Proceedings, ACM (2002). 11.Kumar, R., Farkas, K.I., Jouppi, N.P., Ranganathan, P., Tullsen, D.M.: Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In: International Symposium on Microarchitecture, (2003). 12.Lieverse, P., van der Wolf, P., Deprettere, E., Vissers, K.: A Methodology for Architecture Exploration of Heterogeneous Signal Processing Systems. In: Joranl of VLSI Signal Processing, vol. 29 (2001) 197-206. 13.Mariatos, E.P., Birbas, A.N., Birbas, M.K.: A Mapping Algorithm for Computer-Assisted Exploration in the Design of Embedded Systems. In: ACM Transactions on Design Automation of Electronic Systems, vol. 6, n.1, Jan. 2001, ACM (2001). 14.Omondi, A. Computer Arithmetic Systems: Algorithms, Architecture and Implementation. Prentice Hall (1994). 15.K.Salomonsen, S.Søgaard, E.P.Larsen. Design and Implementation of an MPEG/Audio Layer III Bitstream Processor, Master Thesis, Aalborg University (1997). 16.Ito, S., Carro, L., Jacobi, R.: Making Java Work for Microcontroller Applications. In: IEEE Design & Test of Computers. vol. 18, n. 5, Sept-Oct (2001). 17.Beck, A. C., Carro, L.: Low Power Java Processor for Embedded Applications. In: IFIP 12th International Conference on Very Large Scale Integration, Germany, December (2003). 18.Beck, A. C., Wagner, F.R., Carro, L.: CACO-PS: A General Purpose Cycle-Accurate Configurable Power Simulator. In: 16th Symposium on Integrated Circuits and Systems Design. São Paulo, Brazil, Sept. 2003. Proceedings, IEEE Computer Society Press (2003).