Such challenges impose. more time and cost on the system's software development. ... systems running real application code at the speed of hundreds of. MIPS.
Instruction-driven Timing CPU Model for Efficient Embedded Software Development using OVP Felipe Rosa1,2, Luciano Ost1, Ricardo Reis2, Gilles Sassatelli1 1
LIRMM (CNRS-University of Montpellier II) 161 rue Ada, Cedex 05 - 34095 Montpellier - France {ost, sassatelli}@lirmm.fr
2
UFRGS - Instituto de Informática - PGMicro/PPGC Av. Bento Gonçalves 9500 Porto Alegre, RS - Brazil {frdarosa,reis}@inf.ufrgs.br
performance comes at the expense of accuracy; OVPsim provides instruction accuracy only, which results in inaccurate software performance estimation (e.g. application execution time).
Abstract - The software complexity of MPSoCs is increasing dramatically, resulting in new design challenges, such as improving the system’s performance and programmability by porting parallel programming APIs. Such challenges impose more time and cost on the system’s software development. This leads to the adopting of virtual platform frameworks aimed at functional verification like OVP, capable of simulating embedded systems running real application code at the speed of hundreds of MIPS. This work focuses on enhancing OVP capability by including a quasi-cycle accurate timing CPU model, making it suitable for performance analysis. This paper also evaluates the accuracy of the proposed timing CPU model when compared to a real system. Results show that the accuracy of our model varies from 0.06% to 10.56% depending on the benchmark profile.
This paper contributes by including a quasi-cycle accurate timing CPU model in the OVP framework. The proposed approach broadens the OVP design space exploration spectrum, since software engineers can choose between faster (original OVP) or more accurate simulation (proposed OVP model) within the same simulator. In this direction, we claim that software engineers can easily implement/port C applications, execute them in the original OVP until the point where functionality is validated. Applications can then be executed in a still fast but quasi-cycle accurate OVP model, which allows estimating execution time of a program executing onto a given CPU architecture. Summarizing, this paper contributes in the following aspects:
Keywords: OVP simulation, modeling, design space exploration of MPSoCs, software validation.
I.
INTRODUCTION
Software development is an important issue in today’s MPSoC design. The increasing software complexity makes the functional verification more difficult, resulting into increased development cost [1][2]. In this context, software engineers are investigating alternatives to scale up the system performance, while dealing with new challenges in MPSoC software development, such as defining inter-CPU communication protocol stacks, as well as porting APIs and operating systems (OSs) [3]. To handle with such scenario virtual platforms are being employed. Virtual platforms emulate hardware behavior at the instruction-level making target software believe that it is running on a real physical hardware. While accelerating the software development, such simulators usually offer a set of CPU models and memory system models, allowing the analyses of executing different application/OSs onto multiprocessor architectures without modifications.
the implementation and integration of a quasi-cycle accurate timing model into a JIT-based simulator;
(ii)
the extensive model evaluation by using several benchmarks, while comparing it to a real hardware platform; II.
STATE OF THE ART
Due the limited simulation speed of event-driven cycleaccurate frameworks, simulators based on binary translation become decisive to deal with today’s application challenges, as well as to enable large scenarios evaluation. Simics [7], QEMU [8] and the adopted OVPSim are examples of virtual platform frameworks that rely on dynamic binary translation, i.e. dynamic translation and optimization of target machine code to host machine code. Such simulators/emulators vary in modeling flexibility, simulation speed and accuracy. The lack of accuracy inherent to JIT-based simulators is motivating research in alternatives performance / accuracy tradeoffs.
Event-driven and quasi-cycle accurate virtual platform frameworks like GEM5 target microarchitecture exploration since specific modeling details are provided (e.g. instruction pipeline details, cache coherence protocols, etc) [4]. Such simulators are not scalable to a large number of CPUs, specifically when it comes to usability, ease-of-modeling and simulation time (around 200 KIPS [5]). In contrast, simulators such as the Open Virtual Platforms (OVP) OVPsim that rely on just-in-time (JIT) dynamic binary translation can achieve simulation speeds of up to 100 MIPS [5]. This simulation
978-1-4799-2452-3/13/$31.00 ©2013 IEEE
(i)
In this direction, Chiang et al. [9] propose the integration of QEMU and SystemC allowing faster clock-accurate evaluation when compared to RTL-based ones at the cost of inadequate simulation speed for today’s software complexity, since the simulation is performed in the SystemC environment. A pipeline model was included into QEMU in [10], where authors propose a two-phase approach (offline and online phases) to estimate application performance. In the offline phase a cycle pre-estimation of the application
855
execution time is performed. Precomputed information is then used in a dynamic adaption phase when CPU status and execution time of critical instructions are also taken into account, improving the approach accuracy (mismatch around 10%). A similar approach is presented in [11], where worstcase execution time (WCET) analysis and QEMU are combined. In this work the offline phase is composed of 4 steps, which produce a timing database that is used during the QEMU simulation. The drawback of such approaches is that they rely on the basis of prior application profiling phases, which restricts its use when exploring large scenarios composed of diverse applications. Another disadvantage lies in the fact that any software modification implies re-running offline phases.
information. Once the cycle count is computed, each instruction is executed in the CPU (4).
Different from the reviewed work, the proposed approach relies on OVP and run-time basis, eliminating huge trace files, as well as pre- or post-processing software/application profiling. Due the low memory usage, the proposed approach can be easily configured to observe as many CPUs as desired. The user just needs turning on flags informing which CPUs of the system must be accurately evaluated. Such feature can be used to reduce the simulation time of large MPSoCs.
The number of cycles needed to simulate each instruction can be affected by several conditions, such as content in the registers, last instructions executed, address accessed, among others. Cycle timing for single load and store are examples of operations that may be affected by such conditions. In such cases a normally 2 cycles load can be executed in a single cycle, since their address and data phases may be pipelined [14]. To deal with these conditions internal logic and data structures were implemented, enabling to determine precise cycle counts even under such circumstances.
III.
Watchdog
CPU
Memory
Hash table 4 1. int main () { 2. int a,b,c,i; 3. a = 1; 4. b = 1 + a; 5. 6. for(i=0;i