Code Generation Targeting a RTOS: a Design Case Study. Monica Besana .... 2 Timers. Clock And. Reset Controller. External BUS. Interface. I2C. USB. General ...
Application Mapping to a Hardware Platform through Automated Code Generation Targeting a RTOS: a Design Case Study
Monica Besana and Michele Borgatti STMicroelectronics, Central R&D - Agrate Brianza (MI), Italy {monica.besana, michele.borgatti}@st.com
Abstract Consistency, accuracy and efficiency are key aspects for practical usability of a system design flow featuring automatic code generation. Consistency is the property of maintaining the same behaviour at different levels of abstraction through synthesis and refinement, leading to functionally correct implementation. Accuracy is the property of having a good estimation of system performances while evaluating a high-level representation of the system. Efficiency is the property of introducing low overheads and preserving performances at the implementation level. RTOS is a key element of the link to implementation flow. In this paper we capture relevant high-level RTOS parameters that allow consistency, accuracy and efficiency to be verified in a top-down approach. Results from performance estimation are compared against measurements on the actual implementation. Experimental results on automatically generated code show design flow consistency, an accuracy error of about 0.66% and an overhead of about 11.8% in term of speed.
1. Introduction Nowadays, embedded systems are continuously increasing their hardware and software complexity moving to single-chip solutions. At the same time, market needs of System-on-Chip (SoC) designs are rapidly growing with strict time-to-market constraints. As a result of these new emerging trends, semiconductor industries are adopting hardware/software co-design flows [1][2], where the target system is represented at a high-level of abstraction as a set of hardware and software reusable macro-blocks. In this scenario, where also applications complexity is scaling up, real-time operating systems (RTOS) are
This work is partially supported by the Medea+ A502 MESA European Project.
playing an increasingly important role. In fact, by simplifying control code required to coordinate processes, RTOSs provide a very useful abstraction interface between applications with hard real-time requirements and the target system architecture. As a consequence, availability of RTOS models is becoming strategic inside hardware/software co-design environments. This work, based on Cadence Virtual Component Codesign (VCC) environment [3], shows a design flow to automatically generate and evaluate software -including a RTOS layer- for a target architecture. Starting from executable specifications, an untimed model of an existing SoC is defined and validated by functional simulations. At the same time an architectural model of the target system is defined providing a platform for the next design phase, where system functionalities are associated with a hardware or software architecture element. During this mapping phase, each high-level communication between functions has to be refined choosing the correct protocol from a set of predefined communication patterns. The necessary glue for connecting together hardware and software blocks is generated by the interface synthesis process. At the end of mapping, software estimations have been performed before starting to directly simulate and validate generated code to a board level prototype including our target chip. Experimental results show a link to implementation consistency with an overhead of about 11.8% in term of code execution time. Performance estimations compared against actual measured performances of the target system show an accuracy error of about 0.66%.
2. Speech recognition system description A single-chip, processor-based system with embedded built-in speech recognition capabilities has been used as target in this project. The functional block diagram of the
-
a 2Mbit embedded flash memory (e-Flash), which stores both programs and word templates database; the main processor embedded static RAM (RAM); a static RAM buffer (WORDRAM) to store intermediate data during the recognition phase. Microphone Interface
WORDRAM (1 KB)
mi
c
IPE
RAM (16 KB)
2 Timers
CACHE (4 KB)
Clock And Reset Controller
General Purpose Registers
I2C
BRIDGE AHB/APB
Hardware block
Preproc
RECOGNITION ENGINE
e-FLASH (256 KB)
APB AMBA BUS
ARM7TDMI
Software block
Buffer
FEATURE EXTRACTOR
External BUS Interface
Hand-Held Bus Interface
EPD Speech
Speech Processing
USB
AHB AMBA BUS
speech recognition system is shown in Fig.1. It is basically composed by two hardware/software macro-blocks. The first one, simply called front-end (FE), implements the speech acquisition chain. Digital samples, acquired from an external microphone, are processed (Preproc) frame by frame to provide a subsampled and filtered speech data to EPD and ACF blocks. While ACF computes the auto-correlation function, EPD performs an end-point detection algorithm to obtain silence-speech discrimination. ACF concatenation with the linear predictive cepstrum block (LPC) translates each incoming word (i.e. a sequence of speech samples) into a variable-length sequence of cepstrum feature vectors [4]. Those vectors are then compressed (Compress) and transformed (Format) in a suitable memory structure to be finally stored in RAM (WordRam).
ACF
Figure 2. System architecture LPC
FE BE
Compress
Format
Ap
pli
c
L1 / L2 Dist Norm. and Voting Rule
rd Wo AM R
DTW Innerloop DTW Outloop
sh Fla ory m Me
Figure 1. System Data Flow The other hardware/software macro-block, called backend (BE), is the SoC recognition engine where the acquired word (WordRAM) is classified comparing it with a previously stored database of different words (Flash Memory). This engine, based on a single-word pattern-matching algorithm, is built by two nested loops (DTW Outloop and DTW Innerloop) that compute L1 or L2 distance between frames of all the reference words and the unknown one. Obtained results are then normalized (Norm-and-VotingRule) and the best distance is supplied to the application according to a chosen voting rule. The ARM7TDMI processor-based chip architecture is shown in Fig.2. The whole system was built around an AMBA bus architecture, where a bus bridge connects High speed (AHB) and peripherals (APB) buses. Main targets on the AHB system bus are:
The configurable hardwired logic that implements speech recognition functionalities (Feature Extractor and Recognition Engine) is directly connected to the APB bus.
3. Design flow description In this project a top-down design flow has been adopted to automatically generate code for a target architecture. Fig.3 illustrates the chosen approach. System Behavior TASK1
TASK4
TASK5
TASK3
TASK2
RTOS MicroC/OS-II ARM7TDMI
VCC SW
RT PO X E
Figure 3. Project Design Flow Starting from a system behaviour description, hardware and software tasks have been mapped to the target speech
recognition platform and to MicroC/OS-II (a well-known open-source and royalties-free pre-emptive real-time kernel [5]) respectively. Then mapping and automatic code generation phases allow to finally simulate and validate the exported software directly on a target board. In the next sections a detailed description of the design flow is presented.
patterns have been implemented to fit the requirements of the existing hardware platform.
3.1 Modelling and mapping phases At first, starting from available executable specifications, a behavioural description of the whole speech recognition system has been carried out. In this step of the project FE and BE macro-blocks (Fig.1) have been split in 21 tasks, each one representing a basic system functionality at untimed level, and the obtained model has been refined and validated by functional simulations. Behaviour memories has been included in the final model to implement speech recognition data flow storage and retrieval. At the same time, a high-level architectural model of the ARM7-based platform presented above (Fig.2) has been described. Fig.4 shows the result of this phase where the ARM7TDMI core is connected to a MicroC/OS-II model that specifies tasks scheduling policy and delays associated with tasks switching. This RTOS block is also connected to a single task scheduler (Task), that allows to transform a tasks sequence in a single task, reducing software execution time.
Figure 4. Architecture Model When both descriptions are completed, the mapping phase has been started. During this step of the design flow, each task has been mapped to a hardware or software implementation (Fig.5), matching all speech recognition platform requirements in order to obtain code that can be directly executed on target system. To reach this goal the appropriate communication protocol between modelled blocks has had to be selected from available communication patterns. Unavailable communication
Figure 5. FE Blocks Mapping
3.2 Software performance estimation At the end of mapping phase, performance estimations have been carried out to verify whether the obtained system model meets our system requirements. In particular most strict constraints are in term of software execution time. These simulations have been performed setting clock frequency to 16MHz and using the high-level MicroC/OSII parameter values obtained via RTL-ISS simulation (Table 1) that describe RTOS context switching and interrupt latency overheads. In this scenario the ARM7TDMI CPU architectural element has been modelled with a processor basis file tuned on automotive applications code [6].
Start_overhead (delay to start a reaction) Finish_overhead (delay to finish a reaction) Suspend_overhead (delay to suspend a reaction) Resume_overhead (delay to resume a preempted reaction)
Cycles
16MHz
~220
0.014 ms
~250
0.016 ms
~520
0.033 ms
~230
0.014 ms
Table 1. RTOS Parameters Performance results show that all front-end blocks, which are system blocks with the hard-real time constraints, require 6.71 ms to complete their execution. This time does not include RTOS timer overhead that has
been estimated via RTL-ISS simulations in 1000 cycles (0.0633 ms at 16MHz). Setting MicroC/OS-II timer to a frequency of one tick each ms, all front-end blocks present an overall execution time of 7.153 ms. Since a frame of speech (the basic unit of work for the speech recognition platform) is 8 ms long, performance simulations show that generated code, including the RTOS layer, fits hard real time requirements of the target speech recognition system.
3.3 Code generation and measured results Besides evaluating system performances, VCC environment allows to automatically generate code from system blocks mapped software. This code, however, does not include low-level platform dependent software. Therefore, to execute it directly on the target chip, we have had to port MicroC/OS-II to the target platform and then this porting has been compiled and linked with software generated when the mapping phase has been completed. Resulting image has been directly executed on a board prototype including our speech recognition chip in order to prove design flow consistency. The execution of all FE blocks, including an operating system tick each 1 ms, results in an execution time of 7.2 ms on the target board (core set to 16MHz). This result shows that obtained software performance estimation presents an accuracy error of about 0.66% compared to on SoC execution time. To evaluate design flow efficiency we use a previously developed C code that, without comprising a RTOS layer, takes 6.44 ms to process a frame of speech at 16MHz. Comparing this value to the obtained one of 7.2 ms, we get an overall link to implementation overhead, including MicroC/OS-II execution time, of 11.8%.
4. Conclusions In this paper we have showed that the process of capturing system functionalities at high-level of abstraction for automatic code generation is consistent. In particular high-level system descriptions have the same behaviour of the execution of code automatically generated from the same high-level descriptions.
This link to implementation is a key productivity improvement as it allows implementation code to be derived directly by the models used for system level exploration and performance evaluation. In particular an accuracy error of about 0.66% and maximum execution speed reduction of about 11.8% has been reported. We recognize this overhead to be acceptable for the implementation code of our system. Starting from these results, the presented design flow can be adopted to develop and evaluate software on highlevel model architecture, before target chip will be available from foundry. At present this methodology is in use to compare software performances of different RTOSs on our speech recognition platform. This to evaluate which one could best fit different speech application target constraints.
5. Acknowledgements The authors thank M. Selmi, L. Calì, F. Lertora, G. Mastrorocco and A. Ferrari for their helpful support on system modelling. A special thank to P.L. Rolandi for his support and encouragement.
References [1] G. De Micheli, R.K. Gupta “Hardware/Software CoDesign”, Proceedings of the IEEE, Vol. 85, pages 349-365, March 1997. [2] W. Wolf “Computers as Components – Principles of Embedded Computing System Design”, Morgan Kaufmann, 2001. [3] S. J. Krolikoski, F. Schirrmeister, B. Salefski, J. Rowson, G. Martin, “Methodology and Technology for Virtual Component Driven Hardware/Software Co-Design on the System-Level”, Proceedings of the IEEE International Symposium on Circuits and Systems, Vol. 6, 1999. [4] J.W. Picone, "Signal Modeling Techniques in Speech Recognition", Proceedings of the IEEE, Vol. 81, pages 1215-1247, Sept. 1993. [5] J. J. Labrosse “MicroC/OS-II: The Real-Time Kernel”, R&D Books Lawrence KS, 1999. [6] M. Baleani, A. Ferrari, A. Sangiovanni-Vincentelli, C. Turchetti, “HW/SW Codesign of an Engine Management System”, Proceedings of Design, Automation and Test in Europe conference, Mar. 2000.