Transfer Level. We contribute to the definition of system design automation by pointing out the important tasks in the prototyping design flow of MPSoC.
Draft version
Systematic Design Flow For Fast Hardware/Software Prototype Generation From Bus Functional Model For MPSoC Ivan Petkov1,2, Paul Amblard1, Marin Hristov2, Ahmed Jerraya1 1 TIMA Laboratory 46 avenue Felix Viallet, 38031 Grenoble, France 2 ECAD Laboratory 8 bul. Kliment Ohridski, 1797 Sofia, Bulgaria {Ivan.Petkov, Paul.Amblard, Ahmed.Jerraya}@imag.fr {Petkov, Mhristov}@ecad.tu-sofia.bg Abstract System design at higher level of abstraction is a promising technique to deal with the increasing complexity of the modern embedded systems. Current MPSoC are designed at Register Transfer Level. The Bus Functional Model is a higher level of abstraction that allows the integration of heterogeneous hardware, software components and sophisticated communication interconnects to adapt different description models. This system abstraction model makes it possible to accelerate the simulation but ignores the accuracy of the developed circuit. This paper studies an example of system design transformation from a high level of abstraction to the physical prototype of a multiprocessor system on chip. With this work we propose a systematic and efficient design flow for system on chip integration from a Bus Functional Level of abstraction towards physical prototyping of embedded systems. The flow is applied to accelerate an MPSoC example design.
1
Introduction
The recent progress in the microelectronic technologies enabled the integration of more functionalities in a single chip, it is now possible to create complex embedded system called Multiprocessor System on Chip – MPSoC, containing several microprocessors, memories, shared busies and peripheral circuits on a single die. The implementation of all functionalities on a single chip increased the performance and reduced the power consumption but also led to new challenges and difficulties for system designers. The MPSoC designs became large, complex and involved heterogeneous components such as software parts and hardware devices. To deal with this heterogeneity, frameworks were focused on speeding up the hardware/software design process. Three directions were studied: hardware/software interfaces refinement, design architecture exploration from a system specification, and validation by simulation. New design approaches as System Level Design and Platform Based Design have appeared.
These design approaches are based on abstraction models of the embedded systems. These models developed at a high level of abstraction with event-driven modelling languages are used to accelerate the validation of the functionality and the evaluation of the performance of various software and hardware components implementing a given system. In this work we focus on the distances between the system modelled at the Bus Functional Level and the Register Transfer Level. We contribute to the definition of system design automation by pointing out the important tasks in the prototyping design flow of MPSoC. The motivation of this work is based on the different types of components used at different levels of abstraction and the relations between them during prototyping. We start in Section 2 with the presentation of different abstraction levels used in the MPSoC design flow. In Section 3 we describe the SoC integration flow proposed by our work. Section 4 contains the experimental part of the simulation and generated execution models at different levels of abstraction of a multiprocessor application for video encoding developed during this work. Finally, in Section 5 we provide the conclusions.
2
SoC Design: Different Abstraction Levels
The MPSoC design flow starts from a system specification and ends with the hardware prototype of the system. The MPSoC design flow consists of several abstraction levels, including the successive refinement steps after each level [1] [2].
2.1
System Specification
The system specification represents an informal model that must contain all application’s functionality and requirements. Based on this informal model, the system designers have to create a formal system model, built from one or more functional subsystems. Each subsystem may have several tasks that model the functionality of the subsystem. Communications among subsystems or tasks are
Draft version
realized through high level communication primitives. The simulation model represents a number of functional subsystems that communicate through high level communication services provided by the execution environment. Our study is interested on the next three abstraction levels, represented at Figure 1. Virtual Architecture Model SW Component
System-level
SW task1
ROSES
HW Component
SW task 2
HW block 1
HW block 2
Execution environment
HW/SW interface refinement
(a) Bus Functional Model SW Component
File
HW Component
Bus Functional-level
HW Block
ISS Simulator
Bus Functional Model SoC Integration
represents a hardware function without any precise characteristics for the type of the processor or the communication topology. The system level synthesis is the refinement step from this level. The hardware refinement is performed from an extensible library of reusable components, developed for a given set of protocols and components, while the software refinement is part of an automatic generation of an application-specific OS. Hardware refinement transforms the virtual architecture model, containing abstract software and hardware modules to Instruction Set Architecture containing software and hardware subsystems associated to selected IP components, while software generation produces also a custom OS for each processor on the target platform. More details about the HW/SW refinement using ROSES environment for can be found in [1].
(b) RT-level Model
RT-level
SW Component
HW Component
HW Component
HW Memory
BUS
HW Block
Physical wires
(c)
Figure 1: SoC Design: Different Abstraction Levels
2.2
2.3
Bus Functional Level
ARM ®
System Level
At System-level, (see Figure 1a) the designers build an executable abstraction model and iterate it through a performance analysis loop to decide the task partitioning for the MPSoC architecture. This executable specification uses abstract models for hardware and software components that will implement the tasks. For example, an abstract software model can be a set of software tasks. An abstract hardware can be a behavioural component described using a transaction-level model. The communications are realized through abstract communication channels using TLM [3] primitives. Virtual Architecture Model SW Component SW task1
SW task 2
HW Component HW block 1
HW block 2
Systems at this level have some information about the implementation of the communication protocols and the hardware components used in the HW subsystems. At this phase the HW subsystems and communication are refined to RTL. The SW subsystems are represented by application software task and OS running on Instruction Set Simulators. To achieve the Hardware/Software co-simulation at this level we need to adapt the software execution model with the hardware simulation model. A bus function model - BFM is used. BFM [4] is a simulator adapter that enables communication between ISSs and other simulators. The simulation model (see Figure 3) of the system consists of: Software subsystems represented by ISSs and software code executed directly as a file on the processor simulator, hardware subsystems represented by hardware behavioural models executed on HDL simulators and a cosimulation bus such as BFM. The communication between the software subsystems and the hardware subsystems is interpreted by the bus function model, but it is not yet a physical communication bus. Bus Functional Model SW Component
File
Execution environment ISS Simulator
Figure 2: System Level: Simulation Model The simulation model at this level is a virtual architecture model [1] (see Figure 2). The virtual architecture model is a set of virtual modules interconnected using point-to-point transaction communication channels. The internal component contains a set of software tasks or
HW Component
HW Block
Bus Functional Model
Figure 3: Bus Functional Level: Simulation Model Refining the system to RTL consists of efficient integration of software subsystems with already existing hardware subsystems. The main difficulties in this process are the
Draft version
differences between the two models. The bus functional model does not take into account the additional logic needed to realize the communication with the processor. At bus functional level the memory map is a simple hypothesis and the initialization code of the processor boot is hidden by the simulator. Thus the implementation of the application software program as read-only code and readwrite data are not takes into account during simulation. To manage these difficulties during prototyping we propose in Section 3 a SoC integration flow from Bus Functional Level to Register Transfer Level.
2.4
Register Transfer Level
At the Register Transfer Level, the system designer is involved in mapping the SW subsystems with hardware IP components and implementation of the application software program with hardware memories. The RTL architecture simulation model (see Figure 4) consists of processor(s), hardware IP(s), communication network, and processor IP interfaces. At this level the software code is placed in hardware memory blocks as a bit vectors in hexadecimal format. The software mapping is realized with “real” address decoding logic and the communication is based on hardware components connected through wires.
The SoC Integration flow is based on three key points of important pieces of information extracted from the higher level abstraction model of the MPSoC: (1) Specific Architecture Description describing the communication topology between the hardware IP models, the processor(s) and the memories. This description is an important choice made by the system designer, based on the extracted information from the higher level abstraction model of the system and the available technical possibilities as hardware IP models. (2) Hardware IP Models: Hardware IP library at RTL with the design simulation models of processor(s), memories, basic bus components and peripherals. (3) Memory implementation description file: A scatter loading file give us the necessary information about the memory address implementation of each part for read-only or read-write code and data of the application software program. It is used for the realization of address decoding logic. Bus Functional Level Components
Generic C Library
SW
Execution Model Generation
Simulation
SoC Integration
SoC SW Transformation
Address Decoder
HW
RT-level Model
Scatter Loading File
Specific Architecture
SoC HW Transformation
ARM, AMBA, SRAM, ROM
Register Transfer Level
SW Component
HW Component
HW Component
HW Memory
Arbiter
HW Block
BUS
ARM ®
SW
Specific C Library
HW
Execution Model Generation
Simulation
Components
Physical wires
Figure 5: SoC Integration: Design Flow Figure 4: RT Level: Simulation Model The hardware logic simulation and synthesis are widely described in the literature and they are not the main scope of this paper. However we show some of their aspects in the next section
3
SoC Integration Flow
The intended contribution of this paper is to describe a SoC integration flow (see Figure 5) starting with the abstraction model of an MPSoC at Bus Functional Level described in the previous section corresponding to the architecture on Figure 3. The goal is to produce an accurate synthesizable RTL hardware prototype using a systematic method for the transformation from a higher abstraction model of the MPSoC to a RTL model.
The first important step in the SoC integration flow is the transformation of the software subsystems using hardware IP components assembling based on the specific architecture description. The hardware IP components assembling represents the replacement of the ISS (Instruction Set Simulator) with the time accurate simulation model of the processor that can be included in the target HDL simulators and the design of the processor hardware subsystem. The second important step is the memory implementation. A separate relocation must be performed to create the executable binary image of the software taking the addresses into account. This is obtained by using a scatter loading description file generated using initial model of the application software and the specific architecture description.
Draft version
3.1
Bus Functional Level MPSoC Architecture Model
The MPSoC architecture model at Bus Functional Level has three general components: (1) Application Software Code, containing the application program and operating system usually written in C/C++; (2) Hardware peripheral models written in VHDL, Verilog or SystemC [10] and (3) Generic C library. These three components are used for the generation of an execution model at Bus Functional Level. This execution model is used for the simulation and debug of the MPSoC application. The Figure 6 shows the design flow for the software and the hardware part of the generation of this model. Software design flow : C and C++ Software Code
Hardware design flow : VHDL and SystemC (peripherals)
Compiler
simulation models with the relevant interface hardware components used to establish the communication between the processors and the local memories; and (3) a specific C library with application specific initialization code and memory implementation description file. As the previously described flow the software is compiled to obtain the object files. Then the object files are located respecting the memory map implementation described with the scatter loading file. Compared to the higher level generation flow we have one more stage at this level, we have to perform the translation of the executable program to a binary code image convenient for integration in read-only memory. The RTL execution model generation flow is shown at Figure 7. Software design flow : C and C++ Software Code
Object Code
Object Code
Generic C Library
Link (linker)
Linked with no fixed address yet assigned to the Code/Data
Hardware Implementation
Executable Program
Compiler HW Blocks VHDL,SystemC
HW Blocks VHDL,SystemC
Hardware Assembling
Hardware design flow : VHDL and SystemC (processor, memories and peripherals)
Specific C Library
Locate (linker)
IPs:
Specific Architecture
Scatter Loading File
Memory map Executable implementation used to locate the application Code object files
Description of the connection between the HW models, the processor and the memories
Hardware/Software Merging Hardware/Software Merging Bus Functional Model
Hardware Verification
fromELF
Software
System = Hardware + ROM SW Component
HW Component
HW Component
BUS
HW Block
Hardware Assembling
Hardware
Implementation
ARM®, AMBA SRAM, ROM
IP HDL library description of the processor, the memories and the bus components
Hardware Verification
HW Component
HW Blocks ISS Simulator
BFM
Hardware Component
Binary Code
HW Memory
ARM ®
Physical wires
Hardware Prototype
Figure 6: Bus Functional-level: Execution Model Generation
Figure 7: Register Transfer-level: Execution Model Generation
The software generation flow is a classic design flow for embedded software. It begins with the compilation of the source code to object files and the linking of these files using generic C library and some user library to obtain the executable program. An important particularity is that at this level, there is no link with fixed address assigned to the code and data parts of application software. We use an abstract memory map of the MPSoC application. The hardware design flow consists just of assembling and verification of hardware subsystems of the application.
The hardware design flow starts with the assembling of the entire hardware application prototype, respecting the specific architecture description and using hardware IP library. A verification of the obtained hardware implementation is performed in two steps. The first step is verification of processor interface to test the integration of the hardware components of the MPSoC application. This simulation tests the functionality of the hardware using a Bus Transfer Generator such as File Read Bus Master (FRBM) in AMBA Design Kit [8]. A FRBM enables to simulate an AMBA design quickly by using it to generate explicit transfers on the address, data and control bus. The second step is the validation of the application in a cosimulation to test the merging of the hardware and software. This simulation is performed using a processor Design Simulation Model.
3.2
MPSoC Architecture Model at Register Transfer Level
The target model at Register Transfer Level is composed of the same three components: (1) Application Software Code but at this level represented as bit vectors in hexadecimal format convenient for implementation in the memories; (2) Hardware peripheral models and processors
This separation of the verification phase in two stages allows simplifying the overall validation process and accelerating the debug of the MPSoC application.
Draft version
4
MPEG-4 Application Example
This section presents an example of system design transformation from a high level of abstraction to physical prototype of a multiprocessor system on chip. We developed a DivX Encoder handling MPEG-4 [7] QCIF resolution (176x144 pixels) at 25 frames/sec to achieve real time video encoding and running at 60 MHz. The DivX Encoder application is based on the OpenDivX algorithm [9]. This algorithm is an open source code implementation of MPEG-4 video compression standard. The compression technique is based on removing spatial and temporal redundancy from input video frames. More details about the compression technology, the architecture exploration and the realization of software part of the application can be found in [5] and [6].
4.1
DivX Application Architecture
The block diagram of the generated architecture using ROSES environment is shown in Figure 8. It consist of three DivX front-end cores implementing the motion estimation and compensation, DCT transformation and the quantization, a VLC back-end core implementing the entropy decoder, a hardware Direct Memory Access engine establishing the communication between all modules and a hardware I/O interface blocks. DivX1 RAM0
RAM1
DivX2 RAM0
BUS
RAM1
4.2
DivX Encoder Subsystems
Each DivX Encoder subsystem is based on a 32 bit ARM RISC processor and dedicated subsystem architecture. The DivX subsystems architecture has a specific double banked memory to enable the concurrent data transfer from the DMA and the processor. This functionality is realized with a 2-to-2 interconnection bus matrix such as described for the multi-layered AMBA bus [8]. The VLC subsystem has a basic architecture including a single memory bank for data buffering. HW Component DMA Slave Interface
SW Component
Software
Decoder
AHB
AHB Default Slave
AHB2APB
AHB
AHB
IRQ Controller AHB
ISS Simulator
APB Timer
APB Remap/Pause
Inport 0
AHB
Inport 1
2-to-2 Bus Matrix
APB APB
Master Interface
Arbiter
Outport 0
AHB
Outport 1
Memory Controller
Memory Controller
SRAM
SRAM
Watchdog
DivX3 RAM0
BUS
CPU
All data transfers between the subsystems are routed through a DMA engine with a point to point connection scheme. This block is application specific and it is not a standard Direct Memory Access device adapted for the system purpose. All custom hardware blocks are designed in application specific fashion using SystemC language in SLS group at TIMA Laboratory. The software is aimed to be executed on ARM processors [8].
RAM1
Co-simulation Bus
BUS
CPU
CPU
Figure 9: DivX Subsystem: Bus Functional Model DMA
CPU
IN
OUT
Splitter
VLC
RAM
Combiner
Figure 8: DivX Encoder Application Architecture The functional flow of DivX Encoder application is as follows: Each incoming frame of the input video stream is divided in 3 parts by the Splitter and each frame part is sent to one of the three DivX subsystems: DivX0, DivX1 and DivX2. The DivX cores treat the incoming image data and prepare them for compression. The prepared data is transferred to the VLC subsystem, where the compression of the entire frame is finalized and the whole image data is processed to the Combiner to adjust the compression parameters and transfer them to the system output.
In the initial model at Bus Functional Level (see Figure 9), the hardware components of the subsystems are described in hardware description languages such as SystemC or VHDL and the software component and all processors’ details are abstracted by Instruction Set Simulators. The DivX encoder subsystems were initially a mixed level model, where the bus components and the double bank memory feature were described at RTL, while the application software code and the abstract processors were represented at bus functional level - ISA-level. At this level the application software code is implemented with an abstract memory map without any system initialization code. The hardware/software interface was realized through a Bus Functional Model and physical wires. The next step of the design flow is the SoC integration described in Section 3 or the transformation of the software component at Figure 9 to a lower level abstraction subsystem based entirely on hardware IP components. This lower level abstraction subsystem includes application software
Draft version
containing the software program and OS integrated into fixed hardware memory system. This application software is executed on a time accurate processor simulation model designed for RTL. The communication protocol is also refined to physical wires at RTL. The low level model is shown in figure 10. SW Component
HW Component
gives an example of this method: A Tightly Coupled Memory is located at address 0x0 on power-up, but there is not a valid instruction. Therefore, we have to relocate during the initialization of the system the ROM containing the valid instructions placed at 0x3 addresses to 0x0, but also to allow the processor to locate the TCM at 0x0 during run-time execution. Load View
Program ROM/RAM
0xFFFFFFFF
DMA Slave Interface
Memory Controller
Master Interface
Interrupt Controller
APB Peripherals
APB Peripherals
0xC0000000
AHB
AHB
0xF0000000
Execution View
Interrupt Controller
Decoder
0x40000000
AHB
AHB2APB
Default Slave
AHB IRQ Controller
AHB Inport 0
0x40000000 RAM1
ROM, RAM
AHB Inport 1
2-to-2 Interconnection Bus Matrix
ROM
0xF0000000 0xC0000000
RAM0 AHB
0xFFFFFFFF
(+RO +RW +ZI)
Code (+RO+RW+ZI)
Outport 0
APB
APB
APB
APB
Outport 1
AHB Memory Controller
AHB Memory Controller
SRAM
SRAM
0x30000000
0x30000000 Stack
0x18000
Heap Timer
Remap/Pause
Watchdog
D-TCM SRAM
Code (+ RW +ZI)
0x8000
Figure 10: DivX Subsystem: RTL Model
4.4
Software Implementation
A major difficulty in the design of the application software is the memory map layout, known regretfully after the final hardware implementation. To simplify the MPSoC design flow we propose at the phase of SoC integration to use mechanisms that enable to specify the final memory map during the link process of the software image. Thus we will be capable of describing every region in the image code that has different load and execution address in the memory system. One of the mechanisms allowing this is the scatter loading description files [8]. The scatter loading technique gives complete control over the grouping and placement of memory image components. It is capable also of describing any complex software image map. We apply this technique during the generation of the RTL simulation model of our application. And we obtain easily the software/hardware merging of the execution model. Figure 11
I-TCM SRAM
BOOT
Code in Fast TCM
Figure 11: Memory Map Implementation
Hardware IP Models
The hardware architecture design of DivX Encoder at RTL was realized through the reuse of existing hardware IP components such as design simulation model of the ARM946E-S processor, an AMBA design kit providing generic environment to enable rapid design and a library with synchronous SRAM. This part of the design flow is already automated. We have our in-house tool for hardware assembling from the ROSES [1] environment using component based approach. There are also some commercial tools for rapid hardware assembling such as Platform Express [12] from Mentor Graphics.
ROM_LOAD 0x10000000 0xFFFF { ROM_EXEC 0x10000000 FIXED 0xFFFF { init.o (INIT, +FIRST) __main.o * (Region$$Table) * (ZISection$$Table) } ITCM 0x00000000 0x8000 { vectors.o (VECTORS, +FIRST) * (+RO) } DTCM 0x10000 0x18000 { *(+RW) }
0x00000000
Code in ROM
4.3
Code (+ RO)
ITCM DTCM
Scatter Loading File
The scatter loading file describes the location of code and data at both reset-time and run-time with a text file format. It assigns to each region a load address at resettime and an execution address at run-time. The copying of the software from loading region to a one execution region is done by an ARM C library initialization function which is part of the boot code described in the next section. We are working currently on the adapting of this technique in the design automation process of our flow.
4.5
System Initialization
One other main difficulty in the design of the software for an embedded MPSoC application is the initialization sequence after system reset. The application software must provide some initialization itself. In this section, we want to point out that at lower level of abstraction of MPSoC as RTL the initialization code is specific for the application and the target processor. The boot code sequence provides the system initializations. Usually it carries out about the processor initialization and configuration, memory remap, initialization of memory system and memory required by C code, enabling the caches and the interrupts and entering the C code. Figure 12 shows an example block diagram of an ARM processor boot code of and gives an overview of the two parts in the application software initialization sequence. It is still developed manually by the system designer and it is difficult task for automation. The second part is composed for pre-
Draft version
defined initialization function provided by the target processor C library.
tion with an ARM946E-S DSM running at 60 MHz with 4 Kbytes instruction and data caches and 32 Kbytes TCM.
Application Specific Code Reset Handler Image Entry Point
Remap Initialize stack pointers
Init
5 ARM C Library __main
Configure MPU Enable TCM Setup Caches
Copy code and data Zero initialized data
__rt_entry user_inital_stackheap() Set up stack & heap
Initialize library functions Call top-level constructors
$Sub$$main() Enable caches & interrupts
main() Link in library initialization code
Application
Figure 12: ARM946E-S boot code example
4.6
Conclusions
In this paper we presented a systematic design flow for fast hardware/software prototype generation from Bus Functional Model for MPSoC. This approach promises to be an effective way for SoC integration refinement step providing several advantages in the MPSoC design flow as: accelerated systematic design process, reduced time for software integration, and accurate transformation from higher level of abstraction than RTL. The weak point of this approach is the lack of automation for the generation of scatter loading file and initializing sequence which is a perspective goal. With the presented example we illustrated the feasibility and prove the effectiveness of this approach and its scalability to adapt complex abstraction models of MPSoC.
Results Analysis
To pass from the Bus Functional Level model to RTL model, we spent only few days to refine our application and to tie the software with the hardware design. The productivity gain is more time spent for simulation to debug the entire functionality of the system than to build and fit an accurate simulation model of the system.
References [1] W.O. Cesario, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, L. Gauthier, M. Diaz-Nava, A.A. Jerraya, "Multiprocessor SoC Platforms: A Component-Based Design Approach ", IEEE Design & Test of Computers, Vol. 19, Nov-Dec’ 02.
Boot Code ========================================================= Total ROM Size (Code + RO Data) 4 320 ( 4.22 Kbytes) Total RAM Size (RW Data + ZI Data) 132 ( 0.13 Kbytes) =========================================================
[2] F.R. Wagner, L. Carro, W.O. Cesario, A.A. Jerraya, "Strategies for the Integration of Hardware and Software IP Components in Embedded Systems-on-Chip", Integration the VLSI Journal, Elsevier, pp. 223-252, Volume 37, Issue 4, September 2004.
ARM946E-S Configuration ========================================================= Operating frequency: 60 MHz Instr CACHE size: 4 096 bytes Data CACHE size: 4 096 bytes Instr TCM size: 32 768 bytes Data TCM size: 32 768 bytes =========================================================
[3] A. Haverinen, M. Leclercq, N. Weyrich, D. Wingard. “White Paper for SystemC based SoC Communication Modelling for the OCP Protocol” at http://www.ocpip.org
Simulation ========================================================= Execution time: 83286051 PS + 1 ( 83us ) Memory Usage: 19.7M program + 26.5M data = 46.2M total CPU Usage: 3.6s system + 43.3s user = 46.9s total (229.2s, 20.5%)
Figure 13: Simulation results Thus, we increased the quality and the reliability of our system using simple and illustrative method for hardware/software merging. The obtained architecture model was verified against the initial BFM model using Cadence NC-SIM [11] simulator tool for the hardware part and ARM ADS 1.1 [8] for the software design flow. Figure 13 shows some results from the execution time of the simula-
[4] J. Rawson “Hardware/Software Co-Simulation”, Proc Design Automation Conference, 1994, pp 439-440 [5] A. Sarmento, W. Cesario, A.A. Jerraya, "Automatic Building of Executable Models from Abstract SoC Architectures", RSP’04, Geneva , Switzerland , June 2004. [6] M.-W. Youssef, S. Yoo, A. Sasongko, Y. Paviot, A.A. Jerraya, "Debugging HW/SW Interface for MPSoC: Video Encoder System Design Case Study", DAC'04, San Diego, USA, June 2004. [7] V. Bhaskaran et al., “ Image and Video Compression Standards: Algorithms and Architecture”, Kluwer 1995 [8] ARM Documentation available at http://www.arm.com [9] Open DivX, available at http://www.projectmayo.com/ [10] SystemC 2.0 available at http://www.systemc.org/ [11] Cadence available at http://www.cadence.com/