Software Prototype

0 downloads 0 Views 442KB Size Report
Transfer Level. We contribute to the definition of system design automation by pointing out the important tasks in the prototyping design flow of MPSoC.
Draft version

Systematic Design Flow For Fast Hardware/Software Prototype Generation From Bus Functional Model For MPSoC Ivan Petkov1,2, Paul Amblard1, Marin Hristov2, Ahmed Jerraya1 1 TIMA Laboratory 46 avenue Felix Viallet, 38031 Grenoble, France 2 ECAD Laboratory 8 bul. Kliment Ohridski, 1797 Sofia, Bulgaria {Ivan.Petkov, Paul.Amblard, Ahmed.Jerraya}@imag.fr {Petkov, Mhristov}@ecad.tu-sofia.bg Abstract System design at higher level of abstraction is a promising technique to deal with the increasing complexity of the modern embedded systems. Current MPSoC are designed at Register Transfer Level. The Bus Functional Model is a higher level of abstraction that allows the integration of heterogeneous hardware, software components and sophisticated communication interconnects to adapt different description models. This system abstraction model makes it possible to accelerate the simulation but ignores the accuracy of the developed circuit. This paper studies an example of system design transformation from a high level of abstraction to the physical prototype of a multiprocessor system on chip. With this work we propose a systematic and efficient design flow for system on chip integration from a Bus Functional Level of abstraction towards physical prototyping of embedded systems. The flow is applied to accelerate an MPSoC example design.

1

Introduction

The recent progress in the microelectronic technologies enabled the integration of more functionalities in a single chip, it is now possible to create complex embedded system called Multiprocessor System on Chip – MPSoC, containing several microprocessors, memories, shared busies and peripheral circuits on a single die. The implementation of all functionalities on a single chip increased the performance and reduced the power consumption but also led to new challenges and difficulties for system designers. The MPSoC designs became large, complex and involved heterogeneous components such as software parts and hardware devices. To deal with this heterogeneity, frameworks were focused on speeding up the hardware/software design process. Three directions were studied: hardware/software interfaces refinement, design architecture exploration from a system specification, and validation by simulation. New design approaches as System Level Design and Platform Based Design have appeared.

These design approaches are based on abstraction models of the embedded systems. These models developed at a high level of abstraction with event-driven modelling languages are used to accelerate the validation of the functionality and the evaluation of the performance of various software and hardware components implementing a given system. In this work we focus on the distances between the system modelled at the Bus Functional Level and the Register Transfer Level. We contribute to the definition of system design automation by pointing out the important tasks in the prototyping design flow of MPSoC. The motivation of this work is based on the different types of components used at different levels of abstraction and the relations between them during prototyping. We start in Section 2 with the presentation of different abstraction levels used in the MPSoC design flow. In Section 3 we describe the SoC integration flow proposed by our work. Section 4 contains the experimental part of the simulation and generated execution models at different levels of abstraction of a multiprocessor application for video encoding developed during this work. Finally, in Section 5 we provide the conclusions.

2

SoC Design: Different Abstraction Levels

The MPSoC design flow starts from a system specification and ends with the hardware prototype of the system. The MPSoC design flow consists of several abstraction levels, including the successive refinement steps after each level [1] [2].

2.1

System Specification

The system specification represents an informal model that must contain all application’s functionality and requirements. Based on this informal model, the system designers have to create a formal system model, built from one or more functional subsystems. Each subsystem may have several tasks that model the functionality of the subsystem. Communications among subsystems or tasks are

Draft version

realized through high level communication primitives. The simulation model represents a number of functional subsystems that communicate through high level communication services provided by the execution environment. Our study is interested on the next three abstraction levels, represented at Figure 1. Virtual Architecture Model SW Component

System-level

SW task1

ROSES

HW Component

SW task 2

HW block 1

HW block 2

Execution environment

HW/SW interface refinement

(a) Bus Functional Model SW Component

File

HW Component

Bus Functional-level

HW Block

ISS Simulator

Bus Functional Model SoC Integration

represents a hardware function without any precise characteristics for the type of the processor or the communication topology. The system level synthesis is the refinement step from this level. The hardware refinement is performed from an extensible library of reusable components, developed for a given set of protocols and components, while the software refinement is part of an automatic generation of an application-specific OS. Hardware refinement transforms the virtual architecture model, containing abstract software and hardware modules to Instruction Set Architecture containing software and hardware subsystems associated to selected IP components, while software generation produces also a custom OS for each processor on the target platform. More details about the HW/SW refinement using ROSES environment for can be found in [1].

(b) RT-level Model

RT-level

SW Component

HW Component

HW Component

HW Memory

BUS

HW Block

Physical wires

(c)

Figure 1: SoC Design: Different Abstraction Levels

2.2

2.3

Bus Functional Level

ARM ®

System Level

At System-level, (see Figure 1a) the designers build an executable abstraction model and iterate it through a performance analysis loop to decide the task partitioning for the MPSoC architecture. This executable specification uses abstract models for hardware and software components that will implement the tasks. For example, an abstract software model can be a set of software tasks. An abstract hardware can be a behavioural component described using a transaction-level model. The communications are realized through abstract communication channels using TLM [3] primitives. Virtual Architecture Model SW Component SW task1

SW task 2

HW Component HW block 1

HW block 2

Systems at this level have some information about the implementation of the communication protocols and the hardware components used in the HW subsystems. At this phase the HW subsystems and communication are refined to RTL. The SW subsystems are represented by application software task and OS running on Instruction Set Simulators. To achieve the Hardware/Software co-simulation at this level we need to adapt the software execution model with the hardware simulation model. A bus function model - BFM is used. BFM [4] is a simulator adapter that enables communication between ISSs and other simulators. The simulation model (see Figure 3) of the system consists of: Software subsystems represented by ISSs and software code executed directly as a file on the processor simulator, hardware subsystems represented by hardware behavioural models executed on HDL simulators and a cosimulation bus such as BFM. The communication between the software subsystems and the hardware subsystems is interpreted by the bus function model, but it is not yet a physical communication bus. Bus Functional Model SW Component

File

Execution environment ISS Simulator

Figure 2: System Level: Simulation Model The simulation model at this level is a virtual architecture model [1] (see Figure 2). The virtual architecture model is a set of virtual modules interconnected using point-to-point transaction communication channels. The internal component contains a set of software tasks or

HW Component

HW Block

Bus Functional Model

Figure 3: Bus Functional Level: Simulation Model Refining the system to RTL consists of efficient integration of software subsystems with already existing hardware subsystems. The main difficulties in this process are the

Draft version

differences between the two models. The bus functional model does not take into account the additional logic needed to realize the communication with the processor. At bus functional level the memory map is a simple hypothesis and the initialization code of the processor boot is hidden by the simulator. Thus the implementation of the application software program as read-only code and readwrite data are not takes into account during simulation. To manage these difficulties during prototyping we propose in Section 3 a SoC integration flow from Bus Functional Level to Register Transfer Level.

2.4

Register Transfer Level

At the Register Transfer Level, the system designer is involved in mapping the SW subsystems with hardware IP components and implementation of the application software program with hardware memories. The RTL architecture simulation model (see Figure 4) consists of processor(s), hardware IP(s), communication network, and processor IP interfaces. At this level the software code is placed in hardware memory blocks as a bit vectors in hexadecimal format. The software mapping is realized with “real” address decoding logic and the communication is based on hardware components connected through wires.

The SoC Integration flow is based on three key points of important pieces of information extracted from the higher level abstraction model of the MPSoC: (1) Specific Architecture Description describing the communication topology between the hardware IP models, the processor(s) and the memories. This description is an important choice made by the system designer, based on the extracted information from the higher level abstraction model of the system and the available technical possibilities as hardware IP models. (2) Hardware IP Models: Hardware IP library at RTL with the design simulation models of processor(s), memories, basic bus components and peripherals. (3) Memory implementation description file: A scatter loading file give us the necessary information about the memory address implementation of each part for read-only or read-write code and data of the application software program. It is used for the realization of address decoding logic. Bus Functional Level Components

Generic C Library

SW

Execution Model Generation

Simulation

SoC Integration

SoC SW Transformation

Address Decoder

HW

RT-level Model

Scatter Loading File

Specific Architecture

SoC HW Transformation

ARM, AMBA, SRAM, ROM

Register Transfer Level

SW Component

HW Component

HW Component

HW Memory

Arbiter

HW Block

BUS

ARM ®

SW

Specific C Library

HW

Execution Model Generation

Simulation

Components

Physical wires

Figure 5: SoC Integration: Design Flow Figure 4: RT Level: Simulation Model The hardware logic simulation and synthesis are widely described in the literature and they are not the main scope of this paper. However we show some of their aspects in the next section

3

SoC Integration Flow

The intended contribution of this paper is to describe a SoC integration flow (see Figure 5) starting with the abstraction model of an MPSoC at Bus Functional Level described in the previous section corresponding to the architecture on Figure 3. The goal is to produce an accurate synthesizable RTL hardware prototype using a systematic method for the transformation from a higher abstraction model of the MPSoC to a RTL model.

The first important step in the SoC integration flow is the transformation of the software subsystems using hardware IP components assembling based on the specific architecture description. The hardware IP components assembling represents the replacement of the ISS (Instruction Set Simulator) with the time accurate simulation model of the processor that can be included in the target HDL simulators and the design of the processor hardware subsystem. The second important step is the memory implementation. A separate relocation must be performed to create the executable binary image of the software taking the addresses into account. This is obtained by using a scatter loading description file generated using initial model of the application software and the specific architecture description.

Draft version

3.1

Bus Functional Level MPSoC Architecture Model

The MPSoC architecture model at Bus Functional Level has three general components: (1) Application Software Code, containing the application program and operating system usually written in C/C++; (2) Hardware peripheral models written in VHDL, Verilog or SystemC [10] and (3) Generic C library. These three components are used for the generation of an execution model at Bus Functional Level. This execution model is used for the simulation and debug of the MPSoC application. The Figure 6 shows the design flow for the software and the hardware part of the generation of this model. Software design flow : C and C++ Software Code

Hardware design flow : VHDL and SystemC (peripherals)

Compiler

simulation models with the relevant interface hardware components used to establish the communication between the processors and the local memories; and (3) a specific C library with application specific initialization code and memory implementation description file. As the previously described flow the software is compiled to obtain the object files. Then the object files are located respecting the memory map implementation described with the scatter loading file. Compared to the higher level generation flow we have one more stage at this level, we have to perform the translation of the executable program to a binary code image convenient for integration in read-only memory. The RTL execution model generation flow is shown at Figure 7. Software design flow : C and C++ Software Code

Object Code

Object Code

Generic C Library

Link (linker)

Linked with no fixed address yet assigned to the Code/Data

Hardware Implementation

Executable Program

Compiler HW Blocks VHDL,SystemC

HW Blocks VHDL,SystemC

Hardware Assembling

Hardware design flow : VHDL and SystemC (processor, memories and peripherals)

Specific C Library

Locate (linker)

IPs:

Specific Architecture

Scatter Loading File

Memory map Executable implementation used to locate the application Code object files

Description of the connection between the HW models, the processor and the memories

Hardware/Software Merging Hardware/Software Merging Bus Functional Model

Hardware Verification

fromELF

Software

System = Hardware + ROM SW Component

HW Component

HW Component

BUS

HW Block

Hardware Assembling

Hardware

Implementation

ARM®, AMBA SRAM, ROM

IP HDL library description of the processor, the memories and the bus components

Hardware Verification

HW Component

HW Blocks ISS Simulator

BFM

Hardware Component

Binary Code

HW Memory

ARM ®

Physical wires

Hardware Prototype

Figure 6: Bus Functional-level: Execution Model Generation

Figure 7: Register Transfer-level: Execution Model Generation

The software generation flow is a classic design flow for embedded software. It begins with the compilation of the source code to object files and the linking of these files using generic C library and some user library to obtain the executable program. An important particularity is that at this level, there is no link with fixed address assigned to the code and data parts of application software. We use an abstract memory map of the MPSoC application. The hardware design flow consists just of assembling and verification of hardware subsystems of the application.

The hardware design flow starts with the assembling of the entire hardware application prototype, respecting the specific architecture description and using hardware IP library. A verification of the obtained hardware implementation is performed in two steps. The first step is verification of processor interface to test the integration of the hardware components of the MPSoC application. This simulation tests the functionality of the hardware using a Bus Transfer Generator such as File Read Bus Master (FRBM) in AMBA Design Kit [8]. A FRBM enables to simulate an AMBA design quickly by using it to generate explicit transfers on the address, data and control bus. The second step is the validation of the application in a cosimulation to test the merging of the hardware and software. This simulation is performed using a processor Design Simulation Model.

3.2

MPSoC Architecture Model at Register Transfer Level

The target model at Register Transfer Level is composed of the same three components: (1) Application Software Code but at this level represented as bit vectors in hexadecimal format convenient for implementation in the memories; (2) Hardware peripheral models and processors

This separation of the verification phase in two stages allows simplifying the overall validation process and accelerating the debug of the MPSoC application.

Draft version

4

MPEG-4 Application Example

This section presents an example of system design transformation from a high level of abstraction to physical prototype of a multiprocessor system on chip. We developed a DivX Encoder handling MPEG-4 [7] QCIF resolution (176x144 pixels) at 25 frames/sec to achieve real time video encoding and running at 60 MHz. The DivX Encoder application is based on the OpenDivX algorithm [9]. This algorithm is an open source code implementation of MPEG-4 video compression standard. The compression technique is based on removing spatial and temporal redundancy from input video frames. More details about the compression technology, the architecture exploration and the realization of software part of the application can be found in [5] and [6].

4.1

DivX Application Architecture

The block diagram of the generated architecture using ROSES environment is shown in Figure 8. It consist of three DivX front-end cores implementing the motion estimation and compensation, DCT transformation and the quantization, a VLC back-end core implementing the entropy decoder, a hardware Direct Memory Access engine establishing the communication between all modules and a hardware I/O interface blocks. DivX1 RAM0

RAM1

DivX2 RAM0

BUS

RAM1

4.2

DivX Encoder Subsystems

Each DivX Encoder subsystem is based on a 32 bit ARM RISC processor and dedicated subsystem architecture. The DivX subsystems architecture has a specific double banked memory to enable the concurrent data transfer from the DMA and the processor. This functionality is realized with a 2-to-2 interconnection bus matrix such as described for the multi-layered AMBA bus [8]. The VLC subsystem has a basic architecture including a single memory bank for data buffering. HW Component DMA Slave Interface

SW Component

Software

Decoder

AHB

AHB Default Slave

AHB2APB

AHB

AHB

IRQ Controller AHB

ISS Simulator

APB Timer

APB Remap/Pause

Inport 0

AHB

Inport 1

2-to-2 Bus Matrix

APB APB

Master Interface

Arbiter

Outport 0

AHB

Outport 1

Memory Controller

Memory Controller

SRAM

SRAM

Watchdog

DivX3 RAM0

BUS

CPU

All data transfers between the subsystems are routed through a DMA engine with a point to point connection scheme. This block is application specific and it is not a standard Direct Memory Access device adapted for the system purpose. All custom hardware blocks are designed in application specific fashion using SystemC language in SLS group at TIMA Laboratory. The software is aimed to be executed on ARM processors [8].

RAM1

Co-simulation Bus

BUS

CPU

CPU

Figure 9: DivX Subsystem: Bus Functional Model DMA

CPU

IN

OUT

Splitter

VLC

RAM

Combiner

Figure 8: DivX Encoder Application Architecture The functional flow of DivX Encoder application is as follows: Each incoming frame of the input video stream is divided in 3 parts by the Splitter and each frame part is sent to one of the three DivX subsystems: DivX0, DivX1 and DivX2. The DivX cores treat the incoming image data and prepare them for compression. The prepared data is transferred to the VLC subsystem, where the compression of the entire frame is finalized and the whole image data is processed to the Combiner to adjust the compression parameters and transfer them to the system output.

In the initial model at Bus Functional Level (see Figure 9), the hardware components of the subsystems are described in hardware description languages such as SystemC or VHDL and the software component and all processors’ details are abstracted by Instruction Set Simulators. The DivX encoder subsystems were initially a mixed level model, where the bus components and the double bank memory feature were described at RTL, while the application software code and the abstract processors were represented at bus functional level - ISA-level. At this level the application software code is implemented with an abstract memory map without any system initialization code. The hardware/software interface was realized through a Bus Functional Model and physical wires. The next step of the design flow is the SoC integration described in Section 3 or the transformation of the software component at Figure 9 to a lower level abstraction subsystem based entirely on hardware IP components. This lower level abstraction subsystem includes application software

Draft version

containing the software program and OS integrated into fixed hardware memory system. This application software is executed on a time accurate processor simulation model designed for RTL. The communication protocol is also refined to physical wires at RTL. The low level model is shown in figure 10. SW Component

HW Component

gives an example of this method: A Tightly Coupled Memory is located at address 0x0 on power-up, but there is not a valid instruction. Therefore, we have to relocate during the initialization of the system the ROM containing the valid instructions placed at 0x3 addresses to 0x0, but also to allow the processor to locate the TCM at 0x0 during run-time execution. Load View

Program ROM/RAM

0xFFFFFFFF

DMA Slave Interface

Memory Controller

Master Interface

Interrupt Controller

APB Peripherals

APB Peripherals

0xC0000000

AHB

AHB

0xF0000000

Execution View

Interrupt Controller

Decoder

0x40000000

AHB

AHB2APB

Default Slave

AHB IRQ Controller

AHB Inport 0

0x40000000 RAM1

ROM, RAM

AHB Inport 1

2-to-2 Interconnection Bus Matrix

ROM

0xF0000000 0xC0000000

RAM0 AHB

0xFFFFFFFF

(+RO +RW +ZI)

Code (+RO+RW+ZI)

Outport 0

APB

APB

APB

APB

Outport 1

AHB Memory Controller

AHB Memory Controller

SRAM

SRAM

0x30000000

0x30000000 Stack

0x18000

Heap Timer

Remap/Pause

Watchdog

D-TCM SRAM

Code (+ RW +ZI)

0x8000

Figure 10: DivX Subsystem: RTL Model

4.4

Software Implementation

A major difficulty in the design of the application software is the memory map layout, known regretfully after the final hardware implementation. To simplify the MPSoC design flow we propose at the phase of SoC integration to use mechanisms that enable to specify the final memory map during the link process of the software image. Thus we will be capable of describing every region in the image code that has different load and execution address in the memory system. One of the mechanisms allowing this is the scatter loading description files [8]. The scatter loading technique gives complete control over the grouping and placement of memory image components. It is capable also of describing any complex software image map. We apply this technique during the generation of the RTL simulation model of our application. And we obtain easily the software/hardware merging of the execution model. Figure 11

I-TCM SRAM

BOOT

Code in Fast TCM

Figure 11: Memory Map Implementation

Hardware IP Models

The hardware architecture design of DivX Encoder at RTL was realized through the reuse of existing hardware IP components such as design simulation model of the ARM946E-S processor, an AMBA design kit providing generic environment to enable rapid design and a library with synchronous SRAM. This part of the design flow is already automated. We have our in-house tool for hardware assembling from the ROSES [1] environment using component based approach. There are also some commercial tools for rapid hardware assembling such as Platform Express [12] from Mentor Graphics.

ROM_LOAD 0x10000000 0xFFFF { ROM_EXEC 0x10000000 FIXED 0xFFFF { init.o (INIT, +FIRST) __main.o * (Region$$Table) * (ZISection$$Table) } ITCM 0x00000000 0x8000 { vectors.o (VECTORS, +FIRST) * (+RO) } DTCM 0x10000 0x18000 { *(+RW) }

0x00000000

Code in ROM

4.3

Code (+ RO)

ITCM DTCM

Scatter Loading File

The scatter loading file describes the location of code and data at both reset-time and run-time with a text file format. It assigns to each region a load address at resettime and an execution address at run-time. The copying of the software from loading region to a one execution region is done by an ARM C library initialization function which is part of the boot code described in the next section. We are working currently on the adapting of this technique in the design automation process of our flow.

4.5

System Initialization

One other main difficulty in the design of the software for an embedded MPSoC application is the initialization sequence after system reset. The application software must provide some initialization itself. In this section, we want to point out that at lower level of abstraction of MPSoC as RTL the initialization code is specific for the application and the target processor. The boot code sequence provides the system initializations. Usually it carries out about the processor initialization and configuration, memory remap, initialization of memory system and memory required by C code, enabling the caches and the interrupts and entering the C code. Figure 12 shows an example block diagram of an ARM processor boot code of and gives an overview of the two parts in the application software initialization sequence. It is still developed manually by the system designer and it is difficult task for automation. The second part is composed for pre-

Draft version

defined initialization function provided by the target processor C library.

tion with an ARM946E-S DSM running at 60 MHz with 4 Kbytes instruction and data caches and 32 Kbytes TCM.

Application Specific Code Reset Handler Image Entry Point

Remap Initialize stack pointers

Init

5 ARM C Library __main

Configure MPU Enable TCM Setup Caches

Copy code and data Zero initialized data

__rt_entry user_inital_stackheap() Set up stack & heap

Initialize library functions Call top-level constructors

$Sub$$main() Enable caches & interrupts

main() Link in library initialization code

Application

Figure 12: ARM946E-S boot code example

4.6

Conclusions

In this paper we presented a systematic design flow for fast hardware/software prototype generation from Bus Functional Model for MPSoC. This approach promises to be an effective way for SoC integration refinement step providing several advantages in the MPSoC design flow as: accelerated systematic design process, reduced time for software integration, and accurate transformation from higher level of abstraction than RTL. The weak point of this approach is the lack of automation for the generation of scatter loading file and initializing sequence which is a perspective goal. With the presented example we illustrated the feasibility and prove the effectiveness of this approach and its scalability to adapt complex abstraction models of MPSoC.

Results Analysis

To pass from the Bus Functional Level model to RTL model, we spent only few days to refine our application and to tie the software with the hardware design. The productivity gain is more time spent for simulation to debug the entire functionality of the system than to build and fit an accurate simulation model of the system.

References [1] W.O. Cesario, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, L. Gauthier, M. Diaz-Nava, A.A. Jerraya, "Multiprocessor SoC Platforms: A Component-Based Design Approach ", IEEE Design & Test of Computers, Vol. 19, Nov-Dec’ 02.

Boot Code ========================================================= Total ROM Size (Code + RO Data) 4 320 ( 4.22 Kbytes) Total RAM Size (RW Data + ZI Data) 132 ( 0.13 Kbytes) =========================================================

[2] F.R. Wagner, L. Carro, W.O. Cesario, A.A. Jerraya, "Strategies for the Integration of Hardware and Software IP Components in Embedded Systems-on-Chip", Integration the VLSI Journal, Elsevier, pp. 223-252, Volume 37, Issue 4, September 2004.

ARM946E-S Configuration ========================================================= Operating frequency: 60 MHz Instr CACHE size: 4 096 bytes Data CACHE size: 4 096 bytes Instr TCM size: 32 768 bytes Data TCM size: 32 768 bytes =========================================================

[3] A. Haverinen, M. Leclercq, N. Weyrich, D. Wingard. “White Paper for SystemC based SoC Communication Modelling for the OCP Protocol” at http://www.ocpip.org

Simulation ========================================================= Execution time: 83286051 PS + 1 ( 83us ) Memory Usage: 19.7M program + 26.5M data = 46.2M total CPU Usage: 3.6s system + 43.3s user = 46.9s total (229.2s, 20.5%)

Figure 13: Simulation results Thus, we increased the quality and the reliability of our system using simple and illustrative method for hardware/software merging. The obtained architecture model was verified against the initial BFM model using Cadence NC-SIM [11] simulator tool for the hardware part and ARM ADS 1.1 [8] for the software design flow. Figure 13 shows some results from the execution time of the simula-

[4] J. Rawson “Hardware/Software Co-Simulation”, Proc Design Automation Conference, 1994, pp 439-440 [5] A. Sarmento, W. Cesario, A.A. Jerraya, "Automatic Building of Executable Models from Abstract SoC Architectures", RSP’04, Geneva , Switzerland , June 2004. [6] M.-W. Youssef, S. Yoo, A. Sasongko, Y. Paviot, A.A. Jerraya, "Debugging HW/SW Interface for MPSoC: Video Encoder System Design Case Study", DAC'04, San Diego, USA, June 2004. [7] V. Bhaskaran et al., “ Image and Video Compression Standards: Algorithms and Architecture”, Kluwer 1995 [8] ARM Documentation available at http://www.arm.com [9] Open DivX, available at http://www.projectmayo.com/ [10] SystemC 2.0 available at http://www.systemc.org/ [11] Cadence available at http://www.cadence.com/

Suggest Documents