A Framework for Modelling and Performance ... - Semantic Scholar

2 downloads 0 Views 158KB Size Report
the performance of the MPEG-4 video encoding on Philips Wasabi/Cake NoC and IPv4 packet forwarding over the dual Intel IXP2800 NP. Section 6 discusses ...
A Framework for Modelling and Performance Analysis of Multiprocessor Embedded Systems: Models and Benefits Ismail Assayad* — Sergio Yovine* * Laboratoire VERIMAG (CNRS)

Centre Equation - 2 avenue de vignate, F-38610 Gières [email protected] [email protected]

We present a framework and tools for modelling and performance analysis of multiprocessor embedded systems. Our framework consists of component-based models for modelling parallel software and multiprocessor hardware, and tools for code generation and performance analysis. The framework allows jointly analyzing software and hardware performance rather than evaluating each one in isolation. This joint evaluation enables predicting the impact of hardware on software performance and the ability of hardware performance to accommodate future services at design time. The framework component model relies on transaction-level description of hardware and programmer-level description of software. The experiments carried out on several real-life industrial-size applications show that our framework is scalable and delivers precise performance results with fast simulation speed.

ABSTRACT.

Nous présentons un framework pour la modélisation et l’analyse de performance des systèmes embarqués multiprocesseurs. Notre framework est constituée de composants modélisant du logiciel parallèle et du matériel multiprocesseur, et des outils pour la génération de code et l’analyse de performances. Elle permet une analyse conjointe du logiciel et du matériel. Cette dernière rend possible la prédiction de l’impact du matériel sur les performances du logiciel et de la capacité du matériel à s’adapter à de futurs services, et cela au moment de la conception du système. Le modèle des composants est basé sur une description du matériel au niveau transaction et une description au niveau programmeur du logiciel. Les expériences effectuées prouvent que notre méthodologie passe à l’échelle et fournit des résultats de performances précis avec des vitesses de simulation rapides. RÉSUMÉ.

KEYWORDS:

Multiprocessor Embedded Systems, Components, Transactions, Performance.

MOTS-CLÉS :

Systèmes Embarqués Multiprocesseurs, Composants, Transactions, Performance.

NOTERE 2007, pages 1 à 12

2

NOTERE 2007

1. Introduction Video compression, HDTV, packet routing and other high performance embedded applications motivate the use of multiple processors. However, the complexity of such multiprocessor embedded systems (MES) makes software programming and analysis difficult, leading to sub-optimal performances. Furthermore, as new applications and services are continuously developed, the main challenge is to provide design frameworks capable of supporting accurate performance analysis and integrating future applications (Moonen et al., 2005). This requires joint, rather than separate, software and hardware modelling and simulation at design time. Current practices Several techniques have been proposed to address this issue. Software-based design approaches for MES do not provide means for analyzing the impact of software implementation choices into hardware micro-architectures performance, such as buses bandwidth or banks conflict. Thus, it becomes impossible to measure the capability of hardware configurations to accommodate future applications. Hardwarebased design approaches do not consider application programming and analysis as part of the design. Therefore, there is no way to evaluate the impact into software performance of changing a hardware configuration parameter. In contrast, platform-based design (PBD) (Sangiovanni-Vincentelli, 2002) provides the adequate level of abstraction that can be used for addressing this problem, thing which is not possible, as will be pointed out later by the experiments, when using the two preceding approaches. Related work Most PBD approaches found in the literature are not thought to provide complete solutions for MES performance modelling and analysis. Analytical models have been proposed in (Thiele et al., 2002; Gries et al., 2003) for modelling applications in the specific domain of packet processing. ARTS (Mahadevan et al., 2005) aims at porting communication concurrency modelling of micro-architectures at system-level by describing them using an abstract RTOS model. This approach uses annotated DAGs of tasks for modelling software and reduces the hardware model to communication latencies. Therefore, it does not resolve software timing into hardware performance, such as the available bandwidth, for instance. (Paul et al., 2005) proposes a model composed of a set of event sequences and their scheduling functions. MESH (Paul et al., 2005) and METROPOLIS (Balarin et al., 2003) provide general purpose frameworks in the sense that they do not make any assumption about the functional and timed models of microarchitectures. This has the advantage of broadening the applicability of the frameworks for modelling concurrency at unfixed abstraction levels. Nevertheless, none of these approaches proposes a methodology for modelling and analysis micro-architectures performance (and, thus, concurrency) which resorts to the specific skills of the designer. The industrial applications of SystemC/TLM (Ghenassia, 2006) use the wrapper-based timing model called PVT as a method for performance modelling. Here, we propose an original PBD approach for, not only modelling concurrency, but also handling the issue of performance modelling and prediction of MES, using annotated-timing based simulation meta-models. Our approach We use transaction-level modelling (TLM) (Ghenassia, 2006) as a basis for modelling micro-architectures and software behaviors. We chose TLM because it provides the abstraction level which complies with both wrapper-based and annotated time modelling. To avoid complicating the system design by using semantic-less hardware-dependent wrappers, i.e., for which no construction rules exist in general, and also, but in a second order of importance, to avoid lowering down performance analysis speed by adding another layer, we have adopted the annotated method. Besides, by using this method, we keep the possibility of taking advantage from existing annotated

Modelling and performance analysis

3

models based academic theory, for the verification of timed MES, such as BIP (Basu et al., 2006). Contribution In addition to the generality of its application to MES concurrency modelling, our PBD approach has three main contributions. First, it provides components meta-models for assembling and binding micro-architectures and software components, and composing software and hardware layers, without the need of interfacing wrappers. Second, components are self-contained with a well-defined structure composed of a set of blocks whose features are shown to be pertinent not only for capturing MES concurrency but also for precise performance predictions of MES. Third, our framework includes a compilation and simulation tools that provide joint, scalable, and fast modelling and analysis environment of concurrent embedded software and multiprocessor hardware. Benefits We put forward the interests of using our framework for analyzing MES by experiencing complex industrial applications and then answering the following questions: (1) what are the benefits of our component-based approach in the context of MES? (2) what are the benefits of a joint software and hardware modelling and performance prediction environment for MES? (3) what is the impact of our componentization on prediction time? The remainder of the paper is structured as follows. Section 2 summarizes the framework while section 3 focus on its components meta-models. Section 4 shows an example of simple system. Section 5 overviews the results of using the framework for predicting the performance of the MPEG-4 video encoding on Philips Wasabi/Cake NoC and IPv4 packet forwarding over the dual Intel IXP2800 NP. Section 6 discusses the benefits of using the framework to cope with complex MES and, finally, section 7 concludes this paper. 2. Framework overview We present a PBD framework which supports MES modelling at transaction-level for hardware components and task-level for software components. It offers a componentbased modelling and binding capabilities and a methodology for composing software and hardware. To support this framework, compilation and simulation tools that enable automated settings of the components, fast performance prediction and automated code generation are developed. The first tool is a compilation chain, called JAHUEL, for a formal language, called FXML, for modelling software hierarchical task graphs. The purpose of the language is threefold. First, it provides simple and platform-independent constructs to specify the behavior of the application using an abstract execution model. Second, it provides semantic and syntactic support for correctly refining the abstract execution model into the concrete one. Third, the language and the compilation chain are extensible to easily support new concrete execution models, without semantic break-downs. Besides, the language can be used by the programmer to express program structure, functionality, requirements and constraints, as well as by the compiler as a representation to be directly manipulated to perform program analysis and program transformations to generate executable code which achieves application requirements and complies to platform constraints. The second tool is a simulation platform, called P-WARE, for joint software and hardware components modelling and performance prediction. The hardware view decomposes software tasks into flows of components’ transactions. The term transaction will be defined later. P-WARE combines several hardware transaction-level timed com-

4

NOTERE 2007

ponents and programmer-level software components, and allows for their composition and performance observation and profiling. Setting up a system with this framework is operated as follows. First of all, JAHUEL is used to define the model of the application, the architecture, and their mapping. Then, it generates software and hardware component blocks and bindings as input to P-WARE which then performs joint performance prediction of hardware and software. If performance objectives are met, JAHUEL synthesizes the executable code. A detailed presentation of the framework semantics and tools can be found in (Assayad et al., 2006; Assayad et al., 2005). In the following sections, we focus on describing the P-WARE component meta-models, and show through industrial experiments the benefits of the framework in the context of performance prediction of MES. 3. Components structure description 3.1. Hardware component meta-model Before giving the meta-model of hardware, we first define the notion of transaction. A transaction (T) is a set of sequences of operations to be executed by the component for a given transaction request (TR). T is represented by an automaton which is initially “enabled”, i.e., may be fired. Once fired, the temporal behavior of a transaction is described using an operations graph. This graph is hardware-dependent and is provided by the designer. When T is fired it goes to the “executing” state. Several transactions can be executed at a given time inside a component. Two transactions T1 and T2 which are about to perform operations o1 and o2 may be in conflict if o1 and o2 use a shared resource r. We call the set of operations which are in conflict, due to a shared resource ri , a “conflict group” of ri . The resolution of conflicts depends on the design of the component. Formally, a transaction is a pair hT, (C; L)i where T is an automaton composed of the two possible states of the transaction, namely “executing” and “enabled”, and C is either a chain or a multi-chains with one conditional source node prolonged with at least two chains and L is a function giving for each operation of C the corresponding latency. The timed-level transaction-behavior (TLTB) is a pair (T ; R) composed of T , the set of transactions and of R, the set of conflict groups. The hardware components meta-model is depicted in figure 1. It is a meta-model because it exhibits the concurrency and the timed behaviors that must be specified by the designer of the component in order to set-up the final model. It is composed of five blocks whose interactions are fixed but whose behaviors are given by the designer. The buffer stores input TR, and it signals if the component accepts or refuses new TR. The arbiter is to be instantiated with a scheduling policy inside the buffer. Both the buffer and the scheduler incur latencies. The controller is to be instantiated with a control policy in charge of deciding which and when to launch a transaction and resolving operations conflict: an operation starts execution when it received a “fire” message from the controller (!fire), and notifies the latter with the operation completion by sending the “end” message (?end). The transactor sends outputs to recipient input buffers. It is also used to control transaction granularity, i.e., size of physical data on which transactions operate in the real micro-architecture (eg. data of the bus). The TLTB is to be instantiated with the set of timed-level behaviors of transactions. The controller actions are twofold, TLTB-control actions and output-control actions. When it receives a TR from the arbiter, it decides which transaction of TLTB to fire and when. This decision making is hardware specific. If the transaction is enabled it immediately fires it, otherwise, the firing is delayed until the transaction becomes enabled.

Modelling and performance analysis NTj

Transaction OKout

HTG

Buffer

Arbiter

5

NT

notify sched

Scheduler (S)

Controller

Transactor

ET NT OKout

Dispatcher (D) Transaction

To a processor

Transaction

TLTB ET: Enabled tasks

NTj: Completed tasks in other components

Figure 1. Hardware component

Figure 2. Software component

The controller also resolves inter-transaction operations conflicts. When a transaction completes, the controller is notified and sends results to the transactor. The latter outputs results, i.e. generates TR to recipient buffers.

3.2. Software component meta-model Software components meta-model is composed of a hierarchical task graph (HTG) describing the application logic, a scheduler and a dispatcher which controls the tasks concurrency through the mapping and ordering of (TR of) computations and communications (figure 2). The HTG models the application logic. Roughly speaking, it is composed of three types of nodes. Single nodes called tasks represent nodes that have no sub-nodes and are used to encapsulate TR. For instance, a task “x=x+1” incrementing a variable x may be decomposed, depending on the compiler, into read and write remote memory TR and between the two, a processor TR for locally incrementing the read value. Compound nodes contain other nodes. They are used to represent control structures, data-dependencies between nodes. Loop nodes are used to represent sequential and parallel loops. A loop body is a compound node. The scheduler (S) receives the scheduling requests from tasks through the “sched” messages and puts these tasks into the set of waiting tasks (WT). It computes the set of enabled tasks (ET) from the set WT. This computation corresponds to an instantiable application-specific scheduling function of the scheduler. The set ET is then sent to the dispatcher. The scheduler also observes the completion time of tasks: when a task completes, it notifies the scheduler through the “notify” message. The set of notified tasks (NT) may be used by schedulers and dispatchers of components to implement their scheduling and mapping policies. The dispatcher (D) is in charge of mapping data to memory spaces, i.e. assigning addresses to variables, and tasks to hardware, i.e. assigning TR (eg. read, write, add, load, etc.) to (buffers of) processors. The HTG includes thus tasks with “sched” and “notify” primitives for communicating with the scheduler and TR for composing software with hardware. We notice that “sched” and “notify” may be implemented using untimed events or using TR to take into account their performance overhead. Given a set of hardware and software components, automated behavior monitoring is achieved using meta-model performance observers in charge of monitoring states of components sub-blocks along the execution time. After defining an observation time interval, the monitored performances are, for example, profiled to available bandwidth, conflicts and cache misses in the case of hardware components, and tasks communication times, tasks synchronization times, and tasks execution times in the case of software components. Even if the profiled hardware performance are by nature heterogeneous,

6

NOTERE 2007

they are all deduced from the following sampled performance values: number of received TR, idle-time of the component and/or the transactions, number of generated TR. 4. Example: Modelling a producer/consumer system Let us consider a producer and a consumer components (P and C) deployed on a multiprocessor hardware. The hardware is composed of two processors (P1 and P2), one command bus, one data bus and a channel for controlling memory accesses (Figure 3). NTp

(SW Model)

OKout (state of DRAM)

Buffer

TR Round-robin over slots

OKout

TR

write(@,32B,data)

TR OK

3

CHANNEL

Transactor

T 1400 MHZ

TR

T

TLTB

TR

!fire T11

1

T12

A

Tn1

TLTB

C

C

B

A 1

TR

Transactor

1

3 3

3 Tn2

OKout

C

3

Transactor

1 cycle

1 cycle T

OKout

T’s of b1

Controller

T’s of b2

Transactor

Bank FIFO 4 TR

TR

Bank FIFO 3

Output TR to memory (hand-shake)

OKout

TR

Bank FIFO 1 TR Round-robin over Pi FIFOs

Bank FIFO 2

!fire

?end

!fire

TR

Output TR to memory bank’s fifos

TR

TR

TR

Controller

Output TR to data bus fifos

Figure 5. “DATA BUS”

Arbiter

?end

Arbiter

Controller

TLTB

OKout

Buffer OKout

In order

TR

Transactor

(HW Model)

OKout (state of channel)

TR

OKout

T

TLTB

Figure 4. “P1”

Buffer

TR

10 cycles

Figure 3. P and C example

CMD FIFO

TR

Controller

TR

1 cycle

DATA BUS OK

OKout

P2 FIFO

OKoutTR

?end

4

write data

write data

CMD BUS

!fire

write cmd

write cmd

2

write data

MEMORY

1

P1

TR

Controller

5

?end

P2

Arbiter TR Round-robin over Pi FIFOs

P1 FIFO

Output TR to memory bank’s fifos

NTc

Arbiter

inst slot 1

!fire

OKout

?end

Buffer

Producer

Output TR’s to buses

Consumer

B

C 1

conflict-group

conditional

TLTB

Figure 6. “CMD BUS”

Figure 7. “CHANNEL” Figure 8. “MEMORY”

The producer is an infinite loop writing a variable and the consumer is an infinite loop reading it (Figure 9). The ith iteration of the consumer may start only after the completion of the ith iteration of the producer. The two loops are concurrent. We denote S1 , S2 , D1 and D2 schedulers and dispatchers of P and C respectively. We also denote Pi and Ci the ith occurrence of P and C. Schedulers Si are called by P and C using the sched() and notif y() primitives. Let us suppose a views-map where the variable x is mapped to a memory word of address  , and P and C are mapped to processors P1 and P2 . That is, D1 maps x to  , Pi to (an instruction slot in) the buffer of P1 , and D2 maps x to  and Ci to (an instruction slot in) the buffer of P2 . The HTG of P and C are thus in the form shown in figure 9. When Pi executes S1 .wait(), S1 moves it to the set WTP . S1 enables Pi iff Ci−1 is in the set NTC . At the completion of Pi , the producer executes S1 .notify(); then S1 inserts Pi into NTP , sends the sets NTP and ETP to D1 , sends the set NTP to S2 who then moves Ci+1 to the set of his enabled tasks ETC . Likewise, when Ci executes S2 .wait(), S2 moves it to the set WTC . When Ci completes it notifies S2 which then inserts Ci into NTC , sends the set NTC and ETC to D2 , sends the set NTC to S1 which then moves Pi+1 to the set of his enabled tasks ETP . Notice that this implementation introduces a

TR

Modelling and performance analysis

The HTG of P is:

7

The HTG of C is:

    

             

       

   

           

       

Figure 9. Simple producer/consumer

dependency between Ci−1 and Pi . This is due to the mutually exclusive access to variable x (without data losses). TR flows generated by a global write TR (write (@, 32B, data)) issued from the software layer on the hardware architecture is shown in the figure 3. Flows directions are guided by designer through transactors implementations and components bindings. We notice here that flows granularity, i.e., granularity of transactions, can also be varied from bus-size (eg. one or some pixels) to application-packet (eg. frame of pixels) through bus-packet (eg. a whole line or macroblock of pixels) according to the designer implementation of software dispatchers and hardware transactors. In this example it is bus-size and fixed to 32B. The components behaviors accompanying these flows are as follows. The global write TR is stored on an instruction slot inside the buffer of P1 (figure 4). After handling (ie. arbitration, then transaction firing, and results output), a command TR is sent to the buffer of the command bus while the data TR is stored inside the buffer of the data bus. The command and data TR remain there until the buses arbiters notes a change in the state of the variable OK of the channel indicating that there are places inside the channel for the TR (dotted arrows in figure 3). Then, the command TR is sent by the command bus to the channel. Data TR is executed after having won arbitration in the data bus (Physically, this means a data transfer), and a new data TR is then sent to one of the four fifos of the channel. The chosen fifo depends on transactor algorithm which is usually based on a mapping relation between the TR address value and the fifo number (figure 5). Finally the channel operates a round-robin selection of write data TR and sends them to the memory (figure 7). The memory model instantiated by the designer includes two independent banks b1 and b2 with a precharge (A) latency of 3 cycles, a row access (B) latency of 3 cycles and a column access (C) of 1 cycle (figure 8). Operations A and B are common to sequential access patterns: when successive sequential TR are executed, the latter are reduced to one operation C. TLTB transactions are thus four : T11 and T12 for bank b1 , and T21 and T21 for bank b2 where Tk1 are composed of operations A, B and C while Tk2 are composed of operation C only. All transactions are in conflict at operation C. The memory controller operates as follows. Upon receiving a transaction request TR = (i, j, k) concerning a memory access for data in column i, row j and bank k, the controller tests if transactions Tk1 and Tk2 are enabled. If it is the case, it fires either Tk1 or Tk2 according to the last preceding transaction request (TRlast k ) of bank bk : if TRk .j = TR.j then it fires Tk2 otherwise it fires Tk1 . The channel communicates with the memory using a hand-shake protocol: the channel first sends the TR to the memory controller and blocks waiting for an acknowledgement from the latter indicating that the TR was taken into account by firing the corresponding bank transaction.

8

NOTERE 2007

In addition to the TLTB transactions operations, buffer, arbiter, transactor and controller functions may be annotated with timing models. These latencies are hardware specific and can be found in products reference manuals. For instance, a DRAM component is usually characterized by the latencies of operations such as A, B, C and/or others. 5. Real-life applications We experienced the effectiveness of our framework by using P-WARE for modelling and performance prediction of a MPEG-4 video encoder (MPEG-4, 2001) and an IPv4 packet forwarder (IPv4, 2005) on two network processor hardware architectures: the Wasabi Cake network on chip (Stravers et al., 2001), and the dual IXP2800 network processor (Adiletta et al., 2002). These applications also show the need of joint software and hardware analysis for identifying limiting factors and improving MES performance. Hereinafter, we present the experiments results. 5.1. MPEG-4/Cake For MPEG-4, we have considered several algorithms and several configurations of the Cake network: we programmed five parallel implementations of MPEG-4 using different motion estimation algorithms, i.e., the full (F), the 16-, 32- and 64-range based hierarchical algorithms (H16, H32 and H64) and an implementation without motion estimation (N), and we experienced task- and frame-level mappings of them over Cake configurations using up to four tiles with up to eight processors. With task-level parallelization, the estimated speedup is under two for all Cake configurations. The diagnostic of encoder tasks performance shows that this weak speedup is due to the disparity in tasks synchronization times resulting in significant under-use of corresponding processors. For instance, figure 10 shows this disparity when using one tile. As a consequence, hardware extensions alone are not enough here, we hence considered the second type of MPEG-4 parallelizations. With frame-level parallelization consisting of encoding different frames on several tiles, better speedups were achieved: speedups ranged between 6 and 8 for the implementations F, H64, H32, H16, and a speedup near of 4 for the implementation N are obtained when using 4 tiles composed of at least 4 processors.

Speedup

H16 MPEG4 Synchronization Time on 1-tile Cake

Time (% per task execution time)

100 80 60 40 20

ME DCT

0 Processor

1

1

2

1

2

3

4

1

2

3

4

5

6

7

8

MVP ME Choice MVD PRD DIFF DCT QUANT ZIGZAG RUNLEVEL QUANTI DCTI ADD LAYER

14 12 10 Speedup

120

8 6 4 2

1 Channel

2 Channels

3 Channels

0 Tile size 1P 2P (i.e. total proc number)

4P

8P

Processors (for each tile size)

Figure 10. MPEG-4 Synchronization

0

64

128 192 256 320 384 448 512 576 640 704 768 Banks number

Figure 11. IPv4 Speedup

Modelling and performance analysis

9

To point out the interest of our approach of jointly modelling and predicting the performance of MPEG4 and Cake, it is also worth noticing that for MPEG-2 decoding, which was recognized to be less complex than encoding by a 3-5 factor1 , only a speedup under 7 was obtained using a 4-tiles 4-processors configuration 2 .

5.2. IPv4/IXP To validate the effectiveness of our framework, we experienced a second study, namely IPv4 over the dual IXP2800. For simplicity, we fixed the IPv4 algorithm and mapping, and we explored and asset several IXP configurations consisting of different IXP channels/banks memory populations. For all of them, average packet throughput speedups increase until the number of banks reaches sixty four, then it stabilize around twelve, nine and seven when populating three, two and one channel respectively. The speedups results for all the populations are depicted in figure 11. By looking to the average consumed bandwidths inside the IXP, i.e., the bandwidth of the pull/push buses, we found that the speedups curves have the same trends as those of the average buses used bandwidths. For instance, when using three channels and starting from sixty four banks, the average used bandwidths stabilize around 3.7 GB/s. The speedups curves can thus be explained by the buses bandwidth limitation of 2.8GB/s on one hand and the read/write transaction requests mix ratios on the other hand: non-units read/write mix ratios may result, at the extremes, in a full-use of an internal bus and an under-use of the other, in which cases one of the buses becomes a bottleneck of the system and, hence, increasing the memory population does not have any further impact on data throughput. The inspection of the punctual used bandwidths instead of the average ones confirms this hypothesis and shows that the push bus is saturated over some observation intervals. This study shows the interest of our approach for predicting the performance of IPv4 compared to the used ones in the literature: (Fu et al., 2005) approach does not support evaluation of hardware performance metrics, (Gries et al., 2003) analytical model does not capture components with dynamic effects such as history-dependent memory bank latencies or buses arbitration, and (Govind et al., 2005) Petri net model takes hardware performance such as memory conflicts as inputs rather than evaluating them, which is a drawback with respect to modelling scope, expressiveness, and performance prediction accuracy.

6. Benefits of the approach The experimental results we presented in the previous sections put forward some arguments to validate the suitability of using our framework in the context of performance prediction and analysis of MES. 6.1. Benefits of our component-based approach in the context of MES Our components meta-models support system modelling using different data granularities while still resulting in fast predictions. In our experiments, we have considered task graphs, tasks transaction requests characteristics (eg. memory read/write vs 1. http://www.videsignline.com/showArticle.jhtml?articleID=188500425 2. ce-serv.et.tudelft.nl/˜pyrrhos/CEColloquium/presentations/cake-talk.ppt

10

NOTERE 2007

bus read/write transfer requests, transfer length, etc.), tasks scheduling and dispatching policies of transactions requests as the main rational for software component modelling. Operating system effects (eg. real-time scheduling) may also be captured when instantiating component blocks (eg. scheduler block). In the hardware architecture side, we also have identified the arbitration policies of transaction requests, the latencies of transaction behaviors, transaction granularties and their flow over the hardware architecture, which are controlled by transactors, as pertinent characteristics to be instantiated for modelling micro-architectures by the designer. Our experiments using P-WARE have demonstrated the contribution of these features in the context of modelling expressiveness and fast performance prediction of MES. Indeed, for Cake models, we have used a bus packet as the data granularity which is suitable for high data-locality applications such as MPEG-4, while for IXP models, we have used the bus size which is more suitable for the IPv4. Still, both of them resulted in fast predictions with averages of 490000 cycles/s and 285000 cycles/s for big architectures. Given these expressiveness and performance prediction advantages, our approach is expected to introduce some imprecision in terms of performance metric values. To evaluate the trustworthiness of our results, we have showed that the speedup results obtained using P-WARE are within 5% of the IXP simulator ones using the common tests bench provided in Intel’s IXA SDK framework (IXP2800, 2004). In spite of this introduced imprecision, the tests showed that the shapes of the curves of the predicted values are compatible and below those of the values measured by the cycle accurate simulator. This is crucial for the validation of the accuracy of the model of concurrency and performance, which we developed, with respect to performance measurement. Indeed, it shows, on the one hand, that the analysis carried out on the predictions makes it possible to have conservative measurements (eg. predicted IXP speedups below the ones given by IXP simulator), and, on the other hand, in spite of their inaccuracy, the predictions still allow to identify the limiting factors (eg. speedup stagnation due to push bus saturation), thanks to the correct trends in particular. 6.2. Impact on prediction times Experiments showed that high speeds are achieved with our approach for the prediction of complex MES. For Cake, simulation speeds vary in a range between 227000 and 644000 cycles/second with an average of 400000 cycles/s. Fastest speeds were achieved for the more complex algorithm having the highest encoding time. For IXP, the average achieved simulation speed is around 285000 cycles/s with speed values reaching 702000 cycles/s. Furthermore, the correlation coefficient between encoding times or forwarding times and simulation times for all experiments is almost one which means that the quotient between the prediction times and the simulated times is almost constant. These results show that our approach scales up very well with respect to (a) the size of inputs (eg. pictures with resolutions reaching 720x576), (b) the complexity of sofwatre (eg. encoding algorithm with full search motion estimation), and (c) the hardware size (eg. dual IXP with banks number reaching more than 700 in each unit). We believe that the reason for our framework to scale up is twofold: (a) the use of wrapper-less annotated models, and more importantly (b) the operational semantics of the components, based on event triggered (automata) behaviors, which result in better simulation performance than clock-driven behaviors. Indeed, in the latter the states of the whole system must be updated at every cycle while only local states are updated, either simultaneously (eg. in a SystemC-based implementation) or not, at the occurrence of

Modelling and performance analysis

11

each event in the former. These events are intra- or inter-component events and concern either messages for action requests or signals of changes in the states of the components. As examples, an intra-component event is “controller fired a new transition”, and an intercomponents one is “target buffer state reached a back pressure level” which prevents, for instance, the arbiter of the source component from selecting requests targeted to the buffer who reached the back pressure level.

6.3. Benefits of a joint software and hardware modelling environment The use of a uniform component-based meta-modelling environment for both software and hardware description eases the design of software applications and hardware architectures, and their further modifications through high-level manipulations of system components (instantiation/extension of components blocks, binding relationships). In the context of MES, this contribution is very important since software designers often need to modify the mapping between the software application and the underlying hardware architecture in order to find the best configuration with respect to their joint performance (eg. application deadlines are met?, buses available bandwidth is enough?). As in our studies, these modifications may also involve the architecture of the software, (eg. the communications model), or the architecture of hardware (eg. memory hierarchy). Using JAHUEL language, they are reduced to high-level component manipulations, thus offering a considerable gain in implementation time, which is essential for lowering development costs of embedded product lines. Furthermore, beyond this automation aspect, we believe that joint software and hardware prediction conducted with user-driven system exploration approaches results in faster and far more efficient designs than separate analysis even when they are fully automated. The former has the advantage of allowing rich performance feedbacks in a short time, and leads to efficient designer-driven performance tracking by combining this fast exploration with the experience of the designer in the analysis. In the IXP experiments for instance, hardware analysis shows that beyond a threshold of 64 banks, performance are almost the same for all channel configurations due to the physical limitation of internal buses. The designer was thus able to notice that solving the problem passes by the use of buses having larger bandwidth rather than continuing to augment memory population, and that the development of software alone is not enough for improving IPv4 throughput. Similarly, but at the software side this time, in the Cake experiments, the designer was naturally able to decide to carry out new software mappings, by considering data-level parallelization, once he has noticed that hardware development alone is not of interest since the poor MPEG-4 speedup performance, observed with task-level parallelization, were only due to the huge synchronization times disparity of the tasks. 7. Conclusion We have proposed a framework and a simulation tool for analyzing software and hardware performance rather than evaluating each one in isolation. This joint evaluation enables predicting (1) impact of hardware on software performance and (2) the ability of hardware performance to accommodate future services at design time. The tool relies on an annotated component-based modelling framework which combines transaction-level hardware and programmer-level software models. Moreover, the experiments carried out on complex MES show that our simulation prototype is scalable and delivers precise performance results with fast simulation speed.

12

NOTERE 2007

8. References Adiletta M., Rosenbluth M., Bernstein D., Wolrich G., Wilkinson H., “The Next Generation of Intel IXP Network Processors”, INTEL technology journal, vol. 6, Intel Communications Group, Intel Corporation, Aug, 2002. Assayad I., Bertin V., Defaut F.-X., Gerner P., Quévreux O., Yovine S., “JAHUEL: A formal framework for software synthesis”, ICFEM’05, 2005. Assayad I., Yovine S., “System Platform Simulation Model Applied to Multiprocessor Video Encoding”, IEEE symposium on industrial embedded systems, 2006. Balarin F., Watanabe Y., Hsieh H., Lavagno L., Passerone C., Sangiovanni-Vincentelli A., “Metropolis: An Integrated Electronic System Design Environment”, Computer, vol. 36, num. 4, pp. 45-52, 2003. Basu A., Bozga M., Sifakis J., “Modeling Heterogeneous Real-time Components in BIP”, SEFM ’06, IEEE Computer Society, Washington, DC, USA, pp. 3–12, 2006. Fu J., Hagsand O., “Designing and Evaluating Network Processor Applications”, Workshop on High Performance Switching and Routing, 2005. Ghenassia F., Transaction Level Modelling with SystemC. TLM Concepts and Applications for embedded systems, Springer, 2006. Govind S., Govindarajan R., “Performance Modeling and Architecture Exploration of Network Processors”, QEST, 2005. Gries M., Kulkarni C., Sauer C., Keutzer K., “Comparing Analytical Modeling with Simulation for Network Processors: A Case Study”, DATE, 2003. IPv4, “Teja NP IPv4 Forwarding Foundation Application with DiffServ and NAT”, http://www.teja.com/content/4980_Teja_IPv4.pdf, 2005. IXP2800, Intel IXP2800 Network Processor. Optimizing RDRAM Performance: Analysis and recommendations Application Note, Technical report, http://www.intel.com/design/network/products/npfamily/docs/ ixp2805_docs.htm, 2004. Mahadevan S., Storgaard M., Madsen J., Virk K. M., “ARTS: A System-Level Framework for Modeling MPSoC Components and Analysis of their Causality”, MASCOTS’05, IEEE, 2005. Moonen A., van den Berg R., Bekooij M., Bhullar H., van Meerbergen J., “A Multi-Core Architecture for In-Car Digital Entertainment”, Proceedings of the GSPx conference, 2005. MPEG-4, Inf. technology – Coding of audio-visual objects – Part 2: Visual ISO/IEC 14496-2, Prentice Hall, 2001. Paul J. M., Thomas D. E., Cassidy A. S., “High-level modeling and simulation of single-chip programmable heterogeneous multiprocessors”, ACM Trans. Des. Autom. Electron. Syst., vol. 10, num. 3, pp. 431–461, 2005. Sangiovanni-Vincentelli A., “Defining platform-based design”, EEdesign, EETimes, February, 2002. Stravers P., Hoogerbrugge J., “Homogeneous multiprocessing and the future of silicon design paradigms”, International symposium on VLSI Technology, Systems, and Applications (VLSITAS), pp. 184–187, 2001. Thiele L., Chakraborty S., Gries M., Künzli S., “Design Space Exploration of Network Processor Architectures”, HPCA Workshop, February, 2002.

Suggest Documents