Relaxing Synchronization in a Parallel SystemC Kernel - Amazon Web ...

8 downloads 199 Views 661KB Size Report
software integration, Register-Transfer Level (RTL) de- sign or verification are all part of ... tion on cheap clusters of workstations, the best tracks come from the PDES field ... of the public domain OSCI reference simulator. Then, using the same ...
2008 International Symposium on Parallel and Distributed Processing with Applications

Relaxing Synchronization in a Parallel SystemC Kernel Philippe Combes Eddy Caron Fr´ed´eric Desprez

CUI, University of Geneva

Bastien Chopard

University of Lyon. LIP Laboratory.

Julien Zory

UMR CNRS - ENS Lyon - INRIA - UCB Lyon 5668.

STMicroelectronics - now Honeywell

Abstract

gence of multimedia technologies (MPEG4, MP4, etc.) with wireless connectivity (UMTS, WiMax, etc.) into new generation home and handheld consumer devices typically turns into very complex systems. Time-tomarket is particularly critical for device manufacturers to build a true competitive advantage. One promising approach to reduce simulation times is parallel simulation. SystemC is most often used to describe the behavior of a complete system where several hardware and/or software components concurrently perform some tasks. There are several possible approaches to exploit this inherent concurrency in parallel simulation. Those extend from the basic use of OS threading on SMP’s [11], to complex distributed platforms [3, 7] which aim at providing IP (Intellectual Property) products for testing rather than increasing the simulation speed. However, to achieve fast simulation on cheap clusters of workstations, the best tracks come from the PDES field (Parallel Discrete Event Simulations), which has been widely studied in the 90’s [4, 5, 8]. Some simulators with specific custom languages have arisen from these concepts [10, 12, 14], but none were applied so far to a SystemC kernel1 . We have developed a parallel version of the OSCI kernel based on the conservative approach to PDES, that proved to be efficient in some specific cases, but often suffers from high synchronization overheads [1, 2]. This paper explores some ways to reduce them. Section 2 first briefly describes SystemC and the algorithm of the public domain OSCI reference simulator. Then, using the same formalism, a model is given of our parallel simulator prototype as it was published in [1]. Section 3 targets the reduction of the synchronization costs, focusing on data exchanges and synchronization of simulation time. The impact of our approaches are finally experimented and validated against a straightforward pipeline application in Section 4 and a realworld industrial application in Section 5.

SystemC has become a very popular standardized language for the modeling of System-On-Chip (SoC) devices. However, due to the ever increasing complexity of SoC designs, the ever longer simulation times affect SoC exploration potential and time-to-market. In order to reduce these times, we have developed a parallel SystemC kernel. Because the SystemC semantics require a high level of synchronization which can dramatically affect the performance, we investigate in this paper some ways to reduce the synchronization overheads. We validate then our approaches against an academic design model and a real, industrial application.

1. Introduction A typical System-on-Chip (SoC) development flow includes various activities that are most often handled by different groups with different profiles. Algorithm exploration, functional modeling, hardwaresoftware integration, Register-Transfer Level (RTL) design or verification are all part of that overall process. The SystemC [9] language was developed by the Open SystemC Initiative (OSCI) to enable system level design and ease model exchange. It is promoted by several major EDA vendors and microelectronics companies and it is today a very popular modeling solution in large scale programs. Engineers use a C++ based modeling platform to represent the functionality, communications, software, and hardware components at multiple levels of abstraction with a single common language. Note that in the rest of this paper, the term SystemC will refer to the system level modeling capabilities of SystemC 2.0.1. The ability to develop more and more complex systems with completely virtual platforms and reasonably fast simulations is a key enabler for tomorrow’s successful SoC developments. More specifically, the conver978-0-7695-3471-8/08 $25.00 © 2008 IEEE DOI 10.1109/ISPA.2008.124

1 SimCluster, which is commercial software, is said to support SystemC: http://www.avery-design.com/web/avery_hdlcon02.pdf

180

Authorized licensed use limited to: TEL AVIV UNIVERSITY. Downloaded on November 23, 2009 at 08:59 from IEEE Xplore. Restrictions apply.

2. Parallelization of the OSCI Kernel

this with a single loop, because an iteration of the δcycle loop is no more than an iteration of the outer loop where the simulation time has not changed. We based our parallel algorithms below on this second version.

2.1. The OSCI SystemC Kernel A typical SystemC application is characterized by both a structural part and a behavioral part. The structure will basically describe the way (hardware and software) components are connected to each other; this translates into various sc modules connected through channels. Hierarchical structures are supported as well: sc modules may instantiate other (sub-)modules. The behavioral part of the system is captured in SystemC with the notion of process. Each module may contain one or several processes. The execution of a given process is driven by events (such as a value change on the channels connected to the module) that might awake or even restart the process. The user describes his/her application with the proper structure and behavior implementations using the appropriate SystemC constructs. Then, the simulation of the system is directly handled by the SystemC kernel given certain input stimuli. The task of the SystemC kernel is thus to schedule, on a sequential processor, the multiple processes of the system in response to those events generated by the application. To make sure that concurrent processes use coherent inputs/outputs, the scheduler first runs - “evaluates” - all the processes i that are ready to be executed, and only then it updates their changes on the channels (see the figure of Algorithm 1). Both evaluate and update phases constitute a δ-cycle, referenced by a global counter all along the simulation: δ. Multiple δ-cycles can occur at the same simulation time θ, which is very useful for modeling fully-distributed, time-synchronized computation (as in RTL) [9]. In the OSCI open source kernel, the scheduler relies on a stack of “runnable processes”, Runnable, that is emptied during the evaluate phase, and then filled in again with those processes that are sensitive to any event triggered during the update phase (such as a “value-changed” event on a channel). If the stack is empty at the end of the update phase, then the simulation time can be updated to the earlier trigger time of all pending events. Then all the events with that same time are triggered and the processes that were sensitive to them are pushed into the Runnable stack for another iteration. The simulation stops when the simulation time reaches the end time ΘE defined by the designer, or if there are no more runnable process at all (ΘE = ∞). The figure in Algorithm 1 shows how this is implemented in the OSCI kernel: a loop on δ-cycles inside a loop on simulation time. Yet Algorithm 1 modelizes

Algorithm 1: OSCI SystemC kernel. pop

Evaluate processes

push

Update channels trigger ð−events push

push

no more runnable ?

Update sim. time trigger timed events

runnable processes

no more runnable ? end

1 (δ ; θ) ← (0 ; 0.0) 2 while true do /*Evaluate phase*/ 3 forall i ∈ Runnable do evaluate(i) /*Update phase*/ 4 update() 5 δ←δ+1 /*Termination of both δ-cycle and time loops, merged into one single loop*/ 6 θ ← compute next simulation time() 7 if θ = ΘE then exit() else trigger events() 8 /*fill Runnable*/

2.2. Parallel SystemC Kernel There are several approaches to SystemC parallelization. For instance, Savoiu et al. have described in [11] how to let the underlying OS manage the SystemC processes, by creating a POSIX thread for each of them. But the context switches drastically reduce the simulation speed, even on powerful SMP’s where complex reorganizations of users’ designs are necessary to achieve acceptable speedups. Thimmapuram and Kumar propose in [13] a solution independent of any SystemC kernel, but this is only applicable to multiprocessor SoC designs whose partitioning over the computational nodes is fixed. Our own choices for SystemC parallelization are fully discussed in [1]. They are mainly driven by the tight constraints of the SystemC semantics and a desire to be as user-friendly as possible (the target platforms are cheap clusters of workstations and the changes in the API are minimal). In our model, every computational node p runs a copy of the sequential scheduler, each simulating a subset of the application modules. Every local scheduler has thus its own δ-cycle counter δp , simulation time θp , and stack of runnable processes Runnablep . The partitioning of the processes amongst the N nodes is achieved statically by the user, through a new kind of sc module: sc node module. An

181

Authorized licensed use limited to: TEL AVIV UNIVERSITY. Downloaded on November 23, 2009 at 08:59 from IEEE Xplore. Restrictions apply.

of the next event to be triggered (l.16). The master waits for all these messages and computes the minimum next θ0 . The result is sent back to all nodes that sent a next time different from θ0 , the current time: since θ0 = θp ∀p and θ0 is a lower limit of next θ0 , a node p where next θp = θp goes on running the current δ-cycle loop without needing any message from the master. Note that the master can save little time if ever next θ0 = θ0 . In that case, after line 26, it sends θ0 to all waiting nodes and at line 28, it can answer directly instead of stacking the requests.

sc node module gathers all sc modules that must be simulated on one node. This hand-made static partitioning remains achievable because, as explained further in this paper, the high level of synchronization limits the number of computational nodes to a few tens. The sc modules have directed ports: they can be in, out, or sometimes both. A channel that connects two sc modules (Section 2.1) is actually bound to some of their ports. When these sc modules are decided to be simulated on two different nodes, the channel is duplicated on both nodes, and the copies are synchronized according to the directions of the ports. Based on these bindings, we define: Outputp as the set of all channels bound to an out port of the sc node module p: the data that p writes on these channels must be sent through the network; T oc , where c ∈ Outputp , as the set2 of recipient nodes for the data written on c; Inputp as the set of all channels bound to an in port of the sc node module p: the data that p needs to read must be received from the network; F romc , where c ∈ Inputp , as the set of nodes which c data will be received from. With regard to semantics, the execution order of the processes is not pre-defined [9] at the δ-cycle level. But as explained in Section 2.1, the processes to be executed at the δ-cycle δp depend on what happened at δp −1. Hence all local schedulers have to synchronize themselves at the end of every δ-cycle. Practically, as mentioned in the figure of Algorithm 2, it consists of a channel synchronization and a time synchronization. From node p, the channel synchronization simply consists in sending new data written on Outputp and receiving data from Inputp . We insist on the fact that most of the time, the δ-cycle δp cannot start while p has not yet received all Inputp data written at δp − 1, because they may cause some processes to be put in Runnablep mode and they are likely to be read at δp . If they are not waited for, the simulation could be erroneous. In Algorithm 2, we use the following formalism for communications (the tag CH stamps channel data messages): sendhCH,c,δ,dest,valuei: send the value of the channel c at δ-cycle δ, to node dest; nb recvhCH,c,δ,exp,valuei: receive the value of the channel c at δ-cycle δ, from node exp; note we use nonblocking receives for the correctness of the algorithm: they are completed line 8 or cancelled line 32. As shown in Algorithm 2, the simulation time synchronization is based on a master/slave principle. At the end of every δ-cycle, every slave node sends to the master its next simulation time next θp , the time 2a

pop

Evaluate processes

push

synchro trigger delta−events

Update channels push

push local runnable processes

no more runnable ? synchro trigger timed events

Update sim. time

no runnable on any node ? end

All other local schedulers on all other computational nodes

Algorithm 2: Centralized synchronous time computation.

1 (δp ; θp ) ← (0 ; 0.0) 2 forall c ∈ Inputp do 3 forall q ∈ F romc do nb recvhCH,c,0,q,valuei 4 while θp 6= ΘE do /*Evaluate phase*/ 5 forall i ∈ Runnablep do evaluate(i) /*Update phase*/ 6 forall c ∈ Outputp do 7 forall q ∈ T oc do sendhCH,c,δp ,q,valuei 8 waitall hCHi /*complete all nb recv posted*/ 9 update() 10 forall c ∈ Inputp do 11 forall q ∈ F romc do 12 nb recvhCH,c,δp + 1,q,valuei /*Termination of δ-cycle loop*/ 13 δp ← δp + 1 next θp ← compute next simulation time() 14 15 if p 6= 0 then 16 sendhEND,δp ,0,next θp i if next θp 6= θp then 17 18 recvhEND,δp ,0,next θ0 i next θp ← next θ0 19 20 else 21 nb msg ← N − 1 /*p = 0*/ 22 W aiting ← ∅ while nb msg > 0 do 23 24 recvhEND,δ0 ,q,next θq i nb msg ← nb msg − 1 25 next θ0 ← min(next θ0 ,next θq ) 26 if next θq 6= θ0 then 27 28 W aiting ← (W aiting ∪ {q}) 29 forall q ∈ W aiting do sendhEND,δ0 ,q,next θp i 30 θp ← next θp if θp 6= ΘE then trigger events() /*fill Runnablep */ 31 32 cancelall hCHi /*cancel all nb recv uselessly posted*/

The end of the simulation is reached when the next time equals ΘE : all local schedulers know statically this value and thus can stop. Hence, the synchronization of

channel can connect more than 3 modules

182

Authorized licensed use limited to: TEL AVIV UNIVERSITY. Downloaded on November 23, 2009 at 08:59 from IEEE Xplore. Restrictions apply.

the simulation time is sufficient to detect both the end of a δ-cycle loop and of the simulation. That is why we have used historically the tag END for the time synchronization messages: send/recvhEND,δ,dest/exp,θi: send/receive to/from dest/exp the local next time θ at δ-cycle δ. The most obvious drawback of this algorithm is that the master node is a bottleneck. So far, the number of computational nodes is actually limited by the high synchronization level of the parallel simulation, but this should change in a middle-term future and the centralized feature could become a major drawback. Furthermore, some nodes are likely to depend on simulation data computed by the master node. In real applications the computation load can be extremely unbalanced over the processes but also over the δ-cycles of a single node. It then can waste a great part of the parallelism potential that the master node waits to complete the time synchronization before sending new data.

Algorithm 3: Decentralized, partially synchronous time computation. /*** Task 1: main flow ***/ 1 2 3 4 5 6 7

(δp ; θp ) ← (0 ; 0.0) forall c ∈ Inputp do forall q ∈ F romc do nb recvhCH,c,0,q,valuei Sp ← R W aitingp = ∅ for r = 0 to N − 1 do awaited ans[r] ← 0 while Sp 6= F do /*Evaluate phase*/

8

forall i ∈ Runnablep do evaluate(i) /*Update phase*/

9 10 11 12 13 14

forall c ∈ Outputp do forall q ∈ T oc do sendhCH,c,δp ,q,valuei waitall hCHi /*complete all nb recv previously posted*/ update() forall c ∈ Inputp do forall q ∈ F romc do nb recvhCH,c,δp +1,q,valuei

/*Termination of δ-cycle loop*/ 15 Sp ← T /*CS1*/ δp ← δp + 1 16 /*CS2 ∧ CS1*/ next θp ← compute next simulation time() 17 /*CS1*/ forall node q 6= p do awaited ansp [q] ← 1 /*CS1*/ 18 forall node q : (q, δp , next θq ) ∈ W aitingp do 19 20 sendhEND ANS,δp ,q,next θp i W aitingp ← W aitingp \{(q, δp , next θq )} /*CS3*/ 21 next θp ← min(next θp ,next θq ) 22 /*CS4*/ awaited ansp [q] ← 0 23 /*CS4*/ 24 if next θp 6= θp then 25 forall node q : awaited ansp [q] = 1 do 26 sendhEND REQ,δp ,q,next θp i 27 recvhEND,mei 28 θp ← next θp 29 if θp = ΘE then Sp ← F else Sp ← R if Sp = R then trigger events() 30 /*fill Runnablep */ 31 cancelall hCHi /*cancel all nb recv uselessly posted*/

3. Relaxing Synchronization 3.1. Channel Synchronization Algorithm 2 synchronizes all channels in the update phase. But the values of Outputp can be sent earlier, as soon as it is sure that they will not change in the same δ-cycle. Likewise the values of a channel in Inputp need not to be received at the update phase if no process is sensitive to the changes on that channel: we can then delay the completion of the receive operation to the next read, or even to the next update phase if no process needs to read the value neither. Unfortunately the implementation is more complex than it sounds, because a node would not send the data of a channel if the previous send operation of this channel has not yet completed. The receiver must read the message to free the sender. The relaxation is thus more or less limited to one δ-cycle. A clever implementation would use a message buffer policy on top of the underlying communication layer, but in parctice, it means lots of developments to integrate this into the existing kernel, and we do not plan to do it in a near future.

/*** Task 2.1: END ANS messages processing ***/ 32 upon recvhEND ANS,δ,q,next θq i /*δ ≤ δp , δ ≤ δq */ 33 if (δq = δp )∧(Sp = T )∧(awaited ansp [q] = 1) then /*CS1*/ 34 awaited ansp [q] ← 0 /*CS4 ∧ CS1*/ next θp ← min(next θp ,next θq ) /*CS4 ∧ CS1*/ 35 36

P

if (next θp = θp ) ∨ ( then sendhEND,mei

r6=p

awaited ansp [r] = 0) /*CS1*/

/*** Task 2.2: END REQ messages processing ***/ 37 upon recvhEND REQ,δq ,q,next θq i 38 if δp < δq then /*CS3 ∧ CS2*/ 39 W aitingp ← W aitingp ∪ {(q, δq , next θq )} /*id.*/ else if (Sp = R) ∨ (δp > δq ) then 40 /*CS1*/ 41 sendhEND ANS,δq ,q,θp i /*CS1*/ 42 else 43 sendhEND ANS,δp ,q,next θp i /*CS1*/ awaited ansp [q] ← 0 44 /*CS4 ∧ CS1*/ next θp ← min(next P 45 θp ,next θq ) /*CS4 ∧ CS1*/

3.2. Decentralized Simulation Time Synchronization

46

The centralized feature of our first algorithm is its major drawback. A simple decentralization approach would be a synchronization barrier with a “reduce” operation, but it would be the contrary of a relaxation. We propose here a decentralized version, where every node computes locally the next time θ.

if (next θp = θp ) ∨ ( then sendhEND,mei

r6=p

awaited ansp [r] = 0)

/*CS1*/

In Algorithm 3, a node goes on running its δ-cycle loop as long as it still has runnable processes. At the end of the loop, it computes its next time and sends it as a request END REQ to all other nodes, who in turn answer their own value with an END ANS message. To

183

Authorized licensed use limited to: TEL AVIV UNIVERSITY. Downloaded on November 23, 2009 at 08:59 from IEEE Xplore. Restrictions apply.

ports are connected to sc signals of L-long arrays of char. Each time the process is awaken (actually, every δ-cycle), the behavior is to 1) perform nf lop dummy floating point divisions to emulate CPU load and 2) “increment” and copy data from input to output. The order of the latter 2 operations can be inverted to have the computations overlap the channel synchronization. Indeed, the relaxation of the channel synchronization discussed Section 3.1 would not allow this overlapping alone. First, the write operation to the output channel occurs just before the update phase. And then, the recipient would not delay the receive until the next read operation, because its process is sensitive to value changes on its input port: it must complete the receive at the update phase.

be able to answer requests while it is running simulation, a node needs concurrent tasks for the processing of incoming messages. The coordination between the various tasks is ensured by two means: message passing and shared variables with critical sections. When a node with no more runnable processes has sent all its requests, it has to wait for the tasks 2.1 and 2.2 to signal that all necessary messages have arrived. This is modelized in Algorithm 3 by a self-message: send / recvhEND,mei: send/receive to/from another task the signal that next time has been computed. On top of that, the tasks 2.1 and 2.2 (so called because they can be merged into one single flow of execution) use several shared state variables: δp , θp , Sp and W aitingp . Sp ∈ {R, T, F } marks the current phase of the simulation with regards to time synchronization: Running, Time synchronizing, or Finished. The set W aitingp stores all requests that p cannot process at reception time, those that are stamped with a future δ-cycle δq (see l.39 and 19-23). Any change on such a variable can change the global state of the node. For the consistency of the algorithm, some entire blocks of instructions that must execute within a given state must be protected by critical sections. They are mentioned here as CS1 to CS4 in the comments of Algorithm 3: for instance, sections from l.22 to 23, from l.34 to 35 and from l.44 to 45 are mutually exclusive. Note that embedding the next time value into the request messages allow two minor optimizations. First, while answering a waiting request from node q, the node p gets the value and does not need to send any request to q (l.22-23). Likewise when a request is received from node q while we are waiting an answer from it, it is useless to go on waiting (l.44-45). The actual answer will be dropped later, l.33. These optimizations require an array awaited ansp to keep a track of which answers are awaited. It is another shared variable that critical sections care of.

SC_NODE_MODULE( stage ) {

SC_NODE_MODULE( stage ) { sc_in< L−byte long array > in; sc_out< L−byte long array > out; void dummy(){ n flop divisions ... out = in++; } SC_CTOR( stage ){ SC_METHOD( dummy ); sensitive in; sc_out< L−byte long array > out; void dummy(){ n flop divisions ... out = in++; } SC_CTOR( stage ){ SC_METHOD( dummy ); sensitive in; sc_out< L−byte long array > out; void dummy(){ n flop divisions ... out = in++; } SC_CTOR( stage ){ SC_METHOD( dummy ); sensitive