Compilation for heterogeneous SoCs : bridging the gap between ... - Hal

Compilation for heterogeneous SoCs : bridging the gap between software and target-specific mechanisms Mickaël Dardaillon, Kevin Marquet, Tanguy Risset, Jerome Martin, Henri-Pierre Charles

To cite this version: Mickaël Dardaillon, Kevin Marquet, Tanguy Risset, Jerome Martin, Henri-Pierre Charles. Compilation for heterogeneous SoCs : bridging the gap between software and target-specific mechanisms. workshop on High Performance Energy Efficient Embedded Systems - HIPEAC, Jan 2014, Vienne, Austria. 2014.

HAL Id: hal-00936924 https://hal.inria.fr/hal-00936924 Submitted on 27 Jan 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche fran¸cais ou étrangers, des laboratoires publics ou privés.

Compilation for heterogeneous SoCs : bridging the gap between software and target-specific mechanisms. Jérôme Martin Mickaël Dardaillon, Kevin Marquet, Tanguy Risset CEA, LETI, Minatec campus Université de Lyon, Inria, INSA-Lyon, CITI-Inria, F-69621, Villeurbanne, France

F-38054, Grenoble, France

[email protected]

[email protected] ABSTRACT Current applications constraints are pushing for higher computation power while reducing energy consumption, driving the development of increasingly specialized socs. In the mean time, these socs are still programmed in assembly language to make use of their specific hardware mechanisms. The constraints on hardware development bringing specialization, hence heterogeneity, it is essential to support these new mechanisms using high-level programming. In this work, we use a parametric data flow formalism to abstract the application from any hardware platform. From this premise, we propose to contribute to the compilation of target independent programs on heterogeneous platforms. These developments are threefold, with 1) the support of hardware accelerators for computation using actor fusion, 2) the automatic generation of communications on complex memory layouts and 3) the synchronization of distributed cores using hardware mechanisms for scheduling. The code generation is illustrated on a telecommunication dedicated heterogeneous soc.

Categories and Subject Descriptors D.3.2 [Programming Languages]: Language Classifications—Data-flow languages; D.3.3 [Programming Languages]: Processors—Retargetable compilers

1. INTRODUCTION Throughout its history, processor development has been driven by the Moore law, but was also by technology constraints. The power wall, whereby instruction level parallelism was not improved effectively by increasing transistor count and frequency growth as halted, was answered by the apparition of multi-core processors. We now enter the dark silicon era [8], which fixes a limit on the number of transistors that can be powered on at a given time. To answer this new limitation, chip designers are specializing parts of their design for application-specific constraints (e.g. video decoding,

Henri-Pierre Charles CEA, LIST, Minatec campus F-38054, Grenoble, France

[email protected]

voice recognition), which leads to more and more heterogeneous System on Chip (soc) [3]. We now find hardware acceleration not only in computation, but also in communication and control, making it ubiquitous and which must be programmed efficiently. From the software point of view, programming efficiently each new kind of architecture is a renewed challenge. The multiplication of cores requires explicit parallelism in the application, with several solutions from imperative concurrent models (e.g. Pthreads, MPI), but also approaches like the data flow model of computation (MoC). The emergence of heterogeneity brings a new constraint, each platform requiring specific instructions in order to take advantage of the specialized parts. Our problem is to find how to program these new heterogeneous platforms efficiently. This efficiency has to be found both in terms of raw computing performance, but also in terms of time and difficulty to write new programs or adapt existing programs to new platforms. To reach this objective, a high-level language is necessary, current targets being too complex to be programmed using assembly language. Another constraint is the platform abstraction for both having a portable implementation and reducing the requirements on the platform mechanisms knowledge. Another problem is that the targeted applications on these platforms adds both performance and timing constraints. As an example, telecommunication protocol lte-advanced requires 40 GOP S and a latency less than 2 µs [5]. In order to achieve the performance target, software optimizations can be done both at compilation time and at runtime. Runtime optimizations are able to reach high throughput by leveraging runtime knowledge, at the cost of an initial overhead. This overhead is usually made profitable given compatible time constraints, but is clearly not possible within our latency requirements. To answer the high-level representation and parallelism constraints, we propose to use a parametric data flow MoC. In this model, data and task parallelism are exposed to the compiler. Moreover, this MoC permits many static analyses and associated optimizations to match the constraints of our target application.

In this paper, we propose compilation methods which do not exist in current state of the art data flow compilers. Our contribution is threefold :

A A

OCC0 Send 30

CORE

ICC0

+30 data

Recv 10 Recv 20

C C CORE

x2

45 → 10

OCC0 Send 20

15 → 15

OCC1 Send 15

x3

• Platform independent language primitives thanks to the use of actor fusion.

B B

• Compilation of parametric inter-core communications.

+75 data

CORE

OCC0

ICC1 Recv 15

ICC0 Recv 20 Recv 15

D D CORE

-35 data

Send 60 Send 15

• Code generation for distributed scheduling targeting specialized controllers.

FigureFigure 1: Data and and processing flowsflows example in Magali 9 : Data processing example (from [7])

The remaining of the paper is built as follow: we describe our target demonstrator as well as related work on heterogeneous soc compilation in section 2 to motivate our work; the compilation MoC and language are introduced in section 3; contributions are described in section 4; results sustaining our approach are presented in section 5 before the conclusion.

found in [7]. The controllers in charge of the program sequences are limited in scope and platform specific, but also essential to efficiently use Magali. They are illustrated in Fig. 1. Core C distributed controller is programmed to repeat two times a first configuration, before switching to a second configuration. Using the same distributed control mechanism, communications sequences are also programmed for each input/output.

2. MOTIVATION AND RELATED WORK To better understand the characteristics of heterogeneous socs targeted by our compilation flow, we look at the cea Magali platform [6], before reviewing related work in heterogeneous programming.

2.1

Heterogeneous SoC example : Magali

The Magali chip [6] is a system on chip dedicated to physical layer processing of ofdma radio protocols, with a special focus to 3gpp lte-advanced as reference application. It includes heterogeneous computation hardware, with very different degrees of programmability, from configurable blocks to dsps programmable in C. As an example, one reconfigurable block is used to perform both an fft and a deframing (removing some of the resulting data). This operator permits to preserve only the significant data, hence reducing the data transmission time, but is also platform specific and needs to be abstracted away. Communications between blocks use a 2D-mesh network on chip (Network-on-Chip). All the data communications are programmed statically on a credit/data mechanism between source and destination called icc for input, occ for output. One big difficulty when programming this platform is to guarantee consistency for all communications between all blocks, which for non trivial applications makes manual writing a daunting task. The example on Fig. 1 illustrates this on a toy application with 4 cores. In this example, core A sends 30 data to core C using occ 0. The occ configuration for sending data needs to know which icc it addresses and how many data it sends. Likewise, the icc 0 on core C needs to know how many credits to send and to which occ. On this simple application, coherency between 10 configurations has to be guaranteed by the programmer, making it error prone. Main configuration and control of the chip is done by an arm cpu. Magali offers distributed control features, enabling to program sequences of computations at core level. Distributed Configuration and Communication Controllers (ccc) support program sequences, with two levels loops and automatic program memory caching. More details can be

Looking at this example platform, we see that computation, communication and control are hardware accelerated and need to be supported in the programming flow. In the next section, we look at some significant works in the domain of heterogeneous soc programming from the target perspective, starting from general solutions before focusing on data flow programming.

2.2

Related work

For an embedded software programmer, the easiest way to program an heterogeneous platform is to use an imperative language (generally C language) associated with threads to express parallelism. It has been used to program both heterogeneous and homogeneous parallel platforms. For instance, the different units of the bear sdr platform [15] are programmed using C and Matlab code. The ExoCHI [19] programming environment and Merge [13] framework (based on ExoCHI) are proposals aiming at easing the programming of heterogeneous platforms while achieving good performances. The proposed solution extends OpenMP with intrinsic functions and dynamically maps the software on available resources. Similarly, OpenCL [12] can be used for heterogeneous platforms support. The main limitation of these approaches is their lack of abstraction, requiring to program explicitly all hardware mechanisms. Even if using a high-level representation, the programmer still needs precise knowledge of the platform’s specific resources, in order to write the required platformspecific program. Numerous research works present arguments in favor of a paradigm shift and propose to program waveforms using data flow languages. These languages relies on a data flow Model of Computation (MoC) where a program is represented as a directed graph G = (V, E). An actor v ∈ V represents a computational module or a hierarchically nested subgraph. A directed edge e ∈ E represents a fifo buffer from its source actor S to its destination actor D. Data flow graphs follow a data-driven execution: an actor v can be

executed (fired) only when enough data samples are available on its input edges. When firing, v consumes a certain amount of samples from its input edges and produces a certain number of samples on its output edges. SDF/CSDF

SP DF

o

DDF

/

P rovability Expressivity Figure 2: Representation of the balance between provability and expressivity in data flow computation models. Many data flow-compliant programming models have been proposed for specific applications; they are illustrated in Fig. 2. Synchronous Data Flow (sdf) means that the number of tokens necessary for an actor to fire is known at compile-time. In this case, static scheduling of actors can be performed and the size of the buffers between actors can be bounded. In Dynamic Data Flow, data samples consumed and produced by an actor at each firing can vary dynamically at runtime, and can even be 0 in order to provide more flexibility for programming. As a drawback, theoretical analysis capabilities are reduced. Between synchronous and dynamic data flow formalisms, a wide amount of models have been proposed, e.g. Cyclo-Static Data Flow (csdf) [2], Schedulable Parametric Data Flow (spdf) [9]. The goal was to look for a trade-off between the ability to statically analyze programs and the expressivity of the languages. ΣC [11] is a proposal to program waveforms using an extension of C. The corresponding MoC is more expressive than sdf thanks to non-deterministic extensions but still allows some static analyses to be performed such as bounding memory usage. However it does not allow dynamic behavior of actors. MAPS [4] is also based on a C extension, and uses a dynamic data flow MoC. The dynamicity is supported through a dynamic mapping, with high level operators mapped on accelerators at runtime. We can also mention work on the Magali platform using the kpn MoC [14]. The mapping is done on abstract architecture and platform to reduce the platform dependency. One common pattern in these approaches is the use of C language or a derivative to represent computation in a familiar manner, while using high level concepts such as threads or actors to convey parallelism. Platform specific operators are supported by libraries and apis, with some environments such as ExoCHI allowing to split operators to fit the platform granularity. The recent progress of significative works such as ExoCHI and MAPS testifies to the importance of research in heterogeneous soc programming. However, these approaches uses dynamic MoCs which are supported by multi-core processors, but falls outside the scope of our platform. We propose to work on a parametric data flow MoC to get the desired expressivity while keeping analyzability. Starting from this premise, we put forward a new compilation flow and runtime for this MoC.

3. FRAMEWORK 3.1 Model of computation

In this work, we chose to start from a data flow MoC to harness its inherent parallelism and analyzability. Moreover, the need for verifiable but still flexible data flow MoCs recently lead to the appearance of two new MoCs: ScenarioAware Data Flow [18] and Parametric Data Flow [9]. Fradet and Girault identify a subclass of this MoC called schedulable parametric data flow (spdf) where the schedulability of the data flow graph can still be assessed statically. We chose to experiment using this MoC, but most of the contribution remains valid for analyzable data flow models. Moreover, sdf being a subset of spdf without parameters, our work and the resulting compiler applies to sdf. To introduce spdf, a program is represented on Fig. 3 with four data flow actors named A,B,C and D. As in a classical data flow graph, the integers on the arcs represent the number of samples produced or consumed by the actor at each execution. In spdf, this number can be a symbolic parameter, the value of which is produced by an actor (e.g. the set p[1] in the left actor of Fig. 3).

2p

1

B q set q[p]

pq

A set p[1]

D 2p

p p

C

p

Figure 3: A simple schedulable parametric data flow graph [9], p and q are parameters instantiated at execution. One interesting feature of spdf is the ability to generate one schedule per actor, which is repeated at each iteration of the graph. The data flow graph being parametric, so is the number of iterations, and the generated schedule is a quasi-static schedule. As an example, the schedule for actor B in Fig. 3 is (pop p; (B p ; push q)2 ), which reads as: get parameter p, then repeat twice the following: execute p times the core code of B, provide parameter q (computed in the core code of B). The reader is invited to refer to [9] for implementation details.

3.2

Language

The language we propose is based on C++. It consists of a set of classes allowing to describe a parametric data flow graph. Fig. 4 illustrates the language on a small example. Each actor has a single compute method that represents the execution of one iteration of the actor. The code of this method is written in C++ and uses various push/pop intrinsics to send/receive data, as well as set/get for parameters. An intrinsic api is also used for each application domain. It includes common, platform agnostic operations such as fft. Choosing C/C++ language for the core code of the actors presents many advantages: it allows to reuse legacy code and highly optimized tools such as C compilers, do not require to learn a new language, and permits easy simulation and functional validation. Moreover, the support of a general purpose language for describing the graph structure greatly simplifies the development of complex applications.

void MIMO : : compute ( ) { [...] f o r ( i =0; i