Distributed Operation Layer: Optimal Mapping of Parallel Applications onto Heterogeneous Multiprocessor Tile-Based Architectures Iuliana Bacivarov1 1
Michael Beckinger2
Wolfgang Haid1
Abstract
1. Introduction Multiprocessor embedded systems execute multiple tasks on shared processors and multiple communications between these tasks on shared communication networks. Modern system demands are continuously increasing: dozens of tasks have to run in parallel, causing complex interactions. The challenge is how to map the tasks and inter-task communications onto available HW resources, and how to schedule them in order to efficiently use the underlying architecture. New applications are so complex that they cannot be designed without an automatic mapping strategy of tasks and inter-tasks communication. In addition, detailed knowledge about the underlying HW architecture performances is required. Current practices in HW/SW codesign or general purpose computing are not suitable anymore. Numerous frameworks exist, which try to fill the gap between application SW and HW architecture design, e.g. [3]-[5]. However, they are limited in scope targeting FPGA platforms or focusing on their own program specific languages. A significant effort in the SHAPES project [1][2] investigates the efficient execution of parallel applications on the multiprocessor HW, minimizing the effort required to the application programmer. In this context, the Distributed Operation Layer (DOL) is proposed. The DOL is a framework designed to map applications onto the underlying
Lothar Thiele1
2
Computer Engineering and Networks Lab., Swiss Federal Institute of Technology (ETH) Zurich, Switzerland {bacivarov, haid, huang, thiele}@tik.ee.ethz.ch
Modern multiprocessor embedded systems execute dozens of tasks on shared processors and handle their complex communications on shared communication networks. Traditional practices from the HW/SW codesign or general purpose computing domain cannot be applied any more to cope with these complex systems. To overcome this problem, a framework called Distributed Operation Layer (DOL) is proposed that enables the efficient execution of parallel applications on multiprocessor HW platforms. Two main services are offered by the DOL: high-level performance analysis and multi-objective mapping optimization. Moreover, the DOL defines the routines that enable the automatization of the SW flow by automatic optimization of the mapping, and the automatic refinement of the SW layers. This paper presents the basic principles of the DOL and illustrates its efficiency for the design of a Wave Field Synthesis (WFS) algorithm.
Kai Huang1
Fraunhofer Institut fuer Digitale Medientechnologie, Ilmenau, Germany
[email protected]
HW/SW environment. The key tasks of the DOL are: (1) minimization of the effort to map parallel applications onto the SHAPES multiprocessor HW platform and (2) integration of fast performance evaluation methods, enabling design decisions at a high level of abstraction. In the following sections, the basic concepts of the DOL are presented and the efficiency of the approach is proved by using it for the design of a Wave Field Synthesis algorithm.
2. The DOL Environment To achieve the goal stated above appropriate models were leveraged to describe the application on the one hand and the HW architecture on the other hand. Application. The key requirements for the applications are: (a) the capability of parallel execution using coarse-grained task level parallelism, (b) to scale with respect to user demands, (c) to be easily retargeted onto different platforms, and (d) to be easily remapped to improve certain performance figures. Finally, standard, imperative languages should be used for programming the applications. In order to meet these objectives, the process network model of computation was chosen, consisting of processes and SW channels between these processes. c2
c1 T1
T2
process
T3
SW channel
c6
c3 T4
T5 c4
T6 c5
Figure 1. Application model
For scalability reasons, the DOL programming model separates the programming of application functionality (i.e., the application C code) from the application structure. There is also a clear separation between computation and communication in the application specification. These separations are handled by dedicated communication primitives. Another aspect of scalability is to use repetitive structures for describing applications with a high degree of regularity and that have to deal with increasing workloads. The repetitive structures can be described by using so-called iterators. This opens the possibility of parameterizable and hierarchical process networks. Additionally, DOL provides a set of tools for the functional simulation of the application. The simulator is
described in SystemC and it is automatically generated based on the application code and the process network description. The simulator includes as well an automatic profiling feature. This enables application programmers to implement, test and check the performances of their applications. Architecture. The DOL offers SW system designers a highlevel view of the SHAPES HW architecture, relying on (1) execution resources (e.g. ARM RISC and mAgic DSP processors), (2) storage resources (e.g. on-tile memory, distributed external memory, shared external memory, or a FIFO memory implemented in HW) and (3) communication paths including different elements that physically interconnect processing resources (memories, peripherals and interconnects). The level of detail in the architecture performance specification also defines the level of detail of the mapping optimization. Mapping. The relation between an application and the architecture is defined by the mapping (Figure 2). The mapping consists of the (1) binding of each process to an execution resource and of each SW channel to a communication path, and (2) scheduling strategies for resource sharing. During the mapping process, the DOL takes into account the process network parallelism in order to decide automatically the optimal mapping.
is connected to one summing node which creates the loudspeaker (LS) signal. Each process is directly controlled by a control module with a point-to-point connection. Control data packets must be simultaneously received at the processes. Implementation using iterators. More than 500 processes were generated in a WFS example application by DOL iterators. The application contains two sound sources SRC, one control module, 256 WFS processes, 128 summing modules and 128 loudspeaker output channels (LS). Two SRC processes generate two sine waves which are then rendered by the WFS process network. Loudspeaker channel signals of 0.66s length were generated and evaluated correctly using a functional SystemC simulation. Profiling. The automatic profiler is part of the DOL design space exploration engine and it provides: the buffer usage, the no. of communications per communication channel and in future versions the estimated runtimes of processes on the underlying architecture. For example, buffers are used only 50% (with a data filling level of 128bytes). Data throughput from each SRC to WFS process is 192Kbyte/s.
Application (functional model) Mapping (binding+scheduling) processor #1 (e.g. ARM)
processor #2 (e.g. DSP)
Architecture (abstract description)
Interconnect (bus or NoC or …)
Figure 2. Mapping specification
The DOL mapping optimization process is an iterative process including (1) estimation and (2) optimization phases. For the estimation of the performance of a mapping, the DOL makes use of an internal analysis model and model data parameters that are either obtained from a low-level simulation, or from an analytic model. For the mapping optimization, the DOL takes as input the application and architecture description and, if applicable, additional mapping constraints. The DOL is then able to capture various combinations of processes running on different processors and communication structures, and to choose the optimal solution. The mapping information is finally handed over as an input to the underlying SW layers, enabling the generation of the refined code for the target architecture. For more information about the SHAPES HW and SW architecture, please refer to [1][2].
3. Experiments: Implementing the Wave Field Synthesis Algorithm in DOL Application. The Wave Field Synthesis (WFS) [7] is a high quality spatial sound reproduction technology. Sound sources can be virtually simulated by creating their sound fields by an array of loudspeakers. Figure 3 shows an overview of processes running in parallel on the SHAPES platform. On the left side, m sound sources SRC are connected to appropriate clusters of n WFS processes. Each output of a process cluster
Figure 3. WFS application specification
4. Summary In this extended abstract an overview of the DOL design methodology is given. The main aim was to find an optimal mapping of an application onto a target parallel architecture. More than 500 WFS processes were generated by DOL iterators. Loudspeaker channel signals of 0.66s length were generated and evaluated correctly using a SystemC functional simulation. Application data were profiled: buffer utilization (50%), data throughput (192Kbyte/s), and in future versions the estimated runtimes of processes. These elements are decisional inputs for the mapping optimization tool.
5. References [1] Pier Paolucci et.al., “SHAPES: a tiled scalable SW/HW architecture platform for embedded systems”, CODES+ISSS, Seoul Korea, Nov ‘06 [2] SHAPES web site, www.shapes-p.org [3] S. A. Edwards and O. Tardieu, “Shim: a deterministic model for heterogeneous embedded systems” EMSOFT, New York, USA, pp.64– 272, ACM Press, 2005. [4] F. Balarin et.al., “HW/SW Co-Design of Embedded Systems. The POLIS Approach”, MA, USA: Kluwer Academic Publishers, 1997. [5] C. Brookset.al., “Heterogeneous Concurrent Modeling and Design in Java”, vol. 1, Tec. Rep. UCB/EECS-2007-7, U. Cal., Berkeley, Jan. 2007 [6] Y. Jin et.al., “An Automated Exploration Framework for FPGABased Soft Multiprocessor Systems”, CODES+ISSS’05, ACM Press, pp. 273 - 278, September, 2005. [7] M. Boone et.al., "Spatial sound-field reproduction by wave-field synthesis" Journal Audio Engineering Society, 43(12), 1003-1012, 1995.