banner above paper title
Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators Alessandro Capotondi
Germain Haugou
Andrea Marongiu, Luca Benini
University of Bologna
[email protected]
STMicroelectronics
[email protected]
University of Bologna - ETH Zurich {a.marongiu, luca.benini}@unibo.it
Abstract
We are currently witnessing a phenomenon of convergence between the HPC and the embedded domain. High-end embedded systems are increasingly being designed as heterogeneous systemson-chip (SoCs), where a powerful general-purpose multicore host processor is coupled to a manycore accelerator. The host executes legacy applications on top of full-fledged, virtualization-ready operating systems such as Linux or Android, while the accelerator can be exploited for highly parallel code kernels within those applications. While general consensus on a “standard” programming methodology for heterogeneous embedded systems is far from being reached, programming models and tools similar to those proposed in the HPC domain are currently being adopted to program embedded many-cores. OpenCL can be considered as a de-facto standard, as it offers both an environment for specifying kernel offloading from host programs and a model for parallelism specification within the manycore. However, OpenCL provides a rather lowlevel programming style, which is difficult to understand and manage efficiently for non-expert developers. As a consequence, higher level programming interfaces are being actively researched. Probably the most relevant example is the latest OpenMP v4.0 specifications, which address the former lack of support for heterogeneous platforms. OpenMP now supports a kernel-based offload mechanism which enables targeting different devices than the classical SMP. Accelerator-based program development benefits from the familiar and easy-to-use OpenMP directives. In this paper we envision a near-future scenario, where multiple applications may concurrently require the use of the manycore accelerator (e.g., audio and video coding, graphics, etc.). Such a diverse workload can derive from a realistic usage pattern of a nextgeneration smartphone or tablet device, where the user enables numerous tasks at the same time, and where the underlying software stack on the host processor leverages virtual machines for efficient resource partitioning. From this scenario two observations derive: i) applications may come from very different developers or legacy code-bases and may have different requirements, so it is extremely likely that they will be coded using different programming models; ii) virtualization techniques imply that a single application is agnostic to HW resources sharing. As a consequence, methodologies that enable efficient accelerator resources sharing, supporting multiple programming abstractions and associated execution models will be increasingly important for future heterogeneous SoCs. In this paper we present a runtime system for a cluster-based embedded manycore accelerator, optimized for the concurrent execution of OpenMP and OpenCL kernels. Due to the lack of fullfledged operating system services, and to the constrained execution environment, enabling multiple programming models to live inside the same accelerator requires dedicated middleware on top of native hardware abstraction layers. The fine-grained data parallelism typically found in embedded applications requires such dedicated mid-
Many modern high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a powerful general purpose multicore host processor is coupled to a manycore accelerator. The host executes legacy applications on top of standard operating systems, while the accelerator runs highly parallel code kernels within those applications. Several programming models are currently being proposed to program such accelerator-based systems, OpenCL and OpenMP being the most relevant examples. In the near future it will be common to have multiple applications, coded with different programming models, concurrently requiring the use of the manycore accelerator. In this paper we present a runtime system for a cluster-based manycore accelerator, optimized for the concurrent execution of OpenMP and OpenCL kernels. The runtime supports spatial partitioning of the manycore, where clusters can be grouped into several “virtual” accelerator instances. Our runtime design is modular and relies on a “generic” component for resource (cluster) scheduling, plus “specialized” components which efficiently deploy generic offload requests into an implementation of the target programming model’s semantics. We evaluate the proposed runtime system on a real heterogeneous system, the STMicroelectronics STHORM development board. Categories and Subject Descriptors D.1 [Programming Techniques]: Parallel Programming Keywords Parallel Programming Models, Embedded Heterogeneous Computing, OpenMP, OpenCL
1.
Introduction
With the advent of general-purpose graphical processing units (GPGPU), almost a decade ago, heterogeneous computing has become pervasive in the high-performance computing (HPC) domain. This has implied the rise of a plethora of offload-based parallel programming models, from initial vendor-specific ones such as NVIDIA CUDA, to the standardization efforts of consortia such as Khronos and their Open Computing Language or, more recently, the Heterogeneous System Foundation and associated programming model.
[Copyright notice will appear here once ’preprint’ option is removed.]
Runtime Support for Multiple Offload-Based Programming Models
1
2015/1/27
64-bit G-ANoC-L2
dleware to provide streamlined, low-cost primitives for programming model semantics implementation, as well as a fast mechanism to context-switch from one programming model to another. This allows to fully exploit the massive HW parallelism provided by embedded manycore accelerators, without losing efficiency in the multi-layered software stacks typically required to support sophisticated programming models. Our runtime environment supports spatial partitioning of the manycore, where clusters can be grouped into several “virtual” accelerator instances. The design is modular and relies on a low level runtime component for resource (cluster) scheduling, plus “specialized” components which efficiently deploy offload requests into specific programming model execution semantics. We evaluate the proposed runtime system on a real heterogeneous system, the STMicroelectronics STHORM development board, characterizing both the cost of the proposed runtime system and the efficiency achieved in exploiting the available HW parallelism when multiple applications – coded with OpenMP and OpenCL – are concurrently deployed on the manycore. We compare our findings with OpenMP and OpenCL implementations available on another high-end heterogeneous SoC, the TI Keystone II platform. Results show that our runtime offers almost two orders of magnitude smaller kernel offload cost for OpenMP, and one order of magnitude smaller kernel offload cost for OpenCL. Using our runtime we can reach up to 93% performance efficiency compared to the theoretical optimal solution. The rest of the paper is organized as follows. In Section 2 we describe STHORM, our target platform. In Section 3 we describe the main components of our hybrid runtime. In Section 4 we provide experimental evaluation of our heterogeneous parallel programming model support. Section 5 discusses related work, and Section 6 concludes the paper.
GALS I/F
CCI
CCI
GALS I/F
CCI
L2 le (1MB)
CC
CC
CC
CC
cluster 0
cluster 1
cluster 2
cluster 3
64-bit G-ANoC-L3 STxP70
STxP70
I$
I$
32-KB TCDM
DMA 1
NoC IF
DMA
FABRIC CONTROLLER
IOs
BANK N
DMA 0
BANK 0
NI
NoC IF
STxP70-V4B
LOW-LATENCY INTERCONNECT
2D SNoC links (master + slave)
Shared Tightly-Coupled Data Memory (TCDM)
Figure 1. 4-cluster STHORM SoC
RAB
IO periph SNoC 2AXI
S
SNoC
STHORM
FPGA bridge RAB
AXI2 SNoC
M
AXI interconnect
SoC
FPGA
ARM A9
ARM A9
L1
L1
snoop L2 L2 ctrl
DRAM ctrl
IO periph
GIC DRAM
GPIOs
GPIO
ZYNQ
STHORM BOARD
Evaluation Platform
Figure 2. STHORM heterogeneous system.
As a concrete embodiment of the heterogeneous manycore SoC template targeted in this work we use the first STMicroelectronics STHORM development board. STHORM, previously known as Platform 2012 [11], is a many-core organized as a globally asynchronous, locally synchronous (GALS) fabric of multi-core clusters (see Figure 1). A STHORM cluster contains (up to) 16 STxP70-v4 Processing Elements (PEs), each of which has a 32-bit RISC load-store architecture, with dual issue and a 7-stage pipeline, plus private instruction cache (16KB). PEs communicate through a shared multi-ported, multi-bank tightly-coupled data memory (TCDM, a scratchpad memory). The interconnection between the PEs and the TCDM was explicitly designed to be ultra-low latency. It supports up to 16 concurrent processor-to-memory transactions within a single clock cycle, given that the target addresses belong to different banks (one port per bank). The ENCore also provides a hardware synchronizer and an interrupt generator. As other clusterbased manycore systems (e.g. Kalray MPPA [6]), STHORM cluster has an additional core on each cluster used as cluster controller (CC). The STHORM fabric is composed of four clusters, plus a fabric controller (FC), responsible for global coordination of the clusters. The FC and the clusters are interconnected via two asynchronous networks-on-chip (ANoC). The first ANoC is used for accessing a multi-banked, multi-ported L2 memory. The second ANoC is used for inter-cluster communication via L1 TCDMs and to access the off-chip main memory (L3 DRAM). Note that all the memories are mapped into a global address space, visible from every PE. L3 accesses requests are transported off-chip via a synchronous NoC link (SNoC). The first STHORM-based heterogeneous system is a prototype board based on the Xilinx Zynq 7000 FPGA device (see Figure 2), which features a ARM Cortex A9 dual core host processor running at 699 MHz, main (L3) DDR3 mem-
Runtime Support for Multiple Offload-Based Programming Models
GALS I/F
CCI
SNoC
2.
GALS I/F
ory, plus programmable logic (FPGA). The ARM subsystem on the ZYNQ is connected to a AMBA AXI matrix, through which it accesses the DRAM controller. To grant STHORM access to the L3 memory, and the ARM system access into STHORM L1/L2 memories, a bridge is implemented in the FPGA, which has three main functions. First, it translates STHORM transactions from the SNoC protocol to the AXI protocol (and ARM transactions from AXI to SNoC). Second, it implements address translation logic in the remap address block (RAB). This is required to translate addresses generated from STHORM into virtual addresses as seen by the host application and vice versa. Indeed, the host system features paged virtual memory and MMU support, while STHORM operates on physical addresses. Thus, the RAB acts as a very simple IO-MMU. Third, it implements a synchronization control channel by conveying interrupts in two directions through the FPGA logic and into dedicated off-chip wires. The FPGA bridge is clocked conservatively at 40 MHz in this first board. This constitutes currently the main system bottleneck limiting the bandwidth from the SoC to the L3 DDR around 300MB/s1 .
3.
Multi-programming model Runtime System
In this section we describe the design of our multi-programming model runtime system (MPM-RTS), capable of supporting concurrent offload to the manycore accelerator from OpenCL and 1 Similar
to any realistic heterogeneous SoC design, STHORM is clearly intended for same-die integration with the host, with orders-of-magnitude faster bridge and larger memory bandwidth.
2
2015/1/27
«host» OpenCL
OS SMP
OCL Kernel
OpenMP Kernel
OpenMP Instance
OpenCL
OpenMP Instance
Low‐ Level Runtime
Low‐ Level Runtime
Low‐ Level Runtime
Low‐ Level Runtime
Low‐ Level Runtime
Low‐ Level Runtime
Generic Layer
Global Controller
Cluster 0
Cluster 1
Cluster 2
Cluster 3
Cluster n
HAL
OCL Controller
«host» OpenMP
Application Layer
OpenMP Kernel OMP Controller
OpenCL Host Application
Offloads
OpenMP OpenMP Host Host Application Application
HOST
Specific Layer
Accelerator
Figure 3. Heterogeneous multi parallel programming model software stack overview. OpenMP programs. The key idea is to provide spatial partitioning of the clusters enabling the execution of OpenMP and OpenCL kernels in a subset of clusters, what we call a virtual accelerator instance. The MPM-RTS is organized into two logical layers, the generic runtime and the specific runtimes for the supported parallel programming models. Figure 3 shows the MPM-RTS organization. The specific runtimes are built on top of the generic runtime system. This hierarchical approach enables to share all the features provided by the generic runtime between the two programming models. The generic runtime has also the role of platform access interface, and it provides all the features to create, destroy, and manage specific runtime instances among clusters. Since the targeted manycore template does not execute an operating system, all our code is developed on top of an hardware abstraction layer (HAL). Based on that, the generic runtime implements basic services such as thread management, memory allocation, dynamic code loading. The following sections provide a detailed description of the different components of our MPM-RTS. Note that even if in some places we discuss the deployment of these layers on top of STHORMspecific features (cluster controllers - CCs - and the fabric controller - FC), the structure of our runtime is not specific of such architecture, and every SW component can be executed by any processor in a multi-core cluster. 3.1
a command (using a lightweight IRQ). The associated controller interprets the command and executes appropriate service routines. Note that this communication model does not imply moving data around; each message is usually very small (a few words), as we only pass a pointer to a shared data structure than can be accessed by any core in the platform. Support data structures. To provide these features, the generic runtime needs to hold a persistent view of the platform. This view is stored into two different support data structures: fabric rt pm desc t and dev cluster t. 1 2 3 4 5 6 7
Listing 1. Fabric Controller programming model support data structure The first structure is specific of each programming model (for more details see Listing 1), and contains the main information used to manage it. The generic runtime maintains one of these data structures for each programming model. The structure contains a data pointer to the private programming model configuration, and an array of entry function pointers. This array is used as dynamic jump table by the generic runtime to dispatch task callbacks related to each command supported by the programming model. The structure also contains two additional function callbacks: startCluster, and stopCluster, used to start and stop a programming model runtime on a given cluster. These fields are initialized during the boot phase of OpenCL and OpenMP runtimes with specific callbacks pointers. The second structure is a linked list containing one element for each physical cluster, and is used to track the current status (in particular if the cluster is already associated to a virtual accelerator of some programming model, or if it is ready to execute). The structure also tracks clusters previously allocated to the current programming model and ready to execute (firstFreeCluster). Using those clusters is more efficient than initializing the environment on a free cluster (no boot or switch cost is paid). At boot time no virtual accelerator is created so no cluster is allocated (all the devices are considered free).
Generic runtime
The generic runtime provides all the shared services to schedule and execute the offloads. It is organized in two main components or controllers: a) a global controller, which is in charge of i) enqueuing and collecting new offload commands from/to the host side; ii) scheduling offloads to the clusters; iii) switching from a programming model to another on a target cluster (i.e., loading a different specific runtime); iv) collecting the offload command terminations. It also offers a memory allocator and dynamic linking services. The global controller can run on any core (including the host). On the STHORM prototype we exploit the available hierarchical core organization and map this global controller on the Fabric Controller processor. Each cluster has its own instance of the scheduler, managed by a local controller. Any core within a cluster can act as controller. On the STHORM prototype we map this local controller on the cluster controller processor. b) A local (per-cluster) controller , which is in charge of implementing a run-to-completion task scheduler. Such scheduler provides the basic abstraction of a thread, on top of which specific OpenMP and OpenCL runtimes will map parallel teams and workgroups, respectively. Both controllers can be implemented with a simple FIFO command queue. On the STHORM prototype we used available hardware mailboxes. There is a mailbox for each cluster, and one for the fabric controller. Each mailbox is responsible for propagating
Runtime Support for Multiple Offload-Based Programming Models
typedef struct { void * data ; void (* entry [ F A B R I C _ R T _ P M _ N B _ R E Q ]) ( void * , int , void *) ; dev_cluster_t * f i r s t F r e e C l u st e r ; void (* startCluster ) ( void * context , int cid ) ; void (* stopCluster ) ( void * context , int cid ) ; } fabric_rt_pm_desc_t ;
Commands Management and scheduling. Among the various MPM-RTS features, unification of data structures for analogous semantics among two different programming model was made. This is particularly visible on the kernel offload procedure (commands enqueue in OpenCL nomenclature). The kernel enqueues
3
2015/1/27
commands either in OpenCL or OpenMP, sharing the same message header. The header contains all the generic information for a single offload, such as: i) the PID of the host thread, used as identifier for the reply callbacks to the host; ii) the number of resources requested, in terms of clusters; iv) the programming model type; v) the kernel binary pointer. The generic runtime processes each command (OpenCL or OpenMP) in-order, with first-come-first-served scheduling, according to the arrival time on the command queue. Before pushing to the specific runtime queue a ready command, the generic runtime also performs the virtual accelerator creation, allocating for the offload a subset of clusters. The virtual accelerator is created using a best-effort policy, and the allocation is based on the following criteria:
Private Memory
Private Memory
Private Memory
Private Memory
PROC 1
PROC M
PROC 1
PROC M
Unit 1
…
Local Memory
DMA
EnCore
Unit N
Cluster 1
DMA
PROC M
PROC 1
…
L1 256KB TCMD
Local Memory Constant Data Cache
EnCore
PROC M
PROC 1
Cluster 4 L1 256KB TCMD
DMA
DMA
OpenCL Device
Global Memory
STHORM SoC
L3 external DRAM STHORM OpenCL Memory Model Implementation
OpenCL Memory Model
Figure 4. STHORM OpenCL memory hierarchy compared to the standard OpenCL.
1. the list of free clusters allocated to the requesting programming model is checked; 0 us
The most interesting cases are the last two. Before starting the new specific programming model runtime, the generic runtime triggers the stopCluster callback on all the clusters stolen frome other programming models lists, and the startCluster callback for the new programming model on all the non-initialized clusters. This enables concurrent execution in time and in space of multiple kernels from different programming models. Note that the partitioning unit is a cluster (only a full cluster), or integer multiples thereof, can be associated to a specific programming model. Also note that:
20
OCL RTE Generic RTE
WG
PEs
3. if the number of cluster is still smaller than the request, steal the remaining clusters from the list of free clusters allocated to another programming model.
CC
10
CC
OCL RTE Generic RTE
PEs
Cluster N Cluster 0
2. if the number of ready (previously initialized for the current programming model) cluster is smaller than the request, traverse the list of non-allocated clusters;
FC
WG OCL RTE Generic RTE
OpenCL Enqueue OpenCL Kernel Execuon OpenCL Cluster start C
Figure 5. Accelerator execution trace for an OpenCL kernel offload on STHORM prototype. From the global controller wakeup to kernel execution begin.
• no preemption is supported. The clusters can be stolen by the
time manages the cluster resources and schedules the work-items on the available cores. Finally, on the processing elements (PE) OpenCL runtime executes the work-items (WI) and implements the kernel primitives (barriers, asynchronous copies, etc). The main differences between a GPGPU OpenCL implementation and the STHORM prototype OpenCL implementation consists in the number of threads executed by the runtime. GPGPUS usually have a massive battery of threads, in the order of thousands. In this implementation of OpenCL the number of threads is limited to one per PE in order to avoid the overhead of context switch. Thus the runtime does not support more than 16 WI within a WG. Moreover, the cores do not run in lock-step fashion, which is a common behavior for GPGPUs, but each PE can execute different instruction flows without affecting the application performance. This characteristic enables the execution of complex OpenCL task graphs compared to the typical use of OpenCL in GPGPUs. The OpenCL memory model in this implementation is mapped as follows: stacks: mapped in TCDM, local data: mapped in TCDM; global data: mapped in L3. Figure 4 provides a graphical comparison between the standard OpenCL device model and our STHORM implementation.
other virtual accelerators only when they have terminated the previous command; • the number of clusters allocated to the command can be less
than the requested one. If the number of free clusters is less than the requested one, the kernel will be executed anyway, even with less clusters. This behavior is well supported both in OpenCL and OpenMP RTEs; • only if no clusters are available the command is delayed.
At this point the generic runtime can provide the specific runtime the mapping of the virtual accelerator associated to the command. Dynamic linker Each kernel binary must be, at run-time, dynamically linked to all the specific programming model function calls invoked inside the kernel. This operation is done only once for each binary using the dynamic linker module integrated in the generic runtime. The dynamic linker resolves all the RTE calls inside the binary with the associated physical address to the function. The role of the dynamic linker is relevant, in particular for the OpenMP kernels. The OpenMP API is based on function calls to the OpenMP runtime inserted inside the kernel (e.g. parallel start, and so on) which cannot be resolved at compile time, but only when the kernel is dynamically loaded. 3.2
3.2.1
The OCL specific runtime running on the cluster contains two main data structures. One for the local controller, which contains information about work-item scheduling and resource management; and another one for the processing elements, which contains all the data structures needed for the kernel primitives (the WI descriptors), like the addresses for accessing the hardware for barrier and asynchronous copies. Starting an OCL cluster mainly consists in allocating and initializing these two structures. In particular, the part
The OpenCL runtime
The OCL runtime is distributed among the host and the accelerator. In the host part the OCL Host implements the OpenCL API and manages the overall commands scheduling of the graph (ND ranges and tasks). On the fabric controller the OpenCL global controller parses the ready commands and schedules the work-groups (WG) on the available clusters. On the local controller the OpenCL run-
Runtime Support for Multiple Offload-Based Programming Models
OpenCL cluster start and stop
4
2015/1/27
20 0 us
10
Cluster 0
OCL RTE Generic RTE
PEs
CC
WG
FC
OCL RTE Generic RTE
2
3
0
1
2
3
32 33 34 35
0
1
2
3
4
5
6
7
4
5
6
7
36 37 38 39
4
5
6
7
8
9
10 11
8
9
10 11
40 41 42 43
8
9
10 11
OpenCL Kernel Terminaon
0
1
2
3
0
1
2
3
16 17 18 19
0
1
2
3
4
5
6
7
4
5
6
7
20 21 22 23
4
5
6
7
8
9
10 11
8
9
10 11
24 25 26 27
8
9
10 11
12 13 14 15
28 29 30 31
12 13 14 15
Virtual Shared Accelerator Example (3,2,0)
runtime we use the same toolchain presented on our previous work, but we totally reworked the runtime in order to share the accelerator with other programming models, using the generic low-level runtime as middleware. On the host side the runtime implements all the extensions to the standard runtime which are needed to support the heterogeneous system, such as the the primitive to enqueue and wait for kernel termination, the allocation of kernel binaries and the allocation of shared contiguous memory space between the host and the accelerator. The OMP global controller implements the logic to parse and enqueue OpenMP kernels to the lower generic runtime. It also controls the start and stop of the cluster, which is needed because OpenMP implements a shared memory parallel model, and the global controller takes care about the synchronization of the OMP cluster instances. Each OMP local cluster controller allocates all the shared support variables, and it is in charge to boot the thread local pool. The multithreading support is based on Fixed Thread Pool, non-preemptive, mapped 1-1 on each PE, created by the cluster when the OpenMP runtime is started[10]. Each cluster has its own pool of threads but they can cooperate with the other cluster’s pools. The pool of threads in our evaluation prototype is docked using optimized hardware barriers based on the STHORM hardware synchronizer, and it is alive until the cluster is associated to OpenMP. Since a single kernel execution can use a subset of the clusters, we developed a mechanism that remaps different thread pools coming from different clusters into a singular virtual global thread pool. We called it Virtual Shared Memory Accelerator (VACC ). Thought the VACC the OpenMP runtime is able to map any cluster partition to plain thread pool. Each VACC is univocally associated to an offload and it represents the multithreading environment where the OpenMP kernel application will be executed (see Fig. 7).
OpenCL Kernel execution
When the OCL global controller receives a kernel command ready for execution from the host part, the command is parsed and enqueued to the generic runtime, with scheduling information (e.g. number of work-groups). Once a cluster is available and started, the OCL global controller is asked to schedule a work-group on it. Even if several work-groups can be scheduled, they are sent one by one, as work-groups are independent in the OpenCL execution model. In order to pipeline the cluster start and the WG preparation, the WG structure is permanently allocated during the boot of each cluster. This wastes memory when OCL is not used on the cluster, but at the same time decreases the latency of the offload when switching from another runtime to OpenCL runtime. Once the WG has been received on the cluster controller, it parses the WG information and allocates the resources. This operation consists of splitting the memory space between the stacks, the kernel local and private data and other structures, depending on the WG characteristics. The final stage consists of starting the PEs on the kernel entry point. The PEs are allocated and woken up using the generic runtime. Once they are woken up, they read the entry point information (stack, function and arguments), unpack the kernel arguments and start the kernel function. The execution timeline for a full execution of OpenCL kernels on our prototype, also involving the OCL cluster start, is provided on Figure 5. Once all the WI terminates, a notification task is triggered on the local cluster controller. This task handles the end of the WG by releasing resources and then sends a notification back to the OCL global controller, which handles the WG termination and notifies the generic runtime of the termination. In case this was the last WG of the command, it sends a notification back to the OpenCL host; else the generic runtime can trigger the execution of another WG on the cluster. An execution trace example is provided in Figure 6.
3.3.1
OpenMP cluster start and stop
When the generic runtime invokes a cluster start, the OMP local cluster controller allocates in the local TCDM of the shared structures needed by the thread pool. In particular it allocates the stacks for each thread and all the OpenMP specific descriptors, such as team and workshare descriptors. Figure 8 shows the full time-line for a kernel execution in OpenMP, which also involves a cluster start (OpenMP Cluster start). The OMP local controller also executes the boot of each the processing elements within the cluster using the PE-level task creation. The first PE will be the local master thread, which is in charge to wait the complete boot of all local PEs. Only at this point the local master will notify to its local cluster
The OpenMP runtime
As its OpenCL counterpart, the OpenMP (OMP) runtime is implemented on the accelerator in a distributed fashion. The OpenMP heterogeneous support is based in our previous work [10]. For this
Runtime Support for Multiple Offload-Based Programming Models
12 13 14 15 VCluster 0
Figure 7. A graphical Virtual Shared Memory Accelerator mapped on physical clusters 3, 2, and 0, with a flat thread pool distributed among all clusters.
for the PEs contains the structures describing the WI being executed. The rest of the cluster start procedure consists in a few other hardware allocations and initializations (such as atomic counters) needed by the specific runtime. During this phase, the PEs are left free in their original state and do not execute any code as their start and stop is fully managed by the kernel execution. The stop operation only consists of releasing all the allocated resources. Nothing is required for freeing the PEs since the stop only happens when nothing is being executed on the cluster, and the PEs are always fully freed after a kernel execution.
3.3
44 45 46 47 VCluster 1
Physical Thread Pools Identifiers
OpenCL Cluster stop
Figure 6. Accelerator execution trace for an OpenCL kernel execution termination (at left) and OpenCL stop (at right).
3.2.2
12 13 14 15 Cluster 3
12 13 14 15
OCL HOST
Cluster 1
1
Cluster 2 WG
VCluster 2
Cluster 1
0
12 13 14 15 OCL RTE Generic RTE
CC PEs
Cluster N Cluster 0
0 us
3 Our
OpenMP implementation supports nested parallelism and affinity directives for efficient cluster-aware parallel region mapping.
5
2015/1/27
Kernel
40
OMP RTE Generic RTE Local PE Master
OMP RTE Generic RTE
OpenMP Kernel Enqueue
OpenMP Cluster start
Time
FC
OpenMP Kernel Execuon
10
0 us
10 0
10
1
OMP RTE CC Generic RTE Local PE Master Kernel
PEs
FC
OMP
Kernel OMP RTE Generic RTE
OMP HOST
OpenMP Kernel Terminaon
3 #clusters OMP (WCA) OCL
4 OCL (WCA)
though a GOMP offtask finish() Figure 9 shows an example timeline for a kernel termination.
OpenMP Cluster stop
Figure 9. Execution trace for an OpenMP kernel execution termination (at left) and OpenMP stop (at right).
4.
4.1
Runtime Cost Characterization
To measure the net cost for kernel offload support we enqueue a set of computationally “empty” kernels to the accelerator. The cost for an “empty” kernel is measured and subtracted from the full offload cost measured from the host side. The resulting cost is the net kernel offload overhead. Rather than enqueueing a single kernel and waiting for its completion, we offload 24 asynchronous empty kernels. Figure 10 shows the average round-trip time of these kernels, i.e., the time (measured from the host) needed to enqueue a kernel and wait for its completion. Since in this experiment there is no useful computation, this time represent the mere overhead for offloading a kernel and synchronizing upon its completion with the accelerator from the host. Figure 10 shows offload overheads for both the supported programming models: OpenMP (OMP) and OpenCL (OCL). The round-trip cost is plotted for one to four clusters. For each programming model we also provide a worst case analysis (WCA), which is the cost for the same number of offloads when switching from a programming model to the other is implied at each request. A complete (round-trip) OpenMP offload costs 20 µs (for single-cluster execution) to 25.5 µs (for fourcluster execution). The same numbers grow to 25 µs to 32 µs for the worst case. The cost to start and stop the OpenMP RTE upon a programming model switch is 6 microseconds when all four clusters are involved. The cost for an OpenCL round-trip offload varies from 38 to 45 microseconds, increasing the number of clusters involved from one to four. The same numbers grow to 39 to 47 microseconds for the worst case. OpenCL round-trip offloads are costlier than OpenMP. This is a consequence of the OpenCL parallel execution model. During
OpenMP Kernel execution
The OpenMP kernel execution starts with a GOMP offtask() command by the host. The global controller OMP, as depicted in Figure 8, performs an enqueue of the command to the generic runtime, which is in charge to allocate the cluster requested by the offload and start the clusters according to the platform status. The cluster partitioning returned by the generic runtime defines the VACC for the current OpenMP kernel. The VACC thread master synchronizes itself with all the other unlocked local masters creating the root OpenMP team for the current kernel. After this handshake phase, the VACC thread master is free to jump into the body of the kernel and perform all the OpenMP directives presented inside the kernel, such as PARALLEL team creation, FOR loop, etc. The kernel termination is triggered by the termination of all the VACC local masters. When a kernel terminates, the VACC thread master destroys the root team associated to the OpenMP kernel and releases the involved local masters to notify the termination to the OpenMP FC. Each cluster termination also releases the cluster from the VACC , enabling the execution of another kernel or the stop of the programming model. When the last VACC local master thread terminates, the FC triggers a message to the OpenMP host runtime which will redirect the termination notification to the host OpenMP application. At the host application level, the termination is collected
Runtime Support for Multiple Offload-Based Programming Models
Experimental Evaluation
Our experimental evaluation is organized in two sections. The first section presents a detailed characterization of our runtime costs, while the second section presents performance results enabled by our solution on a realistic computer vision use-case.
controller the thread pool booted, thus the local cluster controller can notify to OMP global controller that it is ready to execute. The stop procedure consists of the deallocation of all data structures created by the CC and the destruction of the pool. Figure 9 shows the execution time-line of the stop procedure. Since the OpenMP local controller scheduler is implemented as a task run-to-termination, it ensures that no other programming model will be executed until the stop is concluded, with no need of synchronization. 3.3.2
2
Figure 10. Average kernel round-trip time in microseconds in OpenMP (OMP) and OpenCL (OCL) RTEs increasing the number cluster requested. The chart provide also the worst case kernel round-trip time (WCA) in microseconds in OMP and OCL.
OMP RTE Generic RTE Local PE Master
PEs
Cluster N
Cluster 0
0 us
30 20
Figure 8. Execution trace example for an OpenMP kernel offload which implies a cluster start/stop and uses a proc bind construct and nested parallelism.3
CC
50
Kernel
Cluster 0
PEs
CC
PEs
Cluster N
20
10
OMP RTE Generic RTE Local PE Master
us
5
0 us CC
6
2015/1/27
120%
OMP PARALLEL Cycles x1000
60% OMP 40%
TI OMP
20%
OCL
STHORM
y = 380.6x + 12367
15 10
0
16
TI OCL
32 #threads
48
10
Cycles x1000
Keystone II
Kernel Granularity (cycles)
0
64
Figure 11. Runtime offload overhead percentage respect kernel granularity in cycles. The RTE overhead percentage is computed as follow: overhead = Of f loadCP U cycles /(Of f loadCP U cycles + KernelCP U cycles )
y = 56.538x - 28.75
0
16
32 #threads
48
32 #threads
48
64
64
Keystone II
STHORM
20 15
y = 489.05x + 3640.5
10 5 0
y = 41.462x - 43.016 0
16
32 #threads
48
64
Figure 12. Hybrid OpenMP runtime directives costs measured by EPCC micro-benchmark test suite (v3.0) on our target platform STHORM and TI Keystone II
the kernel enqueue procedure, the host is responsible for setting up all the parallel environment for the work-items executing on the accelerator, even when launching an empty kernel. On the contrary, not all parallelization costs are paid at offload time in OpenMP. Indeed, OpenMP supports dynamic parallelism creation during kernel execution. For this reason, while OpenMP offloads are cheaper, additional overheads are encountered depending of the use of the OpenMP directives within an offloaded kernel. To assess the importance of such additional costs we discuss further characterization experiments in section 4.1.1. Focusing on the difference between the “optimal” and the worst case (WCA), the overheads introduced by our hybrid programming model support among clusters are smaller for OpenCL. The OpenCL RTE needs less synchronization and less shared support variables to be managed, which in turn results in more lightweight start and stop routines compared to OpenMP. As a second experiment we aim at measuring how this absolute overhead numbers impact the performance depending on the granularity of the offloaded kernel. As a term of comparison we consider the Texas Instruments Keystone II EVMK2H platform, one of the few currently available heterogeneous platforms that support both the OpenMP v4.0 and the OpenCL programming models [17]. The architectural differences between this platform and the STHORM in terms of accelerator core count and clock frequency are significant. Thus, for the comparison to be meaningful we use effective CPU cycles as a metric. Figure 11 shows the impact of an offload for an increasing kernel granularity. The plot shows four trend lines. The first two solid lines represent our OCL and OMP RTEs running on STHORM, the remaining two dashed lines represent the TI Keystone II counterparts. Focusing on our solution, for a kernel bigger than 1 millisecond the overhead introduced by the OpenMP offload is less then 3% of the total execution time and for OpenCL this cost is the less then 5%. For kernels bigger than 10 milliseconds the offload overheads are fully amortized (below 1%). Focusing on the comparison with the TI platform, our hybrid runtime has almost two orders of magnitude smaller offload cost for OpenMP, and one order of magnitude smaller offload cost for OpenCL.
Runtime Support for Multiple Offload-Based Programming Models
10
0
16
OMP BARRIER
y = 613.05x + 4293.5
5
0
STHORM
20 15
y = 185.88x + 4022.8
5
OMP FOR
0%
STHORM
y = 503.6x + 13171
15
y = 139.66x + 3971.4
5 0
Keystone II 20
Cycles x1000
Overhead %
80%
20
Cycles x1000
Kystone II
100%
OMP PAR FOR
4.1.1
OpenMP runtime Cost Characterization
To provide a standard evaluation of our OpenMP support cost we use the EPCC OpenMP micro-benchmark test suite, version 3.0 [20]. The EPCC benchmark provide a full set of micro-test that stress all the OpenMP directives to analyze their cost. The EPCC test suite is provided as a set of standalone applications, which we modify to wrap the various micro-tests into an offload block to the accelerator. The charts on Figure 12 show the measured overheads, expressed in thousand of cycles, of four of the most used OpenMP directives. On the X-axis we report number of threads. For the STHORM implementation we consider 16 to 64 threads (where 64 is the maximum number of available accelerator cores). For the TI implementation we consider 2 to 8 threads (where 8 is the maximum number of available accelerator cores). Considering our implementation, the costs to spawn a parallel region range from 6000 cycles (for 16 threads) to 13000 cycles (for 64 threads). The same numbers for the TI implementation range from 13000 cycles (for 2 threads) to 16000 cycles (for 8 cores). Linear regression shows that the cost of the TI construct grows 2.72 times faster than ours. The other constructs that we tested are the combined directive PARALLEL FOR, FOR, and the BARRIER. Similar results hold for the comparison between our implementation and TI’s: PARALLEL FOR for the TI implementation grows 2.70 times faster than ours, FOR 10.84 times, and the BARRIER 11.79 times. 4.2
Runtime Efficiency: Computer Vision Use-Case
In this section we test our runtime efficiency considering a realistic use case. For this experiment we run a set of computer vision applications, considering the same input video stream at 640x480 (8-bit) sampled from a camera. The applications are listed in Table 1. Two of the applications (FAST, ROD) are parallelized with OpenMP, the remaining two (FD, ORB) are parallelized with OpenCL. We
7
2015/1/27
Mnemonic FAST ROD FD ORB
Application Name FAST Removal Object Detection Face Detection Object Recognition
PM OpenMP OpenMP OpenCL OpenCL
#clusters 1 4 1 4
Description Corner Detection FAST Roster, mainly used for feature extration Removal Object detection based on NCC alghoritm Face detection based on Viola-Jones alghoritm ORB object recognition
Table 1. Benchmark Application Set Summary Parameter Repetitions Tests Time Delay Time
Configuration 20 1 millisecond 0.1 microsecond
Based on these measurements we can calculate the ideal execution time for this use case and check how distant our runtime performance sits from the optimal. Let: Isolation := set of isolation exec time
Table 2. EPCC configuration measure the execution time for 60 frames and provide per-frame average execution time in three different configurations:
• Isolation baseline: the average execution time per-frame is
equal to the sum of all the application execution times in isolation: P ∀Ti ∈Isolation (Ti × nbF rames) Tisolation = (5) nbF rames
• Static: the applications that are parallelized with the same pro-
gramming model are executed concurrently on the accelerator. This scenario is representative of a runtime that supports only a single programming model, but concurrent applications.
• Static baseline: the execution time per-frame is based on the
• Hybrid: all the applications are executed concurrently on the
sum of the the static execution time for each programming model multiplied by the number of frames to be computed, plus the overhead to boot a different programming model runtime, which depends of a switch rate: ! P ∀Si ∈Static (Si × nbF rames) Tstatic = nbF rames
accelerator. This scenario is representative of our hybrid solution, where the runtime supports multiple programming models and multiple application at the same time.
Isolation Static Hybrid
Per-Frame Time 56.42 ms 37.40 ms 33.14 ms 91.76 ms 61.46 ms 94.90 ms 116.70 ms
+ Tboot × nbF rame × switch%
The results are presented int Table 3. Due to the dynamic allocation of the clusters, our Hybrid runtime outperforms both the Static and Isolation runtime solutions. With the Hybrid configuration the time to compute the whole four FAST, ROD, FD, and ORB kernels 4 is 116.70 ms, which is way less than the sum of the execution times of all the kernels for the Isolation and Static configurations. In the Isolation configuration the platform is underutilized every time that a kernel requires less than the total number of clusters (no other kernels can be scheduled at the same time). In the Static solution any number of kernels from the same programming model can be scheduled at any time on the platform, thus improving the overall occupation of the hardware. The Hybrid solution further increases the benefits of this approach, where the system is capable of scheduling any kernel from any programming model concurrently on the platform, maximizing occupation of the available resources (clusters). If an application requires a number of clusters bigger than the available, the runtime will immediately allocate the maximum number of available clusters, rather than stalling the offload request until the moment when the required resources are freed. This behavior is well supported both in OpenMP and OpenCL.
• Ideal baseline: the optimal execution time, leading to the max-
imum utilization of the platform without any restrictions on the number of cluster requested. To compute optimal cluster allocation we use a standard linear programming representation for the problem: min f (xF AST ,xROD , xF D , xORB ) = TROD TF D TORB TF AST + + + xF AST xROD xF D xORB Where xF AST + xROD + xF D + xORB ≤ 16
(7)
(8)
xF AST ≥ 1 xROD ≥ 1
(9) (10)
xF D ≥ 1
(11)
xORB ≥ 1
(12)
We want to minimize the sum of the execution times for all the applications to compute 60 frames each. Under the hypothesis of ideal speedup, this is given by the execution time using a single cluster dived by the number of clusters. We can consider 16 resource units,
average, over 60 frames.
Runtime Support for Multiple Offload-Based Programming Models
(6)
This baseline refers to a case where the platform has multiple programming model support, but the associated runtimes can only execute in a mutually exclusive manner on the accelerator. In addition, switching from a programming model to another implies the full runtime boot cost. This is the most typical case for existing embedded programming model runtime implementations (e.g., the TI Keystone II).
Table 3. Benchmark Applications Measured Execution Time
4 Per-frame
(4)
We can define three execution time baselines:
from other applications. This scenario is representative of a runtime that supports single offloads.
Applications FAST ROD FD ORB FAST+ROD FD+ORB FAST+ROD+FD+ORB
(2) (3)
Tboot := single runtime boot time
• Isolation: each application runs alone, without interference
Runtime
(1)
Static := set of static exec time nbF rames := number of frames to be computed
8
2015/1/27
i FAST ROD FD ORB
Ti 56.42 ms 138.38 ms 33.14 ms 275.28 ms
Xi 2.85 4.64 2.18 6.30
min f
full-fledged, posix-compliant OSes, which makes this approach not applicable to other manycores. In the embedded domain one of the most complete supports for acceleration sharing between multiple programming models is the one used by TI on the heterogeneous SoC Keystone II [19]. The SoC integrates a quad core ARM A15 with 8 DSPs C66x. The SoC fully support the new OpenMP v4.0 specification and the OpenCL programming model [17]. Our experimental results target this platform as a reference for comparison. Similar to our approach, on the accelerator side a bare-metal runtime supports both OpenMP and OpenCL. However, compared to our solution the current implementation by TI lacks the capability for concurrent execution of applications. Multiple host applications cannot use the accelerator at the same time, even if they use the same programming model. Our hybrid runtime, in contrast, supports multiple offloads from multiple host applications at same time, coded using different programming models. In literature other solutions exist to allow multiple programming models to use a programmable accelerator. In some cases source to source compilation is used to transform applications that use different programming model APIs to a unique runtime system supported by the architecture. This is the typical approach used to support OpenMP on GPGPUS. An example is the support for OpenMP v4.0 on Nvidia GPUs by Liao et al.[8]. The authors use the ROSE source to source compiler [14] to transform the offload OpenMP API to Nvida CUDA. Another similar approach is used by Elangovan et al. [3], which provide a full framework based on OmpSS [2] that can incorporate OpenCL and CUDA kernels to target GPGPUs devices. Other examples are provided by Seyong Lee et al. [7] which propose a compiler and an extended OpenMP interface used to generate CUDA code. Becchi et al. [1] developed a software runtime for GPU sharing among different processes on distributed machines, allowing the GPU to access the virtual memory of the system. Ravi et al. [15] studied a technique to share GPGPUs among different virtual machines in cloud environments. Other optimizations that improve the dynamic management of GPU programming interface are presented by Pai et al. [12] and Sun et al.[18], but they consider only the native programming model interface, while our approach enable the utilization of multiple programming models. Moreover, the context is very different, as all these works target high performance systems, where the size of the considered parallel workloads is such that very high overheads can be tolerated, unlike the fine-grained parallelism typically available on the embedded manycores targeted in this work. MERGE is a heterogeneous programming model from Linderman et al. [9]. The MERGE framework replaces current ad-hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based methodology that can automatically distribute computation across heterogeneous cores. Compared to our solution, MERGE does not use a standard parallel programming model interface, nor allows the co-existence of multiple runtime systems to improve the resource utilization of the underlying HW platform. Lithe [13] is a runtime system for parallel programming models, working as resource dispatcher. Compared to our solution, Lithe works on top an Operating System, thus supporting preemption and global dynamic scheduling of all the resources among the programming models. This kind of scheduling requires standard OS support for shared memory systems, which are typically lacking in embedded manycore accelerators. Moreover, the composition of several legacy SW layers (OS, middleware, threading libraries) implies a cost in time and space (i.e., memory footprint) that is not affordable in the embedded domain.
108.32ms Table 4. LP Ideal Execution Time Solution 100%
Efficiency
80% 60% 40% 20% 0% 10 Ideal Static (25%) Isolation
100 1000 10000 nbFrames Hybrid Static (0%) Static (50%) Static (100%)
Figure 13. Use case efficiency with different baselines and our hybrid runtime respect the ideal solution increasing the number of frames and with different frequency of static runtime switch. given by the the number of clusters (four) multiplied by the number of applications (four). Theoretically, we can execute all the applications one by one allocating four cluster to each one. Instead, the minimum number of clusters to be used must be one. Finally, Figure 13 presents the efficiency of the designed use case for all the baseline scenarios and for our multi-programming model runtime respect the ideal solution. Our runtime has an efficiency of 93% respect the ideal solution. It outperforms the efficiency of the best case static baseline (static 0% - where there is a single runtime switch from OpenMP to OpenCL) by 1.3x and the isolation baseline by 1.8x. Table 4 shows the resources allocated for each application and the optimal execution time.
5.
Related work
Concurrent support for multiple parallel programming models is traditionally supported in general-purpose symmetric multiprocessors (SMP), based on the standard POSIX multithreading environment. The POSIX standard and the operating system (OS) scheduler provide resource and time sharing of the hardware for general purpose CPUs. Focusing on heterogeneous architectures, this approach is also supported in the latest manycore architecture from Intel, the Xeon Phi [4]. Conceptually, at initialization time, the OpenCL driver creates 240 SW threads and pins them to the HW threads (for a 60-core configuration). Then, following a clEnqueueNDRange call, the driver schedules the work groups (WG) of the current NDRange on the 240 threads [5]. Since an OS runs on Xeon Phi, another application can concurrently use OpenMP on the same plaftorm. Compared to our solution this environment is at least one order of magnitude more costly than our multithreading implementation, in particular due the OS scheduler and POSIX layers overheads (30 microseconds to spawn 240 threads at 1GHz) [16]. Moreover, typically manycore accelerators do not support
Runtime Support for Multiple Offload-Based Programming Models
9
2015/1/27
6.
Conclusions
[12] S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving gpgpu concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, pages 407–418, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1870-9. . URL http://doi.acm.org/10.1145/2451116.2451160. [13] H. Pan, B. Hindman, and K. Asanovi´c. Composing parallel software efficiently with lithe. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pages 376–387, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0019-3. . URL http://doi.acm.org/10.1145/ 1806596.1806639. [14] D. Quinlan, C. Liao, J. Too, R. Matzke, and M. Schordan. Rose compiler infrastructure, 2013.
In this paper we have presented an hybrid runtime which support multiple offload-based parallel programming model for heterogeneous many-core accelerator. Our runtime design is modular and relies on a generic component for resource (cluster) scheduling, plus specific components which efficiently deploy generic offload requests into an implementation of the two target programming models semantics: OpenCL, and OpenMP. Finally, the results show that our runtime offers smaller kernel offload cost, and better scalability than others comparable heterogeneous programming model support.
[15] V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. Supporting gpu sharing in cloud environments with a transparent runtime consolidation framework. In Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC ’11, pages 217– 228, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0552-5. . URL http://doi.acm.org/10.1145/1996130.1996160.
References [1] M. Becchi, K. Sajjapongse, I. Graves, A. Procter, V. Ravi, and S. Chakradhar. A virtual memory based runtime to support multitenancy in clusters with gpus. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’12, pages 97–108, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-0805-2. . URL http://doi.acm.org/10.1145/ 2287076.2287090.
[16] S. Saini, H. Jin, D. Jespersen, H. Feng, J. Djomehri, W. Arasin, R. Hood, P. Mehrotra, and R. Biswas. An early performance evaluation of many integrated core architecture based sgi rackable computing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 94:1–94:12, New York, NY, USA, 2013. ACM. ISBN 978-1-45032378-9. . URL http://doi.acm.org/10.1145/2503210.2503272. [17] E. Stotzer, A. Jayaraj, M. Ali, A. Friedmann, G. Mitra, A. P. Rendell, and I. Lintault. Openmp on the low-power ti keystone ii arm/dsp system-on-chip. In IWOMP, pages 114–127, 2013. [18] E. Sun, D. Schaa, R. Bagley, N. Rubin, and D. Kaeli. Enabling task-level scheduling on heterogeneous platforms. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pages 84–93, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1233-2. . URL http:// doi.acm.org/10.1145/2159430.2159440. [19] Texas Instruments. Arm keystone ii system-on-chip data manual. [20] University of Edinburgh. Epcc openmp microbenchmarks v3.0. URL http://www2.epcc.ed.ac.uk/computing/ research activities/openmpbench/openmp index.html.
[2] A. Duran, E. Ayguad´e, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02): 173–193, 2011. [3] V. K. Elangovan, R. Badia, and E. A. Parra. Ompss-opencl programming model for heterogeneous systems. In H. Kasahara and K. Kimura, editors, Languages and Compilers for Parallel Computing, volume 7760 of Lecture Notes in Computer Science, pages 96– 111. Springer Berlin Heidelberg, 2013. ISBN 978-3-642-37657-3. . URL http://dx.doi.org/10.1007/978-3-642-37658-0 7. [4] C. George. Knights corner, intels first many integrated core (mic) architecture product. In Hot Chips, volume 24, 2012. [5] Intel Inc. Opencl design and programming guide for the intel xeon phi coprocessor, 2013. URL https://software.intel.com/en-us/ articles/opencl-design-and-programming-guide-forthe-intel-xeon-phi-coprocessor. [6] Kalray S.A. Kalray mppa manycore, 2014. URL http:// www.kalray.eu/IMG/pdf/FLYER MPPA MANYCORE-4.pdf. [7] S. Lee and R. Eigenmann. Openmpc: Extended openmp programming and tuning for gpus. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-1-4244-7559-9. . URL http://dx.doi.org/10.1109/SC.2010.36. [8] C. Liao, Y. Yan, B. R. de Supinski, D. J. Quinlan, and B. M. Chapman. Early experiences with the openmp accelerator model. In IWOMP, pages 84–98, 2013. [9] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: A programming model for heterogeneous multi-core systems. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIII, pages 287–296, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-958-6. . URL http://doi.acm.org/10.1145/ 1346281.1346318. [10] A. Marongiu, A. Capotondi, G. Tagliavini, and L. Benini. Improving the programmability of sthorm-based heterogeneous systems with offload-enabled openmp. In Proceedings of the First International Workshop on Many-core Embedded Systems, MES ’13, pages 1–8, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2063-4. . URL http://doi.acm.org/10.1145/2489068.2489069. [11] D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit. Platform 2012, a many-core computing accelerator for embedded socs: performance evaluation of visual analytics applications. In DAC, pages 1137–1142, 2012.
Runtime Support for Multiple Offload-Based Programming Models
10
2015/1/27