Implementation of Multicore Communications API

Implementation of Multicore Communications API Janne Virtanen, Lauri Matilainen, Erno Salminen and Timo D. Hämäläinen Tampere University of Technology Department of Pervasive Computing P.O. Box 553, 33101 Tampere, Finland {janne.m.virtanen, lauri.matilainen, erno.salminen, timo.d.hamalainen}@tut.fi

Abstract—This paper presents an implementation of Multicore Communications API (MCAPI), with focus on portability, stability, and simplicity of the design. The main motivation for the implementation is instability of other publicly available implementations. The developed implementation utilizes POSIX message queues, that is an easily portable interface and readily compatible with MCAPI. The performance was measured as latency and transfer rate of the API. The measurement platforms were a x86-64 PC and a development board featuring an ARM processor. A MCAPI implementation was used as reference for comparison. PMQ-MCAPI is much more stable and easily usable than other MCAPI implementations publicly available for PC. When transfer size was between 1-8 KiB, latency of transfers between cores was between 9-15 µs and transfer rate 500-5000 MBps. This translates to 27 000-45 000 cycles and 0.16-1.67 bytes per cycle. CPU and especially performance of its cache were concluded as the most important factors contributing to the performance. In comparison to the reference, latency of the implementation was 1/8 at best, while transfer rate was up to 35x.

I. I NTRODUCTION There are increasing number of multiprocessor systems [1], and thus there is increasing need for solutions for communication between processors and cores. From the perspective of an application, the solution should be reusable and portable, a consistent API between various platforms and environments. Such portability allows developing the application on a PC before the final platform is complete or even specified. Furthermore, a PC may have more enhanced debugging facilities, such as IO operations unavailable on a platform of limited resources. However, APIs tends to introduce additional cost, like time to develop, CPU time, memory footprint and power consumption. As such, many APIs and libraries are too heavy for embedded systems. Multicore Communications API (MCAPI) is intended to be a light weight solution [2]. Other options, such as MPI, are considered either too heavy, complex or not portable enough. On the other hand, the publicly available MCAPI implementations [3], [4] are unsatisfactory, and we have had difficulties in using them in the past [5]. This paper presents an MCAPI implementation for Linux platforms, where each node is a process. Since processes may run in separate processor cores, it is effectively also a library for inter core communication. II. R ELATED WORKS Portable Operating System Interface (POSIX) provides multiple ways for inter-process communication [6]. Many of

them, like message queues and shared memory, are relatively lightweight. However, POSIX is largely limited to cases where communicating tasks are controlled by a single operating system instance, which usually occupies a single processor. On the other hand, MCAPI may be used in multiprosessor systems with or without shared memory or operating system, if the implementation allows it. Message-Passing Interface, also known as MPI, is a standardized message passing system between processes executed on separate devices, like networks of workstations or multiprocessor systems [7]. However, the standard does not exclude the possibility to use MPI on shared memory implementations. In comparison with MCAPI, MPI is much larger standard having features like broadcasts and collective communication, which are responsibilities of the application in MCAPI. The size reflects on feasibility of implementation in environments of limited resources. There are also embedded OpenCL implementations [8], [9], but they are more oriented with parallel computing rather than low latency communication between individual computing units. A. Multicore Communications API (MCAPI) MCAPI consists of about 50 functions, featuring various setups, tear downs, supporting functions, and transfers. A three level hierarchy is used to organize the communication and the communicating entities. Moreover, three types of transfers are available and both blocking and non-blocking calls are supported. Domains are collections of nodes [2, p. 10]. They do not have explicit functionality and their scope is implementation specific. For example, a domain could include all processor cores within the same processor or domain could serve some routing purpose. Nodes may not be shared with other domains. Figure 1 shows that nodes are collections of endpoints. The specification allows the implementation to specify the purpose and function of nodes [2, p. 10]. For example, they may be threads, processes or processor cores. They may or may not have shared memory with other nodes. Furthermore, nodes may have their own attributes set by application, but this is implied to be a place holder feature [2, p. 38]. Endpoints are the lowest level of communicating entities, organized to nodes [2, p. 14-15]. They may be used to communicate with connectionless messages or channel-oriented

packets and scalars. Since the API is intended to be portable, communication is not tied to any specific transfer technology. When using channels, endpoints may only communicate with the other endpoint connected to the same channel and each channel is unidirectional, and as such one endpoint may only receive and the other may only send. Moreover, channels require additional setup before use, but may have better performance than messages. On the other hand, messages may be sent to and received from any endpoint and the same endpoint may both receive and send messages. As an exception, message may not be sent to an endpoint which is connected to a channel. Furthermore, messages may have more overhead, and thus lower performance than channels. B. MCAPI implementations The Multicore Association distributed a reference implementation along the MCAPI specification in 2011 [3]. This implementation, also know as shared memory implementation, came with MRAPI (Multicore resource API) reference implementation and several example test cases. It is stated that the reference implementation was intended to be a prototype rather than a high performance implementation. On the other hand, it implements all MCAPI functions, except endpoint attributes. In this implementation the nodes are separate processes. Function-based platform MCAPI (FUNCAPI) was developed at Tampere University of Technology [10, p. 289]. The example system had one FPGA board connected to a PC via PCIe. Processing elements on the FPGA were two softcores and a DCT IP block. HIBI bus was used on FPGA for connecting the blocks and using the PCIe to communicate with the PC. All processors used MCAPI for communication.

(a) Position of PMQ-MCAPI between other layers. Moreover, the HW used in PC experiments is shown on bottom.

III. P ROPOSED PMQ-MCAPI IMPLEMENTATION A. Main decisions The developed MCAPI implementation is called POSIX message queue MCAPI, abbreviated as PMQ-MCAPI. The structure of the implementation is intended to be a simple and flat layer between application and operating system (Figure 2(a)). The goal is to have a reasonably small overhead and simple design to make debugging easier, and thus make the implementation more reliable. An operating system is mandatory, as the implementation uses POSIX message queues to transfer data between nodes,

Fig. 1.

Communication between endpoints illustrated in a case study.

(b) Interfacing with POSIX message queues is restricted to one module. Fig. 2.

PMQ-MCAPI stack and basic operation

and POSIX is an interface provided by operating systems. Each node is a process with private memory space and domains are arbitrary collections of nodes. All MCAPI entities, including domains, nodes, endpoints and channels, are here defined at design time. This simplifies design and make the design tool integration easier. POSIX was chosen as the communication interface, because portability was the key goal for the implementation. As such, not only the application would be portable but also the MCAPI implementation. POSIX is one of the most commonly used inter-process communication interfaces available across multiple platforms and operating systems, but is also relatively light. POSIX also offers alternative IPC options to message queues, most importantly pipes and shared memory. However using these would have made the implementation more com-

TABLE I C OMPARISON OF THE PRESENTED PMQ-MCAPI TO MCAPI STANDARD .

Shared memory Operating system Inter-processor Intended scope Topology Non-blocking Thread-safety

PMQ-MCAPI required required no inter-process point-to-point no optional

Specification not required not required yes closely distributed undefined yes yes

plex. For example, using shared memory would have required to implement some needed features in PMQ-MCAPI, such as FIFO order and blocking transfers. On the other hand, pipes and other byte streams lack priority order and would require some additional mechanism to divide the stream to messages and packets, as MCAPI relies on discrete transfers rather than continuous streams. B. Details PMQ-MCAPI is implemented in C programming language and Linux as the target operating system. There are 11 header files and 10 source files with 4 367 lines of code and development time was about 4 person months. Make is used for compilation process and the used compiler is GCC. Furthermore, the compilation needs access to some system libraries and headers like mqueue.h and time.h. The libraries provide access to features and definitions provided by the system, such as POSIX message queues, time functions, and file permission constants. Most of the modules implementing API functions contain the internal logic of the implementation, including error checks, data defined at design time and state changes of endpoints and channels. These modules in turn use module pmq_layer as an additional layer between them and POSIX (Figure 2(b)). In principle, most of those calls would need only one function from POSIX. But in practice, those calls need some formatting, which in turn is mostly similar between modules. For example, sending MCAPI messages and packets would use the same POSIX function and same function from module pmq_layer, but with different parameters. Table I is a summary of the PMQ-MCAPI features differing from MCAPI specification. The same features are connectionless and connection oriented communication, blocking communication, FIFO data order, priority order, timeouts and inter-core communication. The current implementation does not support non-blocking receive and send functions as these would have made the internal structure more complex. However, one feasible solution could be parallel structures for blocking and non-blocking calls. Furthermore, most endpoint attributes and attribute functions are unsupported. The only exception is setting timeout for endpoint. The node attributes are unsupported as well, since it is a place holder feature [2, p. 38].

The implementation also relies on design time definitions, such as maximum packet size or number of receiving packet buffers. More importantly, endpoints and channels must also be defined at design time. No endpoint or channel may function in PMQ-MCAPI if they are not defined at design time. Thus the implementation should be statically linked with the application and both implementation and application must be rebuilt when ever design time definitions are changed. Failing to do so will result in undefined behaviour. If one application process needs a restart, all of them must be restarted. Before the restart, a cleaner utility provided by the implementation must be executed. It will unlink all POSIX message queues associated with defined endpoints and channels. Thus also the cleaner utility must be rebuilt after changing the design time definitions. Despite these limitation PMQ-MCAPI is easy to use and very reliable. IV. E VALUATION A. Setup There were two measurements platforms. A regular PC with x86-64 processor was the implementation platform. A platform with ARM Cortex-A9 MPCore processor [11] was chosen for comparison. The x86-64 platform had a quad core CPU with clock frequency of 2992 Mhz, 3.6 GiB memory, 64 KiB level 1 cache, and 12 MiB level 2 cache. The ARM platform had a dual core CPU with clock frequency of 925 Mhz, 1.0 GiB memory, 64 KiB level 1 cache, and 512 KiB level 2 cache. CPU frequency scaling of the x86-64 platform was turned off. On both platforms data is transferred between cores and processes via memory, as illustrated in Figure 2(a). Level one cache has separate instruction and data memory for each core, while level two cache has unified memory shared by two cores. Since the x86-64 platform has four cores, this means that a core may access only one other core via level 2 cache. The other two cores must be accessed via main memory, as they have their own level 2 cache. To keep results comparable, most measurements were executed using only one or two processor cores. To measure performance, the system call clock_gettime was used with CLOCK_M ON OT ON IC as the clock identifier. The calls by themselves take some time and thus may have influenced the results a little. But within these measurements, it is more important that results are comparable. Three communication modes provided by MCAPI were measured. Furthermore, the performance of plain POSIX message queues with no wrapper was measured for comparison. Most of the measurements involved 2 nodes, both with 2 endpoints. In case of single core measurements, the node processes were forced to the same processor core using system call sched_setaf f inity so that the system scheduler would not place them to separate cores and thus influence the data transfers. The same call was used also with dual core measurements to force the processes to separate cores and thus force them to use an inter-core connection to communicate. The measured variables were latency and transfer rate as a function of data transfer size from 256 B to 8 KiB. To

8

15

latency (µs)

latency (µs)

6 4

10

5

2 0 0

2

4 6 transfer size (KiB)

0 0

8

(a) Single-core latency

8

8000 transfer rate (MBps)

transfer rate (MBps)


(b) Dual-core latency

4000 3000 2000 1000 0 0

2

PMQ Packet Message 2 4 6 transfer size (KiB)

8

(c) Single-core transfer rate Fig. 3.

6000 4000 2000 0 0

2 4 6 transfer size (KiB)

8

(d) Dual-core transfer rate

Round trip latency and throughput results on PC using 1 and 2 cores

measure latency, there was a round trip setup consisting of two processes and thus two nodes. The first sent data to the another, proceeded to wait for response, and after that logged the passed time. Meanwhile, the other node received the data and then sent it back. As such, a single round trip iteration consists of 4 calls. The latency measurements finally yielded the minimum, average, and maximum latency of preset amount of measurement iterations. Repeating 100 000 iterations proved to be sufficient, as having more did not influence the results. The background load influenced the measurements, especially on the x86-64 platform. Sometimes there was so much noise that the iterations had to be repeated. On contrary, the transfer rate was measured by having two nodes to transfer 10 MiB of data from sender to receiver. The duration of this whole transfer was measured. Afterwards, the amount of data transferred was divided by time spent, thus yielding the transfer rate. B. Results Figure 3 shows the measurement results of the x86-64 platform. All data points in Figures are averages, rather than minimums or maximums. The red dotted line is POSIX message queue, black solid line is MCAPI packet and blue slashed

line is MCAPI message. Balls as data points mean latency, while squares mean transfer rate. The single core round trip latencies (Figure 3(a)) were about 3.0-8.0 µs. MCAPI packet stayed only about 15% slower than POSIX message queue, while MCAPI message is up to 27% slower. The dual core round trip measurements (Figure 3(b)) resulted in even higher latency of about 8.0-14.5 µs. Thus, dual core latency is 2 − 3x longer than single core latency, with larger packets having smaller relative penalty. When data is transferred between two processor cores, it must pass by level 2 cache, as it is the first cache level shared between the cores. This may also explain why latency seems to be more arbitrary in dual core transfers. Furthermore, the difference between performance of communication types seems less tangible, as relatively more time is spent in transfer rather than executing overhead logic. Transfer rate measurements (Figures 3(c) and 3(d)) produced a logarithmic curve, with transfer rate of about 2501450 MBps, when the transfer size was between 1-8 KiB. The form is logarithmic rather than linear, because the increase of transfer rate is a result of increasing the transfer size in relation to transfer overhead. When the transfer size is already high in relation to the transfer overhead, increasing the size yields smaller benefit. Furthermore, increasing cache misses could

TABLE II S UMMARY OF MEASUREMENT RESULTS BY PLATFORM . T RANSFER SIZE IS 2 K I B. PMQ STANDS FOR POSIX MESSAGE QUEUE .

PMQ x86 Avg. latency (µs) Avg. latency (cycles) Transfer rate (MBps) Transfer rate (B/cycle) ARM Avg. latency (µs) Avg. latency (cycles) Transfer rate (MBps) Transfer rate (B/cycle)

Single core Packet Message

PMQ

Dual core Packet Message

3.50 10 472 1 832 0.61

4.21 12 596 1 013 0.34

9.90 29 621 872 0.29

9.32 27 885 3 130 1.05

8.77 26 240 1 692 0.57

10.58 31 655 1 666 0.56

16.72 15 466 343 0.37

18.80 17 390 272 0.29

22.59 20 896 235 0.26

19.27 17 825 707 0.76

20.59 19 046 594 0.64

22.78 21 071 459 0.50

also limit performance increase. When the transfer size is 8 KiB, the single core transfer rate (Figure 3(c)) with plain POSIX is at best about one third faster than MCAPI packet and MCAPI packet is about one third faster than MCAPI message. However, smaller transfer sizes do not have such large differences. This is caused by the overhead of the MCAPI implementation, since the overhead is accumulated for the whole duration of the measurement rather than for single transfers, like in latency measurements. Dual core transfer rates (Figure 3(d)) are almost 2x compared to the single core transfer rates. This is mostly likely because of parallel execution, as receiving and sending processes do not need to compete for the same resource, CPU time. Furthermore, the latency between cores does not matter as long as transfers can keep up with the throughput. Naturally, this requires that not too many cache misses are made. Moreover, the difference between plain POSIX and MCAPI is smaller than with single core transfer rate. Plain POSIX has about 20% higher transfer rate than MCAPI packet and third more than MCAPI message. This is also explainable with reduced competition between the processes. C. Discussion Results are summarized in Table II. In addition to transfer rates and average latencies, it includes latency and transfer rate related to CPU cycles. On the x86-64 platform the memory footprint of the implementation was 28.0 KiB on disk and 540 KiB on memory. On the ARM platform the footprint was 20.4 KiB on disk and 448 KiB on memory. However, memory footprint increases with defined resources, such as endpoints. Latency on the ARM platform was 2 − 3x of the x8664 platform, while difference of transfer rate is 2 − 6x. This could be partially explained with clock frequency of the ARM platform, which is less than one third of the x86-64 platform. However, the smaller cache of the ARM platform could at least be part of the explanation, as it may cause more cache misses with higher transfer sizes. On the other hand, dual core and single core results on the ARM platform had more narrow difference than on the x86 platform.

When comparing CPU cycles between the two platforms, ARM still has higher latency with a single core but interestingly lower latency with dual core. This is not unexpected, as ARM has relatively low latency in transfers between the two cores. Furthermore, these CPU cycles also include the cycles passed in wait state. This could mean that level 2 cache of the ARM platform is relatively fast in comparison to its CPU speed. Plain POSIX message queues perform better than PMQMCAPI in both latency and transfer rate, as the implementation introduces overhead. However, the penalty in transfer rate can be as much as twice the transfer rate of POSIX, while the latency is up to 30% more. MCAPI scalar performs well in terms of latency (7.69 µs with dual core round trip latency), but its transfer rate is really poor (14.02 MBps dual core). This is expected, as the earlier measurements demonstrated that large transfer sizes are good for the transfer rate, and small transfer sizes are favourable for latency. On the other hand, MCAPI scalars may be also useful in terms of memory footprint, since maximum size of a single scalar transfer is small and thus scalar channel does not need a large POSIX message queue. PMQ-MCAPI was also verified with more than 130 unit tests [12]. Based on them, the implementation seems to be very stable. The tests did not only include the ideal situations, but also the ones resulting in errors or by wrong use. Furthermore, measurements included high amount of iteration. Since the measuring processes stayed stable, this means that the implementation does not immediately fail because of load. D. Comparison to reference implementation Figure 4 presents the measurement results of the MCAPI reference implementation on the x86-64 platform. Measurements are presented as comparison to MCAPI, and as such, the reference latency is about 5 − 8x the corresponding dual core latency in PMQ-MCAPI. The transfer rate of PMQ-MCAPI is about 20 − 35x higher, as the difference increases with the transfer size.

15

tations [10, p. 293], PMQ-MCAPI has quite high latency, but it performs much better than the reference implementation. There were no measurements results available for OpenMCAPI, but in the past it was not considered satisfactory in exercise works [5]. The implementation was difficult to initialize and did not stay stable either. There are no other publicly available MCAPI implementations. Furthermore, PMQMCAPI comes with an example application ran with a single command line instruction. Both OpenMCAPI and the reference implementation are more difficult to setup. Plain POSIX message queues performed better than PMQMCAPI in all situations. Furthermore, using POSIX directly is easier than using PMQ-MCAPI, as the setup is less complex, and the interface of POSIX message queues is only 10 functions. On the other hand, applications built on POSIX are not portable to baremetal environments. PMQ-MCAPI is designed to function on top of Linux, and as such it is ideal for early development. Having a functional PC implementation means that application development may start before the platform is decided. As conclusion, PMQMCAPI is stable, freely available and easy to port itself.

10

R EFERENCES

ref./PMQ−MCAPI latency

8 6 4 2 0 0

Packet Message 2


8

PMQ−MCAPI/ref. transfer rate

(a) Round trip latency

30 25 20

5 0 0

2


8

(b) Transfer rate Fig. 4. Dual core performance of the MCAPI reference implementation compared to PMQ-MCAPI.

While Figure 4 represents only dual core performance on the x86-64 platform, the results were similar in relation to PMQ-MCAPI also in single core measurements and on the ARM platform. To measure performance of the reference implementation, slight changes to API needed to be made to make it compatible with measurement setups used to measure PMQ-MCAPI. Theses changes included renaming variables and functions, which were named differently in a more resent version of the MCAPI specification. V. C ONCLUSIONS PMQ-MCAPI is a stable implementation for application development. But only most important functions of the MCAPI specification were implemented. The source code of the implementation is available at GitHub [12]. Use of PMQ-MCAPI is constrained by its need for an operating system, most likely Linux, and its need for POSIX message queues. Furthermore, it requires MCAPI entities, like endpoints and channels to be defined at design time. In addition, a single node is always tied to a single process and launching those processes is not trivial, although it is simplified by a launcher script. When compared to some other APIs and MCAPI implemen-

[1] W. Wolf, A.A. Jerraya, and G. Martin. Multiprocessor system-on-chip (MPSoC) technology. IEEE Tran. Computer-Aided Design of Integrated Circuits and Systems, Oct 2008, Vol. 27, Iss. 10, pp. 1701-1713. [2] Multicore Communications API Specification, Version 2.015, Multicore Association, 2011, 169 p. Available at (accessed on 6.1.2014): https://www.multicore-association.org/request_mcapi.php?what=MCAPI [3] Multicore Communications API Reference implementation source code, Multicore Association, 2010. [4] OpenMCAPI Bitbucket repository. Available at (accessed on 16.6.2014): https://bitbucket.org/hollisb/openmcapi/wiki/Home [5] E. Salminen, and T.D. Hämäläinen, Experiences from System-on-Chip design courses, in FPGA World, 2014, 6 pages. [6] The Open Group Base Specifications Issue 7, 2013 Edition, The IEEE and The Open Group. Available at (accessed on 10.6.2014): http://pubs.opengroup.org/onlinepubs/9699919799/ [7] MPI: A Message-Passing Interface Standard, Version 3.0, Message Passing Interface Forum, 2012, 852 p. Available at (accessed on 2.6.2014): http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf [8] Jia-Jhe Li; Chi-Bang Kuan; Tung-Yu Wu; Jenq Kuen Lee, Enabling an OpenCL Compiler for Embedded Multicore DSP Systems, Parallel Processing Workshops (ICPPW), 2012 41st International Conference on Pittsburgh, pp. 545 - 552 [9] Jung-Hyun Hong; Young-Ho Ahn; Byung-Jin Kim; Ki-Seok Chung, Design of OpenCL framework for embedded multi-core processors, Consumer Electronics, IEEE Transactions on (Volume:60 , Issue: 2 ), pp. 233 - 241 [10] L. Matilainen, and E. Salminen, and T.D. Hämäläinen, and M. Hännikäinen, Multicore Communications API (MCAPI) implementation on an FPGA multiprocessor, in Embedded Computer Systems (SAMOS), 2011 International Conference on Samos, pp. 286-293 [11] Cyclone V SoC Hard Processor System, Altera Corporation, 2013, 35 p. Available at (accessed on 18.6.2014): http://www.altera.com/devices/fpga/cyclone-v-fpgas/hard-processorsystem/cyv-soc-hps.html [12] PMQ-MCAPI source code. Available at (accessed on 1.6.2014): https://github.com/BlackFairy/PMQ-MCAPI

Implementation of Multicore Communications API

Implementation of Multicore Communications API

Suggest Documents

Matilainen multicore communications api implementation

Matilainen multicore communications api ... - TUT Single Sign-On

FPGA Implementation of Heterogeneous Multicore ...

Cache Complexity and Multicore Implementation for ... - ORCCA

A Light-weight API for Portable Multicore Programming - Center for ...

Evaluation of a Multicore-Optimized Implementation for Tomographic

JAIN-IMPS API Reference Implementation - Semantic Scholar

Design and Implementation of the Multicore Architecture Teaching ...

MULTICORE IMPLEMENTATION OF THE AES ALGORITHM IN THE

programme implementation and communications officer

programme implementation and communications officer

Lop to model concurrent MPI communications in multicore clusters

implementation of cooperative communications using software defined ...

Implementation of a Communications Channelizer ... - Semantic Scholar

Implementation of a Communications Channelizer ... - Semantic Scholar

Initialization and Configuration of TINI & Java Communications API for ...

Design and Implementation of a General-Purpose API of ... - CiteSeerX

Bluetooth API Implementation into Android - MATEC Web of ...

Economic Impacts Resulting from Implementation of RFS2 Program - API

An overview of the CellML API and its implementation | SpringerLink

Final version of integration framework and API implementation

An overview of the CellML API and its implementation | BMC ...

Economic Impacts Resulting from Implementation of RFS2 Program - API

Final version of integration framework and API implementation