This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
Differentiated Communication Services for NoCBased MPSoCs Everton Alceu Carara, Ney Calazans, Fernando Gehm Moraes FACIN - PUCRS - Av. Ipiranga 6681- Porto Alegre - 90619-900 – Brazil
[email protected],
[email protected],
[email protected] Abstract— The adoption of Networks-on-chip (NoCs) as the communication infrastructure for complex integrated systems is a fact, and has been promoted by the growing number of processing elements integrated in current MPSoCs. These are designed to execute several applications in parallel, with different communication requirements and distinct levels of required quality of service. To meet these restrictions, most designs customize the MPSoC at design time, using specific NoC communication services as adaptive routing algorithms, priorities, and connections. However, MPSoCs are increasingly used in embedded systems, where new applications may be added at run-time, characterizing dynamic workload scenarios. Such scenarios require adaptability at runtime, with applications having the possibility to select the most appropriate communication service according to their respective requirements. The goal of the present work is to link the hardware level of NoCs to the MPSoC application level, proposing the development of a communication API that exposes the communication services offered by the NoC to the application developer. Executing real and synthetic applications in two different MPSOCs, and using four different NoC communication services enabled to demonstrate the efficiency of the proposed approach to meet applications requirements. Keywords— MPSoC; NoC; QoS; Composability; API.
I. INTRODUCTION An MPSoC is a complete computational system integrated in a single chip, combining multiple processing elements (PEs) as the main components [1]. At a higher abstraction level, MPSoCs can be seen as reconfigurable platforms, where the reconfigurability is achieved by programming the processors. The growing interest in MPSoCs lies in its ability to combine high performance and flexibility. Thanks to its programmability, the same MPSoC can be used in different products, reducing their time-to-market and extending their time-in-market. This device reusability also reduces the cost for the final consumer. The availability of several PEs in the MPSoCs enables the parallelism exploration at the task level. To cope with this programming paradigm, applications are decomposed in tasks that can execute in parallel. In addition, several applications can run simultaneously, characterizing the use of parallelism at the application level. In general, applications are independent and do not require communication among them. However, the application tasks communicate with each other for data exchange and synchronization. With the increasing number of simultaneously running applications, the volume of data exchanged between tasks may compromise the overall system performance. Therefore, it is essential that the communication infrastructure support high communication rates and a high parallelism degree. NoCs [2][3][4] are a natural solution to this problem, which in addition to supplying the needs for bandwidth and parallelism, provide scalability to larger systems.
Digital Object Indentifier 10.1109/TC.2012.123
NoCs are roughly composed by a set of routers and point-topoint interconnection channels [5]. The MPSoC is built by interconnecting Intellectual Property (IP) cores to local ports of NoC routers. The prevalent communication model in NoCs is message passing, due to NoCs distributed structure and the absence on them of a common communication infrastructure interconnecting all IP cores simultaneously as in buses. Messages are transmitted between IP cores wrapped in NoC packets. Recently, many studies have been conducted proposing different communication services that go beyond the basic message passing in NoCs. Such communication services include priorities in the allocation of communication resources, establishment of connections and support to collective communication such as multicast and broadcast. The main goal of these services is to provide differentiated treatment to application flows with constraints like minimum throughput and maximum latency. They offer different levels of quality of service (QoS), providing means to meet applications requirements. The adoption of NoCs as communication infrastructure is a fact, given the number of PEs integrated in current MPSoCs [6]. NoCbased MPSoCs allow the development of scalable platforms, with a huge computational power, enabling the simultaneous execution of multiple complex applications. Nonetheless, the integration of multiple applications on the same platform must be done carefully, since each application has a distinct set of communication requirements and patterns, each of which may benefit from distinct network communication services. Therefore, it is evident the need to expose the communication services offered by the NoC to the programming level, through an Application Programming Interface (API), in order to optimize the use of the available NoC resources from higher abstraction levels. This API provides to the applications programmer access to the communication services offered by the NoC, favoring the design space exploration and platform optimization. Such approach is suggested in [7], where low level communication primitives are offered as a software component of the STNoC. In order to access the low level services from higher abstraction levels, it is a designer task to implement the programming model (e.g. shared memory) and create an API. This work presents mechanisms implemented at the NoC level, and the corresponding API responsible to explore these as differentiated communication services at the application level. In this work, the term differentiated services refers to communication services with quality level higher than basic message passing. Communication services are evaluated in two different MPSoC architectures, with the execution of applications programmed using the API. The main contribution of this work is to show the importance of a concomitant NoC and MPSoC design in an effort to create reusable platforms serving applications with different communication requirements. The development of NoC architectures must use real
0018-9340/12/$31.00 © 2012 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
traffic generated by PEs, and the development of MPSoC architectures cannot simply abstract the NoC infrastructure by using simple send/receive primitives. To meet the requirements of actual applications, designers must use a complete environment, including hardware, operating system, communication API and applications partitioned in tasks. This paper is organized as follows. Section II presents related works in NoC-based MPSoCs that claim some mechanism to ensure QoS. The core of the work is presented in Sections III, IV and V, in a bottom-up fashion. Section III details the hardware mechanisms to support differentiated communication services in the adopted NoC. Section IV briefly presents the MPSoC architectures and application modeling used in this work, supporting the communication services. Next, section V describes the primitives composing the communication API, executing in the MPSoCs. Section VI presents the results, and Section VII concludes this article, and point out directions for future work.
II. RELATED WORK This section presents examples of NoC-based MPSoCs using different strategies to offer QoS, in order to enable the applications to meet their performance requirements. Tilera devices [8] are homogeneous two-dimensional arrays of PEs (64-100) interconnected by the iMesh NoC [9]. In addition to the PEs, the NoC also connects memory controllers and input/output cores. The iMesh NoC contains 5 bi-dimensional mesh networks: UDN (User Dynamic Network), IDN (I/O Dynamic Network), STN (Static Network), MDN (Memory Dynamic Network) and TDN (Tile Dynamic Network). The dynamic networks (UDN, IDN, MDM and TDN) adopt packet switching (wormhole), while the static network (STN) implements switching circuit. The networks UDN, IDN and STN are integrated into the processor pipeline, being quickly accessed through processor registers. Thus, any processor instruction can access directly these networks. Communication is based on message passing, similar to the MPI standard, supporting the sending of messages at any time to any destination without requiring connection. Tilera adopts hardware redundancy (multiple networks) to improve performance, offering circuit switching to ensure QoS (STN). Kumar et al. [10] propose an integrated design flow for NoCbased MPSoC generation, targeting FPGA devices. The architecture is based on the Silicon Hive processor [11] and the Æthereal NoC [12]. The tool generates an RTL VHDL description of the system and simulation models for each of its components. The communication service based on connections implemented by Æthereal allows the specification of connection parameters (e.g. bandwidth and latency) at design time. Murillo et al. [13] modified the Xpipes NoC [14] to provide communication services based on priorities and connection. The NoC is associated to Altera Nios-II soft core PEs to produce a homogenous MPSoC. A priority mechanism was implemented in the router arbiter, responsible to set the order in which it serves the requests coming from input ports. When two or more input ports request the same output port, the arbiter serves the one containing the packet with highest priority. The priority of a packet can range from 0 to 7 (highest) and is part of the packet header. This approach offers low QoS guarantees, because the priorities are verified only when there are simultaneous requests for the same output port. Support for physical connections is based on circuit switching. A connection is established with a BE packet sent from source to destination. This packet reserves the entire bandwidth of the path for the duration of the communication. Connections can be unidirectional or
bidirectional. The connection is closed with another BE packet sent by the source, which deallocates the path resources. Singh et al. [15] present a design flow for NoCs-based MPSoCs, also targeting FPGAs devices. Initially, the NoC is generated using the NoC Generator [16] tool. This tool allows the generation of NoCs with guaranteed throughput, using spatial division multiplexing (SDM). Next, MicroBlaze processors-based PEs are connected to the network interfaces through Fast Simplex Link (FSL) ports, completing the generation of the hardware infrastructure. Communication among PEs is based on connections, which are established by a central processor that controls the communication of the MPSoC. Connection configurations are generated at design time by the NoC Generator, and depend on the communication application requirements to be executed. The processor responsible for the MPSoC communication control stores configurations and establishes connections before starting application execution. The HS-Scale MPSoC [17] explores task migration to restore or improve the performance of running applications. HS-Scale is a homogeneous bi-dimensional array of NPUs (Network Processing Units) that communicate via the Hermes NoC [18]. Monitoring processor load and communication queues occupation triggers task migration. Each NPU executes a multitask kernel, with an API for communication between tasks. This API is based on standard MPI message passing and has only two primitives: (i) MPISend() and (ii) MPIRecv(). Yang et al. [19] propose a method for adaptive traffic admission control to maintain the communication infrastructure operating below the saturation point. The control is distributed and based on a closed feedback loop that regulates the IPs injection rate. The technique is applied to the SpiNNaker MPSoC. Ahmad et al. [20] describe an MPSoC that dynamically changes some of its NoC features (routing, switching mode and packet size) to adapt it to changes in the applications’ communication requirements. The GENESYS MPSoC [21] allows the creation of platforms for different application domains (e.g. aviation and automotive) from an infrastructure that provides a set of basic core services. These core services include: (i) synchronization; (ii) communication; (iii) configuration; (iv) execution control. A “time-triggered NoC” (controlled by a TDMA scheme) implements the core services. On top of the core services, higher-level services can be implemented by domain-independent and domain-specific system components that customize the platform to the needs of the specific application domain. A prototype was implemented using a Stratix III FPGA to execute a racing car simulator [22]. The results show the adequacy of the implementation for systems with rigid latency and bandwidth constraints. All above steps are design-time actions defined according to the target application. Motakis et al. [23] present an approach to access the services of the Spidergon NoC from the software level targeting an abstract MPSoC description. The approach uses a set of drivers and a library of functions (libstnoc) that implements an API. The Spidergon NoC offers services related to power management, routing, security, and QoS. These are accessible at runtime and exposed to the software level via memory-mapped registers. A dedicated CAD tool generates different instances of the Spidergon NoC (RTL description) and the Linux drivers, which can be inserted into the kernel or compiled as separate modules. SIMPLE (Simple Multiprocessor Platform Environment) [24] is a platform described in SystemC, targeting the validation of real time applications. The system adopts the SoCIN [25] network, and each PE contains a Java processor, local memory, network interface, and a
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
real time clock. The processor runs the Real-Time Specification for Java (RTSJ) [26], thus supporting multithreaded and real-time applications. The exchange of messages between threads is supported by the COM-API, which allows the establishment of communication channels and assigning different priorities to messages. However, the NoC does not support communication services with QoS guarantees. The QoS is relayed to the application level, omitting any connection to the physical level. xENoC [27] is an environment to generate NoC-based MPSoCs. It employs the NoCWizard tool, which allows the generation of NoCs described in Verilog RTL from the specification of various parameters. Examples are topology, flow control, switching mode and routing algorithm. The environment has an IP library with different processors and accelerators, which enables the generation of heterogeneous MPSoCs. Besides the hardware infrastructure, xENoC also includes a software library for message passing and synchronization between tasks. The library is named on-chip MPI (ocMPI). Since the implemented NoC only supports unicast transmission, collective communication services such as broadcast (ocMPI_Broadcast()) are implemented from the basic primitives ocMPI_Send() and ocMPI_Recv(). The integration of the NoC services within the MPSoC design is still a gap observed in most reviewed works. At the NoC level, three mechanisms are adopted to ensure some level of differentiated services: (i) circuit-switching [8], [10], [13], [15]; (ii) priorities [13]; (iii) reconfigurability [20]. In addition, there are propositions of two higher-level services: (i) task migration [17]; (ii) traffic admission control [19]. In all these works, the APIs commonly implement only the basic message passing communication service (send/receive), offering little or no control over network resources. Other works propose dedicated APIs, customized at design time. The API proposed in [21] generates hardware platforms using a TDMA NoC. References [23], [24], [27] also present APIs to guarantee QoS. Another approach observed in some works [28], [29] is the centralized management of the communications infrastructure. In complex MPSoCs, communication management must be distributed to avoid bottlenecks in the NoC. The goal of the present work is to bridge the gap between the hardware and the application level, proposing the development of a communication API that exposes to the application developer the hardware level. The proposed NoC, derived from Hermes [18], offers the following services: (i) priorities; (ii) connections (circuit switching); (iii) differentiated routing, where the application may select deterministic or adaptive routing - no reviewed work enables this feature; (iv) collective communication (multicast). All these services may be accessed at the software level, at run-time, without the need of a centralized controller, to ensure that applications meet their performance requirements.
III.COMMUNICATIONS SERVICES SUPPORT The basic router architecture adopted in the present work is illustrated in Figure 1, an adaptation of the generic router architecture presented in [30], where the major components of any router are clearly identified that is, buffers, the crossbar, and arbitration and routing unit. •
•
The adopted NoC has the following features: 2D mesh topology – topology found in most NoCs [31], being its physical implementation straightforward. Agarwal et al. [32] suggest that 2D meshes are more efficient in terms of latency, power consumption when compared to most other topologies; Input buffering for temporary flit storage;
• • • • •
Simultaneous packet (wormhole) and circuit switching; Stop & Go flow control between adjacent routers; Distributed routing; 5 bi-directional ports with 2 unidirectional links per port; 2 physical channels per link (no virtual channels), enabling to increase the router bandwidth without incurring the cost of virtual channels [33]. North0
North1
Local0
Local1
West0
East0 Crossbar (10x10)
West1
East1
Arbitration and Routing
South0
South1
Figure 1 – Generic router architecture with duplicated physical channels.
Usually, the communication services supported by NoCs can be classified as best-effort (BE) or guaranteed service (GS). BE services forward packets as soon as possible, but no guarantees are given (regarding e.g. latency or throughput), as opposed to guaranteed services. Due to its simplicity, BE services are the most commonly found approach in NoCs. BE produces traffic where only correctness and completion are guaranteed, while GS provides additional guarantees. A combinations of BE and GS services is desirable, since GS incur predictability, a quality which it is important in real-time systems, while BE improves average resource utilization. Quality-of-service (QoS) refers to the levels of guarantees given for data transfers in order to satisfy communication requirements. This Section presents four resource allocation mechanisms to support differentiated communication services: (i) flow oriented routing; (ii) priorities; (iii) connection (circuit switching); (iv) multicast. Flow oriented routing and priorities offer soft guarantees, while connection offers hard guarantees. Multicast supports collective communication, which is required to optimize parallel algorithms and control operations (like those found e.g. in cache coherence protocols). The QoS mechanisms, soft and hard, allow controlling resource allocation, providing a better isolation of flows from different applications. Isolation among applications running simultaneously on a platform favors system composability [28]. This property aims to ensure applications’ performance, independent of whether other applications are present and how they behave. Ideally, an application should perform in the same way when running either alone or in the presence of other concurrent applications. Since in NoC–based systems the network is the main shared resource, allocation mechanisms are required to isolate flows belonging to applications with QoS requirements, in this way achieving composability at the network level.
A. Flow Oriented Routing (soft guarantee) Several NoC routing proposals targeting overall performance optimization are available in the literature [34]-[40]. These works propose routing schemes to achieve the advantages of deterministic/adaptive and/or minimal/non-minimal routing algorithms. However, the proposals tend to handle all application
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
flows in the same way, still offering only the basic message passing communication service. Since the interconnection infrastructure is shared by the communicating IPs, application flows may compete for resources, creating inter-application interferences. The ability of a system to group application flows in different traffic classes, provides total or partial isolation among the former. NoC-based SoCs natively provide some degree of flow isolation, due to the use of several independent links, which interconnect routers. However, flow collisions cannot be avoided when the same path is to be taken by two flows simultaneously. Collisions can be reduced, although not avoided completely, by exploring alternative paths. Adaptive routing algorithms, allowing a better traffic distribution over the network, can find such alternate paths. Non-minimal adaptive routing may slightly increase packet latency, but may consider a wider spectrum of path choices. Flow oriented routing [39] is a feature that can be added to any adaptive routing algorithm, such as odd-even [41] or the ones based on the turn model (e.g. west first or north last) [42]. The basic condition is that there exists a deterministic version of the employed adaptive algorithm. It is possible to show that such a version always exist, by fixing one single path between each source/target pair from all paths usable by the original adaptive algorithm. In the proposed approach, flows with temporal constraints, such as soft QoS, are routed adaptively, creating more opportunities for efficient path exploration. Flows without temporal constraints, such as BE flows, are routed deterministically, using a deterministic version of the adaptive routing. Such routing differentiation conducts to a flow oriented routing scheme that enables the NoC to handle distinct traffic classes differently. Flow oriented routing allows the NoC to simultaneously provide adaptive and deterministic routing, which means that at the same time some packets are routed deterministically while others are routed adaptively. Routing schemes available in the literature do not offer the possibility to make this choice at runtime, i.e., all flows are routed either deterministically or adaptively. Since adaptive routing has more resources available to offer alternative paths, it can be applied to high priority flows, while low priority flows are routed deterministically. Routers are responsible for the selection of the routing version to be applied on a given packet at runtime. An important issue in adaptive routing is the policy to select the router output port to use. The usual policy is to use neighborhood congestion level as the decision metric [35][40]. However, such metric does not ensure a congestion-free path, since it is uses only local information and may lead the packets to not locally visible congested areas. To reduce area overhead, and keep implementation as simple as possible, this work does not adopt any kind of congestion detection to select the output port. When more than one output port is available, the selected port is the one which leads to one of the shortest path to the target (meaning that non-minimal routing is adopted). If all available output ports lead to shortest paths, the first free port found is selected. The area overhead of the proposed scheme is very small, less than 1% of the router area. This work suggests as case study the Hamiltonian routing algorithm [43], due to its simplicity to implement adaptive and deterministic versions, and the possibility to use it for deadlock-free multicasting. The two versions of the Hamiltonian routing algorithm used are (i) non-minimal partially adaptive and (ii) minimal deterministic. A Hamiltonian path for a graph is any path that visits every graph vertex exactly once. In the Hamiltonian routing algorithm, each NoC router receives a label. In an NoC with N routers, label assignment is based on the router position on a
Hamiltonian path, where the first router in the path is labeled 0 and the last one is labeled N-1. Figure 2 illustrates a possible label assignment to routers based on a Hamiltonian path in a 4x4 mesh NoC. The non-minimal partially adaptive version of the Hamiltonian routing algorithm works as follows. A packet in a router with a label smaller than the destination router label is forwarded to any neighbor router with larger label, as long as it has a label smaller or equal to that of the destination router. In a similar way, when a packet is in a router with label larger than that of the destination router, it is forwarded to any neighbor router with smaller label that has a label larger or equal to that of the destination router. To create a minimal deterministic version from the partially adaptive Hamiltonian routing algorithm, the forwarding condition can be restricted to “forward to the neighbor router with larger/smaller label that has a label smaller/larger or equal to that of the destination router” (depending on the source and target labels). The Hamiltonian routing is livelockfree because packets always follow an increasing (or decreasing) path (according to the order of the labels). 15
14
13
12
8
9
10
11
7
6
5
4
0
1
2
3
Figure 2 - Example of label assignment based on a Hamiltonian path in a 4x4 mesh.
To allow the routing engine to select the routing version at runtime, one available bit of the packet header is defined as the routing selection bit. During packet transmission, the routing selection bit is verified at each router in the path (assuming the use of distributed routing) and the corresponding routing version is executed. The PE sets the routing selection bit before injecting each packet in the NoC.
B. Priority Based Communication (soft guarantee) The main router characteristic to support priorities lies in the duplication of physical channels (each channel being a bi-directional link), as Figure 1 illustrates. Physical channel replication is an approach increasingly employed [44] [45] [46]. It is an attractive alternative to virtual channels, due to its smaller area overhead and increased bandwidth. Both approaches imply area overheads, related to input buffer control and crossbar size. Virtual channels have an extra cost due to physical channel multiplexing (TDM), which makes its area overhead larger than physical channel replication, considering the same number of virtual and physical channels per direction [33]. Furthermore, the aggregated router bandwidth is proportional to the physical channel replication factor, whilst virtual channels do not increase router bandwidth. If the amount of storage resources on the router is comparable, both approaches offer similar power efficiency [45]. Here, support to soft QoS relies on fixed priorities [47]. Two prioritized traffic classes are distinguished inside the NoC: (i) high priority packets and (ii) low priority packets. One physical channel (channels with suffix 0 in Figure 1) is reserved to transmit exclusively high priority packets, whereas the other one (channels
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
with suffix 1 in Figure 1) may transmit both packet classes. Sharing one of the two physical channels between traffic classes increases support to high priority traffic, because two high priority flows can be transmitted simultaneously in the same direction. Priorities offer only soft guaranteed service to high priority traffic with regard to latency and bandwidth. When two or more high priority flows compete for common paths inside the NoC, QoS may be affected due to high priority packets collision. In fact, NoCs employing priority mechanisms to provide some QoS tend to perform like BE NoCs as the amount of high priority traffic increases [48]. This mechanism may be the answer to applications with tolerance to some deadline misses. Similar to flow oriented routing, the physical channels allocation proposed here uses a priority bit, defined in each packet header belonging to a flow. During packet transmission, the priority bit is verified at each router in the path and channel allocation is performed. This priority bit is set by the PE before injecting each packet in the NoC.
C. Connection oriented communication (hard guarantee) Hard QoS support employs circuit switching [47]. A connection is established between a source/target pair and the resources in the path stay allocated throughout the entire duration of the communication. In the implementation proposed, this switching mode coexists with packet switching, i.e. the NoC simultaneously supports both modes. All connections are unidirectional and established/released by the source through control packets transmitted by packet switching. The approach requires minimum changes in the router architecture and presents low area overhead. Achievement of simplicity derives from the underlying use of packet switching. A control packet is sent in packet switching mode to the target, allocating resources in the path. At the end of communication, resources release occurs after sending another control packet. The approach is similar to the PCC (Packet Connected Circuit) implemented in the SoCBUS NoC [49]. Resources are reserved for worst-case situations, which allow applications to achieve maximum throughput without any interference from other communications. Bounds for performance parameters as latency, jitter and throughput are fully guaranteed. Since worst-case allocation can be a drawback when the QoS application throughput is low and/or when paths are blocked for a long time, connections can only take place in channels with suffix 0. As the NoC combines circuit and packet switching modes, one physical channel is always available for packet switching. Router local ports (Local0 and Local1) are allocated depending on the switching mode. Local0 is used for connection-oriented communication (if packet switching is not being used), whereas Local1 is used for packet switched transmission. An IP is able to simultaneously send packets through both local ports. In this way, IPs can communicate with other IPs using the Local1, while keeping an active connection on Local0.
D. Multicast Collective communication services such as multicast and broadcast are basic in several parallel applications and its unoptimized execution can represent a significant increasing in the communication load. Example applications include search algorithms, graph algorithms, and matrix operations (e.g. inversion and blocked multiplication). In addition, multicast is part of systemwide actions like management and network configuration, synchronization, and cache coherence protocols. Despite bus-based interconnection architectures are not scalable and provide low support to parallel transactions, they naturally support collective communication service (e.g. broadcast). In NoCs
and specifically in 2D mesh topologies, the efficient implementation of multicast depends on specific algorithms. Commonly, multicast support is implemented in a non-scalable way, sending individual message copies to each target (called unicast–based multicast) [27] [50] [51]. This solution often increases network traffic load and energy consumption. Several multicast algorithms have been proposed for multicomputers. Since they are designed considering 2D mesh networks, its implementation in 2D mesh NoCs does not incur difficulties [52][53]. This work adopts the implementation of the dual-path multicast algorithm to support a collective communication service [43]. The dual-path algorithm establishes the number of message copies produced to develop any multicast communication to a maximum of two, independent on the number and position of targets. The dual-path multicast algorithm is based on the Hamiltonian routing algorithm. The first step in this algorithm is to divide the target set labels in two subsets, one containing labels larger than the source label and the other containing labels smaller than the source label. The second step consists in sorting the label subsets. The higher subset is sorted in ascending order and the lower one is sorted in descending order. Two NoC packets are created, each one wrapping a copy of the multicast message. The sorted label subsets are added to the header of each respective packet. In the last step, each packet is sent to the corresponding subset of nodes. If the message targets are all higher or lower than the source, only one packet is sent. To support multicast messages, the NoC packet header is extended to include any number of targets, as Figure 3 shows. Each flit in the packet header indicates a message target, and each has a 1– bit field setting it as target. The last target flit is set indicating it is at the end of header. The packet header size depends on the number of message targets. Figure 4 illustrates an example of the two packets wrapping a multicast message addressed to the targets {1, 3, 5, 8, 10, 12, 14}, considering 6 as the message source.
header
target
first target
target
second target
end
...
last target flit 1 flit 2
...
payload
flit n
Figure 3 – NoC packet header with extended header. Each header flit indicates a target. 8
10
12
14
Message (payload)
5
3
1
Message (payload)
Figure 4 – Multicast message wrapped in two packets. (a) packet sent to the higher subset (targets>6), (b) packet sent to the lower subset (targets< 6).
The message reaches the targets in the order specified in the packet header. When the message reaches a target, it allocates the local port and removes the target flit from the packet header. Then, the Hamiltonian routing algorithm is executed again, based on the next target, allocating a port towards the latter (North, South East or West). At each target, the message is forwarded simultaneously to both allocated ports. Figure 5 illustrates the multicast routing process in a 4x4 mesh, considering the packets presented in the Figure 4. Figures present only the packet header, abstracting the payload attached to them.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
14 12 10
14 12 10
9 14 12 10 8
7
12
14 12
10
11
15
14
8
13
9
14 12 10 8
10 1 3 5
6
5
7
4
6
11 1 3 5
5
1 3
4
Source
Source
1
0
A. HeMPS
12
1
2
0
3
1
(a)
1 3
1
2
HeMPS [56] (Hermes Multiprocessor System) is a parameterizable MPSoC. Instances of HeMPS are created interconnecting PEs to NoC routers. The HeMPS PEs, each called Plasma-IP, contains a Plasma processor, a private memory (RAM), a network interface (NI), and a DMA module. Besides communication and computation components, the system also has an external memory, which has the role of task repository, storing all object codes of executable tasks. Figure 7 illustrates a 2x3 HeMPS instance.
3
HeMPS
(b)
Plasma-IP SL
IV. CASE STUDY MPSOCS This Section presents two MPSoC platforms used as case studies for evaluating the mechanisms described in Section III. Both platforms employ homogeneous multiprocessing, and modified versions of the Hermes NoC [18] and the Plasma processor [54]. In addition, the platforms use a message passing programming model and NORMA memory architecture. The main differences between the platforms are the PE architecture and the kernel (OS) implementation, which was modified to include the communication API. The original Hermes NoC was modified, implementing the mechanisms presented in Section III, to support differentiated communication services. The Plasma processor [54] implements a simple and compact RISC architecture. Its instruction set is compatible with the MIPS-I architecture. To keep the processor as small as possible, Plasma implementation has only 3 pipeline stages, no cache, no Memory Management Unit (MMU) and no memory protection support. Typical applications running in MPSoCs, such as multimedia and networking, are composed by several communicating tasks. Accordingly, in both platforms applications are represented using task graphs, where vertices represent tasks and edges represent the communication (data dependence) between a pair of tasks. Figure 6 shows a graph representing a synthetic application composed by nine tasks. 1
3
4
5
Task Repository
Figure 5 – (a) message transmitted to the larger labels subset {8,10,12,14}, (b) message transmitted to the smaller labels subset {1,3,5}.
Modified Hermes NoC Router
Plasma-IP MP
Router
Plasma-IP SL
Router
Router
Router
Router
Plasma-IP SL
Plasma-IP SL
PLASMA DMA
RAM
8
13
Network Interface
14 14 12 10 8
14 12
14
14
15
Plasma-IP SL
Figure 7 – A 2x3 HeMPS instance block diagram, showing the modified Hermes NoC, supporting differentiated communication services.
The system contains a single master processor (Plasma-IP MP), responsible for managing resources, allocating tasks and interacting with the external world. This is the only processor that has access to the task repository. When HeMPS starts execution, the Plasma-IP MP allocates application tasks to slave processors. The system supports static and dynamic mapping. The latter is fired on demand at runtime. Each slave processor (Plasma-IP SL) runs a small proprietary OS kernel, which enables multitasking and message passing. The original C-based HeMPS API provides two MPI-like communication primitives to allow the inter-task message passing through the NoC: Send() and Receive(). Both primitives provide BE communication services. Their prototypes are: Send (Message *msg, int target_task ); Receive (Message *msg, int source_task);
Here, message *msg points to a message structure in the task memory; int target_task is the target task identifier; and int source_task is the source task identifier.
B. HS-Scale 0
2
8
7
6
Figure 6 – Task graph of a synthetic application with 9 tasks.
As applications are represented using task graphs, which explicitly show the inter-task communications, the message passing protocol suits better than shared memory. Although communication through shared memory is widely used in bus-based systems (NUMA), in NoC-based MPSoC this method would require one of the following organizations: (i) possibility of a given PE to access the memory space of other PEs; (ii) a dedicated shared memory block attached to some router, resulting in a communication hotspot due to the number of accesses to the same NoC region; (iii) several dedicated shared memory blocks distributed in the MPSoC. Several academic NoC-based MPSoCs adopt MPI-like protocols to enable communication [17] [27], while industrial NoC-based MPSoCs adopt shared memory communication (such as TI OMAP platform [55]). The message passing programming model allows a more efficient exploration of inter-task communications.
The original HS-Scale [17] is an MPSoC that aims autoadaptability at runtime by means of task migration services. HS-Scale is formed by an array of homogeneous NPUs (Network Processing units). Each NPU wraps a Plasma processor, a Hermes router, a private memory, a network interface and a few peripherals (including a timer, an interrupt controller and a UART). Figure 8 presents the block diagram of a 4x4 HS-Scale and the details of one NPU.
Figure 8 – A 4x4 HS-Scale instance and the NPU internal organization.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
Incoming messages are read from the network interface (NI) and stored in the target task FIFOs (implemented in software). A target task has one FIFO for each source task it receives data from. Outgoing messages are written to the NI, which is responsible for packing and for injecting the packet in the NoC. Each NPU has multitasking capabilities, which enable timesliced execution of multiple tasks. A preemptive multitasking OS kernel running on each NPU supports this feature. This kernel provides usual operating systems services and structures such as queues, threads, semaphores and mutexes. The message passing programming model is also supported by the kernel which provides two MPI-like communication primitives to allow the inter-tasks message passing through the NoC: MPISend() and MPIRcv(). As in HeMPS, communication primitives provide only BE services. Prototypes are: MPISend (int target, int fifo, void *data, int size);
Here, int target is the message target task identifier; int fifo is a FIFO where the target task must store received messages; void *data points the message to be sent; and int size is message size, in bytes. MPIRcv (int fifo, void *data, int size);
Here, int fifo is the input FIFO to be read; void *data points the location where to store the message; int size is message size, in bytes.
V. COMMUNICATION SERVICES AT THE SOFTWARE LEVEL To support differentiated communication services at higher abstraction levels, the four mechanisms presented in Section III were exposed through the communication API, implemented in the kernel of both MPSoCs. All four mechanisms were implemented in the HeMPS communication API, whereas the HS-Scale platform includes only connection and priorities. The respective OS kernels link the mechanisms and the communication API, bridging the gap between network and application levels in NoC-based MPSoCs. To keep the system structured in layers (network, operating system and application), kernels implement API communication primitives as a set of system calls. Using these high-level primitives a programmer can develop applications to achieve its communication requirements. This approach gives runtime autonomy to applications, allowing these to control NoC resources in a distributed way. The communication primitives are called in the application code passing high level parameters. Communication primitives calls generate specific NoC packets and control bits during its processing. Inside the kernel, device drivers controlling the DMA and NI modules, which are used to inject/receive packets into/from the NoC, support the API primitives. The packet headers contain control bits related to each one of the aforementioned mechanisms, which are read at each router and handled accordingly. The OS kernel can only offer differentiated communication services to the applications if it has access to the mechanisms implemented at the network level. If the OS kernel abstracts the mechanisms offered by the NoC it only injects data into the network without any guarantees. Without such mechanisms, optimizations on the OS kernel can offer differentiated communication services only to the applications running on the same processor.
A. Communication service with differentiated routing (HeMPS) Flow oriented routing supports this service. In HeMPS, a new parameter, namely priority, was added to the Send() primitive. Through this parameter, the application programmer is able to specify the version of the routing algorithm to execute when routing the message. When setting the priority parameter to HIGH, the adaptive
version of the Hamiltonian routing algorithm is used, and setting it to LOW implies the use of the deterministic version. During processing of the Send() primitive, the priority parameter value is used to set the routing selection bit in the header of the packet. An important issue when using adaptive routing algorithms is out-of-order message delivery. The Hermes NoC does not implement any kind of mechanism to ensure message ordering. But in the HeMPS platform, message ordering is ensured by the software, using the read request protocol implemented in the kernel [56]. In this protocol the source task locally stores the message to be transmitted in the kernel memory area. Afterwards, it continues execution (nonblocking send). The target task requests a message to the source task through a read request control packet (blocking read), and if a message is available, it is transmitted. Therefore, according to this protocol, the source transmits a new message n if and only if message n-1 was already read by the target. In this way, it is not possible to have two or more messages being simultaneously transmitted from a source to a target, which is the premise to occur out-of-order message delivery.
B. Communication service based on priorities (HeMPS/HS-Scale) This service is supported by resource allocation based on priorities and was included in the HeMPS and HS-Scale communication APIs. The priority parameter, already added to the HeMPS Send() primitive, was also added to HS-Scale MPISend() primitive: /* Message transmission with priority for HeMPS */ void Send(Message *msg, int target, int priority); /* Message transmission with priority for HS-Scale */ void MPISend(int target, int fifo, void *data, int size, int priority);
Through this parameter, an application programmer is able to set message priorities. During processing of Send()/MPISend() primitives, the priority parameter value is used to set the priority bit in the header of the packet. The priority bit may receive the values HIGH or LOW. In the case of HeMPS, the priority parameter is also used to choose the routing algorithm. When priority is set to HIGH the packet is transmitted with high priority and routed adaptively.
C. Communication service based on connections (HeMPS/HS-Scale) This service is supported by the circuit switching mechanism and was included in both, HeMPS and HS-Scale communication APIs. To allow the establishment and release of connections, two new primitives were created (i) Connect() and (ii) Close(): /* Connection establishment */ void Connect(int target); /* Connection release */ void Close();
Before starting a connection-based communication, the Connect() primitive must be called by the source task, specifying the target task. This primitive sends a control packet to the PE where the target task is allocated, reserving resources in the path (along channel 0). The kernel performs the target task address resolution. Once the connection is established, the source task sends messages using the Send()/MPISend() primitives, setting the priority parameter to GT (Guaranteed Throughput). The GT value indicates that the message must be sent through the connection (local port Local 0). It is also possible to send messages in packet switching mode (local port
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
Local1) even if a connection is established, setting the priority parameter as HIGH or LOW. At the end of the communication, the source task releases the connection calling the Close() primitive. This primitive sends a BE control packet, which releases its reserved resources. The GT connection is established at the software level. Each PE can handle one GT connection. Therefore, one task per-PE can establish a connection at a time. This connection stays open until the call of the Close() primitive, even if the task connection owner is not scheduled. If another task in the same PE tries to establish a connection, this task will stay blocked until the connection has been released. This limitation enables, per-PE, one task communicating with hard QoS guarantees, and the remaining tasks communicating with soft QoS guarantees. Nonetheless, it should be clear that intermediate routers may support multiple connections simultaneously, e.g. NorthSouth and WestEast.
IVLC1, IVLC2, IVLC3, IVLC4, IVLC5, IQUANT, IDCT e Print), which communicate in a pipeline fashion. The Start task continuously sends compressed blocks to task IVLC1, which starts decoding. Due to the software CPU bound tasks execution time the used link bandwidth is very small, namely less than 3%. MJPEG task mapping along with the flow spatial distribution is illustrated in Figure 9, considering the deterministic version of the Hamiltonian routing algorithm. The Plasma-IPs connected to routers 3 to 7 execute only system debug, generating messages to the master processor, which correspond to disturbing traffic. This scenario characterizes a hot spot region in the Plasma-IP MP, which disturbs the communication between tasks IVLC2 and IVLC3.
20
21
IVLC3
IVLC4
IVLC5
22
23
24 IQUANT
D. Collective communication service (HeMPS) This service is supported by the dual-path multicast algorithm. In order to provide collective communication, a new primitive, namely Multicast(), was included in the HeMPS communication API. This primitive allows a communication involving a source task and several target tasks with a single call. The target tasks receive a multicast message through the Receive() primitive, in the same way as unicast messages. The multicast primitive prototype is:
19
18
PRINT
IDCT
10
11
12
13
14
9
8
7
6
5
Start
IVLC1
IVLC2
0
1
2
3
4
17
Master
16
15
Multicast (Message *msg, int *target_list, int targets);
Here, Message *msg is the message to be sent; int *target_list is the message targets set; int targets is the number of targets. The Multicast() primitive is executed by the kernel, and include the resolution of the target tasks’ addresses, creating one or two multicast packets (see Section III.D), and addresses sorting before injecting the multicast packet(s) in the network. The bubble sort algorithm is used to sort the subsets. Since this sort algorithm has O(n2) complexity, the Multicast() primitive processing time increases with the number of targets. Besides user applications, in HeMPs, multicast messages are also used by the Plasma-IP MP OS kernel to spread control messages, such as the location of new allocated tasks and to notify the deallocation of a given task.
VI. RESULTS The results were obtained from cycle-accurate simulations, using the Modelsim simulator. Both platforms are fully described in synthesizable VHDL RTL. For fast simulation purposes, a cycle accurate SystemC description of the Plasma processor is employed. The evaluation employs real and synthetic applications described in C. Inter-task communication is explicitly described in C code using communication primitives from the API presented in the previous Section. The original memory footprint of the HeMPS kernel, where the proposed API is fully implemented, increased from 15 to 22 Kbytes. Despite the 46.6% increase in the final memory footprint, it can still be considered small for embedded processors. Differentiated routing and collective communication services are evaluated in the HeMPS MPSoC. Priorities and connection-based services are evaluated in the HS-Scale MPSoC. Such approach, using different MPSoCs, demonstrates the effectiveness of the hardware exposure through the API implemented in different kernels.
A. Communication service with differentiated routing This evaluation was conducted on a 5x5 HeMPS instance. The target application is an MJPEG decoder divided into 9 tasks (Start,
Figure 9 – MJPEG mapping and flow spatial distribution.
The MJPEG performance (latency, throughput and jitter) was evaluated in three routing scenarios: 1.
Deterministic – all flows routed deterministically.
2.
Adaptive - all flows routed adaptively.
3.
Flow oriented – only MJPEG flows routed adaptively, disturbing flows routed deterministically.
Table 1 presents the average packet latency and throughput results for all MJPEG flows considering the three routing scenarios. Latency results are in clock cycles and throughput corresponds to the used percentage of the total link bandwidth. Total time corresponds the time spent to finish the application execution in clock cycles. The total execution time without disturbing traffic (MJPEG alone) is 2,051,939 clock cycles. The last line presents the execution time overhead with regard to the MJPEG running alone in the platform. Table 1 – Latency and throughput results for MJPEG running on the HeMPS platform. Routing Deterministic Adaptive Flow Oriented Latency Throughput Latency Throughput Latency Throughput Start → IVLC1 477.57 2.37% 479.6 2.64% 477.4 2.7% IVLC1 → IVLC2 349.5 1.23% 349.4 1.28% 349.5 1.41% IVLC2 → IVLC3 1621.2 1.23% 1668.66 1.28% 377 1.41% IVLC3 → IVLC4 350 1.23% 349.9 1.28% 349 1.41% IVLC4 → IVLC5 349.5 1.23% 349.4 1.28% 350 1.41% IVLC5 → IQUANT 349 1.23% 349 1.28% 349 1.41% IQUANT → IDCT 349.5 1.23% 349.5 1.28% 349.5 1.41% IDCT → PRINT 350 1.23% 350 1.28% 350 1.41% Total time (cycles) Reference time: 3,932,709 2,209,991 2,078,633 2,051,939 cycles Execution time 91.66 % 7.7 % 1.3 % overhead
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
Comparing deterministic and adaptive routing, the average packet latency of flow IVLC2→IVLC3 is similar, while the total execution time is quite different. In the deterministic scenario, the high competition for router 7 North port leads to contention in IVLC2. Therefore, its packets traverse the hot spot only after some debug tasks finish. This is why it takes a long execution time with an average latency similar to adaptive routing. When only MJPEG flows are routed adaptively (flow oriented columns), the flow IVLC2→IVLC3 is able to avoid the hot spot following the nonminimal path {2, 7, 8, 11, 18, 21, 22}. This non-minimal path is free of contention, since the disturbing flows follow the minimal path toward the master processor (12). In this scenario, the flow oriented routing allows only the flow IVLC2→IVLC3 to follow the nonminimal path, whereas the disturbing flows compete for the minimal one. Note that even using a non-minimal path, the average latency was reduced since packets were able to avoid congested areas, which could incur packet stalling. Thanks to flow oriented routing, the execution time overhead is only 1,3%. These results shows that system composability is increased by flow oriented routing, since application performance is close to that obtained when the application runs alone in the platform. Figure 10 presents application jitter (latency variation). The application jitter is minimum when only the MJPEG flows are routed adaptively. This is because the adaptive routing found an alternative path free of contention around the hot spot. When all flows are routed in the same way we can see a remarkable packet latency variation.
decoding is composed by an ADPCM decoder (AD) and a finite impulse response filter (FIR). An initial task called SPLIT performs the demultiplexing of compressed audio/video streams and sends them to the respective decoder pipelines whereas the task JOIN synchronizes decoded streams. The required minimum throughput is set to 30 frames/s for video and 32,000 audio samples/s, with imagelevel audio/video synchronization. Synthetic applications employ BE communication. These are dummy tasks executing memory accesses or accessing some output device. Memory and output devices are emulated by software tasks.
Figure 11 – Audio/video decoder task graph.
The mapping shown in Figure 12(a) was manually defined to be an optimal mapping, i.e., communicating tasks at a 1-hop distance in all cases. Considering this optimal mapping, the measured throughput at the output of the JOIN task was 31.13 frames/s, value used as reference. Soft QoS is used in this experiment. Remembering that in dynamic systems new applications are frequently allocated and removed, this may result in a dispersion of the available resources, the so-called fragmentation. Therefore, an optimal mapping is rarely achievable at run time, unless a remapping of currently running applications is acceptable. The new allocated applications commonly share system resources used with already allocated ones. Figure 12(b) shows an application mapping that is representative of such a situation, i.e. a non-optimal mapping reflecting inter-application mapping interferences. Four new applications were added, each one with two tasks: T1MEM, T2MEM, T3OUT and T4OUT. Dashed lines show the audio/video decoder data flows and solid lines show the new applications flows.
Figure 10 – MJPEG jitter in clock cycles.
The good results obtained with the hot spot scenario obviously capitalize in the existence of uncongested areas available to the adaptive routing algorithm. A quite different situation arises when using traffic distributions where the load is evenly distributed inside the network (e.g. complement), without hot spot regions. In experiments conducted in this situation, the three routing scenarios presented similar results. Such experiments highlight the benefits of differentiating flows when hot spots are identifiable during application execution. Meanwhile, they also present the limitations of the flow oriented routing when the search space for non-congested regions is restricted, as is the case for complement traffic distributions.
B. Communication service based on priorities and connections This evaluation was conducted on a 4x4 HS-Scale instance. In this Section, the communication services based on priorities and connections are evaluated in a scenario mixing real and synthetic applications. The real application is a synchronized audio/video decoder composed of 7 tasks (Figure 11) that requires specific timing constraints. The video pipeline decoding is executed by an MJPEG decoder split into 3 tasks (MJ1, MJ2 and MJ3) and the audio pipeline
(a) optimal mapping
(b) non-optimal mapping
Figure 12 - Mappings for the audio/video decoder application.
Table 2 presents the measured audio/video decoder throughput varying the number of flows with high priority, considering the mapping presented in Figure 12(b), for six different scenarios. When only the audio/video decoder flows have high priority (scenario S1), the obtained throughput is 1.6% smaller compared to the reference one, but it is still able to guarantee application requirement (30 frames/s). The priorities mechanisms isolate the real time application flows from the BE flows, avoiding interferences. Scenarios S2 to S5 makes explicit priorities mechanism drawback, leading to throughput degradation as the number of high priority flows increases. Throughput reduces 52% when all application flows
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
have high priority (scenario S5). In this situation, the NoC works in BE mode and the high router bandwidth is not enough to ensure application requirements. From this scenario, where priorities do not ensure anymore minimum application requirements, GT connections are employed. Since the video decoder generates traffic rates higher than the audio one, the former was chosen to use GT connections. This constitutes scenario S6, where the video decoder tasks communicate through GT connections whereas all other tasks use high priority. Note that the GT connections extended the guarantees offered by the platform and the audio/video decoder throughput increased 3.5% with regard to the reference one. The results obtained in scenarios S1 and S6 show the effectiveness of software level management of QoS mechanisms to achieve composability, in scenarios with multiple applications flows competing for NoC resources. Table 2 - Throughput results for 6 different scenarios corresponding to the mapping presented in Figure 12(b). GT stands for Guaranteed Throughput. GT Connection Video
Throughput 30.62 frames/s 23.8 frames/s 16.76 frames/s 16.08 frames/s 14.81 frames/s 32.24 frames/s
Figure 13 shows the video decoder pipeline jitter. The X-axis represents the time interval between decoded blocks arriving at the JOIN task and the Y-axis represents the number of decoded blocks arriving within each time interval.
(a)
The target application is a set of simple producer/consumers. The Producer sends a multicast message to several consumers and waits the consumers’ acknowledgement. After receiving the message, each consumer acknowledges the reception. The experiments evaluate: (i) t_send, the time the producer spends to process message sending; (ii) t_ack, the time for the producer to send messages and receive all acknowledgements. The producer is mapped in the center of the NoC, and the consumers are equally distributed around it. The first experiment evaluates multicast as a function of message size, with a fixed number of targets, namely 23. Figure 14 presents the results, where DP corresponds to dual-path multicast and Uni corresponds to multiple unicasts. For short messages, with up to 128 bytes, both approaches have similar performance, due to the complexity of the Multicast() primitive with regard to the Send() primitive. It is important to observe the small impact in performance of the dual-path when varying the message size.
Figure 14 – Multicast as a function of the message size.
This evaluation was conducted on a 5x5 HeMPS instance. The goal is to compare the dual-path multicast algorithm with the multicast implementation based on multiple unicasts, created by a software loop of Send() primitives.
(b)
C. Collective communication service
S1 S2 S3 S4 S5 S6
Flows High Priority Audio, Video Audio, Video, T4 Audio,Video,T3,T4 Audio, Video, T1,T3,T4 Audio, Video, T1,T2,T3,T4 Audio,T1,T2,T3,T4
Low Priority T1,T2,T3,T4 T1,T2,T3 T1,T2 T2 -
ensuring equivalent latency and jitter for different scenarios, even with disturbing traffic. Figure 13(b) shows the jitter for scenarios S2 to S5. The decoded blocks arrive at the JOIN task within intervals: 254 ± 212, 416 ± 1042, 366 ± 1119, 382 ± 1132 Kclock cycles for each scenario. The disturbing traffic flows generated by tasks T1, T2, T3 and T4, increase the average time between blocks, as well the jitter, being responsible for the throughput degradation observed in the Table 2.
Figure 13 – Jitter for scenarios (a) guaranteeing and (b) not guaranteeing the application requirement.
Figure 13(a) presents the jitter for scenarios of (i) optimal mapping; (ii) high priority for audio/video tasks (scenario S1); and (iii) GT connections for video tasks (scenario S6). In these three scenarios, most of the decoded blocks arrive at the JOIN task within intervals (average and standard deviation values): 191 ± 55, 192 ± 59, 189 ± 57 Kclock cycles for optimal mapping, S1 and S6 respectively. Besides, the experiments display equivalently distributed throughput values. The similarity between the three plots shows that the proposed mechanisms efficiently isolate the QoS application,
The second experiment evaluates multicast as a function of the number of targets considering three message sizes. Figure 15 corroborates the previous statement that both multicast approaches perform similar when messages are small, independently of the number of targets. However, the dual-path multicast generates lower traffic, resulting in smaller energy consumption with regard to the multiple unicasts approach [57]. Figure 16 and Figure 17 show that the dual-path multicast algorithm reduces the total time to deliver the multicast massages, compared to multiple unicasts, from medium size messages. At a first glance, it seems obvious that the dual-path multicast performs better than multiple unicasts. Nevertheless, as the experiments showed, message copies preparation may have a significant impact on the overall dual-path multicast performance. In the current implementation, message copies preparation is fully performed in software by the kernel. The authors believe that a multicast hardware implementation could present better results than the multiple unicasts approach even for small messages.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
have the additional benefit of obviating the design of network interfaces, because the mechanisms control is embedded into the NoC packet headers, and access to it is left to the software. The developed network interfaces are very simple communication protocol adapters.
Figure 15 – Multicast as a function of the number of targets. Small message size (32 bytes).
The results described here investigated each proposed communication service separately. An ongoing work is the analysis of possible interactions between multiple services used simultaneously. No further development is necessary, since the API already supports all services. One open issue in the flow oriented routing is congestion detection. Even with non-minimal routing, packets may be sent to congested regions. Research in distributed approaches, with processing elements responsible for monitoring a given region of the MPSoC, could be a starting point to solve this issue. In addition, as a future work, the runtime adaptation of communication services use can be achieved by integrating the proposed API with system and NoC performance monitoring.
Gain
Gain
REFERENCES [1] [2]
Figure 16 - Multicast as function of the number of targets. Medium message size (128 bytes).
Gain
[4] [5] [6]
Gain
[3]
[7]
[8] [9]
Figure 17 - Multicast in function of the number of targets. Large message size (512 bytes).
VII.
CONCLUSIONS AND FUTURE WORK
NoCs are becoming commonplace in the design of complex MPSoCs for several reasons, including the orthogonalization of computation and communication. This work agrees with this orthogonalization but demonstrates that communication still needs to be tightly controlled to ensure the overall efficiency of computation. The significant enhancement of the overall communication bandwidth brought about by NoCs does not by itself guarantees fulfillment of application requirements. Effects like packet collisions, concurrency for communication resources and others may dynamically change NoC packet delivery characteristics. The results described here show how the access to low level resource allocation mechanisms (flow oriented routing, priorities and connection) by applications can be exploited to achieve isolation between traffic classes like BE and GS. The dual path multicast algorithm is not directly related to GS, but its use reduces the overall NoC traffic, freeing resources such that other flows may meet performance requirements. Also, the use of an API that exposes the communication services to the software level and the modification of NoCs to support the API
[10]
[11] [12]
[13]
[14]
[15]
[16]
[17]
[18]
Wolf, W.; Jerraya, A.; Martin, G. “Multiprocessor System-on-Chip (MPSoC) Technology”. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, v.27(10), 2008, pp. 1701–1713. Benini, L; Micheli, G. “Networks on chips: a new SoC paradigm”. Computer, v.35(1), 2002, pp. 70–78. Dally, W. J.; Towles, B. “Route packets, not wires: on-chip interconnection networks”. In: DAC, 2001, pp. 684–689. Guerrier, P.; Greiner, A. “A generic architecture for on-chip packetswitched in- terconnections”, In: DATE, 2000, pp. 250-256. Pasricha,S.; Dutt, N. “On-Chip Communication Architectures – System on Chip Interconnect”. Morgan Kaufmann Publishers, 2008, 522p. International Technology Roadmap for Semiconductors. “2009 Update System Drivers”. Available at: http://www.itrs.net/Links/2010ITRS/ Home2010.htm. Coppola, M.; Grammatikakis, M.; Locatelli, R.; Maruccia, G.; Pieralisi, L. “Design of Cost-Efficient Interconnect Processing Units: Spidergon STNoC”. CRC Press, 2008, 288p. Tilera Corporation. “TILE64TM Processor”. Product Brief Description, 2007. Wentzlaff, D.; Griffin, P.; Hoffmann, H.; Liewei B.; Edwards, B.; Ramey, C.; Mattina, M.; Chyi-Chang M.; Brown, J.F.; Agarwal, A. “OnChip Interconnection Architecture of the Tile Processor”. IEEE Micro, v.27(5), 2007, pp. 15-31. Kumar, A.; Hansson, A.; Huisken, J.; Corporaal, H. “An FPGA Design Flow for Reconfigurable Network-Based Multi-Processor Systems on Chip”. In: DATE, 2007, pp. 117-122. “SiliconHive”. Available at: http://www.silicon-hive.com, 2011. Goossens, K.; Dielissen, J.; Radulescu, A. “Æthereal Network-on-Chip: Concepts, Architectures, and Implementations”. IEEE Design & Test of Computers, v.22(5), 2005, pp. 414-421. Murilo, J. “HW-SW Components for Parallel Embedded Computing on NoC-Based MPSoCS”. PhD Thesis, Universitat Autonòma de Barcelona, 2009, 218p. Bertozzi, D.; Benini, L. “Xpipes: A Network-on-Chip Architecture for Gigascale Systems-on-Chip”. Circuits and Systems Magazine, v.4(2), 2004, pp. 18–31. Singh, A. K.; Kumar, A.; Srikanthan, T.; Ha, Y. “Mapping Real-life Applications on Run-time Reconfigurable NoC-based MPSoC on FPGA”. In:FPT, 2010, pp. 365-368. Yang, Z. J.; Kumar,A.; Ha, Y. “An area-efficient dynamically reconfigurable spatial division multiplexing network-on-chip with static throughput guarantee”. In: FPT, 2010, pp 389-392. Almeida, G. M.; Sassatelli, G.; Benoit, P.; Saint-Jean, N.; Varyani, S.; Torres, L.; Robert, M. “An Adaptive Message Passing MPSoC Framework”. International Journal of Reconfigurable Computing, vol. 1, 20p, 2009. Moraes, F.; Calazans, N.; Mello, A.; Moller, L.; Ost, L. “HERMES: an
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
[19] [20]
[21]
[22]
[23]
[24] [25] [26]
[27]
[28]
[29]
[30]
Infrastructure for Low Area Overhead Packet-switching Networks on Chip”. Integration the VLSI Journal, v.38(1), 2004, pp. 69-93. Yang, S.; Furber, S. B.; Plana, L. A. “Adaptive Admission Control on the SpiNNaker MPSoC”. In:SOCC, 2009, pp. 243-246. Ahmad, B.; Erdogan, A. T.; Khawam S. “Architecture of a Dynamically Reconfigurable NoC for Adaptive Reconfitgurable MPSoC”. In: Adaptive Hardware and Systems (AHS’06), 2006, pp. 405-411. Obermaisser, R.; Kopetz, H.; Paukovits, C. “A Cross-Domain Multiprocessor System-on-a-Chip for Embedded Real-Time Systems”. IEEE Transactions on Industrial Informatics, v.6(4), 2010, pp. 548-567. Onieva, E.; Pelta, D. A.; Alonso, J.; Milanés, V.; Pére, J. “A Modular Parametric Architecture for the TORCS Racing Engine”. In CIG, 2009, pp. 256–262. Motakis, A.; Kornaros, G.; Coppola, M. “Dynamic Resource Management in Modern Multicore SoCs by Exposing NoC Services”. In: ReCoSoC, 2011, 6p. Silva, E.; Barcelos, D.; Wagner, F.; Pereira, C. “A Virtual Platform for Multiprocessor Real-Time Embedded Systems”. In: JTRES’, 2008, 6p. Zeferino, C. A.; Susin, A. A. “SoCIN: a Parametric and Scalable Network-on-Chip”. In: SBCCI, 2003, pp. 169-174. Bolleta, G.; Brosgol, B.; Furr, S.; Hardin, D.; Dibble, P.; Gosling, J.; Turnbull, M. “The Real-Time Specification for Java”. www.rtsj.org, 2010. Joven, J.; Font-Bach, O.; Castells-Rufas, D.; Martinez, R.; Teres, L.; Carrabina, J. “xENOC – An eXperimental Network-on-Chip Environment for Parallel Distributed Computing on NoC-based MPSoC Architectures”. In: PDP, 2008, pp. 141-148. Kumar, K.; Mesman, B.; Theelen, B.; Corporaal, H.; Ha, H. “Analyzing Composability of Applications on MPSoC Platforms”. Journal of Systems Architecture: The EUROMICRO Journal, v.54(3-4), 2008, pp. 369-383. Radulescu, A.; Dielissen, J.; Goossens, K.; Rijpkema, E.; Wielage, P. “An Efficient On-Chip Network Interface Offering Guaranteed Services, Shared-Memory Abstraction, and Flexible Network Configuration”. In: DATE, 2004, pp. 878-883. Duato, J.; Yalamanchili, S.; Ni, L. “Interconnection Networks: An Engineering Approach”. Morgan Kaufmann Publishers, 2003, 600p
[31] Salminen, E.; Kulmala, A.; Hämäläinen, T. “Survey of Network-on-chip Proposals”, White Paper, OCP-IP, 2008. [32] Agarwal, A.; Iskander, C.; Shankar, R. “Survey of Network on Chip (NoC) Architectures & Contributions”, Journal of Engineering, Computing and Architecture, v.2(1), 2009. [33] Carara, E.; Calazans, N.; Moraes, F. “A New Router Architecture for High-Performance Intrachip Networks”. Journal of Integrated Circuits and Systems, v.3(1), 2008, pp. 23-31. [34] Nilsson, E.; Millberg, M.; Oberg, J.; Jantsch, A. “Load distribution with the proximity congestion awareness in a networks on chip”. In: DATE, 2003, pp. 1126-1127. [35] Hu, J.; Marculescu, R. “DyAD-Smart Routing for Networks- on-Chip”. In: DAC, 2004, pp. 260-263. [36] Ye, T.; Benini, L.; Micheli, G. “Packetization and routing analysis of on-chip multiprocessor networks”. Journal of Systems Architecture, v.50(2-3), 2004, pp. 81-104. [37] Li, M.; Zeng, Q.; Jone, W. “DyXY – A Proximity Congestion- Aware Deadlock-Free Dynamic Routing Method for Networks on Chip”. In: DAC, 2006, pp. 849-852. [38] Sobhani,A; Daneshtalab, M.; Neishaburi, M.; Mottaghi, M.; AfzaliKusha, A.; Fatemi, O.; Navabi, Z. “Dynamic Routing Algorithm for Avoiding Hot Spots in On-chip Networks”. In: DTIS, 2006, pp. 179-183. [39] Carara, E; Moraes, F. "Flow oriented routing for NOCS". In: SOCC, 2010, pp. 367-370. [40] Ascia, G.; Catania, V.; Palesi, M.; Patti, D. “Implementation and Analysis of a New Selection Strategy for Adaptive Routing in Networkson-Chip”. IEEE Transactions on Computers, v.57(6), 2008, pp. 809-820. [41] Chiu, G. “The Odd-Even Turn Model for Adaptive Routing”. IEEE Transactions on Parallel and Distributed Systems, v.7(11), 2000, pp.
729-738. [42] Glass, C. J.; Ni, L. M. “The Turn Model for Adaptive Routing”. Journal of the Association for Computing Machinery, v.41(5), 1994, pp. 874902. [43] Lin, X.; McKinley, P. K.; Ni, L. M. “Deadlock-free Multicast Wormhole Routing in 2-D Mesh Multicomputers”. IEEE Transactions on Parallel and Distributed Systems, v.5(8), 1994, pp. 793-804. [44] Gilabert, F.; Gómez, M.E.; Medardoni, S.; Bertozzi, D. “Improved Utilization of NoC Channel Bandwidth by Switch Replication for CostEffective Multi-Processor Systems-on-Chip”. In: NOCS, 2010, pp. 165 – 172. [45] Yoon, Y. J.; Concer, N.; Petracca, M.; Carloni, L. “Virtual Channels vs. Multiple Physical Networks: A Comparative Analysis”. In: DAC, 2010, pp. 162-165. [46] Kakoee, M. R.; Bertacco, V.; Benini, L. “ReliNoC: A Reliable Network for Priority-Based on-Chip Communication”. In: Design, Automation and Test in Europe (DATE’11), 2011, pp. 491-496. [47] Carara, E.; Almeida, G.; Sassatelli, G.; Moraes, F. "Achieving composability in NoC-based MPSoCs through QoS management at software level". In: DATE, 2011, pp. 407-412. [48] Mello, A.; Tedesco, L.; Calazans, N.; Moraes, F. "Evaluation of Current QoS Mechanisms in Networks on Chip". In: SOC, 2006. [49] Wiklund, D.; Liu, D. “SoCBUS: Switched Network on Chip for Hard real Time Embedded Systems”. In: IPDPS, 2003. [50] Fernandez-Alonso, E.; Castells-Rufas, D.; Risueno, S.; Carrabina, J.; Joven, J. “A NoC-based Multi-{soft}core with 16 cores”. In: ICECS, 2010, pp. 259-252. [51] Fu, F.; Sun, S.; Hu, X.; Song J.; Wang, J.; Yu, M. “MMPI: A Flexible and Efficient Multiprocessor Message Passing Interface for NoC-Based MPSoC”. In: SOCC, 2010, pp. 359-362. [52] Carara, E; Moraes, F. "Deadlock-Free Multicast Routing Algorithm for Wormhole-Switched Mesh Networks-on-Chip". In: ISVLSI, 2008, pp. 341-346. [53] Daneshtalab, M.; Ebrahimi, M.; Xu, T. C.; Liljeberg, P.; Tenhunen, H. “A generic adaptive path-based routing method for MPSoCs”. Journal of Systems Architecture, v.57(1), 2011, pp. 109-120. [54] Rhoads, S. “Plasma-Most MIPS-I(TM) opcodes”. Available at: http://www.opencores. org/project,plasma. [55] Texas Instruments. “Wireless Terminals Solutions Guide”. Available at: http://focus.ti.com/pdfs/vf/wireless/ti_wireless_solutions_guide.pdf, 2012. [56] Carara, E.; Oliveira, R.; Calazans, N.; Moraes, F. “HeMPS - A Framework for NoC-Based MPSoC Generation”. In: ISCAS, 2009, pp. 1345–1348. [57] Chaves, T. M.; Carara, E. A.; Moraes, F. G. “Energy-Efficient Cache Coherence Protocol for NoC-based MPSoCs”. In: SBCCI, 2011, pp. 215-220.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS
AUTHOR’S BIOGRAPHY Everton Alceu Carara received the B.S., M.Sc. and Ph.D. degrees in Computer Science from the Pontifical Catholic University of Rio Grande do Sul (PUCRS), Porto Alegre, Brazil, in 2005, 2008, and 2011, respectively. He is currently Associate Professor at the Department of Electronic and Computer Science at Federal University of Santa Maria (UFSM). His main research interest includes systems on chip (SoCs) focusing multiprocessor systems on chip (MPSoCs) and networks on chip (NoCs). Ney Laert Vilar Calazans received a PhD degree in Microelectronics in 1993, from the Université Catholique de Louvain (UCL), Belgium, and CS MSc and EE bachelor degrees from the Federal University of Rio Grande do Sul (UFRGS), Brazil, in 1988 and 1985, respectively. He is currently a Professor at the Pontifical Catholic University of Rio Grande do Sul (PUCRS). His research interests include intrachip communication networks, non-synchronous circuit design and implementation, and computer-aided design techniques and tools. Professor Calazans is a member of the IEEE, of the Brazilian Computer Society (SBC), and of the Brazilian Society of Microelectronics (SBMICRO).
Fernando Gehm Moraes received the Electrical Engineering and M.Sc. degrees from the Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Brazil, in 1987 and 1990, respectively. In 1994 he received the Ph.D. degree from the Laboratoire d´Informatique, Robotique et Microélectronique de Montpellier (LIRMM), France. He is currently a Professor at the Pontifical Catholic University of Rio Grande do Sul (PUCRS). He has authored and coauthored 16 peer reviewed journal articles in the field of VLSI design, comprising the development of networks on chip and telecommunication circuits. He has also authored and co-authored more than 150 conference papers on these topics. His research interests include intrachip communication networks (NoCs), and MPSoC design. Professor Moraes is a member of the IEEE, of the Brazilian Computer Society (SBC), and of the Brazilian Society of Microelectronics (SBMICRO).