ocMPI - An efficient-lightweight communicationstack ...

6 downloads 3483 Views 325KB Size Report
environments an efficient mapping of custom programming models and languages is a key challenge to exploit the HW resources avoiding the performance.
ocMPI - An efficient-lightweight communication stack for NoC-based MPSoC architectures �� Jaume Joven, David Castells-Rufas, Jordi Carrabina Universitat Autònoma de Barcelona {jaume.joven, david.castells, jordi.carrabina}@uab.es

Abstract In the emerging Network-on-Chip-based (NoC-based) Multiprocessors-System-on-Chip (MPSoC) environments an efficient mapping of custom programming models and languages is a key challenge to exploit the HW resources avoiding the performance and power consume degradation. The main target of this work is to demonstrate that a custom message passing programming model is a feasible way to program future homogeneous or heterogeneous MPSoCs. Thus, in this work we describe the set of components we built to create the whole software stack layers, starting from low-level middleware routines and its hardware support, up to high-level message passing microtask communication library, which is our on-chip Message Passing Interface (ocMPI). This work proposes a HW-SW approach in order to show how to improve the communication interfaces according to the SW libraries included in our MPSoC platforms. Finally, we execute and analyze several low-level synthetic benchmarks and parallel applications on a real NoC-based MPSoC prototyping platform in order to show the practical viability of our SW stack.

1. Introduction Now, IC manufacturing technology will provide us with a few billion transistors on a single chip. The design of future homogeneous (e.g. many-cores) or heterogeneous MPSoCs [1] (e.g. embedded SoCs for mobile applications) will introduce complex problems: the communication architecture, the HW/SW interfaces and its interoperability through high-level programming models. The communication problem mainly focused on deep-sub-micron (DSM) effects, such as the global synchrony of large chips, and also its scalability have been solved by adopting the Network-on-Chip (NoCs) paradigm. Thus, in order to communicate all processing elements efficiently, a GALS (Globally Asynchronous Locally Synchronous) design style has

been adopted. In addition, NoCs have better modularity and design predictability compared to bus-based systems. For all these reasons, the Network-on-Chip (NoC) paradigm [2][3] have emerged as the solution for designing scalable communication architectures for Systems-on-Chips (SoCs), and obviously for large MPSoCs. In heterogeneous NoC-based MPSoCs mixing standard processors with different kinds of processing units may be essential to perform custom functionalities (e.g. DSPs or VLIWs, GPUs, set of ASIPs or custom hardware accelerators), the interoperability and the HW/SW interfaces [4] between HW-SW components could also become key challenge. For all of that, in order to compile and program effortlessly SW applications in the new generation of multiprocessor systems, two parallel programming models have been introduced and standardized by software communities according to the memory organization for high-level programming: shared memory (e.g. openMP [5]) or message passing (e.g. MPI [6][7]). Both programming models and APIlibraries abstract the hardware and software interfaces components and let the designer to execute concurrent or parallel application in a multiprocessors system. Because the requirements of the next generation of NoC-based MPSoCs when the heterogeneity and the high-grade of interoperability flow will dominate the traffic (e.g. video and audio streams) we believe that a custom mapping of a message passing programming model into NoC-based MPSoCs is the best well-known standard solution to face the programmability of this embedded systems. In addition, using ocMPI is easy to achieve fine-grain (i.e. data partition), coarse-grained (i.e. task level) and streaming parallel processing parallelism (i.e. scatter, gather) offering at the same time better system scalability when the number of IP cores increase due to the intrinsic distributed nature of these systems. Thus, because the mapping of these programming models in these fully-constrained systems (i.e. memory, performance and real-time, resource availability) is not straightforward in this paper we face a complete design

of a message passing by following a bottom-up approach and (using SW-HW point of view) in order to design an efficient embedded message passing SW communication stack according to the underlying architecture. An additional target is to let the software designers to program the future multi or many core MPSoC architectures using the same way as in the large traditional cluster systems, becoming HPC (HighPerformance Computing) in HPEC (High-Performance Embedded Computing).

2. Related work Some non-standard and specific initiatives of programming models have been designed for MPSoCs during last years. In [8] Forsell presents a PRAM (Parallel Random Access Machine) paradigm by using multithreaded processors on Eclipse NoC architecture. In [9] MultiFlex MPSoC programming model environment is focused on distributed system object component (DSOC) based on message passing and symmetrical multi-processing (SMP) model using on share memory systems. Other domain-specific programming models initiatives reviewed in [9] are MESCAL and NP-Click. Another relevant work related to HW/SW interfaces and programming models is the conceptual point of view presented in [4]. A similar approach to this work can be found in [10][12] but limited to the scalability of the bus based system or the interconnection and the number of PEs of the Cell processor. Recently, an interesting work of mapping MPI in embedded systems is presented in [13] but it lacks of interoperability since the HW-SW interfaces are not presented and it includes some MPI functions which are not relevant in the domain of the NoC-based MPSoCs systems. Thus, in this work, the main contribution is to show that a minimal efficient and lightweight custom implementation of the MPI (i.e. ocMPI) targeted specially to the emerging NoC-based MPSoC is a feasible way to exploit efficiently the HW capabilities. At the same time, it boosts the development of parallel SW in the embedded systems by using a microtask communication message passing scheme. To achieve these targets we put special emphasis on optimize the low-level middleware routines to make more efficient the message passing communication infrastructure by designing of low-latency HW/SW accelerator or DMAlike support interfaces in these resource-constrained systems. Therefore, through this approach (i.e. the ocMPI SW API) we are able to program effortlessly and

efficiently homogeneous or heterogeneous NoC-based MPSoCs systems by using a well-known high-level programming model.

3. System architecture: NoC-based MPSoC platform Essentially, our SW stack can be put on top of any NoC-based MPSoC platform generated by the existing EDA tools (e.g. Xpipes/SUNMAP [17][16], MAIA [19] or NoCGEN[18]) only adapting the HW/SW interfaces (i.e. NICs). Despite this fact, as is shown in figure 1 along this work we use our own low-latency buffer-less circuit switching NoC with 2D-Mesh topology, which uses a 4-phase handshake flow control protocol. Each tile of this NoC is an independent NIOS-II based sub-system, i.e. the soft-core processor, the Avalon bus with all required peripherals (included to debug and profile the communication between HW/SW interfaces), the scratchpad memory and the Network Interface Controller (NIC). Finally the whole SW communication stack (i.e. the middleware and drivers and the ocMPI) will be loaded on top of each NIOS-II soft-core processor together with and optional uKernel (e.g. eCos RTOS).

Figure 1. NoC-based MPSoC and on-chip network micro-architecture The 2D-Mesh circuit switching router architecture is simple and it is based on the block diagram shown in figure 1. Basically, the router consists of two basic blocks for each input/output communication channel.

The XY Routing block and the PathSwitchMatrix which is in charge to arbitrate possible request, establish/release the output channel. The flit latency traverse the router is 2 clock cycles. Thanks to the fact that our RTL Verilog code of our NoC architecture is highly parameterized we can effortlessly design two different instances, 2x2 and 3x3 Mesh NoC-based MPSoC as our testbench platforms

3.1. HW/SW interface: the Network Interface Controller It is quite common in NoC-based systems to combine packet switched or circuit-switched architectures with shared medium (usually on-chip busses) within the tiles, and also custom HW cores tiles with SW ones. Thus, in order to adapt all these communication architectures and HW/SW interfaces, as done in computer networks, is usual to embed a Network Interface in every tile. The main task of this component is the adaptation by decoupling computation elements from communication ones. From HW point of view the NIC acts as a data translation element between routers and tiles (to pack/unpack flits), but from the software point of view it manages the injection rate from the SW stack. For that reasons, the design of this HW/SW interface is one of the key challenges to exploit efficiently all hardware capabilities from middleware and the high-level SW APIs, hiding architecture dependant aspects. Next as is shown in figure 2, 3 we illustrate two ways of design the NIC inside a NIOSII-based tile (figure 1): (i) as a Bus-Based NIC or (ii) as a Custom Instruction NIC (CI-based NIC) In i) the NIC is attached directly on the on-chip bus as a master-slave DMA-like module, whereas in (ii) the NIC acts as NIOS-II custom instruction inside the processor datapath. Both NIC implementations (described in Verilog HDL) work similarly using local bus-side polling with notification IRQs methods, and therefore there is not overhead added onto the network during the process. As is shown in figure 2, 3 both NICs contain a TX_REGISTER, RX_REGISTER that can be parameterized as different FIFO depths, a 32 bits STATUS_REGISTER and CTRL_REGISTER, and finally two FSMs (i.e. txFSM and rxFSM) to perform the 4-phase handshake flow control protocol.

Figure 2. Bus-based Network Interface Controller

Figure 3. Custom Instruction Network Interface Controller Furthermore, to ease NIC re-use, the bus-based NIC slave interface have been designed to follow VSIAVCI [20] rules to other on-chip busses compatibility (e.g. AMBA, Avalon, Coreconnect, etc), which let us to communicate with many different types of tiles. Obviously, this fact allows us to implement more heterogeneous NoC-based MPSoCs. On the other hand, the CI-based NIC is especially interesting because it put in practice the abstract concept of the NoC assembly language [21]. Moreover, a priori our CIbased NIC implementation shows many advantages in terms of less decoding and arbitration bus latency overhead than the bus-based NIC, and depending on the system we will get better injection rates using it.

4. Software stack for message passing This section describes all software components needed to model the message passing programming model onto heterogeneous NoC-based MPSoC architectures. Thus, it allows us to program effortlessly parallel applications through high-level SW API. Basically, our lightweight SW stack will be composed by the low-level middleware routines and our custom implementation of the standard message passing interface for microtask communication on-chip environments (i.e. ocMPI). To avoid large memory footprints and in order to achieve a maximum runtime performance we design both components using ANSI C code.

4.1. Middleware for the Network Interface Controller The low-level middleware interacts with the NIC HW components in the sense of marshalling and unmarshalling all data messages from upper SW layers into the NoC communication channel. Thus, we designed high-performance optimized middleware routines for both NIC implementations. The main important are: • int nicStatus(void) – Checks the NIC status register. It returns the status as a 32 bits integer. • int nicSend(byte* buffer, int length, int address) – blocking send used to transmit an array of bytes. It returns 1 for success, otherwise returns < 0. • int nicRecv(byte* buffer, int length) – blocking receive used to obtain an array of bytes. It returns a positive number according to the number of read bytes, otherwise returns < 0. Cooperatively, other control routines according to the circuit switching scheme (e.g. openChannel, closeChannel) and/or specific data exchanging routines according to determined datatypes can be included in this low-level library (e.g. nicSendChar(), nicSendInt()).

4.2. Programming Model: A lightweight Onchip Message Passing Interface (ocMPI) In this section, we explain the software design to create our lightweight on-chip message passing interface (ocMPI). Two main phases have been performed by using a bottom-up approach: (i) We select a minimal working subset of standard MPI functions. This task is necessary because MPI contains more than hundred functions, most of them not useful in NoC architectures. (ii) Make an accurate porting process (shown in figure 4) by changing the lower layers of each selected function, i.e. overwriting the low-level layers in order to exchange data through the NIC to the onchip network, and vice versa.

Figure 4. Layer mapping between standard MPI to our on-chip environment (ocMPI)

Thus, to do the porting process we start from scratch and we take as a starting point the source code of the open MPI initiative [7] and MPICH [6] (concretely MPI-2). In addition, we take into account all previous work and concepts explained in [11]. Thus, we can face the mapping by using an incremental bottom-up approach with many refinement phases. Mainly, in order to design our lightweight ocMPI we follow the next considerations: • All MPI data structures have been simplified in order to get a reduce memory footprint. The main structures that have been changed are the communicator, the data types and the operations. • No user-defined data types and operations are supported in the basic library. Only the basic MPI datatypes and operations are defined (e.g. MPI_INT, MPI_DOUBLE, MPI_CHAR, or MPI_MAX, MPI_SUM, MPI_AND, MPI_LXOR, and so on). • The support for virtual topologies has been removed completely since real application-specific or custom NoC designs can be modeled by using the tools presented in [14][15][16][17]. • No Fortran bindings and MPI I/O functions have been included. • All ocMPI functions follow the prototypes defined on standard MPI-2 in order to keep this portability. Table 1 shows the 19 standard MPI functions ported to our NoC-based MPSoC platform. Table 1. Classification of supported functions in our ocMPI library Types of functions Management

Profiling Point-to-point communication Collective communication

MPI

Ported MPI functions MPI_Init, MPI_Finalize, MPI_Finalized, MPI_Initialized, MPI_Comm_size, MPI_Comm_rank, MPI_Get_processor_name, MPI_Get_version MPI_Wtick, MPI_Wtime MPI_Send, MPI_Recv, MPI_SendRecv MPI_Broadcast, MPI_Gather, MPI_Scatter, MPI_Barrier MPI_Reduce, MPI_Scan

Each ocMPI message has the following envelope: (i) Source rank, (ii) Destination rank, (iii) Message tag, (iv) Packet datatype, (v) Communicator (vi) Payload length, and finally (vii) The payload data.

Depending on the application requirements we can select different SW configuration stacks. Figure 5 lists four typical configurations of the ocMPI library and its stripped code size in bytes. The minimal working subset of our onMPI library contains the NIC middleware routines and the following six functions: • ocMPI_Init() & ocMPI_Finalize(): initializes/ends the ocMPI software library. • ocMPI_Comm_size(): gets the number of all concurrent processes that run over the NoC-based MPSoC ocMPI_Comm_rank(): gets the rank of a process in the ocMPI software library.





ocMPI_Send() & ocMPI_Recv(): blocking send/receive functions. The size of this basic library is only 4.942 bytes. Code size (in bytes)

13500 12000 10500 9000 7500 6000 4500 3000 1500 0

Basic stack

NIC Middleware ocMPI Profiling

Basic stack + Management

Basic stack + Profiling

Basic ocMPI Adv. Communication

Basic stack + Profiling + Adv. communication

ocMPI Management

Figure 5. SW stack configurations (ocMPI library) Other interesting SW configuration is the basic stack plus the ocMPI profiling functions. As one can see in figure 5, these profiling functions employ large code size because both must include many NIOS-II HAL API functions calls and libraries to access the required peripherals (i.e. the timers and/or the performance counter). Thus, this is an important point be optimized in future by creating our custom API and peripherals. Finally, figure 5 shows that a complete ocMPI library with all the advanced collective communication primitives only takes 12.970 bytes. It is important to remark that all results shown in figure 5 have been extracted using the nios2-elf-gcc compiler with -O2 optimization.

NIC middleware routines and our ocMPI library by executing several benchmarks in order to extract the most important communication metrics.

5.1. Evaluation of the middleware routines In order to profile our middleware routines and evaluate, in terms of bandwidth, and injection rate, we test COMMS (which is a low-level synthetic benchmarks included in PARKBENCH [22]) on our NoC-based MPSoC. Thus, unidirectional or ping-pong traffic patterns have been injected through both NIC implementations and its middleware routines as soon as possible from each NIOS-II processor. The results shown that our CI-based NIC integrated in the processor datapath is faster than our bus-based NIC. On the other side, the custom HW core can achieve the maximum bandwidth because no SW overhead is added to exchange data through the NoC. Looking the results the use of the CI-based NIC and its middleware routines is clearly the best choice to communicate any NIOS-II based tiles in our NoCbased MPSoC. It offers around 80 Mbps of bandwidth and an injection rate close to 1 flit every 40 cycles under unidirectional traffic patterns at only 100MHz. Despite this fact, in both NIC implementations the injection rate (as is shown in figure 6) is degraded according to the number of hops due to end-to-end flow control protocol of our circuit switching NoC. However, this fact does not happen in dramatically manner when the number of hops increases, and therefore it permits its scalability. The zero latency to send 1 byte using the BB-NIC and the CI-NIC are 1 us and 890 ns, respectively. The achieved results are doubled under bidirectional traffic benchmarks since NIC-to-Router or Router-toRouter communication channels are full duplex. 140 Injection rate (flit/cycle)

4.3. SW stack configurations

120 100 80 60 40 20 0

5. Evaluation of the ocMPI programming model on our NoC-based MPSoC This section shows the evaluation and its profiling process of all SW layers of our lightweight message passing stack. Thus, in this section we characterize our

1

2

3

4

Number of hops Bus-based NIC (unidirectional) CI-based NIC (unidirectional)

Bus-based NIC (ping-pong) CI-based NIC (ping-pong)

Figure 6. Injection rate of each NIC implementation on NIOS-II based tiles

5.2. Profile the ocMPI library

200

750000 600000 450000 300000

150

150000

32

64 12 8 25 6 51 2 1K B 2K B 4K B 8K 16 B K 32 B K 64 B 12 K B 8 25 K B 6 51 K B 2K B 1M B

50

8 16

0

100

4

Execution time (in clocks cycles)

900000

Latency (us)

This section presents the profiling of many ocMPI functions to demonstrate its minimum execution overhead. First, in figure 7 we present the results acquired by performing a profiling monitoring with external non-ocMPI profile functions during the execution of each parallel program. The results shown that our ocMPI management functions consume less than 75 cycles, with the exception of ocMPI_Get_processor_name() which spends around 210 clock cycles.

the limited on-chip resources related to the scratchpad memory. The zero latency is 4us for ocMPI_Send(), this means around 4 times more overhead than in nicSend() routines because of the overhead to send/receive the ocMPI packet envelope, and the ocMPI function parameter checking. On the other side for ocMPI_Bcast() the zero latency is 14us and 35us in our 2x2 and 3x3 multi-core systems, respectively.

Packet size (in bytes) 0 ocMPI_Get_processor_name ocMPI_Initialized ocMPI_Comm_size

ocMPI_Bcast (2x2) ocMPI_Send (ping-pong)

1

ocMPI_Finalize ocMPI_Finalized ocMPI_Wtick

ocMPI_Get_version ocMPI_Comm_rank ocMPI_Wtime

Figure 7. Profile of the management ocMPI functions Other important primitives to be profiled are ocMPI_Init() and ocMPI_Barrier(). In both cases its execution depends on the underlaying network because it synchronization nature. Thus, we obtained different execution times by testing hundred times several function calls to its functions in a 2x2 and 3x3 Meshbased multi-core embedded system. The results are presented in Table 2. Table 2. Profiling of ocMPI_Init() and ocMPI_Barrier() in NoC-based MPSoC ocMPI function ocMPI_Init()

NoC-based architecture 2x2 Mesh 3x3 Mesh 496 clocks 1196 clocks

ocMPI_Barrier()

8638 clocks

19538 clocks

In order to monitor the latency of our main ocMPI high-level communication functions (i.e. ocMPI_Send/ocMPI_Recv, and ocMPI_Bcast) we executed the COMMS benchmark several times increasing the packet size from 4 bytes up to to 1MB. Figure 8 shows the average latency extracted from these tests. For packets smaller than 64 KB, the latency does not increase exponentially, but when the packet exceeds this threshold it increases exponentially due to

ocMPI_Bcast(3x3) ocMPI_Send (unidirectional)

Figure 8. Profile of the ocMPI communication function

6. Design of parallel applications through our ocMPI library In order to verify our ocMPI library a great variety of basic experiments were performed on our NoCbased MPSoC architectures creating synthetic traffic patterns. Despite this fact, because ocMPI is intended to design high-level message passing SW applications, next we show the results of two well-know parallel applications. 1. Parallel calculation of pi: this pi calculation algorithm is based on the following approximation equation:

∏ ∞ (−1) n =∑ 4 n=0 2·n + 1 2.

Parallel generation of Mandelbrot set: its generation is based on parallel calculation of set of points in the complex plane.

Both applications have been parallelized in a parallel mode through ocMPI. In addition, the source codes of both applications have been compiled and executed without any changes from the source code used in any traditional supercomputing system. Because both applications have a high computationcommunication rate, and all calculated data are independent of each other, we get speed-ups close to

the number of processors. This is also true because as we show the latency of the SW stack is negligible compared to the computational time. Table 3. cpi and Mandelbrot problem size parameters Test # Test 1

Problem size N=10000 QCIF (176x144 pixels) with MAX_ITERATIONS=64 N=100000 QCIF (176x144 pixels) with MAX_ITERATIONS=128 N=1000000 QCIF (176x144 pixels) with MAX_ITERATIONS=256 N=10000000 QCIF (176x144 pixels) with MAX_ITERATIONS=512

Test 2

Test 3

Test 4

As show in table 3, for each parallel application we test four different tests changing on it the problem size parameters. Figure 9 shows the time in seconds obtained from ocMPI_Wtime() during the executing of each test. It is easy to observe that using parallel implementations we can achieve speed-ups close to the number of CPUs. For instance, the serial generation of pi takes using “Test 3” takes 40.6s, whereas the parallel generation takes on 2x2 and 3x3 Mesh takes 10s and 4,5s respectively. On the other side the Mandelbrot set in “Test 4” takes around 410s, whereas the parallel generation in our 2x2 or 3x3 Mesh takes 110s or 46s, respectively. 400 Time (seconds)

350 300 250 200 150 100 50 0 Test 1

Test 2

cpi_serial cpi_parallel (3x3) mandelbrot_parallel (2x2)

Test 3

Test 4

cpi_parallel (2x2) mandelbrot_serial mandelbrot_parallel (3x3)

Figure 9. Measured time of each parallel application

7. Conclusions We present a complete SW stack (i.e. the middleware routines and our ocMPI library) in order to use effectively all available computational resources

included in our on-chip platform by using high-level message passing programming model. Thus, we customize the standard MPI to map highlevel message passing parallel applications on a generic NoC-based MPSoCs architectures, which it includes several NIOS-II soft-core processors and/or custom IP blocks. Because the standard MPI is not suitable for on-chip systems, due to its huge memory footprint and high execution latencies, in this work we face an extremely accurate porting process based on the design of a lightweight message passing interface library. This leads to a compact implementation version of the standard MPI by minimizing internal data structures, and giving support for the really necessary capabilities to communicate on-chip multi-core systems. Besides, it is important to remark that all ocMPI functions do not depend on any micro-kernel or operating system. Due to our library has a tiny footprint we can fit the whole SW stack in the on-chip memory space allowing its fast execution and extremely low latency message passing, and optimizing the power efficiency. The obtained results in this paper demonstrate four main issues: (i) Mixing well-known modern on-chip architectures (i.e. NoC-based MPSoC), together with a traditional standard message passing programming models (i.e. MPI) produces a reusable and robust platform-based designs [23]. (ii) Our tiny ocMPI SW library is a custom instance of message passing standard and it is a viable solution to program NoC-based MPSoC architectures through high-level programming model. (iii) Due to the standard MPI has intrinsic support for conversion and data type management (i.e big & little endianess and internal representation of data types), our ocMPI is a suitable way to program the future heterogeneous NoC-based MPSoC. (iv) Our experiments revels that increasing the frequency in our system we could achieve better bandwidth reaching almost the performance in traditional supercomputing communication infrastructures. In addition, this study demonstrates that the use of custom HW/SW interfaces (such as our CI-based NIC) and custom instructions improves the message passing performance. In our research work, the use of soft-core processors allows us to extend its Instruction Set Architecture (ISA) by adding new NoC specific assembly instructions (and their corresponding accelerator hardware in the processor data-path). This fact let us to predict that in near future, all hard-core

processors in NoC-based MPSoC systems should also extend their ISA to perform an efficient message passing protocol. Finally, our ocMPI eases the design of coarse-grain or fine-grain parallel on-chip applications using the explicit parallelism and synchronization inherent in all MPI source codes. Moreover, all SW designers that program parallel applications using MPI they do not need reeducation to write and run parallel applications in these NoC-based MPSoC environments. This maximizes the productivity of SW developers and the reuse of parallel applications. Another important issue to use ocMPI and thanks to its inherent support for scatter, gather, reduce, scan functions let the SW programmer to perform streaming programming on embedded systems. For all of these reasons, the result of this work will be a feasible cluster-on-chip environment for highperformance embedded computing.

References [1] Ahmed Amine Jerraya, Wayne Wolf. The Morgan Series in Systems on Silicon, Kaufmann “Multiprocessor Systems-on-Chips”, ISBN: 0-12385251-X, 2005 [2] Axel Jantsch, Hannu Tenhunen, “Networks on chip”, Kluwer Academic Publishers, Hingham, MA, 2003 [3] Benini, L.; De Micheli, G. “Networks on chips: a new SoC paradigm”, IEEE Computer, v. 35(1), Jan. 2002, pp. 70-78 [4] Ahmed A. Jerraya, Aimen Bouchhima, Frédéric Pétrot, “Programming models and HW-SW interfaces abstraction for multiprocessor SoC”, DAC '06: Proceedings of the 43rd annual conference on Design automation, 2006

[5] OpenMP, available from http://www.openmp.org [6] The Message Passing Interface (MPI) standard, http://www-unix.mcs.anl.gov/mpi/ [7] Open MPI: Open Source High Performance Computing, available from http://www.openmpi.org/ [8] Forsell, M., "A scalable high-performance computing solution for networks on chips," Micro, IEEE , vol.22, no.5, pp. 46-55, Sep/Oct 2002 [9] Pierre G. Paulin, Chuck Pilkington, Michel Langevin, Essaid Bensoudane, Gabriela Nicolescu, "Parallel programming models for a multiprocessor SoC platform applied to high-speed traffic management", CODES+ISSS '04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, 2004 [10] J. A. Williams, I. Syed, J. Wu, and N. W. Bergmann. A Reconfigurable Clsuter-on-Chip

Architecture with MPI Communication Layer. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06), California, USA, 2006. [11] T.P. McMahon, A. Skjellum, "eMPI/eMPICH: Embedding MPI," mpidc, p. 0180, Second MPI Developers Conference, 1996 [12] M. Ohara, H. Inoue, Y. Sohda abd H. Komatsu, and T. Nakatani. MPI microtask for programming the Cell Broadband Engine processor. IBM Journal of Research and Development, 45, 2006. [13] James Psota, Anant Agarwal: rMPI: Message Passing on Multicore Processors with On-Chip Interconnect. HiPEAC 2008: 22-37 [14] D. Bertozzi et al., "NoC Synthesis Flow for Customized Domain Specific Multi-Processor Systems-on-Chip", IEEE TPDS, Feb 2005 [15] S. Murali, et al. Designing application-specific networks on chips with floorplan information. In Proc. ICCAD, 2006. [16] Srinivasan Murali, Giovanni De Micheli, "SUNMAP: A Tool for Automatic Topology Selection and Generation for NoCs", dac, pp. 914-919, Design Automation Conference, 41st Conference on (DAC'04), 2004 [17] Bertozzi, D., Benini, L. “Xpipes: A Network-on-chip Architecture for Gigascale Systems-on-chip”, IEEE Circuits and Systems Magazine, 4(2), pp18-31,2004 [18] Chan, J. & Parameswaran, S. NoCGEN: A Template Based Reuse Methodology for Networks on Chip Architecture. Proceedings of the 17th International Conference on VLSI Design, 2004. [19] Luciano Ost, Aline Mello, José Palma, Fernando Moraes and Ney Calazans, MAIA: a framework for networks on chip generation and verification, ASP-DAC '05: Proceedings of the 2005 conference on Asia South Pacific design automation, pp. 49-52, 2005 [20] Virtual Socket Interface Alliance. Virtual Component Interface Standard, OCB Specification 2, Version 1.0, March 2000., available from http://www.vsi.org [21] A. Jantsch, “NoCs: A new contract between hardware and software,” Proc. of DSD, pp. 10-16, September 2003. [22] Hockney, R., M. Berry. Public International Benchmarks for Parallel Computers. PARKBENCH Committee: Report-1, February 1994. http://www.netlib.org/parkbench/. [23] Soininen, J.P., Jantsch, A., Forsell, M., Pelkonen, A., Kreku, J., Kumar, S., "Extending platform-based design to network on chip systems", Proceedings. 16th International Conference on VLSI Design 2003, vol., no.pp. 401- 408, 4-8 Jan. 2003