Customized Network-on-Chip for Message Reduction

0 downloads 0 Views 492KB Size Report
Sun et al. (Eds.): ICA3PP 2014, Part I, LNCS 8630, pp. 535–548, 2014. © Springer International ... Guangwen Yang1, and Weimin Zheng1,2. 1 Department of ...
Customized Network-on-Chip for Message Reduction Hongwei Wang1, Siyu Lu1, Youhui Zhang1,2,*, Guangwen Yang1, and Weimin Zheng1,2 1

Department of Computer Science and Technology, Tsinghua University Beijing, China 2 Technology Innovation Center at Yinzhou Yangtze Delta Region Institute of Tsinghua University, ZheJiang [email protected]

Abstract. This paper proposes a network-on-chip (NoC) design customized for message reduction, which enhances some common routers with a special Reduce Processing Unit (RPU) to complete reduce-computations hop-by-hop, as well as to learn the transmission path of reduction-messages adaptively. More specifically, for reduction on a small data-set, the corresponding data is transmitted through the NoC directly. Thus, along the transmission path, enhanced routers can complete reduction in place, which not only speeds up the processing procedure but also coalesces messages. An adaptive method for the deterministic routing algorithm is also introduced to enable these routers to learn transmission paths accurately to improve the processing efficiency. We present the detailed micro-architecture design and evaluate the corresponding performance, the power consumption and chip-area. Testing results show that this design can improve the reduction / all_reduce performance of 2.67~11.76 times, while the consumption of power and chip-area are both limited. Keywords: Network on chip, CMP, message reduction.

1

Introduction

The general trend in processor development has moved from dual- and quad-core processor chips to ones with tens or even hundreds of cores [1][2] that are connected by Network-on-Chip (NoC). In addition, future multi-core-processors (CMPs) are expected to have an amount of local on-chip memory assigned to each core [3]. Then, the architectural similarities of such NoC-based CMPs and computer clusters have led to the adoption of the message passing mechanism for on-chip programming, like RCCE for Intel’s SCC [4] or the Multicore Communication API [5]. MPI is a standardized and portable message-passing system designed to function on a wide variety of parallel computers. It provides a large amount of MPI collective operations [6]. MPI_Reduce and MPI_Allreduce are widely used because of the simplification of the complex task of writing scalable, parallel programs [7]. [8] showed *

Corresponding author.

X.-h. Sun et al. (Eds.): ICA3PP 2014, Part I, LNCS 8630, pp. 535–548, 2014. © Springer International Publishing Switzerland 2014

536

H. Wang et al.

that the execution time of such operations can account for up to 40% of the total execution time of MPI routines. Accordingly, we could achieve comparatively high performance by improving the implementation of message-reduction. A naive implementation of reduction is based on point-to-point primitives [9]. When a reduction-operation is initiated, all nodes send data to the rank root node. The latter receives the data and handles all the computation. To improve efficiency, two types of optimization can be employed: (1) To parallelize computation and communication. (2) To offload overheads to some specific hardware module(s). Software-based solutions usually belong to the first category. Binomial Tree Reduce (BTR)[10] is an algorithm that takes log2P (P is the number of processors) steps to get the final result. This is one of the most efficient algorithms by software. The essential idea is: The associativity of computation allows parallel reduction. Each level of the tree corresponds to one round of the reduction-algorithm. From the transferring aspect, it can be regarded as in-transmission optimization: the original and intermediary reduction-operands are transmitted across the given communication paths and computed by intermediary nodes, till the final result has been gotten by the root. However, the application-level topology may not be corresponding to actual physical positions, which may lead to extra overheads. The second category, hardware-assistant specific technique, is widely used to minimize software overheads. Sun Clint network[12], Voltaire FCA[13], Intelligent Network Interface Card (INIC)[14], BlueGene/L[15], Reconfigurable Hardware on Accelerating MPI_Reduce [16] and so on are such solutions in the high-performance computing or reconfiguration computing fields. Some works have provided hardware supports in the CMP chips, like [17] and the TILE processor [18]. Some others have implemented software optimizations based on specific hardware features, like [19] [20] on Cell, [21] on SCC, etc. Some work has used the two types of optimization together. For example, the SCC work [21] has completed the reduction step by step on the on-chip MPBs (messagepassing buffers) along the transmission paths, while the message processing is still achieved by software; [16] on FPGA also has adopted some specific computation patterns for the optimization. Compared with existing work, we propose a network-on-chip (NoC) design customized for message reduction, which enhances common NoC routers to complete reduction hop-by-hop. The main characteristic lies in that all enhancements are completely integrated into the NoC-router’s architecture. Namely, the in-transmission optimization is united with hardware-offloading on the network layer. Considering the CMP architecture connected by a NoC, message-packets generated by reduction-calls can be transmitted through the NoC directly: they are delivered between nodes, across intermediate routers from the source node to the destination, namely the hop-by-hop transmission. Under this condition, a node may act as an intermediate communication node for more than one packet. This kind of transmission gives us an opportunity to do some optimization along the transmission path, such as computation and coalescing messages. Accordingly, we have enhanced routers so that they can not only complete reduction-computation but also learn the transmission paths of reduction-messages

Customized Network-on-Chip for Message Reduction

537

adaptively. Namely, the computation and communication have been combined together. Unlike the current in-transmission optimization that has to construct some communication pattern in advance, our work can adapt to the NoC context inherently because any message-packet of a reduction (we call it “reduction packet” in this paper for short) will be transmitted by the NoC after all. To the best of our knowledge, such approach on NoC has not received much attention yet. One similar research is [22]: it has presented a method to provide network-hardware support for broadcast and reduction-operations. But this work is implemented in FPGAs and limited to the specific FPGA interconnection architecture (BEE3 platform) and topology. With this design, the following contributions have been accomplished



1. We enhance a common NoC router by integrating a processing unit into the router pipeline. So that along the transmission paths of reduction packets, reduction can be completed by enhanced routers on the network layer. 2. A SW/HW hybrid method is proposed to enable routers to learn transmission paths adaptively. Thus a router on the path can know the number of reduction packets it should process, as well as the direction of each packet, which improves the processing efficiency further. 3. We present an optimal layout of enhanced routers in the NoC, which can make a balance between the performance enhancement and extra overhead. Testing results show that this design can promote the reduce / all_reduce performance of 2.67~11.76 times (up to 11.76 for reduction and 10.2 for all_reduce), while the consumption of power and chip-area are both very limited.

2

Related Work

2.1

Software Approaches

[10] uses Binomial Tree Reduce (BTR) algorithm for reduce. It follows a special traffic pattern, called Recursive Distance Doubling (RDD), which also forms the basis for many collective operations. The concept of Recursive Vector Halving is to splitting the input vectors in half each round. Based on this mechanism, for long messages, [11] introduced an approach to compute different portions of the result in parallel on all cores. 2.2

Hardware Approach

Hardware-assistant specific technique is widely used to enhance MPI implementation and minimize software overhead. In the high-performance computing filed, an FPGA-based implementation of SunMPI-2 APIs for the Sun Clint network is reported by [12]; the Voltaire FCA [13] goes further to off-load collective communication and operations to the network by adding CPUs to the switches. Similarly, [14] implemented MPI Reduce in the FPGA fabric of a Network Interface Card. BlueGene/L [15] have a dedicated network to

538

H. Wang et al.

handle collective communications, specifically for broadcast and reduction. In [16], the semantics of the MPI_Reduce call have been implemented in the reconfigurable resources of an FPGA device across a cluster of all-FPGA compute nodes. A recent work on CMP is [17]: It presents an underlying customized NoC that incorporates buses into NoC to achieve high performance for both point-to-point and broadcast data transmission; an MPI engine attached to each core has implemented basic MPI primitives to relieve the processor core (but no reduce-specific support). In addition, the famous TILE processor [18] supports passing messages between cores without system software intervention. One similar work is [22]. [22] has enhance the NoC in FPGA make it MPI aware by adding hardware support for broadcast and reduction. The major difference lies in that this work is limited to the specific FPGA interconnection architecture and topology, and then is not so efficient as our work that enables routers to learn transmission paths adaptively.

3

Proposed Architecture

This section describes the proposed design for accelerating reduction with the NoClevel support. There are two kinds of enhanced routers in this work, one of them is referred as Class_1 router. A Class_1 router is a common router integrated with a specific Reduce Processing Unit (RPU); the latter is responsible for computing and coalescing reduction packets as well as learning transmission paths. The other is referred as Class_2, integrated with a simplified RPU that is only capable of learning transmission paths. 3.1

Architecture Overview

Before the detailed design, we outline the basic architecture of the objective CMP and some message-passing primitives. There are N*N computing nodes connected by a 2D-mesh NoC; each node is composed of a CPU core with the local cache and an enhanced router (Class_1 router or a Class_2). There are 5 ports of a router, four of them are used to connect to each of its neighbor routers through independent network channels (North, East, South, and West) and the last one is used to connect to the local core. A deterministic routing algorithm (Y-X routing) has been applied, as well as the virtual-channel flow control. Without loss of generality, the L1 cache-line is set to 64–byte and we limit the maximum payload-size of a reduction packet to one line. It means that a reduction on some small data-set (the result size is not large than 64-byte) is directly supported by our design. In addition, the NoC flit size is set to 128-bit and five flits constructs such a maximum packet. From the message passing aspect, a core will send out an MPI packet (or more accurately, send out in-order flits of the packet) to the local router; the router will deliver flits to the destination hop by hop. For a common NoC, routers are MPI-unaware, which means they just deliver packets while any other work is handled by upper-level

Customized Network-on-Chip for Message Reduction

539

modules. In contrary, our design can complete reduction in the network layer and the details are presented in the following paragraphs. 3.2

The Learning Method

We propose a SW/HW hybrid method to enable routers to learn the transmission path of reduction packets adaptively under deterministic routing. More specifically, this method enables every enhanced router to know the number of reduction packets it should process, as well as the incoming direction of each packet. In our design, as a parallel MPI-task is beginning, the runtime will launch a fakereduction on the default MPI communicator: all involved cores (here we assume that one core just execute one MPI process) send a special reduction packet without any data payload to the rank root node. During the transmission, if a Class_1 router receives such a packet from a neighbor or the local core, it will increase the number of received packets from the corresponding direction and record this value into an internal bit-array, called a learned status-bit-array (abbreviated as LSBA). Figure 1 shows an example of a Class_1’s LSBA. It contains 5 fields; each field records the number of received reduction packets from the corresponding direction. It can be inferred from this figure that the router have 3 reduction packets from the west neighbor and 5 reduction packets from the north, as well as 1 from the local core. Obviously, for an NxN mesh NoC, the length of each LSBA-field is not more than Log2 (NxN) bits. The exception is the local field: it is only one-bit long because we assume no more than one reduction packet of a given communicator will be delivered from the local core. If such a case happens (for example, one core runs two or more processes of a single MPI task), we assume that the runtime software can coalesces messages by completing the reduction locally. East 0000

South 0000

West 0011

North 0101

Local Core 1

Fig. 1. An example of LSBA

If a Class_2 router receives such a packet, it just records the incoming direction, not the number. Thus, the bit-length of a Class_2’s LSBA is only 5. Another issue is about how to forward the empty reduction packet. It also depends on which class the router belongs to: (1) If it is a Class_1 router, which means it completes reduction of all incoming reduction packets in place and send out one result packet, the router will only forward the first reduction packet and discard others (if existing). (2) If Class_2, the packet will be forwarded as normal. During the common reduction procedure, a Class_1 router will record the information of received reduction packets into another bit-array, called a current status-bitarray (abbreviated as CSBA). The structure of a CSBA is the same as LSBA. As the values of these two bit-arrays are identical (it means the router has received all packets that it should handle because we use the deterministic routing algorithm), it can complete the reduction and send out a new reduction packet containing the result.

540

H. Wang et al.

Moreover, every time a MPI communicator is being created or modified, a fake-reduction will be carried out again to reflect the latest transmission-pathinformation of the given communicator. Accordingly, one router with m LSBAstructures can support m communicators at the same time; each LSBA is identified by the communicator. For CSBA, the case is a little more complicate because one communicator may launch more than one reduction simultaneously. Therefore, we should differentiate them. The solutions is straightforward: the runtime will allocate an ID for each reduction, which contains the value of the corresponding communicator and a unique number with a fixed length. Thus, the router can use this ID to locate the corresponding CSBA and LSBA. Another function of the learning method is to speed up the broadcast of all_reduce in the NoC layer. From the transmission aspect, the broadcast is an inverse procedure of reduction. As the rank root node has gotten the final reduction result, it will send out the result packet(s) according to the LSBA information. It means that if some field of the LSBA is non-zero, the router will transmit one result packet to the corresponding direction. This step will be repeated by each router (regardless of Class_1 or 2) along the transmission path and then all involved nodes will get the result finally. 3.3

Architecture of the Enhanced Router and the Packet Format

The micro-architecture of an enhanced router is shown in Figure 2. A Class_1 router is integrated with a Reduce Processing Unit (RPU) that is responsible for doing the reduction-computation (including MAX, AND, ADD, etc.) on payloads of incoming reduction packets, leading to a high efficient mechanism that makes messages being processed during their transmission. While a Class_2 router is integrated with a simplified RPU (sRPU) that doesn't contain FPU, and has less status-bit-arrays.

Fig. 2. Block diagram of enhanced router architecture

In addition, the specific format of a reduction packet has been introduced, which contains extra information in its head flit, including the reduction type, tag, data type, etc. The head-flit format of a reduction packet is shown in Figure 3, which provides

Customized Network-on-Chip for Message Reduction

541

information such as Src as the source ID in the network layer, as well as Drc as the destination ID. Com is used to identify different communicators and reductions of one communicator. Op type refers to the corresponding reduction type. If all bits of Op are 0, it means that this packet is a special reduction packet for the learning step.

Fig. 3. Head-flit format of Reduction packet

The original input unit has been modified accordingly: It can judge whether an incoming flit belongs to a reduction packet or not. If yes, it will piece together all such flits into the virtual-channel flit-buffers. • RPU Figure 4 shows the block diagram of the RPU, which contains the following main modules: ─ A control unit: It is in charge of the reduction management; it also contains some registers to store the reduction type, tag, data type, etc. ─ A simplified Float Point Unit (FPU), which can complete reduction of two doubleprecision floating points once, including MAX, MIN, ADD, MULTIPLY, etc. Of course it can process integer data, too. Furthermore, the division function is needless because it does not meet with the associativity. ─ Three FIFOs. Two FIFOs (FIFO_A and FIFO_B) are used as the double buffer for the incoming packets, as well as the input source of FPU. The last FIFO is the result buffer, which can also feed the result back to the FPU. ─ LSBA & CSBA structures. They are implemented as Content Addressable Memory (CAM) modules. Both are identified by the “Com” field of the reduction packet, as mentioned above. The CAM width is the sum of the length of LSBA and “Com” field, which is no more than (Log2 (NxN) +10) bits. The depth of LSBACAM is the maximum number of communicators that it can support at the same time. The depth of CSBA-CAM represents the max number of simultaneous reduction tasks it can support.

Fig. 4. Block Diagram of RPU

542

H. Wang et al.

After the RPU has completed all computations, the result packet will be injected into the router-pipeline again; the forwarding direction is dependent on the Y-X deterministic routing algorithm. • sRPU As mentioned before, Class_2 is integrated with a simplified RPU that has a smaller status-bit-arrays, which contains only LSBA and no CSBA. Each LSBA is at a fixed length of 5-bit. 3.4

A Processing Example

• Reduction Take the ADD operation for example. As the first reduction packet has been buffered at one of the five input ports of a router, it will be sent to the RPU. The data will be stored in one of the FIFOs of RPU and the result FIFO will be set as the “safe” value (for instance “0” for addition). Then the FPU can execute “ADD” for the incoming and the result data. At the same time, its virtual channel will be released; the original packet will not be forwarded. When any following packet has arrived, extracted data will be also stored into FIFOs and operated by the FPU to finish the ADD operation with the buffered one; the sum will be stored in the result FIFO to replace the older. Because of the doublebuffering mechanism, the data-transmitting and computation can be parallelized. This procedure will repeat until the LSBA&CSBA have identified that all reduction packets have been received, as described in the above subsections. At last, the newly generated data will be packed as a reduction packet and sent out. • All_reduce All_reduce is regarded as a reduction followed by a result-broadcast from the root; the latter is an inverse process of reduction. On reception of such a packet, it will be sent to RPU/sRPU. The control logic will check the LSBA to get the forwarding destinations, which are its neighbors and / or the local core that participated in the reduction of this communicator. For each destination, the RPU/sRPU generates one corresponding packet to send out. This process runs iteratively, until the broadcast finishes. 3.5

Layout of Class_1 and Class_2 Routers

The FPU in RPU is responsible for doing the corresponding computation, which is the main component to consume power and occupy a comparatively large area. As a result, the usage of Class_1 router should be very prudent. On the other hand, fewer Class_1 routers are not enough to achieve high performance. Consequently, the layout of Class_1 and Class_2 routers in a NoC is crucial to both performance and power consumption.

Customized Network-on-Chip for Message Reduction

4

Evaluation

4.1

Methodology

543

To evaluate the effectiveness of the design we proposed, we have implemented a cycle-accurate CMP simulator using the Xtensa Xplorer toolkit [23]. Moreover, the toolkit contains an energy estimator. Thus, for a given design we can get the running cycles / frequency, power consumption and chip area under some CMOS process. Our simulator includes more than one Xtensa LX4 core connected by a NoC module. The Xtensa LX4 can achieve 1.4 GHz on 45nm GS process technology. And the NoC models a detailed pipeline structure for the enhanced routers. As for NoC, Table 1 lists the network configurations applied to all our experiments. In the experiments, the network scale varying from a 4x4 mesh to a 16x16 mesh in order to measure the scalability. Table 1. Network Configurations Network Configuration

Enhanced Router

Topology Routing algorithm Channel width/flit size Maximum Packet size Number of ports VCs per port Buffers per VC FIFO_A/FIFO_B/ FIFO_Result Length of LSBA/CSBA

2D mesh Deterministic Y-X routing 128 bits (16 Byte) 5 flits 5 ports 2 VCs 5 flits 5 flits

Number of LSBA Number of CSBA FPU latency (Multiplication)

5 10 6 cycles

Log2 (N x N + 10)

Time taken by performing a complete reduction/All_reduce based on our customized NoC is referred as the hardware latency. Also, 3 typical layout strategies are measured in the experiment, shown in Figure 5. The layout of Strategy 1 can be considered as a naive implementation of reduction without any optimization. In this case, only 1 router is the Class_1 router, which is nearest to the center and also designated as the rank root node. While of Strategy 2, there are only one row (1/n of all nodes) that lies in the middle of the network is equipped with Class_1 routers. Strategy 3 is the example shown in Section 3.5, and the number of Class_1 routers is 2/n of all. Because we focus on the small data-set and the enhanced routers can occupy intransmission reduction packets, the one-pass communication protocol is employed: any core involved in the reduction sends out the reduction packet directly without handshake. To be fair, the same strategy is also applied to the software method.

544

H. Wang et al.

Fig. 5. Three layout strategies

The corresponding time based on the software Binomial Tree Reduce method is referred as the software latency. It is estimated on the Log2P model, while all involved software overheads, such as the packet start-up time, per data transmission time, computation time, etc., are gotten from the real CMP simulation. Moreover, we assume that there is no network contention; thus it can be considered as the greatly optimized reduction implemented by software. All_reduce is considered as a regular reduce process and then broadcast the result to all nodes from root. The software latency of All_reduce is estimated base on the Log2P model as well. We perform several experiments with different size of payloads of reduction packet, as well as network scales. In all cases, the packet size has an upper limit of 64 bytes, which is the size of a L1 cache line. Also, both hardware design of Strategy 1,2 and 3 and software method are evaluated in all experiments. 4.2 •

Results Reduction performance

The hardware latencies of different strategies and the software latency are shown in Figure 6 of different scales of the NoC, as well as the data payloads of reduction packet. It can be inferred from this result that all the three strategies have a great performance improvement than software implementation, for at least 2.67 times. With the 16x16 mesh NoC of Strategy 3, the hardware-based approach achieves its peak speedup of more than 11.76 times improvement over the software-only approach. For all cases, the average is about 8 times. Analysis shows that for one packet, it takes thousands of cycles by software approach to deal with it. In contrast, although the transmission paths of the hardware method are suboptimal, the high efficiency of hardware implementation makes a good remedy of it.

Latency (cycles)

Customized Network-on-Chip for Message Reduction

545

Reduction

26000 24000 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

strategy 1 strategy 2 strategy 3 software

4x4 mesh 16bytes

4x4 mesh 64bytes

8x8 mesh 16bytes

8x8 mesh 16x16 mesh 16x16 mesh 64bytes 16bytes 64bytes

Network Scale (NxN mesh)

Fig. 6. Reduction latency of hardware and software

As for the hardware optimization, Strategy 3 is slightly better than Strategy 2, and both Strategy 2 and 3 are better than Strategy 1 apparently. Compared with Strategy 1, the least improvement of Strategy 2 and 3 happens at the 4x4 mesh network, which accounts for about 84.4% and 82.9% of the latency of Strategy 1 respectively. This result is reasonable for the small scale network, due to the fact that one Class_1 router has more influence for a small network than for a large. Also, with the increasement of the network scale, the advantage of Strategy 2 and 3 becomes more and more obvious. For the 16x16 mesh topology, Strategy 2 and 3 account for only 24.8% and 22.7% of the latency of Strategy 1 respectively. Furthermore, the number of transmission hops is greatly decreased for about 50% by Strategy 2 and 3 (compared with Strategy 1), as shown in Figure 7. Transmission of packets contributes most power consumption of NoC, so this is another benefit from our design. Number of transmission hops

2200 2000 1800 1600 1400 1200 1000

strategy 1

800

strategy 2

600 400

strategy 3

200 0 4x4 mesh

8x8 mesh

16x16 mesh

Network Scale (NxN mesh)

Fig. 7. Number of transmission hops of the three strategies

546

H. Wang et al.

In consideration of the hardware consumption, the number of Class_1 routers of Strategy 3 is twice as many as Strategy 2’s. Therefore, Strategy 2 is the best design strategy, which keeps a balance between the performance and power consumption. All cores have been involved in the above experiments. In addition, if a reduction task includes fewer nodes and does not contain any Class_1, just as the situation described in Section 3.5, the experiment result shows that it only costs a little more latencies (about 1%~2%) than the case including some Class_1 routers, such as No.18~21 & No.26~29. • All_reduce performance The speedup of All_reduce is shown in Figure 8, compared with software implementation, it has a speedup of 3.35~10.2 times, which is almost the same as reduction. For hardware strategies, the same conclusion with the reduction does still hold.

Latency (cycles)

All_reduce 26000 24000 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

strategy 1 strategy 2 strategy 3 software

4x4 mesh 16bytes

4x4 mesh 64bytes

8x8 mesh 16bytes

8x8 mesh 16x16 mesh 16x16 mesh 64bytes 16bytes 64bytes

Network Scale (NxN mesh)

Fig. 8. All_reduce result

• Implementation overhead To obtain the implementation overhead, enhanced routers have been described in HDL and synthesized in 90 nm GS process under typical operating conditions. The area and power have also been measured at the maximum supported frequency. Results show that one Class_1 router is about 90.8% larger than the common router, and the simplified FPU occupies 38% of the total area of a Class_1 router. For power consumption, such a router costs 1.84 times of the common router in one complete reduction. For the Class_2 router, the extra consumption is limited: only 1.4% area overhead is observed. In the best layout (Strategy 2) mentioned above, only 1/N of routers belong to Class_1 and others are Class_2, which greatly decreases the hardware overhead. More specifically, from the perspective of resource consumption of routers (not including the links and cores), our design has increased 24%, 12.6% and 7% of the chip area for the 4x4, 8x8 and 16x16 mesh-NoCs respectively.

Customized Network-on-Chip for Message Reduction

5

547

Conclusion

The most important contribution of this paper is the two kinds of enhanced routers for message reduction, which are integrated with the Reduce Processing Unit (RPU) that is able to complete reduction and to learn transmission paths adaptively, or a simplified Reduce Processing Unit (sRPU) that is only capable of learning transmission paths. Three implementation of layout strategies of the customized network-on-chip is presented and evaluated to prove the effectiveness. Compared to the most efficient software-based implementation of reduce/all-reduce operation, we achieve improve the performance of 2.67~11.76 times (up to 11.76 for reduction and 10.2 for All_reduce). In addition, the extra chip area and power dissipation has been greatly reduced by the optimal strategy. Acknowledgment. The work is supported by the High Tech. R&D Program of China under Grant No. 2013AA01A215.

References 1. Timothy, M.: The Future of Many Core Computing, http://i2pc.cs.illinois.edu/ presentations/2010_05_06_Mattson_Slides.pdf 2. Rakesh, K., Timothy, G.M., Gilles, P., Rob, V.D.W.: The Case for Message Passing on Many-Core Chips. Multiprocessor System-on-Chip, pp. 115–123 (2011) 3. Jie, M., Daniel, R., Ayse, K.C.: 3D Systems with On-Chip DRAM for Enabling LowPower High-Performance Computing. In: Proceedings of Fifteenth HPEC Workshop, Massachusetts, USA (September 2011) 4. Timothy, G.M., Rob, F.V.D.W., Michael, R., Thomas, L., Paul, B., Werner, H., Patrick, K., Jason, H., Sriram, V., Nitin, B., Greg, R., Saurabh, D.: The 48-core SCC processor: the programmer’s view. In: Proceedings of 2010 International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA (2010) 5. MULTICORE COMMUNICATIONS API WORKING GROUP, http://www.multicore-association.org/workgroup/mcapi.php 6. Dong, Y., Chen, J., Yang, X., Yang, C., Peng, L.: Low power optimization for MPI collective operations. In: The 9th International Conference for Young Computer Scientists, ICYCS 2008, IEEE (2008) 7. Rabenseifner, R.: Automatic MPI counter profiling of all users: First results on a CRAY t3e 900-512. In: Message Passing Interface Developer’s and User’s Conference (1999) 8. Rabenseifner, R.: Optimization of collective reduction operations. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3036, pp. 1–9. Springer, Heidelberg (2004) 9. Open MPI Development Team, Open MPI: open source high-performance computing, http://www.open-mpi.org/ 10. Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. High Performance Computing Applications 19(1), 49–66 (2005) 11. Rabenseifner, R.: Optimization of collective reduction operations. In: Proceedings of Int’l Conference on Computational Science (ICCS), Krakow, Poland (2004) 12. Nicolas, F., Marc, H., Eric, L., Bernard, T.: MPI for the Clint Gb/s Interconnect. In: Proceedings of the 10th European PVM/MPI User’s Group Meeting, pp. 395–403 (2003)

548

H. Wang et al.

13. Maximize Platform MPI Performance with Voltaire® Fabric Collective AcceleratorTM (FCATM) and HP, http://www.mellanox.com/related-docs/voltaire_ acceleration_software/FCA-Voltaire-Platform-HP-WEB111110.pdf 14. Underwood, K.D., Ligon, W.B., Sass, R.R.: Analysis of a prototype intelligent network interface. Concurrency and Computation: Practice and Experience 15(7-8), 751–777 (2003) 15. Almási, G.S., et al.: Implementing MPI on the blueGene/L supercomputer. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 833–845. Springer, Heidelberg (2004) 16. Gao, S., Schmidt, A.G., Sass, R.: Impact of reconfigurable hardware on accelerating mpi_reduce. In: 2010 International Conference on Field-Programmable Technology (FPT), pp. 29–36 (2010) 17. Libo, H., Zhiying, W., Nong, X.: Accelerating NoC-based MPI Primitives via Communication Architecture Customization. In: Proceedings of IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors, Delft, July 2012, pp. 141–148. IEEE (2012) 18. David, W., Patrick, G., Henry, H., Liewei, B., Bruce, E., Carl, R., Matthew, M., Chyi-Chang, M., John, F.B., John III, F.B., Anant, A.: On-chip Interconnection Architecture of the Tile Processor. IEEE Computer Society (September-October 2007) 19. Velamati, M.K., Kumar, A., Jayam, N., Senthilkumar, G., Baruah, P.K., Sharma, R., Kapoor, S., Srinivasan, A.: Optimization of collective communication in intra-cell MPI. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 488–499. Springer, Heidelberg (2007) 20. Ali, Q., Midkiff, S.P., Pai, V.S.: Efficient high performance collective communication for the cell blade. In: Proceedings of the 23rd International Conference on Supercomputing, pp. 193–203. ACM (2009) 21. Kohler, A., Radetzki, M., Gschwandtner, P., Fahringer, T.: Low-latency collectives for the intel scc. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER), pp. 346–354. IEEE (2012) 22. Peng, Y., Saldaña, M., Chow, P.: Hardware support for broadcast and reduce in mpsoc. In: 2011 International Conference on Field Programmable Logic and Applications (FPL), pp. 144–150. IEEE (2011) 23. Gonzalez, R.E.: Xtensa: A configurable and extensible processor. IEEE Micro 20(2), 60–70 (2000)

Suggest Documents