AP1000+: Architectural Support of PUT/GET Interface for ... - CiteSeerX

1 downloads 1509 Views 275KB Size Report
rect remote data access for the interprocessor commu- nication interface enables data to be written directly to memory in selected processors, which eliminates ...
ASPLOS VI '94

AP1000+: Architectural Support of PUT/GET Interface for Parallelizing Compiler Kenichi Hayashi, Tunehisa Doi, Takeshi Horie, Yoichi Koyanagi, Osamu Shiraki, Nobutaka Imamura, Toshiyuki Shimizu, Hiroaki Ishihata, and Tatsuya Shindo Parallel Computing Research Center Fujitsu Laboratories Ltd. 1015 Kamikodanaka, Nakahara-ku, Kawasaki 211, Japan woods@ ab.fujitsu.co.jp

Abstract The scalability of distributed-memory parallel computers makes them attractive candidates for solving largescale problems. New languages, such as HPF, FortranD, and VPP Fortran, have been developed to enable existing software to be easily ported to such machines. Many distributed-memory parallel computers have been built, but none of them support the mechanisms required by such languages. We studied the mechanisms required by parallelizing compilers and proposed a new architecture to support them. Based on this proposed architecture, we developed a new distributed-memory parallel computer, the AP1000+, which is an enhanced version of the AP1000. Using scienti c applications in VPP Fortran and C, such as NAS parallel benchmarks, we simulated the performance of the AP1000+.

1 Introduction The scalability of distributed-memory parallel computers makes them attractive candidates for solving large-scale problems. Since the memory model of the distributedmemory architecture is radically di erent from that of conventional sequential machines, we believe the compiler is critical to the success of distributed-memory architecture machines in supercomputing. The support of global address space is important for easy programming and for porting software from uniprocessor supercomputers. New languages such as HPF [6], FortranD [7], and VPP Fortran [16] have been proposed for this purpose. Many distributed-memory parallel computers have been built, however, they do not support the mechanisms required by such languages. To use these languages on distributed-memory parallel computers requires a new architecture that takes into account the communication mechanisms used by parallelizing compilers.

1.1 Communication Mechanisms Required to Support Parallelizing Compilers We implemented VPP Fortran and HPF on the AP1000 [4, 22] and studied other language speci cations [3, 7, 12, 19]. Based on this, we concluded that the mechanisms required to implement these languages on distributedmemory parallel computers are: direct remote data access, bulk data transfer, stride data transfer, barrier synchronization, and global reduction. In code generation that uses SEND and RECEIVE primitives to access distributed data [7, 12, 19], the processor that owns the distributed data to be read sends the data to each requesting node. Use of SEND/RECEIVE is inecient if the SEND/RECEIVE pair cannot be established at compile time. If it is dicult to determine SEND/RECEIVE correspondence at compile time, communication among all processors is required. Use of direct remote data access for the interprocessor communication interface enables data to be written directly to memory in selected processors, which eliminates the need for setting SEND/RECEIVE correspondence at compile time. The communication overhead on a distributed-memory parallel computer depends heavily on the number of transfers. Hence, bulk and stride data transfers, which are used for tasks like transposing or redistributing large matrices, are indispensable for enhancing sustained performance. Bulk and stride data transfers reduce the number of transfers by blocking small data transfers together. Parallelizing compilers generate code that spreads to node groups, which execute programs independently. Barrier synchronization and global reductions are performed in speci c groups of nodes. Hence, the architecture must support fast barrier synchronization and global reduction for all nodes, as well as for speci c groups of nodes.

To appear in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), October 5-7, 1994, San Jose

ASPLOS VI '94

1.2 Architectural Support for Parallelizing Compilers

though data has been received, the program cannot use it and idle time is introduced because the ag has not been updated. Sending ags separately also doubles the number of messages and, therefore, increases the sending overhead. The data and ag should, therefore, be sent together, and ag updating should be combined with the completion of data transfer. Usually one ag is used for detection of received messages. The ag update mechanism increments the ag value.

The architecture we propose includes two functions to support the mechanisms required by parallelizing compilers: the PUT/GET operation and a combined ag update mechanism and data transfer.

PUT/GET operation PUT and GET are examples of active messages [23]. The PUT/GET operation is the same for direct remote data access except for the ag update. The Split-C, extensions to the C language, allows programmers to write PUT/GET operations directly [3]. The basic PUT and GET operations are as follows:



PUT copies a local memory block to remote memory at an address speci ed by the sender and updates

ags on both sending and receiving nodes to notify the completion of data transfer.



GET retrieves a remote memory block from an address speci ed by the sender and makes a local copy. It updates ags on both sending and receiving nodes to notify the completion of data transfer.

1.3 PUT/GET vs SEND/RECEIVE The SEND/RECEIVE communication model is a popular programming paradigm in parallel computing. In the SEND/RECEIVE model, messages are bu ered at the receiving node, and copied to the user area by explicit receive functions. This is an intrinsic overhead, which is one reason for idle time. If the sender does not know either (1) the address to be stored or (2) the status of the receive area, the SEND/RECEIVE model is useful. If the sender knows both (1) and (2), bu ering is wasteful, and deteriorates sustained performance [9]. SEND/RECEIVE operations are suitable for reliable parallel programming with explicit communications because they never destroy data on remote nodes. PUT/GET operations are, however, ecient for code generation in a parallelizing compiler because addresses to be stored are calculated, synchronization is managed by the compiler, and bu ering overhead time is reduced. PUT/GET operations were reported to improve sustained performance on the CM-5, nCUBE/2, and AP1000, although those machines were not well suited to implement PUT/GET operations because of the large overhead caused by polling or interruptions [5, 21, 23]. From these observations, we adopted PUT/GET operations as the low-level communications for our parallelizing compilers.

Since PUT/GET is a split-phase remote memory operation, communication and computation can overlap. PUT/GET increments separately speci ed ags on the node that sends and receives the data. This allows simple synchronization by checking the ag, and detection of communication completion. PUT/GET has been implemented on conventional message passing based parallel computers, such as the CM-5, nCUBE/2 [23], and the AP1000 [5]. Such computers need to invoke handlers for message arrival using polling or interrupts. The handlers analyze the message header and set DMA parameters. The handlers require a large amount of processing time and stall messages on the network. Hence, the handler for PUT/GET should be supported by hardware.

Combined ag update mechanism and data transfer To implement the handler for PUT/GET in hardware, the ag update and data transfer must be combined. Since the ags are used to detect the completion of data transfer, ag updates are essential for PUT/GET. The PUT receive complete ag, for example, must be updated when data receipt is complete. When the data packet includes the address of the ag to update, the receiving node must be able to temporarily store the address and update the ag when it receives DMA completion noti cation. A ag packet can be sent to a destination node after a data packet. Other messages, however, may enter the network between the two messages, and may cause a ag update delay. In this case, even

1.4 New Highly Parallel Computer AP1000+ After studying the mechanisms required by parallelizing compilers, we developed the AP1000+, a new distributed-memory parallel computer. The AP1000+ is an enhanced AP1000 [10, 20] that supports the architecture required for parallelizing compilers. The AP1000+ uses PUT/GET for basic data transfer and supports other mechanisms required for parallelizing compilers. 2

ASPLOS VI '94

2 Communication Mechanisms Required for Parallelizing Compilers

collectively from global array B to local array A. List 1 Example of SPREAD MOVE 1 2 3 4 5 6

2.1 VPP Fortran and HPF

VPP Fortran [16] is a parallel programming language that enables high-performance programming on distributed-memory parallel computers, providing global address space and a single program multiple data (SPMD) execution model. High Performance Fortran (HPF) [6], which has been proposed as a standard language for distributed-memory parallel computers, and VPP Fortran are based on similar programming models. Both models include global memory space, block and cyclic decomposition, and SPMD program execution. The major di erence is that VPP Fortran allows users to customize their programs. VPP Fortran has both global and local memory areas to localize data and minimize communication overhead, and provides asynchronous directives for collective data transfers. The HPF compiler must generate code to specify collective data transfer. VPP Fortran uses a hierarchical logical programming model with local and global memory (Figure 1). Processors share global memory space, making it easier to port programs from uniprocessor supercomputers to distributed-memory parallel computers. Because objects in global memory space are accessible to all processors, the programmer can use a memory model similar to that of conventional uniprocessor machines. Objects in local memory space can be accessed by the owner without interprocessor communication ensuring high-speed localdata access.

An overlap area, that is, a boundary data area replicated in adjacent processors (Figure 2), can be de ned with an index partition when arrays are decomposed in blocks across processors. An overlap area is e ective in a program where an adjacent processor references boundary data. A copy of the overlap area is maintained in each processor and is eciently updated with OVERLAP FIX during collective data communication. PE0

PE1

PE2

Array C(1000,1000)

Local memory

...

Local memory

Processor

Processor

...

Processor

PE3

Figure 2: Example of overlap area VPP Fortran is used as an intermediate form for compiling HPF programs. HPF is converted into VPP Fortran with explicit data transfers. SPREAD MOVE and OVERLAP FIX constructs are generated by a communication aggregation pass in our HPF compiler. The VPP Fortran compiler consists of a translator and run-time system for the VPP500 [16] and the AP1000. The translator translates a VPP Fortran program into FORTRAN77 sequential code with run-time system calls for each processing element. The run-time system consists of machine-dependent libraries which include an interprocessor communication mechanism. The translator inserts an index calculation code which converts global addresses to local addresses. It also inserts communication library calls for accessing remote data.

Global memory Local memory

!XOCL SPREAD MOVE DO 200 J=1,M A(J)=B(J,K) 200 CONTINUE !XOCL END SPREAD (X) !XOCL MOVEWAIT (X)

Figure 1: VPP Fortran memory model VPP Fortran provides directives to specify collective data transfer for program performance tuning, since communication overhead on a distributed-memory parallel computer depends heavily on the number of transfers executed. Two collective data communication directives, SPREAD MOVE and OVERLAP FIX, are provided. SPREAD MOVE transfers a \chunk" of elements from one array to another in one operation. A SPREAD MOVE directive is inserted into an inter-array assignment contained in a DO loop to direct the compiler. List 1 shows a SPREAD MOVE example. Each line beginning with !XOCL is a VPP Fortran directive. SPREAD MOVE transfers data

2.2 Interprocessor Communications Direct remote data access The VPP Fortran translator inserts explicit nonblocking direct remote data accesses based on analysis to array references, and the run-time system uses them for the interprocessor data communication interface, to enable data to be written directly to memory in selected processors. The following interfaces are used: 3

ASPLOS VI '94 support barrier synchronization both for all nodes and for speci c groups of nodes. An example of group barrier synchronization in VPP Fortran is the use of index partition directives for decomposition of data (arrays) and computation (DO loops). In this case, every partitioned group node must use group barrier synchronization. The index partition directive corresponds to a combination of the ALIGN and DISTRIBUTE directives in HPF [6]. Parallelizing compilers also require global reduction of all nodes and of speci c groups of nodes. The elements of global reduction are scalar or vector. The required architecture must support reduction in some groups and for vector data as well as for all nodes and scalar data.

readRemote(node_id, raddr, laddr, size) writeRemote(node_id, raddr, laddr, size)

where node id is the node ID, raddr the start address of the remote machine, laddr the start address of the local machine, and size the amount of data to be transferred. The readRemote function reads size bytes of data from the node node id starting at raddr and writes the data to the local processor starting at address laddr. Similarly, the writeRemote function writes size bytes of data from the local processor starting at laddr to an area in the node speci ed by node id starting at raddr. This eliminates the need for setting SEND/RECEIVE correspondence at compile time.

3 Architectural Support for Parallelizing Compilers

Stride Data Transfer Supporting stride data access is important for the SPREAD MOVE and OVERLAP FIX block transfer. In List 1 line 3, if loop index J is the 2nd dimension in global array B like B(K,J), stride data transfer is required because local array A is continuous, but global array B is stride. Stride data transfer is necessary if the overlap area is allocated along the 2nd dimension as shown in Figure 2.

In this section, we discuss the architecture required to support parallelizing compilers and describe the mechanisms implemented in hardware.

3.1 Bulk and Stride Data Transfer Bulk and stride data transfer can be performed using PUT/GET operations with low overhead and high throughput. Since the PUT/GET operation, unlike SEND/RECEIVE, does not have explicit receive functions, a method of detecting when data is written is required. In addition, to send data, a method of detecting when data transfer has completed is required to keep communication overhead low and to protect the sending data area while sending. Flags are useful for synchronizing PUT/GET operations. PUT/GET operations increment ags which are speci ed by the program and the program checks the value of these ags to detect the completion of communications. An acknowledge packet is needed to detect the completion of a PUT operation by the requesting node. Not all PUT operations need an acknowledgment, however, and the requirement must be speci ed by the program. The PUT and GET functions are speci ed as follows:

Detecting communication completion With code generated using direct remote data access, access requests to a given node are asynchronous with processor operation. To synchronize execution, each node must be able to detect the completion of communication for a given parallel DO loop. Otherwise, data may be overwritten before it is read or old data may be read before correct data is available. The end of communication is detected using ags for each node, which are incremented at data receipt completion, and by barrier synchronization. To detect the completion of readRemote is easy, because reply data returns and update the ag, but detection of writeRemote completion is more dicult. One solution is to use an acknowledgment. In this case, the receiving node returns an acknowledge packet to notify completion and to increment the acknowledge ag, which is a implicit ag on each node. The requesting node checks the ag value and determines whether all writeRemotes are completed and enters barrier synchronization1 .

put(node_id, raddr, laddr, size, send_flag, recv_flag, ack) get(node_id, raddr, laddr, size, send_flag, recv_flag)

2.3 Barrier Synchronization and Global Reduction

Speci cations of node id, laddr, raddr, and size are same as for writeRemote and readRemote. For the PUT/GET operation, send flag and recv flag are also speci ed. PUT transfers size bytes of data from laddr in local memory to raddr in the remote node memory. At the

Barrier synchronization is important for detecting the start and end of communications. The architecture must 1 This type of operation is common in data parallel programming, so we call this the Ack & Barrier model.

4

ASPLOS VI '94 sending node, send flag is incremented when sending is completed, and at the receiving node, recv flag is incremented when receiving is completed. ack speci es whether the sender expects an acknowledge packet. Since the PUT operation is nonblocking, in other words, it does not wait for the completion of data transfer, and does not copy sending data, programs can access the sending area during sending. send flag is used to protect these areas. GET transfers data from raddr in destination node node id to laddr in local memory. send flag and recv flag are used in the same way as PUT. Stride data transfers are also importrant for parallelizing compilers. The one-dimensional stride function, for example, is speci ed as follows:

PUT/GET communications must be nonblocking, i.e., the processor must be able to execute the next instruction without waiting for sending to completed. To achieve this, command queues, which bu er PUT/GET requests, are required. Since a program may issue too many PUT/GET requests for a queue to handle, a mechanism to handle queue over ow is required. Not only the send queue but also the reply queue must support over ow handling, because many GET requests may come. Message handling must be independent of processor execution. For sending messages, the message handler must extract a request from the send queue, activate the send DMA, and update the ag when sending is complete. For receiving a PUT messages, the message handler must analyze the header, activate the receive DMA, and update the ag when receiving is complete. For a GET message, the message handler must reply to the GET request automatically. Flag synchronization combines detection of the completion of communication with data transfer. Hence, the synchronization mechanism updates the ag when data sending and receiving is complete. Furthermore, to check arrival of multiple messages, the ag value is incremented. The ag update mechanism, therefore, must be realized by \fetch and increment." The mechanism must support barrier synchronization and global reduction both for all nodes and for speci c groups of nodes.

put_stride(node_id, raddr, laddr, ack, send_flag, recv_flag, send_item_size, send_cnt, send_skip, recv_item_size, recv_cnt, recv_skip) get_stride(node_id, raddr, laddr, send_flag, recv_flag, send_item_size, send_cnt, send_skip, recv_item_size, recv_cnt, recv_skip) send item size is the item size at the sending node, send cnt is the number of items, send skip is the skip size between items. recv item size, recv cnt, and recv skip are the item size, number of items, and skip

size between items for the receiving node. Figure 3 shows examples of put stride(). send_skip

4 AP1000+ Architecture The AP1000+, an enhancement of the AP1000 [10, 20], is a distributed-memory highly parallel computer that supports the communication mechanisms required by parallelizing compilers. Figure 4 shows the AP1000+ system con guration and Figure 5 shows the processing element (cell) con guration. The AP1000+ system consists of 4 to 1024 processing elements, called cells. The host is a Sun workstation. Cells are connected by three independent networks: a two-dimensional torus network, or T-net, for point-topoint communication between cells; a broadcast network, or B-net, for broadcast communication and data distribution and collection; and a synchronization network, or S-net, for barrier synchronization. Each cell consists of a SuperSPARC, DRAM, and network devices. The SuperSPARC operates at 50 MHz and has a 36-kilobyte write-through cache. Each cell has 16 or 64 megabytes of DRAM, which are con gured as single inline memory modules (SIMMs). The routing controller (RTC) was developed to implement the T-net. The bus interface (BIF) was developed to implement the S-net and B-net. MSC+ and MC were developed for the AP1000+. The message controller (MSC+) provides an interface be-

recv_skip PUT

send_item_size

send_cnt = 3

recv_item_size

recv_cnt = 2

Figure 3: PUT stride parameters

3.2 Mechanisms Implemented in Hardware The following mechanisms should be implemented in hardware: a low overhead interface, nonblocking command issue, protection, independent message handling,

ag updating combined with data transfer, and fast and

exible barrier synchronization and global reduction. To keep the PUT/GET overhead low, the PUT/GET parameters for DMA must be set at the user level. If DMA cannot be activated by the user, a program must use a system call. User access to DMA is important, because the system call overhead is large. A program may specify an illegal address, but if commands are issued at the user level, the operating system cannot detect the error. Hence, the hardware must check for illegal addresses. 5

ASPLOS VI '94 tween cells and networks to support PUT/GET operations. The memory controller (MC), controls the V-Bus between the SuperSPARC and memory. The RTC, BIF, cabinets, and three networks are the same as those used by the AP1000. Options such as distributed disk and video (DDV) and the numerical computation accelerator (NCA), which have been developed for the AP1000, can be used for the AP1000+. Table 1 shows the AP1000+ speci cations.

Table 1: AP1000+ speci cations Processor SuperSPARC (50 MHz) Processor performance 50 MFLOPS Memory per cell 16, 64 megabytes Cache per cell 36 kilobytes, write-through System con guration 4 - 1024 cells System performance 0.2 - 51.2 GFLOPS

Stride data transfer

AP1000+

S-net

The AP1000+ supports one-dimensional stride data transfer as a compromise between the hardware cost of implementing high-dimensional stride data transfer and the processing overhead of one-dimensional stride data transfer. We expect that high-dimensional stride data transfer can be done eciently by repeating one-dimensional stride data transfer, as long as the overhead for each one-dimensional stride data transfer is very small. Stride data transfers can be performed in the same way as normal PUT/GET operations. This means the overhead of stride data transfer is the cost of a few store instructions.

SPARC Station

Host BIF

B-net

DDV Option HDTV

Cell Disk NCA Option

RTC

NCA Board

T-net

Figure 4: System con guration Super SPARC MSC+ VBus IF Send Queue

Reply Queue

MC

VBus

MC

MSC+ Send CTL

LBus IF

The program speci es a logical address for the PUT/GET operation. Using the MMU in the MC, the MSC+ converts the logical address to a physical address. The MC has a translation lookaside bu er (TLB), which is direct-mapped and has 256 entries for every 4-kilobyte page and 64 entries for every 256-kilobyte page. The MSC+ can, therefore, quickly obtain the converted address from the MMU. Since the program can specify an illegal address for the PUT/GET operation, a protection mechanism is required. If an illegal address is speci ed, a page fault occurs, which causes a program interrupt. If a page fault happens in a remote cell during message transfer, the MSC+ interrupts the operating system and pulls the remaining message from the network.

Comm. Table TLB Reg. Walker

Recv CTL

BIF OPT DMAC DMAC

MMU and protection

VBus IF

D$/I$

RTC

SIMM LBus

DRAM CTL

Flag inc.

BIF

25MB/s x 4 50MB/s T-net B/S-net

Figure 5: Cell con guration and data path of MSC+ and MC

4.1 Implementation of PUT/GET User interface PUT/GET operations are invoked by writing parameters to the send queue in the MSC+. When a program uses PUT/GET, the program writes the parameters oneby-one to the special address. The MSC+ reads the command from this address. Since the MSC+ knows the number of parameters for each command, after the last parameter is written, the MSC+ activates the send DMA controller. The program can, therefore, issue a PUT/GET operation to execute only a few store instructions, and the PUT/GET overhead is small. Since PUT/GET operations require 8-word parameters, the overhead of PUT/GET is the time for 8 store instructions, in other words, 8 clock cycles. The send DMA controller can send from 1 word (4 byte) to 1 megaword (4 megabytes) of data in a single operation.

Queues and queue over ows The MSC+ contains ve queues in its own RAM. There are three sending queues for PUT and GET requests issued by the user, PUT and GET requests from the system, and remote access. PUT and GET requests require two queues so that when the system uses PUT/GET operations, the MSC+ does not need to save and restore the entries for the user. Remote access uses another queue because the processor waits for a remote load, so remote access must be privileged from PUT/GET operations. There are two reply queues, one for GET replies, and one for remote load replies. Remote load replies precede GET replies. 6

ASPLOS VI '94 Since the maximum queue size is 64 words, it is possible that an MSC+ queue may become full. In this case, the MSC+ is able to automatically write the data directly to a previously allocated bu er in DRAM. All data written by the processor after the queue becomes full is written into the bu er in DRAM. When the queue empties, the MSC+ interrupts the operating system, which then loads data from the bu er in DRAM back into the queue in the MSC+. If the bu er in DRAM becomes full, the MSC+ interrupts the operating system, which then allocates a new bu er. Reply queues have the same mechanism.

acknowledge ag is updated when the GET reply packet arrives. By checking the value of this ag, the sending cell can detect the completion of PUT operations in remote cells. The AP1000+ does not use direct acknowledging by the actual PUT packet, as a compromise between hardware cost and usage of acknowledge. Direct acknowledging requires one more address eld for the acknowledge

ag in the PUT message format, and bu ers in receiving nodes to store the acknowledge ag address, origin cell ID, and context ID, and mechanisms to reply to the acknowledge packet when receive DMA is completed. Usually, the data parallel programing model uses acknowledgment and barrier synchronization (Ack & Barrier) to detect completion of all data transfers in all nodes. In this case, even if other messages enter between PUT and GET for acknowledge, delaying the arrival of an acknowledgment, the time to reach the barrier point is almost the same. This is because all nodes cannot reach the barrier point until all data transfers and acknowledge packets are completed. In other words, order in the network may change but the time to reach the barrier point does not change.

Message handling When sending PUT, the send controller in MSC+ reads parameters from the send queue and activates the send DMA. The MSC+ of the receiving cell analyzes the header of the message and activates the receive DMA to write the data directly. Invalidation of cache is done at the time of message reception. This means that data reception from a network does not prevent user program execution. For GET operations, request messages are transfered in the same way. At the destination cell, the MSC+ analyzes the GET request message and enters it into the reply queue. Afterward, the same mechanisms as for PUT transfer data to a requesting cell.

4.2 Distributed Shared Memory

Flag update combined with data transfer

The SuperSPARC supports 64 gigabytes of physical address space (36 bit addresses). Each cell uses half of this address space for local memory space and the other half for distributed shared memory space. 32 gigabytes of shared memory space is divided into blocks equally corresponding to each cell. For example, if the system consists of 1024 cells, and the local memory size is 64 megabytes, the block size becomes 32 megabytes and half of the local memory is mapped for shared space that can be accessed by normal load/store from other cells. To access the shared memory space, the MSC+ generates parameters for remote load/store and writes them to the remote access queue. The MSC+ generates commands to translate the upper 10 bits of physical addresses accessed by the processor to destination cell IDs and the other bits to local addresses at the destination cell. The di erence between PUT/GET and remote load/store is the issuer, data addresses, size, and blocking/nonblocking. PUT/GET is issued by software and any data size can be speci ed. Remote load/store is issued by hardware and data is transferred when shared memory space on each cell is accessed by SuperSPARC instructions such as LOAD and STORE. PUT/GET and remote store are nonblocking, while remote load is blocking. To detect the completion of remote store, acknowledge packets are used, which are sent automatically by the MSC+.

Updating ags is combined with completing of sending or receiving. Flags are normal variables speci ed in the user programs and their addresses are logical. During sending, the MSC+ requests that the MC increment a

ag, whose address is shown in the queue when the send DMA operation is completed. The MC converts the ag address from logical to physical using its own MMU and increments the ag value. The MC has an incrementer, which can fetch and increment. During reception, the MSC+ requests that the MC increment a ag when the receive DMA operation is completed. In both cases, if

ag addresses are speci ed as 0, MSC+ does not update the ag. This function is useful because some PUT/GET operations use either send or receive ags, or neither send nor receive ags.

Acknowledge packet If the PUT requires an acknowledgment from the receiving cell, the program issues a GET operation after the PUT operation, and the program uses the GET reply packet for acknowledgment. This method exploits a characteristic of the T-net, which uses static routing and passes messages in order. If address 0 is speci ed as the destination address, the GET packet goes and comes back, and does not copy the data in remote memory. The 7

ASPLOS VI '94

Write through page

communication registers are allocated in shared memory space, sending data from communication registers to other communication registers can be performed with a simple store instruction to the appropriate address. If sending addresses are previously calculated using algorithms such as binary tree or cross over, global reduction can be achieved only by repeating store, execute, and load instructions. Global reductions for vector data use a ring bu er with SEND/RECEIVE. Each cell sends vector data to the ring bu er of the destination cell. The receiving cell executes the data of the ring bu er directly, and sends the data to the next cell. For global reduction, the received data is used only once, so the receiving cell does not need to copy this data from the ring bu er. This eliminates the message copy overhead, which is the main problem associated with the SEND/RECEIVE model. Another bene t is that the ring bu er can handle any vector size.

The AP1000+ supports so called write through page to eciently execute for shared memory programming. This mechanism uses part of local memory as a cache for distributed shared memory space, and enables the replacement of remote accesses with local accesses. A more detailed discussion of write through page is beyond the scope of this paper.

4.3 SEND/RECEIVE Model The AP1000+ supports the SEND/RECEIVE communication model in addition to the PUT/GET model, because there are many programs which use SEND/RECEIVE communications. The AP1000+ has receive bu ers, called ring bu ers, in main memory. The ring bu er is a circular bu er, which stores messages from the interconnection network. SEND is executed using the same hardware mechanism as PUT, specifying a ring bu er for the destination address instead of the speci c address of a remote cell. RECEIVE functions search the ring bu er and copy the message into the user's memory area. If the ring bu er becomes full, the MSC+ interrupts the operating system, which then allocates a new bu er.

5 Simulation A trace-driven simulator for a message-passing parallel computer { the message level simulator (MLSim) { has been developed to study communication behavior [9]. MLSim was implemented on the AP1000. The execution traces of real applications on the AP1000 are input to MLSim. The information was collected from probes inserted at various points in the operating and run-time systems. Events are monitored and stored in a trace bu er along with time and message information. We inserted probe points at entries and exits of the communication and synchronization library and interrupt service routine. MLSim simulates communication behavior based on the trace information and parameter le as shown in Figure 6, preserving the order of message communications and barrier synchronization between processors with a delay parameter. The computation parameter is given as a ratio to SPARC performance and communication parameters are given in microseconds. MLSim calculates the time needed for message handling, barrier synchronization, and computation from the input parameters. MLSim can calculate such statistics as user time, idle time, communication overhead time, transferred message size, communication distance, and the number of communication events. MLSim can be tuned to match the performance of real machines by varying the communication parameters.

4.4 Communication Registers The AP1000+ has special registers exclusively for communication. 128 4-byte communication registers for each MC are allocated in shared memory space. The AP1000+ uses these registers for fast barrier synchronization and global reduction. Communication registers can be accessed in 4- or 8byte blocks. Each communication register has a present bit (p-bit). The p-bit is set to 1 when data is stored and to 0 when data is read. If the p-bit is 0, the processor automatically retries loading the communication register until the p-bit becomes 1. This mechanism is useful for keeping the overhead low when waiting for data, because the processor does not use software polling.

4.5 Barrier Synchronization and Global Reduction The AP1000+ uses the synchronization network (S-net) in hardware and the communication registers in software for barrier synchronization. Barrier synchronization in software can be done in the same way as global summation. Software synchronization can be used for barrier synchronization for speci c groups of cells. The AP1000+ supports global reduction for both scalar and vector data. Global reductions for scalar data are performed using communication registers. Since

5.1 PUT Communication Model Figure 7 shows the PUT communication model for MLSim on the AP1000 based on interrupts. The PUT communication overhead on the AP1000 is as follows: 8

ASPLOS VI '94 # # AP1000 model # # computation SPARC computation_factor # # ---- network ---network_prolog_time network_delay_time # # ---- PUT/GET ---# put_prolog_time put_epilog_time put_msg_time put_dma_set_time put_msg_post_time # intr_rtc_time recv_msg_flush_time recv_dma_set_time

1.00 0.16 0.16

20.0 15.0 0.05 15.0 0.04 20.0 0.04 15.0

compiled and the run-time system uses PUT/GET. Two applications in C language use PUT/GET primitives directly in the source code.

# # AP1000+ model # # computation SuperSPARC computation_factor 0.125 # # ---- network ---network_prolog_time 0.16 network_delay_time 0.16 # # ---- PUT/GET ---# put_prolog_time 1.00 put_epilog_time 0.00 put_msg_time 0.05 put_dma_set_time 0.50 put_msg_post_time 0.00 # intr_rtc_time 0.00 recv_msg_flush_time 0.00 recv_dma_set_time 0.50

EP generates 228 pseudo-random numbers and has no communication.

SP computes the solution for scalar pentadiagonal equa-

tions. A total of 400 iterations are performed on the 64 2 64 2 64 input array. MLSim simulated the rst 10 iterations because of trace bu er limitations.

CG is the conjugate gradient method for solving a linear

system of equations. The order of the input matrix is 1400 with 78184 nonzero elements.

FT is a 3-D Fourier transform. The input array size

is 256 2 256 2 128. Six iterations of the FFT were calculated for this benchmark.

Figure 6: Some MLSim parameters Send overhead = put prolog time + put enqueue time + put msg post time2 2 msg size + put dma set time + put epilog time

TOMCATV is a vectorized mesh generation program.

For this program, two types of simulations were done: one with stride data transfers, the other without stride data transfers, meaning each item was sent one by one. MLSim simulated the rst 10 iterations because of trace bu er limitations.

Interrupt reception overhead = intr rtc time + 2 msg size + recv dma set time On the AP1000+, all the programs have to write parameters into the MSC+ queue. The MSC+ sends the data automatically, without blocking processor execution. Since the AP1000+ uses a write-through cache, the MSC+ does not need to re ect the data in cache to the memory. Therefore, the overhead of PUT communication on the AP1000+ is only put enqueue time on sending. The MSC+ of the receiving cell analyzes the message header and activates receive DMA to write the data. Since the cache is invalidated at the time of message reception, the AP1000+ does not need to invalidate cache before data reception. This means that data reception from a network does not prevent user program execution. We set the parameters of MLSim based on the hardware nature of both the AP1000 and the AP1000+. The simulation model does not include a mechanism to treat queue over ow. MLSim assumes that queues are long enough. recv msg invalid time

MatMul calculates A 2 B = C . The matrix to be calculated is a dense 800 2 800 matrix.

SCG solves Poisson's di erential equation using the

scaled conjugate gradient method in which the coecient matrix is scaled by diagonal elements. The matrix to be solved is a sparse 40000 2 40000 matrix.

5.3 Simulation Results The computation power of the SuperSPARC is assumed to be eight times3 faster than that of the SPARC and other communication parameters estimated from hardware speci cations for MLSim. Table 2 shows the performance predictions compared to the AP1000. The two models simulated are the AP1000+ and an AP1000 model whose processor speed is eight times faster and message handling is done by software. Table 3 lists the application statistics for MLSim. Figure 8 shows the percentage of execution time, communication overhead, and idle time, normalized to the execution time of the AP1000+. For applications in VPP Fortran the run-time system time is included.

5.2 Applications We present simulation results for a collection of scienti c programs. This collection includes EP, SP, CG, and FT from the NAS parallel benchmarks [1], TOMCATV from the SPEC benchmarks [24] in VPP Fortran [11], and matrix multiplication and scaled conjugate gradient (SCG) in C language. Five applications in VPP Fortran are 2 Post

3 We estimated this gure from a comparison using a simple matrix multiplication program on a SPARC Station 1+ and SPARC Station 10. The processor in the SPARC Station 1+ is the same as that in the AP1000, and that in the SPARC Station 10 is the same as that in the AP1000+.

mirrors cache data to the memory.

9

ASPLOS VI '94

PUT IN

PUT OUT

User A (1)

SVC

(5)

flag++

System (2)

(3)

(4)

INTR SEND COMPLETE

(6) (7)

DMA

Network (15) (16)

INTR

(17) (18)

System

INTR RECV COMPLETE flag++

(8)

(9)

(10)

CHECK

(11) (12)

User B (13)

( 1) put_prolog_time ( 2) put_enqueue_time ( 3) put_msg_post_time x msg_size ( 4) put_dma_set_time ( 5) put_epilog_time ( 6) send_complete_time ( 7) send_complete_flag_time ( 8) intr_rtc_time ( 9) recv_msg_invalid_time x msg_size (10) recv_dma_set_time (11) recv_complete_time (12) recv_complete_flag_time (13) flag_check_prolog_time (14) flag_check_epilog_time (15) network_prolog_time (16) network_delay_time x distance (17) network_msg_time x msg_size (18) network_epilog_time

(14)

Figure 7: PUT communication model on AP1000

5.4 Analysis of the Simulation Results

For TOMCATV program, \TC st" indicates TOMCATV using stride data transfers and \TC no st" indicates TOMCATV without stride data transfers. Figure 8 shows four bars for TOMCATV, normalized to the AP1000+ with stride data transfer model.

Sustained Performance The average improvement in

performance for the AP1000+ is almost the same as the rate of the processor improvement, but that for the second model is only 70% of processor improvement. EP has no communication, so both models achieved a rate equal to the processor improvement. CG is the worst case improvement and has high overhead, because large vector global summations dominate in its execution. SEND operations are used for global reduction of vector data. SEND operations are blocking, that is, they wait to complete data transfer in the SEND library, so a large overhead is introduced. CG reduces the vector global summations of an array whose vector size is 11200 bytes (1400 2 8) by 390 times. Since communication and computation cannot overlap during global reductions, idle time is introduced. FT and SP use many communication operations (Table 3), but the overhead on the AP1000+ is very small. TOMCATV with stride data transfers uses the least number of communication operations among the applications in VPP Fortran, and many barrier synchronizations. This results in the least di erence between the two models. For TOMCATV without stride, the two models have the largest di erence. The number of communications becomes 257 times and the message size one 257th that of TOMCATV using stride. The two C language applications use PUT/GET directly and overlap communication with computation, and therefore almost achieve peak processor performance. The second model, however, introduces very large overhead and idle times. This is because large overhead and interruption times prevent communication and computation from overlap-

Execution time is processor execution time, excluding run-time system time, library overhead, and idle time.

Run-time system is the time for the VPP Fortran run-

time system to calculate addresses for PUT/GET operations, nding stride data patterns and so on. This time does not include communications.

Overhead is the time spent executing communication library routines, excluding idle time. While time is spent in communication libraries, processor execution is blocked.

Idle time is the time spent waiting for messages in the

receive function of SEND/RECEIVE or waiting for

ag update in the ag check function of PUT/GET, and waiting for establishment of barrier synchronization.

Table 2: Performance simulation: compared to AP1000 Application EP CG VPP Fortran FT SP TC st TC no st C Language MatMul SCG 3

AP1000

AP1000+ 8.00 4.78 7.12 7.62 7.83 11.55 8.27 7.96

AP10003 8.00 3.42 4.14 6.05 6.42 2.20 6.22 5.17

: AP1000 with SPARC replaced by SuperSPARC

10

ASPLOS VI '94

788 172 153

150

150

140

Normalized execution time(%)

133 125 107

Idle time Overhead Run time system Execution time

100

100

50

Left: AP1000+ Right: AP1000 with SPARC replaced by SuperSPARC

0

EP

FT

CG

stride

SP

no stride

TOMCATV

VPP Fortran

MatMul

SCG

C language

Figure 8: E ect of PUT/GET hardware support Table 3: Application statistics Application

PE

SEND

Gop

V Gop

Sync

PUT

PUTS

GET

GETS

Size of

per PE

per PE

per PE

per PE

per PE

per PE

per PE

per PE

Msg.

EP

64

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

CG

16

365.6

810.0

390.0

3135.0

390.0

0.0

0.0

0.0

700.0

FT

128

0.0

24.0

0.0

51.0

2048.0

7680.0

9652.0

512.0

1638.4

SP

64

1.0

0.0

1.0

42.0

10880.0

0.0

10710.0

0.0

1355.3

TC st

16

0.0

20.0

0.0

80.0

0.0

37.5

37.5

0.0

2056.0

TC no st

16

0.0

20.0

0.0

80.0

9637.5

0.0

9637.5

0.0

8.0

MatMul

64

0.0

0.0

0.0

64.0

64.0

0.0

0.0

0.0

76800.0

SCG

64

878.1

893.0

0.0

1.0

878.1

0.0

0.0

0.0

1600.0

SEND:

point-to-point SEND message

PUTS:

Gop:

global operation for a scaler

GET:

PUT with stride data transfer point-to-point GET message

V Gop:

global operation for a vector

GETS:

GET with stride data transfer

Sync:

barrier synchronization

Size of Msg:

average message length for each

PUT:

point-to-point PUT message

PUT/GET (bytes) without GET for acknowledge

Idle time The AP1000+ model shows smaller idle

ping, although the programmer wrote the programs carefully to overlap them.

times. This is because all applications are based on the data parallel programming model and load balance is good. In other words, communication and computation overlap and waiting time for establishment of barrier synchronization is small.

Run-time system The run-time system overhead is 2-

3% for CG, FT, and SP, 7% for TOMCATV with stride, and 24% for TOMCATV without stride. The variation in the number of communication operations causes the di erences in run-time system overhead.

Bulk data transfer As table 3 shows, the average mes-

sage size of PUT/GET is very big. It is important, therefore, to send such large data by one transfer command without dividing it into small packets.

Communication overhead The communication over-

head of the AP1000+ is less than 5% that the second model except for that of CG. This is because the AP1000+ has message handling hardware, while the AP1000 uses software based on interrupts. The AP1000+ also improves the user interface, enabling it to issue commands simply by writing data into the queue at the user level, while the AP1000 must use system calls to activate the DMA.

Stride data transfer Many stride data transfers are

generated by the VPP Fortran compiler in FT and TOMCATV. If the hardware does not support stride data transfer, the number of times put() or get() is called is much larger than that of put stride() and get stride(), and the performance deteriorates as

11

ASPLOS VI '94 shown by TOMCATV4 . TOMCATV with stride data transfers is about 50% faster than that without stride data transfers on the AP1000+ model.

Hardware implementations include distributed shared memory parallel computers. Some are based on message passing such as the CRAY T3D and Meiko CS-2, while others are based on cache coherent, such as Alewife, FLASH, SHRIMP, and Typhoon. To our knowledge, each switch node of the T3D [17] contains a \block-transfer-engine" in addition to a message passing mechanism, which allows it to transfer large blocks of data between local and remote memory, independently and asynchronously to the PEs. The T3D, however, does not have a mechanism to update ags to con rm the completion of data transfers. The CS-2 [8] realizes a low overhead interface by writing send commands to the communication co-processor (ELAN) at the user level. The CS-2 supports virtual memory, and address translations will be done in ELAN using its own TLB. The size of the message, however, is 32 bytes, so overhead becomes larger for larger messages. Some machines were proposed to integrate cache coherent hardware and bulk data transfer mechanisms, for applications which require high throughput communications. The MIT Alewife [13, 14] integrates ne grain communication using cache coherent hardware and bulk data transfer using message passing. The Alewife machine, however, does not support virtual memory. The Alewife can send stride data by writing many address and size pairs into message headers, but this type of stride data transfer is unsuitable to large numerical calculation, which uses many iterations of stride data, so overhead becomes large. The goal of the Stanford FLASH [15] is also integration of cache coherent hardware and message passing mechanisms. Since the FLASH uses a programable protocol processor, some cache coherent protocols and message passing protocols can be used. In message passing, user messages are sent as a series of independent, cache linesized messages. It is dicult, therefore, to perform stride data transfers. The SHRIMP from Princeton [2] provides communication by mapping local writes to remote addresses. It is dicult to send messages larger than page size and impossible to fetch remote data. The Typhoon proposed by Wisconsin [18] can change the cache coherent protocol from user level software and choose the optimal protocol for each application. All message handling for both cache coherent and message passing is done by software, handling overhead, therefore, cannot be ignored.

Barrier synchronization VPP Fortran compiler code

uses a lot of barrier synchronization. Although barrier synchronization introduces both library overhead and idle time, Figure 8 shows that the AP1000+ model has small overhead and idle times. This means the overhead of synchronization is small and load balance is good.

Global reduction Both scalar and vector global reduc-

tion are used in VPP Fortran applications. The number of global operations for scalar data is large, so such a reduction must be major to use the communication registers. Since all applications in VPP Fortran are parallelized by one-dimensional partitioning, they do not use barrier synchronization and global reduction for speci c groups of nodes. Group barrier synchronization and global reductions will be performed if larger dimensional partitioning is used for optimization.

Acknowledge packet for PUT Current implementa-

tion of the VPP Fortran run-time system requires an acknowledgment for every put() and put stride(), except for PUT for local cell. In this simulation get() is always used for this purpose. Communication overhead is small, although this requirement doubles communication overhead. The current implementation of MLSim, however, does not include a queue over ow model. Hence, MLSim cannot detect whether over ow occurs, and if so, how this a ects performance. Since no PUT operations except the last PUT for every destination cell need acknowledgment, the number of get() operations can be decreased dramatically. The VPP Fortran run-time system is now under improvement for this purpose.

6 Related Work Strategies for direct remote data access have been presented. Some of these approaches use software, while others use hardware. Software implementations include von Eicken and others, who have reported experiments conducted on CM-5 and nCUBE/2 with active messages [23]. The Split-C, extensions to the C language based on active messages, allows programmers to write PUT/GET operations directly using distributed addresses and overlap communication and computation [3].

7 Concluding Remarks We discussed the communication mechanisms required by parallelizing compilers, and showed the features of the PUT/GET interface. We described the mechanisms

4 We

cannot simulate FT without stride data transfers because FT without stride data transfers uses too many PUT/GET operations, which cause a trace bu er over ow.

12

ASPLOS VI '94 needed to implement the PUT/GET interface in hardware and presented an architecture. A new distributedmemory highly parallel computer, the AP1000+, was developed to support this architecture. We simulated the performance of the AP1000+ with message level simulator using scienti c applications such as NAS parallel benchmarks. We think a PUT/GET model is suitable for data parallel applications such as those studied in this paper. The PUT/GET interface eliminates bu ering and overlapping communication and computation, which improves sustained performance. To exploit the e ect of the PUT/GET interface, it is important to minimize the PUT/GET overhead. This means that hardware support of the PUT/GET interface is the appropriate choice. The AP1000+ has sucient communication mechanisms to extract the SuperSPARC processor performance and strikes a good balance between execution and communication. The AP1000+ does not use caching because cache coherent hardware is expensive and we believe message passing based machines with added software cache coherent, such as the AP1000+, have better cost-performance than cache coherent based machines with added message passing mechanisms, such as FLASH and Alewife. In the future, we will show that this is true.

8 Acknowledgments We thank Dr. Ishii, Mr. Shiraishi, Mr. Sato, and Mr. Ikesaka for their helpful suggestions, and Ms. Kaneshiro for use of the NAS parallel benchmark programs. We are also grateful to our colleagues for their participation in many discussions. We also thank referees for their helpful comments and suggestions.

References

[1] Bailey, D., Barton, J., Lasinski, T., and Simon, H. The NAS parallel benchmark. Tech. Rep. RNR-91-002 Revision 2, NASA Ames Research Center, Mo ett Field. CA 94035, August 1991. [2] Blumrich, M. A., Li, K., Alpert, R., Dubnicki, C., Felten, E. W., and Sandberg, J. Virtual memory mapped network interface for the SHRIMP multicomputer. In the 21st Annual International Symposium on Computer Architecture (April 1994). [3] Culler, D. E., Dusseau, A., Goldstein, S., Krishnamurthy, A., Lumetta, S., Eicken, T., and Yelick, K. Parallel programming in Split-C. In Supercomputing '93 (Nov 1993). [4] Hagiwara, J., Kaneshiro, S., Doi, T., Iwashita, H., and Shindo, T. An implementation of HPF compiler and evaluation on AP1000. In Summer Workshop on Parallel Processing '94 (July 1994). [5] Hayashi, K., and Horie, T. Improvement of parallel programs performance by Active Messages. In Summer Workshop on Parallel Processing '93 (Aug. 1993), pp. 129{136. 93-PRG13-17.

[6] High Performance Fortran Forum. High Performance Fortran Language Speci cation Version 1.0, May 1993. [7] Hiranandani, S., Kennedy, K., and Tseng, C. Compiler optimizations for Fortran D on MIMD distributed-memory machines. In Supercomputing '91 (1991), pp. 86{100. [8] Homewood, M., and McLaren, M. Meiko CS-2 interconnect Elan - Elite design. In Hot Interconnects '93 (August 1993), pp. 2.1.1{4. [9] Horie, T., Hayashi, K., Shimizu, T., and Ishihata, H. Improving AP1000 parallel computer performance with message communication. In the 20th Annual International Symposium on Computer Architecture (May 1993), pp. 314{325. [10] Ishihata, H., Horie, T., Inano, S., Shimizu, T., and Kato, S. An architecture of highly parallel computer AP1000. In IEEE Paci c Rim Conf. on Communications, Computers and Signal Processing (May 1991), pp. 13{16. [11] Kaneshiro, S., and Shindo, T. NAS Parallel Benchmark implementation and evaluation using VPP Fortran on the AP1000. In Summer Workshop on Parallel Processing '94 (July 1994). [12] Koelbel, C., and Mehrotra, P. Compiling global name-space parallel loops for distributed execution. In IEEE Transactions on Parallel and Distributed Systems (1991), pp. 440{451. [13] Kranz, D., Johnson, K., Agarwal, A., Kubiatowicz, J., and Lim, B. Integrating message-passing and shared-memory: Early experience. In Fourth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (1993), ACM, pp. 54{63. [14] Kubiatowicz, J., and Agarwal, A. Anatomy of a message in the Alewife multiprocessor. In International Conference on Supercomputing (1993), pp. 195{206. [15] Kuskin, J., Ofelt, D., Heinrich, M., Heinlein, J., Simoni, R., Gharachorloo, K., Chapin, J., Nakahira, D., Baxter, J., Horowitz, M., Gupta, A., Rosenblum, M., and Hennessy, J. The Stanford FLASH multiprocessor. In the 21st Annual International Symposium on Computer Architecture (April 1994). [16] Miura, K., Takamura, M., Sakamoto, Y., and Okada, S. Overview of the Fujitsu VPP500 supercomputer. In COMPCON '93 (Feb. 1993), pp. 128{130. [17] Oed, W. The Cray Research massively parallel processor system CRAY T3D. available through ftp from ftp.cray.com (Nov. 1993). [18] Reinhardt, S. K., Larus, J. R., and Wood, D. A. Tempest and Typhoon: User-level shared memory. In the 21st Annual International Symposium on Computer Architecture (April 1994). [19] Ruhl, R., and Annaratone, M. Parallelization of FORTRAN code on distributed-memory parallel processors. In International Conference on Supercomputing (1990), pp. 342{353. [20] Shimizu, T., Horie, T., and Ishihata, H. Low-latency message communication support for the AP1000. In the 19th Annual International Symposium on Computer Architecture (May 1992), pp. 288{297. [21] Shindo, T., Iwashita, H., Doi, T., and Hagiwara, J. An implementation and evaluation of a VPP Fortran compiler for AP1000. In Summer Workshop on Parallel Processing '93 (Aug., 1993), pp. 9{16. 93-HPC-48-2. [22] Shindo, T., Iwashita, H., and Okada, S. VPP Fortran for distributed memory parallel computers. Parallel Language and compiler research in Japan(in press) (1994). [23] von Eicken, T., Culler, D. E., et al. Active Messages: a mechanism for integrated communication and computation. In the 19th International Symposium on Computer Architecture (1992), pp. 256{266. [24] Waterside Associates. The SPEC Benchmark Report. Fremont, CA, Jan. 1990.

13 (End)