Implementing Communication Latency Hiding in

Implementing Communication Latency Hiding in High-Latency Computer Networks Volker Strumpen1 and Thomas L. Casavant2 1

2

Institute for Scientific Computing, ETH Zentrum, CH-8092 Zurich Dept of Electrical and Computer Engr., University of Iowa, Iowa IA 52242

Abstract. We present a latency hiding protocol for asynchronous message passing in UNIX environments. With this protocol distributed parallel computing can be utilized to solve applications, which can be structured such that useful computation overlaps communication, in a more efficient way than possible with current standard technologies. To maintain portability our protocol is layered on top of the Berkeley socket interface and the TCP/IP protocol. We present experimental data that validate our model on latency hiding and demonstrate the capability of our implementation.

1 Introduction Distributed parallel computing is currently growing in its acceptance among engineers and scientists. We denote parallel computing in heterogeneous computer networks as distributed parallel computing. In this paper, we explore techniques for implementing and using software platforms that organize such configurations that may consist of large numbers of workstations, located not only within a single local area network but also distributed over long distance networks. Usually, such a software platform provides the user with a parallel virtual machine. This is also the name of a project, PVM [8], which is often considered a de-facto standard in this field. A newly emerging standard, MPI [1], is targeted to replace PVM. However, solving very large problems with high resource requirements for both computation and (especially) communication is generally considered inappropriate for networked computers, because high latency networks restrict their use to coarse-grained applications with relatively low communication volume. To increase the range of communication-intensive applications that can be solved efficiently by network computing, we have developed a technique for implementing communication latency hiding in UNIX environments. We informally define communication latency hiding as a technique to increase processor utilization by transferring data via the network and continuing with the computation at the same time. Latency hiding is exploited in several recent parallel computer architectures on the hardware level. With workstations, network interfaces are separate pieces of hardware in addition to the CPU, having direct memory access, and allowing for concurrent communication and computation. However, the hardware underlying current workstation technologies allows for higher throughput and lower latency than provided by vendor-supplied interfaces, but requires the use of a new software protocol. We analyze the opportunities and limitations of this approach by:

1. Briefly reviewing our model for the analysis of communication latency hiding by overlapping computation and communication in high-latency networks. A more detailed description is given in [6, 7]. 2. Introducing a protocol on top of TCP/IP and the Berkeley socket interface that implements efficient and deadlock-free message passing as well as communication latency hiding in UNIX environments. 3. Presenting experimental results that show that our protocol implements latency hiding efficiently in accordance with our model. Previously, we have developed a model of communication latency hiding [6, 7], which shows that hardware can be more efficiently utilized — in particular by applications where communication and calculation can be overlapped.

2 Latency Hiding in UNIX Networks In this section, we introduce the basic ideas of a generic model for communication latency hiding. Then, we discuss some properties of streams, implemented in the TCP/IP protocol suite and the Berkeley socket interface [5]. These properties are then exploited in our proposed changes to implement portable communication latency hiding in UNIX networks. 2.1

A Latency Hiding Model

Consider a parallel application, which comprises a loop that contains a calculation and a subsequent communication phase. Such a parallelization structure arises typically in grid computations where the iteration models some time dependency. Usually, within each iteration new data are calculated, and then exchanged among cooperating tasks. With the runtime tseq of one loop iteration of the sequential program and equal distribution of work, each task of the parallelized program with p processors is assumed to require calculation time tcalc = tseq =p plus communication time tcom for one iteration. Now, if only the fraction (1 ? f ); 0 f 1, of the calculation time is required to calculate those data that are to be sent to another task, the loop body can be restructured to allow for overlapping communication and calculation. First, the data to be communicated are calculated in time (1 ? f )tcalc . Then, these data are launched into the network. Next, the remainder of the task is calculated in time ftcalc . Finally, the pending messages are received. The runtimes of the parallel implementations without latency hiding t and with latency hiding tlh can be approximated by: ?

tlh (p) = (1 ? f )tcalc + max ftcalc ; tcom : We characterize communication latency hiding by means of the gain G, which we define as the ratio of the speedups Slh of the implementation with latency hiding and S without t(p) = tcalc + tcom ;

latency hiding:

Slh (p) = (1 ? f ) + maxp? f; tcom =tcalc ;

S (p) = 1 + t p =t : com calc

Introducing the granularity

tcalc =tcom , we obtain gain 1= : G = SSlh((pp)) = (1 ? f )1++max (f; 1= ) In this representation, f = 0 corresponds to the case without latency hiding, and =

f = 1 to ideal latency hiding. Figure 1 illustrates speedup and corresponding gain for varying latency hiding degree f . 2.1 2

p f=1

f = 0.5

1.9

f = 0.25 f = 0.1

1.8

f=0

f=1 1.7 1.6

Gain

Speedup

3p/4

p/2

1.5 1.4 f = 0.5

1.3 1.2

p/4

f = 0.25

1.1 f = 0.1 1 0

1

2

3

4

5

γ

6

7

8

9

10

0.9 0

(a)

f=0 1

2

3

4

5

γ

6

7

8

9

10

(b)

Fig. 1. Dependency of communication latency hiding on granularity

Gain G is bounded by the constant 2 independent of the number of processors being employed. Furthermore, with the definition of efficiency E (p) = S (p)=p, gain equals the quotient of efficiencies of the versions with and without latency hiding G = Elh =E . It should be pointed out that without a high value of gain (i.e., close to 2), utilizing larger numbers of processors will in general lead to low efficiency for relatively large communication times. Therefore, gain itself is a secondary objective which will lead to higher efficiency, and leads to a solution that allows the use of a particular number of processors with maximum efficiency or speedup. 2.2 Buffering, Fragmentation and Deadlocks UNIX provides the system calls read and write to communicate across connectionoriented communication channels by means of streams [5]. For message passing built on top of streams, structured data or different values scattered over the memory have to be marshalled into a contiguous buffer. On the receiving side, a buffer is needed to receive the message. This marshalling becomes a bottleneck for communication if memory bandwidth is too small. Measurments show [6] that relatively fast machines like a Sun SPARCstation10 perform the marshalling in almost negligible time compared to the transmission time of an Ethernet. In contrast, a slower Sun SPARCstation1+ needs up to three times the transmission time for copying the message. In principle, the faster the network with respect to processor performance, the more essential is an efficient buffer management scheme for communication performance.

The TCP/IP protocol suite relieves the programmer from implementation details such as fragmenting messages into network dependent maximal transmission units. Surprisingly, fragmenting messages above the TCP/IP transport layer can reduce communication latency. There exists a local minimum of transfer time for a fragment length of 4 Kbyte. The reason for the optimum fragment size of 4 Kbyte is an implementation detail of the dynamic buffer memory management of UNIX [3]. An mbuf memory buffer stores 112 bytes of data. To buffer larger messages, multiple mbuf data structures are connected as a linked list. If the message length exceeds half of the machine’s page size, a whole page is mapped into an mbuf structure. In contrast to managing a linked list of mbuf s, the page-table entry is cheaper in terms of runtime overhead. Data are moved without memory-to-memory copies by simply remapping pages with pointers. Asynchronous communication requires buffering of data at the system level. Because these buffers are bounded in UNIX, deadlocks may occur if the communication system calls are blocking. Suppose two processes of a SPMD-styled program send a message to each other before invoking the corresponding receive operation. If both messages are larger than the capacities of the send and receive buffers, both processes block in their send calls — the processes deadlock. Interfaces like Intel’s NX message passing system [4] or IBM external user interface [2] leave the problem of deadlock avoidance with the programmer. We solve this problem by fragmenting messages on top of the transport layer into sizes not bigger than the send or receive buffers, and reading pending data in a receive buffer before writing. This frees receive buffer space that the sending process can use to complete the send operation. 2.3

Protocol Support for Latency Hiding

The basic idea underlying our latency hiding implemention is the separation of communication and calculation into two processes. A generic communication process contains the message passing protocol, and the calculation process executes the user program. The effect is the interleaving of communication and calculation scheduled by the UNIX kernel that can switch the CPU to the user process whenever the transport layer idles. Two mechanisms are necessary to implement latency hiding. Both processes must be able to access the message, and have to synchronize mutually. We utilize a physically shared memory segment [5], for fast message access. Synchronization can be implemented with System V semaphores, signals or synchronizing messages. Experiments showed [6] that it is faster on most workstation models to transfer a short message via a pipe than using semaphores or signals. Figure 2 illustrates the steps of our communication protocol. Consider calculation (user) process U1 sending a message to U2 . Step 1 indicates that U1 has written values into the shared memory array A, or has marshaled a message into array A. U1 initializes the send operation by forwarding a send request via the pipe to communication process K1 (step 2). K1 transfers the message to K2 (step 3), while both processes can continue their computation. At some point during program execution (step 4), U1 synchronizes the pending send call by sending a sync request to K1 . User process 1 blocks until K1 has signaled that the message is delivered completely to the TCP protocol layer. At some time, U2 issues a receive request to K2 (step 5). Two cases are distinguished here. First, if this request has been issued after the message transfer has been started, K2 allocates

buffer space to store the received message. Buffer space (array C ) is allocated according to the size of a message fragment (maximum 4 Kbyte), and eventually linked into lists. When U2 sends a sync request to its communication process (step 6), the message has to be copied into the requested shared memory region. In the second case, the receive request (step 5) is transferred to K2 before the message arrives which avoids copying the message, but storing the message directly in user space (array B ). The sync request of U2 (step 6) blocks until the entire message is stored in B . Finally, U2 can access the message in the requested memory region (step 7).

1

Send

User Process 1 2

A

5

B

Req

4 Sync

Process 1

Receive

Pipe

Pipe

Req

Kernel

7

User Process 2

6 Sync TCP/IP 3

Transfer

C

Kernel Process 2 Shared memory

Fig. 2. Data exchange via shared memory

Three primitives are added to the program notation: Non-blocking send and recv calls, and a blocking sync call. The send and recv calls initiate the corresponding message passing by registering message buffers with the system. Before accessing such a buffer area the sync call ensures that the message is submitted, and the buffer can be overwritten, or the message has arrived, and the buffer contains the expected data. The idea of splitting the send and receive primitives into two calls is not new. It has been introduced to utilize hardware support for latency hiding. Intel’s NX asynchronous message passing primitives isend, irecv and msgwait are an example [4].

3 Experimental Results To illustrate the benefit of our latency hiding implementation, we present experiments performed on two-processor configurations. These experiments investigate the dependency of gain G on granularity. An explicit finite difference solver based on a fivepoint-stencil, has been run on a two-dimensional rectangular domain partitioned into two equal parts in the x-direction and having a fixed y-dimension. Granularity is varied by changing the mesh size in the x-dimension. 3.1 Internet Performance The UNIX send operation (system call write) blocks until the message has been passed to the transport layer (TCP). We illustrate the effect of message size and socket

buffer size on performance with two different y-dimensions of 500 and 1000 grid points. Using an 8 byte double precision number per grid point, the resulting message sizes are 4000 Byte and 8000 Byte, respectively. With the default socket buffer size of 4 Kbyte, the send operation terminates immediately after copying the 4000 Byte message into the send buffer. With the 8000 Byte message, the send operation blocks, until the first 4 Kbyte part of the message is transferred via the network, and the remaining part of the message has been copied into the send buffer. 2.0 1.9

1.8 1.7 1.6 Gain 1.5 1.4 1.3

..... ... .....

latency hiding blocking send

.. ............ ........

... . .. ... ...... ............... ... ... ... .... ... ... ... ... ... .. .. .. ..... ... ... .... ... ........................ . ...... ... ... ... ... ... ... ... .. ... .. ... . ... ... ... ... ... ... .. . . . ... .... . ... .... ... ... ... ... . ... .. ... ... .... .. . ... ... . ...... ... ....... ... ... .... . .... ... ..... .... ..... ...... ... ... ........ .... ... .................... .. ............... ... ................ .... ... ............... ..... .. ................ .. ............................... ..... ............................................... ............................. ....... ..................................................... .. .............. ............. ...... .............. ........ ............ ........... ..... .......... ...... . ......... .. ........ ...... ........ ........ ....... ........ ........ ... ........ ... ... . ..

1.2

1.1 1.00

2.0 1.8 1.6 1.4 1.2 Speedup 1.0 0.8 0.6 0.4

20

40 60 80 Domain Size ( )

100

120

?

?

.......... ................................................................................................................. ........................................................................................................................... ..................... .............. ..... .................... ................ ..... ...................... .. ............................................ .. ... .. ..... .. ..... ... ... ... .... ...... . . .. ........... ............. ... ... ............. ... ... ............. .. .. ............. ... ... ........... ... ... ............. .. .. ............ ............. ... ... ................... . . . . .. .... . . . . . . . . . . . . . . . ......... . .. ...................... ... ... ..................... .. ... ...................... ... .. ............... .. .. ... ... .. .... .. .. .... ... ... ... ... ... .. .. ... .... ... ... ... ...... .... . . . . . ..... .... ..... ... ..... ... ..... ... ...... .. ...... ... .... ... ..... ... ..... ... .... ... ..... ... ....... . . . . .... ... .... ... ...... ... ... ... ..... ... ...... ... ..... ... .... .. ...... .. ... ... .. .... . . . .. .... ... .. ... .. .. ... ... .... ... . .. .. .. ... ... ... .. .. .. ... ... ... . . . .. ... ... ... ... .. .. ... ... .. .. ... ..... ...... .... ...... . . . .......

?

?

?

?

?

?

0.2 ?

latency hiding blocking send naive algorithm

0.00

20


(a) 500 x-grid points

100

120

2.0 1.9 1.8 1.7 1.6 1.5 Gain 1.4 1.3 1.2 1.1 1.0 0.9 0.80

latency hiding blocking send

.... ... ... ... ... ... .... ... ... .. .. ... ... ... .. ... ... ... ... ... ... . ... . ... ... .. ... . ... . ... ... ... ... . .. . ... .. .. ... . ... . ... ... .. ... . . . .. ... ... .. ... . . .. .. ... .. .......... .. . .......... ......... ... .......... .. . .......... . ......... ... .......... ........... ... . .......... . . .......... .. .......... ......... ... . ......... .. ........ .. . ......... . . ....... .. ......... ........ ... . ........ . ........ .. ......... ... ........ . . ........ .. ........ .. ....... .. . .. ...

........ ... ........... .. ..... ...... ... ..... .. .......................................................... .... ................................................ ........................................ ... ... ..... ..................................................... ................................ ... ... ........ .... .. ... ..... ....... ... ... ..... ............ ... .. ..... ......

2.0 1.8 1.6 1.4 1.2 Speedup 1.0 0.8 0.6 0.4 0.2

20


100

?

120

?

............................................................................................................................... ............. ................ ............... ............... ............. ................ .............. ............... .............. ............. ..... . . . ....... .......... . ..... ......... . ..... .......... ................ ..... .......... ............. .... ......... ................ ... ......................... .. .......................... .. ...................... .. ............... ................ . . ... . . . . . . . . . ........ .. ............ .. .......... . . . . .. . . . . . ....... .. ......... .. ........... .. ............... .. ................. .. ................. ... ................ ................. ... ................ .. ................ .. ......... .. ....... .. .......... . .. . . . . .. .... ... .. .... ... ... .... ... ... .. ... .... ... .. .... ..... .. .... .. .. .... ..... .. .... ..... ... .......... .. ......... .. ........ . . . . . . . .... .. ..... ... ...... .. ...... .. ...... .. ... ...... ...... .. ...... .. .... .. .. .... .. ....... ... ......... . .. .... .. ... .. .... .. ..... .. .... .. ...... . .. ... ... .. .. ... .. ... ... ..... .. ...... . ... .... .. .... ............ ........ ....... ...... ..... .... .... ..... ......

?

?

?

?

?

?

latency hiding blocking send naive algorithm

?

0.00

20


100

120

(b) 1000 x-grid points

Fig. 3. Two-processor gain and speedup on the Internet

Figure 3 shows the gain of two implementations, one based on blocking stream socket communication (blocking send), and the other with our latency hiding protocol. The reference measurements are based on a naive implementation without separating the send and receive operations and performing useful computation during the transfer (naive algorithm). As expected, stream sockets and our latency hiding version perform similar for messages shorter than the socket buffer size (Fig. 3(a)). In the case with 8000 Byte messages the CPU can be utilized on the sender side primarily during the transfer of the first 4 Kbyte part of the message and on the receiver side during that phase

of the receive call, in which the second part of the message is transferred (Fig. 3(b)). Figure 3 also shows the corresponding speedup curves. = 1 corresponds to the domain size of 10 grid-points on the x-axis. Accordingly, = 120 denotes an x-range of 1200 points. All measurements were performed with one SPARCstation at ETH, Z¨urich, connected to another SPARCstation at MIT, Cambridge, via the Internet. 3.2 Ethernet Performance The previous experiments shed light on the dependency of message size and socket buffer sizes on latency hiding performance. Another suite of experiments has been performed between two SPARCstation10 workstations on an Ethernet. The finite difference solver was run with 100,000 grid points in the y-direction, corresponding to 800,000 bytes per message. This size greatly exceeds the maximum socket buffer size. 2.0

2.0

1.8

1.9

1.6 1.4 1.2 Speedup 1.0 0.8 0.6 0.4 0.2 0.00

?

?

1

? ? ?

1.7 1.6

?

Gain 1.5 1.4

?

latency hiding simultaneous send ping-pong

1.3 1.2 1.1

2

3

4 5 6 7 Domain Size ( )

(a)

1.8

.. ..... ....... ...... ...... ....... ...... ....... ...... ..... .... . . . .. ..... .. ..... ..... ..... .... ..... ..... .... ..... .... ... ..... ... ..... ..... ..... ... ..... ..... ... ..... .... ... ..... ..... .. ..... ..... . . . . . . . . .. ... ..... .... ..... ... .... ...... ... .... .... .. ..... .... ... .... .... ... .... .... .... .... ... .... .... ... .... .... ... .... .... ... .... ... ... .... ....... . . . .. ... ... ... .... ... ... .... .. ... ... .. ... .... ... .. ... .. ... .... .. .. .... ... ... .... ... ... ... ... ... ... ... .. .. .. ..... .... . .. ... .. .. ..... ... ...

8

9

10

1.00

latency hiding simultaneous send

............... .... ...... ... ... .. ... ... .. ... ... ... .. ... ... ... ... . ... ... ... ... . ... . .. ... ... ... .. . ..... .. .... ..... ... .. .... . ..... ... .... ..... ... ..... ..... ..... ..... .... .... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........... ................. ................ ............... ................. ................. ................ ............... ........... .......... ....... ............... .......... ............ .............. .................................... ................................... .............................. ............................................. ................................................................... ............

1

2

3

4 5 6 7 Domain Size ( )

8

9

10

(b)

Fig. 4. Two-processor speedup and gain on the Ethernet

The reference implementation (ping-pong) operates with pure stream socket communication. To avoid deadlock, this version is programmed with ping-pong styled communication, i.e., two programs are required, one that first sends a message and then receives, and the other that first receives a message and then sends. To illustrate the performance gain of the developed protocol without actually utilizing the latency hiding feature, a version with both processes first sending and then receiving has been implemented (simultaneous send). In contrast to pure stream socket communication, the processes do not deadlock with our protocol but achieve higher network utilization. The third version is the latency hiding implementation (latency hiding). The resulting speedup and gain values are presented in Fig. 4. The domain size of 3 100; 000 grid points per processor corresponds to = 1. This extreme communication volume demonstrates the performance capabilities of the Ethernet, although it is clearly an unrealistic partitioning of the problem domain. For

5, the memory requirements for the sequential measurements exceed physical

memory. Consequently, swapping activities lead to large speedup values of > 10, and are therefore omitted in Fig. 4(a).

4 Discussion The Internet experiments show that latency hiding increases speedup and efficiency. The speedup reaches its maximum value for smaller granularities than without latency hiding. Thus, with latency hiding, applications with finer granularities can be executed more efficiently. Furthermore, almost optimal speedup and efficiency is obtained with latency hiding for > opt , where opt is the granularity with maximum gain. This shows that certain applications can run as efficiently on parallel systems with high-latency networks as they could potentially on low-latency parallel machines. It remains to investigate congestion effects of LANs if a larger numbers of workstations are employed. With the Ethernet experiments, the calculation of the messages cannot be neglected, especially for smaller granularities. Here, the approximation of ideal latency hiding does not hold. Furthermore, the ratio of machine performance and network performance is smaller than that of the Internet configuration. Therefore, the influence of the overhead of the latency hiding implementation increases. Both effects explain the smaller maximum value of gain G. The simplicity and elegance of the presented implementation stems from the use of the UNIX kernel scheduler to switch between calculation process and communication process. Although the overhead of process switching is relatively large, especially in the Ethernet, we proved the validity for low latency networks such as the Internet. It should therefore be relatively easy to incorporate this technology into existing systems. PVM [8], for example, operates with the necessary two-process setting already; The PVM daemon could take the part of the communication process.

References 1. MPI: A Message-Passing Interface Standard. Message Passing Interface Forum, April 1994. (Distribution: netlib). 2. V. Bala et al. The IBM External User Interface for Scalable Parallel Systems. Parallel Computing, 20(4):445–462, 1994. 3. S. J. Leffler, M. K. McKusick, M. J. Karels, and J. S. Quarterman. The Design and Implementation of the 4.3BSD UNIX Operating System. Addison-Wesley, Reading, 1989. 4. P. Pierce. The NX Message Passing Interface. Parallel Computing, 20(4):463–480, 1994. 5. R. W. Stevens. UNIX Network Programming. Prentice-Hall, Englewood Cliffs, 1990. 6. V. Strumpen. Communication Latency Hiding — Model and Implementation in High-Latency Computer Networks. Technical Report 216, Department Informatik, ETH Z¨urich, June 1994. (WWW: ftp://ftp.inf.ethz.ch/doc/tech-reports/1994/216.ps.Z). 7. V. Strumpen and T. L. Casavant. Exploiting Communication Latency Hiding for Parallel Network Computing: Model and Analysis. In International Conference on Parallel and Distributed Systems, pages 622–627, Hsinchu, Taiwan, December 1994. IEEE. 8. V. S. Sunderam. PVM: A framework for parallel distributed computing. Concurrency: Practice and Experience, 2(4):315–339, 1990. This article was processed using the LATEX macro package with LLNCS style

Implementing Communication Latency Hiding in

Implementing Communication Latency Hiding in

Suggest Documents

HIDING GLOBAL COMMUNICATION LATENCY IN THE ... - CiteSeerX

Exploiting Communication Latency Hiding for ... - Semantic Scholar

Hiding Communication Latency and Coherence Overhead in Software ...

Managing Communication Latency-Hiding at Runtime for Parallel ...

A Communication Latency Hiding Parallelization of a Traffic Flow ...

Implementing Separable Reversible Data Hiding In ...

Automatic Methods for Hiding Latency in Parallel ... - Computer Science

Automatic Methods for Hiding Latency in Parallel ... - Computer Science

Implementing successful interprofessional communication

Implementing successful interprofessional communication ...

Implementing successful interprofessional communication ...

Memory Predecryption: Hiding the Latency Overhead of ... - CiteSeerX

Loop Scheduling with Complete Memory Latency Hiding on Multi-core ...

Hiding in plain sight: communication theory in ... - Springer Link

Implementing Information and Communication Technology in Higher ...

Organizational Communication in Implementing School Committee

Implementing Information and Communication Technology in Higher ...

Reliable Low Latency Wireless Communication Enabling ... - arXiv

Communication Latency Aware Low Power NoC Synthesis

Effects of Communication Latency, Overhead, and ... - CiteSeerX

Effects of Communication Latency, Overhead, and ... - CiteSeerX

Latency Criticality Aware On-Chip Communication

Low-latency Explicit Communication and ... - FORTH-ICS

Low-latency Explicit Communication and ... - Semantic Scholar