Overlapping Communication with Computation in Distributed Object Systems Françoise Baude, Denis Caromel, Nathalie Furmento, and David Sagnol SLOOP - Joint Project CNRS / INRIA / University of Nice Sophia Antipolis INRIA - 2004 route des Lucioles - B.P. 93 - 06902 Valbonne Cedex, France
[email protected]
Abstract. In the framework of distributed object systems, this paper presents the concepts and an implementation of an overlapping mechanism between communication and computation. This mechanism allows to decrease the execution time of a remote method invocation.
1 Introduction The idea to overlap communication with computation is attractive but not new. As far as we know, this idea has never been investigated in the area of distributed object-oriented languages based on remote service invocations through method calls as RMI [12] in Java or RPC in C/C++ [1]. Optimization of the parameter copying process, as in [15] is a dierent but complementary approach. The general idea is that during a remote service dealing with large data requiring transmission, communication and computation are automatically split in steps with a smaller data volume; then, it is only a question of pipelining these steps in order to achieve overlapping between the current step of the remote computation and the data transmission related to the next step of the remote computation. This requires to execute a computation and a transmission step at the same time. One way to achieve this is to use asynchronous communications. Several problems have to be solved, including: 1. design and implement elementary mechanisms, mainly data splitting and computation with partial data; 2. make it as much as possible a transparent mechanism for programmers, but give them the possibility to guide the data splitting, if they wish; 3. try to determine in an automatic way the appropriate size for data packets (i.e. try to estimate the duration of the dierent steps). The implementation of this technique is generally restricted to the eld of the compilation of data parallelism language for a parallel architecture with distributed memory: HPF [2], FortranD [14], but also LOCCS [6], a library for communication routines and computation. Our contribution is to design, implement, and evaluate it within the context of an object-oriented language extended with mechanisms for parallelism and distribution, C++// [4]. Only points 1 and 2 are resolved in this paper. Dynamically solving point 3 will require more precise information about the computation (a strategy is developed in [6]).
In Sect. 2, various solutions for point 1 are proposed. Then, solutions for splitting requests (point 2) are presented. Sect. 3 introduces one implementation for this technique using the C++// language on top of the Schooner library. In Sect. 4, we present some benchmarks. Finally, we conclude in Sect. 5 with the benets and applications of this technique.
2 Communication/Computation Overlap This section presents our overlapping technique and the necessary requirements for its implementation.
2.1 Elementary Mechanisms
They should allow to resolve point 1: send a request in pieces (without taking into account the strategy used for splitting); be able to rebuild a partial request in such a way that service execution can be started; be able to update a missing part when it comes even if service execution has started; be able to block the computation if it tries to use a missing piece. Step for Request Creation. In every system that proposes a RPC mechanism, the remote service request has to contain the method ID and the dierent parameters of the call which are marshalled using a deep copy of the objects graph1. Then, the request is sent asynchronously. To obtain overlapping, we dissociate the splitting-attening operations with the sending ones. Requirement 1. Gain access to the runtime code that sends requests in order to be able to decide when to send a request piece. Step for Request Rebuilding. Once arrived in the remote system, the request is rebuilt: each parameter is reconstructed with the corresponding data and then the service can start. For implementing the overlapping technique, we have to be able to put a mark for the missing data. This mark informs the service that data are, temporarily, unavailable. Requirement 2. Gain access to the runtime code that deals with the unmarshalling of the request in order to manage marks of missing pieces. When the remote context receives a new part of a request that is already partially rebuilt, the context has to be able to deal with it in an automatic and transparent way regarding the service that is already executing. Requirement 3. A mechanism that receives and manages messages transparently. 1
If a eld of an object is a reference to a remote object, i.e. is a proxy, we just atten a copy of this proxy.
Step for Service Execution. The service can run without any problem as long as it does not attempt to access missing data. An automatic and transparent blocking mechanism is required when it tries to use a missing data. In the same way, resumption has to be transparent and automatic. This requires a waitby-necessity mechanism [3]. Such mechanism is provided by the classical future mechanism, so our solution is the following: each missing data at the instantiation time of the request object is replaced with a future data type. In order to do this replacement - preferably in an automatic way - we also require:
Requirement 4. A way of transparently adding a future semantics to class data type.
2.2 Strategies for Splitting a Request This section deals with the point 2 presented in the introduction. The crucial idea is to break, in the most transparent way for the programmers, the request parameters (by inserting breakpoints). It requires a modication of the marshalling/unmarshalling routines of objects. Whether these routines are generic or not, we have to be able to overload them.
Requirement 5. Be able to change the default marshalling/unmarshalling routines.
Strategies can be split in two groups whether they modify or not the class of the object involved in a request. Without Class Modication. We dene a new marshalling routine which, for example, inserts a breakpoint: (1) between each data member of the object graph we want to atten (given the recursive characteristic of the attening, it means that each data item of a fundamental type is separated from other types); (2) between each level of the object graph (if the marshalling algorithm runs through this graph breadth rst). On the other hand, if the language allows that a member function be used for the atten operation instead of the standard one, a class can dene a suitable routine. This can be used to customize the insertion of breakpoints, for example, by inserting less breakpoints than strategy 1 does (see Code 1). Example of code 1 (flatten() function with breakpoints inserted manually). Buffer *Matrix::flatten(Buffer *buff) { *buff t
With Overlapping Without Overlapping
0
2000
4000
6000
8000
10000
12000
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
14000
16000
18000
20000
18000
20000
Benefit
0
2000
4000
6000
8000
10000
12000
14000
16000
Fig. 2. Execution of the remote service (caller side, s) and obtained benet in function of matrices size (number of integers). Direct task-to-task communication between clusters through a LAN. The overhead of the mechanism is subdivided in: a xed cost independent on the data size. It mainly consists in additional function calls (for Schooner and for communication) to manage the delayed parameters (2 in our experiment, m1 and m2). Since the number of parameters for a service invocation is generally small, this cost can be seen as a constant. a variable cost dependent on the data size. This cost is not due to Schooner (there is no additional work dependent on the data size), but to an increase of the transmission time of the matrix m2, because the receiver cluster has in the same time to receive parts of m2 and to perform computation on m1. We can explain this overhead by watching carefully what happens in Pvm: if the Pvm daemon is in charge of the communication : the sender daemon does not send a packet while the previous one has not been acknowledged (by default, the acknowledgment window is 1). So, the longer the sending of an acknowledgment is delayed, the latter the sending for the
P
subsequent packet of m2 occurs ( 05? 5 ), thus increasing the whole transmission time for m2. In this case, we see that the time overhead depends on m2 size (i.e. the number of packets used for its transmission). if there is a direct task-to-task link between clusters : the transmission of m2 has been stopped because the receiving cluster has not acknowledged the bytes written on its socket by the sender cluster. This mechanism can be noticed thanks to tcpdump [13]. To overcome this kind of problem, a solution should be to modify the window size (e.g by using pvm_setopt() or setsockopt() as the case may be), but this is not desirable since in this case, the solution is no more portable. i
4.5e+07 4e+07 3.5e+07 3e+07 2.5e+07 2e+07 1.5e+07 1e+07 5e+06 0
t
i > t
With Overlapping Without Overlapping
0
2000
4000
6000
8000
10000
12000
1.4 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8
14000
16000
18000
20000
18000
20000
Benefit 0
2000
4000
6000
8000
10000
12000
14000
16000
Fig. 3. Execution of the remote service (caller side, s) and obtained benet in function of matrices size (number of integers). Direct task-to-task communication between clusters through a WAN.
5 Conclusion In each case (LAN and WAN), we obtain a speedup since the remote computation and the transmission of the future parameters are overlapped. But the overhead of the mechanism is not insignicant, in particular due to variable costs. The situations for which the overlapping mechanism is relevant have the following characteristics: (1) the computation time using the already received parameters must be similar or superior to the transmission time of the subsequent ones; (2) this transmission time must be high enough to make insignicant a potential increase of it. As for the second point, the speed and the distance of the network link between the two clusters is also signicant. Although in case of a WAN (see
Fig. 3), the gain will be more variable due to high and variable transmission time, in case of a LAN (see Fig. 2), it will be almost constant. The requirement to implement the overlapping technique in an object-oriented distributed language is mainly to have free access to the transport layer and a MOP for the language. If so, essentially only the atten and rebuild phases of remote procedure call need to be modied: the object representing the remote call has to be fragmented into several pieces independently managed. Those phases need only to use a future mechanism. Automatic message processing is required at the runtime support level. Its aim is to transparently receive and manage late fragments. Such a mechanism is of widespread use, and is in particular available in Schooner, in Nexus, and in PM2 [11], all of them acting as low-level runtime supports for parallel and distributed computations. Future Work. We plan to test others fragmentation policies. At the moment, only the simplest one requiring that late fragments be objects of future type has been tested. Others policies could maybe help to expose new overlapping opportunities in the progress of the remote service.
References 1. A.D. Birrell and B.J. Nelson. Implementing Remote Procedure Calls. ACM Transactions on Computer Systems, 2(1): 3959, Feb. 1984. 2. T. Brandes and F. Desprez. Implementing Pipelined Computation and Communication in an HPF Compiler. In Euro-Par'96, J:459-462, Aug. 1996. 3. D. Caromel. Towards a Method of Object-Oriented Concurrent Programming. Communications of the ACM, 36(9):90-102, Sep. 1993. 4. D. Caromel, F. Belloncle, and Y. Roudier. Parallel Programming Using C++, chapter The C++// System, p 257-296. MIT Press, 1996. ISBN 0-262-73118-5. 5. D. Caromel and D. Sagnol. C++// home page. http://www.inria.fr/sloop/c++ll/ 6. F. Desprez, P. Ramet, and J. Roman. Optimal Grain Size Computation for Pipelined Algorithms. In Euro-Par'96, T:165-172, Aug. 1996. 7. I. Foster, C. Kesselman, and S. Tuecke. The Nexus Approach to Integrating Multithreading and Communication. JPDC, 37:70-82, 1996. 8. N. Furmento and F. Baude. Schooner: An Object-Oriented Runtime Support for Distributed Applications. In PDCS'96, 1:31-36, Dijon, France, Sep. 1996. ISBN: 1-880843-17-X. 9. A. Geist et al. Pvm Parallel Virtual Machine: a user's guide and tutorial for networked parallel computing. MIT Press, 1994. 10. G. Kiczales, J. des Rivières, and D.G. Bobrow. The Art of the Metaobject Protocol. MIT Press, 1991. 11. R. Namyst and J-F. Méhaut. PM2 : Parallel Multithreaded Machine. A computing environment for distributed architectures. In ParCo'95, Gent, Belgium, Sep. 1995. 12. Sun Microsystems. Java RMI Tutorial, Nov. 1996. http://java.sun.com. 13. W.R. Stevens. Advanced Programming in the UNIX Environment. Addison-Wesley Publishing Company, 1992. 14. C.W. Tseng. An Optimizing Fortran D Compiler for MIMD Distributed-Memory Machines. PhD thesis, Rice University, Jan. 1993. 15. C. Videira Lopes. Adaptive Parameter Passing. In ISOTAS'96, Kanazawa, Japan, Mar. 1996.