Network latency does not affect the monitoring of the system, since no network ... readings obtained in the front-end, a monitoring tool is being developed using ...
Use of the Parallel Port to Measure MPI Intertask Communication Costs in COTS PC Clusters M. Haridasan and G.H. Pfitscher Department of Computer Science, University of Brasilia, Brazil {maya, gerson}@cic.unb.br
Abstract Performance analysis of system time parameters is important for the development of parallel and distributed programs because it provides a means of estimating program execution times and it is important for scheduling tasks on processors. Measuring time intervals between events occurring in different nodes of COTS clusters of workstations is not a trivial task due to the absence of a unified clock view. In this work we propose a different approach to measure system time parameters and program performance in clusters with the aid of the parallel port present in every machine of a COTS cluster. Some experimental values of communication delays using the MPI library in a Linux PC cluster are presented and the efficiency and precision of the proposed mechanism are analyzed.
1. Introduction The use of low cost processors and memory devices has been increasing in small, medium and large scale parallel computers. COTS clusters utilize costeffective commodity off-the-shelf components to deliver scientific and engineering computational cycle s at a low price. One of the major factors contributing to the increase in the use of these clusters is the growth of public domain software such as Linux, PVM and MPI, which permit that parallel programs be executed in a group of machines connected by a local area network. Due to the use of low performance networks, not all distributed algorithms yield satisfactory results in Beowulf clusters. In such systems, communication is a bottleneck and special care should be dedicated in modeling applications. Therefore, the role of time measuring is clear, both for obtaining system parameters, specially those related to intertask communication costs, and for distributed program tracing. Communication cost has a special importance because the granularity of parallel applications is strongly defined by the communication latency of a system.
One of the problems that occur in Beowulf-class clusters of computers is the absence of a perfect synchronization between machine clocks. Each node of the cluster has its own local clock, sensitive to external conditions such as temperature. These individual clocks suffer the effect of different constant drifts, making it difficult to maintain a unified clock view. Many software synchronization algorithms have been proposed [1], but network delays to propagate synchronization messages do not permit very precise clock synchronization. Hardware and hybrid approaches, which permit a more precise synchronization, have been proposed [2, 3, 4], but are also not able to provide a perfect view of a reference clock. Due to the difficulties in measuring time intervals between distributed events, in this work we propose the use of the parallel port for centralized time measuring in small and medium scale PC clusters. The parallel port represents a very attractive solution for communication of small quantities of information, since it is present in all machines and almost never used in clusters. Parallel ports have been previously considered for different uses on clusters, such as for collective communications, barrier synchronization and clock synchronization. In the TTL_PAPERS project [5], an approach to provide low latency barrier synchronization and aggregate communication using parallel ports and minimum extra hardware is proposed. Hardware clock synchronization using the parallel port has also been proposed and proved to be efficient [4]. These approaches have shown that the use of the parallel port for auxiliary tasks in a cluster can yield satisfactory results. Here we use the ports’ lines for exchange of useful information for the monitoring of distributed program execution and communication between machines in small clusters. Some aspects which must be considered to evaluate the efficiency of use of the parallel port for time measurements are the time spent reading and writing to the ports and the capacity of the front-end of handling multiple timing signals. We also present measurements collected with the proposed system, of MPI peer to peer and collective
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
communication functions on a Beowulf cluster of eight machines. This paper is organized as follows. Section 2 presents the system parameters utilized by some parallel computation models to represent communication costs. Section 3 describes our proposed parallel port approach to measure communication costs in clusters. Section 4 presents our methodology for measuring communication times in a COTS PC cluster. And finally, section 5 presents some communication time measurements obtained with the proposed approach. 2. Communication cost in models of parallel computation Several models of parallel computation have been proposed, but there is still no consensus about which model best describes parallel computation in different machines. In this section we make a brief presentation of how two of these models handle communication between processing nodes. The postal model [6] is strictly a communication model and does not describe computation. It focuses on three aspects of communication in message passing systems: total connectivity, simultaneous input/output and communication latencies. In the postal model, any processor p may send a point-to-point message to any other processor in the system and each processor can simultaneously send a message to a processor q and receive a message from another processor r. If a processor p starts sending a message m to a processor q at an instant of time t, then processor p is busy sending the message during time interval [t, t + 1], and processor q is busy receiving the message during time interval [t + ? -1, t + ? ] (Figure 1). The emission event of a message is separate from the reception event so that a processor can send a message and does not need to wait for its reception before it can send a new message.
the pair of communicating processors and the load on the communication network. Under normal operation conditions, the value of ? can be considered relatively uniform along the system. The LogP model [7] is a parallel computation model with distributed memory in which processors exchange information through point-to-point communication. The model specifies performance characteristics without attaching itself to a specific kind of network. The LogP model uses four parameters to model a system: L, an upper bound for latency, which occurs in the communication of a message containing a word or a small group of words from the source node to the destination node; o, the overhead, which is defined as the time interval during which the processor is sending or receiving a message and is not able to perform other operations; g, the gap, or the minimum time interval between two consecutive transmissions or receptions of messages in a processor; and P, the number of processormemory modules. The LogP model assumes that a network has a limited bandwidth, through which ?L/g? messages from any processor to any processor can be in transit at an instant of time. If a processor tries to send a message while the network’s upper limit has been achieved, it will be paused until conditions for transmission of the message are satisfied. Parameters L, o and g are measured in terms of processor cycles. Processors are considered to work asynchronously, so the latency of processors can not be foreseen. The model also assumes that messages are of small size and that messages may not reach their destiny in the order in which they are sent. In the LogP model a message requires o time to be prepared and put into the network, L cycles to get to the destination node (communication latency) and o additional cycles to be received by the other end (Figure 2). Thus, a message can only be received after L + 2o cycles.
Figure 2 Transmission time of a message under the LogP model Figure
1
Emission of a message from a processor p to processor q under the Postal model
The delivery of messages include several costs such as the time to prepare a message, the time to write on the output buffer, the delay on the output port, the network propagation delay, the delay in the input port, the time to write to the input buffer and the time to interpret the message. Communication parameter ? must reflect both hardware and software costs. It actually depends on
The LogGP model is an extension of the LogP model that does not limit the size of messages. It presents a linear model for long messages, adding an additional parameter G, which represents the bandwidth obtained for long messages [8]. The three parameters o, g and G represent three different types of bottleneck: o represents the interval of time during which the processor is involved in th e emission or reception of a message; parameter g represents the network initialization bottleneck; and
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
parameter G reflects the network bandwidth per processor for long messages. Other models such as the BSP model [9], which is a bridging model, have been proposed and each contain its specific model of computation. There is still no definition about which model, if there is one, best describes communication in a cluster of computers. A specific model can be achieved through observation of the behavior of several communication routines. 3. Using the parallel port for measuring communication cost
previous value of the first data bit of the parallel port, and writes it back inverted.
Figure 3. Parallel port connection between cluster nodes and front-end
In our approach, each machine participating in the cluster is connected to a front-end machine through a single parallel port line (Figure 3). Performance monitoring duties are assigned to the front-end machine, which captures event timing signals sent by the other machines through the parallel port. Network latency does not affect the monitoring of the system, since no network communication is involved in the process. The major advantage of this approach is that all event times are read from the front-end’s clock, not relying on internal clock synchronization among the node’s clocks for a correct ordering and precise measurement of timing events generated in different machines. The original parallel port consists of eight output lines, five input lines and four bi-directional lines. More recent modes have been defined, like the PS/2, EPP and ECP modes, which permit that the eight lines initially used only for output be used for input as well. Three contiguous physical addresses are assigned to the standard parallel port, usually starting from the base addresses 3BCh, 378h or 278h. The first of the three contiguous addresses corresponds to the data register, and contains eight bits used for output, but which can be read or written. The second address contains the status register, through which we can access the five input bits. And finally, the third address corresponds to the control register, containing the four bi-directional bits, that can be configured either for output or input. Our proposed model consists of two role players: the workers or slave machines and the front-end machine. In the slave machines, we only use one bit of the data register to output a signal to the front-end every time we want to mark the time of an event. All other lines are free to be used for other tasks such as clock synchronization. Our work was developed of 8 machines, so each line was directly connected to one of the possible input lines of the front-end parallel port. A simple macro containing the code to write a value to the data register of a parallel port in C language has been written. Whenever a timing mark should be added at some point of a program, the programmer just needs to insert a call to the macro, which reads the
Since each machine is connected to a different pin of the front-end’s parallel port connector, the frontend can identify which machine has sent the signal by the pin number whose value has changed. Using this logic of inversion of bits to indicate time marks instead of generation of pulses, the ris k of the front-end machine loosing an information is much smaller. Even if a node changes its state while the front-end is busy treating another node’s change of state, the front-end will be able to perceive the change afterwards within a small tolerance interval time. At the front-end, a monitoring program also written in C keeps listening to the parallel port waiting for any changes. If a value changes, it stores the value which was read together with the current time for posterior interpretation. With the sequence of values read from its parallel port, the front-end can later interpret which machine generated each time mark. The program terminates and writes the obtained values in a file whenever it waits more than a specific time for a change of value. To measure the interval clock time, the rdtsc instruction, implemented on Intel processors from the Pentium onwards, has been used [10]. This instruction returns the number of clock cycles elapsed since the machine’s last reboot. Besides having a minimal execution time, the rdtsc instruction provides nanoseconds precision, which is better than the commonly used gettimeofday C function, which provides a microsecond precision. To automate the process of interpreting the readings obtained in the front-end, a monitoring tool is being developed using the Java language. This tool permits that a Task Precedence Graph (TPG) representing a specific parallel program be graphically entered. A TPG models parallel programs using nodes to represent tasks and arcs to represent communication and dependence requirements. Besides specifying the program’s TPG, the user can add timing events to each task, and indicate the order of these timing events. The user should also map the tasks onto the processors of the cluster. The tool then
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
executes the program, and maps the timing events read to the events specified by the user. The results may be visualized through Gantt charts and tables. 4. Methodology Because of the time differences between different machines’ local clocks, communication times between machines cannot be directly measured from the difference between a clock reading performed by the receiving machine and a clock reading performed by the sending machine. A simple approach which is commonly used to measure commu nication times is the round-trip method, also known as ping-pong, which consists in measuring the total time a message requires to be transmitted between two computers A and B and retransmitted back from B to A [11, 12]. Thus, machine A reads clock value t i before sending the message to computer B, and reads clock value t f after receiving back an equal message from computer B (Figure 4). The communication time from A to B will then be half the difference between times t f and t i . A disadvantage of this approach is that it assumes that the time for the message to go from A to B is equal to the time for the same message to go from B to A, which may not be the case in heterogeneous systems. Moreover, the ping-pong approach is intended only for measuring point-to-point coummunication times and cannot be used for measuring the time of collective communication functions.
third processor which permanently broadcasts messages to the first two processors, and waits for them to send the same message back. These two kinds of traffic can be visualized in Figure 5. Traffic type 3 is similar to type 2, with the difference that the third processor sends individual messages instead of broadcast messages. Times for broadcast communication were also measured using the parallel port approach. The process which sends the broadcast message sends a signal to the front-end machine immediately before sending the message and all other processors signal to the front-end immediately after receiving the message. And finally, to measure the execution time of barriers in MPI, we measured the time interval between the arrival of the last process to the barrier and the departure of the last process from the barrier. In the execution of a barrier with n processes, n –1 processes will have to wait for some time until the last process reach es the barrier.
Figure 5. Traffic of types 1 and 2. 5. Results
Figure 4. Ping-pong technique for measuring message transmission delays. To measure MPI point-to-point communication delays, we have compared the ping-pong approach with our approach using the parallel port. Using the parallel port approach, one process sends a signal to the front -end immediately before sending the message and another process sends a signal to the front -end immediately after receiving the message. Tests were performed for messages of lengths varying from 1 to 100000 32-bits integers, in a randomly chosen order. Each test was repeated 10 times. Initially, times were measured without any communication traffic in the network. Three types of traffic were simulated and tests were realized under these different conditions of traffic. Traffic type 1 consists of pairs of processors exchanging variable size messages between themselves. The second type of traffic involves a
To analyze the efficiency and precision provided by our proposed approach, we measured the time taken by the front-end machine to treat each timing signal generated by the worker machines and the time taken by the worker machines to send a timing signal to the frontend machine. Operations to access the parallel port usually take approximately one microsecond. The time taken by the slave machines to send a signal is longer than the time taken by the front-end to treat a signal, and therefore there is no possibility of a worker sending two signals consecutively which the front-end is not able to handle. The precision grain of the parallel port approach is approximately one microsecond. This precision is satisfactory for measuring communication costs in clusters since communication costs are in the hundreds of microsseconds precision range. To measure the execution time of events with a better precision, a hybrid approach using both parallel port signaling and local clock readings might be used. We present result s obtained for some communication measurements realized on a cluster of eight nodes, each with a Pentium II 350 processor and
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
80000 Traffic Tupe 1 Traffic Type 2 Communication time (10 -6 s
128 MB of RAM memory. The machines communicate through a 100 Mb/s switch. To compare our parallel port approach with the ping-pong approach to measure point to point communication time, we measured the time to send varying size messages between two machines of the cluster. In Figure 6 we present the times measured using the ping-pong and the parallel port approaches. It can be noticed that the send time linearly increases with the size of the message, and values measured with both techniques are very close to each other.
60000
Traffic Type 3
40000
20000
0 0
20000
40000
60000
80000
100000
Number of transmitted integers 40000
30000
y = 0,3717x + 317,64 2 R = 0,9999
Figure 7. Influence of different kinds of traffic in point-to-point communication times – measures obtained with the parallel port approach.
20000
160000
10000
0 0
20000
40000
60000
80000
100000
Number of transmitted integers
Figure 6. Point-to-point communication times of varying length messages without presence of any kind of traffi c – measures obtained with the ping-pong and parallel port approaches. Figure 7 presents the communication time of messages with the presence of different kinks of traffic in the network, measured using the parallel port approach. Tests with the ping-pong method were also realized and the values obtained were again very close to those obtained with the parallel port approach, and therefore are not presented here. Different kinds of traffic influence the communication cost in different ways, which makes it difficult to take traffic into in a communication model. Communication times for increasing amounts of traffic are presented in Figure 8. With the increase in amount of traffic in the network, transmission times and dispersion increase, as it would be expected. The additional cost of the communication time is due to two different factors: the presence of traffic in the network and the processing cost of additional tasks being executed. The influence of each of these two factors has been evaluated with traffic programs similar to the ones used before, but with each process communicating with itself. The processing cost was then isolated and we have verified that its influence on the communication costs are insignificant. Thus, we could conclude that the increase in communication time occurs basically due to the presence of traffic in the network.
Communication time (10-6 s)
Communication time (10-6
Parallel-port Ping-pong
No Traffic 1x Traffic 2x Traffic 3x Traffic 4x Traffic
120000
80000
40000
0 0
20000
40000
60000
80000
100000
Number of transmitted integers
Figure 8. Influence of increasing amounts of traffic in point-to-point communication times – measures obtained with the parallel port approach. Figure 9 shows communication times for the MPI broadcast routine. The times measured indicate the time taken by all the processes to receive the broadcast message. The advantage of the broadcast routine, which uses a tree algorithm to propagate a message, could be verified. The process which broadcasts the message first sends it to a second process. In the next stage, the two processes which already know the value transmit it to other two, and then the four which already know ransmit to other four and so on. The graphic reflects this behaviour, with times for broadcasting a message to two and three processes being very close, as well as for four, five, six and seven processes. We also present times measured for the execution of an MPI barrier involing increasing number of processes in Figure 10. As with the broadcast measurements, the graphic shows the time taken for all the processors involved to leave the barrier. We have noticed that the processes require different times to leave a barrier, as it can be observed in Figure 11.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
cost evaluation in COTS clusters of computers using a low cost mechanism based on the parallel port present in all cluster machines. Our approach permits these measurements without message delay influence. It also permits the performance evaluation of task precedence graphs, representing specific parallel programs, through Gantt charts and tables. The precision of measurements is limited only by the access time to the parallel port, actually one microsecond.
120000 1 processor 2 processors 3 processors 4 processors 5 processors 6 processors 7 processors
Communication time (10-6 s)
100000
80000
60000
40000
20000
7. References
0 0
20000
40000
60000
80000
100000
Number of transmitted integers
Figure 9. Broadcast communication time of varying length messages between two processors without presence of any kind of traffic measures obtained using the parallel port 0,001
0,000794
Execution time (s)
0,0008 0,000665
0,000713
0,0006 0,000478 0,0004
0,000504
0,000366
0,0002
0,000225 0,000001
0 0
1
2
3
4
5
6
7
8
9
Number of processes
Figure
10. Barrier execution time without presence of any kind of traffic measures obtained using the parallel port.
Time to leave barrier (10-6 s)
600,000
400,000
200,000
1
2
3
4
5
6
7
8
Processor
Figure 11. Time taken by each of eight processes in eight different processors to leave a barrier after the last process arrives the barrier. 6. Conclusions The use of the parallel port in clusters has been proved to be efficient in several previous works. We have presented an approach for MPI intertask communication
[1] Anceaume, E., Puaut, I. Performance Evaluation of Clock Synchronization Algorithms. Technical Report 3526, INRIA, France, October 1998. [2] Ramanathan, P., Kandlur, Dilip D., Shin, Kang G. Hardware-Assisted Software Clock Synchronization for Homogeneous Distributed Systems. IEEE Transactions on Computers, 39(4):514-524, April 1990. [3] Horauer, M. Hardware Support for Clock Synchronization in Distributed Systems. Proceedings of the International Conference on Dependable Systems and Networks (DSN'01), Sweden, 2001. [4] Nonaka, J., Pfitscher, G. H., Nakano, H., Onisi, K. Low-Cost Hybrid Internal Clock Synchronization Mechanism for COTS PC Cluster. EuroPar 2002, Paderborn, RFA, Lecture Notes in Computer Science, Springer. [5] Dietz, H. G., Muhammad, T., Sponaugle, J. B. & Mattox, T. PAPERS: Purdue's Adapter for Parallel Execution and Rapid Synchronization, Purdue University School of Electrical Engineering, Technical Report TR-EE 94-11, March 1994. [6] Bar-Noy A. & Kipnis, S. Designing Broadcasting Algorithms in the Postal Model for Message-Passing Systems. Proc. 4th Symp. on Parallel Algorithms and Architectures, pp. 13-22, 1992. [7] Culler, D. E., Karp, R. M., Patterson, D. A., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R., Eicken, T. LogP: Towards a Realistic Model of Parallel Computation, 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, May 1993. [8] Alexandrov, A., Ionescu, M., Schauser, K. E., Scheiman, C. LogGP: Incorporating Long Messages into the LogP model One step closer towards a realistic model for parallel computation, 7th Annual Symposium on Parallel Algorithms and Architecture (SPAA'95), July 1995. [9] Valiant, L. G. A Bridging Model for parallel Computation. Communication of the ACM, 33(8):103-111, 1990. [10] Using the RDTSC Instruction for Performance monitoring, Intel Corporation, 1997 [11] Donaldson, S. R., Hill, J. M. D., Skillicorn, D. B. Performance Results for a Reliable Low-Latency Cluster Communication Protocol. IPPS/SPDP Workshops 1999: 10971114
[12] Luecke, G. R., Raffin, B., Coyle, J. J. Comparing the Communication Performance and Scalability of a Linux and a NT Cluster of PCs, a Cray Origin 2000, an IBM SP and a Cray T3E-600. Proceedings of IEEE Computer Society International Workshop on Cluster Computing, pp 26-35, December 2-4, 1999, Melbourne, Australia.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE