Examination of Fault Tolerance in MMPI 1 Introduction - CiteSeerX

3 downloads 0 Views 67KB Size Report
One can expect the mobiles of tomorrow to have gigahertz level processors as a 1Ghz system .... implementation comes at a cost of approximately doubling the .... True fault tolerance within a parallel system may never be achieved as there is.
Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. III (2008), Suppl. issue: Proceedings of ICCCC 2008, pp. 265-270

Examination of Fault Tolerance in MMPI Daniel C. Doolan, Sabin Tabirca

Abstract: The Mobile Message Passing Interface (MMPI) provides the developer with a set of functions similar to that found in the Message Passing Interface used on high end parallel machines and clusters. Unlike these specially built machines that feature high speed interconnects the MMPI system is designed on top of Bluetooth technology allowing for parallel applications to be developed within the realm of mobile computing. MMPI is a Java based library built upon Java Micro Edition (JME) and JSR-82 Bluetooth. Fault Tolerance is especially important in MPI as the probability for failure within the system increases as the number of nodes increase. This paper looks at the aspects of fault tolerance from the mobile perspective through the use of the MMPI library. Mobile fault tolerance opens up a whole other dimension of possible faults that need to be handled when dealing with wireless communications technology such as Bluetooth. This paper discusses some of the aspects of Mobile fault tolerance within the MMPI world. Keywords: MMPI, Bluetooth, Fault Tolerance, Mobile Parallel Computing

1

Introduction

True fault tolerance is something that we can only aspire to. No matter how well a system may be designed their is always the possibility of a complete failure of all nodes within the parallel world. Fault tolerance within the world of mobile parallel computing is far more complex. Murphy’s law succinctly defines that if something can go wrong it will go wrong. It is only within the past few years that the concept of mobile grid / parallel computing has come into it’s own. This is due to the lack of system resources inherent with mobile devices. The mobiles of today however sport microprocessors of 220Mhz and better [4] [12] with tens of megabytes of system memory available for application usage. One can expect the mobiles of tomorrow to have gigahertz level processors as a 1Ghz system was announced by ARM [2] [3] [13] [14] in October 2005. Mobiles are advancing at such a rate that in October 2007 Sun Microsystems announced [16] that they would no longer be supporting JME development as mobile devices are on the verge of being capable of running full blown JVM’s. It is expected that within ten years most devices will be running such VM’s although lower end devices may still be running with the JME VM. The sheer number of mobile devices currently active is in excess of 3.3 Billion (50% of the worlds population). For the last few years sales of mobile devices have remained constant at approximately one Billion units [10] per year. The combined computing power of all these mobile devices represents a significant computing resource.

1.1

Bluetooth

Bluetooth operates within the range of 2.40Ghz to 2.48Ghz of the Industrial, Scientific and Medical (ISM) RF band. This band is divided into 79 segments displaced by 1Mhz graduations with a maximum frequency hop rate of 1,600 hops/s. The Bluetooth 1.2 specification allows for transmission speeds of up to 721 kbits/s, however the Bluetooth 2.0 specification allows for an Enhanced Data Rate (EDR) of 3.0 Mbits/s giving an effective rate of up to 2.1 Mbits/s. The majority of Bluetooth enabled phones allow for the lower data rate, but some of the more recent ones (Nokia N810 (announced October 2007)) provide Bluetooth 2.0 + EDR, as well as WLAN support 802.11b/g. With the adoption of version 2.1 + EDR on the 1st August 2007 by the Bluetooth Special Interest Group (SIG) and Ultra Wide Band (UWB) Technologies, Bluetooth is clearly here to stay for the near to medium future. Three main classes of Bluetooth network exist: Point to Point, Piconet and Scatternet. The majority of Bluetooth networks that are presently formed are of the Point to Point variety. Hands free wireless headsets being one of the main application sectors. The Piconet allows for the formation of a network consisting of several devices, with an upper limit of eight (a server connected with seven clients). In reality some devices that support Bluetooth do not fully comply and therefore the upper limit can be far more limited. Networks formed using the Piconet configuration have a Star network topology, therefore all inter client communication must be routed through the server device. Copyright © 2006-2008 by CCC Publications - Agora University Ed. House. All rights reserved.

266

1.2

Daniel C. Doolan, Sabin Tabirca

Mobile Message Passing Interface

MMPI [7] allows for the creation of a parallel world through the creation of a fully interconnect mesh network. The mesh is established in a phased manner, firstly a selection of devices are started as client nodes and one other started as a master / root node. The master carries out device and service discovery to find all of the other devices that have advertised themselves as “mmpiNode”s. At this stage the network topology that is formed is that of the standard star network typical of Bluetooth Piconets. According to the Bluetooth specification the inquiry phase must last for 10.24 seconds [5]. However in reality this figure is usually several seconds more. The second phase of the formation algorithm results in the creation of communication channels between all of the “Client” devices. This requires the Client designated devices to create Server connections, on instantiation of same a message is relayed to the master which in turn relays it to the appropriate Client node. On receipt of the relayed message the Client can then establish a connection with the Server object of initiating Client node. The MMPI library allows for a selection of point-to-point and global communication methods to be called. A developer using the library has no need to develop any Bluetooth code. One must simply instantiate an MMPI object and then call the required communication routines upon same. The ability to develop a single application rather than separate Client and Server applications helps to reduce development time and cost as well as simplifying the system architecture. The library is suitable not only for mobile parallel processing applications but also for multi-player gaming, mLearning and mobile parallel graphics.

2

The Costs for Fault Tolerance in MPI

The Message Passing Interface in itself is simply a standard that specifies how a correctly written parallel program may be achieved. Many say that MPI is not fault tolerant, but this is neither true or false. Most believe that when a process dies then all MPI nodes that are part of the world should so too die, this however is not the case. The default operation for when a process becomes unavailable is indeed to kill all remaining processes in the world, this is achieved through the built in MPI_ERRORS_ARE_FATAL error handler. Therefore if the developer carries out no error handling in respect to this type of error, then indeed all other nodes in the world will die before the process should normally terminate by calling the MPI Finalize function. This default behaviour for all nodes dying on a node having an error was decided by the MPI Forum to be the most useful default behaviour. Fault tolerance is not an actual property of MPI itself. In general it is assumed that that the parallel program will execute on reliable hardware. How hardware faults are handled are implementation specific. Gropp and Lusk [9] investigated fault tolerance in MPI by dealing with only one probability for a failure of the system. It is assumed that at most one failure may occur between checkpoints with the probability α . Therefore the total run time may be defined by ¶¶ µ µ T 1 ET = , k0 + t0 + α k1t0 + t02 t0 2 where k0 is the time to create a checkpoint and q k1 is the time to read / restore a checkpoint. Accordingly, the√optimal time between checkpoints is given by t0 = 2kα0 therefore the expected computation time is T (1 + α k1 + 2α k0 ). In general a fault tolerant program and underlying infrastructure should be capable of surviving failures such as system crashes and network failures. At the highest level the MPI program should be capable of automatically recovering from a set of faults without any change to the apparent behaviour of the program. The next level is that notifications of failures should be posted to the MPI program so appropriate action may be undertaken. At the third level, certain operations may become invalid, for example failure of a node may rule out the possibility of collective communication routines, but standard point to point communication may still continue between unaffected nodes. The fourth level makes use of checkpointing thereby allowing a program to save its state to persistent storage, abort and restarted from the checkpoint. The final level of survival may use a combination of the previous approaches. Most approaches to fault tolerance have a set of three distinct requirements: detection of failures, the maintaince of state information to continue the computation and the ability to restart. Any implementation conforming to the MPI standard is responsible for detecting and handling network faults, this may include message retransmission or the informing of the application through the use of an error code. Essentially the contents of a message transmitted from one node should be identical to the message received on another node. Several fault tolerant MPI implementations are currently in existence. MPICH-V [6] is considered to be one of the most complete featuring checkpointing and message logs to allow aborted processes to be replaced. This implementation comes at a cost of approximately doubling the communication times, for the provision of full

Examination of Fault Tolerance in MMPI

267

recovery. The LAM based MPI-FT [11] also uses a similar approach to fault tolerance while FT-MPI [8] modifies some of the standard MPI semantics. The first attempt at the development of fault tolerant MPI applications made use of checkpointing and roll back. Co-Check MPI [15] was the first MPI implementation built that used the Condor [17] library for checkpointing. All process would synchronously checkpoint, this proved to be a drawback with the system as in large systems the procedure could become expensive from a time concern. The result of this work was the creation of a new version of MPI called tuMPI as the modification of the original MPICH implementation was considered too complex. Another similar implementation is Starfish MPI [1] but uses its own system to achieve checkpointing. The use of atomic group communication calls removed the need as in the tuMPI implementation to flush the message queues to avoid messages being lost.

3

Quantizing the Costs for Mobile Fault Tolerance

Checkpoints may be created at regular intervals to save program state. If T is the total execution time without checkpoints, and t0 is the time between checkpoints then the number of checkpoints is given by tT0 . Under standard MPI characteristics a node may fail due to errors on the device itself, or due to errors related to inter-device communications. In the mobile world these failures have been classified into three distinct categories. 1. Normal failure with the probability α0 . 2. A device fails because it exceeded the Bluetooth range. This occurs with the probability α1 . 3. A device terminates the application with the probability α2 . The accurate detection of errors is only one half of the fault tolerance equation, the other is to ensure that applications can carry on from a previously valid system state. This may be achieved through the use of checkpointing. In the case of errors caused by devices moving in and out of the Bluetooth range (10 meters, for a typical phone), one may attempt to restore the original connections (DataInput / DataOutput Streams) as the physical address of the devices in question remain the same. In other cases it may be necessary to re-initialise the world. This would require for the carrying out of device and service discovery once again, and the reformation of the network based on the presently active nodes detected by the discovery process. This can be an expensive operation as the combined discovery process can last on the order of eighteen to twenty seconds. k0 = the time to create a checkpoint k1 = the time to read / restore a checkpoint k2 = the time to restore the communication channels k3 = the time to initialise the world

3.1

Evaluating the Costs

There are three distinct cases to be evaluated as mentioned previously with regard to possible failures. The overall cost of fault tolerance is a combination of these three together. Case 1. This is when a normal failure occurs. The costs involved are to create a checkpoint, read the previous checkpoint and restore the state hence the equation is exactly as in Gropp & Lusk [9]. For one checkpoint we have ¶ µ 1 E1 = (1 − α0t0 )(k0 + t0 ) + α0t0 k0 + t0 + k1 + t0 = 2 µ ¶ 1 2 1 = k0 + t0 + α0 k1t0 + t0 = k0 + t0 (1 + α0 k1 ) + α0t02 2 2 which gives the following cost over

T t0

checkpoints

E1t (t0 ) =

· ¸ 1 T k0 + t0 (1 + α0 k1 ) + α0 T02 . t0 2

(1)

Case 2. When a device is outside of the Bluetooth network range. The device should return to within the network range and consequently reestablish the I/O connections with the MMPI world. The device will then read the previous checkpoint and restore the system state. Therefore, the total cost for one checkpoint is

268

Daniel C. Doolan, Sabin Tabirca

¶ µ 1 E2 = (1 − α1t0 )(k0 + t0 ) + α1t0 k0 + t0 + k2 + k1 + t0 = 2 1 2 1 = k0 + t0 + α1t0 (k2 + k1 ) + α1t0 = k0 + t0 [1 + α1 (k2 + k1 )] + α1t02 2 2 with the total cost given by E2t (t0 ) =

· ¸ 1 T k0 + t0 [1 + α1 (k2 + k1 )] + α1t02 t0 2

(2)

Case 3. When the device terminates the application for some reasons. In this case all the devices should start the application from scratch which gives the following total cost. E3 = (1 − α2t0 )(k0 + t0 ) + α2t0 (k0 + t0 + T ) = k0 + t0 + α2t0 T. ⇒ E3t (t0 ) =

T [k0 + t0 (1 + α2 T )] . t0

(3)

Total cost over these three cases is E t (t0 ) = E1t (t0 ) + E2t (t0 ) + E3t (t0 ) = ¸ · 1 T 3k0 + t0 [3 + α0 k1 + α1 (k2 + k1 ) + α3 T ] + (α0 + α1 )t02 = = t0 2 3T k0 T + (α0 + α1 )t0 + [3 + α0 k1 + α1 (k2 + k1 ) + α2 T ] T. = t0 3 The optimal time t0 between checkpoints can be calculated as follows 3T k0 T dEt 3T k0 6k0 T = − 2 + (α0 + α1 ) = 0 ⇒ 2 = (α0 + α1 ) ⇒ t02 = ⇒ t0 = dt0 2 2 α1 + α2 t0 t0

r

6k0 α1 + α2

which gives the following optimal run time 3T k0 E (t0 ) = q t

6k0 α1 +α2

=

r

r T 6k0 + (α1 + α2 ) + T [3 + α0 k1 + α1 (k2 + k1 ) + α2 T ] = 2 α1 + α2

3p k0 (α1 + α2 )T + 2

r

3p k0 (α1 + α2 )T + T [3 + α0 k1 + α1 (k2 + k1 ) + α2 T ] = 2 hp i = 6k0 (α1 + α2 ) + 3 + α0 k1 + α1 (k1 + k2 ) + α2 T T.

E t (t0 ) represents the minimal time that may be achieved with mobile fault tolerance. Consequently as α0 , α2 ≃ 0 the highest probability for error is £√ that of a device moving outside of the range α1 therefore the optimal ¤ time in this case may be given by E t (t0 ) = 6k0 α1 + 3 + α1 (k1 + k2 ) T .

3.2 Saving System State The saving of checkpoints can be carried out in several different manners depending on the type of application and amount of data required for checkpointing. Firstly, the checkpoint data can be saved to the file system using JSR-75. This however requires the Midlet to be signed, as unsigned Midlets require significant user intervention to give the application permission. This means of checkpointing is necessary for large and complex applications where a large amount of system state must be saved. The alternative to this is to store the state data using the Record Management System RMS. No user intervention is required for this, but the RMS is only suitable for applications where a small amount of state information is necessary. The final checkpoint process allows for the saving of system state to the programs internal memory, this provides the fastest checkpointing facility. This checkpoint is volatile but can be used in conjunction with the persistent forms to speedup the recovery time.

Examination of Fault Tolerance in MMPI

4

269

Summary and Conclusions

Parallel computing within the mobile domain opens the door the to a plethora of new possibilities for interdevice communications failure. True fault tolerance within a parallel system may never be achieved as there is always the possibility of the complete failure of all nodes. Checkpointing is one useful mechanism to reduce the cost of system failure by allowing the computation to continue from the last known valid state. This paper has shown that fault tolerance within the mobile parallel domain is far more complex. This is due to the numerous possibilities for failure inherent with a wireless communications medium. In the case of Bluetooth these additional failure cases include communications errors, devices moving in and out of the Bluetooth range, and nodes being terminated by user intervention. One can clearly see that fault tolerance by the use of checkpointing within the mobile parallel domain is more than viable. This is achievable although at a slightly higher cost due to the higher possibility of failure than that of the dedicated systems for high end cabled parallel clusters.

5

Acknowledgements

The development of this body of work was funded under the “Irish Research Council for Science, Engineering and Technology” funded by the “National Development Plan”.

References [1] A. Agbaria and R. Friedman, “Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations”, The Eighth International Symposium on High Performance Distributed Computing, pp. 167–176, 1999. [2] ARM, “Arm introduces industry’s fastest processor for low-power mobile and consumer applications”, http://www. arm.com/news/10548.html, October 2005. [3] ARM, “Arm neon technology fuels consumer electronics growth with next-generation mobile multimedia acceleration”, http://www.arm.com/news/6540.html, October 2005. [4] A. Baker, “Mini Review - Enhancements in Nokia 6680,” http://www.i-symbian.com/forum/images/ articles/43/Mini_Review-Nokia_6680_Enhancements.pdf, 2005. [5] Bluetooth SIG, “Annex A (Normative): Timers and Constraints”, Bluetooth Specification version 1.1, 2001. [6] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov, “MPICHV: Toward a Scalable Fault Tolerant MPI for Volatile Nodes”, Supercomputing, ACM/IEEE Conference, pp. 29–29, 2002. [7] D. C. Doolan, S. Tabirca, L. T. Yang, “Mobile Parallel Computing”, 5th International Symposium on Parallel and Distributed Computing (ISPDC06), pp. 161–167, 2006. [8] G. E. Fagg, J. J. Dongarra, “FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World”, Lecture Notes in Computer Science, vol. 1908, pp. 346–353, 2000. [9] W. Gropp, E. Lusk, “Fault Tolerance in MPI Programs”, Cluster Computing and Grid Systems Conference, http:// www-unix.mcs.anl.gov/~gropp/bib/papers/2002/mpi-fault.pdf, December 2002. [10] R. Jaques, “One billion mobile phones shipped in 2006”, http://www.itweek.co.uk/vnunet/news/ 2173516/billion-mobile-phones-shipped. [11] S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou, “Mpi-ft: Portable fault tolerance scheme for mpi”, Parallel Processing Letters, vol. 10, no. 4, pp. 371–382, 2000. [12] Nokia, “Nokia n70 technical specs”, http://www.forum.nokia.com/main/0,,018-2578,00.html? model=N70, April 2005. [13] F. Pilato, “Arm reveals 1ghz mobile phone processors”, http://www.mobilemag.com/content/100/102/ C4788/, October 2005. [14] D. Robinson, “Arm chips to power 1ghz mobiles”, http://www.vnunet.com/itweek/news/2143741/ arm-chips-power-1ghz-mobiles, October 2005.

270

Daniel C. Doolan, Sabin Tabirca

[15] G. Stellner, “CoCheck: Checkpointing and Process Migration for MPI”, Parallel Processing Symposium, Honolulu, Hawaii , pp. 526–531, 1996. [16] S. Shankland, “Sun starts bidding adieu to mobile-specific Java”, Cnet News, http://www.news.com/ 8301-13580_3-9800679-39.html. [17] T. Tannenbaum and M. Litskow, “Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System”, Dr. Dobbs Journal, Vol. 227, pp. 40–48, 1995.

Daniel C. Doolan, Sabin Tabirca University College Cork Department of Computer Science College Road, Cork, Ireland E-mail: {d.doolan, s.tabirca} @cs.ucc.ie

Suggest Documents