Collaborative Fault Diagnosis in Grids through Automated Tests Alexandre Nóbrega Duarte, Francisco Brasileiro, Walfredo Cirne, José Salatiel de Alencar Filho Universidade Federal de Campina Grande Departamento de Sistemas e Computação Avenida Aprígio Veloso, s/n. Bodocongó 58.109-970, Campina Grande, PB, Brazil {alex,fubica,walfredo,salatiel}@dsc.ufcg.edu.br Abstract. Grids have the potential to revolutionize computing by providing ubiquitous, on demand access to computational services and resources. However, grid systems are extremely large, complex and prone to failures. A survey we’ve conducted reveals that fault diagnosis is still a major problem for grid users. When a failure appears at the user screen, it becomes very difficult for the user to identify whether the problem is in his application, somewhere in the grid middleware, or even lower in the fabric that comprises the grid. To overcome this problem, we argue that current grid platforms must be augmented with a collaborative diagnosis mechanism. We propose for such mechanism to use automated tests to identify the root cause of a failure and propose the appropriate fix. We also present a Java-based implementation of the proposed mechanism, which provides a simple and flexible framework that eases the development and maintenance of the automated tests. Keywords: Grid computing, fault treatment, collaborative fault diagnosis, failure management, automated tests.
1. Introduction Grids have the potential to revolutionize computing by providing ubiquitous, on demand access to computational services and resources. They promise to allow for on demand access and composition of computational services provided by multiple independent sources. Grids can also provide unprecedented levels of parallelism for high-performance applications. These possibilities create fertile ground for entirely new applications, much more ubiquitous and adaptive, maybe with huge computational and storage requirements. On the other hand, grid characteristics, such as high heterogeneity, complexity and distribution (traversing multiple administrative domains) create many new technical challenges, which need to be addressed. Among these technical challenges, failure management is a key area that demands much progress [11][13][16]. Even fault diagnosis, a basic step in any failure management strategy, needs to see great improvement if we are to realize the grid vision. Today, when a grid user sees a failure in his screen, he has a very hard time in pinpointing the root cause of the failure1 and to hand over the responsibility to fix the problem to the appropriate support staff (an application developer, a middleware administrator, a system administrator, etc). It may be the user’s own application that has a bug. It may be that the user has requested a certificate whose lifetime was too short. It may be a configuration problem in some site that was used by the application for the first time. It may be that a disk on a machine next door has crashed. It may be a very large number of things. To start with: how does the user know whether the problem is with his own application, or somewhere in the infrastructure used by the application? Since the grid encompasses services and resources in multiple administrative domains, there may be restrictions to obtaining logs and information about the grid components. Even when such information is available, one would have to know what should be happening, but it isn’t in order to come up with the correct diagnosis. The problem here is a cognitive one. To further complicate things, error messages may be misleading. A recent 1
In many situations, the task of diagnosing a fault is carried out by the support staff, instead of the grid user; to make the writing more fluid, we will assume that the user performs this task. We note that due to the complexity of the grid, this is a complex task even for a specialized support staff. 1
study shows that even specialized users, such as system administrators, can spend as much as 25% of their time following wrong paths suggested by unclear error messages [3]. To put it in another way, grids make heavy use of complexity-hiding abstractions. However, when a failure occurs, abstractions lose their ability to hide complexity. One must understand how they should work (instead of merely what they should do) to diagnose faults and fix them. In a grid context, this means understanding the functioning of the many different technologies that comprise it. This requires expertise in many different technologies in terms of middleware, operating systems and hardware. It’s just too much for any single human being! Note that this state of affairs is largely due to the fact that grids are much larger and more complex (and thus prone to failures) than traditional computing platforms. Machines may be disconnected from the grid due to crashes or network partitioning, remote machines may have a version of libc that causes the user application to crash, hackers may bring sites down through denial of service attacks, and so on. Moreover, in a grid environment there are potentially thousands of resources, services and applications that need to interact in order to make possible the use of the grid as an execution platform. Since these components are extremely heterogeneous, there are many failure possibilities, including not only those caused by independent faults on each component, but also those resulting from interactions between components. Solutions for grid monitoring have been proposed [1][2][18][19][20][21][23]. They are certainly useful, since they allow for failure detection and also facilitate the collection of data describing the failure. However, they do not provide mechanisms for fault diagnosis. Failure recovery mechanisms have also been proposed [10][11][13][14][17][24]. They allow recovery from omission and crash failures. Although being able to mask some failures, these mechanisms are not able to diagnose faults and therefore to fully recover the system to its best operational state. Consequently, dealing with failures in grids is still a serious problem at this time. No wonder that, in a recent survey we conducted, grid users said that: i) the complexity of the failure treatment abstractions as well as the long time to recover from failures continues to be a problem to users; ii) automated fault treatment is an important user requirement; and, iii) appropriate support for fault diagnosis is a feature that is still missing in most systems. In this paper we propose a collaborative approach for diagnosing faults in grids. Our approach uses automated tests to determine whether a failure signaled by component C was originated by a fault at C, or at some component that provides service to C. When the source of a failure is determined, we can also automatically propose appropriate fixes. We start by presenting, in Section 2, a survey that exposes the difficulties users currently experience while using their grid platforms. The aim of this survey was to capture the actual experience regarding fault treatment of those who have been using grids. Then, in Section 3, we propose a collaborative approach for fault diagnosis in grids. In Section 4, we present a Java-based framework that can be used to implement this approach. We also describe how it has been successfully applied to develop diagnostic tools for some components of two distinct grid middleware infrastructures (Globus [9] and MyGrid [5]). Further, in Section 5, we discuss related work and argue why the available solutions are not sufficient to diagnose faults in grid environments in an effective manner. Finally, Section 6 concludes the paper with our final remarks.
2. Fault Treatment in Grids In order to identify the status quo on fault treatment in grids, we have consulted grid users spread throughout the world. The following questions were posed to them: 1) What are the most frequent kinds of faults you face when using a grid? 2) What are the mechanisms used for detecting and/or correcting and/or tolerating faults? 3) What are the greatest problems you encounter when you need to recover from a failure? 4) To what degree is the user involved during the failure recovery process? 5) What are the greatest users’ complaints? 6) Are there mechanisms for application debugging in your grid environment? A questionnaire containing these questions was made available on the Web (http://www.ourgrid.org/surveys/ft.html) and advertised in several grid discussion lists, such as:
[email protected];
[email protected];
[email protected];
[email protected];
[email protected];
[email protected];
[email protected];
[email protected];
[email protected];
[email protected];
[email protected];
[email protected];
[email protected]; and
[email protected]. Answers were
2
received via the Web form as well as by e-mail. We have conducted two advertisement campaigns. The first one was made during April 2003 and resulted in 22 responses [16]. The second one was conducted during April 2005 and resulted in 13 responses. Although the number of responses was small, they provide good anecdotal evidence on what are the main problems that the grid community is facing regarding fault treatment. It’s interesting to note that a similar survey (i.e. a self-selected survey conducted on-line) with users of parallel supercomputers resulted in 214 responses [4], an order of magnitude higher than the surveys presented in this paper. Also, many respondents have demonstrated a high level of interest about the results of our research, signaling their hope for better ways to deal with faults in grids. These facts highlight the infancy of grid computing and that better fault treatment is an important factor to bring grids to maturity.
2.1. The Surveys Kinds of Faults From the data presented in Figure 1, we can state that the situation regarding the type of faults that are more frequent in 2005 remains almost the same as in 2003. The main kinds of faults are related to the environment configuration. In 2003, a little more than 75% of the responses have pointed this out, while in 2005 this was the main complain of a little more than 60% of the respondents. Following this, we have middleware faults with a little less than 50% and application faults with around 40%. Note that, in the majority of the responses, more than one kind of fault was indicated. Kinds of Faults
Kinds of Faults
80%
80%
70%
70%
60%
60%
Configuration
50% 40%
Application
30%
Hardware
20%
Middleware
40%
Application
30%
Configuration
50%
Middleware
Hardware
20%
10%
10%
0%
0%
(b) 2005
(a) 2003
Figure 1: Kinds of failures (a) 2003; (b) 2005
Fault Treatment Mechanisms Regarding fault treatment mechanisms, we can see differences in the data collected in the two surveys. In 2003, in addition to ad hoc mechanisms – based on users’ complaints and analysis of log files – grid users used automated ways to deal with faults on their systems (see Figure 2(a)). Nevertheless, 57% of them were application-dependent. Even when monitoring systems were used (29% of the cases) they were proprietary ones (in fact, standards such as GMA [21] and ReGS [1] were very recent specifications at that time and had few implementations). Checkpointing was used in 29% of the systems and fault-tolerant scheduling in 19%. In some cases, different mechanisms were combined. In 2005 (see Figure 2(b)), fault-tolerant scheduling is the most predominant mechanisms for fault treatment. Application-dependent mechanisms and fault monitoring tools are also used by a considerable amount of respondents. Moreover, several respondents reported the use of visualization tools to support the process of fault treatment. This can be seen as an indication that fault treatment is indeed an important issue for grid users. Further, users would like to have more support from the grid infrastructure (see the decrease on application-dependent mechanisms as the use of monitoring and visualization tools as well as fault-tolerant scheduling increases).
3
Fault Treatment Mechanisms in Current Use
Fault Treatment Mechanisms in Current Use 60%
60%
50%
50% Application-dependent
40%
Monitoring systems Checkpointing-recovery
30%
Fault-tolerant scheduling
20%
Visualization Tool
Application-dependent
40%
Monitoring systems
30%
Checkpointing-recovery Fault-tolerant scheduling
20%
10%
10%
0%
0%
Visualization Tool
(a) 2003
(b) 2005
Figure 2: Fault Treatment Mechanisms in Current Use
Application Debugging The results collected suggest that this is also an area that has greatly improved in the past two years. In 2003, less than 5% of the respondents (just one response) had good mechanisms to allow them to influence the application execution (e.g. change a variable value); 14% had “passive mechanisms” that only allowed them to watch the application execution; 19% had mechanisms that did not show them a grid-wide vision of their application (i.e. the mechanism scope was limited to a single resource that comprised the grid); and 62% of the grid users had no application debugging mechanism available to them (see Figure 3(a)). On the other hand, in 2005, all respondents informed that they have access to some sort of debugging tool (see Figure 3(b)). Moreover, only 15% of them use mechanisms with limited scope. The majority of users have access to passive mechanisms, and the number of users with access to good mechanisms increased from less than 5% to more than 20%. Application Debugging Mechanisms in Current Use 70% 60% 50% 40% 30% 20% 10%
Application Debugging Mechanisms in Current Use
None available application debugging
70%
Mechanisms with limited scope
50%
Passive mechanisms (watching the application execution)
30%
Good Mechanisms
10%
None available application debugging
60%
Mechanisms with limited scope
40%
Passive mechanisms (watching the application execution)
20%
Good Mechanisms
0%
0%
(b) 2005
(a) 2003
Figure 3: Application Debugging Mechanisms
Degree of User Involvement As the previous results suggest, the involvement of the user with fault treatment has been reduced in the last two years. In 2003, users needed to be highly involved during the failure recovery process (see Figure 4(a)). About 58% of them needed to define exactly what should be done when failures occur, 29% of them were somewhat involved, while only 13% of the users were involved in a low degree and could rely on the mechanisms provided by the system. In 2005 the situation seems to be better (see Figure 4(b)), with only about 1/3 of the users requiring a high involvement in the fault treatment activities.
4
Degree of User Involvement
Degree of User Involvement 70%
70%
60%
60% 50%
50% High
40%
Medium
30%
Low
High
40%
Medium
30%
20%
20%
10%
10%
0%
0%
Low
(a) 2005
(a) 2003
Figure 4: Degree of User Involvement
The Greatest Users’ Complaints Regarding the greatest users’ complaints, in 2003 these were mainly related to the complexity of the fault treatment abstractions/mechanisms (71% - see Figure 5(a)). Users were more concerned with the ability to recover from failures, than with the failure occurrence rate (33%) or the time to recover from them (10%). In 2005, failure occurrence rate responds to only 15% of the complaints (see Figure 5(b)), indicating that the reliability of grid technology has increased in the past two years. On the other hand, the complexity of the failure treatment abstractions continues to be a problem to users. Moreover, the long time to recover from failures has also become an important issue. These facts further suggest that there is a lot to be advanced in the area of fault treatment in grids. The Greatest Users´ Complaints
The Greatest Users´ Complaints 80%
80%
70%
70% 60%
Complexity of the failure treatment abstractions
50%
Failure occurrence rate
40% 30%
Time to recover from failure
20%
60%
Complexity of the failure treatment abstractions
50%
Failure occurrence rate
40% 30%
Time to recover from failure
20%
10%
10%
0%
0%
(b) 2005
(a) 2003
Figure 5: The Greatest Users’ Complaints
The Greatest Problems for Recovering from a Failure In 2003, the greatest problem reported was the lack of support in the process of diagnosing the fault that originates a failure, i.e. the identification of the root cause of a failure. About 70% of the responses pointed this out (see Figure 6(a)). The difficulty to implement the application-dependent failure recovery behavior was present in 48% of the responses (the user does not know what to do to recover from a failure). These two were also the main source of complaints among the respondents of the 2005 survey (see Figure 6(b)), with a small decrease on the first problem and an increase on the second problem. The impossibility of gaining authorization to correct the faulty component is a problem reported in both surveys; however, the impact of this is far smaller than that of the other two problems. It appears that in the past two years there has been little improvement in the support for fault diagnosis in grids.
5
Problems When Recovering from a Failure Scenario
Problems When Recovering from a Failure Scenario 80,0%
80% 70%
Diagnose the fault
70,0%
Diagnose the fault
60,0%
60% 50%
Difficulty to implement the failure recovery behavior
40% 30%
Gain authorization to correct the faulty component
20% 10%
50,0%
Difficulty to implement the failure recovery behavior
40,0% 30,0%
Gain authorization to correct the faulty component
20,0% 10,0% 0,0%
0%
(b) 2005
(a) 2003
Figure 6: Problems when recovering from a failure
2.2. Survey Lessons From the responses presented in the previous subsection we can infer that, although progress has been made, failure management is still a major issue when using grids. In particular, fault diagnosis still remains as an open problem (see Figure 6). We believe this is highly related with how grid software is organized: as a collection of software layers, each providing a particular abstraction [7]. Grid application developers use the abstractions provided by the grid middleware to simplify the development of application software. Similarly, grid middleware developers use the abstractions provided by the operating system to ease their job. This is an excellent way to deal with complexity and heterogeneity, except when things go wrong. When a software component fails, it typically affects the components that use it. This propagates up to the user, who perceives the failure. To understand what is going on, one has to drill-down through abstraction layers to find the original fault. In summary, the problem is that, when everything works, one has to know only what a software component does, but when things break, one has also to know how the component works. In Figure 7 we can see an example of a multi-layer grid architecture [7]. The figure also shows the application user and the administrators of each software layer.
Figure 7: Layered Grid Architecture versus Internet Protocol Architecture [7]
6
Suppose that an instantiation of this architecture has a fault at its Fabric layer. It is possible that this fault will cause the failure of a user application that runs in the grid. When the user receives an error message indicating the failure, he does not know whether the cause of the failure is in his application or in the grid. He may be able to access logs from several layers to analyze and figure out what has happened. First the user will check his application logs for some useful information. Then, after spending some time checking his application code without finding anything wrong, he will start to suspect that the grid infrastructure is broken. At this point he faces a painful task. He needs to analyze all available logs to diagnose a fault in a system that is extremely complex and runs in a very heterogeneous environment. If the user is able to read the logs, he may see several error messages from the Collective layer, Resource layer, Connectivity layer, Fabric layer, or lower operating system layers. However, how will he check the software to find out which of the layers is (are) faulty? Since the Fabric layer fault may have propagated to higher layers, the user may never discover the fault that caused the failure of his application. Therefore, he can’t contact the appropriate personnel in order to fix the problem. Although not exclusive of grids, this characterization is a much bigger problem in grids than in traditional systems. This is because grids are much more complex and heterogeneous, encompassing a much greater number of technologies than traditional computing systems. For instance, a grid may be composed of processing resources whose architecture is completely unknown to the user. Thus, he knows nothing about it. He does not know how it should work. He does not know where its logs are stored. Therefore, there is a huge cognitive barrier between the failure detection and the fault diagnosis. Most of the time the logs are available, indicating a problem, but the person who reads them is not able to adequately interpret them. Consequently, grid fault treatment depends on intensive user collaboration, including not only system administrators but also application developers. In this way, the focus of the application developer is lost when he would probably prefer to concentrate on application functionality, rather than diagnosing middleware or configuration faults. To overcome this cognitive barrier, fault diagnosis should be carried out in an autonomous way, with minimum user intervention. To achieve this, the software must possess the knowledge needed to diagnose its own faults. Unfortunately, current software is not prepared to provide such level of information [15]. All it provides to the user are failure symptoms in the form of error messages, making fault diagnosis a user intensive task. To make things worse, in some situations it’s possible for a failure to happen without exposing any symptom at all to the user [11]. Therefore, a solution to manage the complexity involved in grid fault treatment is to add fault diagnosis mechanisms to current available software. This is not an easy task, since there are some important aspects that must be addressed in order to provide a generic fault diagnosis solution. We emphasize the following ones: •
Intrusiveness: the solution must not consider changes in current software source code, since it may not be available or may be too complex to be adapted to diagnose its own faults;
•
Simplicity: complex solutions may add another source of uncertainty further complicating the problem instead of solving it; a way to make a solution more reliable is to assure that it’s so simple that any programmer can completely understand it;
•
Maintainability: as all software has a life cycle, the fault diagnosis solution must be prepared to evolve with its associated application in order to continue to be useful; and
•
Flexibility: to be generic, the fault diagnosis solution must be flexible enough to be usable by different users with different applications.
3. Collaborative Fault Diagnosis As previously discussed, in a production environment each grid software layer is an abstraction level for which appropriate personnel (e.g. application developer, middleware administrator and system support staff) is responsible for, among other things, dealing with faults. Thus, if a failure is detected at a higher layer, but its root cause is at a lower one, the corresponding staff should be informed to solve the problem. The challenge is to identify the right levels for this handover, allowing collaborative drilling-down in a controlled and effective manner.
7
Our approach to overcome this problem is to use automated tests to verify the correctness of software components in a production environment. Although software components may be exhaustive tested before going into production [12], we believe that the ability to run tests in production is very useful. This is because the environment and its configuration change from development to production environments. In fact, due to the complexity of systems, nowadays each environment is unique. Automated production tests are the key for enabling users of the software at a given layer to determine, without understanding how the other layers work, whether a software fault occurred at their own layer or at an underlying one. They allow for finding configuration errors as well as bugs that were not detected while the software was tested within the developers’ environment. Additionally, automated tests ease not only problem hand-over (by identifying which layer is broken) but also the identification of the fault within a particular layer. The functionality of a layer is provided by a set of software components. We propose the use of diagnosis components (named Doctors) able to run the automated tests of the software components that implement the functionality of the layers. Following this approach, the developer of a software component should provide appropriate automated tests along with the software component itself. We note that these tests are very similar to (if not the same as) the functional tests that developers normally run before releasing the software components to their clients. Moreover, the diagnosis components use and implement a well-known interface to execute these tests in a collaborative way, allowing the diagnosis of faults in the whole system. When the Doctor of a particular type of component is activated, it performs tests on a component C that it receives as a parameter and returns a diagnostic for C. If C passes all tests, then the Doctor infers that C is correct and returns this information in the diagnostic. Otherwise, if at least one test fails, then a diagnostic indicating the causes of the failure and prescriptions for fixing the problem is returned. However, in this case, the Doctor has also to figure out whether C’s failure is due to a faulty in C or in an underlying component that C uses. To discover this, the Doctor tests all the components C’ that C uses, by activating their associated Doctors. If at least one of these activations returns that some underlying component C’ is faulty, then the diagnostic returned by the Doctor that is testing C will also contain information about failure causes and associated prescriptions for all underlying components C’ that are faulty. Thus, when the diagnostic returned informs that only C is faulty, then the fault in C is probably the root cause of the failure detected. Referring to the layered architecture depicted in Figure 7, ideally there should be a Doctor for the user application, as well as a Doctor for each component that implements part of the functionality of a layer. As soon as a user is notified of a failure in his application, he can ask the Doctor associated with the application to diagnose the cause of the failure. This action will fire a cascade effect, involving the underlying Doctors, until one of them possibly gets a diagnosis for the fault that originated the failure. This approach is non-intrusive, since it’s based on external software components - the Doctors, requiring no source code modification in the target software. However, as discussed before, a suitable diagnosis tool should also be simple, maintainable and flexible. The approach implies in having a Doctor for each software component. This makes the Doctor simple since it is responsible for fault diagnosis in a single component. One could argue, however, that this could boost the maintainability problem, since there will be several pieces of software to be updated. We remind that the automated tests are part of the software provided by the developers, and each developer has to worry with testing only its own software. Thus, the complexity of the maintenance problem is distributed among several developers, making it manageable even when the number of components is huge. Also, this approach could make the Doctor a less flexible tool, since each module is specialized in diagnosing faults for a particular software component. Nevertheless, the existence of a common well-known interface allows for the development of highly independent Doctor modules still allowing them to cooperate.
4. The JavaDoctor Framework To validate the approach proposed in the previous section we have implemented JavaDoctor, a framework for fault diagnosis of Java-based grid middleware. A collection of collaborative JavaDoctors provide a simple mechanism to test Java8
based components. It captures and diagnoses the problems that may happen in a grid environment, proposing possible solutions to fix them. JavaDoctor assumes that a grid component is a remote Java object accessed via a local Java object, which acts as its proxy, and that failures are indicated by Java exceptions.
4.1. Implementation Details The framework core is composed by four Java entities (classes and interfaces): JavaDoctorInterface, AbstractJavaDoctor, Diagnostic and Cause. The simplified UML class diagram presented in Figure 8 illustrates the JavaDoctor framework architecture. JavaDoctorInterface * + Diagnostic test(Object aComponent) + boolean isAbleToTest(Object aComponent)
Cause + void setDescription(String aDescription) + String getDescription()
AbstractJavaDoctor + void addDoctor(JavaDoctor anotherDoctor)
*
Diagnostic + void addCause(Cause aCause) + Collection getCauses() + void addPrescription(String aPrescription) + Collection getPrescriptions()
Figure 8: The JavaDoctor framework architecture JavaDoctorInterface has two methods: test and isAbleToTest. The test method is used to test the Java object that plays the role of a proxy for a component, returning to the invoker a Diagnostic object that contains a detailed description of the status of the component just tested. The isAbleToTest method verifies if a JavaDoctor implementation is able to test objects of a given class. A class that implements JavaDoctorInterface for testing objects of a particular class should be able to execute the appropriate automated tests for these objects and to emit the corresponding diagnostics. For our purposes, an automated test is defined as a Java method that throws Java exceptions for each failure it can detect. This method usually tests a given component by using it to run a computation whose correct output is already known. If the component does not produce an output or outputs something different from the expected result, then it is considered faulty, otherwise it is considered correct. As described in the previous section, when a Doctor is used to test a component C, it runs C’s automated tests. Moreover, if a failure is detected, the Doctor should test all components used by C in order to figure out whether the failure was caused by a fault in C or by a fault in some of the underlying components C uses. The AbstractJavaDoctor class provides a simple way to achieve this; it is a composite [8] of Doctors that implement the JavaDoctorInterface. When an object of a class that extends AbstractJavaDoctor detects a failure while testing a component C, it searches its collection of auxiliary Doctors for Doctors able to test each of the components used by C in order to discover the fault that is causing the test to fail2. The Diagnostic is a Java class that encapsulates a textual description of a failure along with its possible causes. It also encapsulates a prescription. A prescription is a detailed textual description of what a user should do to fix the faults that has been diagnosed.
2
Alternatively, one could follow a more sophisticated approach and use a discovery service to dynamically discover appropriate Doctor components in the grid. 9
The Cause class encapsulates the textual description of the fault that originates a failure. Notice that, since a test of a component C may spawn the execution of tests in several components used by C, the diagnostic of a failure in C may include as its cause the diagnostic of a failure in a component used by C. This creates a “stack trace” of the diagnosis process. A typical implementation of the JavaDoctorInterface has several tests, each one throwing several Java exceptions. It must be able to interpret every exception that may be thrown by any of the tests and provide the appropriate diagnostic. This could result in a complex piece of software. As we argued before, a diagnosis tool should be as simple as possible. In order to simplify the implementation of Doctors we decided to detach the test execution code from the test interpretation knowledge. With this separation one can modify the test execution code without caring about how the results will be interpreted and vice-versa. We have exercised this solution with two AbstractJavaDoctor extensions. We codified only the test execution code inside the classes that implement the JavaDoctorInterface and used an XML file to provide the test interpretation knowledge. The XML file has four sections. The Tests section defines the collection of tests that will be executed by the Doctor. It also defines, for each test, all known exceptions that may be thrown by the execution of a particular test. For each exception, it is possible to associate the identification of a corresponding diagnostic. The Diagnostic section gives a detailed description of the diagnostics whose identifications appear in the Test section. It contains for each diagnostic a textual description, a list of the identification of possible causes and a list of identification of possible prescriptions. A cause represents a fault that may originate an exception. A prescription gives the user a hint on how to correct a fault. We note that, as discussed before, a diagnostic may contain another diagnostic as its cause. The last two sections (Causes and Prescriptions) are used to provide textual descriptions to all causes and prescriptions that appear in the file.
Figure 9: Sample JavaDoctor XML file Figure 9 shows a sample XML file that provides the knowledge needed to interpret one of the Java exceptions of a test named testRun. This example describes a test for the run service of the Globus GRAM. The knowledge codified in this
10
sample file specifies that if the testRun method has thrown an org.ietf.jgss.GSSException, the Doctor that is executing the test should emit a diagnosis that states that the ExpiredProxy failure was caused by an ExpiredCredentials fault. It must also emit RestartProxy and CheckProxyCredentials as prescriptions to solve the problem. Further down in the XML file it is described what ExpiredProxy, ExpiredCredentials, RestartProxy and CheckProxyCredentials mean.
4.2. Experiments To validate the JavaDoctor framework we have instantiated it to test particular software components of two distinct grid middleware. We extended the AbstractJavaDoctor creating two new classes: GRAMJavaDoctor and GUMJavaDoctor. The first one is able to test some services of the Globus GRAM [9], while the second one can test the MyGrid UserAgent [5]. Since the JavaDoctor framework can only be used to diagnose faults in Java objects, we relied on Java CogKit3 [6] to encapsulate the communication with Globus GRAM. Further, we have also developed an ApplicationDoctor for the application we have used in the experiments with both grid middleware. Both the GRAMJavaDoctor and the GUMJavaDoctor are auxiliary Doctors for the ApplicationDoctor. Once a failure is detected, the user invokes the ApplicationDoctor to diagnose the fault, be it in the application itself or in the underlying middleware. In these experiments we have focused in configuration and deployment faults, since they have been pointed out by the respondents of our survey as the main causes of failures (see Figure 1). We created automated tests to diagnose some common faults that can be made during the configuration and deployment phases of both systems. The experiments consisted in submitting the application for execution in the grid and injecting the selected configuration or deployment faults. After the failure was reported, we ran the ApplicationDoctor to diagnose the fault. We could then establish a comparison between the information the user receives with and without the assistance of ApplicationDoctor. On Globus 2.4 middleware [9], we need to use gridftp services to transfer files and globus-job-run to submit jobs to the remote machines. We created automated tests for submitting a job, sending a file, and getting a file from a remote machine. We have injected the following faults in this experiment:
3
•
Offline Proxy - We tried to run an application without starting the Globus proxy.
•
Globus Services Offline - In this test we tried to send a job to a machine that was not running any Globus service.
•
Expired Proxy - We started a proxy with a very small lifetime and then waited until it expired before executing the application.
•
Application – In this test we introduced a fault in the application. We sent the job to be executed on a machine that was running Globus Services but that didn’t have some files required by the application.
Commodity Grid (CoG) Kits are high-level frameworks used to encapsulate the access to grid services. 11
Table 1 shows the failure symptoms and the GRAMJavaDoctor diagnostics for the above failure scenarios. Table 1: ApplicationDoctor diagnostics for Globus failure scenarios
Fault Offline Proxy Globus Services Offline Expired Proxy Application
Failure Symptoms Defective credential detected [Root error message: Proxy file (/tmp/x509up_u1140) not found Cannot start destination service
Diagnostics Invalid or expired Proxy
Prescriptions Try to restart the proxy
The gatekeeper is not running The host is unreachable The gatekeeper is listening to a non-standard port
Expired credentials detected
The proxy lifetime has been reached
Make sure you can ping the host Make sure the gatekeeper is being launched by inetd Make sure the gatekeeper is listening to the standard port Try to restart the proxy Check the proxy credentials
Exception in thread “main” java.io.FileNotFoundException: input1.dat (No such file or directory)
An application file was not found in the grid machine file system
Verify your job description file
As can be seen, a single fault can have multiple diagnostics, as is the case, for example, of the “Globus Services Offline” fault. Also, we can observe that the message “Invalid or expired Proxy” is easier to understand than the message “Defective credential detected [Root error message: Proxy file (/tmp/x509up_u1140) not found”. The experiment shows how the use of the collaborative fault diagnosis tool facilitates the hand-over problem. The diagnostic and prescription messages shown in Table 1 make clear that the first three faults are in the middleware, while the last fault is in the application. We have also conducted an experiment using the MyGrid middleware [5]. It provides a resource virtualization service named UserAgent. This service can be used to transfer files to and from a remote machine, as well as to execute and kill remote tasks. GUMJavaDoctor is able to test MyGrid grid machine abstractions. Again we chose common configuration and deployment faults and codified automated tests to diagnose them. The faults injected were: •
Offline Machine - We used a grid created with only offline machines. Since there were no available machines to execute the tasks of the application, the application never ended.
•
Offline UserAgent - We created a grid with online machines, but we didn’t start the UserAgents on all of them. When MyGrid sent a task to be executed in a machine without a running UserAgent it failed due to communication problems.
•
Incorrect Java Version - We added to the previous scenarios some machines with older Java versions, still without starting the UserAgents. Again, MyGrid failed due to communication problems.
•
Incorrect Port - This is yet another flavor of the previous scenarios. We included grid machines with other applications using the ports specified to be used by the UserAgents.
•
Wrong File Permissions - We configured the UserAgents to use read-only directories for file storage. When the application tried to transfer some file it failed because the remote machine disk could not be written.
12
Table 2 summarizes the output of the experiment. Table 2: ApplicationDoctor diagnostics for MyGrid failure scenarios
Fault Offline GridMachine Offline UserAgent Incorrect Java Version Incorrect Port Wrong File Permissions
Failure Symptoms GridMachine is unavailable GridMachine is unavailable GridMachine is unavailable GridMachine is unavailable Permission denied
Diagnostics Host is currently down Host does not exist The UserAgent has not been started Incorrect Java version The UserAgent port is already in use The UserAgent could not write to disk
Prescriptions Make sure you can ping the host Verify if the UserAgent is running Verify the Java version of your grid machines Verify your UserAgent port in the configuration file Verify your UserAgent output directory in the configuration file
Another difficulty in fault diagnosis is that different faults may appear to the user as the same failure. For instance, in the experiments with MyGrid, four of the five injected faults appeared to the user as a “GridMachine is unavailable” failure. With this experiment we see how the use of ApplicationDoctor solves this problem. The diagnosis provided by ApplicationDoctor through its auxiliary GUMJavaDoctor gives the user a much clearer vision of what is happening with its application. Further, the prescriptions provided indicate what should be done (and, if necessary, which personnel should be contacted) to solve the problem.
5. Related Work Many systems have been proposed for grid monitoring [1][2][18][19][20][21][23]. The GMA (Grid Monitoring Architecture) [21], for instance, is an open standard for grid monitoring that is being developed by the Global Grid Forum Performance Working Group. Its architecture consists of four types of components, namely: directory service, producer, consumer and intermediary. The directory service supports information publication and discovery. Producers make management information available in the form of events. Consumers receive events containing management information and process them. Intermediaries implement both the provider and the consumer interfaces to provide specialized services. For instance, an intermediary may consume events from several producers and produce new events derived from the received events. The Reporting Grid Services (ReGS) system [1] specifies two kinds of intermediaries for the monitoring of OGSA applications [22]: an intermediary for filtering events and another for logging them. The Failure Detection Service (FDS), described in [11], is a notification service able to provide information regarding the status of a task submitted for execution in a grid. In summary, grid monitoring solutions are only concerned with gathering information across grid nodes. Although they are useful for failure detection, they are not able to solve the cognitive problem associated with fault diagnosis. There are also solutions focusing on fault tolerance [10][11][13][14][17][24]. Such solutions strive to make the application run correctly even in the presence of failures. Solutions such as GALLOP [24] and WQR [17], for instance, use replication to provide fault tolerance. GALLOP replicates SPMD (single-program-multiple-data) applications in different sites within the virtual organization, while WQR is an efficient fault-tolerant scheduler for bag-of-tasks applications that automatically reassigns tasks that failed. Grid-WFS [11] provides fault tolerance through a flexible recovery scheme. Constructed above a failure detection service, it provides techniques to mask failures through replication, retries and rollback-recovery. Legion [10] and Condor [14] also provide fault tolerance through rollback-recovery. In Legion, this is provided at the application level, while in Condor it is embedded into the middleware. Phoenix [13], on the other hand, is a solution able to detect and classify failures in data-intensive grid applications. It uses a probabilistic strategy to detect failures in file transfers. Once a failure is detected, the Phoenix Failure Agent is used to classify faults that affect file transfers as either transient or permanent faults. When the fault is transient the failed transfer may be retried. Similar to our approach, the classification process is also based on the execution of automated tests. The Failure Agent tests specific network (e.g. DNS correctness, connectivity, authentication) and file permission conditions. Although the classification of
13
faults provided by Phoenix is useful for avoiding the retry of an operation over a component that is experiencing a permanent fault, it is not useful for indicating which actions should be taken to fix the system. Mechanisms for automated fault tolerance are, without any doubt, very useful and are becoming more popular, as showed in Figure 2. However, in all cases discussed in the previous paragraph, failures are masked and the user is not aware of them4. Fault diagnosis is not a requirement for these mechanisms. Nevertheless, if the fault is permanent, it’s not possible to completely recover the faulty component without proper fault diagnosis.
6. Conclusions We have seen many advances regarding failure recovery in grid platforms. Failure recovery mechanisms are very useful, in the sense that they are able to mask from the user the occurrence of failures. However, not all failures can be masked. Moreover, even when failures are masked, if the faulty components are not properly fixed, the performance and usability of the grid may degrade to unacceptable levels. Thus, fault diagnosis is a basic step towards the provision of an adequate execution environment. Unfortunately, this is an area that has been largely overseen. As a consequence, grid users still don’t have appropriate support for fault diagnosis. When the system breaks, the user has to figure out by himself why the system is not working. Unfortunately, there are too many different components in a grid, and it is not reasonable to expect that the user will master all details of a grid. Furthermore, since the user does not know what is the cause of the failure, it can’t contact the appropriate personnel to fix the problem. To solve this problem, we have proposed a collaborative fault diagnosis approach. In this approach we use automated tests to verify the correctness of software components in a production environment. Automated tests are the key for enabling users of a software component to determine, without understanding how the other components of the system work, whether a software fault occurred at that component or at an underlying component it uses. Accurate fault diagnosis allows for effective hand-over of responsibilities and reduction on the down time of faulty components since appropriate personnel can be immediately contacted to solve the problem. We propose the use of Doctors - diagnosis components able to run the automated tests of the software components in a collaborative way, through a common well-known interface, allowing the diagnosis of faults in the whole system. To validate our fault diagnosis approach we have developed JavaDoctor, a framework for fault diagnosis of Javabased grid middleware. JavaDoctor captures and diagnoses the problems that may happen in a grid environment, proposing possible solutions to fix them. Our preliminary experiments, diagnosing faults in components of two distinct grid middleware (Globus [9] and MyGrid [5]), showed the flexibility of the framework and the applicability of our approach as a generic grid fault diagnosis solution. JavaDoctor is open source and available for download at www.ourgrid.org.
Acknowledgements This work has been partially developed in collaboration with HP Brazil R&D. We are indebted to Katia Saikoski for her valuable comments on earlier versions of this manuscript. Authors would also like to thank the financial support from CNPq/Brazil.
References 1. 2. 3.
4
Y. Aridor, D. Lorenz, B. Rochwerger, B. Horn and H. Salem, Reporting Grid Services (ReGS) Specification. Haifa Research Lab, 2003. M. Baker and G. Smith, GridRM: A resource Monitoring Architecture for the Grid, in: Proceedings of the GRID 2002, Lecture Notes in Computer Science, Vol. 2536 (Springer, Berlin, 2002) 268-273. R. Barrett, E. Haber, E. Kandogan, P.P. Maglio, M. Prabaker, L.A. Takayama, Field Studies of Computer System Administrators: Analysis of System Management Tools and Practices, in: Proc. of Computer-supported Cooperative
In the case of Phoenix the user may be signaled that a permanent fault has occurred, but no detailed diagnostic is provided. 14
A ri do r
B ak er
B ar re tt
4. 5.
6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
18. 19.
20.
21. 22. 23.
24.
Tools 2004, (ACM Press, Chicago, 2004) 388-395 W. Cirne and F. Berman, A Model for Moldable Supercomputer Jobs, in: Proceedings of IPDPS 2001: International Parallel and Distributed Processing Symposium, 2001, 59. W. Cirne, D. Paranhos, L. Costa, E. Santos-Neto, F. Brasileiro, J. Sauvé, F. A. B. Silva, C. O. Barros and C.Silveira, Running Bag-of-Tasks Applications on Computational Grids: The MyGrid Approach, in: Proceedings of the ICCP'2003 - International Conference on Parallel Processing, 2003, 407. The CogKit Team, CogKit, http://www.cogkit.org/, 2005. I. Foster, C. Kesselman and S. Tuecke, The Anatomy of the Grid: Enabling Scalable Virtual Organizations, in: International Journal of High Performance Computing Applications 15 (2001) 200-222. E. Gamma, R. Helm, R. Johnson and J. Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software (Addison-Wesley, Reading, 1995). The Globus Alliance, Globus. http://www.globus.org, 2005. A. S. Grimshaw, A. Ferrari, F. Knabe and M. Humphrey, Wide-Area Computing: Resource Sharing on a Large Scale, IEEE Computer 35 (1999) 29-37. S. Hwang and C. Kesselman, A Flexible Framework for Fault Tolerance in the Grid, in: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, 2004, 251-258. R. E. Jeffries, A. Anderson and C. Hendrickson, Extreme Programming Installed. (Addison-Wesley, Boston, 2000). G. Kola, T. Kosar, and M. Livny. Phoenix: Making Data-intensive Grid Applications Fault-tolerant. in: Proceedings of 5th IEEE/ACM International Workshop on Grid Computing, 2004. M. Litzkow, M. Livny and M. Mutka, Condor – A Hunter of Idle Workstations. In Proceedings of the 8th International Conference of Distributed Computing Systems, 1988, 104-111. P.P. Maglio, E. Kandogan, Error Messages: What’s the Problem? Real-world Tales of Woe Shed some Light, ACM Queue 2 (2004) 51-55. R. Medeiros, W. Cirne, F. Brasileiro and J. Sauvé, Faults in grids: why are they so bad and what can be done about it?, in: Proceedings of the Fourth International Workshop on Grid Computing, 2003, 18-24. D. Paranhos, W. Cirne and F. Brasileiro, Trading Information for Cycles: Using Replication to Schedule Bag of Tasks Applications on Computational Grids, in: Proceedings of the Euro-Par 2003: International Conference on Parallel and Distributed Computing, 2003. W. Smith, A Framework for Control and Observation in Distributed Environments, NASA Advanced Supercomputing Division, 2001. P. Stelling, I. Foster, C. Kesselman, C. Lee and G. Laszewski, A Fault Detection Service for Wide Area Distributed Computations. in: Proceedings of the 7th IEEE Symposium On High Performance Distributed Computing, 1998, 268-278. B. Tierney, B. Crowley, D. Gunter, M. Holding, J. Lee and M. Thompson, A Monitoring Sensor Management System for Grid Environments, in: Proceedings of the IEEE High Performance Distributed Computing Conference, 2000, 97-104. B. Tierney, R. Aydt, D. Gunter, W. Smith, V. Taylor, R. Wolski and M. Swany, A grid Monitoring Architecture, Working Document, Global Grid Forum Performance Working Group, 2002. S. Tuecke, K. Czajkowski, I. Foster, J. Frey, S. Graham and C. Kesselman, Grid Service Specification, Draft 3, Global Grid Forum, 2002. A. Waheed, W. Smith, J. George and J. Yan, An Infrastructure for Monitoring and Management in Computational Grids, in: Proceedings of the 5th Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers, 2000, 235-245. J. Weissman, Fault Tolerant Computing on the Grid: What are My Options?, Technical Report, University of Texas at San Antonio, 1998.
15
Ci rn e
Ci rn e
C og Ki t
F os te r
G a m m a
Gl ob us
G ri m sh a w
H w an g
Je ffr ie s
K ol a
Li tz ko w
M ag li o
M ed ei ro s
Pa ra nh os
S m it h
St ell in g
Ti er ne y
Ti er ne y
T ue ck e
W ah ee d
W ei ss m an