Differential Diagnosis for Software Systems

Differential Diagnosis for Software Systems: A Meta-Methodology for In-Field Root Cause Analysis Eli M. Dow Test and Integration Center for Linux IBM Systems and Technology Group Poughkeepsie NY [email protected]

ABSTRACT Enterprise software and hardware customers rely on vendor-supplied diagnosticians to perform in-field root cause analysis and isolation in the presence of simultaneous faults. These diagnosticians routinely operate without sufficiently-detailed user guides or complete domain-specific knowledge. Teaching software problem diagnosis as performed in-the-field is underrepresented in literature and practice. In-situ problem diagnosis suffers from inconsistent tooling, terminology, and a lack of formalized generality of best practices. This paper will examine problems that should be considered by systems diagnosticians and developers. We identify inhibiting factors to the success of in-field diagnosis, as well a critique of the current state of affairs in the field. We compare and contrast the diagnosis of computing systems to traditional medical differential diagnostic practices and demonstrate specific areas where we should learn from medical practitioners to enhance our field. Ultimately we propose a meta-methodology and adopt terminology to enable computer systems diagnosticians to work and communicate more effectively.

General Terms Management, Documentation, Human Factors, Standardization

Keywords Debugging, Diagnosis, Systems Thinking, Bugs, Faults.

1. INTRODUCTION Current diagnostic techniques as practiced in IBM, while mature in their specific area, are not formally taught. At no time is this more evident than when a customer opens a critical problem and time is of the essence. A generalized representation of the diagnostic state of a problem could set a firm foundation for getting additional experts on board quickly. Our empirical observations have shown that current inconsistent terminologies hinder communication and delay results when experts from various areas are pulled together for critical customer situations. Also, the lack of a formalized methodology encumbers common best practices and techniques from being shared across diagnostic areas. Critical situation systems diagnosis is often performed in a computing environment which contains a plurality of software products, operating systems, and hardware. While subject matter experts can work together to rule out innocent components, often domain-specific knowledge is not readily available until diagnosis has indicated a potential root cause component. To complicate matters, many critical situations are the result of an interaction between two or more technology products causing a

E r i n

Erin M. Farr z/OS Development IBM Systems and Technology Group Poughkeepsie NY [email protected] previously unknown malfunction. Pinpointing suspect components within large scale computing systems becomes especially challenging when so few high-level generalists exist. It is rare that a single individual with sufficient breadth and depth of technical domain knowledge about multiple components making up said environment exists. The field of medicine has developed methodologies, such as differential diagnosis, to diagnose health problems in human patients. We contend that if we approach a fault like a doctor does a disease, and treat errors as symptoms, we can utilize differential diagnosis for fault diagnosis.

1.1 Introduction to Differential Diagnosis Like humans, problems in computer systems are complex and often difficult to diagnose. Medicinal diagnosticians have readily available, portable, universally-agreed upon compilations of symptoms (errors) at their disposal. It is our observation that no comparable compendium exists for the software systems diagnostician. In addition, we also lack rigorous scientific study and evaluation in the face of adverse inthe-field critical situations with customers who may be quite irate. Borrowing terminology guidelines, practices, and inspiration from medical diagnosis, we assert that more structure and scientific emphasis in diagnosis methodology will lead to ultimately more productive in-situ diagnosis of complex computing environments.

1.1.1 What is Differential Diagnosis? The term diagnosis derives from the Greek words, dia meaning "apart", and gnosi meaning "to learn" [1]. This particular definition symbolically points out the action of breaking apart a problem into smaller problems, which might also need further diagnosis by employing analytical methods or the process of elimination. Diagnosis is used in a variety of disciplines by varying the application of logic and experience to determine the cause and effect relationships. Differential diagnosis is the process of weighing the probability of one disease against other diseases, to attempt to account for a patient's illness [2]. Specifically, the recognition of a disease or condition occurs through contrasting the signs, symptoms, and laboratory findings of two or more diseases to systematically perform an analysis of the underlying causes. The differential diagnosis process begins when the patient presents a given condition or circumstance, referred to in medical literature as the presenting problem or chief complaint [3]. During this process doctors attempt to identify underlying causal factors and concurrent phenomena. Some goals of medical differential diagnosis are to:

18

•

Clearly understand a patients condition

•

Determine a reasonable prognosis, which is a statistical assessment of a patient's chance of recovery. Note that a prognosis is only a prediction. In computing, this notion aligns with the subjective probability that the fault will address the failure.

•

•

Determine appropriate treatment or intervention for the condition. Treatment is the management and care of a patient to combat disease, which may include a cure. Note that medical treatment does not include conducting diagnostic tests. In computing, a cure is analogous to a permanent vendor-supported fix. When a cure is unknown, treatment indicates the need for a workaround which may be an intermediate or transient solution. Provide a means for a patient to integrate the condition or circumstance into their lives, until such time as the condition can be ameliorated. In computing, this form of treatment is analogous to a workaround.

To aid in diagnosis, several medical texts contain grouped listings of the most common causes of a given symptom or listings of disorders similar to a given disorder [4] [5]. These texts are often annotated with diagnostic suggestions for narrowing the search [6]. In computer science, some work is being done to create a collection of diagnostic footprints, such as in [7]. However, this footprint introduces performance degradation which is especially undesirable for enterprise systems.

1.1.2 Diagnostic Criteria The combination of signs, symptoms, and test results that allows the medical diagnostician to ascertain the diagnosis of a disease is sometimes referred to diagnostic criteria. Removing candidate diagnoses from the list of potential diagnoses is done by making observations and using scientific tests. The tests are designed to have different results depending on which diagnosis is correct. A diagnosis is confirmed by tests which have positive results or ruled out by tests with negative indicators. Our survey of medical literature indicates that medical professionals often rely on four factors when performing diagnosis: •

Anatomy – describes the structure and components of the human body [8]. In systems development, much care has been taken to break the components into larger groups of major subsystems, which facilitates communication between diagnosticians.

•

Physiology - describes the function of an organism's biological system and the way in which different parts function as an integrated system [9]. In systems development, physiology is analogous to the designed behavior (intended function) of a component.

•

Pathology – describes the scientific study into the nature of disease, its causes, origins, method of action, development, and impact of that disease in both functional and material terms [10]. Pathology is also used to describe the anatomical or functional manifestations of a given disease. Another way of looking at the definition is to consider pathology to describe a departure or deviation from a normal condition. In computing, a pathology can represent

the set of errors for a given service failure, or the service failure itself. •

Psychology - describes behavioral and thought patterns of the individual under diagnosis. This is not directly analogous to computing but we include it for completeness.

Medical practitioners are trained to know the human body and its associated functions in terms of normality. This is referred to as homeostasis [11]. The concept of normalcy for a system should be considered when dealing with computer systems diagnosis.

1.2 Characteristics of Differential Diagnosis in Medicine An important aspect of differential diagnosis is the fluid process which is used in response to new information. Hypotheses are created, eliminated and tested as new information becomes available. Physicians are agile and respond to a variety of information sources such as lab tests, empirical observations, as well as remarks from the patient themselves. During empirical observation, the doctor considers the patient in their normal, or 'well', context rather than an animate collection of medical conditions. Medical professionals are often forced to assess the greater context of the patient such as family and stress level. In medicine, diagnosis often benefits from these added insights. Notably, differential diagnosis can be performed on an unconscious or unresponsive patient. Typically a cognizant patient will present their set of symptoms and chief complaints to the doctor. However, in the case where the patient is unable to do so, this unresponsive or unconscious state is considered to be the de-facto complaint. In general however, the medical practitioner obtains additional further information directly from the cognizant from the patient themselves about that particular patient's symptoms, general state of health prior to the currently presented malady. Given the incredible number of potential afflictions of the human body, it is simply impossible to consider the exhaustive myriad of diseases which at any one time may be troubling a patient. Physicians excel at quickly and efficiently paring down the set of possible illnesses which are likely to account for the observed symptoms. The doctor must then perform the mental gymnastics necessary to rank, in order of probability, the diseases which are most likely causing those symptoms. To accomplish this, doctors make use of diagnostic tools, both practical, and educational such as charts and reference books. In some advanced medical situations, a physician is unable to diagnose with certainty until they have performed further medical testing to confirm or disprove the diagnosis. A benefit to running these additional diagnostics is that this data can be used to document a patient's status and ensure their medical history and records are current. This documentation becomes vital as medical consultation with other physicians and specialists is routinely sought when a scenario is out of the ordinary. In many ways doctors can exemplify team work under pressure and in complex problem solving, in part based on the meticulous record keeping and use of common terminology. Another defining characteristic of medical diagnosis is the scientific rigor and laboratory examination that plays such a

19

critical component in complex diagnosis. In some cases a doctor may use a laboratory recreation as a substitution for direct examination of a patient. More commonly, the laboratory recreation is a complimentary factor done in conjunction with a physical patient examination. For example, in an infectious disease situation, the examination of the patient often leads to a determination that the patient is suffering from some class of infection, while a laboratory examination determines which pathogen is causing that infection. We will return to the concept of short depth decision trees for in-the-field problem classification, with supporting laboratory examination later.

In some situations when presented with multiple symptoms, a doctor may prescribe concurrent treatments. We assert that it is more important for system diagnosticians than medical doctors to nail down the underlying fault. While a patient will recover because you treated both, we need to be concerned with prevention of the actual fault in other systems. Unlike a disease in a patient, a specific fault which is isolated in a computer system can usually be prevented from occurring in other computer systems. In essence it is conceptually similar to the model of vaccination used in medicine.

The last major characteristic of differential diagnosis is the brevity of interaction between patients and doctors. Many doctors can diagnose the majority of illnesses in a short period of time because of the general obviousness or vast experience of that physician. According to some, a contributing factor is that the diagnostic decision trees that physicians use have a shallow depth [12].

William Osler suggested in the early 1900’s that any medical doctor should be capable of recognizing known manifestations of diseases previously described by medicine [18]. In addition, it was also realized that a good physician must also understand the underlying mechanism by which the disease operated in addition to understanding its prevention and cure.

1.3 Occam’s razor and Hickam’s Dictum In computer science, efficient programmatic searches often make use of tight algorithmic restraints. In medicine, an implicit algorithmic restriction exists in the form of Occam’s razor.

1.3.1 Reciting Occam's razor William of Occam in the fourteenth century stated, "The assumptions introduced to explain a thing must not be multiplied beyond necessity” [13]. This concept is often referred to as "diagnostic parsimony" or "unitarianism" [14]. Clinicians attempt to determine a unifying diagnosis that explains all of the patient's problems. A commonly recited generalization dictates, in situations where there are a number of possible explanations for observed phenomena, the simplest explanation is preferred. Medical students are taught the adage that when you hear a hoof beats in Manhattan, think horse, not zebra [15]. Usually less likely diagnoses are considered after the most probable diagnoses have been excluded as candidates.

1.3.2 Learning Hickam's Dictum Hickam's dictum challenges the use of Occam's razor in the medical profession [16]. The principle is attributed to John Hickam, M.D. who is purported to have said, "Patients can have as many diseases as they damn well please." [17] The common process which occurs when diagnosing a patient is a recursive stream of hypothesis development, modification and subsequent testing of that hypothesis. Hickam's dictum asserts that at no stage should a particular diagnosis be excluded solely because it does not conform to Occam's razor. In other words, diagnostic parsimony does not demand that the diagnostician necessarily opt for the simplest explanation. Instead it suggests that a medical practitioner should seek explanations without unnecessary additional assumptions which account for all of the relevant diagnostic evidence. Consider Hickam's dictum as a limiting principle to that of Occam's razor. In practice we have seen that computer systems often have multiple root causes for different user-reported chief complaints. This situation may arise as a result of discrete events or exposure to a combination of situations. Thus it should remain the practice of the diagnostician to never exclude a fault if doesn’t account for all presented errors.

1.3.3 The Oslerian ideal

Osler placed emphasis on the identification (or on a broader classification of related diseases if no perfect classification could be made) such that the appropriate known remedies could be applied. By using a detailed examination of previously seen cases of pathology and morbid anatomy one could make better judgments about disease. In addition to classification of diseases based on previous science, it was also observed that any individual patient is actually representative of a class of people who have the same disease. This has led to a general trend in thinking that the biological individuality of that patient is not considered ultimately important when compared to the population trends. This emphasis on categorization for diagnosis as it relates to computing will be highlighted in later sections. The dominant medical philosophy today views the patient is a collection of symptoms to be characterized and analyzed algorithmically in order to draw a diagnosis and subsequently produce a strategy of treatment.

2. DIFFERENTIAL DIAGNOSIS FOR COMPUTING SYSTEMS It is probable that many expert computer systems diagnosticians already utilize techniques that align with differential diagnosis. Formalizing these techniques can further assist with facilitating communication during critical customer problems, sharing best practices, and sustaining our talent as we train the next generation of software diagnosticians. We posit that if such a meta-methodology was described and socialized across the industry, related tools could then be created for use in the design phase of software development to ensure in-field diagnostics and root cause analysis can be accomplished more easily than today. Differential diagnosis is a combination of logic and experience. In medicine this act is referred to as “practice.” The medical practitioners have a formalized methodology, to the great benefit of their science. Likewise the precedent has been set by software engineers who have already adapted concepts rooted in disciplines such as engineering for the benefit of the computing community at large. We feel the time is right to coalesce around a common set of best practices and industry specifications derived from medical diagnosis to perform a similar service.

20

2.1 Comparison of Field Maturity A major contributor to the difference in maturity of the two disciplines is that the field of computer science is much younger than the field of modern medicine [19] [20]. Correspondingly, the level of diagnostics available to practitioners of each discipline varies widely. Medicine began using a myriad of techniques which have since been refined and replaced. In comparison, computer science is also refining and replacing their own diagnostic toolset, but at a much faster pace than was shown in early medicine. It should be appreciated that both disciplines are accelerating their rate of change due to advancing technology. In addition to the practical science of medicine, the concept of diagnostic medicine has been socialized to most patients receiving care. We have all become quite familiar with the process of doctors visits, a basic triage when you arrive, a panel of background questions, and ultimately time spent with the medical professional performing the diagnosis to ultimately yield an effective treatment plan. In fact, doctors used to order specific blood tests when they suspected an issue with a given patient, but now routinely run full blood chemistry profiles as part of a regular checkup. This practice is now common place because the tests have become societally accepted and cost effective. This new standardization has offered medicine the opportunity to speed up the process of diagnosis. This includes performing non-invasive diagnoses, such as radiological technologies like Magnetic Resonance Imaging. These noninvasive yet high resolution glimpses into the patient during the course of their illness and optimally while experiencing their chief complaint.

2.2 Rationale for a Formal Methodology Current techniques for diagnosis in IBM are ad-hoc, componentspecific, and not formalized across products, platforms, and services. We have empirically observed knowledge transfer during customer critical situations and service engagements is time consuming. Diagnosticians working in shifts around the clock spend a substantial portion of time communicating results of earlier tests to one another before passing the work on to the next crew. This communication is necessary, but perhaps the time is right to provide a common set of diagnostic terminology, along with a representation of the current diagnostic state. For example, such a representation could include previously executed tests, the current active hypothesis, and reasons the remaining hypotheses have been excluded or reprioritized. Such record keeping would not only serve to assist those doing the diagnosis, it would serve as a valuable training aid to those who wish to learn the thought process of the diagnostic engineers. Additionally, such records could become a valuable asset for performing post-mortems (note that the commonly used term post-mortem itself was borrowed from medical literature). Much work has been done to classify code defects during the development phase [21]. This paper focuses specifically on operational faults during the use phase [22], which we refer to as “in-the-field.” While it is important to focus on fault reduction during the development phase, most accept the reality that some will escape to the field. It should also be appreciated that root-cause in-field diagnostics of computer systems can be more difficult to diagnosis than in the development phase because:

•

Diagnosticians have additional restrictions, such as time constraints and maintaining system integrity.

•

The defect has escaped in-place development precautions such as testing and code reviews. This could imply either an inherent difficulty to reproduce or a faulty assumption about the customer environment.

•

In our experience, fewer people are assigned to an in-situ critical situation than were involved in the development and test phases of a product life-cycle.

•

It is likely that a customer-deployed computing system environment is much more complex than the environment in which the affected component was tested. In the extreme case, systems diagnosticians may have to trace root cause problems across a plethora of systems architectures and software solutions which span a plurality of vendors and products that the diagnostician may not have any special knowledge of directly.

3. SIMILARITIES WITH MEDICAL DIFFERENTIAL DIAGNOSIS For areas in which we are aligned with medical diagnosis, we should study their methods to see what methodologies we can implement. This section enumerates a non-exhaustive list of similarities. First, both humans and computing systems are complex, with interaction scenarios that are often not completely understood. Just as humans can have more than one underlying causal pathology for a symptom, complex computer systems can have different root causes for an error. Second, in both diagnostic disciplines, recreating a symptom is sometimes possible, though often at a different scale. For a human, recreation may occur in a lab at a cellular level using a physical sample from the patient, or in some cases the doctor may simply ask the patient to recreate their pain point. There is also a cost to the patient for recreation, though it manifests differently, through physical discomfort in contrast to a solely monetary cost. It is viable that in both cases the human and system are not performing their intended service delivery during this time. Certainly both patients and customers do not want to be impacted by an outage, such as a hospital stay, or time spent undergoing lengthy diagnostic tests. In both diagnostic disciplines, if the symptom does not occur when the patient is in the doctor’s office, it may be more difficult to form a diagnosis. Thirdly, both Medicine and IT have a hierarchy of specialists and generalists. Just as a general practitioner may send a patient to a specialist, the typical hierarchical information technology support structure moves the customer through increasing levels of depth expertise as related to the problem complexity and components affected. As with medicine, complex and highseverity problems (e.g. crit-sits) often involve multiple specialists to find the root cause. Furthermore, medicine enumerates four approaches to organizing and prioritizing the differential diagnosis for a given problem [23]: •

Possiblistic – consider all known causes equally likely, and concurrently test for them.

21

•

Probabilistic – consider first the disorders which are most likely, (those with highest pretest probability.)

•

Prognastic – consider the most serious diagnoses first.

•

Pragmatic – consider the diseases most responsive to treatment first.

According to Scott, et al. [24], experienced physicians often integrate the latter three approaches when constructing a differential diagnosis. System diagnosticians implicitly take a similar approach when diagnosing faults, driven by the nature and severity of the reported problem. IBM has already done work towards the system itself acting as the diagnostician and implementing a pragmatic approach [25], with the intent to quickly and automatically treat the sick system before it degrades further. This pragmatic approach can be viewed as a form of error handling that may be performed concurrently with fault handling [26]. Finally, systems diagnosticians already practice a basic notion of triage, as evidenced by the hierarchical support structure.

4. DIFFERENCES WITH MEDICAL DIFFERENTIAL DIAGNOSIS Medical differential diagnosis and computer systems diagnosis are not always comparable. In fact, in some ways the two fields are quite divergent. We include a discussion of these areas for completeness. The first prevalent difference is that medical doctors are required to undertake a very specific form of education and are required to be licensed. Classes are not taught on diagnosis, as they are on programming or testing, the latter of which was also underrepresented in curriculums until recent years. Computer systems are often diagnosed by people with substantially varied educational backgrounds, such as computer science, software engineering, mathematics, or information technology. Others have even more varied educational background. Medical practitioners are trained with a residency procedure which ensures they have time spent in the field under the mentor-ship of competent practitioners before practicing on their own. Systems diagnosticians operate under some interesting sociological pressures. It is rarely the case that a patient will blame a doctor or other medical professional for the illness they are suffering (with the notable but comparatively infrequent exception of malpractice). In general, medical practitioners do not cause the diseases and therefore have no cause to be defensive. In contrast, systems diagnosticians are often on the defensive end of a confrontation with an angry customer who equates the person solving the problem in the field with the developer or system designer who constructed the system which is not operating correctly. Systems diagnosticians are disadvantaged because they do not yet have the mature, comprehensive symptom databases that the medical field has compiled over decades of research and collaborative effort. While some databases exist, such as RETAIN within IBM, they do not yet take advantage of classifications of errors, faults, and their relationships. Another critical variation between disciplines is that the data which we as computer professionals obtain when diagnosing systems software is not as consumable. If only we in the

computer systems diagnostics area could create documentation and notation standards similar to those well established and enforced in the field of medicine [27]. IT has an inherent advantage over medicine with regards to record-keeping. Much effort has been made in the medical field to convert hardcopy records to digital format. Our diagnostic records are already soft-copy, though turning that data into consumable knowledge is often cumbersome. A poignant example of the differences between diagnostic domains is that virtually all medical practitioners are familiar with MRI, X-ray, and common blood tests. Their profession enforces a certain base level of understanding of the available tools and data that can be produced in differing circumstances and with different levels of invasiveness and complexity. We as systems diagnosticians need a comparable set of tools that are common across computing systems and environments. At minimum, tools with common naming conventions would be beneficial, even if the output or implementation varied from system to system. Imagine having a common lightweight dynamic tracing infrastructure such as Sun Microsystem's DTrace [28] common to all computing platforms. Unlike medical patients who may chose another doctor to resolve the problem at any time if they aren't getting good care, enterprise software customers cannot easily go to some other vendor for immediate care. This is due to a variety of factors (mainly a lack of mission critical understanding of heterogeneous complex systems spanning multiple vendors as well as nonstandard means of documenting root cause analysis). Usually, an enterprise customer’s only option is to work with the vendor whose product is causing the immediate issues, until resolution. As a consequence, they may then opt to move off that vendor’s platform after the immediate needs have been met. This is often at great additional cost and expense to the customer, and certainly at a cost to the vendor who lost a customer. Additionally, doctors routinely schedule a follow-up, even after diagnosis and treatment are confirmed. Not all computer systems diagnosticians revalidate a system’s health after a fix has been confirmed. It is possible that there is more to be learned by performing such a simple action. The notion of a preventive checkup is well established in the general medical industry. In computer systems, the individuals responsible for any preventative maintenance (if done at all) are distinct from the group of in-field systems diagnosticians who are dispatched when critical customer situations occur. Some computing platforms, software, and operating systems do have the notion of a software-implemented checkup, e.g. Healthchecker [29]. Other computing platforms, software, and operating systems may implement self-diagnostics and preventative analytics [30], but there is little systemic view of the health of an enterprise computing environment as a cohesive single system or entity. An area where these fields are different but perhaps converging is in preparation for diagnosis. Some doctors are embracing technology to help streamline their diagnostic process by suggesting patients utilize web-sites [31] that provide preparation checklists for specific appointments or symptoms. Many of our IT components document the data to collect before

22

reporting a problem to IBM, if that data was not already collected as part of first failure data capture.

what you were doing when the pain occurred or how much pain you are experiencing.

Given that we have seen examples of differences and similarities between the computer and medical diagnostic practices, how can reconcile this information into something that can help us be more proficient at in-field diagnosis?

Similarly, we should adopt the generalized approach, invoking Occam's razor, tempered with our knowledge of Hickam’s dictum to understand the customer’s chief complaint. In essence, we should recognize the possibility that a customer is experiencing more than one error.

5. META-METHODOLOGY FOR DIAGNOSIS OF SYSTEMS ERRORS In any discipline, diagnostic decision trees can find a solution only for a previously-discovered fault. Because we have already shown that diagnosis is the iterative art of knowing what questions to ask, data to collect, and tests to run, we now look to see how precisely what we can learn from medicine and the current state of diagnostic research in computer systems. We have synthesized a meta-methodology intended as a generalized starting point from which specialists can generalize their own best practices and share commonalities among other areas. It is modeled after the technique of differential diagnosis, which is the process of weighing the probability of one disease versus that of other diseases, combined with the process of elimination, to eventually determine the patient's illness. Our proposed meta-methodology is additionally influenced by the Oslerian ideal by using classifications. Our meta-methodology distinguishes between classes of errors and faults just as medicine does with symptoms and pathologies, respectively. It encapsulates finding root cause, as well as documenting what failed for purposes of both educating others and efficiently communicating the findings. We will demonstrate parallels between our meta-methodology and some of the current research into problem diagnosis.

5.1 Information Collection Before we can diagnose a fault, we must understand the customer’s environment. This involves collecting information about the system environment, along with information about the error. In practice, the two steps below may be performed concurrently.

5.1.1 Collect History Just as a doctor takes your pulse, blood pressure, and family history, we collect operating system, software, and hardware component data such as version levels, maintenance levels, change control levels and current logging levels along with corresponding log data. We might also inquire about how the component normally behaves, i.e. its “homeostasis.” In this phase, we collect everything that helps us understand the operating environment before performing any diagnostic tests. Remember that during history collection, a medical practitioner is often simultaneously conducting a physical examination of a patient, while studying the patient's existing medical record, and routinely asks further questions to probe the nature of the illness, and potentially rule out some conditions.

5.1.2 Identify Errors (Symptoms) In practice, during computer systems diagnosis, the customer contacts us to report one or more errors. Computer systems diagnosticians need to identify these errors by understanding the type of error, the impact (severity level) to the customer, and what activity was being performed when the error occurred. This aligns with medical diagnosis in that a doctor might ask you

5.1.3 Type of Error Once a doctor has performed some basic level of triage, and conducted a cursory interview with a patient, the doctor will begin to classify the observed symptoms. This includes discerning which anatomical area is affected, and the degree of deviation from homeostasis in major physiological subsystems. In computer systems diagnosis, for each reported error the diagnostician will identify the affected components. This may be seen as a parallel to determining which anatomical components are at fault, for example, if hardware is affected, or which major software components are suffering (similar to observing a cardiac or pulmonary issue. Experts of these components may be called upon to help identify the type of error that is occurring for the symptom. Just as doctors have done work to classify the types of issues that cause malaise to the human body, much work has been done to classify errors in computer systems [32] [33]. The research by Avizienis, et al.[34], is based on a notion of service delivery which is akin to normal body operation. Their classification of faults in computer systems is based on deviation from that service delivery (i.e. no longer maintaining our medical notion of homeostasis). That deviation is identified as an error. Note that the error, analogous to a symptom, represents the indication to the customer that there is a problem. We enumerate a subset of the referenced errors for purposes of demonstration: •

Abend/Crash/Outage - System or application failure.

•

Hang/Wait/Loop: - System or application is hung, in a wait state, or is looping

•

Error Message - An error message is generated to the user and/or to the product/system log

•

Unexpected Results - Function does not complete successfully resulting in incorrect output, corrupt data or data which does not adhere to specification, however, no error message is observed

•

Performance - System or application does not meet expectations in terms of processing speed.

These are generic error types, and can be modeled into more granular subtypes by specific components, as appropriate. In medical pathology, there is a similar notion of classifying disorders into families, and then into further subtypes such as the strain of a virus. It should be noted that at this stage we are merely identifying broad classifications of observable or reported errors which will be used as input into our diagnostic decision making.

5.1.4 Impact of Error (Pain Level of symptom) The pain level of the patient helps direct which approach (e.g. probabilistic, pragmatic) of differential diagnosis is pursued. For

23

example, a patient in severe pain may need one or more treatments before a diagnosis is actually made. Pain scales are based on self-report and observational data [35]. The severity of the error is subjective, and this should be remembered when selecting which tests should be performed. During diagnosis of in-situ computing environments, this self-report may influence a diagnostician’s ability to act, as typically the ability to perform changes, and run tests is dictated by the customer. For systems diagnosticians, treatment of a symptom is analogous to a work-around. We get the system running again, and subsequently focus on diagnosis. Therefore, it is important to get customer feedback about the impact of the failed service delivery. Specifically: •

What is the severity of the problem?

•

Was there a system outage? (i.e. service delivery failure caused the system to go down or not recover.)

•

Is a workaround needed?

What is priority of the customer? In reality, some errors may need to receive higher priority, especially in cases where a particular customer has been subjected to a number of unrelated errors.

5.1.4.1 Activity during Discovery Just as a doctor will often ask a patient what they were doing when the symptom first occurred, we need to inquire about the activity being performed when the error presented itself, understanding also that the error could be related to an unidentified previous activity. OPC [36] has also done work to identify customer activities. A subset of these includes: •

Planning - Customer is gathering information about the product or system.

•

System Administration/Configuration - Customer is administering or configuring the system or product. Installation - Customer is installing or uninstalling the product and encounters a problem.

•

Migration/Upgrade - Customer is upgrading or applying a service upgrade and encounters a problem.

•

Other HW/SW installation/change - Customer is installing or upgrading another IBM or OEM hardware/software product and encounters a problem that does not appear to be caused by that product. .

•

•

Product running then problem occurs - Customer has installed, configured, deployed and now has product running when a problem is encountered. Attempting to Run Diagnostic Aids - Customer encountered problem when attempting to diagnose or gather information to resolve a hardware or software issue.

See Figure 1 for a UML-modeled representation of an error class.

Figure 1 – UML Class Representation of Errors, Faults and Diagnostic Tests

5.2 Diagnosis 5.2.1 Hypothesis Generation Once the previous inputs to the diagnostic decision making functions have been defined, we can hypothesize about what is causing the fault. Recalling our medical introduction earlier, we know that doctors believe that an underlying pathology causes a symptom. In a similar fashion, Avizienis’ work [37] notes that the errors (read symptoms) are caused by some underlying fault much in the way that the Oslerian ideal has shown us that some medical cause underlies maladies in human patients. Typically during this phase, computer systems diagnosticians will hypothesize about possible faults. In addition to identifying the component suspected of owning the fault, we usually implicitly assign a fault type. Work has been done to classify fault types, such as by Avizeinis and ODC research by IBM [38], laying a groundwork for computing pathology. A subset of the referenced ODC generic fault types includes: •

Configuration

•

Code

•

Design

•

Documentation

These faults can also be qualified, with attributes such as “missing”, “incorrect”, or “extraneous.” Additionally, as modeled after medicine, each fault could have associated with it the following information: •

Priority – indicates when to attempt to confirm or exclude this fault.

•

Probability – the likelihood this is the fault causing the error.

•

State – after or during diagnostic tests, this fault been confirmed or excluded, or is in the process of being confirmed or excluded.

Faults and their relationship to errors are modeled in Figure 1. Additionally, for each fault, we may identify one or more workarounds.

24

5.2.2 Selection of a Working Hypothesis Now that a hypothesized list of faults exists, consider which are most probable, severe, or responsive to treatment. If a problem is severely impacting the customer, or in contrast, has a straightforward and non-disruptive treatment, one may want to begin treatment before pursuing a diagnosis. In other words, consider the prognostic and pragmatic approaches, instead of solely a probabilistic approach. When taking a probabilistic approach, we should identify a “leading hypothesis.” In medicine, this is a doctor’s single best explanation for a patient’s clinical problem(s) and is also known as a “working diagnosis” [39]. Additional diagnoses known as “active alternatives” may be worth considering later depending on their likelihood, seriousness if untreated, or probable response to treatment. Other hypotheses may be identified, which are those that seem unlikely but are retained if initial hypotheses are ruled out, or further data collection points to those unlikely hypotheses. Finally, once a hypothesis is disproved, it is considered an “excluded hypothesis.” As with medicine, when ranking hypotheses, one should supply the evidence (rationale) for the ranking [40] and preserve for posterity in some artifact. As with medicine, if a test can definitively confirm a hypothesis, then alternate hypotheses do not need to be considered. Otherwise, alternate hypotheses should be considered. Just as the probability of correctly diagnosing a particular disease is dependent on the experience of current doctor (his memory of what he's seen before), fault probability determination is reliant on the experience of the system diagnostician. This is not only prone to error, but not available for somebody new to the field. With exception of the RETAIN database, we do not have statistics about prior faults to work with, such as in evidence-based medicine [41]. An additional concern regarding symptoms is the disconcerting question, "When is a symptom not a symptom?" Sometimes things might seem important at the onset of a diagnostic interview may turn out to be a false path. One study concluded that up to 25 percent of a system administrator’s time may be spent following blind alleys suggested by poorly constructed and unclear messages [42]. In medicine, a symptom may be from previous injury or illness and have little bearing on the current cause of a patient’s duress. Likewise, when diagnosing computer systems in the field, certain error indicators may lead you down a false path.

•

Invasiveness of the test – If the test is invasive and can only be performed at the customer site, you may want to consider one or more alternate tests.

•

Precision Rate – This represents the probability or likelihood that the test can confirm or exclude a fault hypothesis.

•

Time constraints – A laboratory recreation may take time to set up. A customer might be willing to run tests at their own site, possibly on their own productions systems, for a quicker diagnosis.

•

Resource constraints – For complex environments, we may not have the human or system resources to perform an inlab recreate. As an example, we often do not perform integration test with competing vendor products.

•

Cost – Each of the above factors has an associated cost to IBM or the customer.

The above factors are represented in a diagnostic test class in Figure 1.

5.2.3.1 Location of Test Testing at Customer Site Examples of non-invasive tests performed at the customer site involve further data collection, such as verifying the latest service is applied for relevant components, determining if recent service was applied that may have introduced or exposed a fault, verifying the customer’s configuration, and identifying if recent configuration changes were made.

We also may ask the customer to send along First Failure Data Capture, if it exists for the affected components. If the above data does not lead to a diagnosis, we may ask for more invasive tests. At this point we often consider the possibility of a local laboratory recreation, rather than impacting the customer production deployment. Testing at the Lab

A system diagnostician’s objective is to gather data (test results) that confirm or exclude a hypothesis. For each of the fault hypotheses to pursue, identify tests that can fully or partially prove each hypothesis or exclude it.

Diagnostic testing in a lab typically includes checking if a known fix already exists, verifying documentation, and ensuring that the design of the computing system aligns with the customer’s expected and actual usage. In cases where local lab recreation of an error is possible, the next logical step is to verify the code is behaving as it should and the configuration is correct. Laboratory recreations may have a significant system and resource cost to the laboratory providing the testing. This is most apparent when the recreation involves expensive systems and is performed at scale. However, laboratory recreations allow more flexibility in the diagnostic tests which may be run. It should also be appreciated that testing in the lab while performing concurrent diagnostics in the field is a means to provide expedient resolution at the expense of substantial cost (via two teams operating in parallel).

To decide which tests to pursue for in-situ diagnosis, consider:

5.2.3.2 Invasiveness

Lastly, when ranking your list of suspected faults some attention should be paid to how difficult, and how costly it is to perform a hypothesis verification test.

5.2.3 Diagnostic Test Selection

•

Location of the test – Is it to be run at the customer site or can it be recreated in a laboratory setting?

When considering an examination (test) in medicine or in computer systems diagnosis, the practitioner often attempts to select the least invasive procedure possible. While medical terminology defines levels of invasiveness, no such analogous definitions have been put forth in the computer systems problem

25

Non-Invasive Medical procedures are strictly defined as non-invasive when no break in the skin is created [43]. For example, an examination of the ear-drum falls outside the strict definition of "non-invasive procedure". For the IT industry, non-invasive tests should not be a function of the customer’s interpretation and should instead be standardized to mean any tracing, output, or logging which is enabled by default during installation time. Any vendor-supplied default configuration settings that were disabled by the customer are considered non-invasive when re-enabled. Marginally Invasive In medicine, this does not include procedures such as a biopsy, which do not require an open incision. We define marginally invasive to mean the run-time activation of latent function which may enable additional "lightweight" tracing and or logging. Performance overhead of such additional requested function should be measured in laboratory settings to determine performance overhead. The increase in overhead should be minimal, e.g. no more than 10% increase over baseline testing. In addition, this testing should be strictly defined as a run-time reconfiguration that requires no stopping or restarting of services or processes. Invasive In medicine, invasive is defined as using incision or insertion of an instrument into a living body [44]. Exploratory surgery is an extreme example of an invasive diagnostic procedure. It is used as a last resort when other symptoms indicate an underlying disease that cannot be detected by radiology or other noninvasive tests. In computing, invasiveness is analogous to binary replacement or altering the environment in such a way that requires restarting of applications. In addition, invasive may mean enabling run-time diagnostics which consume, or can be reasonably expected to consume, more than 10% overhead compared to the baseline. Use an appropriate tool for the stage of diagnosis. Use noninvasive technologies when possible (lightweight instrumentation before heavy tracing or other solutions that impact normal operation). When you have to cut (i.e. be invasive), have a plan. If you attempt haphazard surgery with an imprecise tool, you will likely kill the patient. Figure 2 shows examples of varying levels of invasiveness.

Lab-run Tests Customer-run Tests

domain. In addition the subjective nature of the environment in which the component is operating makes judgment calls difficult. This is where in-situ has an effect, as invasiveness is subjective to the customer. Get customer feedback about tests which will be permitted.

check if a fix already exists verify documentation

latest service applied? recent service applied?

verify design

configuration ok? recent configuration changes?

If lab recreate possible verify code

If recreate feasible… enable logging, tracing, debug

request FFDC

Less-Invasive

More-Invasive

Figure 2 - Invasiveness Considerations

5.2.4 Run Diagnostic Tests Test results may lead to identification of additional fault hypotheses, requiring additional diagnostic tests. If the probability of the leading hypothesis is high enough, consider whether or not more tests are needed. Figure 3 represents the described meta-methodology, which can be adapted for specific areas.

6. FUTURE WORK This section intends to outline some of the work that we feel would benefit in-field systems diagnosis. While not an exhaustive list, we have attempted to describe some of the more interesting possibilities for related research and development.

6.1 An Application Embodiment The meta-methodology proposed here is a high level framework set to guide in-field diagnostic practices (specifically for those operating in a critical situation type environment). After some socialization of these concepts, it is entirely possible to build software to help manage and maintain a specific methodology embodiment (and its derivatives). We envision an application that is both aware of the high level meta-methodology and knows about situationally specific, expert crowd-sourced derivatives which are tailored to concrete hardware environments and software solutions. We feel that such an application should include some context awareness functionality to perform searches for subject matter experts or contributors to the meta-methodology derivative.

6.2 Cloud Computing and Systems Differential Diagnosis Each of the issues we have discussed with respect to in field diagnosis becomes compounded when portions of an enterprise architecture are hosted in the cloud. It is our position that diagnosing systemic issues would be akin to trying to perform a patient diagnosis when you do not have access to the patient. Diagnosis of complex systems will become more challenging with cloud computing. As a cloud consumer, one may only diagnose up to the borders of the cloud, as the cloud is by definition a black box. From the provider side of cloud, computing diagnosing issues may be more difficult because it is precisely the consumer’s interaction with your cloud offering which causes the error (for instance, the particular data they send into your servers at a particular time, which may not be easily replicated stand alone.)

26

Lab (IBM)

Customer

Figure 3. Meta-Methodology for In-field Systems Diagnosis

6.3 Advances for a Common Framework A common framework for diagnostic terminology and methodology could help advance the science of diagnosis. An IBM Academy study in collaboration with experts of differential diagnosis, perhaps from the Mayo clinic, could help advance this effort.

6.4 Tools and Methods for Diagnosis Can we take the best of breed diagnostic tools and standardize them across platforms? Clearly it would be in the best interest f the in-field systems diagnostician to convince the manufacturers of computer systems to select best of breed diagnostic tools for standardization across platforms. Open source code enables in-field diagnostician more flexibility and power in some situations. While we realize that it will not always be practical for source code to be made freely available, we do think that there is a clear advantage for in-field systems diagnosis. An empirical study comparing the effectiveness of infield debuggers operating in comparable environments consisting of open source software as compared to closed source, black box software may prove an interesting study. Doctors are turning towards more DNA based scans and diagnostics, and although software and hardware systems do not have DNA, there is a very evident analogy in source code! We also advocate a common educational process for in-field systems diagnosticians. Modeling after the medical residency program could set a level foundational competency across vendors and platforms.

7.CONCLUSION We have shown that computer science and medical diagnosticians have some commonalities and some substantial differences. We have illuminated how the computer diagnostic industry could learn from the situational experience and methodology of differential diagnosis through the application of

our meta-methodology (and domain specific derivatives). We have demonstrated a need for common terminology and thought process when performing differential diagnosis of computer systems in-situ. We hope we have also made the case for the enhancement and standardization of: non-invasive diagnostic tools, modern flexible consumable and interchangeable diagnostic records, and common educational training for use by in-situ diagnosticians.

8. ACKNOWLEDGEMENTS Thanks to the Test and Integration Center for Linux Virtual Server Platform Evaluation Test Department for allowing me the time to work on this paper. Special thanks to the IBM Academy of Technology for providing a venue for the discussion of problem prediction, avoidance, and diagnosis. For reviews and/or advice we thank James Caffrey, Shannon Farr O.D., Scott Loveland, Brent Miller, Geoffrey Miller, William Scott, Lisa Spainhower, David Thornley, and Jessie Yu.

REFERENCES [5] Online Entymology Dictionary. Definition of Diagnosis. Nov. 2001. [Online.] Available: http://www.etymonline.com [6] MedicineNet.com. June 2002. Definiton of Differential Diagnosis. [Online]. Available: http://www.medterms.com/script/main/art.asp?articlekey=2 991 [7] Harold C. Sox, Marshall A. Blatt, Michael C. Higgins, Keith I. Marton, M.D., Medical Decision Making. Philadelphia: American College of Physicians, 2007. p. 25. [8] French,, Herbert. Kinirons, Mark T. Ellis, Harold. French's index of differential diagnosis. 14 ed. London : Hodder Arnold, 2005.

27

[9] Porter, Robert. The Merck Manual of Patient Symptoms : a Concise, Practical Guide to Etiology, Evaluation, and Treatment. Whitehouse Station, NJ: Merck Research Laboratories, 2008. [10] French, Herbert. Kinirons, Mark T. Ellis, Harold. French's index of differential diagnosis. 14 ed. London : Hodder Arnold, 2005. [11] Jungwoo Ha , Christopher J. Rossbach , Jason V. Davis , Indrajit Roy , Hany E. Ramadan , Donald E. Porter , David L. Chen , Emmett Witchel, Improved error reporting for software that uses black-box components, Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, June 10-13, 2007, San Diego, California, USA [12] Merriam-Webster Medline Plus. Definition of anatomy. 2005. [Online]. Available: http://www2.merriamwebster.com/cgi-bin/mwmednlm [13] Merriam-Webster Medline Plus. Definition of physiology. 2005. [Online]. Available: http://www2.merriamwebster.com/cgi-bin/mwmednlm [14] Merriam-Webster Medline Plus. Definition of pathology. 2005. [Online]. Available: http://www2.merriamwebster.com/cgi-bin/mwmednlm [15] United States. National Library of Medicine. Homeostasis. Jan. 1999. [Online]. Available: http://www.nlm.nih.gov/mesh/2009/mesh_browser/MBrow ser.html [16] T. Tang, G. Zheng, Y. Huang, G. Shu, P. Wang,, "A Comparative Study of Medical Data Classification Methods Based on Decision Tree and System Reconstruction Analysis", IEMS Vol. 4, No. 1, pp. 102-108, June 2005. [17] MedicineNet..com. Feb. 2004. Definition of Occam's razor. [Online]. Available: http://www.medterms.com/script/main/art.asp?articlekey=2 6739 [18] Bio--Medicine.org. Feb. 2004. Hickam's dictum. [Online]. Available: http://www.bio-medicine.org/medicinedefinition/Hickams_dictum/ [19] Bio-Medicine.org. Feb. 2004. Hickam's dictum. [Online]. Available: http://www.bio-medicine.org/medicinedefinition/Hickams_dictum/ [20] Bio--Medicine.org. Feb. 2004. Hickam's dictum. [Online]. Available: http://www.bio-medicine.org/medicinedefinition/Hickams_dictum/ [21] Bio--Medicine.org. Feb. 2004. Hickam's dictum. [Online]. Available: http://www.bio-medicine.org/medicinedefinition/Hickams_dictum/ [22] Geyman, John P., M.D. “The Oslerian Tradition and Changing Medical Education: A Reappraisal”, Western Journal of Medecine 1983 June; 138(6): 884–888.

[25] International Business Machines Corporation. ODC Guide. [Online]. http://w3-03.ibm.com/qse/page/2574 [26] A. Avizienis,, J.-C. Laprie, B. Randell and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Transactions on Dependable and Secure Computing, vol. 1, p. 14, Jan.-Mar. 2004. [27] Stern, Scott D. C., Diane Altkorn, and Adam Cifu. Symptom to Diagnosis (Lange). New York: McGraw-Hill Medical, 2005. p. 2. [28] Stern, Scott D. C., Diane Altkorn, and Adam Cifu. Symptom to Diagnosis (Lange). New York: McGraw-Hill Medical, 2005. p. 2. [29] Caffrey,, J.M., The resiliency challenge presented by soft failure incidents, Continuously Available Systems Infrastructures, IBM Systems Journal 47, No. 4 (Oct 2008)) [30] A. Avizienis,, J.-C. Laprie, B. Randell and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Transactions on Dependable and Secure Computing, vol. 1, p. 14, Jan.-Mar. 2004. [31] "Questionable hospital chart documentation practices by physicians." J Gen Intern Med. 2008 Nov;23(11):1865-70. Epub 2008 Aug 27. - Sharma R, Kostis WJ, Wilson AC, Cosgrove NM, Hassett AL, Moreyra AE, Delnevo CD, Kostis JB. [32] Sun Microsystems, Inc. An Overview of DTrace. 2009. [Online]. Available:http://opensolaris.org/os/community/dtrace/. [33] International Business Machines Corporation. (2009, April). z/OS V1R10.0 Problem Management. [Online]. [34] Available: http://publibz.boulder.ibm.com/cgibin/bookmgr_OS390/Shelves/EZ2ZBK0G [35] International Business Machines Corporation. OA27165: NEW FUNCTION - PREDICTIVE FAILURE ANALYZER. Apr. 2009. [Online]. Available: http://www01.ibm.com/support/docview.wss?uid=isg1OA27165 [36] WebMD. Healthwise Self-Care Checklist. June 2007. [Online]. Available: http://www.webmd.com/a-to-zguides/healthwise-self-care-checklist-step-1-observe-theproblem [37] A. Avizienis, J.-C. Laprie, B. Randell and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Transactions on Dependable and Secure Computing, vol. 1, p. 14, Jan.-Mar. 2004. [38] International Business Machines Corporation. Orthogonal Problem Classification Collection Tool Schema. Version 6.0. Oct. 2008 .[Online]. Available: http://w3.cqat.ibm.com/uxd/gendl.jsp?tab=CY19271,CY17 969

[23] Computer History Museum. [Online]. Available: http://www.computerhistory.org/babbage/history/

[39] United States. National Institutes of Health. NIH Pain Consortium. Pain Intensity Scales. Jan. 2007. [Online]. Available: http://painconsortium.nih.gov/pain_scales/index.html

[24] sciencemuseum.org [Online]. Available: http://www.sciencemuseum.org.uk/broughttolife/timeline.a spx

[40] International Business Machines Corporation. Orthogonal Problem Classification Collection Tool Schema. Version 6.0. Oct. 2008 .[Online]. Available:

28

http://w3.cqat.ibm.com/uxd/gendl.jsp?tab=CY19271,CY17 969 [41] A. Avizienis, J.-C. Laprie, B. Randell and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Transactions on Dependable and Secure Computing, vol. 1, p. 14, Jan.-Mar. 2004. [42] International Business Machines Corporation. ODC Guide. [Online]. Available: http://w3-03.ibm.com/qse/page/2574 [43] W. Scott Richardson, Mark C. Wilson, Gordon H. Guyatt, Deborah J. Cook, James Nishikawa, and the Evidence Based Medicine Working Group. How to Use an Article About Disease Probability for Differential Diagnosis. Aug. 2007. Centre for Health Evidence. [44] Harold C. Sox, Marshall A. Blatt, Michael C. Higgins, Keith I. Marton, M.D., Medical Decision Making. Philadelphia: American College of Physicians, 2007. p. 19

[45] W. Scott Richardson, Mark C. Wilson, Gordon H. Guyatt, Deborah J. Cook, James Nishikawa, and the Evidence Based Medicine Working Group. How to Use an Article About Disease Probability for Differential Diagnosis. Aug. 2007. Centre for Health Evidence. [46] R. Barrett, E. Haber, E. Kandogan, P. P. Maglio, M. Prabaker, and L. A. Takayama. Field studies of computer system administrators: Analysis of system management tools and practices. In ACM CSCW (Computer-supported Cooperative Work), 2004. [47] The American Heritage® Medical Dictionary. Houghton Mifflin Company. 2007, [Online]. Available: http://medicaldictionary.thefreedictionary.com/noninvasive [48] Merriam-Webster Medline Plus. Definition of invasive. 2009. [Online]. Available: www2.merriamwebster.com/cgibin/mwmednlm?book=Medical&va=invasive.

29