PROBABILISTIC FAULT MANAGEMENT IN

0 downloads 0 Views 3MB Size Report
Mar 5, 2007 - some cases it is possible to develop polynomial algorithms which give ..... selected in such a way that any distributed system topology may be ...... Least Squares Fit (LSF) is an appropriate approach for single variable ..... Recurrence is based not on a trinomial, but on a primitive polynomial with many terms.
PROBABILISTIC FAULT MANAGEMENT IN DISTRIBUTED SYSTEMS

DISSERTATION zur Erlangung des akademischen Grades DOKTOR - INGENIEUR der Fakult¨at f¨ ur Mathematik und Informatik der FernUniversit¨at in Hagen

von JIANGUO DING

Hagen, Germany 2007

Eingereicht am: Tag der m¨ undlichen Pr¨ ufung: 1. Berichterstatter: 2. Berichterstatter:

March 5th, 2007 September 23rd, 2008 Prof. Dr.-Ing. Bernd J. Kr¨amer Prof. Dr.-Ing. Herwig Unger

iii

Acknowledgements I would like to acknowledge my supervisors, Prof. Bernd J. Kr¨amer, Prof. Hansheng Chen and Prof. Yingcai Bai. To Prof. Kr¨amer I am truly grateful for the opportunity to have been able to pursue my research under his supervision. I am even more impressed by his detailed help in my research proposition and my paper writing, especially for his encouragement and assistance in publishing the research results. To Prof. Hansheng Chen I sincerely thank him for all his help during my study and his recommendation, so that I could have the opportunity to do my thesis research at FernUniversit¨at in Hagen. To Prof. Yingcai Bai I am grateful for his guidance and support throughout my study at Shanghai Jiao Tong University. I would like to express my thanks to DAAD (the German Academic Exchange Service). This research would not have been possible without the financial support from DAAD. Many thanks are due to Mr. Carsten Schippang for intensive discussions in distributed systems management and for providing the sample management data of the campus network of FernUniversit¨at Hagen for the year 2003. I thank Ms. Renate Zielinski for her lots of work in text editing on the book. I express my acknowledgements to Prof. Firoz Kaderali, Prof. Wolfgang A. Halang, Dr. Klaus Gotthardt, Jun.-Prof. Dr. Jens Krinke, Prof. Guanrong Chen, Mr. Hans-Friedrich K¨otter, Dr. Eugen Grycko, Prof. Otto Moeschlin, Dr. Shanqing Gu, Mr. Quankai Lee, Ms. Elfriede Seitter, Dr. Liu Du, Dr. Ningjiang Chen, Dr. Zhong Li, Prof. Weinong Wang, Prof. Jiabin Li, Prof. Franco Davoli, Dr. Lihong Ma, Mr. Stephan Wunderer, Dr. Ramesh Bharadwaj, Dr. Yuhong Liu, Dr. Dongxi Liu, Mr. Shahriar Fakher, Ms. Gerda Wessel, Mr. Marc Jelitto, Ms. Xia Wang, Mr. Thorsten Blum, Ms. Shourong Lu, Dr. Ping Li, Dr. Wei Zhang, Mr. Leonard Kwek, Mr. John Shou, Mr. John Gordon and Ms. Grace Hu for their kind help and support during my thesis research. I owe a great deal to all my colleagues at DVT (Datenverarbeitungstechnik) at FernUniversit¨at in Hagen. They spent a large amount of time to discuss my research, to provide a perfect working environment and to consult and deal with administrative issues. I must not fail to mention all the Chinese colleagues at FernUniversit¨at in Hagen, who stood by me during the hard times and shared the fun times. Many thanks are due to all my colleagues at Shanghai Jiao Tong University for their support and for their help during my thesis research. I thank Miss Shihao Xu for her cooperation in implementing the simulation experiment. Many thanks are due to the anonymous reviewers from the conferences and journals in the area of network and system management, such as DSOM, IM, IPOM, ECUMN, ISPA, etc, and the Journal of Network and System Management. Their valuable comments helped me a lot in improving my research and related publications. And finally, I would like to dedicate this book to my parents and my sister. I thank them for their support and encouragement in many ways during my long period of study.

Jianguo Ding September 2008 Luxembourg

iv ”Curiosity, or love of the knowledge of causes, draws a man from consideration of the effect to seek the cause; and again, the cause of that cause; till of necessity he must come to this thought at last, that there is some cause whereof there is no former cause, but is eternal; which is it men call God. So that it is impossible to make any profound inquiry into natural causes without being inclined to believe there is one God eternal.” Thomas Hobbes (1588-1679) LEVIATHAN Chapter XI: Of the Difference of Manners PART I: Of Man

Contents

v

Contents 1 Introduction 1.1 Motivation of the Research 1.2 Objective of the Book . . 1.3 Summary of Contributions 1.4 Overview of the Book . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 1 2 3 4

2 Fault Management in Distributed Systems 2.1 Distributed Systems Management: State of the Art . . . . . . . . . . 2.1.1 Distributed Systems Management: Objectives and Challenges 2.1.2 Relevant Work towards Distributed Systems Management . . 2.1.3 Enabling Technologies . . . . . . . . . . . . . . . . . . . . . . 2.2 Fault Management in Distributed Systems . . . . . . . . . . . . . . . 2.2.1 Basic Concepts in Fault Management . . . . . . . . . . . . . . 2.2.2 Challenges in Fault Management . . . . . . . . . . . . . . . . 2.3 Relevant Research in Fault Management for Distributed Systems . . . 2.3.1 Expert Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Finite-State Machines . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Graph-theoretic Techniques . . . . . . . . . . . . . . . . . . . 2.3.4 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Model Traversing Techniques . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

5 5 6 7 8 13 14 17 20 21 24 25 27 29

3 Probability Reasoning and Bayesian Networks 3.1 Background of Probability Theory . . . . . . . . 3.1.1 Probability Calculus . . . . . . . . . . . 3.1.2 Why Uncertain Reasoning? . . . . . . . 3.2 Models of Bayesian Networks . . . . . . . . . . 3.2.1 Basic Concepts of Bayesian Networks . . 3.2.2 An Example of Bayesian Networks . . . 3.2.3 The Semantics of Bayesian Networks . . 3.2.4 d-Separation in Bayesian Networks . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

31 31 32 34 35 36 38 38 41

. . . . . . . .

44 44 44 46 48 53 54 56 60

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

4 Probabilistic Inference in Fault Management 4.1 Bayesian Networks for Fault Management . . . . . . . . . . . . . . . . . 4.1.1 The Characteristics of the Faults in Distributed Systems . . . . . 4.1.2 Why Use Bayesian Networks for Distributed Fault Management? . 4.1.3 Mapping Distributed Systems to Bayesian Networks . . . . . . . . 4.2 Probabilistic Inference for Distributed Fault Management . . . . . . . . . 4.2.1 Basic Model of Backward Inference in Bayesian Networks . . . . . 4.2.2 Strongest Dependency Route Algorithm for Backward Inference . 4.2.3 Related Algorithms for Probabilistic Inference . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

vi

Contents

5 Prediction Strategies in Fault Management 5.1 Dynamic Bayesian Networks for Fault Management . . . . 5.1.1 Dynamic Characteristics in Distributed Systems . . 5.1.2 Dynamic Bayesian Networks for Fault Management 5.2 Prediction Strategies for Distributed Systems Management 5.2.1 Prediction Methods in Dynamic Systems . . . . . . 5.2.2 Prediction in Dynamic Bayesian Networks . . . . . 5.2.3 Analysis of the Prediction Strategies . . . . . . . . 6 Simulation Measurement 6.1 Construction of the Simulation for Bayesian Networks . . 6.1.1 Generation of Pseudo-Random Numbers . . . . . 6.1.2 Random Generation of Bayesian Networks . . . . 6.1.3 Implementation of the Simulation Program . . . . 6.2 Simulation Measurement for Probabilistic Inference . . . 6.2.1 A Simulation of Backward Inference . . . . . . . . 6.2.2 Evaluation of the SDR algorithm . . . . . . . . . 6.3 Simulation Measurement in Dynamic Bayesian Networks 6.4 Analysis of the Simulation Program . . . . . . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

7 Application Investigation 7.1 Architecture for Distributed Systems Management . . . . . . . . . 7.1.1 Components of Distributed Management System . . . . . . 7.1.2 Protocols for Distributed Systems Management . . . . . . 7.1.3 Extended Architecture for Probabilistic Fault Management 7.2 The Structure and Function of Fault Diagnosis Agent . . . . . . . 7.2.1 Data Collection and Analysis for Fault Management . . . . 7.2.2 Dependency Analysis for Events . . . . . . . . . . . . . . . 7.2.3 Bayesian Networks for Fault Management . . . . . . . . . 7.3 Discussion of Application Issues . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . .

62 62 62 66 67 67 70 74

. . . . . . . . .

. . . . . . . . .

76 76 76 78 79 82 82 92 95 96

. . . . . . . . .

104 . 104 . 104 . 105 . 107 . 107 . 107 . 114 . 116 . 118

. . . . . . .

8 Conclusions 8.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

120 120 122 122

A Pseudo-Random Numbers Generation

124

B Generation of Bayesian Networks

126

C Code for SDR Algorithm

130

Abbreviations

134

References

136

Index

148

vii

Abstract With the growth of distributed systems in size, heterogeneity, pervasiveness and complexity of applications and network services, the effective management of distributed systems becomes more important and more difficult. Managers have to live with unstable, uncertain and incomplete management information. Individual hardware defects or software errors or combinations of such defects and errors in different system components may cause the degradation of services of other (remote) components in the network or even their complete failure due to functional dependencies between managed objects. Meanwhile, dynamic changes are unavoidable in complex distributed systems. This brings more challenges in fault management of distributed systems. Hence an effective mechanism for distributed fault detection is needed to support rapid decision making in the management of distributed systems and allow for partial automation of fault correction. Currently, available techniques in fault detection and commercial management software are unable to provide effective solutions for fault detection in face of uncertain and incomplete management information and dynamic changes in distributed systems. In this book Bayesian networks are applied to model the dependencies among managed objects and to provide efficient methods for locating the root causes of failures arising from inaccurate management information. A Strongest Dependence Route (SDR) algorithm for backward inference is offered based on Bayesian networks. The SDR algorithm will allow system managers to trace the strongest dependency route from effects and to identify the sequence of possible causes, so that the most probable causes can be investigated first. Unavoidable dynamic changes caused by system degeneration or improvement are investigated by importing a temporal factor to standard Bayesian networks, so that the time factor related changes can be integrated into the Bayesian networks model. Dynamic Bayesian Networks (DBNs) are applied in distributed systems management in order to address time factors and to model the dynamic changes of managed entities and the dependencies between them. Furthermore, prediction capabilities are investigated by means of the inference techniques when imprecise and dynamic management information occurs in distributed systems. Implementation details of the simulation in Bayesian Networks and Dynamic Bayesian Networks are provided for measurement of backward inference and prediction strategies in distributed fault management. Application issues of probabilistic fault management are investigated and a software architecture is designed to demonstrate how the probabilistic inference and prediction strategies can be brought into practice.

viii

This Page Intentionally Left Blank

1 Introduction

1

Chapter 1 Introduction 1.1

Motivation of the Research

With the growth in size, heterogeneity, pervasiveness, and complexity of applications and network services, the effective management of distributed systems has become more important and more difficult. Individual hardware defects, software errors or combinations of such defects and errors in different system components may cause the service degradation of other (remote) components in the distributed system or even their complete failure due to functional dependencies between managed objects. Hence, an effective distributed fault detection method is needed to support quick fault detection in distributed systems management and allow for automation of fault management. Although the Open System Interconnect (OSI) management standard provides a framework for managing faults in heterogeneous open systems, it does not address methodological issues to detect and diagnose faults. In order to fill this gap, a great deal of research effort in the past decade has been focused on improving management systems in fault detection and diagnosis. Rule-based expert systems have so far been the major approach to alarm correlation in fault detection [FGT91] [HS91] [OK99] [ZM03]. This approach suits well-defined problems where the environment is not very dynamic, but they do not adapt well to the evolving distributed system environment [FKW96]. Case-based reasoning [Lew93] [PNM99] and Coding-based methods [YKM+96] [LCL00] offer potential solutions for fault identification and isolation, but they can not deal with uncertain or unstable situations in distributed systems. Finite State Machines (FSMs) were used to model fault propagation behaviors and duration [WS93b] [RH95] [CYL01] [MA01]. But this approach has difficulties in scaling up to large and dynamic distributed systems. K¨atker and Geihs provide Model Traversing Techniques for fault isolation in distributed systems [KG97], but these is lack flexibility, especially when fault propagation is complex and not well structured. Most of these solutions are based on certain mechanisms and are very sensitive to ”noise” (such as loss of management information, delay in information collection and response, misunderstanding alarms). That means they are unable to deal with incomplete and imprecise management information effectively in uncertain and dynamic environments. Probabilistic reasoning is another effective approach for fault detection in fault management [DLW93] [HJ97b] [SS01] [SB02] [DKB+04a]. Most of the current commercial management software, such as IBM Tivoli, HP OpenView, SunNet Manager, Cabletron Spectrum and Cisco Works network management software, support the integration of different management domains, collect information, perform remote monitoring, generate fault alarms, and provide statistics on management information. However, they lack facilities for exact fault localization, or the automatic execution of appropriate fault recovery actions. From the experience in distributed systems management, a typical metric for

2

1 Introduction

on-line fault identification indicates 95% fault location accuracy while 5% of faults cannot be located and recovered in due time [Hil01]. Hence for large distributed systems with thousands of managed components it may be rather time-consuming and difficult to resolve the problems in a short time by an exhaustive search for the root causes of a failure and the exhaustive detection may interrupt important services in the systems. Due to the complexity of distributed systems, it is not always possible to build precise models for fault management. During severe failures a number of entities might be unreachable causing unavailability, loss, or delay of network fault management messages. Also a transient entity failure may result in a number of unreliable alarms. A well-designed strategy for fault management of distributed systems should therefore operate efficiently in the case of redundant, incomplete and unreliable information. In distributed systems, because of losses or delays in data collection, it is difficult to obtain full and precise management information. As the complex dependency relationship between managed objects and the cause-effect relationships among faults and alarms are generally incomplete, it is impossible to get a full and exact understanding of the managed system from the viewpoint of systems management. In daily management, specialist or expert knowledge is very important and useful. But quantitative expert knowledge is often expressed in imprecise ways, such as ”very high”, ”normal” or ”sometimes”. When expert knowledge is incorporated into a management system, probabilistic approaches are needed for the quantitative expression of this kind of expert knowledge. In real-life distributed systems, dynamic changes are unavoidable because of the degeneration or improvement in system performance. Hence to understand unavoidable changes and to ”catch the trend” of changes in a distributed system will be very important for fault management. This book investigates strategies to improve fault management in distributed systems in the face of uncertainty and imprecise management information, and presents novel prediction facilities coping with the dynamic changes in distributed systems.

1.2

Objective of the Book

In this book, Bayesian networks are employed to model the dependencies among managed objects and provide efficient methods to locate the root causes of failure situations in the presence of imprecise management information. Our ultimate goal is to automate part of the daily management business. Bayesian Networks are an effective means of modeling probabilistic knowledge by representing cause-and-effect relationships among key entities of a managed system. Bayesian Networks can be used to generate useful predictions about future faults and decisions even in the presence of uncertain or incomplete information. One basic operation in distributed systems management is trying to infer the particular causes from the observation of effects, particularly in fault diagnosis and malfunction recovery. Hence backward inference techniques are identified as the basic processes for further decision making in implementing a self-management system. In this book a backward inference approach based on Bayesian networks is developed to locate the root causes from effects in the uncertain environments of distributed systems. Most distributed systems dynamically update their structures, topologies and their dependency relationships between management objects. We need to accommodate sustainable changes and maintain a healthy management system based on learning strategies, which correspondingly allows us to modify the cause-effect structure and also the dependencies between

1 Introduction

3

the nodes of a Bayesian network. However the Bayesian paradigm does not provide a direct mechanism for modeling temporal dependencies in dynamic systems [AC96] [YS96]. In this book, Dynamic Bayesian Networks (DBN) are extended by employing temporal factors. We model dynamic changes among managed objects and their dependencies between each other period and investigate the prediction capabilities based on the inference techniques in fault management in the presence of imprecise and dynamic management information. Further, regression theory [WG94] is employed to capture the trend of changes and to give a reasonable prediction of individual components and the trends of the changes in dependencies between managed objects in a distributed system.

1.3

Summary of Contributions

The major research topic of this book is probabilistic fault management in distributed systems. The research tries to provide strategies to deal with the uncertain and incomplete management information, as well as the dynamic changes in distributed systems. The contributions of the book can be summarized as follows: 1. The book provides a detailed analysis of the ”state of the art” in the management of distributed systems and fault management of distributed systems. 2. To resolve uncertain and incomplete management information in distributed systems, Bayesian networks are applied to model the probabilistic management information and the dependencies between the managed objects in distributed systems. Further, the book investigates probabilistic characteristics of fault management and Bayesian networks construction for fault management. 3. A pruning algorithm is proposed to remove the unrelated (redundant) nodes for the Bayesian network in terms of considered effects. It provides a simplified method to deal with the backward inference in a sub-BN which is embedded in a large and complex Bayesian network. 4. To locate the root cause in presence of a management problem, the book examines the backward inference technique based on Bayesian networks and presents a Strongest Dependence Route (SDR) algorithm for backward inference. The SDR algorithm will allow users to trace the strongest dependency route from some malicious effects to its potential causes, so that the most probable causes are investigated first. A formal proof of the SDR algorithm’s core property to compute the strongest dependency route is presented. 5. After the analysis of the dynamic changes in distributed systems, dynamic Bayesian networks, which integrate a temporal factor to model dynamic changes in distributed systems, are imported to deal with the fault management in dynamic environments. 6. Prediction strategies based on dynamic Bayesian networks are presented for fault management. These can catch the trend of the changes in individual managed objects, the dependencies between managed objects and further infer the potential faults which may occur in the future. 7. A simulation model of Bayesian networks is developed to verify the effectiveness and efficiency of the proposed approaches in the backward inference and prediction issues of fault management.

4

1 Introduction 8. Application issues for probabilistic fault management are investigated and the related design of a software architecture is described to demonstrate how the probabilistic inference and prediction strategies can be brought into practice.

1.4

Overview of the Book

The whole book is organized as follows: Chapter 1 gives a brief introduction to the problem domain, the proposed work, and an overview of the whole book. Chapter 2 presents a literature survey on distributed systems management, particularly the relevant research in fault management. Current challenges in distributed systems which inspired the research topics in this book are also investigated. Chapter 3 provides the background of probability theory which is the basic theory for probabilistic reasoning in uncertain environments. The basic concepts related to Bayesian networks are introduced as well. Chapter 4 presents the initial study on the probabilistic inference scheme based on Bayesian networks for fault management in distributed systems. The intention of the application of Bayesian networks for fault management is ascertained and the construction of Bayesian networks from distributed systems based on the management task in fault management is investigated. The proposed Strongest Dependence Route (SDR) algorithm for backward inference can efficiently deal with the inference task from effects to causes. A formal proof, as well as the simulation of the algorithm are presented in detail. Chapter 5 investigates issues related to dynamic changes in distributed systems and a temporal factor is integrated into the Bayesian networks model to analyze dynamic changes. Prediction strategies based on dynamic Bayesian networks are provided to ascertain the trends of the changes and further to implement proactive fault management before the failure occurs. Chapter 6 establishes a simulation system based on Bayesian networks to evaluate the backward inference and prediction tasks. The process and a detailed analysis of the simulation are provided to demonstrate that the simulation measurement is reliable. Chapter 7 investigates application issues for probabilistic fault management and discusses some practical topics in distributed fault management. Chapter 8 summarizes the whole book and addresses open problems and future research in probabilistic management of distributed systems. The appendix finally contains part of the code of the simulation program for Bayesian networks.

2 Fault Management in Distributed Systems

5

Chapter 2 Fault Management in Distributed Systems Fault management consists of a set of functions that enable the detection, isolation, and correction of abnormal behavior in the monitored system [ANS94]. It is an important part of distributed systems management, which consists of fault management, performance management, security management, accounting management, and configuration management. Due to increasing demands for access to information, today’s distributed systems have grown both in size and complexity. Numerous new services are expected in such systems. In addition, distributed systems are becoming an integral part of our daily activities. Such an environment requires that systems be highly reliable and easily available. In the future, distributed systems will get even larger, more complex and error prone. Hence, effective fault management is a key to maintain the high availability and reliability in distributed systems.

2.1

Distributed Systems Management: State of the Art

Over a decade ago, the classical agent-manager centralized paradigm was the prevalent network management architecture, exemplified in the Open Systems Interconnection (OSI) reference model, the Simple Network Management Protocol (SNMP), and the Telecommunications Management Network (TMN) management framework [HAN99]. With the increasing size, management complexity, and service requirements of today’s distributed systems, such management paradigms are no longer adequate, and should be replaced with distributed management paradigms. This trend is extensively discussed in [Mar99]. On the other hand, in the IT industry, the increasing maintenance cost of distributed systems management is directly related to the increase in complexity of computing technologies that have become so advanced that traditional manual management techniques are equally apt to harm systems rather than enhance them. Controls intended to improve capability are actually weakening systems because the IT systems become unwieldy from the sheer number of adjustments. In addition, between the adjustments are unknown and virtually unknowable. As this increased complexity is coupled with the trend towards heterogeneous distributed systems, the complexity of the computing environments increases dramatically [Stu03]. To bring these complex and unmanageable systems under control, it is necessary for the IT industry to move to self-managing systems where technology itself is used to manage technology.

6

2 Fault Management in Distributed Systems

2.1.1

Distributed Systems Management: Objectives and Challenges

Hegering [HAN99] defines distributed systems management as all measures ensuring the effective and efficient operations of a system within its resources in accordance with corporate goals. To achieve this, distributed systems management is tasked with controlling network resources, coordinating network services, monitoring network states, and reporting network status and anomalies. The objectives of distributed systems management are: • Managing system resources and services: this includes control, monitor, update, and report of system states, device configurations, and distributed systems services. • Simplifying systems management complexity: is the task of management systems that extrapolates systems management information into a humanly manageable form. Conversely, management systems should also have the ability to interpret high-level management objectives. • Providing reliable services: means to provide distributed systems with a high quality of service and to minimize system downtime. Distributed management systems should detect and fix distributed system faults and errors. Distributed systems management must safeguard against all security threats. • Maintaining cost consciousness: requires to keep track of system resources and network users. All distributed systems resource and service usage should be tracked and reported. OSI has a well-defined reference model [HAN99] pertinent to the design of current distributed systems management architectures. The OSI model breaks distributed systems management functions into the following five functional areas: • Fault Management: It includes the detection, isolation and correction of faults in distributed systems. • Configuration Management: It aims at maintaining an inventory of network resources and ensuring that network configuration changes are reflected in the configurationmanagement database. Configuration management also provides information that is used by problem-management and change-management systems. Problem-management systems use this information to compare version differences and to locate, identify, and check the characteristics of network resources. Change-management systems use the information to analyze the effect of changes and to schedule changes at times of minimal network impact. • Accounting Management: It aims at identifying the costs of the network features utilized in order to meet the needs of a given communication objective, so that these costs may be communicated to the users and the corresponding tariffs may be applied. • Performance Management: It provides reliable and high quality network performance. This involves provisioning quality of service (QoS) and regulating crucial performance parameters such as network throughput, resource utilization, delay, congestion level and packet loss. • Security Management: It aims at the application of security and surveillance policies defined for the system. It includes the creation, elimination and control of security services and mechanisms and also the distribution of information related to security.

2 Fault Management in Distributed Systems

7

In recent years, distributed systems infrastructure has been shifting towards service-centric distributed systems. The dependencies between service components are extendable, dynamic and unpredictable. Besides the above distributed management objectives and OSI functional areas, distributed systems management must also fulfill additional management requirements, similar to today’s business service models: service differentiation, service customization, more features, and flexibility. The future of distributed systems infrastructure will drastically change the way of distributed systems management and present new challenges to distributed systems management. Firstly, as the size of distributed systems continues to grow, more and more network devices will need to be managed efficiently, demanding better scalability. As a result of such an increase, human directives can only be given at a very high level of abstraction and generalization. The underlying management system must take care of the interpretation of these high-level directives to realizable network configurations and oversee their enforcement. Secondly, as distributed systems infrastructures from various sectors converge, heterogeneous network technologies must co-exist and inter-work. Management systems must provide such seamless integration via common service interfaces, and hide underlying technological heterogeneity from network users. Thirdly, the competitive nature of current system services demands the economical operation of distributed systems. Distributed systems management must also be more self-regulating and self-governing, in order to be economically beneficial. At the same time, distributed systems management solutions must be kept simple and elegant, as the development of the Internet has demonstrated: only simple and elegant solutions would prevail in large-scale heterogeneous distributed systems. Lastly, as the devices in distributed systems become more and more powerful, there is increasing pressure to utilize their processing capabilities. This leads to increasing the need for distributed systems management at the device level.

2.1.2

Relevant Work towards Distributed Systems Management

In the traditional manager-agent architecture of distributed systems management, such as SNMP, the agent is kept as simple as possible, only tasked with a device status report and update, while the burden of management and data processing resides with the manager. Researchers realized the inadequacy of such a design around the early 90’s, as the size of managed distributed systems rapidly grew, and produced an increasing demand on performance and reliability. This prompted a complete rethinking of the distributed systems management paradigm. SNMPv2 is the first major installment towards distributed systems management. The initial set of request for comments (RFCs) (1441-1452) was published in 1992. SNMPv2 introduces the concept of intermediary manager. An intermediary manager can be viewed as a ”middle manager”. The manager communicates directly with the intermediary managers and exchanges command information, while the intermediary managers handle data exchange with agents. In this fashion, the intermediary take over managers shift some of the data processing from the manager side and are capable of performing simple tasks, such as periodic status polling from agents, without the manager’s direct intervention. In 1995, the Internet Engineering Task Force (IETF) took a further step towards management distribution with the proposal of Remote MONitoring (RMON) [Wal95]. RMON uses the concept of monitors or probes. Probe implementation can be achieved as device embedded applications or as separate devices. The task of a probe is to monitor the network traffic at its local region and to report anomalies in the form of alarms to its manager. By defining alarm

8

2 Fault Management in Distributed Systems

types and alarm thresholds, the manager is able to offload some data gathering and decisionmaking (mainly event filtering) to the probes. Furthermore, the probes can also perform some data pre-processing before forwarding data to the manager. In general, earlier work towards distributed systems management can be considered as weak distributions. The management tasks still reside heavily on the manager side, and some rudimentary management duties are delegated to intermediary entities, in the form of event filtering, notification, and data pre-processing. Due to a lack of a self-managing mechanism, distributed systems management will still suffer from tough tasks.

2.1.3

Enabling Technologies

A set of enabling technologies are commonly recognized to be potential candidates for distributed systems management. Their potential benefits to distributed systems management are examined, their drawbacks, and postulates on their prospects are investigated and discussed as follows: 1. Policy-based Management Strategies Policy-based distributed systems management started in the early 1990’s [MS93] [KKR95]. Although the idea of policies appeared even earlier, they were used primarily as a representation of information in a specific area of distributed systems management: security management [DM89]. The idea of policy comes quite naturally to any large management structure. In policy-based distributed systems management, policies are defined as the rules that govern the states and behaviors of a distributed system. The management system is tasked with: • the transformation of human-readable specifications of management goals to machinereadable and verifiable rules governing the function and status of a distributed system, • the translation of such rules to mechanical and device-dependent configurations, and • the distribution and enforcement of these configurations by management entities. The reference model for policy-based distributed systems management is largely a manageragent model, consisting of Policy Decision Points (PDP) and Policy Enforcement Points (PEP) [IET99] [IET00a]. The first two tasks are handled by the PDP, while the last task is handled by the PEP. IETF’s Resource Allocation Protocol (RAP) plays a key role in policy-based distributed systems management with its Common Open Policy Services (COPS) [IET00b] and its extension COPS-PR [CDG+00]. Some recent work has been done on the translation of business directives to network level policies [CBG00] and on the resolution of policy conflicts [LS99]. The most significant benefit of policy-based distributed systems management is that it promotes the automation of establishing management level objectives over a wide-range of systems devices. The system administrator would interact with the distributed systems by providing high-level abstract policies. Such policies are device-independent and are stated in a human-friendly manner. Policy-based distributed systems management can adapt rapidly to the changes in management requirements via run-time reconfigurations, rather than re-engineer new object modules for deployment. The introduction of new policies does not invalidate the correct

2 Fault Management in Distributed Systems

9

operation of a distributed system, provided the newly introduced polices do not conflict with existing policies. In comparison, a newly engineered object module must be tested thoroughly in order to obtain the same assurance. For large distributed systems with frequent changes in operational directives, policy-based distributed systems management offers an attractive solution, which can dynamically translate and update high-level business objectives into realizable network configurations. However, one of the key weaknesses in a policy-based distributed system management lies in its functional rigidity. After the development and deployment of a policy-based distributed management system, the service primitives are defined. By altering management policies and modifying constraints, we have a certain degree of flexibility in coping with changing management directives. However, we cannot modify or add new management services to a running system, unlike mobile code or software agents. 2. Distributed Object Computing Technology Distributed object computing (DOC) uses an object-oriented (OO) methodology to construct distributed applications. Its adaptation to distributed systems management is aimed at providing support for a distributed systems management architecture, integration with existing solutions for heterogeneous distributed systems management, and providing development tools for distributed systems management components. Distributed object computing enables a distribution of services and applications in a seamless and location transparent way by separating object distribution complexity from network management functionality concerns. Another advantage of this separation of concerns is the ability to provide multiple management communication protocols accessed via a generalized abstract programming interface (API), fostering inter-operability of heterogeneous network management protocols, such as SNMP for IP networks and Common Management Information Protocol (CMIP) for telecommunication networks. In addition, DOC provides a distributed development platform for rapid implementation of robust, unified, and reusable services and applications. Distributed object computing, in general, and the Common Object Request Broker Architecture (CORBA), in particular, are well received technologies for developing integrated distributed systems management architectures with object distribution. The success of CORBA as an enabling distributed object technology can be attributed to the fact that CORBA specifies a well-established support environment for efficient run-time object distribution and a set of support services. On this basis DOC is useful as an integration tool for heterogeneous distributed systems management domains, and for extending deployed distributed systems management architectures. However, DOC still uses static object distribution. It does not have the flexibility that code mobility offers. Furthermore, DOC requires dedicated and heavy run-time support, which may not always be feasible on every device in a distributed system. These issues restrict its range of application. 3. Mobile Agents Yemini et al. (1991) first introduced the concept of Management by Delegation (MbD), and they further refined this concept in [GY95]. Yemini et al. suggested to push management tasks to the agent side. This can be achieved by dynamically transporting programs from managers to agents and performing the delegated management tasks locally. Three immediate advantages of the Management by Delegation approach are apparent. • Firstly, the manager is no longer a centralized processing entity in the distributed system. Much of its processing can be charged to agents via delegated programs.

10

2 Fault Management in Distributed Systems • Secondly, a considerable amount of network resources are saved. For instance, data gathering can be performed locally. • Lastly, it is possible to augment the functionality of agents by providing them with dedicated programs at runtime. In this fashion, some decision making and network monitoring duties can be performed locally, allowing faster response to management requests and better fault tolerance (in case of manager crash). As mobile code is transported across the components of a distributed system, it must be loaded at the destination for execution. Mobile code may take quite a long time to suspend the execution of a component, pack its code and data, transport the packed code and data across distributed systems, restore the component and execute it. Hence, code mobility is not a good candidate for distributed systems with simple but frequent service requests. Furthermore, to prevent mobile agents from adversely affecting systems resources, malicious measures are often in place which either restrict the operations a mobile agent can perform on local resources, or provide some type of access gateway. Neither solution is satisfactory as access restrictions constrain the operational capacity of mobile agents; while access gateways add unnecessary processing overhead. 4. Intelligent Agents Intelligent agents exhibit the following characteristics: autonomy, social ability, reactivity, pro-activeness, mobility, learning, and beliefs. An intelligent agent is an independent entity capable of performing complex actions and resolving management problems on its own. Unlike code mobility, an intelligent agent does not need to be given task instructions to function, rather just high-level objectives. The use of intelligent agents completely negates the need for dedicated manager entities, as intelligent agents can perform the distributed systems management tasks in a distributed and coordinated fashion, via interagent communications. Many researchers believe intelligent agents are the future of distributed systems management, since there are some significant advantages in using intelligent agents for distributed systems management. • Firstly, intelligent agents would provide a fully scalable solution to most areas of distributed systems management. Hierarchies of intelligent agents could each assume a small task in their local environment and coordinate their efforts globally to achieve some common goal, such as maximizing the overall network utilization. • Secondly, data processing and decision making are completely distributed, which alleviates management bottlenecks as seen in centralized distributed systems management solutions. In addition, the resulting management system is more robust and fault tolerant, as the malfunction of a small number of agents would have no significant impact on the overall management function. • Thirdly, the entire management system is autonomous, network administrators would only need to provide service-level directives to the system. • Lastly, intelligent agents are self-configuring, self-managing, and self-motivating. It is ultimately possible to construct a management system that is completely selfgoverned and self-maintained. Such a system would largely ease the burden of distributed systems management routines that a systems administrator has to currently struggle with.

2 Fault Management in Distributed Systems

11

The application of intelligent agents to distributed systems management is still in its infancy, and many difficult issues still remain unsolved. As applications utilizing intelligent agents arise in distributed systems management, the problem of managing these intelligent agents will also become increasingly important. These self-governing agents cannot simply be allowed to roam around the network freely and access vital resources. Currently, it is still very difficult to design and develop intelligent agent platforms. This is mostly because very little real-life practice with intelligent agents exists today. As intelligent agents are empowered with more intelligence and capabilities, their size will become an increasing concern for network transport. Furthermore, agent-to-agent communication typically uses the Knowledge Query Manipulation Language (KQML). KQML wastes a substantial amount of network resources, as its messages are very bulky. Lastly, similar to mobile agents, the security of intelligent agents is a big barrier in their applications. In particular, the following questions remain unanswered: Who takes care of agent authentication? Can agents protect themselves against security attacks? Can agents keep their knowledge secret? How much access rights should agents have over network resources? 5. Active Networks Technology According to Tennenhouse et al. [TSS+97], an active network is a new approach to distributed systems architecture in which the system nodes, such as routers and switches, perform customized computation on messages flowing through them. In active networks, routers and switches run customized services that are uploaded dynamically from remote code servers or from active packets. The characteristic of activeness is threefold. From a device view, the device’s services and operations can be dynamically updated and extended actively at run-time. From a network provider view, the entire network resources can be provisioned and customized actively on a per customer basis. In the view of the users of distributed systems, the allocated resources can be configured actively based on user application needs. Active networks combined with code mobility present an effective enabling technology for distributing management tasks to the device level. Not only management tasks can be offloaded to individual network devices, but also the suppliers of management tasks need no longer be manager entities. Such a solution provides full customization, devicewise, service provider-wise, and user-wise; it provides the means for a distributed process across all network devices; it is inter-operable across platforms via device-independent active code; it fosters user innovation and user-based service customization; it accelerates new service and network technology deployment, bypassing standardization process and vendor consensus; it allows for fine grained resource allocation based on individual service characteristics. In the literature, there are two general approaches for realizing active networks: the programmable switch approach and the capsule approach. The programmable switch approach uses an out-of-band channel for code distribution. The transportation of active code is completely separated from regular data traffic. This approach is easier to manage and secure, as the active code is distributed via private and secure channels. It is suitable for network administrators configuring network components. On the other hand, the capsule approach packages active code into regular data packets. The active code is sent to active nodes via regular data channels. This approach allows open customization of user-specified services. The downside of this approach is, however, that it is prone to security threats. [BP02] analyzed the benefits of active networks to enterprise distributed systems management.

12

2 Fault Management in Distributed Systems Some recent work has been done on exploring active networks for distributed systems management, such as the Virtual Active Network (VAN) proposal [Bru00] and the agentbased active network architecture [HC02]. However, security remains a major roadblock for practical application of active networks. Not only the integrity of network resources and user data has to be kept, but also the content of user data must remain confidential. Besides security, resource provisioning and fault tolerance are the other two major issues that need to be addressed in active networks. 6. Probability Theory With the increase in size, heterogeneity, pervasiveness, complex applications and services in distributed systems, effective management of such distributed systems is becoming more important and more difficult. Uncertainty is an unavoidable characteristics of complex systems, which comes from unexpected hardware defects, unavoidable software errors, incomplete management information and dependency relationship between the managed entities. An effective management system in distributed systems should deal with the uncertainty and suggest probabilistic reasoning for daily management tasks [DKB+04a]. Probability theory provides a sound mathematical basis for reasoning under uncertainty. It has a number of nice properties. It allows one to envision multiple possible states of the world with different degrees of likelihood. It allows one to incorporate multiple pieces of evidence obtained from various sensors in a coherent way, using the principle of Bayesian conditioning. In combination with decision theory, it allows one to make optimal decisions that lead to the maximum expected utility. Also, statistics provide excellent tools for learning probabilistic models from data. Probability theory is used as the foundation of our work in this book. 7. Artificial Neural Networks An Artificial Neural Network (ANN) is a system constituted of elements (neurons) interconnected according to a model that tries to reproduce the functioning of the neural network existing in the human brain. Conceptually, each neuron may be considered as an autonomous processing unit, provided with local memory and with unidirectional channels for the communication with other neurons. The functioning of an input channel in an ANN is inspired by the operation of a dendrite in biological neurons. In an analog way, an output channel has an axon as its model. A neuron has only one axon, but it may have an arbitrary number of dendrites (in a biological neuron there are around ten thousand dendrites). The output ”signal” of a neuron may be utilized as the input for an arbitrary number of neurons [MK94b] [Mea89]. In its simplest form, the processing carried out in a neuron consists of affecting the weighted sum of the signals present in their inputs and of generating an output signal if the result of the sum surpasses a certain threshold. In the most general case, the processing may include any type of mathematical operation on the input signals, also taking into consideration the values stored in the neuron’s local memory. One of the main motivations for the development of the ANN is the utilization of computers to deal with a class of problems that are easily solved by the human brain, but which are not of effectively treated with the exclusive utilization of conventional programming paradigms. The distributed control and storage of data and parallelism are remarkable features of the ANN. Besides that, an ANN does not require previous knowledge of the mathematical

2 Fault Management in Distributed Systems

13

relationship between inputs and outputs, which may be automatically learned, during the system’s normal operation. This makes them, at first, a good alternative for applications (such as alarm correlation and fault diagnosis) where the relationships between faults and alarms are not always well defined or understood and where the available data is sometimes ambiguous or inconsistent [CMP89]. ANNs have been used for the purpose of fault localization [GH98] [Wie02]. The disadvantage of neural network systems is that they require long training periods [GH96] [WBE+98], and that their behavior outside their area of training is difficult to predict [WBE+98]. 8. Economic Theory Distributed systems management using economic theory proposes to model distributed systems services as an open market model. The resulting distributed system is selfregulating and self-adjusting, without the presence of any formal distributed systems management infrastructure. Distributed systems administrators would indirectly control the systems dynamics by inducing incentives and define aggregate economic policies. Such an approach may seem to be very bold, but it draws its theory from the well-established economic sciences. The premises for applying economic theories are: the existence of open and heterogeneous distributed systems; multi-vendor orientation; and competitive services. Very little work has been done on this subject, and most of it is focused on using economic theory as an agent coordination model [BKR99] [BMI+00]. However, the application of economic theories to distributed systems management is only at an early experimental stage. Many critical issues brought out with these experiments cast doubts on the applicability of economic theory to distributed systems management. Using a market model for managing distributed systems is a novel idea. However, some important design issues must be carefully considered. • Firstly, the driving force for a market model is the authenticity of its currency. The currency in distributed systems management denotes the authority to take use of system resources for certain component or certain service. Hence, currency values and their transaction processes used in a market model must be secure. Furthermore, such secure transactions must be performed very efficiently, as it would be a very frequent operation. • Secondly, the economic policy for the market model must be designed in such a way that it encourages fair competition, and strongly relates resource contention and its associated price. • Lastly, the market model would be operating on a wide scale, requiring standardization of its elements and operations. Such standardization may be a very slow process and would require full consensus from all participating vendors.

2.2

Fault Management in Distributed Systems

Distributed systems management aims to improve efficiency and reliability of distributed systems and to provide a higher degree of performance in an efficient and cost effective manner. One aspect of distributed systems management is fault management and its central component is fault diagnosis. Although failures are unavoidable, their quick detection, isolation and recovery are essential for the robustness, reliability, and accessibility of a system.

14

2 Fault Management in Distributed Systems

Fault management tasks can be characterized as detecting when network behavior deviates from normal and formulating a corrective course of action. Fault management has the ability to locate faults, determine their cause and set appropriate corrections. This entails identifying and finding problems or irregularities within the whole distributed system, pinpointing the origin of the setbacks, and troubleshooting the problem elements at their source and immediate location. On an even higher plane, it also espouses the rule of prevention, that a proactive approach of stopping negative conditions in their tracks even before they affect a real failure in the system. Fault management can typically be broken down into three basic steps, namely [ANS94]: 1. Fault Detection [BCF94]: the process of capturing on-line indications of disorder in distributed systems provided by malfunctioning devices in the form of alarms. Fault detection determines the root cause of a failure. In addition to the initial failure information, it may use failure information from other entities in order to correlate and localize the fault. 2. Fault Isolation [BCF+95] [KG97] (also referred to as fault localization, event correlation, and root cause analysis): a set of observed fault indications is analyzed to find an explanation of the alarms. This stage includes identifying the cause that lead to the detected failure in case of fault-propagation and the determination of the root cause. The fault localization process is of significant importance because the speed and accuracy of the fault management process are heavily dependent on it. 3. Fault Correction: is responsible for the repair of a fault and for the control of procedures that use redundant resources to replace equipment or facilities that have failed. Ideally, fault management will include all three steps starting with the detection, but most often only the fault-detection is implemented due to the complexity in providing general fault isolation and fault correction procedures. As mentioned above, fault management is a complex process and consists of two diagnosis steps (fault detection, fault isolation) and one planning step (fault correction).

2.2.1

Basic Concepts in Fault Management

There is a collection of basic terms associated with the investigation of fault management. Some of these terms have no standard definition yet. The most common definition of these terms is as follows: • Fault: In document ISO/CD 10303-226, a fault is defined as an abnormal condition or defect at the component, equipment, or sub-system level which may lead to a failure. According to Federal Standard 1037C [Fed00], the term fault has the following meanings: (1) An accidental condition that causes a functional unit to fail to perform its required function. (2) A defect that causes a reproducible or catastrophic malfunction. A malfunction is considered reproducible if it occurs consistently under the same circumstances. Failures in hardware can be caused by random faults or systematic faults, but failures in software are mostly systematic. [JW93] defines a fault as ”a disorder occurring in the hardware or software of the managed network. A fault happens within the managed network or its components”. [HSV99] defines a fault as ”an event that is associated with abnormal network state, i.e., network behavior that deviates from expectation”. [WS93a] defines faults as any abnormal operations that disrupt communications or degrade performance significantly. In the context of

2 Fault Management in Distributed Systems

15

distributed systems management, a fault is defined as a cause for malfunctioning. Faults may prevent the normal functioning of a system. They manifest themselves through errors, i.e. deviations in relation to the normal operation of the system. In distributed systems management faults may be caused, e.g., by incorrect design, incorrect coding or hardware component wear-out. Faults constitute a class of network events that can cause other events but are not themselves caused by other events [JW93] [LC99] [YKM+96]. In literature, failure is often used as an alternative word for fault. According to their duration time, faults may be classified into [Wan89]: – Permanent: describes a failure, fault, or error that is continuous and stable [SS82]. In distributed systems management, a permanent failure reflects an irreversible change in the distributed system. The word ”hard” is used interchangeably with permanent. – Intermittent: describes a fault or error that occurs on a discontinuous and periodic basis due to unstable hardware or varying software or software states (for example, as a function of load or activity). An intermittent fault can cause a degradation of service for short periods of time. Frequently re-occurring intermittent faults significantly jeopardize service performance [SS82]. – Transient: describes a fault or error resulting from temporary environmental conditions. The word ”soft” is used interchangeably with transient [SS82]. Transient faults cause a temporary and minor degradation of service. They are usually automatically repaired by error recovery procedures [Wan89]. • Error: a discrepancy between a computed, observed, measured value or condition and a true, specified, theoretically correct value or condition [Wan89]. Errors may cause deviation of a delivered service from the specified service, which is visible to the outside world. The term failure is used to denote this type of error. Errors do not need to be directly corrected, and in many cases they are not visible externally. However, an error in a distributed system device or software may cause a malfunctioning of dependent system devices or software. Thus, errors may propagate within the distributed system causing failures of faultless hardware or software [Wan89]. A failure (fault) is the occurrence of one or more errors in a row. • Alarm (Alert): Alarms defined by vendors and generated by entities in distributed systems when they sense an abnormality. [JW93]. [BCF94] references an IBM internal report that describes an ideal alarm (alert). Generally alarms (alerts) try to answer the following questions about a fault: Who, what, where, when, and why. However, answers to the questions where and why are usually not provided. The main requirement to perform fault management in an integrated way is the existence, in a management center, of information in real time functioning in distributed systems. The abnormalities that occur during the operation of the distributed system cause the automatic emission of alarm notifications, which are received at the distributed system management center. From the notifications of received alarms, the human operator must attempt to identify the occurred fault and, if necessary, to emit a trouble ticket, which is used as a reference for dispatching maintenance teams. Once the problem is solved, the trouble ticket is closed, and remains available only for consultation. • Event: [JW93] [LC99] [YKM+96] define an event as an ”occurrence of an exceptional condition in the operation of the network”. Events are often the result of underlying

16

2 Fault Management in Distributed Systems problems such as hardware or software failures, performance bottlenecks, configuration inconsistencies, or intrusion attempts. Events are classified as problem events, symptom events, either problem and symptom events, or neither. A symptom event is defined as an event that is observable [YKM+96]. This leads to the conclusion that a problem event is non-observable and must be inferred from its symptom events. [Inf89] defines an event as a change in the status of the performance index in a distributed system, or a message passed between entities in a distributed system. [HSV99] defines an event as an instantaneous occurrence at a time point. Typically, an event is associated with an object in which it occurs; the objects are called Managed Objects (MO). Events are classified as being primitive or composite. Events that are directly generated in or correspond to a change of state of a managed object can be considered primitive since they are directly observable. • Event Correlation: [JW93] defines event correlation as ”a conceptual interpretation of multiple alarms such that a new meaning is assigned to those alarms. It is a generic process that underlies different management tasks in distributed systems such as contextdependent alarm filtering, alarm generalization, network fault diagnosis, generation of corrective actions, proactive maintenance, and network behavior trend analysis.” [HSV99] describes event correlation as, ”event correlation works by establishing relationships between network events”. An Event Correlation System (ECS) correlates events to diagnose faults. Moreover, an ECS aids in – detection and isolation of faults; – filtering events; – performance tuning. To be useful, an ECS must be – correct: the outcome must reflect what has occurred in the distributed system with high probability, – optimal: the ECS should infer a small set of root causes that explain all detected events. [YKM+96] gives a general definition ”correlation is concerned with analysis of causal relations among events”. Also, causality is defined as a partial order relation between events. • Symptom: an external manifestation of failures [JW93](the observable events). For example, a degraded application performance is a symptom of the faulty interface problem. Symptoms are observed as alarms/notifications of a potential failure [JW93] [KG97] [LC99] [YKM+96]. These notifications may originate – from management agents via management protocol messages (e.g., SNMP trap [CMR+96] and CMIP EVENT-REPORT), – from management systems that monitor the distributed system status, e.g., using command ping [Ste95], – from system log-files or character streams sent by external equipment [Sch96].

2 Fault Management in Distributed Systems

17

The symptoms of failures in distributed systems might be general (such as clients being unable to access specific servers) or more specific (routes not listed in the routing table). Each symptom can be traced to one or more problems or causes by using specific troubleshooting tools and techniques. In distributed systems management, sometimes the term ”symptom” is used with the term ”alarm” interchangeably. • Fault Injection: The deliberate insertion of faults into an operational system to determine its response [CP95]. Fault and error injection have been recognized as powerful approaches to validate the fault tolerance mechanisms of a system and obtain statistics on parameters such as coverage and latencies [KKA95]. Fault injection tests fault detection, fault isolation, re-configuration and recovery capabilities [HTI97]. • Fault Diagnosis: a stage in the fault management process which consists of finding out the original cause of the received symptoms (represented by the alarms). Before getting to the original cause, it may be necessary to formulate a set of fault hypotheses, which will need to be validated by means of tests. It is desirable that a system for fault diagnosis should have a model of the managed configuration, process the flow of alarms in real time and be capable to work in the presence of incomplete data. Besides this, it is expected to be able to identify changes in the appearance and in the importance of problems related to time (for example, hour, day of the week, season of the year), to separate cause from effects and to solve the problems according to their seriousness (i.e., the most serious problems must have the highest priority). In the selection of tests to be applied, the system must choose the most cost effective one. As much as possible, the diagnostic tests must be automated. Finally, it is desirable that the system is somehow able to interpret the results of tests.

2.2.2

Challenges in Fault Management

With the growth of the managed systems, it is estimated that the management center of a medium size regional operator will be receiving tens of thousands of alarm notifications per day, which will render the ”manual” processing of all of them practically unfeasible [Nyg95]. Traditional manual management techniques are equally apt to harm systems rather than enhance them [Stu03]. In addition, many of the received notifications do not contain original information. In fact, the occurrence of a single fault in the supervised distributed system sometimes results in the reception of multiple notifications. Also a single notification may be generated by multiple faults. The dependency relationship between managed objects contributes to this situation. The fault of a given component may affect several other components, causing the fault’s propagation. Hence in a distributed system, it is difficult to identify the root cause from the notification of a fault directly [DKB+04a]. The problem of fault localization is NP-complete [KS95], which means that a polynomial algorithm that may solve it is not yet known. Nevertheless, by means of heuristics [Pea88], in some cases it is possible to develop polynomial algorithms which give approximate solutions or, in other cases, which give a correct solution with a given probability [KS95]. In fault management besides the inherent complexity to any NP-complete problem, the following additional aspects must be taken into account. 1. Fault evidence may be ambiguous, inconsistent, redundant, noisy and incomplete [DSY+89] [HS91] [KS95] [DKB+04a].

18

2 Fault Management in Distributed Systems In general, it is presumed that all information necessary to handle the management task is spontaneously sent by the elements of the distributed system. In a complex environment, location and diagnosing faults can be difficult. There are specific problems associated with fault observation as indicated by the following distinction: • Unobservable faults: Certain faults are inherently unobservable locally. For example, the existence of a deadlock between cooperating distributed processes may not be observable locally. Other faults may not be observable because vendor equipment is not instrumented to record the occurrence of a fault or an interruption in a link without an alternative path. • Partially observable faults: A node failure may be observable but the observation may be insufficient to pinpoint the problem. A node may not be responding because of the failure of some low level protocol in an attached device. • Uncertainty in observation: Ambiguity in the observed set of alarms stems from the fact that the same alarm may be generated as an indication of many different faults. Even when detailed observations of faults are possible, there may be uncertainty and even inconsistencies associated with the observations. A lack of response time from a remote device may mean that the device is locked up, the network is partitioned, congestion caused the response to the delayed, or the local timer is faulty. Noise can be made up of meaningless information, redundant information, streaming alarms, occasional spikes, frequent oscillations and repeated occurrences. Incompleteness is a consequence of alarm loss or delay [HS91]. It is essential that a fault management system be able to create a consistent view of network operations even in the presence of ambiguous, inconsistent, or incomplete information [DSY+89]. 2. A fault management system should provide means to represent and interpret uncertain data within the system knowledge and fault evidence [DLW93] [HS91] [KYY+95] [DKB+04a]. A set of alarms generated by a fault may depend on many factors such as dependencies among distributed system devices, current configurations, services in use since fault occurrence, presence of other faults, values of other distributed system parameters, etc. Due to these factors the system knowledge may be subject to inaccuracy and inconsistency. Fault evidence may also be inaccurate because of spurious alarms, which are generated by transient problems or as a result of overly sensitive fault detection mechanisms. When spurious symptoms occur, the management system may not be sure which observed alarms should be taken into account in the fault localization process. 3. Fault management should deal with hidden and complex dependencies [DKB+04b]. Once faults are observed, it it necessary to isolate the fault to a particular component. The following is a list of problems that can arise: • Multiple potential causes: When multiple technologies are involved, the potential points of failure and the types of failure increase. This makes it harder to locate the source of a fault. • Too many related observations: A single failure can affect many active communication paths. A failure in one layer of the communications architecture can cause degradation or failures in all dependent higher layers. A failure in a T1 line will be detected in the routers as a link failure and in the workstations as transport and

2 Fault Management in Distributed Systems

19

application failures. Because a single failure may generate many secondary failures, the proliferation of fault monitoring data that can be generated in this manner can obscure the single underlying problem. The model of dependencies adopted very often presumes that, when a support feature fails, all the elements that depend on this feature will also fail. Very often the strategy adopted in the correlation demands the construction of a model of the managed distributed system. The simplifications adopted in this model may render some elements of the managed system ”invisible” to the correlation process. This allows a fault occurred in an ”invisible” system element to simulate the occurrence of a fault in another system element. 4. Fault management in large distributed systems should be performed in a distributed fashion [BCF+95] [KBC95] [YKM+96]. Distributed systems are becoming more and more advanced in terms of their size, complexity, speed, and the level of heterogeneity. It would be computationally prohibited to process large volumes of information necessary to perform the fault localization in such systems. It is also impractical to assume that the fault localization process has access the information of the entire system. Many researchers [BCF+95] [KBC95] [YKM+96] have concluded that the fault localization process in large distributed systems should be performed in a distributed fashion by a group of event management nodes, which divide data and processing complexity among them. Each of the managers governs a subset of system hardware software components within boundaries marked by protocol layers or network domains. Errors propagate horizontally - between peer associated devices within the same layer vertically - from upper layers to lower layers and vice versa between related services [Wan89]. They may cross boundaries of management domains. As a result, the fault management system may be provided with indications of faults that did not happen in its management domain or be unable to detect all symptoms of faults existing in its management domain [KBC95] [Wan89]. 5. Fault management has to take into account temporal factors and dynamic changes in distributed systems [DKX+04]. In real-life distributed systems dynamic changes are unavoidable because of the system evolution of hardware, software and distributed applications. Static management strategies in distributed systems can not work efficiently sometimes in dynamic environments. Distributed dynamic systems are apt to be unmanageable if the management system not identify and deal with can dynamic changes in due time. In dynamic environments, the changes are pertinent to a time factor. Fault management should concern the temporal factor of dynamic changes when understanding the changes and catching the trend of the changes in distributed systems. Hence a management system for distributed systems has to live with dynamic updates of the distributed system and to deal with the management tasks even in the dynamic environments. Each of the tasks listed above presents challenges in fault management. Solving the whole ensemble of problems efficiently remains an open research area. These problems make it hard for a human system administrator to manage and understand all of the tasks that go into the smooth operation of the distributed system. The skills learned from any one distributed system may prove insufficient in managing a different distributed system thus making it difficult to generalize the knowledge gained from any given distributed system.

20

2.3

2 Fault Management in Distributed Systems

Relevant Research in Fault Management for Distributed Systems

In the past, numerous paradigms were proposed upon which fault management was based. These paradigms derive from different areas of computer science, including expert systems, graph theory, finite-state machine and probabilistic reasoning. In Figure 2.1, a classification of the existing solutions is presented. These solutions include techniques derived from the fields of expert systems (rule-based, case-based, and model-based reasoning tools), finite-state machines, graph-theoretic techniques, model-traversing techniques, and the probabilistic approach.

Rule-based systems Expert systems techniques

Case-based systems Model-based systems

Finite-state Machines Causality graphs Fault management techniques

Graph-theoretic techniques

Dependency graphs Decision trees

Model traversing techniques Fuzzy logic Probabilistic Models Bayesian networks

Figure 2.1: Classification of Fault Management Technologies. In the past, fault diagnosis and recovery are entirely based on the human operator’s expertise which is proved to be difficult to acquire and maintain. Hence, automation of fault management of large distributed systems is a natural goal. The complexity of fault management makes the application of knowledge based expert systems very appealing. Several experimental management systems use expert systems techniques for the purpose of the fault management. Some examples are presented in [FGT91]. Although the effort in the area is still growing, most of the developed expert fault management systems are built on an ad-hoc and unstructured basis to simply transfer the knowledge of the human expert into an automated system. However, to face future challenges, a theoretical foundation for fault management must be established to bridge the gap between working systems and research and to provide a general structured model easily adaptable to future distributed systems. Researchers investigating the problem of fault management have applied a variety of techniques that span many fields. Section 2.3.1 covers the area of expert systems, which is part of artificial intelligence. Section 2.3.2 covers the area of finite state machines for fault management. Section 2.3.3 covers the area of graph theory and in particular the causality graphs and dependency graphs for fault diagnosis. Section 2.3.4 covers probabilistic models for fault management. Finally Section 2.3.5 covers the model traversing techniques in fault localization.

2 Fault Management in Distributed Systems

2.3.1

21

Expert Systems

The most widely used techniques in fault management are based on expert systems [PMM89]. The term ”expert system” refers to a system that uses contemporary technology to store and interpret the knowledge and experience of a human expert, sometimes several experts, in a specific area of interest. By accessing this computer-based knowledge, an individual is able to get the benefit of ”expert advice” about that particular area [LG91]. Figure 2.2 below is a high level diagram of the components of an expert system.

User Interface

Inference Engine

Knowledge Base

Knowledge Acquistion Facility

Domain Expert

Knowledge Engineer

Figure 2.2: Expert System Functional Diagram. Expert systems try to reflect actions of a human expert when solving problems in a particular domain. The knowledge base is where the knowledge of human experts in a specific field or task is represented and stored. It contains a set of rules or cases with the knowledge about a specific task, that is, an instance of that class of problems. The inference engine of an expert system contains the strategy to solve a given class of problems using, e.g., the rules in a particular sequence. It is usually set up to mimic the reasoning or problem solving ability that the human expert would use to arrive at a conclusion. Rule-based, case-based and model-based approaches are popular expert systems for fault management in distributed systems. • Rule-based systems [Lie88] [Nyg95]. Most expert systems use rule-based representation of their knowledge-base. In the rulebased approach, the general knowledge of a certain area is contained in a set of rules

22

2 Fault Management in Distributed Systems and the specific knowledge, relevant for a particular situation, is constituted of facts, expressed through assertions and stored in a database. There are two operation modes in a rule-based system. One is the forward mode which departs from an initial state and constructs a sequence of steps that leads to the solution of the problem (”goal”). When it comes to a fault diagnosis system, the rules would be applied to a database containing all the alarms received, until a termination condition involving one fault is reached. The other is the backward mode, which starts from a configuration corresponding to the solution of the problem and constructs a sequence of steps that leads to a configuration corresponding to the initial state. The same set of rules may be used for the two operation modes [Ric83]. In the domain of fault localization, the inference engine usually uses a forward-chaining inference mechanism, which operates in a sequence of rule-firing cycles. In each cycle the system chooses rules for execution, whose antecedents (conditions) match the content of the working memory. Expert systems applied to the fault localization problem differ with respect to the structure of the knowledge they use. The approaches that rely solely on surface knowledge are referred to as rule-based reasoning systems. The research on rule-based fault localization systems addresses the structure of the knowledge base and the design of the rule-defined language. [Lor93] organizes the system of rules by distinguishing between core and customized knowledge. The core knowledge may be understood as a generic or reusable knowledge. It is useful for identifying an approximate location of a fault in a large distributed system. Rule-based systems, which rely solely on surface knowledge, do not require profound understanding of the underlying system architectural and operational principles and, for small systems, may provide a powerful tool for eliminating the least likely hypotheses [Kat96]. However, rule-based systems possess a number of disadvantages that limit their usability for fault isolation in more complex systems. The downside of rule-based systems include the inability to learn from experience, inability to deal with unseen problems, and difficulty in updating the system knowledge [Lew93]. Rule-based systems are difficult to maintain because the rules frequently contain hard-coded distributed system configuration information. Although approaches have been proposed to automatically derive correlation rules based on the observation of statistical data [KMT99], it is still necessary to regenerate a large portion of correlation rules when the system configuration changes. Rule-based systems are also inefficient and unable to deal with inaccurate information [SMA01]. The lack of structure in the system of rules typically makes it very difficult to allow the reusability of rules that seems so intuitive in hierarchically built distributed systems. Another problem is that rule-based systems get convoluted if timing constraints are included in the reasoning process. Also, rule interactions may result in unwanted sideeffects, and are difficult to verify and change [WBE+98]. [KK98] extends a production rule interpreter and to enable a systematic prediction of the effects of policy executions and to allow for a better impact analysis in case of policy changes. The rule-based systems rely heavily on the expertise of the system manager. The rules depend on the prior knowledge about the fault conditions on the distributed system, and do not adapt well to the evolving distributed system environment [FKW96]. Thus it is possible that entirely new faults may escape detection. Furthermore, even for a stable distributed system there are no guarantees that an exhaustive database has been created.

2 Fault Management in Distributed Systems

23

[WEW99] introduced one system called ANSWER (Automatic Network Surveillance with Expert Rules), an expert system used in monitoring and maintaining the 4ESS switches in the AT&T long distance network. The knowledge base uses object-oriented programming (C++) to model the 4ESS switch leading to flexible and efficient design, i.e., only components with abnormal activities get instantiated, thus realizing reductions in time and space. The knowledge base interacts with the actual switch in two ways: 1) it receives events as input; and 2) it issues commands (e.g., to request diagnostics to be run) as output. The rule-based component of the system has been implemented using an extension to the C++ language called R++. This has led to the tight integration of the knowledge base with the rest of the system. [She96] presents the HP OpenView Event Correlation Service (ECS). This package is a commercial product integrated into the HP OpenView Distributed Management (DM) postmaster. It consists of two components: the ECS designer and the ECS engine. The ECS Designer is a graphical user interface (GUI) development environment where correlation rules can be developed interactively by selecting, connection, and configuring logical processing blocks (nodes). The process of combining different nodes creates a correlation circuit where events flow from a source node through the path(s) of the defined circuit and exit through a sink node. The ECS engine is a run-time correlation engine. It executes a set of download correlation rules that control the processing of event streams. • Case-based systems [LD93] [DV95] [MT99]. As an alternative to the rule-based approach, some researchers propose a technique denominated case-based reasoning (CBR) [Sla91] [WTM95] [MT99]. Here, the basic unit of knowledge is not a rule but a case. Cases contain registers with the most relevant characteristics of past episodes and are stored, retrieved, adapted and utilized in the solution of new problems. The experience obtained from the solution to these new problems constitutes new cases, which are added to the database for future use. Thus, the system is able to acquire knowledge through its own means, and it is not necessary to interview human experts. Another relevant characteristic of the CBR systems is their ability to modify their future behavior according to the current mistakes. In addition, a case-based system may build solutions to the unheard-of problems through the adaptation of past cases to the new situations. Since the development of CBR systems started in the 1980’s, several challenges have stimulated the researchers’ creativity: how to represent the cases; how to index them to allow their retrieval when necessary; how to adapt an old case to a new situation to generate an original solution; how to test a proposed solution and identify it as either a success or a failure; and how to explain and repair the fault of a suggested solution to originate a new proposal. The problem of case adaptation is studied in [LD93]. It described a technique named parameterized adaptation, which is based on the existence in a trouble ticket, of a certain relationship among the variables that describe a problem and the variables that specify the corresponding solution. A CBR system takes into account the parameters of this relationship in the proposition of a solution for the case under analysis. To represent the parameters, the use of linguistic variables (i.e., the ones that assume linguistic values, instead of numeric values) and the provision of functions is proposed, so that the parameters’ numeric values are translated into grades of membership in a fuzzy set. To store and retrieve the knowledge on the solution of past problems, [DV95] presents the concept of master ticket, which contains a generalization of information on the faults

24

2 Fault Management in Distributed Systems instead of information on a single fault (case). Thus, when a master ticket is retrieved, it must be instantiated before the information it contains may be applied to a particular case. The goal of this procedure is to facilitate the access to the information and the involved node’s addresses. To instantiate a master ticket consists of substituting its parameters for the real values of the case under consideration. • Model-based systems [Gru98] [KG97] [KYY+95]. Model-based reasoning (MBR) is a paradigm that arose from artificial intelligence, which has several applications in alarm correlation. MBR represents a system through a structural model and a functional model, while the rules are based on empirical associations in traditional rule-based systems. In a management system for a distributed system, the structural representation includes a description of the distributed system elements and of the topology (i.e., connectivity and containment). The representation of functional behavior describes the processes of event propagation and event correlation [JW95]. Model-based reasoning systems play an important role in the device level fault management and in the execution escalation rules or fault management policies such as which trouble tickets have to be created under which circumstances. In this role the expert system serves as an integrator between the different fault management techniques. In model-based expert systems, conditions associated with the rules usually include predicates referring to the system model. The predicates test the existence of a relationship among system components. The model is usually defined by an object-oriented paradigm [DSW+91] [FJM+98] [HS91] [JW93] [SZ88] and it frequently has the form of a graph of dependencies among system components. A different model is proposed in SINERGIA [BBM+93], which represents structural knowledge as a set of network topology templates selected in such a way that any distributed system topology may be expressed as instances of these templates. All possible alarm patterns for each template are listed along with fault diagnosis functions.

2.3.2

Finite-State Machines

A finite-state machine (FSM) is an abstract model describing a synchronous sequential machine and its spatial counterpart, the interactive network [Koh78]. In [WS93a] the authors address the problem of fault detection, which is the first step in fault management. Given a system that can be modeled by an FSM, its behavior can be described by a word, w, which is a concatenation events. The fault detection problem can be defined as the difference between normal and faulty words. Faults are modeled as abnormal words that violate the specification. A system under observation is given by an FSM that is a 4-tuple G = (Σ, Q, δ, %), where Σ is the set of all possible events or the alphabet. Q is a finite nonempty set of states. δ is the state transition function: Q × Σ → Q. % is the initial state set of the FSM. L(G) denotes the language generated by FSM G ∈ Q. The goal of [WS93a] is to build an observer, i. e., a machine A, recognizes that faulty words violating the specification. Since L(G) is the set of correct words generated from G, L(G) denotes the set of faulty words. There are several objectives in the choices of fault detection systems. One is to minimize the information fed to the observer to detect a given class of faults. Another is to optimize the structure of the observer. However, it is inevitable that not all faults can be detected if the complexity of the observer is simpler than the original system. In [BHS93] the authors investigate the problem of fault identification in communication networks. Fault identification is the process of analyzing the data or output from the malfunctioning component in order to propose possible hypotheses for the faults. Therefore, this work

2 Fault Management in Distributed Systems

25

assumes that fault localization methods have already isolated a fault in a single process of a communication system. The faulty process is modeled as an FSM, and the possible faults are represented as additions and changes to the arcs in that FSM. The trace history of the faulty process is used in order to propose possible hypotheses about multiple faults. The trace history is assumed to be partially observed (i.e., it may include deletions, additions, and changes of events). Therefore, the problem becomes one of the inferring finite state structures given by the unreliable and partially observed trace history. [RH95] presents the problem of alarm correlation using probabilistic FSM (PFSM). A fault is modeled as a PFSM so that the possible output sequences correspond to the possible alarm sequences resulting from the fault. The data model includes a noise description, which allows the machine to handle noisy data. Therefore, this approach can be divided into two phases: a learning phase, which acquires the model of each fault from possibly incomplete or incorrect history data, and a correlation phase, which uses those models to interpret the online data and identify faults. Both phases call for heuristic algorithms. This approach can be viewed as a centralized approach where the distributed system is considered a single entity. Here, the events are collected in a centralized location and then correlated with each other. This approach has difficulty in scaling up to large distributed systems.

2.3.3

Graph-theoretic Techniques

Graph-theoretic techniques rely on a graph model of the system, called a Fault Propagation Model (FPM), which describes which symptoms may be observed if a specific fault occurs [KP97]. The FPM represents all the faults and symptoms that occur in the system. The observed symptoms are mapped into the FPM nodes. The fault localization algorithm analyzes the FPM to identify the best explanation for the observed symptoms. Graph-theoretic techniques require the priori specification of how a failure condition or alarm in one component is related to failure conditions or alarms in other components [HCF95]. The creation of the FPM requires an accurate knowledge of current dependencies among abstract and physical system components. The efficiency and accuracy of the fault localization algorithm depends on the accuracy of this a priori specification. The FPM takes the form of causality or dependency graph. • Causality Graph Model [YKM+96] Distributed systems management consists mainly of monitoring, interpreting and handling events. In management an event is defined as an exceptional condition in the operation of distributed systems. Some problems are directly observable, while others can only be observed indirectly from their symptoms. Symptoms are defined as the observable events. However, a symptom cannot be directly handled; instead its root cause problem needs to be handled to make it go away. Relationships are essential components of the correlation, because problems and symptoms propagate from one object to another relationship. A natural candidate for representing the problem domain is the causality graph model. Causality induces a partial order relationship between events. A causality graph is a directed acyclic graph Gc (E, C) whose nodes E correspond to events and whose edges C describe cause-effect relationships between events. An edge (ei , ej ) ∈ C represents the fact that event ei causes event ej , which is denoted as ei → ej [HSV99]. The nodes of a causality graph may be marked as problems or symptoms. Some nodes are neither problems nor symptoms, while others may be marked as problems and

26

2 Fault Management in Distributed Systems symptoms at the same time. Causality graph edges may be labeled with a probability of the causal implication. Similarly, it is possible to assign a probability of independent occurrence to all nodes labelled as problems.

S

8 3

1

7 6

2 4

5 (a)

8

S 3

P 1 P2

7 S 6 S 5 S

S 4 (b)

Figure 2.3: (a) Causality Graph. (b) Labelling of the Graph. For example, Figure 2.3(a) depicts the causality graph of a distributed system consisting of 7 nodes. Certain symptoms are directly caused by particular problem or indirectly by other symptoms; for instance, symptom 7 in Figure 2.3(a) is generated by node 1 or by node 6. To proceed with correlation analysis, it is necessary to identify the nodes in the causality graph corresponding to symptoms and those corresponding to problems. A problem is an event that may require handling while a symptom (alarm) is an event that is observable. Nodes of a causality graph may be marked as problems (P) or symptoms (S) as in Figure 2.3(b). • Dependency Graph Model A dependency graph is a directed graph G = (O, D), where O is a finite, non-empty set of objects and D is a set of edges between the objects. Each object may be associated with a probability of its failure independent of other objects. The directed edge (oi , oj ) ∈ D denotes the fact that an error or fault in oi may cause an error in oj . Every directed edge is labeled with a conditional probability that the object at the end of an edge fails, provided that the object at the beginning of an edge fails [Kat96] [KS95]. Note that the presented dependency graph models all possible dependencies between managed objects in distributed systems. In reality the graph could be reduced based on information on current open connections. Many approaches using dependency graph models assume that an object may fail in only one way. If this is the case in a real system, then a failure of an object may be represented as an event. In this case, the two representations, causality graph and dependency graph, are equivalent. When multiple failures may be associated with a single object, they can typically be enumerated into a small set of failure modes, such as complete failure, abnormal transmission delay, high packet loss rate and so on. The dependency graph then associates multiple failure modes with each object, whereas dependency edges between objects are weighted with probability matrices rather than with single probability values, where each matrix cell indicates the probability with which a particular failure of an antecedent object causes a particular failure of a dependent object [SS02]. The dependency graph may still be mapped into a causality graph by creating a separate causality

2 Fault Management in Distributed Systems

27

graph node for each object and each of its failure modes, and then connecting the nodes accordingly. While the use of a dependency graph as a system model has many advantages (e.g., it is more natural and easier to build), a causality graph is more suitable for the task of fault localization as it provides a more detailed view of the system and is able to deal with a simple notion of an event rather than potential multi-state system objects. A dependency graph provides generality because different types of distributed systems can be easily modeled. It is flexible in the sense that we can easily modify the representation as the active picture of the observed system changes. It is simple because it gives a manageable model and it has the necessary complexity to represent real systems. It also has the property of similarity, which means that similar systems have similar representations. For example, if we add another entity in the graph, the representation changes only slightly. Most graph-theoretic techniques reported in the literature allow the FPM to be nondeterministic by modeling prior and conditional failure probabilities. However, many of these techniques require the probability models to be restricted to canonical models such as OR and AND models [KS95] [KYY+95] [SS02]. An OR model combines possible causes of an event using logical operator OR, meaning that at least one of the possible causes has to exist for the considered event to occur. An AND model uses the logical operator AND, instead. Some techniques may be extended to work with hybrid models that allow the relationship between an event and its possible causes to have the form of an arbitrary logical expression.

2.3.4

Probabilistic Models

Due to the complexity of managed distributed systems, it is not always possible to build precise models in which it is evident that the occurrence of a given set of alarms indicates a fault on a given element (object) [Din07]. The knowledge of the cause-effect relations among faults and alarms is generally incomplete. In addition, some of the alarms generated by a fault are frequently not made available to the correlation system in due time because of losses or delays in the route from the system element which originates them. Finally, due to the fact that the configuration frequently changes, the more detailed a model is, the faster it will become outdated. The imprecision of the information supplied by specialists very often causes great difficulties. The expressions ”very high”, ”normal” and ”sometimes” are inherently imprecise and may not be directly incorporated to the knowledge basis of a conventional rule-based system. Fuzzy logic and Bayesian networks are popular models for probabilistic fault management. • Fuzzy Logic Fuzzy logic provides one way to deal with uncertainty and imprecision, which characterize some applications of distributed systems management. Fuzzy logic relies on traditional logic, multi-valued logic, probability theory and probabilistic logic as special cases. The basic concept underlying fuzzy logic is fuzzy sets. Conventional binary logic is based on two possible values: true or false. When it comes to fuzzy sets, a certain grade of membership, which may assume any value between 0 (when definitely the element does not belong to the set) and 1 (when the element certainly is a member of the set). The concept fuzzy set brings in the novelty that any given proposition does not have to be only true or false, but that it may be partially true in the interval between 0 and 1.

28

2 Fault Management in Distributed Systems Although it is possible to empirically attest that a given fuzzy logic system operates according to what is expected, there are still no tools that allow one to prove this [MK94b]. Fuzzy expert systems allow the rules to be directly formulated utilizing linguistic variables such as ”very high” or ”normal”, which rather simplifies the development of the system. Some applications of fuzzy systems have been implemented in fault management [CH96] [AD99]. [AD99] proposes a fuzzy-logic model for the temporal reasoning phase of the alarm correlation process in network fault management. [CH96] presents a fuzzy expert systems to simplify management within communication networks. Some researchers argue that all problems that can be solved by means of fuzzy logic may be equally well solved by means of probabilistic models like, for example, Bayesian networks (see Section 3.2), which have the advantage of relying on a solid mathematical basis. • Bayesian Networks A Bayesian network, also called a belief network or a causal network, is a model for reasoning by using probabilities. It is a directed acyclic graph (DAG) where the nodes are random variables and certain independence assumptions are held [Pea88] [Nea90] [Cha91]. A more detailed study of Bayesian networks is presented in Section 3.2. To give an example, consider a client-server communication problem where client stations set up connections with a server station via an Ethernet. Suppose a user complains that his client station cannot access the server. Possible causes of this failed access (AF ) might be due to a server fault (SF ), network congestion (N C), or link fault (LF ). Network congestion is usually caused by high network load (HL) and evidenced by low network throughput and long packet delays and even losses. We also assume that a link fault (LF ) will sometimes trigger the link alarm (LA). This example is illustrated in Figure 2.4. High-load(HL)

Link-fault(LF)

Link-alarm(LA)

Network-congestion(NC)

Server-fault(SF)

Access-failure(AF)

Figure 2.4: An Example of a BN in Distributed Systems Management. Assume for simplicity that all the random variables are binary valued. For a random variable, say HL, we use hl to denote that the value is true and hl to denote that the value is false. If the systems manager receives a user complaint that the access has failed (af ), and the link alarm (la) was issued, she can calculate the conditional probabilities of link-fault p(lf |af, la), network congestion p(nc|af, la) and server fault p(sf |af, la), respectively.

2 Fault Management in Distributed Systems

29

We can see that the basic computation in Bayesian networks is the computation of every node’s belief given all the evidence that has been observed so far. In general, such computation is NP-Hard [Coo90]. Some applications of Bayesian networks for fault management have been reported. However, they are limited to rather specific fault diagnosis problems, which use simplified Bayesian network models. [DLW93] provides a probabilistic approach to fault diagnosis in linear lightwave networks based on Bayesian networks model. [WS93b] uses Bayesian networks in diagnosing connectivity problems in communication systems. [HJ97a] utilizes Bayesian network models to accomplish early detection by recognizing deviation from normal behavior in each of the measurement variables in different network layers. However, this approach can only deal with the abnormal detection in a single node and a single network layer in distributed systems. It is hard to extend to other layers (such as physical layers, application layers) and can not deal with multiple nodes in distributed systems, which are involved in fault propagation environments.

2.3.5

Model Traversing Techniques

Model traversing techniques use a formal representation of a distributed system with clearly marked relationships among system entities [Gru98] [HCF95] [JP93] [Kat96] [KG97] [KP97]. By exploring these relationships, starting from the system entity that reported an alarm, the fault identification process is able to determine which alarms are correlated and locate faulty elements of the distributed systems. Model traversing techniques reported in the literature use an object-oriented representation of the system [HCF95]. One approach [JP93] exploits the OSI management framework. The approach described in [Kat96] uses guidelines for the definition of managed objects (GDMO) (with non-standard extensions) to model services and dependencies between services in a distributed system. The proposed refinements of this model include the possibility of testing operational and quality of service states of managed services [KG97]. During the model traversal managed objects are tested to determine their operational status. The root cause is found when the currently explored malfunctioning object does not depend on any other malfunctioning object. In multi-layer models, first a horizontal search is performed in the layer in which the failure has been reported [KP97]. When a failing component is located, a vertical search carries the fault localization process to the next layer down. In the lower layer the horizontal search is started again. In NetFACT [HCF95], the fault localization process is performed in two phases. Firstly, in the horizontal search, votes are assigned to the potentially faulty elements based on the number of symptoms pertaining to these elements. Secondly, in the phase-tree search the root cause is determined. This search determines if the device that receives the most votes in the first step is at fault or if it fails because one of the components it depends upon is faulty. Model traversing techniques are robust against the frequent configuration changes in distributed systems [KP97]. They are particularly attractive when the automatic testing of a managed object may be done as a part of the fault localization process. Model traversing techniques seem natural when relationships between objects are graph-like and easy to obtain. These models naturally enable the design of distributed fault localization algorithms. Their strengths are high performance, potential parallelism and robustness against a distributed system’s changes. Their disadvantage is a lack of flexibility, especially if fault propagation is complex and not well-structured. In particular, they are unable to model situations in which the failure of a device may depend on a logical combination of other device failures [HCF95]. In conclusion, there is not a unique solution that is the ”best”, in terms of precision and

30

2 Fault Management in Distributed Systems

complexity, to solve a generic problem in fault management. In fact, a comprehensive solution to the problem may require combinations of different approaches in complex distributed fault management.

3 Probability Reasoning and Bayesian Networks

31

Chapter 3 Probability Reasoning and Bayesian Networks Certainty refers to knowledge in a particular domain of expertise that is secure and reliable. Certain knowledge can be expressed simply as conditional rules of the form: ”If A, then B”. Uncertainty arises when certainty is either impossible, unnecessary or insufficient. As more and more knowledge-based systems are being developed for a large variety of problems, it becomes apparent that the knowledge required to solve these problems is not often precisely defined; it is of an imprecise nature instead. In fact, many real-life problem domains are fraught with uncertainty. Human experts in these domains are typically able to form judgments and make decisions based on uncertain, incomplete, and sometimes even contradictory information. To be of practical use, a knowledge-based system must deal with such information at least equally well. Uncertainty is not only an integral part of human decision-making, but also a fundamental element of the world. Reasoning under uncertainty is a common issue in our daily life and a central one in Artificial Intelligence (AI). Humans can handle uncertain situations mostly with ease, but computers cannot. Many techniques, such as rule-based reasoning, fuzzy logic, probability theory, artificial neural networks (ANN), and various heuristic search methods in AI have been investigated to solve the problem of reasoning under uncertainty. However, the decision mechanism of human brain keeps undiscovered. Thus it is difficult to implement the uncertainty reasoning efficiently with current AI techniques. Probability theory provides a sound mathematical basis for reasoning under uncertainty. It has a number of nice properties. It allows one to envision multiple possible states of the world with different degrees of likelihood. It allows one to incorporate multiple pieces of evidence obtained from various sensors in a coherent way, using the principle of Bayesian conditioning. In combination with decision theory, it allows one to make optimal decisions that lead to the maximum expected utility. Also, statistics provides excellent tools for learning probabilistic models from data.

3.1

Background of Probability Theory

As probability theory is a mathematically well-founded theory about uncertainty, it is not surprising that this theory takes a prominent place in research on reasoning with uncertainty in knowledge-based systems. Some basic concepts of probability theory are defined as follows.

32

3 Probability Reasoning and Bayesian Networks

3.1.1

Probability Calculus

• Basic axioms The probability P (a) of an event (random variable) a is a number in the unit interval [0,1]. (i) P (a) = 1 if and only if a is certain. (ii) if a and b are mutually exclusive, then P ({a} ∪ {b}) = P (a) + P (b). • Conditional probabilities A conditional probability statement is of the following kind: ”Given the event b, the probability of the event a is x.” The notation for the preceding statement is P (a|b) = x. It should be stressed that P (a|b) = x does not mean that whenever b is true, then the probability for a is x. It means that if b is true, and everything else known is irrelevant for a, then the probability of a is x (”everything else” may be separated from a given b). • The fundamental rule The fundamental rule of the probability calculus is explain informally P (a|b)P (b) = P (b|a)P (a) = P (a, b)

(3.1)

Where P (a, b) is the probability of the joint event (a) ∩ (b). The fundamental rule says the probability of a joint event (a, b) is equivalent to the product of the probability of one event (a or b) and the probability of another conditional event (P (b|a) or P (a|b)), regardless of the sequence of the events. Remember that probabilities should always be conditioned by a context c, the formula should read P (a|b, c)P (b|c) = P (a, b|c)

(3.2)

For readability, P (a|b, c) is the brief for P (a|{b, c}). If A is a variable with states a1 , . . . , an , P (A) is a probability distribution over these states: P (A) = (x1 , ..., xn ), xi ≥ 0,

n ∑

xi = 1

(3.3)

i=1

where xi is the probability of A being in state ai . Note that if A and B are variables, then P (A, B) is a table of probabilities P (ai , bj ) for the possible pairs of states of A and B. From a table P (A, B) the probability distribution P (A) can be calculated. Let ai be a state of A. There are exactly m different events for which A is in state ai , namely the mutually exclusive events (ai , b1 ), . . . , (ai , bm ). Therefore P (ai ) = P (ai , b1 ) + P (ai , b2 ) + . . . + P (ai , bm ).

(3.4)

This calculation is called marginalization and we say that variable B is marginalized out of P (A, B) (resulting in P (A)). The notation is

3 Probability Reasoning and Bayesian Networks

P (A) =

33



P (A, B)

(3.5)

B

• Bayes’ Theorem From Equation 3.1 follows P (a|b)P (b) = P (a, b) = P (b|a)P (a), suppose P (a) 6= 0 and this yields the well-known Bayes’ rule in symbols: P (b|a) =

P (a|b)P (b) P (a)

(3.6)

or mathematically, Bayes’ rule states posterior =

likelihood ∗ prior marginal likelihood

(3.7)

Bayes’ Theorem, developed by Rev. Thomas Bayes, an 18th century mathematician and theologian, was first published in 1763. Mathematically it is expressed as: P (H|E, c) =

P (E|H, c)P (H|c) P (E|c)

(3.8)

where we can update our belief in hypothesis H given the additional evidence E and the background context c. The left-hand term, P (H|E, c) is known as the ”posterior probability,” or the probability of H after considering the effect of E on c. The term P (H|c) is called the ”prior probability” of H given c alone. The term P (E|H, c) is called the ”likelihood” and gives the probability of the evidence assuming hypothesis H and the background information c are true. Finally, the last term P (E|c) is independent of H and can be regarded as a normalizing or scaling factor. The heart of Bayesian inference lies in the famous inversion formula. P (H|E) =

P (E|H)P (H) P (E)

(3.9)

which states that the belief we accord a hypothesis H upon obtaining evidence E can be computed by multiplying our previous belief P (H) by the likelihood P (E|H) that E will materialize if H is true. This P (H|E) is sometimes called the posterior probability (or simply posterior), and P (H) is called the prior probability (or prior). The denominator P (E) of Equation 3.9 hardly enters into consideration because it is merely a normalizing constant P (E) = P (E|H)P (H) + P (E|H)P (H), which can be computed by requiring that P (H|E) and P (H|E) sum up to unity 1. • Likelihood Sometimes P (a|b) is called the likelihood of b given a, and it is denoted L(b|a). The reason for this is the following: Assume b1 , . . . , bn are possible scenarios with an effect on the event a, and we know a. Then, P (a|bi ) is a measure of how likely it is that bi is the cause. In particular, if all bi ’s have the same prior probability, Bayes’ rule yields P (bi |a) =

P (a|bi )P (bi ) = kP (a|bi ) P (a)

(3.10)

34

3 Probability Reasoning and Bayesian Networks where k is independent of i. • Conditional independence Variables A and C are independent given the variable B if P (ai |bj ) = P (ai |bj , ck )

(3.11)

for all i, j, k. This means that if the state of B is known, then no knowledge of C will alter the probability of A. (We will write P (A|B) = P (A|B, C), although the two tables do not have the same dimensions.) Remark: If condition B is empty, we simply say that A and C are independent. Definition 3.11 may look asymmetric: however, if equation 3.11 holds, then by the conditioned Bayes’ rule, we get P (C|B, A) =

P (A|C, B)P (C|B) P (A|B)P (C|B) = = P (C|B) P (A|B) P (A|B)

(3.12)

The proof requires that P (A|B) > 0, that is, for events A = a and B = b with P (A = a|B = b) = 0, the calculation is not valid. However, for our considerations it does not matter; if B = b, then the evidence A = a is impossible and will not appear, so why bother with the transmission of it? • Joint Probability Distribution (JPD) Let X be a set of random variables x1 , x2 , . . . , xn . Each combination of the values of these variables defines a configuration or scenario. The number of possible configurations is given by the product m1 × m2 × . . . × mn , where mi (i = 1, 2, . . . , n) corresponds to the number of possible states of each variable x1 , x2 , . . . , xn . In the particular case in which x1 , x2 , . . . , xn , are binary variables, the number of possible combinations is 2n . The joint probabilities distribution of X is defined as P (x1 , x2 , . . . , xn ), for all the possible configurations [Cha91]. For example, if X = {x1 , x2 }, where x1 and x2 are binary variables, the joint distribution will contain the following probabilities, corresponding to the four possible configurations: P (x1 , x2 ), P (x1 , x2 ), P (x1 , x2 ), P (x1 , x2 ). The probabilities of a joint distribution are exhaustive and mutually exclusive. Therefore, the sum of all these probabilities must be equal to 1. Besides that, the probability associated to the occurrence of at least one among two possible configurations A and B is equal to the sum of the probabilities associated to A and B, that is, P (A ∪ B) = P (A) + P (B) and P (A ∩ B) = 0.

3.1.2

Why Uncertain Reasoning?

There are various of situations where uncertainty arises since every corner of the world is full of uncertainty. Uncertainty arises from a variety of sources and confuses system designers in a variety of ways. From the engineering view of the world, uncertainty arises when experts concern their own knowledge and when engineers have no exact knowledge of the environment in which their system is supposed to function; uncertainty is inherent in domains engineers normally act in cause their models often abstract out details or unknown facts of the real world that turn out to be relevant to their application later on.

3 Probability Reasoning and Bayesian Networks

35

As uncertainty permeates everywhere in the world, any model offering an explicit representation of it would be more accurate and realistic than the one in which uncertainty is disregarded. By modelling uncertainties, certain structural and behavioral aspects of the target system become more visible and understandable, thereby enabling future development steps to be carried out more efficiently and effectively. Uncertainty comes from the following sources in engineering: 1. Observability: some aspects of the domain are often only partially observable and therefore they must be estimated indirectly through observation. For example, in the management of distributed systems, managers often use their expert knowledge to infer the hidden factors by the obvious evidences when there is no exact or direct management information available by the managed system model. 2. Event correlation: the relations among the main events are often uncertain. In particular, the relationship (dependency) between the observables and nonobservables is often uncertain. Therefore, the nonobservables cannot be inferred from the observables with certainty. The uncertain relation prevents a reasoner from arriving at a unique domain state from observation, which accounts for the guesswork. For example, in a local distributed system the interrupted service in e-mail can be caused by a fault in the Email server. Email client or disconnection in networks, which makes it difficult to determine exactly the location of the root cause when the fault in the e-mail service is reported. 3. Imperfect observations: the observations themselves may be imprecise, ambiguous, vague, noisy, and unreliable. They introduce additional uncertainty to inferring the state of the domain. For complex systems in real life, it is difficult to get full and exact required information; there is always noise and redundancy in the observations. 4. Interpretation: even though many relevant events are observable, very often we only understand them partially. Therefore, the state of the domain must be guessed based on incomplete information. Even though relations are certain in some domains, very often it is impractical to analyze all of them explicitly. Consequently, the state of the domain is estimated from computationally more efficient but less reliable means. 5. Reduction of complexity: another source of uncertainty comes from the most fundamental level of our universe. Heisenberg’s uncertainty principle holds that ”the more precisely the position is determined, the less precisely the momentum is known in this instant, and vice versa” [Hei27]. Therefore, uncertainty exists within the building blocks of our universe. In the light of these factors and others, the reasoner’s task is not one of deterministic deduction but rather uncertain reasoning. That is, the reasoner must infer the state of the domain based on incomplete and uncertain knowledge (observations) about the domain.

3.2

Models of Bayesian Networks

Probabilistic models based on directed acyclic graphs (DAGs) have a long and rich tradition, which began with the geneticist Sewall Wright (1921). Variants have appeared in many fields; within cognitive science and artificial intelligence, such models are known as Bayesian networks. Their initial development in the late 1970s was motivated by the need to model the top-down (semantic) and bottom-up (perceptual) combination of evidence in reading. The capability for

36

3 Probability Reasoning and Bayesian Networks

bidirectional inferences, combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian networks as the method of choice for uncertain reasoning in AI and expert systems, replacing earlier, ad hoc rule-based schemes [Pea88] [SP90] [HMW95] [Jen96]. Bayesian Networks are effective means to model probabilistic knowledge by representing cause-and-effect relationships among key entities of a managed system. Bayesian Networks can be used to generate useful predictions about future faults and support decisions even in the presence of uncertain or incomplete information. Bayesian networks have been applied in various areas. [Nik00] [WKT+99] [SCH+04] use Bayesian networks in medical diagnosis. [KK94] describes the application of Bayesian networks in the fault diagnosis in Diesel engines. [CSK01] presents methods in distributed Web mining from multiple data streams based on the model of Bayesian networks. [HBR95] describes an application of Bayesian networks in the retrieval of information, according to the users’ areas of interest. [BH95] presents a system that utilizes a Bayesian network for debugging very complex computer programs. [Bun96] approaches the utilization of Bayesian networks in coding, representing and discovering knowledge, through some processes that seek new knowledge on a given domain based on inferences on new data or on the knowledge already available [KZ96]. [BDV97] uses Bayesian networks for map learning and [SJB00] applies the model of Bayesian networks to image sensor fusion. [Leu02] [GF01] and [Nej00] use Bayesian networks to model adaptive e-learning environments.

3.2.1

Basic Concepts of Bayesian Networks

The technology with which a system handles uncertain information forms a crucial component of its overall performance. The technologies for modeling uncertainty include Bayesian probability, Dempster-Shafer theory, Fuzzy Logic, and Certainty Factor. Bayesian probability uses probability theory to manage uncertainty by explicitly representing the conditional dependencies between the different knowledge components. It offers a language and calculus for reasoning about the beliefs in the presence of uncertainty. Prior probabilities are thus updated, after new events are observed to produce posterior probabilities. By repeating this process, the implications of multiple sources of evidence can be calculated in a consistent way, and the uncertainties are exploited explicitly to reach an objective conclusion. Bayesian networks provide an intuitive graphical visualization of the knowledge including the interactions among the various sources of uncertainty. Bayesian networks also known as Bayesian belief networks, probabilistic networks or causal networks, are an important knowledge representation techniques in Artificial Intelligence [Pea88] [CDL+99]. Bayesian Networks use directed acyclic graphs (DAG) with probability labels to represent probabilistic knowledge. Bayesian Networks can be defined as a triplet (V, L, P ), where V is a set of variables (nodes of the DAG) which represent propositional variables of interest, L is the set of causal links among the variables (the directed arcs between nodes of the DAG) which represent informational or causal dependencies among the variables, P is a set of probability distributions defined by: P = {p(v | π(v)) | v ∈ V }; π(v) denotes the parents of node v. The dependencies are quantified by conditional probabilities for each node given its parents in the network. The network supports the computation of the probabilities of any subset of variables given evidence about another subset. In Bayesian Networks, the information included in one node depends on the information of its predecessor nodes. Any direct predecessor denotes an effect node; the latter represents its causes. Note that an effect node can also act as a causal node of other nodes, where it then plays the role of a cause node. Causal relations also have a quantitative side, namely their strength. This is expressed by

3 Probability Reasoning and Bayesian Networks

37

attaching numbers (probabilities) to the links. Let A be a parent of B. Using probability calculus, it would be natural to let P (B|A) be the strength of the link. However, if C is also a parent of B, then the two conditional probabilities P (B|A) and P (B|C) alone do not give any clue about how the impacts from A and C interact. See Figure 3.1, they may cooperate or counteract in various ways, so we need a specification of P (B|A, C). To each variable B with parents A1 , . . . , An , there is attached the potential table P (B|A1 , . . . , An ). Note that if A has no parents, then the table reduces to unconditional probabilities P (A). P(A)

A

C

B

P(C)

P(B|A,C)

Figure 3.1: Basic Model of Bayesian Network. A Bayesian network provides a complete description and very compact representation of the domain. It encodes Joint Probability Distributions (JPD) in a compact manner. An important advantage of Bayesian Networks is the avoidance of building huge JPD tables that include permutations of all the nodes in the network. Rather, for an effect node, only the states of its immediate predecessor need to be examined. A complete joint probability distribution over n binary-valued attributes requires 2n − 1 independent parameters to be specified. In contrast, a Bayesian Network over n binaryvalued attributes, in which each node has at most k parents, requires at most 2k n independent parameters. To make this concrete, suppose we have 20 nodes (n = 20) and each node has at most 5 parent nodes (k = 5). Then the Bayesian Network requires only 640 numbers, but the full joint probability distribution requires a million. Clearly, such a network can encode only a very small fraction of the possible distributions over these attributes, since it has relatively few parameters. The fact that the structure of a BN eliminates the vast majority of distributions from consideration indicates that the network structure itself encodes information about the domain. This information takes the form of the conditional independence relationships that hold between attributes in the network. To this end, a Bayesian network comprises two parts: a qualitative part and a quantitative part. The qualitative part of a Bayesian network is a graphical representation of the independencies held among the variables in the distribution that is being represented. This part takes the form of an acyclic directed graph. In this graph, each vertex represents a statistical variable that can take one of a finite set of values. Informally speaking, we take an arc vi → vj in the graph to represent a direct influential or causal relationship between the linked variables vi and vj ; the direction of the arc vi → vj designates vj as the effect or consequence of the cause vi . Absence of an arc between two vertices means that the corresponding variables do not influence each other directly, and hence, are independent. Associated with the qualitative part of a Bayesian network is a set of functions representing numerical quantities from the distribution at hand. With each vertex of the graph is associated a probability assessment function, which basically is a set of (conditional) probabilities, describing

38

3 Probability Reasoning and Bayesian Networks

the influence of the values of the vertex’ predecessors on the probabilities of the values of this vertex itself.

3.2.2

An Example of Bayesian Networks X1 Season Sprinkler X 3

X2 Rain X4 Wet

X5 Slippery

Figure 3.2: Causal Influences in A Bayesian Network. Figure 3.2 illustrates a simple yet typical Bayesian network. It describes the causal relationships among the seasons of the year(X1 ), whether it raining (X2 ), whether the sprinkler is on (X3 ), whether the pavement is wet (X4 ), and whether the pavement is slippery (X5 ). Here, the absence of a direct link between X1 and X5 , for example, captures our understanding that there is no direct influence of season on slipperiness, but the influence is mediated by the wetness of the pavement. (If freezing is a possibility, then a direct link could be added.) The most important aspect of Bayesian networks may be that they are a direct representation of the world, not of the reasoning processes. The arrows in the diagram represent the real causal connections and they do not represent the flow of information during reasoning (as in rule-based systems and neural networks). Reasoning processes can operate on Bayesian networks by propagating information in any direction. For example, if the sprinkler is on, then the pavement is probably wet (prediction); if someone slips on the pavement, this event also provides evidence that it is wet (abduction). On the other hand, if we see that the pavement is wet, that makes it more likely that the sprinkler was on or that it had rained (abduction); but if we then observe that the sprinkler is on, that reduces the likelihood that it is raining (explaining away). It is difficult to model this kind of reasoning in rule-based systems and neural networks in any natural way.

3.2.3

The Semantics of Bayesian Networks

The key feature of Bayesian networks is the fact that they provide a method for decomposing a probability distribution into a set of local distributions. The independence semantics associated with the network topology specifies how to combine these local distributions to obtain the complete joint probability distribution over all the random variables represented by the nodes in the network. This has three important consequences. Firstly, naively specifying a joint probability distribution with a table requires a number of values exponential in the number of variables. For systems in which interactions among the random variables are sparse, Bayesian networks drastically reduce the number of required values.

3 Probability Reasoning and Bayesian Networks

39

Secondly, efficient inference algorithms are formed in that work by transmitting information between the local distributions rather than working with the full joint distribution. Thirdly, the separation of the qualitative representation of the influences between variables from the numeric quantification of the strength of the influences has a significant advantage for knowledge engineering. When building a Bayesian network model, one can focus first on specifying the qualitative structure of the domain and then on quantifying the influences. When the model is built, one is guaranteed to have a complete specification of the joint probability distribution. The most common computation performed on Bayesian networks is the determination of the posterior probability of some random variables, given the values of other variables in the network. Because of the symmetric nature of conditional probability, this computation can be used to perform both diagnosis and prediction. Other common computations are: the computation of the probability of the conjunction of a set of random variables, the computation of the most likely combination of values of the random variables in the network and the computation of the piece of evidence that has or will have the most influence on a given hypothesis. A detailed discussion of inference techniques in Bayesian networks can be found in the book by Pearl [Pea00]. • Probabilistic semantics. Any complete probabilistic model of a domain must, either explicitly or implicitly, represent the joint distribution which the probability of every possible event as defined by the values of all the variables. There are exponentially many such events, yet Bayesian networks achieve compactness by factoring the joint distribution into local, conditional distributions for each variable given its parents. If xi denotes some value of the variable Xi and π(xi ) denotes some set of values for Xi ’s parents π(xi ), then P (xi |π(xi )) denotes this conditional distribution. For example, P (x4 |x2 , x3 ) is the probability of wetness given the values of sprinkler and rain. Here P (x4 |x2 , x3 ) is the brief of P (x4 |{x2 , x3 }). The set parentheses are omitted for the sake of readability. We use the same expression in this book. The global semantics of Bayesian networks specifies that the full joint distribution is given by the product P (x1 , . . . , xn ) =



P (xi |π(xi ))

(3.13)

i

Equation 3.13 is also called the chain rule for Bayesian networks. In the sample Bayesian network in Figure 3.2, we have P (x1 , x2 , x3 , x4 , x5 ) = P (x1 )P (x2 |x1 )P (x3 |x1 )P (x4 |x2 , x3 )P (x5 |x4 )

(3.14)

Provided the number of parents of each node is bounded, it is easy to see that the number of parameters required grows only linearly with the size of the network, whereas the joint distribution itself grows exponentially. Further savings can be achieved using compact parametric representations, such as noisy-OR models, decision tress, or neural networks, for the conditional distributions [Pea00]. There are also entirely equivalent local semantics, which assert that each variable is independent of its non-descendants in the network given its parents. For example, the parents of X4 in Figure 3.2 are X2 and X3 and they render X4 independent of the remaining nondescendant, X1 . That is, P (x4 |x1 , x2 , x3 ) = P (x4 |x2 , x3 )

(3.15)

40

3 Probability Reasoning and Bayesian Networks The collection of independence assertions formed in this way suffices to derive the global assertion in Equation 3.14, and vice versa. The local semantics are most useful in constructing Bayesian networks, because selecting as parents the direct causes of a given variable automatically satisfies the local conditional independence conditions. The global semantics lead directly to a variety of algorithms for reasoning. • Evidential reasoning. From the product specification in Equation 3.14, one can express the probability of any desired proposition in terms of the conditional probabilities specified in the network. For example, the probability that the sprinkler was on, given that the pavement is slippery, is

P (X3 = on|X5 = true) (3.16) P (X3 = on, X5 = true) = P (X5 = true) ∑ x ,x ,x P (x1 , x2 , X3 = on, x4 , X5 = true) = ∑1 2 4 P (x1 , x2 , x3 , x4 , X5 = true) ∑ x1 ,x2 ,x3 ,x4 x1 ,x2 ,x4 P (x1 )P (x2 |x1 )P (X3 = on|x1 )P (x4 |x2 , X3 = on)P (X5 = true|x4 ) ∑ = x1 ,x2 ,x3 ,x4 P (x1 )P (x2 |x1 )P (x3 |x1 )P (x4 |x2 , x3 )P (X5 = true|x4 ) The above computation is based on Equations 3.1, 3.5 and 3.14. These expressions can often be simplified in the ways that reflect the structure of the network itself. It is easy to show that reasoning in Bayesian networks subsumes the satisfiability problem in propositional logic and hence reasoning is NP-hard [Coo90]. Monte Carlo simulation methods can be used for approximate inference [Pea87], given that estimates are gradually improved as the sampling proceeds. (Unlike join-tree methods, these methods use local message propagation on the original network structure.) Alternatively, variational methods [JGJ+98] provide bounds on the true probability. • Functional Bayesian networks. The networks discussed so far are capable of supporting reasoning about evidence and about actions. Additional refinement is necessary in order to process counterfactual information. For example, the probability that ”the pavement would not have been slippery had the sprinkler been OF F , given that the sprinkler is in fact ON and that the pavement is in fact slippery” cannot be computed from the information provided in Figure 3.2 and Equation 3.14. Such counterfactual probabilities require a specification in the form of functional networks, where each conditional probability P (xi |π(i)) is replaced by a functional relationship xi = fi (π(i), ²i ), where ²i is a stochastic (unobserved) error term. When the functions fi and the distributions of ²i are known, all counterfactual statements can be assigned unique probabilities, using evidence propagation in a structure called a ”twin network”. When only partial knowledge about the functional form of fi is available, bounds can be computed on the probabilities of counterfactual sentences [BP95] [Pea00]. • Causal discovery. One of the most exciting prospects in recent years has been the possibility of using Bayesian networks to discover causal structures in raw statistical data [PV91] [SGS93] [Pea00], which is a task previously considered impossible without controlled experiments. Consider, for example, the following pattern of dependencies among three events: A and B are dependent, B and C are dependent, yet A and C are independent. If you ask a person to supply an example of three such events, the example

3 Probability Reasoning and Bayesian Networks

41

would invariably portray A and C as two independent causes and B as their common effect, namely, A → B ← C. Fitting this dependence pattern with a scenario in which B is the cause and A and C are the effects is mathematically feasible but very unnatural, because it must entail fine tuning of the probabilities involved; the desired dependence pattern will be destroyed as soon as the probabilities undergo a slight change. One of the most common approaches towards reasoning in Bayesian Networks is backward inference, which traces the causes from effects. The dependency structure and the reasoning model in Bayesian networks provide the possibility in causal discovery [DZB+04]. • Plain beliefs. In mundane decision making, beliefs are revised not by adjusting numerical probabilities but by tentatively accepting some sentences as ”true for all practical purposes”. Such sentences, called plain beliefs, exhibit both logical and probabilistic characters. As in classical logic, they are propositional and deductively closed; as in probability, they are subject to retraction and to varying degrees of entrenchment. Bayesian networks can be adopted to model the dynamics of plain beliefs by replacing ordinary probabilities with non-standard probabilities, that is, probabilities that are infinitesimally close to either 0 or 1 [GP96]. • Models of cognition. Bayesian networks may be viewed as normative cognitive models of propositional reasoning under uncertainty [Pea00]. They handle noise and partial information by using local, distributed algorithm for inference and learning. Unlike feed forward neural networks, they facilitate local representations in which nodes correspond to propositions of interest. Recent experiments [TG01] suggest that they capture accurately the causal inferences made by both children and adults. Moreover, they capture patterns of reasoning that are not easily handled by any competing computational model. They appear to have many of the advantages of both the ”symbolic” and the ”subsymbolic” approaches to cognitive modelling. Two major questions arise when we postulate Bayesian networks as potential models of actual human cognition. Firstly, does an architecture resembling that of Bayesian networks exist anywhere in the human brain? No specific work had been done to design neural plausible models that implement the required functionality, although no obvious obstacles exist. Secondly, how could Bayesian networks, which are purely propositional in their expressive power, handle the kinds of reasoning about individuals, relations, properties, and universals that pervades human thought? One plausible answer is that Bayesian networks containing propositions relevant to the current context are constantly being assembled as needed to form a more permanent store of knowledge. For example, the network in Figure 3.2 may be assembled to help explain why this particular pavement is slippery right now, and to decide whether this can be prevented. The background store of knowledge includes general models of pavements, sprinklers, slipping, rain, and so on; these must be accessed and supplied with instance data to construct the specific Bayesian network structure. The store of background knowledge must utilize some representation that combines the expressive power of first-order logical languages (such as semantic networks) with the ability to handle uncertain information.

3.2.4

d-Separation in Bayesian Networks

d-Separation is one important property of Bayesian networks for inference. Before we define d-separation, we first look at the way that evidence is transmitted in Bayesian Networks. There

42

3 Probability Reasoning and Bayesian Networks

are two types of evidence: • Hard Evidence (instantiation) for a node A is evidence that the state of A is definitely a particular value. • Soft Evidence for a node A is any evidence that enables us to update the prior probability values for the states of A. d-Separation (Definition): Two distinct variables X and Z in a causal network are d-separated if, for all paths between X and Z, there is an intermediate variable V (distinct from X and Z) such that either • the connection is serial or diverging and V is instantiated or • the connection is converging, and neither V nor any of V ’s descendants have received evidence. If X and Z are not d-separated, we call them d-connected. The following are the three cases of d-Separation in Bayesian networks in detail: 1. Serial connections Consider the situation in Figure 3.3. X has an influence on Y , which in turn has an influence on Z. Obviously, evidence on X will influence the certainty of Y , which then influences the certainty of Z. Similarly, evidence on Z will influence the certainty on X through Y . On the other hand, if the state of Y is known, then the channel is blocked, and X and Z become independent. We say that X and Z are d-separated given Y , and when the state of a variable is known, we say that it is instantiated. We conclude that evidence may be transmitted through a serial connection unless the state of the variable in the connection is known.

X

Y

Z

Figure 3.3: Serial Connection. If Y is Instantiated, it blocks the communication between X and Z. 2. Diverging connections The situation in Figure 3.4 is called a diverging connection. Influence can pass between all the children of X unless the state of X is known. We say that Y1 , Y2 , . . . , Yn are d-separated given X. Evidence may be transmitted through a diverging connection unless it is instantiated. 3. Converging connections A description of the situation in Figure 3.5 requires a little more care. If nothing is known about Y except what may be inferred from knowledge of its parents X1 , . . . , Xn , then the parents are independent: evidence on one of the possible causes of an event does not tell us anything about other possible causes. However, if anything is known about the

3 Probability Reasoning and Bayesian Networks

43

X

Y1

Y2

...

Yn

Figure 3.4: Diverging Connection. If X is instantiated, it blocks the communication between its children.

X1

X2

...

Xn

Y

Figure 3.5: Converging Connection. If Y changes certainty, it opens for the communication between its parents. consequences, then information on one possible cause may tell us something about the other causes. This is the explaining away effect illustrated in Figure 3.2: X4 (pavement is wet) has occurred, and X3 (the sprinkler is on) as well as X2 (it’s raining) may cause X4 . If we then get the information that X2 has occurred, the certainty of X3 will decrease. Likewise, if we get the information that X2 has not occurred, then the certainty of X3 will increase. The three preceding cases cover all ways in which evidence may be transmitted through a variable.

44

4 Probabilistic Inference in Fault Management

Chapter 4 Probabilistic Inference in Fault Management Efficient fault management requires an appropriate level of automation and self-management. Knowledge based expert systems, as examples of automated systems, have been very appealing for distributed systems fault diagnosis [KBW+91]. Usually, such systems are based on deterministic network models. A serious problem of using deterministic models is their inability to isolate primary sources of faults from uncoordinated network events. As discussed in Section 3.1.2, uncertainty is unavoidable in distributed systems. Observing that the cause-and-effect relationship between symptoms and possible causes is inherently nondeterministic, probabilistic models can be considered to gain a more accurate representation. Bayesian networks are appropriate models for probabilistic management in fault management. From the view of management, a normal operation is to trace the root causes from detected symptoms (alarms). Hence the backward inference based on the probabilistic model from effects to causes is the basis in distributed fault management.

4.1

Bayesian Networks for Fault Management

Bayesian networks are appropriate for automated diagnosis because of their deep representations and precise calculations. A concise and direct way to represent a system’s diagnostic model is as a Bayesian network constructed from relationships between failure symptoms and underlying problems. A Bayesian network represents cause and effect between observable symptoms and the unobserved problems so that when a set of symptoms are observed the problems most likely to be the cause can be determined. In practice, the fault management system is built from descriptions of the likely effects for a chosen fault. The development of a diagnostic Bayesian network requires a deep understanding of the cause and effect relationships in a domain, provided by domain experts. One advantage of Bayesian networks is that the knowledge can provide clear inner of relationship between effects and causes. It is not represented as a black box, as in Artificial Neural Networks. In addition, compared with other logic, Bayesian networks also provide more fine-grained quantitative evaluation in probabilistic model. Thus, humanly understandable explanations of diagnoses can be given.

4.1.1

The Characteristics of the Faults in Distributed Systems

Due to the enlarged topology, distributed application and services, and complex dependency between managed systems, the faults in a distributed system are different with those in a

4 Probabilistic Inference in Fault Management

45

centralized system. In general, the characteristics of faults in distributed systems are identified as follows: • The sources of faults are distributed. In distributed systems, the topology and distributed services in distributed systems are often expanded in a distributed environment. The managed objects are distributed geographically or logically. The faults which are generated from a distributed system are also scattered in the whole distributed system. So an efficient fault management system should consider fault detection within the distributed environments. • Fault propagation comes from dependency relationship between managed objects. In distributed systems, to finish certain applications or services, managed objects cooperate with each other for the same goal. Often in the cooperation, one managed object depends on another object in transactions. This kind of dependency is denoted by the cause-effect relationship. For example, the disconnection in a switch may interrupt the connection services with other devices which are connected to the switch and on which they depend. Hence the dependency relationship between managed objects is an important factor in fault detection and recovery. • The sources of faults are often hidden. In complex distributed systems, because of the fault propagation, the symptoms of a hardware or software fault are often caused by a remote or hidden factor. That means the root of the faults is sometimes difficult to identify directly from the detected symptoms. Consider the following simple scenario: A physical link failure occurs in one of several interconnected networks. The failure is detected and reported by the management component monitoring the physical resource in question. The failure also has side effects on other resources in the network, e.g., connections on the various layers which use the link will experience timeouts. The management components of resources such as, e.g., protocol stacks, will therefore report failures. The result is that the operator’s console is literally flooded with reports indicating the existence of some network abnormal condition, making it extremely difficult to determine the real cause of the problem. • A managed system holds enough information for fault management. In current distributed systems, lots of devices and application modules keep the records of the numerous states of the operations and services. Most of the information related to faults is also recorded in the managed systems. Hence, it is possible to mine new knowledge from all the recorded information and the statistics of the historical data. For example, the system’s event log consists of a large number of individual events sent by all nodes in the system that have event generation capabilities. On a typical day, the log could reach tens of thousands of events [Nyg95]. This number is a function of parameters such as the size of the system, the configuration, the type and the amount of traffic being carried, and the number and the type of faults that have occurred during that day. Events are collected at a centralized operation and maintenance center (OMC) where the event log is being assembled. It is known that the manual processing of this mass of data tends to become unfeasible as the number of high speed systems in the network increases. However, the statistics in the large volume of history data can be used to retrieve new knowledge which is a potential reference for fault management.

46

4.1.2

4 Probabilistic Inference in Fault Management

Why Use Bayesian Networks for Distributed Fault Management?

In distributed systems, some faults may be directly observable, i.e., they are problems and symptoms at the same time. However, many types of faults are unobservable due to their intrinsically unobservable nature, or the lack of management functionality necessary to provide indications of fault existence. Some faults may be partially observable. The management system provides indications of a fault occurrence, but the indications are not sufficient to precisely locate the fault. Failure: The expected response does not arrive before time-out. Transaction is aborted.

Error: IP headers are discarded because of incorrect body

Database Server

Fault: One of interfaces get out of sync every 0.25ms

Alarm: Transaction aborted alarm.

Database Client Error: Bursts of noise are generated causing bit errors in IP packets

Router B

Router A

Error: IP headers are discarded because of incorrect body

Figure 4.1: A Model of Fault Propagation. In Figure 4.1, a simple distributed system is presented in which a client accesses a remote database server. An interface of one of the routers between the client and server gets intermittently out of sync causing bursts of bit errors in transmitted IP datagrams. As a result, many IP datagrams passing through the router are rejected by the next router because of header errors, or by the server because of the corrupted datagram body. The client does not receive any response to its query and times out. This example illustrates how a seemingly invisible fault manifests itself through a failure at a location distant from the location of the fault. Since most faults are not directly observable, the management system has to infer their existence from information provided by the received alarms. The information carried within reported alarms may include the following: the identity of the object that generated the alarm, type of failure condition, time stamp, alarm identifier, measure of severity of the failure condition, a textual description of the failure, etc. [HCF95] [Sch96]. Hence, in a distributed system, a single fault may cause a number of alarms to be delivered to the network management center. Multiple alarms may be a result of (1) fault re-occurrence, (2) multiple invocations of a service provided by a faulty component, (3) generating multiple alarms by a device for a single fault, (4) detection of and issuing a notification about the

4 Probabilistic Inference in Fault Management

47

same network fault by many devices simultaneously, and (5) error propagation to other devices causing them to fail and, as a result, generate additional alarms [HCF95]. It may be argued that typical distributed systems provide plenty of information necessary to infer existence of faults [YKM+96]. Due to the dense knowledge representation of Bayesian networks, Bayesian networks can represent large amounts of interconnected and causally linked data as they occur in distributed systems. Generally speaking: • Bayesian networks can represent knowledge in depth by modeling the functionality of the transmission network in terms of cause and effect relationship among distributed system components and distributed system services. • Bayesian networks are easy to extend in design because they are graph based models. Hence Bayesian Networks are appropriate to model the problem domain in distributed systems, particularly in the domain of fault propagation. • Bayesian networks come with a very compact representation. A complete Joint Probability Distribution (JPD) over n binary-valued attributes requires 2n − 1 independent parameters to be specified. In contrast, a Bayesian network over n binary-valued attributes, in which each node has at most k parents, requires at most 2k n independent parameters. It is clear that such a network can encode only a very small fraction of the possible distributions over these attributes, since it has relatively few parameters. The fact that the structure of a Bayesian network eliminates the vast majority of distributions and indicates that the network structure itself encodes information about the domain. This information takes the form of the conditional independence relationships that hold between attributes in the network. • Bayesian networks have the capability of handling noisy, transient, and ambiguous data, which is unavoidable in complex distributed systems, due to their grounding in probability theory. • Bayesian networks have the capacity to carry out inference on the presence of a distributed system from the combination of: – statistical data empirically surveyed during the network functioning; – subjective probabilities supplied by specialists and – information (that is, ”evidences” or ”alarms”) received from the distributed systems. • Bayesian networks provide a compact and well-defined problem space because they use an exact solution method for any combination of evidence or set of faults. Through the evaluation of a Bayesian network, it is possible to obtain approximated answers, even when the existing information is incomplete or imprecise; as new information becomes available, the Bayesian networks allow a corresponding improvement in the precision of the correlation results. • Bayesian networks are abstract mathematic models. In distributed systems management, Bayesian networks can be designed on different levels on different management intention or based on particular application or services. For example, when a connection service is considered, the physical topology is the basis for the construction of a Bayesian network. While a distributed service is taken into account, the logic dependency based on certain service between managed objects will act as the foundation in construction of a Bayesian network.

48

4.1.3

4 Probabilistic Inference in Fault Management

Mapping Distributed Systems to Bayesian Networks

The Construction of a Bayesian network Building a Bayesian network for an application domain involves various tasks. The first is to identify the variables that are of importance in the domain at hand, along with their possible values. The identification of the important domain variables is typically performed with the help of one or more domain experts. This task is not specific for building a Bayesian network, but it is quite common in engineering knowledge-based systems. A knowledge engineer can therefore make use of the various elicitation techniques designed for engineering knowledgebased systems in general [Kin01]. Once the important domain variables have been identified, each of them needs to be expressed as a statistical variable, which is characterized by its values being mutually exclusive and collectively exhaustive and has to take its value from a finite set of discrete values to allow for inclusion in a Bayesian network. Only if the set of values of a domain variable exhibits these properties can the variable be included in the network as it is. For example, a single-valued domain variable taking its value from an infinite set of values cannot be expressed directly as a statistical variable for inclusion in a Bayesian network. The variable’s set of values has to be discredited, that is, to split up into a finite number of mutually exclusive subsets of values, which subsequently are taken as the variable’s new values. Also, a domain variable taking multiple values from a finite set of values cannot be expressed directly as a statistical variable since its values are not mutually exclusive. The variable’s values then have to be redefined to render them mutually exclusive and collectively exhaustive; alternatively, the domain variable can be decomposed into several variables, which subsequently are separately modeled in the Bayesian network in the making. Formally, for constructing the qualitative part of the network, the independence relation of the JPD on the discerned variables has to be identified and represented in an acyclic digraph. In general practice, however, the graph is constructed directly without explicitly identifying all relevant independencies. For most of application domains, the qualitative part of a Bayesian network has to be handcrafted with the help of one or more domain experts. For eliciting the topology of the DAG of the network, often the concept of causality is used as a heuristic guiding rule during the interview with a domain expert; typical questions are ”what could cause this effect?” and ” What manifestations could this cause have?” [Hen89]. Thus the elicited causal relations among the discerned variables are easily expressed in graphical terms by taking the direction of causality for directing the arcs between related variables; this graphical representation can then be taken as a basis for feedback to the domain expert for further refinement. Building on the concept of causality has the advantage that domain experts are allowed to express their knowledge in either the causal or diagnostic direction. For some application domains, the construction of the qualitative part of the Bayesian network can be performed automatically by exploiting collected data carefully. After the qualitative part of the Bayesian network has been constructed, its quantitative part is to be specified. Specifying the quantitative aspect partly amounts to defining the probability assessment functions for the variables, which are modeled in the distributed systems. The task of assessing all the required probabilities tends to be by far the hardest task in Bayesian network building. In most domains, at least some information is already available for this task, it is coming from literature or from domain experts. Although literature on the application domain often provides abundant probabilistic information, unfortunately it is seldom directly amenable to encoding in a Bayesian network: the typical information is not complete; it concerns variables, which are not causally related or are mixed with noise.

4 Probabilistic Inference in Fault Management

49

A large number of probabilities will have to be assessed by domain experts. The field of decision analysis offers various methods for the elicitation of judgmental probabilities from experts. These methods are designed to avert at least to some extent problems of bias and poor calibration typically found in human probability assessment. In practice, however, building a Bayesian network is a cyclic process that iterates over the various tasks until a network result is deemed satisfactory for the application domain at hand. In conclusion, we would like to stress that building a Bayesian network requires careful tradeoff between the desire for a large and rich model to obtain accurate results on the one hand, and the cost of construction and maintenance and the complexity of probabilistic inference on the other hand. Building Bayesian networks for distributed fault management. As discussed before, three steps are necessary to construct a Bayesian network [Hec96]: • To choose the variables and the states of each one of them; • To construct the structure of the Bayesian network, that is, the directed acyclic graph containing the information on the independence among variables; • To assign probability values, that is to specify the distribution of local probabilities for each variable. In the case of a distributed system, a great part of the work needed for the construction of the Bayesian network, namely, the choice of the variables and the construction of the structure, will already be done as soon as the network model is available. We represent uncertainty in the dependencies among the entities of distributed system by assigning probabilities to the links in the dependency or causality graph [KS95] [KYY+95]. Some commonly accepted assumptions in this context are: (a) given fault A, the occurrences of faults B and C that may be caused by A are independent, (b) given occurrences of faults A and B that may cause event C, whether A actually causes C is independent of whether B causes C (the OR relationship among alternative causes of the same event), and (c) root faults are independent of one another. This dependency graph can be transformed into a Bayesian network, which is a DAG with certain special properties [HW95]. To the best of our knowledge, no approximation has been proposed that fits all types of networks. This book focuses on a class of Bayesian networks representing a simplified model of conditional probabilities called noisy-OR models [Pea88]. The simplified model contains binary-valued random variables. The noise-OR model associates an inhibitory factor with every cause of a single effect and assumes that they are all independent. The effect is absent only if all inhibitors corresponding to the present causes are activated. In distributed systems, an object may be successively subdivided until each one of the resulting objects is considered indivisible. An indivisible object is denominated as a terminal object. In the model of Bayesian networks, a distributed system is represented by a directed graph G = (V, L, P ), where V is a non-empty finite set of terminal objects vi (∈ V ), and L is a set of directed edges (vi , vj ) ∈ V × V . The existence of an edge (vi , vj ) indicates that a fault in vi causes side effects vj . When a distributed system is modelled as a Bayesian Network, two important processes need to be resolved: 1. Ascertain the Dependency Relationship between Managed Objects.

50

4 Probabilistic Inference in Fault Management A distributed system consists of a number of managed objects. An object is a ”part” of the distributed system that has a separate and distinct existence. At the physical level, an object can be a network, a node, a switch, a layer in a protocol stack, a virtual link, a physical element like an optical fiber, a piece of cable, a hardware component, etc. At the logic level, an object can be a software service, such as a process, a piece of code, a URL, a servlet or a service request. Objects in a distributed system consist of other objects down to the level of the smallest objects that are considered indivisible. An indivisible object is defined as a terminal object. The concept of division and appropriate level of division are system-dependent and application-dependent. Objects in a distributed system are dependent upon each other in rather complex ways. These dependencies are very important for the alarm correlation and fault identification process. In most cases a failure in one object has side effects on other objects that depend on it. For example, a link failure has an effect on other resources in the network, e.g., connections on the various layers that use the link will experience timeouts. The knowledge of these dependencies gives us valuable information for the purpose of alarm correlation and fault localization. Dependencies: When one object requires a service performed by another object in order for it to execute its function, this relationship between the two object is called a dependency. Consider any two objects, say A (such as a service, an application component in software or hardware) and B. A is said to be dependent on B, if B’s services are required for A to complete its own service. A weight may also be attached to the directed edge from A to B, which may be interpreted in various ways, such as a quantitative measure for the extent to which A depends on B or how much A may be affected by the non-availability or poor performance of B, etc. Any dependency between A and B thus arises from an invocation of B from A, which may be synchronous or asynchronous. Dependency analysis explores causal dependencies among objects and data items, with the goal to trace the fault symptom back to the cause. This is an often used trouble-shooting technique, applicable to any system that is based on collaboration of independent or distributed entities. For instance, deadlocks in databases may be diagnosed by following transactions that are blocked waiting for other transactions. In computing, there exist many different kinds of dependencies. However, not all references and interactions actually represent causal dependencies that are relevant for diagnosis. Hence the dependencies, which are pertinent to the purpose of the management, are taken into account. The dependencies among distributed entities can be assigned probabilities to the links in the dependency or causality graph [KS95] [KYY+95]. This dependency graph can be transformed into a Bayesian Network with certain special properties [HW95]. In distributed systems, the notion of dependency can be applied at various levels of granularity. Sometimes the dependencies that occur between different system components should be defined carefully. For example, the maintenance of an email server obviously affects the service ’email’ and thus all the users whose user agents have a client - server relationship with this specific server; however, other services (news, WWW, ftp) are still usable because they do not depend on a functioning email service. So the inter-system dependencies are always confined to the components of the same service. Two models are useful to get the dependency between cooperating entities in distributed systems.

4 Probabilistic Inference in Fault Management

51

• Functional model (from the view of users) The functional model defines generic service dependencies and establishes the principle constraints to which the other models are bound. A functional dependency is an association between two entities, typically captured first at design time, which says that one component requires some services from another. The functional dependence between logical objects is determined by the implementation and functional support relationships and originates in a graph, from which it is possible to correlate a set of state changes (which may be considered as a ”signature” of a problem) to the original cause of the problem. The functional model is utilized by a ”network state estimator” to correlate the changes in the network state. The state changes are reported by the received alarms, to which information exogenous to the network (such as those related to climatic situations) are added. • Structural model (from the view of system implementers) The structural model contains the detailed descriptions of software and hardware components that realize the service. A structural dependency contains detailed information and is typically captured first at deployment or installation time. 2. Obtaining the Measurement of the Dependency. The faults and anomalies in distributed systems can be identified based on the statistical behavior of the Management Information Base (MIB) variables and the recordings in log files. When Bayesian networks are used to model distributed systems, Bayesian networks represent causes and effects between observable symptoms and the unobserved problems, so that when a set of evidences is observed, the most likely causes can be determined by inference technologies. Single-cause (fault) and multi-cause (fault) are two kinds of general assumptions to consider the dependencies between managed entities in distributed systems management. In Bayesian networks, a non-root node may have one or several parents (causal nodes). Single-cause means any of the causes must lead to the effect. So the dependencies between causes and effect for single-cause are denoted as: { P (e | c1 , . . . , cn ) =

100%, ci = F (F alse), ∃i, i ∈ [1, n]; 0, otherwise.

(4.1)

e denotes the effect node, ci denotes the cause of e, i = 1, . . . , n. The existence of multiple causes means that one effect is generated only when more than one cause occurs simultaneously. So the measurement of the dependencies has various possibilities based on the particular problem domain. In the above description, the states of the objects are identified as T(True) or F(False). In complex systems, it is possible that managed objects hold more than two states. In distributed systems, the measurement of dependencies between managed objects can be obtained from the following methods: • Management information statistics are the main sources to get the dependencies between the managed objects in distributed systems. • The empirical knowledge of experts is another important source to determine the dependency between managed objects.

52

4 Probabilistic Inference in Fault Management • For particular dependencies, an experiment provides a way to retrieve the dependencies between the managed objects. Hasselmeyer [Has01] argues that the dependencies among distributed cooperating components should be maintained and published by services themselves, and he proposes a schema that allows these dependencies to be obtained. Some researchers have performed useful work to discover dependencies from the application view in distributed systems [GNA+03] [KBK00] [GKK04]. Despite all the methods cited in this section, it has to be observed that obtaining dependency information in an automatic fashion is still an open research problem. In obtaining dependency information, it needs to use available and suitable techniques to deal with every system, layer or type of device separately.

Figure 4.2:

Example of a Campus Network.

Figure 4.2 shows an example of the campus network of FernUniversit¨at in Hagen. When examining the connection service for end users, Figure 4.3 illustrates the relevant Bayesian network corresponding to the communication network presented in Figure 4.2. The JPD which measures the dependencies between nodes in Figure 4.3 are listed in Table 4.1. Table 4.1:

The JPD to measure the dependencies between nodes in Figure 4.3.

P (A) = 0.0092% P (E) = 0.063% P (A|BE) = 100% P (D|EF ) = 20% P (C|E) = 100%

P (B) = 0.0097% P (F ) = 0.035% P (A|BE) = 100% P (D|EF ) = 100% P (C|E) = 0%

P (C) = 0.0097% P (A|BE) = 0% P (D|EF ) = 0% P (B|E) = 100%

P (D) = 0.376% P (A|BE) = 100% P (D|EF ) = 100% P (B|E) = 0%

4 Probabilistic Inference in Fault Management

Figure 4.3:

53

Example of Bayesian Network for Figure 4.2.

Here the dependency probability for the connection service is derived from the record of the connection failures between objects, with the adaptations to load-balancing mechanisms within a router do not interfere this model. The arrows in the Bayesian network denote the dependency from causes to effects. The weights of the links denote the probability of the dependency between the objects. When one node has several parents (causes), the dependency between the parents and their child can be denoted by joint probability distribution. In this example, component F and component E are the causes for component D. The annotation P (D|EF ) = 100% denotes the probability of the non-availability of component D, which is 100% when component F is in order but component E is not. Other annotations can be read similarly. In this example, some evidence, such as the status of component D is easily detected through regular monitoring, but the causes of a failure on component D are not always obvious. One important task in management is to infer hidden factors from the available evidence. In terms of precision, the behavior of a Bayesian network reflects the quality and the detailing level of its structure, which stems from the object system model. Another factor which affects the Bayesian network model is the precision of the value of the conditional probabilities.

4.2

Probabilistic Inference for Distributed Fault Management

The semantics of a Bayesian network determines the conditional probability of any event given any other event. When computing such a conditional probability, the conditioning event is called the evidence, while the event for which we want to determine its conditional probability given the evidence is called the query. The general capability of a Bayesian network to compute conditional probabilities allows it to exhibit many particular patterns of reasoning (inference). • Causal reasoning is the pattern of reasoning that reasons from a cause to its effects. • Evidential reasoning is the reasoning from effects to its possible causes.

54

4 Probabilistic Inference in Fault Management • Mixed reasoning combines both causal and evidential reasoning. • Intercausal reasoning involves reasoning between two different causes that have an effect in common.

In case of fault management in distributed systems, we only consider the backward inference (evidential reasoning), which is the basic operation of fault diagnosis.

4.2.1

Basic Model of Backward Inference in Bayesian Networks

The fault localization algorithm based on the fault propagation model (FPM), which we are going to introduce now, should return a number of fault hypotheses that best explain the set of observed symptoms. The most common approach towards reasoning with uncertain information about dependencies in distributed systems is probabilistic inference, which traces the causes from effects. The task of backward inference amounts to finding the most probable instances of some hypothesis variables, given the observed evidence. We define E as the set of effects (evidence) which we can observe, and C as the set of causes. The inference from effects to causes is denoted by P (cj |Ei ), Ei ⊆ E, cj ∈ C. Before discussing the complex backward inference in Bayesian networks, a simplification model will be examined. In Bayesian networks, one node may have one or several parents (if it is not a root node), and we denote the dependency between parents and their child by a Joint Probability Distribution (JPD).

x1

x2

...

xn

Y Figure 4.4:

Basic Model for Backward Inference in Bayesian Networks.

Figure 4.4 shows the basic model for backward inference in Bayesian networks. Let X = (x1 , x2 , . . . , xn ) be the set of causes, Y be the effect of X. A real line arrow denotes the causal relationship from a cause to an effect. A dashed arrow denotes the relationship from an effect to a cause. According to the definition of Bayesian networks, the following variables are known: P (x1 ), P (x2 ), . . . , P (xn ), P (Y |x1 , x2 , . . . , xn ) = P (Y |X). Here x1 , x2 , . . . , xn are mutually independent, so P (X) = P (x1 , x2 , . . . , xn ) =

n ∏

P (xi )

(4.2)

i=1

P (Y ) =

∑ X

by Bayes’ theorem,

[P (Y |X)P (X)] =

∑ X

[P (Y |X)

n ∏ i=1

P (xi )]

(4.3)

4 Probabilistic Inference in Fault Management

55 n ∏

P (Y |X) P (xi ) P (Y |X)P (X) i=1 = P (X|Y ) = n ∑ ∏ P (Y ) [P (Y |X) P (xi )] X

which computes to P (xi |Y ) =



(4.4)

i=1

P (X|Y )

(4.5)

X\xi

In Equation (4.5), X\xi = X − {xi }. According to Equations (4.2)-(4.5), the individual conditional probability P (xi |Y ) can be achieved from the JPD p(Y |X), X = (x1 , x2 , . . . , xn ). In Figure 4.3, E, F are the parents of D. The calculation of backward inference from D to E, F is denoted in Table 4.2. The backward dependency from effects to causes can be obtained from Equation 4.5. The dashed arrow in Figure 4.4 denote the backward inference P (xi |Y ) from effect Y to individual cause xi , i ∈ [1, 2, . . . , n]. Table 4.2: P (E) P (F ) P (E, F ) P (D|E, F ) P (D, E, F ) P (E, F |D)

Calculation of backward inference in Bayesian networks.

P (E)[0.063%] P (F )[0.035%] 0.000% 100.000% 0.000% 0.000%

P (E)[0.063%] P (F )[99.965%] 0.063% 100.000% 0.063% 90.000%

P (E, F |D) =

P (E)[99.937%] P (F )[0.035%] 0.035% 20.000% 0.007% 10.000%

P (D|E, F )P (E, F ) P (D)

P (E)[99.937%] P (F )[99.965%] 99.900% 0.000% 0.000% 0.000%

1 0.0007 1

(4.6)

P (E|D) = P (E, F |D) + P (E, F |D) = 90.000% + 0.000% = 90.000%

(4.7)

P (F |D) = P (F , E|D) + P (F , E|D) = 10.000% + 0.000% = 10.000%

(4.8)

In Figure 4.3, when a fault is detected in component D, then based on the above calculation (Equation 4.7, 4.8), we obtain P (F |D) = 90.000%, p(E|D) = 10.000%. This can be interpreted as follows: when component D is not available, the probability of a fault in component F is 90.00% and the probability of a fault in component E is 10.00%. In this situation, we can claim the hypothesis that F is more likely the cause of the fault in D. Here only the fault related to the connection service is considered.

56

4 Probabilistic Inference in Fault Management

4.2.2

Strongest Dependency Route Algorithm for Backward Inference

The objective of any fault management system is to minimize the time to localize and repair a fault. The time to localize a fault is the sum of the time to propose possible fault hypotheses (fault identification) and the time to perform testing in order to verify these hypotheses. The time required for testing is affected by the number of managed objects that must be tested. Thus, if the network management system is about to identify the source of a fault, it is desirable that the minimum number of tests be performed. Hence, there are two main aspects, subject to optimization, of any fault localization process: accuracy of the hypothesis it provides and time complexity of the fault identification algorithm it uses. In order to optimize the time to localize the fault, we should maximize the accuracy of the proposed hypotheses and minimize the time complexity of the fault identification process. In distributed system management, normally when some faults are reported, the most urgent task is to locate the causes, to bring faulty states back to normal, and possibly to improve the performance of the system. The key factors that are related to the defect in the system should be identified. The Strongest Dependency Route (SDR) algorithm is proposed to resolve these tasks based on probabilistic inference. Before we describe the SDR algorithm, the definition of strongest cause and strongest dependency route are given as follows: • Strongest Cause. In a Bayesian Network let C be the set of causes, E be the set of effects. For ei ∈ E, let Ci be the set of causes based on effect ei . Then ck is the strongest cause for effect ei iff P (ck |ei ) = M ax[P (cj |ei ), cj ∈ Ci ]. • Strongest Dependency Route. In a Bayesian Network, let C be the set of causes, E be the set of effects, let R be the set of routes from effect ei ∈ E to its cause cj ∈ C, R = (R1 , R2 , . . . , Rm ). Let Mk be the set of transition nodes, which are necessary transit nodes between ei and cj in route Rk ∈ R. Iff P (cj |Mk , ei ) = M ax[P (cj |Mt , ei ), t = (1, 2, . . . , m)], then Rk is the strongest route between ei and cj . The detailed description of the SDR algorithm is described as follows: Pruning of Bayesian networks. When a concrete problem domain is considered, a common strategy is to omit some variables that are not related to the problem domain. To start the pruning operation on a Bayesian Network BN = (V, L, P ), the set of initial nodes (effect nodes) E should be determined firstly. In distributed systems management, E is obtained from the fault report of a management system. Generally speaking, multiple effects (symptoms) may be observed at a single moment, so Ek = {e1 , e2 , . . . , ek } (ei ∈ V ) is defined as the set of initial effects. A pruning operation is defined as follows: Pruning Algorithm Prune (BN = (V, L, P ), Ek = {e1 , e2 , . . . , ek }, ei ∈ V ) new BN 0 = (V 0 , L0 , P 0 ); V 0 = Ek ; //set Ek to V 0 L0 = ∅; //∅ denotes empty set for ei ∈ Ek (i = 1, . . . , k)

4 Probabilistic Inference in Fault Management

57

vi = ei , while vi 6= N IL do V 0 = V 0 ∪ {π(vi )}, //add vertex π(vi ) to V 0 vi ← π(vi ), L0 = L0 + < π(vi ), vi >; //add edge < π(vi ), vi > to L0 return BN 0 ; In the operation of pruning, every step just integrates the current nodes’ parents into the sub-BN (BN 0 ) and omits their brother nodes, because their brother nodes are independent from each other. The pruned graph is composed of the effect nodes Ek and all their ancestors, and the end nodes construct the set of causes based on the effect node ei . The pruning algorithm will cut the unrelated (redundant) nodes for the Bayesian network in term of considered effects. It provides a simplified method to deal with the backward inference in a sub-BN which is embedded in a large and complex Bayesian network. Strongest Dependency Route (SDR) Algorithm After the pruning algorithm has been applied to a Bayesian network, a simplified sub-BN is obtained. Between every cause and effect, there may be more than one dependency routes. The questions now are: which route is the strongest dependency route and among all causes, which is the strongest cause? The SDR algorithm is used to trace the strongest dependency route, to locate the causes and to generate the dependency sequence among the causes based on particular effects in the Bayesian network. The SDR algorithm uses product calculation to measure the serial strongest dependencies between effect nodes and causal nodes. In the SDR algorithm, multiple effects can be considered. Suppose Ek ⊆ E, Ek = {e1 , e2 , . . . , ek }. If k = 1, the graph will degenerate to a single-effect model. Here the backward dependency calculation is based on Equations 4.2 - 4.5 in section 4.2.1. When several effects (symptoms) Ek are observed at the same time, the effects are instantiated, then P (ei ) = 1, or P (ei ) = 1, i = (1, 2, . . . , k). Here only the state P (ei ) = 1 is considered, ei denotes a fault evidence (symptom). SDR Algorithm: Input: BN = (V, L, P ); V : the set of nodes (variables) in the Bayesian network, L: the set of links in the Bayesian network, P : the dependency probability distribution for every node in Bayesian network; Ek = {e1 , e2 , . . . , ek }: the set of initial effect nodes in BN , Ek ⊆ V . Output: T : a spanning tree of the Bayesian network, rooted at Ek = {e1 , e2 , . . . , ek }, whose path from ei to each causal node is a strongest dependency route, and a vertex-labelling giving the dependency probability from ei to each causal node. Variables: depend[v]: the strongest probability dependency between v and all its descendants; ψ(l): the condition probability of P (v|u), v as the parent of u, they share the link l; ψ(l)

58

4 Probabilistic Inference in Fault Management

can be calculated from JPD of P (u|π(u)) based on Equations (4.2)-(4.5); ϕ(l): the temporal variable which records the strongest dependency between nodes. Initialize the SDR tree T as Ek ; // Ek the ei ∈ are added as root nodes of T Initialize the set of frontier edges (where one endpoint (child) is within T and the other (parent) is out of T ) for tree T as empty; Write label 1 on ei , i ∈ [1, k] //ei ∈ Ek While SDR tree T does not yet span the BN For each frontier edge l in T Let u be the labelled endpoint of edge l; Let v be the unlabeled endpoint of edge e (v is one parent of u); Letψ(l) = P (v|u); Setϕ(l) = depend[u] ∗ ψ(l) = depend[u] ∗ P (v|u); Let l be a frontier edge for BN that has the maximum ϕ-value; Add edge l (and vertex v) to tree T ; depend[v] = ϕ(l); Write label depend[v] on vertex v; Return SDR tree T and its vertex labels; The result of the SDR algorithm is a spanning tree T . Every cause node cj ∈ C is labeled with depend[ci ] = P (cj |Mk , ei ), ei ∈ Ex , Mk is the transition nodes between ei and cj in route Rk ∈ R. According to the values of the labels in the cause nodes, a cause sequence can be obtained. This sequence is important for fault primary diagnosis and related maintenance operations. Meanwhile, using Depth-First search on the spanning tree, the strongest route between effect nodes and cause nodes can be also achieved. Suppose one reasoning chain of a series of variables (from cause to effect) is: δ0 → δ1 → . . . → δn , then the JPD P (δ0 |δn , δn−1 , . . . , δ1 ) is considered as the backward reasoning value based on the backward serial variables (from effect to cause): δn → δn−1 → . . . → δ0 . Proof of the SDR Algorithm. Now we prove the core property of the SDR algorithm [DZB+04], which is to compute the strongest path among all possible paths from an effect node to a possible cause node. If the route < ei , u1 , u2 , . . . , un , cj > is the strongest dependency route for a given BN and effect node ei and < ei , δ1 , δ2 , . . . , δm , cj > is any route from ei to cj . Then P (u1 |ei ) ∗ P (u2 |u1 ) ∗ . . . ∗ P (cj |un ) ≥ P (δ1 |ei ) ∗ P (δ2 |δ1 ) ∗ . . . ∗ P (cj |δm )

(4.9)

The above formula demonstrates the product of all backward dependencies along the strongest dependence route keeps the maximum of those in all routes. Define weight(u, π(u)) = −lg(p(π(u)|u)), Equation 4.9 is transferred to:

weight(ei , u1 ) + weight(u1 , u2 ) + . . . + weight(un , cj ) ≤ weight(ei , δ1 ) + weight(δ1 , δ2 ) + . . . + weight(δm , cj )

(4.10)

When a vertex u is added to spanning tree T , define d[u] = weight(ei , u) = −lg(depend[u]). Because 0 < depend[δj ] ≤ 1, so d[δj ] ≥ 0. Note depend[δj ] 6= 0, or else there is no dependency relationship between δj and its offspring .

4 Probabilistic Inference in Fault Management

59

Proof: suppose to the contrary that at some point the SDR algorithm first attempts to add a vertex u to T for which d[u] 6= weight(ei , u).

u

T

ei y x

Figure 4.5:

Proof of the SDR Algorithm.

See Figure 4.5. Consider the situation just prior to the insertion of u and the true strongest dependency route from ei to u. Because ei ∈ T , and u ∈ V \T , at some point this route must first take a jump out of T . Let (x, y) be the edge taken by the path, where x ∈ T , and y ∈ V \T (it may be that x = ei , y = u). We now prove that d[y] = weight(ei , y). We have computed x, so d[y] ≤ d[x] + weight(x, y) (4.11) Since x was added to T earlier, by hypothesis, d[x] = weight(ei , x)

(4.12)

Since < ei , . . . , x, y > is sub-path of a strongest dependency route, by Equation 4.12, weight(ei , y) = weight(ei , x) + weight(x, y) = d[x] + weight(x, y)

(4.13)

By Equations 4.11 and 4.13, we get d[y] ≤ weight(ei , y)

(4.14)

d[y] = weight(ei , y)

(4.15)

Hence Now note that y appears midway on the route from ei to u, and all subsequent edges are positive, we have weight(ei , y) < weight(ei , u)

(4.16)

d[y] = weight(ei , y) < weight(ei , u) ≤ d[u]

(4.17)

and thus

Therefore y would have been added to T before u, in contradiction to our assumption that u is the next vertex to be added to T . So the algorithm must work. Since the calculation is correct for every effect node, it is also true for multiple effect nodes in tracing the strongest dependency route. When the algorithm terminates, all vertices are in T , thus all dependency (weight) estimates are correct.

60

4 Probabilistic Inference in Fault Management

Complexity analysis of the SDR Algorithm To determine the complexity of the SDR algorithm, we observe that every link (edge) in a Bayesian network is only calculated one time, so the size of the links in a Bayesian network is consistent with the complexity. It is known in a complete directed graph that the number of edges is n(n − 1)/2 = (n2 − n)/2, where n is the size of the nodes in the pruned spanning tree of a Bayesian network. Normally a Bayesian network is an incomplete directed graph. So the calculation time of the SDR is less than (n2 − n)/2. The complexity of the SDR is O(n2 ). In Figure 4.3, if the effect nodes D and C are detected, the calculation complexity of cause detection after the SDR algorithm is 3(< 22 ). According to the SDR algorithm, multiple effect nodes can be observed. From the spanning tree, the strongest routes between effects and causes can be obtained by a depth-first search. Meanwhile the value of depend[cj ] (cj ∈ C) generates a dependency ranking of causes based on Ek . This dependency sequence is a useful reference for fault diagnosis and system recovery. In terms of precision, the behavior of a Bayesian network reflects the quality and the detailed levels of its structure, which stem from the object network model. Another factor that affects the precision of the Bayesian alarm correlation process are the quality and detailed levels of the alarms to be correlated. These two factors affect the precision in any alarm correlation process, which is independent of the adapted approach. The third factor, i.e. the precision of the values of the conditional probabilities also contributes to the precision of the correlation process.

4.2.3

Related Algorithms for Probabilistic Inference

Various types of inference algorithms exist for Bayesian networks. All in all, they can be classified into two types of inferences: exact inference [LS88] [Pea88] [Pea00] and approximate inference [Nea93]. Each class offers different properties and works better on different classes of problems, but it is very unlikely that a single algorithm can solve all possible problem instances effectively. Every resolution is always based on a particular requirement. It is true that almost all computational problems and probabilistic inference using general Bayesian networks have been shown to be NP-hard by Cooper [Coo90]. In the early 1980’s, Pearl published an efficient message propagation inference algorithm for polytrees [KP83] [Pea86]. The algorithm is exact, and has polynomial complexity in the number of nodes, but works only for singly connected networks. Pearl also presented an exact inference algorithm for multiply connected networks called loop cutset conditioning algorithm [Pea86]. The loop cutset conditioning algorithm changes the connectivity of a network and renders it singly connected by instantiating a selected subset of nodes referred to as a loop cutset. The resulting single connected network is solved by the polytree algorithm, and then the results of each instantiation are weighted by their prior probabilities. The complexity of this algorithm results from the number of different instantiations that must be considered. This implies that the complexity grows exponentially with the size of the loop cutest being O(dc ), where d is the number of values that the random variables can take, and c is the size of the loop cutset. It is thus important to minimize the size of the loop cutset for a multiple connected network. Unfortunately, the loop cutset minimization problem is NP-hard. A straightforward application of Pearl’s algorithm to an acyclic digraph comprising one or more loops invariably leads to insuperable problems [KW01] [Nea93]. Another popular exact Bayesian network inference algorithm is Lauritzen and Spiegelhalter’s clique-tree propagation algorithm [LS88]. It is also called a ”clustering” algorithm. It first transforms a multiple connected network into a clique tree by clustering the triangulated moral graph of the underlying undirected graph and then performs message propagation over the clique tree. The clique propagation algorithm works efficiently for sparse networks, but still

4 Probabilistic Inference in Fault Management

61

can be extremely slow for dense networks. Its complexity is exponential in the size of the largest clique of the transformed undirected graph. In general, the existent exact Bayesian network inference algorithms share the property of run time exponentiality in the size of the largest clique of the triangulated moral graph, which is also called the induced width of the graph [LS88]. It is also difficult to record the internal nodes and the dependency routes between particular effect nodes and causes. In distributed systems management, the states of internal nodes and the key routes, which connect the effects and causes, are important for management decisions. Moreover, the sequence of localization for potential faults can be a reference for system managers and thus very useful. It is also important for system performance management to identify the relevant key factors. Few algorithms give satisfactory resolution for this case. Compared with other algorithms, the SDR algorithm belongs into the class of exact inferences and it provides an efficient method to trace the strongest dependency routes from effects to causes and to track the dependency sequences of the causes. It is useful in fault location, and it is beneficial for performance management. Moreover it can treat multiple connected networks modelled as DAGs. A simulation model is developed to verify the effectiveness and efficiency of the proposed approach. A detailed description of the simulation is presented in Chapter 6. From simulation results, we notice that the proposed SDR algorithm not only identifies multiple problems at once but also holds much lower calculation complexity than related exact inference algorithms.

62

5 Prediction Strategies in Fault Management

Chapter 5 Prediction Strategies in Fault Management For complex distributed systems, it is important that the fault management is to be proactive, that is to detect, diagnose, and mitigate problems before they result in severe degradation of system performance. Proactive fault management depends on monitoring distributed systems to obtain the data on which a manager’s decisions are based. Fault prediction is to predict a failure in advance on the basis of the current information about the system. It is especially true for large systems that have some components failing all the time, for such a prediction can be done by an analysis of the historical information. Dynamic changes in distributed systems raise higher barriers for exact fault location. Hence, for large distributed systems with thousands of managed components, it may be rather timeconsuming and difficult to locate the unknown causes of faults in due time by the exhaustive search for the root causes of a failure and this process may interrupt or impair important system services. Dynamic updates bring even more challenges in the fault detection. Systems, whose behavior is not fully understood, are often modeled by Bayesian networks (BNs). However, the BN paradigm does not provide direct mechanisms for modeling temporal dependencies in dynamic systems [AC96] [YS96]. In this chapter we apply Dynamic Bayesian Networks (DBNs) to address temporal factors and model the dynamic changes of managed entities and the dependencies between them. Based on the related inference techniques we further investigate the predictive capabilities of fault management in the presence of imprecise and dynamic management information. The application of DBNs for fault management is discussed in Section 5.1. The prediction mechanisms reflecting the dynamic changes in distributed systems are presented in Section 5.2.

5.1

Dynamic Bayesian Networks for Fault Management

In real-life distributed systems, dynamic changes are unavoidable because of the system evolution in hardware, software and distributed applications. Hence it is very important for the fault management to understand the changes and to catch the trend of the changes in distributed systems.

5.1.1

Dynamic Characteristics in Distributed Systems

Dynamic changes in distributed systems are correlated to hardware, software and the dependencies between those components in implementing certain functions. Hence changes in distributed systems demonstrate some particular characteristics.

5 Prediction Strategies in Fault Management

63

Hard Changes and Soft Changes in Distributed Systems Dynamic updates in distributed systems can be classified into either hard or soft changes [DKX+04]. A hard change refers to a change that happens abruptly and most of the time is generated on purpose by the system owner. This kind of change does not depend on the system’s history. For example, a router being added or removed from the distributed system may cause an abrupt change in the system topology and behavior. Some intended operations also generate this kind of hard change, such as a change of the configuration of a distributed system. Generally, a hard change does not happen so often, but it depends on the intention of the system manager. A soft change, in contrast, refers to a change that happens gradually and depends on the system history. A soft change typically results from changes of system properties such as performance degradation, application degeneration, dependency modifications and so on. A soft change may bring some potential problems, such as the network traffic gets slower or certain network services decrease in efficiency. The roots of these kinds of problems are aging devices, conflicted applications, or unknown hidden factors in the systems. From the experience of system management, lots of unknown or unlocated causes of faults are triggered by a soft change, which is related to the potential changes and updates of the system. Compared with a hard change, a soft change keeps going on all the time in distributed systems and it is hard to predict using a straightforward approach. In our research we focus on soft changes in distributed systems. When distributed systems are modelled as graph structures, hard changes can be treated by structure modifications in the models based on the abrupt changes in the systems or based on the intentions of system managers. While soft changes (such as the improvement or degradation in performance of a hardware component or a software component) will not bring the modification in the topology of the network, but will update the weights (dependencies) between the components in the network. Considering soft changes in distributed systems, one kind of change comes from individual distributed systems entities; another arises from updating dependencies between managed entities. From the viewpoint of management, an entity can be a hardware device, a software component or a certain application. In distributed systems, real-life dynamic systems are often rife with nonlinearities, many of which are expressed as discrete failure modes that can produce discontinuous jumps in system behavior. Characteristics of Dynamic Changes in Distributed Systems In distributed systems, dynamic changes are often identified as a discrete nonlinear time series. A time series is a chronological sequence of observations on a particular variable. Time series data are often examined in hope of discovering a historical pattern that can be exploited in the preparation of a forecast. In order to identify this pattern, it is often convenient to think of a time series as consisting of several components: trend, cycle and irregular fluctuations. • Trend: Trend refers to an upward or downward movement that is characterized by a time series over a period of time. Thus trend reflects the long-run growth or decline in a time series. Trend movements can represent a variety of factors. For example, in distributed systems the trend movements may come from long-run movements in the performance of distributed systems, which might be determined by the improvement or degeneration in hardware, software, or increasing application services.

64

5 Prediction Strategies in Fault Management • Cycle: Cycle refers to the reoccurrence of upward and downward movements around the trend levels. These fluctuations have a duration or period that can be measured either from peak to peak or trough to trough. For example, in distributed systems, one of the common cyclic fluctuations in time series data is the ”traffic cycle”. The traffic cycle is represented by the fluctuations in the time series caused by the recurrent periods of user services. Since there is no single explanation, cyclic fluctuations vary greatly in both length and magnitude. • Irregular Fluctuations: Irregular fluctuations are erratic movements in a time series that follow no recognizable or regular pattern. Such movements represent what is ”left over” in a time series after trend, cycle and variations have been accounted for. Many irregular fluctuations in a time series are caused by ’unusual’ events that cannot be accidentally predicted. Irregular fluctuations can also be caused by errors on the part of the time series analyst. Normally it is hard to do a prediction of irregular fluctuations.

variable f(t)

variable f(t)

time (t)

(a) Trend

time (t)

(b) Cycle

variable f(t)

(c) Irregular Fluctuation

Figure 5.1:

time (t)

Time Series that Exhibit Trend, Cycle and Irregular Fluctuations.

5 Prediction Strategies in Fault Management

65

Figures 5.1(a)-(c) plot the trend, cycle and irregular fluctuations in time series. All these components of time series are popular patterns to describe dynamic changes in distributed systems. Figure 5.1(a) may demonstrate the trend in performance improvement in distributed systems. Figure 5.1(b) portrays traffic measurement with the cycle time series in a distributed system. Figure 5.1(c) exhibits irregular fluctuations in the resource utilization in an unstable distributed system. It should be pointed out that the time series components we have discussed do not always occur alone; they can occur in any combination or all can occur together. For this reason, no single best prediction model exists. A prediction model that can be used to predict a time series characterized by trend alone may not be appropriate in a prediction of a time series characterized by a combination of trend and cyclical variations. Thus one of the most important problems to be solved in prediction is that of trying to match the appropriate prediction model to the pattern of the available time series data. Once an appropriate model has been selected, the methodology usually involves estimating the time series components (model parameters). The estimates are then used to compute a prediction. For example, if a time series is characterized by a combination of trend and cyclic components, the appropriate prediction technique would first estimate these two components. Predictions would then be obtained by combining the estimate of the trend component with the estimate of the cycle component. Again, however, it should be emphasized that the key to this methodology is finding a model that matches the pattern of the historical data that is available. Fault Management in Dynamic Environments Today distributed systems are in a constant state of change. The size and complexity as well as the topology and configuration change frequently to meet users’ demands. At the same time, system administrators are under constant pressure to update system equipment to reduce costs and to provide more reliable systems. This has led to totally heterogeneous systems, including equipment and software packages manufactured by different vendors with different technologies. These facts about today’s systems bring major challenges to the old ad-hoc techniques for the effective management of a system. These challenges are as follows: • The number of events generated by these dynamic systems is no longer manageable. An average size system can easily generate tens of thousands of events on a daily basis [Nyg95]. This volume of information makes it extremely difficult for the system operator to determine the state of the system from these raw data. • The increase in size and complexity of these systems leads to an increase in the volume of events generated on a daily basis. Therefore, the volume of daily events can be seen as a time-dependent function that makes any static management tool obsolete very quickly. • The larger the system and its event log, the higher the probability that events will become intermingled. The intermingling of events or event addition is defined as the overlap of unrelated events within a set of events that form a known pattern of behavior of some node(s). Dynamic changes in the system make the situation more complex. • As a system’s topology and configuration change frequently, so do the event patterns representing nodes and their normal/ abnormal behavior. How to deal with the dynamic changes is an important task in a fault management system. Again, static tools will break down very quickly in such an environment.

66

5 Prediction Strategies in Fault Management

System managers have been trying to deploy novel tools and techniques to overcome such obstacles in fault management. However, management systems still face some pressures attributed to two factors. Firstly, already strained knowledgeable system personnel is spending a large portion of their time in analyzing huge amounts of information semi-manually. Secondly, the accuracy of these techniques is questionable. From the experience in distributed systems management, a typical metric for on-line fault identification is 95% fault location accuracy [Hil01]. This leads to a delay in the detection and identification of faults and other important system information. Such delays decrease the system’s availability. Prediction of system faults, anomalies and performance degradation forms an important component of distributed systems management. The advent of real-time services on distributed systems creates a need for continuous monitoring and prediction of systems performance and reliability. Although faults are rare events, they have enormous consequences when they do occur. Yet the rareness of faults in distributed systems makes their study difficult. Performance problems occur more often and in some cases may be considered as the indicators of an impending fault [MF90]. Efficient handling of these performance issues may help eliminate the occurrence of severe faults.

5.1.2

Dynamic Bayesian Networks for Fault Management

When a dynamic distributed system is modeled, a time dimension will be considered. Because observations and evidence can be updated over time, a management system should capture the evolution of the system as it changes over time. A problem with the standard theory of BNs is that there is no natural mechanism for representing time [AC96] [YS96]. DBNs provide a way to model a dynamic system, which describes a system that is dynamically changing or evolving over time [KKR95]. DBNs will enable users to monitor and to update the system as time proceeds, and even to predict further behavior of the system. As there is no standard definition for DBNs, researchers may use different descriptions to accommodate their research requirements. Current literature tends to use the terms ”dynamic” and ”temporal” interchangeably. The temporal approaches can be divided into two main categories of time representation models. • as points or instances or • as time intervals. DBNs have various definitions in different application areas. In fault management and dissident systems, DBNs possess a time related function: BN (t) = (V (t), L(t), P (t))

(5.1)

For a soft change in distributed systems, dynamic changes only happen in individual components and on the dependency between components. Under these kinds of changes, the topology of the Bayesian Network keeps stable; hence the time parameter can be omitted in nodes and edges: BN (t) = (V, L, P (t))

(5.2)

DBNs can represent large amounts of interconnected and causally linked data as well as the dynamic properties when they occur in distributed systems. Thus DBNs can model time-related changes in the dependencies between managed objects in distributed systems.

5 Prediction Strategies in Fault Management

67

DBN is an extension of BN that models time series [Fri98]. A DBN is a way to extend Bayesian networks to model the possible distributions over a time series. We only consider the discrete-time stochastic processes, so we increase the index t by one every time a new observation arrives. The observation could indicate that something has changed in distributed systems. Note that the term ’dynamic’ means not that the topology of the distributed system changes over time, but that a dynamic system is modeled.

5.2

Prediction Strategies for Distributed Systems Management

Predictive management plays a crucial role in distributed systems. The ability to predict service problems in distributed systems, and to respond to those warnings by applying corrective actions, brings multiple benefits. Firstly, the detection of system failures on a few servers can prevent those failures from spreading to the entire distributed system. For example, slow response time on a server may gradually escalate technical difficulties on all nodes in the attempt to communicate with that server. Secondly, a prediction can be used to ensure the continuous provision of distributed systems services through the automatic implementation of corrective actions.

5.2.1

Prediction Methods in Dynamic Systems

Qualitative methods and quantitative methods Qualitative prediction methods generally use the opinions of experts to subjectively predict future events. Such methods are often required when historical data are either not available at all or scarce. Qualitative prediction techniques are also used to predict the changes in historical data patterns. Since the use of historical data to predict future events is based on the assumption that the patterns of the historical data will persist, the changes in the data patterns cannot be predicted on the basis of the historical data. Thus qualitative methods are often used to predict such changes. Two commonly used techniques for a qualitative prediction are: • Subjective curve fitting. Since the forecaster first determines subjectively the form of the curve to be used, the subjective construction of such a curve is very difficult and requires a great deal of expertise and judgment. • Time-dependent technological comparison. In this method, which is often used to predict technological changes, changes in one area are predicted by monitoring changes that take place in another area. That is, the forecaster tries to determine a pattern of change (often a primary trend) in one area, which will be expected to result in new developments in some other areas. A prediction of the developments in the second area can then be made by monitoring the developments in the first area. In distributed systems management, qualitative prediction depends mostly on the expertise of the manager for a subjective curve fitting. Quantitative prediction methods can be classified into two types: • One common type of quantitative prediction method is called a univariate model. Such a model predicts future values of a time series solely on the basis of the past values of the

68

5 Prediction Strategies in Fault Management time series. When a univariate model is used, historical data is analyzed in an attempt to identify a data pattern. Then, assuming that it will continue in the future, this data pattern is extrapolated in order to produce predictions. Univariate prediction models are, therefore, most useful when conditions are expected to remain the same; they are not very useful for predicting the impact of changes in management policies. • The use of causal prediction models involves the identification of other variables that are related to the variable to be predicted. Once these related variables have been identified, a statistical model will be developed to describe the relationship between these variables and the variable to be predicted. The statistical relationship derived is then used to predict the variable of interest. However, causal models have several disadvantages as well. First of all, they are quite difficult to develop. In addition, they require historical data on all the variables to be included in the model. Moreover, the ability to predict the dependent variable depends on the ability of the forecaster to accurately predict future values of the independent variables.

Quantitative prediction methods are used when historical data are available. To be more specific, the univariate models predict the future values of the variable of interest solely on the basis of the historical pattern of that variable, assuming that the historical patterns will continue; and the causal models predict the future values of the variable of interest based on the relationship between that variable and other variables. Quantitative prediction techniques are also used when the historical data are scarce or not available at all, and they depend on the opinions of experts who subjectively predict future events. In actual practice most prediction systems employ both quantitative and qualitative methods. For example, quantitative methods are used when the existing data pattern is expected to persist, while qualitative methods are used to predict when the existing data pattern might change. Thus the predictions generated by quantitative methods are almost always subjectively evaluated by management. Construction of a Prediction One task of the prediction is to look for a technique matching the characteristics of the problem. Important factors are the discrete or continuous nature of the data, whether observations are taken at equal time intervals or not and whether the data are aggregated over time intervals or corresponds to instantaneous values. For example, most techniques, based on time series analysis, deal with discrete observations taken at equal time intervals. The fact that prediction techniques often produce predictions that are somewhat in error has a bearing on the form of the predictions we require. Two types of predictions are considered: • the point prediction and • the interval prediction. A point prediction is a single number that represents our best prediction (or guess) of the actual value of the variable being predicted, whereas an interval prediction is an interval (or range) of numbers, which is calculated so that we are very confident (for instance, 95% confidence) that the actual value will be contained in the interval. When choosing a prediction technique, the forecaster must consider the following factors. • The desirable prediction form. In some situations a point prediction may be sufficient; in other situations an interval prediction may be required.

5 Prediction Strategies in Fault Management

69

• The time frame. Predictions are generated for points in time that may be a number of minutes, days, weeks, months, or years in the future. The length of time is called the time frame or time horizon. The length of the time frame is usually categorized as immediate term, short term, medium term and long term. • The cost of prediction. Several costs are relevant to choose a prediction technique. – Firstly, the cost of developing the model must be considered. The development of a prediction model requires that a set of procedures be followed. The complexity, and hence the cost, of these procedures vary from technique to technique. – Secondly, the cost of storing the necessary data must be considered. Some prediction methods require the storage of large amounts of data. – Thirdly, the cost of the actual operation of the prediction technique is obviously very important. Some prediction methods are simple to operate, while others are very complex. The degree of complexity can have a definite influence on the total cost of prediction. • The desirable accuracy. In some situations, a prediction that is in error by as much as 20% may be acceptable; in other situations a prediction that is in error by 1% might be disastrous. The accuracy that can be obtained by using any particular prediction method is always an important consideration. • The availability of data. The historical data on the variable of interest is used when quantitative prediction methods are employed. The availability of this information is a factor that may determine the prediction method to be used. Since various prediction methods require different amounts of historical data, the quantity of the available data is important. Beyond this, the accuracy and the timeliness of the available data must be examined, since the use of inaccurate or outdated historical data will obviously yield inaccurate predictions. If the needed historical data are not available, special data-collection procedures may be necessary. • The ease of operation and understanding. The last important factor is the ease with which the prediction method is operated and understood. Managers are responsible for the decisions they make. If they are expected to base their decisions on predictions, they must be able to understand the techniques used to obtain these predictions. A normal manager will not have confidence in the predictions obtained from a prediction technique he or she does not understand. If the manager does not have confidence in these predictions, they will not be employed in the decision-making process. Thus the manager’s understanding of the prediction system is of crucial importance. The choice of the prediction method for a particular situation involves the choice of a technique that balances the factors just discussed. It is obvious that the ”best” prediction method for a given situation is not always the ”most accurate.” Instead, the prediction method for the given situation should meet its need at the least cost and with the least inconvenience. To make predictions, one needs to access to historical data. We define historical data as an ordered collection of data, Hi , that starts at a point in time t1 , covering events up to a final time ti . Specifically, Hj = {h1 , h2 , . . . hj }, 1 ≤ j ≤ i, where the jth element is a pair hj = (vj , tj ). The first element of the pair, vj , indicates the value of one or more variables of interest, whereas the second element of the pair, tj , indicates its occurrence time. The elements of Hi are ordered in time, that is, tj ≤ tk for any j < k.

70

5 Prediction Strategies in Fault Management

A prediction is an estimation of the value of a variable vi+k occurring at time ti+k in the future conditioned on historical information Hi . Hence a prediction is the output of a generic function conditioned on Hi , vi+k = g(Hi ) + ²i , in which g(·) is a function capturing the predictable component and ²i models the possible noise.

5.2.2

Prediction in Dynamic Bayesian Networks

Correlation serves to diminish the number of alarms presented to the operator in distributed systems management, yet ideally, the approach should be able to facilitate the fault prediction, which can predict the faults that have occurred from the alarms and warn the operator before severe faults may happen. a(1)

b(1)

a(2)

b(2)

c(1)

c(2)

BN(1)

BN(2)

a(t)

b(t)

...

a(t+1)

b(t+1)

c(t)

c(t+1)

BN(t)

BN(t+1)

Figure 5.2: Model of Dynamic Bayesian Network. Considering the model of DBN in Figure 5.2, two possible changes, which are updates over time, are presented in DBNs: • the possible updates of the nodes (variables) • the possible updates of the links (dependency between nodes). When a distributed system is modeled as a DBN, one important task is to capture the trends for the evolution in the distributed system. This amounts to obtain BN (t + 1) based on the data set BN (1), BN (2), . . . , BN (t). Here BN (t) denotes the updated BN at time t. In DBN, the prediction can be denoted as BN (1), BN (2), . . . , BN (t) ⇒ BN (t + 1)

(5.3)

In DBNs, the following prediction tasks are to be considered as a result of the management requirements. • Prediction per individual component. The state of an individual component in a distributed system can change over time because of the degradation or improvement of the component. The prediction of the individual component’s change of state can be denoted as: P (v(1)), P (v(2)), . . . , P (v(t)) → P (v(t + 1)), v ∈ V.

(5.4)

P (v(t)) represents the probability of the state of component v at time t. • Prediction of the dependency between components. The modification of dependencies between managed objects derives from the update of the system performance and changes in the correlation between objects. This can be denoted as:

5 Prediction Strategies in Fault Management

P (v(1)|π(v(1))), P (v(2)|π(v(2))), . . . , P (v(t)|π(v(t))) → P (v(t + 1)|π(v(t + 1)))

71

(5.5)

v ∈ V , P (v(t)|π(v(t))) represents the probability of the dependency between node v and its parent π(v) at time t. • Prediction for potential faults based on backward inference. When the future state of the effect nodes is estimated, a promising prediction is to trace the causal nodes based on the estimated state of the effect nodes. The prediction from effects to causes is considered as the backward inference: E(t + 1) → C(t + 1)

(5.6)

E(t) denotes the set of effects at time t, and C(t) denotes the set of causes at time t. Single Factor (Variable) Prediction in DBNs In distributed systems management, the prediction in an individual managed component and the prediction in the dependencies between managed components belong to a single factor prediction. Dynamic changes in distributed systems are identified as nonlinear time series. For a prediction in nonlinear time series, there are some commonly used methods, such as ANN (Artificial Neural Network), Markov prediction, linear or nonlinear regression [TH98]. ANN uses learning strategies to do prediction by constantly training the network model based on a large data set. It has large computing complexity and is infeasible for the prediction of individual nodes or links in large Bayesian networks, which hold large amount of nodes and links. A Markov process has no memory of historical data; it can only do a one-step prediction. Hence Markov process can not efficiently make use of historical data. Regression analysis is aimed to discover the statistical principles underlying a stochastic process. In regression analysis, one important preliminary task is to establish a regression equation. In real-life systems, it is not easy to give the exact type of the equation. A polynomial equation is often used as regression equation in the situation that the type of the equation can not be determined, because any kind of curve can be expressed approximately by polynomial. In distributed systems, the changes are often identified as single variable depended (only time related). Least Squares Fit (LSF) is an appropriate approach for single variable polynomial regression [WG94]. Suppose that we want to fit a polynomial y = a0 + a1 x + . . . + am xm

(5.7)

with the data points (x1 , y1 ), (x2 , y2 ), (x3 , y3 ), . . . , (xn , yn ). In (xi , yi ), xi denotes the time dimension; yi denotes the time-series data. Then we have:  y1 = a0 + a1 x1 + . . . + am xm  1   y2 = a0 + a1 x2 + . . . + am xm 2 ...    yn = a0 + a1 xn + . . . + am xm n We use a matrix to model the equations above:

(5.8)

72

5 Prediction Strategies in Fault Management y = Av,  y = [ y1 y2 2 . . . 1 x1 x1 . . .  1 x2 x22 . . .  2 A=  1 x3 x3 . . .  ... ... ... ... 1 xn x2n . . . v = [ a0 a1 . . . am ]T .

yn xm 1 xm 2 xm 3 ... xm n

]T ,    ,  

(5.9)

Then AT Av = AT y, hence v = (AT A)−1 AT y

(5.10)

From the calculation above, the coefficient of the polynomial y = a0 + a1 x + . . . + am xm

(5.11)

is obtained. The prediction of the value y can be obtained given x (time variable). For example, dynamic changes are considered in the failure time on component D illustrated in Figure 4.3. We use the statistics recorded in the system log to calculate the failure time in component D. The failure state in terms of weeks in the component D is presented in Table 5.1. Table 5.1: Week No. 1 2 3 4 5 6 7 8 9 10 11

The Data Set of Failure Ratio of Component D in the Campus Network.

Failure rate (%) 0.000 6.860 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.248 0.000

Week No. 12 13 14 15 16 17 18 19 20 21 22

Failure rate (%) 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Week No. 23 24 25 26 27 28 29 30 31 32 33

Failure rate (%) 0.000 0.000 6.220 0.030 1.964 0.000 0.893 1.081 0.000 0.000 0.000

Week No. 34 35 36 37 38 39 40 41 42 43 44

Failure rate (%) 0.298 0.000 1.617 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Week No. 45 46 47 48 49 50 51

Failure rate (%) 0.000 0.000 0.000 0.050 0.000 0.000 0.000

For the sake of simplification, a fourth order polynomial is used to perform the nonlinear regression: Yt = A + B ∗ Xt + C ∗ Xt2 + D ∗ Xt3 + E ∗ Xt4

(5.12)

Based on the data set shown in Table 5.1 (the failure rate is obtained by the component failure time), best-fit values are obtained: A = 3.046, B = −0.7170, C = 0.05246, D = −0.001407, E = 1.2430e − 005.

5 Prediction Strategies in Fault Management

73

Now the value of the failure state for week 52 in component D can be predicted by: Yt = A + B ∗ Xt + C ∗ Xt2 + D ∗ Xt3 + E ∗ Xt4 (5.13) 2 3 4 = 3.046 − 0.7170 ∗ 52 + 0.05246 ∗ 52 − 0.001407 ∗ 52 + (1.2430e − 005) ∗ 52 = 0.6648 Hence in the 52nd week, the estimated value of the non-availability time in component D is 0.6648%*10080(minutes for 7 days) =67.01(minutes). Actually, in real-life prediction, the deviation is unavoidable. The reason for the deviation between predicted and observed down-time is due to some errors in prediction. These prediction errors can be used to correct a future estimation. From a practical point of view, the alerts or reminders from the management system provide the manager with helpful references for a better system protection when the estimation of the failure time exceeds a threshold. The nonlinear prediction for the dynamic update of dependencies in DBNs can be processed by the same procedure. From the view of an application, in polynomial regression, the rank of the polynomial can be determined by the balance of complexity of computing and the expected precision in prediction. In real-life distributed systems, dynamic changes may not spread to each object and each dependent link in a management period. Most of the time only the partial updates are ongoing, or else the system would be totally unstable and it is hard to provide applicable predictions of system behavior. The precision of the prediction in time series is related to the size of the data set. The prediction will be more precise if larger data sets are available. Hence this type of prediction is advisable for the long-term behavior of distributed systems. Meanwhile noise in the data is another factor that affects the result of the prediction. Prediction Based on Probabilistic Inferences in Distributed Systems In principle, many of the issues that have challenged traditional approaches to diagnosis, ranking possible failures, handling multiple simultaneous failures, and robustness, can be addressed within a probabilistic tracking framework. Suppose the final estimated values from the dynamic changes in the whole DBN, have already been obtained from the polynomial regression stated in Section 5.2.2. Now consider backward inference in a static BN. E is defined as the set of effects (evidence) which we can observe, and C as the set of causes.

x1(t)

x2(t)...

xn(t)

Y(t) Figure 5.3: Basic Model for Backward Inference in Dynamic Bayesian Networks.

74

5 Prediction Strategies in Fault Management

In BNs, one node may have one or several parents (if it is not a root node), and we denote the dependency between parents and their child by a JPD (Joint Probability Distribution). (t) (t) (t) Figure 5.3 shows the basic model for backward inference in DBNs. Let X (t) = (x1 , x2 , . . . , xn ) be the set of causes of effect Y (t) at time t. According to the definition of BNs, the following (t) (t) (t) (t) (t) (t) variables are known: P (x1 ), P (x2 ),. . ., P (xn ), P (Y (t) |x1 , x2 ,. . ., xn ) = P (Y (t) |X (t) ). Here (t) (t) (t) x1 , x2 ,. . ., xn are mutually independent. By Bayes’ theorem it follows that:

P (X (t) |Y (t) ) =

P (Y

P (Y

|X )P (X ) = ∑ P (Y (t) )

(t)

(t)

(t)

(t)

[P (Y

|X )

(t)

(t)

(t)

P (xi |Y (t) ) =



(t)

P (xi )

i=1 n ∏ (t)

|X )

X (t)

which computes to

n ∏

(5.14) (t) P (xi )]

i=1

P (X (t) |Y (t) )

(5.15)

P (X (t) |Y (t) )

(5.16)

(t) X (t) \xi

Then we can obtain: (t)

P (xi |Y (t) ) =

∑ (t) X (t) \xi

(t)

(t)

In Equation 5.16, X (t) \xi = X (t) − {xi }. According to Equation 5.16, the individ(t) ual conditional probability P (xi |Y (t) ) can be achieved from the JPD P (Y (t) |X (t) ), X (t) = (t) (t) (t) (x1 , x2 , . . . , xn ). The backward dependency can be obtained from Equation 5.14. The dashed arrow in Figure 5.3 denote the backward inference from effect Y (t) to individual causes (t) xi , i ∈ [1, 2, . . . , n]. Consider backward inference in DBNs, the dynamic changes in individual nodes or individual dependencies may propagate to the whole DBN and thus cause the modification of the strongest dependent routes and the rank of the dependent sequence in causal nodes. The simulation result in Section 6.3 shows more details of this kind of dynamic changes in DBNs.

5.2.3

Analysis of the Prediction Strategies

For dynamic systems, there are numerous prediction strategies. Different strategies have different applicative areas; there is no best approach for all kinds of precisions. To consider the precision of the prediction, the selected approach should take into account the character of the data. Nearly all the prediction strategies focus on a particular prediction task. Hence the identification of the change pattern in the data is the preliminary step. Artificial Neural Network (ANN) is a common approach for addressing prediction tasks in computer science. ANN uses a black-box model to improve the prediction by using continual input to train the neural network. It is hard to establish ANN for every dynamic node or link in a large distributed system. A Markov process can be used for a one-step inference in the stochastic process. However, it has no memory of history and it can not make full use of lots of historical data in the prediction. Dynamic Bayesian networks are appropriate for dynamic distributed systems management by the ability to perform the prediction job in both single factor (variable) and dependent factors (dependent variables).

5 Prediction Strategies in Fault Management

75

The approach demonstrates its efficiency in modeling dynamic changes in distributed systems management by employing the time factor. LSF is a simple approach, for the prediction is easy to establish a stable model for automatic computing. It is feasible in a large scale distributed system which holds thousands of nodes and links. Unfortunately, all prediction situations involve uncertainty to some degree. The presence of an irregular component, which represents unexplained or unpredictable fluctuations in the data, means that some errors in the prediction must not be ignored. If the effect of the irregular component is small, the determination of the appropriate trend and seasonal or cyclic patterns should allow us to predict with more accuracy. However, real-life systems are complex enough. This raises an open research topic about how to select a best approach in precision and efficiency. In distributed systems, dynamic changes only occur on part of the nodes and links (variables), so only the dynamic factors are considered in the DBN model. Thus the complexity of the computation is decreased greatly [DBK05]. One of the important uses of prediction in a dynamic system is to produce an alert or alarm to the manager before the potential failure comes into reality and the predicted values violate the threshold. In real applications, the thresholds are set by the management requirements.

76

6 Simulation Measurement

Chapter 6 Simulation Measurement Simulation schemes for probabilistic inference in Bayesian networks have many advantages over exact inference algorithms. The use of randomly generated Bayesian networks is a good way to test the robustness and convergence of simulation schemes [DJL+06]. In this chapter the detailed scheme of a simulation implementation for distributed systems is provided to evaluate the probabilistic inference based on Bayesian networks.

6.1 6.1.1

Construction of the Simulation for Bayesian Networks Generation of Pseudo-Random Numbers

Random numbers play a central role in simulation algorithms for probabilistic inference with Bayesian networks. In practice, the pseudo-random number program is deterministic and produces a random sequence, called pseudo-random numbers. It is known that the traditional UNIX-supplied rand() is flawed. Quite a number of implementations, which use ANSI C libraries, belong to the category ”totally botched”. Based on the properties of the simulation for distributed systems, one type of shifted-register generator, called Mersenne Twisted Generalized Feedback Shift Register (TGFSR) algorithm, is selected as pseudo-random number generator. Basically, there is no quantitative measure of merits available from Random Number Generators (RNG). However, some measures do exist to keep the risk of incorrect simulation results as small as possible. Some desirable properties of random number generators are as follows [Knu97]: • Randomness • Reproducibility • Speed • Large cycle length Well-designed algorithms for random number generation allow us to find conditions for their parameters that will guarantee a certain period of length of the output sequence. Further, it will be possible to detect certain weaknesses of the algorithm in advance by theoretical analysis.

6 Simulation Measurement

77

The Generalized Feedback Shift Register Generator The generalized feedback shift register (GFSR) generator [LP73] is a widely used pseudorandom number generator based on the linear recurring equation Xl+n := Xl+m ⊕ Xl , (l = 0, 1, . . .)

(6.1)

where each Xl is a word with components 0 or 1 of size w, and ⊕ denotes bitwise exclusive-or operation. Thereby, this algorithm generates the same number of m-sequences as the word length in parallel. By regarding one word as a real number between 0 and 1, the algorithm generates [0, 1] uniform pseudo-random real number sequences. This algorithm has the following merits [MK92]: • Advantages of GFSR – The generation is very fast. Generation of one pseudorandom number requires only three memory references and one exclusive-or operation. – The sequence has an arbitrarily long period independent of the word size of the machine. – The implementation is independent of the word size of the machine. • Disadvantages of GFSR – The selection of initial seeds is very critical and influential in the randomness, and good initialization is rather involved and time consuming. – Each bit of a GFSR sequence can be regarded as an m-sequence based on the trinomial tn + tm + 1,which is known to have poor randomness. – The period of a GFSR sequence 2n − 1 is far smaller than the theoretical upper bound; i.e the number of possible states 2nw . – The algorithm requires n words of working area, which is memory consuming if a large number of generators is implemented simultaneously. Twister Generalized Feedback Shift Register Generator The twisted GFSR generator (TGFSR for short) [MK92] designed by M. Matsumoto and Y. Kurita solves the above four drawbacks of the original GFSR generator. The TGFSR generator is the same as the GFSR generator, except that it is based on the linear recurrence Xl+n := Xl+m ⊕ Xl A, (l = 0, 1, ...)

(6.2)

where A is a w × w matrix with 0-1 components and Xl is regarded as a row vector over GF(2). With a suitable choice of n, m, and A, the TGFSR generator attains the maximal period 2nw − 1; i.e., it assumes all possible states except the zero-state in a period. Because of this maximality, the TGFSR generator improves on the above four drawbacks of the GFSR generator as follows: • The initialization is simple. Any initial seed except for all zeros produces the identical sequence disregarding the phase, and hence no special initialization technique (as shown in [FT83]) is necessary. Moreover, since the working area is far smaller, the initialization is very fast.

78

6 Simulation Measurement • Recurrence is based not on a trinomial, but on a primitive polynomial with many terms. • The period of the generated sequence attains the theoretical upper bound, 2nw − 1. • The working area is one-wth the size of that of the GFSR required to attain the same period. For example, assume that the word size of the machine is 32. Then, the TGFSR generator requires only 25 words for the working area to attain the period 225∗32 −1, while the GFSR generator requires 800 words for the working area for the same period. • The sequence is n-equidistributed, which is the most possible. [MK92] provides an efficient algorithm to obtain the order of equi-distribution, together with a tight upper bound on the order.

TGFSR generators are most suitable for the simulation of a large distributive system, which requires a large number of mutually independent pseudorandom number generators, for the following two reasons: 1. the TGFSR generators consume a far smaller amount of working area than the GFSR does; 2. a large number of distinct TGFSR generators can be implemented by merely changing the parameters m and A. The algorithm description and theoretical discussion of TGFSR generators can be found in [MK92] [MK94a].

6.1.2

Random Generation of Bayesian Networks

The simulation follows the guidelines in Pearl’s book [Pea88] to create a method, which randomly generates Bayesian networks with a random graph structure and random conditional probability tables. Part of our code is included in Appendices A, B, C. The algorithm is controlled by two parameters, N and M . N is the number of variables and M is the number of at most how many parents any one variable will have. As for the (conditional) probabilities of the variables, there are two methods to produce the probability tables for the network. One method is to assign the probabilities by hand based on an existing network. The other is to generate them with random number generators. When testing one simulation method, many possible probability distributions are tested, especially the ones with high variability in small and large probabilities of nodes. One advantage of the randomly generated probabilities for a network is to help us to have not only uniform distribution but also more extreme probability distributions. The simulation program can generate the distributions in a pointed interval, such as [0.9, 1] or [0, 0.1]. Actually, in many reallife distributed systems, the diagnosis model, which has the states ”normal” and ”abnormal”, follows the extreme probability distribution. In this way, the simulation can test the convergence behavior from simulation algorithms to extreme probability distributions. Theoretically speaking, our simulation program can produce a Bayesian network with any number of nodes and any probability distributions of nodes. A simulation algorithm on a network with extreme distributions is a good test to verify the convergence rate of the simulation.

6 Simulation Measurement

6.1.3

79

Implementation of the Simulation Program

Data Structure In the simulation, there are two major data structures: the data structure for an individual node and the data structure for Bayesian networks. Below is the data structure for an individual node: class nodeInfo { public: int parentNum; /* number of parent nodes */ int state; /* state of the node: 0-normal, 1-abnormal */ int childNum; /* number of children */ int parent[MAXPARENT]; /* index number for parent node */ double antiProb[MAXPARENT]; /* the probability of the states of parent nodes */ double selfProb; /* the probability of the state of the node */ private: double* p_condiProb; /* address of the CPT */ public: nodeInfo(); ~nodeInfo(); double prob_X(int stateX); /* the probability of state = stateX */ void set_prob_X(double prob); /* set the probability of the node */ double prob_X_case_A(int stateX, int stateA); /* the probability P(X=stateX|A=stateA) when only one parent exist, A is the parent of X */ void set_prob_X_case_A(double prob, int stateA); /* set the probability P(X=1|A=stateA) of the node X, A is the only parent of X */ double prob_X_case_AB(int stateX, int stateA, int stateB); /* the probability P(X=stateX|A=stateA,B=stateB) of X which has two parent nodes, A and B are the parent nodes of X */ void set_prob_X_case_AB(double prob, int stateA,int stateB); /* set the probability P(X=1|A=stateA,B=stateB) of the node X, A and B are the parent nodes of X */ double prob_X_case_ABC(int stateX, int stateA, int stateB, int stateC); /* set the probability P(X=stateX|A=stateA,B=stateB, C=stateC) of X, A, B and C are the parent nodes of X */ void set_prob_X_case_ABC(double prob, int stateA,int stateB, int stateC); /* set the probability P(X=1|A=stateA,B=stateB, C=stateC), A, B and C are the parent nodes of X */ double prob_A_case_X(int indexA, int stateA, int stateX); /*the probability P(A=stateA|X=stateX) while A=stateA, A is one parent node of X */ void set_prob_A_case_X(double prob, int indexA); /* set P(A=1|X=1), A is one parent node of X */ };

The data structure of a Bayesian Network is defined as follows: class beliefNet {

80

6 Simulation Measurement

private: int maxNodeNum; /* number of total nodes */ int sourceNum; /* causal nodes in Bayesian network */ int resultNum; /* effect nodes in Bayesian network */ nodeInfo Node[MAXNODE]; /* all the nodes */ double *realRecordPtr; double *realRecordPtr2; double *realRecordPtrA; int effect[MAXNODE]; /* record the number of state=1 */ public: beliefNet(); /* generation of BN */ beliefNet(int maxNum, int srcNum, int rstNum); /* generation of BN */ beliefNet(const beliefNet& bn); void Create(BYSDATA bysdata); void initProb(); /* initialize the probability */ void CaculateProb(); /* calculation of the probability */ void CaculateAntiProb(); /* calculation of backward probability */ void SpanTree(); double StrongestDepend2(int beginNum, int *begin, int end); void PredictSource(); /* inference from one effect node */ void PredictSource2(); /* inference from two effect nodes */ void PredictSourceA(); /* inference from multiple effect nodes */ void ProduceEffect(); /* RN as input, simulate the output */ int Correlation(int nodeNum, int* nodeptr); /* the correlation of nodes: times of the occurrences for multiple nodes when their states =1, nodeNum: the number of nodes which are calculated, nodeptr: pointer to the array of certain node */ private: void SetParentNum(int nodeIndex); /*set the number of parents */ bool ComparePreParent(int nodeIndex, int curParent, int possibleParent); void SetParentID(int nodeIndex ); /*set the label of parent node */ bool SetChild(int parentID); };

In the simulation program, the number of total nodes, the number of source nodes, the number of effect nodes and the valid test times are provided by the user who generates the BN. Major Tasks in Building a Simulation in Bayesian Networks 1. To use TGFSR to generate the Pseudo-Random numbers for the simulation in Bayesian networks. To generate a Bayesian network, the number of total nodes, causal nodes and effect nodes are provided for the simulation program. Valid test times are also provided to control the scale and the accuracy of the simulation. During the generation of parent nodes, we keep the rule that the index number of parent nodes is always less than that of their children, so that there is no cycle occurring in the Bayesian network, that is, to generate a DAG. Another factor considered in the simulation is to avoid the extreme distribution of the nodes in the simulation network. 2. To generate the Conditional Probability Table (CPT) for the Bayesian network. In the simulated Bayesian network, except for the source (causal) nodes, there is a CPT for

6 Simulation Measurement

81

every node. The probability of the node (except for the source nodes) is obtained from its parents and the attached CPT. In fault management, the states of the managed objects are identified as ”normal” (0) or ”abnormal” (1), so the states of every node should be 0 or 1. In fault management, the managed objects hold the dependency by which they keep the cause-effect relationship between each other. Hence in the simulation, the default probabilities obey the rules as follows: { P (e = 0 | c1 , . . . , cn ) =

0, ci = F (abnormal, 1), ∀i ∈ [1, n]; 100%, ci = T (normal, 0), ∀i ∈ [1, n].

(6.3)

here, c is the set of causal nodes, e is the immediate parent of c. The above rule can be explained as: if the states of all the parent node are 0 (normal ), the state of the child node must be 0 (normal ); if the states of all the parent nodes are 1 (abnormal ), the state of the child node must be 1 (abnormal ). Another principle is that the more ”abnormal” parent nodes are, the larger the probability of ”abnormal” is for their child. It is true that the more ”normal” parent nodes are the larger the probability of ”normal” is for their child. suppose CB ⊆ CA ⊆ CI , CI−A = CI − CA , CI−B = CI − CB , CI = π(e) = (c1 , c2 , . . . cn ), then {

P (e|CA , CI−A ) ≥ P (e|CB , CI−B ) P (e|CA , CI−A ) ≥ P (e|CB , CI−B )

(6.4)

Here e ≡ e = 1; while e ≡ e = 0; CA ≡ ∀ci = 1, ci ∈ CA ; CI−A ≡ ∀cj = 0, cj ∈ CI−A . The rest of the symbols can be read similarly. This principle is to keep the simulation close to the situations in the real-life distributed system and to make the simulation more realistic in fault management. 3. Based on the calculation from Equation 4.2-4.5, the backward probability P (xi |Y ) =



P (X|Y )

(6.5)

X\xi

can be obtained, xi is one of the parents of Y , X\xi = X − {xi }. For a node with n parent nodes, it will have n backward probabilities. This one-step backward probability is the basis for the backward inference in Bayesian network. 4. To work out the backward strongest dependency route step by step from effects to causes by the SDR algorithm (see Section 4.2.2) given the backward probability for every link in a Bayesian network. The result of the SDR algorithm produces a spanning tree rooted by the effect nodes. After the unitization of probabilities in causal nodes, the simulation also generates a sequence of probabilities in causal nodes. The simulation program takes into account both the single-effect inference and multiple-effects inference. 5. To produce the sample data for the simulation. A random number between 0 and 1 is generated by the algorithm of TGFSR, and the Bayesian Network model is input for test. This process produces a random number for every source node, compares the random

82

6 Simulation Measurement number with the set probability of the source node, and decides the state of the nodes is 1 or 0. For other nodes (not cause nodes), their probability of ”abnormal” are worked out according to the states of parent nodes and the local probability, and then the random number is generated to compare the random number with the probability of ”abnormal”, finally the state of the node can be determined. After all the nodes are processed, one simulation process is over. The state of all the nodes (besides the source nodes) being 0 (normal ) means that no ”abnormal” nodes occurs in the simulated network, thus it has no contribution to the simulation result and the invalid simulation should just be discarded. After K simulations, the program produces K groups of data which record the state of every node, and the statistics of all the data can be used further in the simulation.

6.2

Simulation Measurement for Probabilistic Inference

6.2.1

A Simulation of Backward Inference

The most common approach to reasoning with uncertain information about dependencies in distributed systems is the probabilistic inference, which traces the causes from the effects. The task of the backward inference is to find the most probable instances and the key factors related to the defects in distributed systems. To start an example simulation, Figure 6.1 presents the topology of the simulated Bayesian network, which is generated by the simulation program. The Bayesian network has 22 nodes in total, 5 for source nodes, and 3 for effect nodes. The cause set is C = {A, B, C, D, E}, the effect set is E = {T, U, V }. In the production of the JPD in the Bayesian network, the real test times are 34850089, and the valid test times are 50000. This simulates a managed distributed system, with 22 nodes for managed objects, 5 nodes for causes (source of fault), 3 nodes for effects (evidence for fault). 1 denotes abnormal, 0 denotes normal. The joint probability table is available: Node A: P(A=1) = 0.009349 Node B: P(B=1) = 0.007087 Node C: P(C=1) = 0.011781 Node D: P(D=1) = 0.019556 Node E: P(E=1) = 0.006696 Node F: P(F=1|D=0) = 0.000000 P(F=1|D=1) = 0.217880 Node G:

6 Simulation Measurement

Figure 6.1:

83

The Simulation of Bayesian Network with 22 Nodes.

P(G=1|D=0) = 0.000000 P(G=1|D=1) = 0.003320 Node H: P(H=1|E=1,B=1,F=1) P(H=1|E=1,B=1,F=0) P(H=1|E=1,B=0,F=1) P(H=1|E=1,B=0,F=0) P(H=1|E=0,B=1,F=1) P(H=1|E=0,B=1,F=0) P(H=1|E=0,B=0,F=1) P(H=1|E=0,B=0,F=0)

= = = = = = = =

0.840243 0.711140 0.815905 0.654529 0.788202 0.474936 0.164596 0.000000

Node I: P(I=1|C=0) = 0.000000 P(I=1|C=1) = 0.111561 Node J: P(J=1|C=1,A=1) P(J=1|C=1,A=0) P(J=1|C=0,A=1) P(J=1|C=0,A=0)

= = = =

0.999575 0.215495 0.616537 0.000000

84

6 Simulation Measurement

Node K: P(K=1|E=1,H=1,I=1) P(K=1|E=1,H=1,I=0) P(K=1|E=1,H=0,I=1) P(K=1|E=1,H=0,I=0) P(K=1|E=0,H=1,I=1) P(K=1|E=0,H=1,I=0) P(K=1|E=0,H=0,I=1) P(K=1|E=0,H=0,I=0)

= = = = = = = =

0.815616 0.060184 0.627452 0.051157 0.778522 0.016252 0.546357 0.000000

Node L: P(L=1|B=0) = 0.000000 P(L=1|B=1) = 0.330844 Node M: P(M=1|G=0) = 0.000000 P(M=1|G=1) = 0.456754 Node N: P(N=1|L=1,J=1) P(N=1|L=1,J=0) P(N=1|L=0,J=1) P(N=1|L=0,J=0)

= = = =

0.887628 0.763248 0.554206 0.000000

Node O: P(O=1|D=0) = 0.000000 P(O=1|D=1) = 0.006831 Node P: P(P=1|L=1,M=1,N=1) P(P=1|L=1,M=1,N=0) P(P=1|L=1,M=0,N=1) P(P=1|L=1,M=0,N=0) P(P=1|L=0,M=1,N=1) P(P=1|L=0,M=1,N=0) P(P=1|L=0,M=0,N=1) P(P=1|L=0,M=0,N=0)

= = = = = = = =

0.857328 0.210025 0.179157 0.168073 0.558186 0.022194 0.108027 0.000000

Node Q: P(Q=1|O=1,M=1,H=1) P(Q=1|O=1,M=1,H=0) P(Q=1|O=1,M=0,H=1) P(Q=1|O=1,M=0,H=0) P(Q=1|O=0,M=1,H=1) P(Q=1|O=0,M=1,H=0) P(Q=1|O=0,M=0,H=1) P(Q=1|O=0,M=0,H=0)

= = = = = = = =

0.975930 0.661973 0.021971 0.004367 0.196155 0.072069 0.012076 0.000000

Node R: P(R=1|O=0) = 0.000000 P(R=1|O=1) = 0.063444

6 Simulation Measurement Node S: P(S=1|F=1,K=1,A=1) P(S=1|F=1,K=1,A=0) P(S=1|F=1,K=0,A=1) P(S=1|F=1,K=0,A=0) P(S=1|F=0,K=1,A=1) P(S=1|F=0,K=1,A=0) P(S=1|F=0,K=0,A=1) P(S=1|F=0,K=0,A=0) Node T: P(T=1|O=1,S=1) P(T=1|O=1,S=0) P(T=1|O=0,S=1) P(T=1|O=0,S=0)

= = = = = = = =

85

0.983189 0.207990 0.122115 0.032312 0.271526 0.017029 0.095369 0.000000

= = = =

0.875635 0.372772 0.363458 0.000000

= = = =

0.819306 0.614213 0.548586 0.000000

Node U: P(U=1|S=1,P=1) P(U=1|S=1,P=0) P(U=1|S=0,P=1) P(U=1|S=0,P=0)

Node V: P(V=1|S=1,Q=1,R=1) P(V=1|S=1,Q=1,R=0) P(V=1|S=1,Q=0,R=1) P(V=1|S=1,Q=0,R=0) P(V=1|S=0,Q=1,R=1) P(V=1|S=0,Q=1,R=0) P(V=1|S=0,Q=0,R=1) P(V=1|S=0,Q=0,R=0)

= = = = = = = =

0.936333 0.641712 0.618077 0.352443 0.451359 0.252283 0.152592 0.000000

Based on the SDR algorithm, the backward probability for every child to its immediate parent is as follows: P(S=1|V=1) P(Q=1|V=1) P(R=1|V=1) P(S=1|U=1) P(P=1|U=1) P(O=1|T=1) P(S=1|T=1) P(F=1|S=1) P(K=1|S=1) P(A=1|S=1) P(O=1|R=1) P(O=1|Q=1) P(M=1|Q=1) P(H=1|Q=1) P(L=1|P=1) P(M=1|P=1)

= = = = = = = = = = = = = = = =

0.930602 0.066333 0.003256 0.521130 0.479624 0.115376 0.884909 0.135138 0.023085 0.850368 1.000000 0.005791 0.020751 0.974188 0.363716 0.000713

86

6 Simulation Measurement

P(N=1|P=1) P(D=1|O=1) P(L=1|N=1) P(J=1|N=1) P(G=1|M=1) P(B=1|L=1) P(E=1|K=1) P(H=1|K=1) P(I=1|K=1) P(C=1|J=1) P(A=1|J=1) P(C=1|I=1) P(E=1|H=1) P(B=1|H=1) P(F=1|H=1) P(D=1|G=1) P(D=1|F=1) Node E : Node D : Node C : Node B : Node A :

= 0.638160 = 1.000000 = 0.280322 = 0.722387 = 1.000000 = 1.000000 = 0.290119 = 0.123345 = 0.601172 = 0.315461 = 0.697769 = 1.000000 = 0.520457 = 0.401539 = 0.087572 = 1.000000 = 1.000000 no conditional no conditional no conditional no conditional no conditional

prob prob prob prob prob

Now consider the single-effect inference in Bayesian networks. 1. Suppose the effect node T = 1 is detected, a pruned tree rooted on T is obtained after the pruning operation on the BN in Figure 6.1. See Figure 6.2.

Figure 6.2:

The Pruned Tree Rooted on T .

After the algorithm was applied, the spanning tree (see Figure 6.3) is produced. It is rooted on node T and holds the entire strongest dependency route. Its strongest dependency route can be obtained from the following results of the simulation program.

6 Simulation Measurement

Figure 6.3:

87

The Spanning Tree after the SDR on Figure 6.2, rooted on T .

P(A=1|S,T=1) statistic prob = 11379/15140 = 0.751585; SDR prob = 0.844268; SDR: A

Suggest Documents