HIDDEN MARKOV MODELS FOR ANOMALY DETECTION AND

7 downloads 0 Views 941KB Size Report
In this thesis, we utilize hidden Markov model-based algorithms to address ...... segmentation, data mining from noisy data streams, credit card fraud detection, .... likelihood ratio, i.e., the ratio of probability density (or mass) function (pdf or.
HIDDEN MARKOV MODELS FOR ANOMALY DETECTION AND FAULT DIAGNOSIS

Satnam Singh, Ph.D. University of Connecticut, 2007 In this thesis, we utilize hidden Markov model-based algorithms to address the problem of anomaly detection and dynamic multiple fault diagnosis. In the first part of the thesis, we address the problem of detecting an anomaly (e.g., intrusions, fraud and unusual business activities) with minimum delay and fewest false alarms. In our application, an anomaly is a sequence of very few transactions of interest embedded in a large number of noise (benign) transactions. We propose a sequential detection-based approach to detect HMMs, which are used to model anomalies (asymmetric threats). A transaction-based probabilistic model is developed to combine hidden Markov models and feature-aided tracking. A detailed performance analysis of the proposed anomaly detection algorithm is performed along with a comparison with the maximum likelihood-based data mining method. In the second part of the thesis, we develop near-optimal algorithms for dynamic multiple fault diagnosis (DMFD) problems in the presence of imperfect test outcomes. The dynamic diagnostic inference problem is to determine the

Satnam Singh––University of Connecticut, 2007

most likely evolution of component states, the one that best explains the observed test outcomes. Here, we discuss four formulations of the DMFD problem. These include the deterministic situation corresponding to a perfectly-observed coupled Markov decision processes, to several partially-observed factorial hidden Markov models ranging from the case where the imperfect test outcomes are functions of tests only to the case where the test outcomes are functions of faults and tests, as well as the case where the false alarms are associated with the nominal (fault-free) case only. All these formulations are intractable NP-hard combinatorial optimization problems. We solve each of the DMFD problems by combining Lagrangian relaxation and the Viterbi decoding algorithm in an iterative way. Computational results on real world problems are presented. A detailed performance analysis of the proposed algorithm is also discussed.

HIDDEN MARKOV MODELS FOR ANOMALY DETECTION AND FAULT DIAGNOSIS

Satnam Singh

B.E. (Hons.), Indian Institute of Technology (IIT)- Roorkee, India, 1996 M.S., University of Wyoming, USA, 2003

A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy at University of Connecticut 2007

Copyright by

Satnam Singh

2007

APPROVAL PAGE Doctor of Philosophy Dissertation

HIDDEN MARKOV MODELS FOR ANOMALY DETECTION AND FAULT DIAGNOSIS

Presented by Satnam Singh, M.S., B.E.(Hons.)

Major Advisor Krishna Pattipatti

Associate Advisor Peter Willett

Associate Advisor Shengli Zhou University of Connecticut 2007

ii

ACKNOWLEDGEMENTS First and foremost, I thank my major advisor, Dr. Krishna Pattipati, who has patiently advised me throughout my PhD studies and always found time to discuss new ideas and research approaches. Besides learning how to perform research, I learned presentation and writing skills from him. I am thankful to Dr. Pattipati for giving me the opportunities to participate in conferences as well as sponsoring my travel to exotic ski locations. Next, I would like to express my deep gratitude to my co-advisor, Dr. Peter Willett. He taught me how to think deeply and supervised me on formulating the statistical models of the problem. His humor always made me laugh, and he always put me at ease during tough moments. He always found time to discuss my research ideas, as well as edited my papers. I am fortunate to have Dr. Pattipati and Dr. Willett as my advisors. Dr. Shengli Zhou taught me several communication courses, and also advised me regarding my thesis. I am indebted for his edits on my thesis and on-going advice. He has always encouraged me to think outside the box and work on problems that span multiple fields. A special thanks to Dr. Yaakov Bar-Shalom, who taught me several courses on probability and estimation theory. I will never forget his witty remarks during ECE 311 seminars, which helped me to improve my presentation skills. Thanks to Dr. Peter Luh for teaching me the concepts of linear systems and nonlinear optimization in his graduate classes.

iii

My PhD would have been a bumpy road without the humor and support of my colleagues in cyberlab. I always felt Cyberlabmates was part of my USA family. During my first project at UConn, I worked with Haiying Tu, Jefferey Allanach, Jijun Lu, and William Donat. I am fortunate to have Haiying as my senior in the lab. She has been a caring and friendly mentor. Thanks to Jeff for making my Washington, DC workshop trip enjoyable as well as teaching me how to be cool during rough situations. I am thankful to Jijun for rescuing me from software coding problems. I can’t forget my Orlando conference visit with Bill; he taught how to work hard as well as how to play hard! I would like to extend my gratitude to Candra Meirina, who always advised me during my rough moments at UConn. Thanks also to Jianhui Luo, Tinku and Setu Madhavi Namburu for wonderful discussions and talks in the cyberlab. My lunch time discussions were made interesting by Madhavi and Anuradha Kodali. Both of these people helped me to seek knowledge and learn new concepts. I am thankful for their witty questions. My second project in cyberlab was with the fault diagnosis group. In this project, I got a chance to work with Kihoon Choi, Madhavi, Anuradha and Bharath. I would like to thank entire fault diagnosis group for their comments throughout the course of my stay. Thanks to ONR group members, Feili Yu, Sui Ruan, and Woosun An, for their discussions, and for giving me opportunities to work with them.

iv

I would like to thank ECE department staff members at UConn: Tina, Sharon, Mary, Barbara and Dee for their support and timely help on almost anything. Special thanks to Tina; who prepared all the paperwork for me through out my stay at UConn. I see my PhD as ”Running a Marathon;” and it is nearly impossible to run Marathon without the support of your family and friends. I would like to express my warmest thanks to my parents, and my wife, for their unswerving understanding, love, patience, support. They were there when I needed them most. They have fully supported my plans, and motivated me to achieve unreachable goals. My actual journey to PhD started in 2000, when I decided to pursue a PhD degree and enrich myself with in-depth technical knowledge. It was my brother, Jaspal, who encouraged me to come to the USA. He motivated me to study Electrical engineering. I am thankful for his advice and I always appreciated his vision, and thoughts. Special thanks to Linda, who has always motivated me to finish my PhD. To my younger brother, Daljeet, and his family, thank you for your support and love. Special thanks to my grandfather, Khajan Singh; he is my inspiration for hard work, and persistence.

v

TABLE OF CONTENTS Chapter 1:

Introduction

1

1.1

Anomaly detection via Feature-aided Tracking and HMMs . . . . .

2

1.2

An Advanced System for Modeling Asymmetric Threats . . . . . .

4

1.3

Dynamic Multiple Fault Diagnosis Problem: Mathematical formulations and solution techniques . . . . . . . . . . . . . . . . . . . . .

6

1.4

Dynamic Fusion of Classifiers for Fault Diagnosis . . . . . . . . . .

9

1.5

Impact of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6

Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 2:

Anomaly Detection via Feature-Aided Tracking and Hidden Markov Models

2.1

2.2

16

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.2

Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

A Transaction-based Probabilistic Model . . . . . . . . . . . . . . . 26 2.2.1

Null Hypothesis (“Noise Only”) . . . . . . . . . . . . . . . . 33

2.2.2

Alternative Hypothesis “HMM in the Presence of Noise” . . 34

2.3

Example: Development of a Nuclear Weapons Program (DNWP) . 35

2.4

Modeling HMMs via TEAMS . . . . . . . . . . . . . . . . . . . . . 40

2.5

Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.1

Modified Forward, Backward and Termination Steps . . . . 41

vi

2.6

2.7

2.5.2

HMM Detection Scheme . . . . . . . . . . . . . . . . . . . . 44

2.5.3

Cusum Update with Skipped Observations . . . . . . . . . 48

Simulations and Results . . . . . . . . . . . . . . . . . . . . . . . . 51 2.6.1

Results under N ull Hypothesis . . . . . . . . . . . . . . . . 53

2.6.2

Results under Alternative Hypotheses . . . . . . . . . . . . 55

2.6.3

What-if Analysis under Alternative Hypothesis . . . . . . . 57

2.6.4

Performance Analysis

. . . . . . . . . . . . . . . . . . . . . 57

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Chapter 3:

Stochastic Modeling of a Terrorist Event via the ASAM System

3.1

62

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.1

Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.2

The ASAM System . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.3

Modeling of a Terrorist Event . . . . . . . . . . . . . . . . . . . . . 67

3.4

3.3.1

Modeling Aspects . . . . . . . . . . . . . . . . . . . . . . . . 67

3.3.2

BN Model of a Terrorist Event . . . . . . . . . . . . . . . . . 69

3.3.3

Truck Bombing (HM M1 ) . . . . . . . . . . . . . . . . . . . . 70

3.3.4

Deadly Chemical Cloud (HM M2 ) . . . . . . . . . . . . . . . 70

Advanced Methods for Tracking Terrorist Activities . . . . . . . . . 75 3.4.1

Multiple Target Tracking . . . . . . . . . . . . . . . . . . . . 76

3.4.2

Multiple Hypothesis Tracking . . . . . . . . . . . . . . . . . 77

3.4.3

Attribute-Aided Tracking . . . . . . . . . . . . . . . . . . . . 79

vii

3.5

Simulations and Results . . . . . . . . . . . . . . . . . . . . . . . . 80

3.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Chapter 4:

Dynamic Multiple Fault Diagnosis: Mathematical Formulations and Solution Techniques

4.1

84

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.1

Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.1.2

Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.2

DMFD Problem Formulations . . . . . . . . . . . . . . . . . . . . . 91

4.3

DMFD Problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.1

Primal-Dual Optimization Framework . . . . . . . . . . . . 100

4.3.2

Approximate and Exact Duality Gap . . . . . . . . . . . . . 105

4.4

DMFD Problem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.5

DMFD Problem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.6

DMFD Problem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.7

Sliding Window DMFD Method . . . . . . . . . . . . . . . . . . . . 118

4.8

Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.9

4.8.1

Solving subproblems using Viterbi Algorithm . . . . . . . . 119

4.8.2

Updating lagrange multipliers via subgradient method . . . 120

Simulations and Results . . . . . . . . . . . . . . . . . . . . . . . . 122 4.9.1

Small-Scale System . . . . . . . . . . . . . . . . . . . . . . . 122

4.9.2

Real World Data Sets . . . . . . . . . . . . . . . . . . . . . . 125

4.9.3

Sliding Window DMFD Results . . . . . . . . . . . . . . . . 130

viii

4.9.4

Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Chapter 5: 5.1

134

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.1.1

5.2

Dynamic Fusion of Classifiers for Fault Diagnosis

Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Dynamic Fusion Process Overview . . . . . . . . . . . . . . . . . . 139 5.2.1

Feature Extraction or Data Pre-processing . . . . . . . . . . 140

5.2.2

Error Correcting Codes (ECC) Matrix . . . . . . . . . . . . . 140

5.2.3

Fault Detection using the Support Vector Machine (SVM) Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2.4

141

Dynamic Fusion . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.3

Dynamic Multiple Fault Diagnosis (DMFD) Problem . . . . . . . . 142

5.4

Simulations and Results . . . . . . . . . . . . . . . . . . . . . . . . 147

5.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Chapter 6:

Conclusion and Future Work

ix

153

LIST OF TABLES 3-1 Transactions for the truck bombing HMM . . . . . . . . . . . . . . . 73 4-1 Small-scale scenario for simulations . . . . . . . . . . . . . . . . . . 117 4-2 Results for small-scale scenario . . . . . . . . . . . . . . . . . . . . . 123 4-3 Real world models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4-4 Results on real world models . . . . . . . . . . . . . . . . . . . . . . 126 4-5 Type of faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5-1 Error correcting code (ecc) matrix . . . . . . . . . . . . . . . . . . . 140 5-2 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5-3 Results on CRAMAS® data . . . . . . . . . . . . . . . . . . . . . . . 149

x

LIST OF FIGURES 2-1 A simplified intelligence observation space . . . . . . . . . . . . . . 27 2-2 An example of a transaction . . . . . . . . . . . . . . . . . . . . . . . 28 2-3 A HMM combined with feature-aided tracking . . . . . . . . . . . . 29 2-4 Transaction space for hypothesis testing . . . . . . . . . . . . . . . . 31 2-5 Markov chain of RD HMM . . . . . . . . . . . . . . . . . . . . . . . 36 2-6 Markov chain of P W GM HMM . . . . . . . . . . . . . . . . . . . . 38 2-7 Markov chain of F T I HMM . . . . . . . . . . . . . . . . . . . . . . 39 2-8 A HMM state consists of a transaction . . . . . . . . . . . . . . . . . 40 2-9 Features associated with Nuclear Scientist node . . . . . . . . . . . . 42 2-10 Cusum statistic of PWGM, RD and FTI HMMs under the null hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2-11 PWGM HMM: cusum statistic of HMMFA and Na¨ıve methods . . . 53 2-12 RD HMM: cusum statistic of HMMFA and Na¨ıve methods . . . . . 54 2-13 FTI HMM: cusum statistic of HMMFA and Na¨ıve methods . . . . . 54 2-14 Cusum statistic of truncated RD HMM under what-if scenario no. 1 56 2-15 Cusum statistic of truncated RD HMM under what-if scenario no. 2 56 2-16 Effect of complexity of HMM on the performance . . . . . . . . . . 58 2-17 Performance comparison between HMMFA and Na¨ıve methods . . 59 3-1 BN model of terrorist attack threat in the Athens 2004 Olympics . . 68 3-2 Markov chain for the truck bombing HMM . . . . . . . . . . . . . . 70

xi

3-3 HMM states (S1 -S5 ) of the truck bombing HMM . . . . . . . . . . . 71 3-4 HMM states (S6 -S9 ) of the truck bombing HMM . . . . . . . . . . . 72 3-5 Multiple hypothesis tracking for two HMMs . . . . . . . . . . . . . 77 3-6 Detection of modeled HM M1 at t = 25 . . . . . . . . . . . . . . . . . 80 3-7 Detection of HM M1 AN D HM M2 in the presence of HM M1 . . . . 82 4-1 DMFD problem viewed as a factorial hidden Markov model (FHMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4-2 Tri-partite digraph for DMFD problem

. . . . . . . . . . . . . . . . 92

4-3 Bi-partite graph for the DMFD problem . . . . . . . . . . . . . . . . 94 4-4 Detection and false alarm probabilities for problem 1 . . . . . . . . 97 4-5 Decomposition of the original DMFD problem . . . . . . . . . . . . 103 4-6 Flow chart of the algorithm . . . . . . . . . . . . . . . . . . . . . . . 106 4-7 Detection and false alarm probabilities for problem 2 . . . . . . . . 108 4-8 Detection and false alarm probabilities for problem 3 . . . . . . . . 113 4-9 Approximate duality gap . . . . . . . . . . . . . . . . . . . . . . . . 124 4-10 Boxplots of CI and FI for automotive and document matching system127 4-11 Boxplots of CI and FI for power distribution and UH-60 helicopter transmission system . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4-12 Boxplots of CI and FI for engine simulator system . . . . . . . . . . 129 4-13 Correct isolation rate for various fault behaviors . . . . . . . . . . . 130 4-14 False isolation rate for various fault behaviors . . . . . . . . . . . . 131 5-1 Overview of dynamic fusion process . . . . . . . . . . . . . . . . . . 139

xii

5-2 Tri-partite graph for dynamic fusion problem . . . . . . . . . . . . . 143 5-3 Fault appearance and disappearance probabilities . . . . . . . . . . 144 5-4 Detection and false alarm probabilities

. . . . . . . . . . . . . . . . 145

5-5 Parameter optimization in dynamic fusion . . . . . . . . . . . . . . 147 5-6 Comparison of classification error among various methods . . . . . 148 5-7 Effect of window size on the classification error

. . . . . . . . . . . 150

5-8 Effect of window size on the false isolation rate . . . . . . . . . . . . 151

xiii

Chapter 1

Introduction

Hidden Markov model (HMM) is a principal method for modeling partiallyobserved stochastic processes. The premise behind a HMM is that the true underlying process, represented as a Markov chain depicting the evolution of true transactions (events) as a function of time, is not directly observable (hidden states), but it can be probabilistically inferred through another set of stochastic processes (observations). The HMMs can solve three problems: (1) evaluation of the probability of a sequence of observed events given a specific model; (2) decoding the most likely evolution of an abnormal activity (state sequence) represented by the HMM; and (3) estimation of HMM parameters that produce the best representation of the most likely state sequence. HMMs are successfully applied in speech recognition, DNA sequence analysis, robot control, signal detection, to name a few. In this thesis, we solve evaluation and decoding problems using hidden Markov models in two areas: anomaly

1

2 detection and fault diagnosis. We describe the anomaly detection research work in Chapters II and III. Chapters IV and V describe the research work in fault diagnosis area. In the area of anomaly detection, we combined hidden Markov models with feature-aided tracking to detect asymmetric threats. The algorithms are implemented in a web-based software system, termed the Adaptive Safety Analysis and Monitoring (ASAM) system. In a second application of HMMs, we investigate the dynamic multiple fault diagnosis (DMFD) problem in safety-critical systems, such as aircraft, automobiles, nuclear power plants and space vehicles. The DMFD problem arises when multiple faults evolve over time and the objective is to infer them from a set of partial and unreliable test outcomes observed over time. Various formulations of the DMFD problem are discussed. In one of the DMFD formulations when the uncertainty is associated with the tests only, it leads to a dynamic fusion problem of combining evolving classifier outputs. The dynamic fusion algorithm discussed herein is tested on real-world datasets from an automotive system.

1.1 Anomaly detection via Feature-aided Tracking and HMMs Anomaly detection research is carried out in disparate domains, such as monitoring business news, epidemic or bioterrorism detection, intrusion detection, hardware fault detection, network alarm monitoring and fraud detection. The anomaly detection problem involves large volumes of time series data, which has

3 a significant number of entities and activities. The main goal of anomaly detection is to identify interesting and rare events (e.g., intrusions, fraudulent and/or unusual business activities) with minimum delays and fewest false alarms. The problem of detecting an anomaly (or abnormal event) is such that the distribution of observations is different before and after an unknown onset time of the event, and the objective is to detect the change by statistically matching the observed pattern with that predicted by a model. In this work, we propose to employ the well known Page’s test [6] (an efficient scheme for the quickest detection of a change in a distribution) to detect an abnormal event when the pattern to be detected is modeled as a hidden Markov model (HMM). HMMs are a natural choice to detect an anomaly (e.g., a pattern of suspicious activities). Real-world adversarial actions or events, such as terrorist attacks, are characterized as partially observable and uncertain signals. Their signals, or electronic signature, are a series of observations. HMMs provide a systematic way to make inferences about the evolution of such partially observable asymmetric threats. Here, we illustrate the capability of feature-aided tracking and HMMs to solve the evaluation problem, i.e., evaluate the probability of a sequence of noisy observations given a model of an asymmetric threat. To solve the evaluation problem, we need to develop a signal model that can be used to distinguish between suspicious and real instances of abnormal activities. In doing so, the model must

4 be able to: (1) detect potential abnormal activities (e.g., asymmetric threats) in a highly-cluttered environment; (2) efficiently analyze large amounts of data; and (3) generate hypotheses with only partial and imperfect information. The signal model is a transaction-based model that identifies relationships among nodes in a network to describe its structure and functionality. If we can identify the types of activities (or observations) that an adversary may be involved in, then we can construct a model-based solely on these.

1.2 An Advanced System for Modeling Asymmetric Threats The ability of terrorist networks to conduct sophisticated and simultaneous attacks suggests that there is a significant need for developing information technology tools for counter-terrorism analysis. These technologies could empower intelligence analysts to find information faster, share, and collaborate across agencies, ”connect the dots” better, and conduct quicker and better analyses. Here, we describe one such technology, the Adaptive Safety Analysis and Monitoring (ASAM) system. The ASAM system is a semi-automated model-based system, which has the ability to detect and track terrorist activities and perform what if analyses to enable an analyst gain deeper insights into a potential terrorist activity. The ASAM system provides a means to develop models based on real world events. Using the ASAM system, potential threat scenarios can be built and used to suggest priorities for efforts to reduce the overall threats. The ASAM system is a

5 semi-automated system which helps the analyst to spend more time on analysis rather than collecting and reporting the information. The premise of the ASAM system is that terrorists leave detectable clues about these enabling events in the information space, which can be related, linked, and tracked over time. We denote the enabling events associated with terrorist attacks, such as financing, acquisition of weapons and explosives, travel, and communications among suspicious people, as transactions. A pattern of these transactions and its dynamic evolution over time is a potential realization of a terrorist activity. The ASAM system employs a novel combination of HMMs and BNs to compute the likelihood that a certain terrorist activity exists. This likelihood is an important indicator of terrorist threat. The ASAM system utilizes attribute-aided tracking and hidden Markov models to identify suspicious activity consistent with an a priori terrorist template model. A probabilistic matching of modeled attributes with the observed attributes provides an ability to identify the suspicious person, place, or an object (item). The ASAM system provides efficient and effective methods for counter-terrorism analysis. In this work, we focus on the application of the ASAM system to real world examples. We discuss some of the modeling aspects of terrorist events and illustrate the modeling process via two examples of hypothetical terrorist activities.

6 1.3 Dynamic Multiple Fault Diagnosis Problem: Mathematical formulations and solution techniques On-line vehicle health monitoring and fault diagnosis is essential to improve the vehicle availability via condition-based and opportunistic maintenance, and to reduce maintenance and operational costs by seamlessly integrating on-board and off-line diagnosis, thereby reducing troubleshooting time. During on-line (dynamic) fault diagnosis, the test outcomes are obtained over time as compared to static fault diagnosis where the observed test outcomes are available as a block. On-line vehicle health monitoring relies heavily on extensive processing of data in real-time, which is made possible by smart on-board sensors. Using these intelligent sensors, the system parameters that are essential for vehicle fault diagnosis can be transmitted to an on-board diagnostic inference engine. A significant technical challenge in on-board vehicle health monitoring is the quality of tests. Generally, the tests are imperfect due to unreliable sensors, electromagnetic interference, environmental conditions, or aliasing inherent in the signature analysis of on-board tests. The imperfect tests introduce additional elements of uncertainty into the diagnostic process: the pass outcome of a test does not guarantee the integrity of components under test because the test may have missed a fault; on the other hand, a fail outcome of a test does not mean that one or more of the implicated components be faulty because the test outcome may have been a false alarm. Hence, it is desired that an on-board diagnostic algorithm should be able to accommodate missed detections and false alarms in

7 test outcomes. The performance of on-board diagnosis can be improved by incorporating the knowledge of reliabilities of tests, and by incorporating temporal correlations of test outcomes. The hidden Markov model (HMM) is a natural choice here to represent the individual fault states of the system. The HMM is a doubly-embedded stochastic process with an underlying unobservable (hidden) stochastic process (individual fault state evolution), which can be observed through another set of stochastic processes (i.e., uncertain test outcome sequences). The individual fault state HMMs are coupled through the observation process. Consequently, the fault diagnosis problem corresponds to a factorial HMM, where each HMM characterizes the individual fault states of the system. The sequence of uncertain test outcomes are probabilistic functions of the underlying Markov chains characterizing the evolution of system states. Here, we investigate the problem of determining the most likely fault states of components, given a set of partial and unreliable test outcomes over time. Dynamic multiple fault diagnosis (DMFD) is a challenging and difficult problem due to coupling effects of the states of components and imperfect test outcomes that manifest themselves as missed detections and false alarms. The objective of the DMFD problem is to determine the most likely temporal evolution of fault states, the one that best explains the observed test outcomes over time. Here, we discuss four formulations of the DMFD problem. These include the deterministic situation corresponding to a perfectly-observed coupled Markov

8 decision processes, to several partially-observed factorial hidden Markov models ranging from the case where the imperfect test outcomes are functions of tests only to the case where the test outcomes are functions of faults and tests, as well as the case where the false alarms are associated with the nominal (fault-free) case only. All these formulations are intractable NP-hard combinatorial optimization problems. We solve each of the DMFD problems by decomposing them into separable subproblems, one for each component state sequence. Our solution scheme can be viewed as a two-level coordinated solution framework for the DMFD problem. At the top (coordination) level, we update the Lagrange multipliers (coordination variables, dual variables) using the surrogate subgradient method, which does not require solving all the subproblems before dual variables are updated. The top level facilitates coordination among each of the subproblems, and can thus reside in a vehicle-level diagnostic control unit. At the bottom level, we use a dynamic programming technique (specifically, the Viterbi decoding or Max-sum algorithm) to solve each of the subproblems. The key advantage of our approach is that it provides an approximate duality gap, which is a measure of suboptimality of the DMFD solution. Interestingly, the perfectly-observed DMFD problem leads to a dynamic set covering problem, which can be approximately solved via Lagrangian relaxation and Viterbi decoding.

9 1.4 Dynamic Fusion of Classifiers for Fault Diagnosis Classifier fusion has been widely investigated in diverse fields such as image segmentation, data mining from noisy data streams, credit card fraud detection, sensor networks, image, speech and handwriting recognition, fault diagnosis, to name a few. In the literature, classifier fusion is variously referred to as classifier ensembles, consensus aggregation, decision fusion, committee machines and classifier selection or mixture of experts. The objective of classifier fusion is to achieve better classification accuracy by combining the results of individual classifiers. Our focus here is on combining class labels from multiple classifiers over time. In this work, we formulate the dynamic classifier fusion problem as one of maximizing the a posteriori probability of a hidden state sequence given uncertain classifier outcomes over time. For simplicity of classifier fusion, we transform the data into binary classes by selecting the individual classifiers to correspond to the columns of an error correcting code (ECC) matrix. In the fault diagnosis area, we refer to classes as components and classifiers as tests. Thus, the binary classifiers (binary tests) correspond to the columns of the ECC matrix, and the components correspond to the rows of the ECC matrix. Thus, the ECC matrix may be viewed as a diagnostic matrix (D-matrix, diagnostic dictionary, reachability matrix), which defines the cause-effect relationships among components (rows) and tests (columns).

10 Our approach to dynamic fusion involves four key steps: (1) data preprocessing (noise suppression, data reduction and feature selection) using signal processing techniques, such as wavelets, FFTs, principal component analysis (PCA), partial least squares (PLS), computing statistical moments, etc., (2) error correcting codes to transform the multiclass data into dichotomous choice situations (binary classification), (3) fault detection using pattern recognition techniques (e.g., support vector machines (SVM), probabilistic neural networks, k-nearest neighbor), and (4) fault isolation via dynamic fusion of classifiers output labels over time using the DMFD algorithm.

1.5 Impact of Research The research on anomaly detection has applications in many areas such as intrusion detection in networks, fraudulent and unusual business activities, monitoring business news, epidemic or bioterrorism detection and hardware fault detection. We primarily focused on anomaly detection of asymmetric threats, such as terrorist activities which are vital to the nation’s security. Our work on dynamic multiple fault diagnosis (DMFD) problem, is salient for on-board diagnosis in safety-critical systems, such as aircraft, automobiles, nuclear power plants and space vehicles. Dynamic classifier fusion can further improve the on-board diagnostic accuracy. An accurate on-board diagnostic process will ensure performability, maintainability and survivability of safety-critical systems.

11 1.6 Publications A. Journal Papers 1. S. Singh, A. Kodali, K. Choi, K. Pattipati, S. M. Namburu, S. Chigusa, D. V. Prokhorov, and L. Qiao, “Dynamic Multiple Fault Diagnosis Problem Formulations and Solution Techniques,” IEEE Trans. on SMC: Part A, August 2007 (under review). 2. S. Singh, H. Tu, W. Donat, K. Pattipati and P. Willett, “Anomaly Detection via Feature-aided Tracking and Hidden Markov Models,” at IEEE Transactions on System, Man and Cybernetics, Part A: Systems and Humans, August 2006 (under review). 3. S. Singh, H. Tu, J. Allanach, K. Pattipati and P. Willett, “Modeling Threats,” IEEE Potentials, August-September 2004. 4. H. Tu, J. Allanach, S. Singh, P. Willett and K. Pattipati, “Information Integration via Hierarchical and Hybrid Bayesian Networks,” IEEE Transactions on System, Man and Cybernetics, Part A: Systems and Humans, special issue on “Advances in Heterogeneous and Complex System Integration,” vol.1, no.1, page 19-34, January 2006. 5. L. Grymek, S. Singh, and K. Pattipati, “Vehicular Dependence Adds to Telematics Allure,” IEEE Potentials, March 2007.

12 6. W. An, S. Singh, S. Gokhale, K. Pattipati and D. Kleinman, “Dynamic Scheduling of Multiple Hidden Markov Model-based Sensors,” Journal of Advances in Information Fusion, October 2007 (under review). 7. W. Donat, K. Choi, W. An, S. Singh, K. Pattipati, ”Data Visualization, Data Reduction and Classifier Fusion for Intelligent Fault Detection and Diagnosis in Gas Turbine Engines,” ASME Journal of Engineering for Gas Turbines and power, 2007 and also published at Turbo Expo 2007: Power for Land, Sea and Air, Proceedings of GT 2007, Montreal, Canada, May 2007. A. Book Chapters 1. K. Pattipati, P. Willett, J. Allanach, H. Tu and S. Singh, “Hidden Markov Models and Bayesian Networks for Counter-terrorism,” in R. Popp and J. Yen (editors) Emergent Information Technologies and Enabling Policies for Counter Terrorism, Wiley-IEEE Press, May 2006, pp. 27-50. 2. G. Levchuk, C. Meirina, S. Singh, K. Pattipati, P. Willett and K. Chopra “Learning from the Enemy: Approaches to Identifying and Modeling the Hidden Enemy Organization,” in A. Kott (editor) Information Warfare and Organizational Decision-Making, Artech House, Inc., Norwood, MA, December 2006. C. Conference Papers

13 1. S. Singh, K. Choi, A. Kodali, K. Pattipati, S. M. Namburu, S. Chigusa, D. V. Prokhorov, and L. Qiao, “Dynamic fusion of classifiers for fault diagnosis,” IEEE SMC Conference, Montreal, Canada, October 2007. 2. S. Singh, K. Choi, A. Kodali, K. Pattipati, J. Sheppard, S. M. Namburu, S. Chigusa, D. V. Prokhorov and L. Qiao, “Dynamic Multiple Fault Diagnosis Problem Formulations and Solution Techniques,” DX-07 International Workshop on Principles of Fault Diagnosis, Nashville, TN, May 2007. 3. S. Singh, S. Ruan, K. Choi, K. Pattipati, P. Willett, S. M. Namburu, S. Chigusa, D. V. Prokhorov and L. Qiao, “An Optimization-Based Method for Dynamic Multiple Fault Diagnosis Problem,” IEEE Aerospace Conference, Big Sky, Montana, March 2007. 4. S. Singh, W. Donat, H. Tu, K. Pattipati and P. Willett “Anomaly Detection via Feature-Aided Tracking and Hidden Markov Models,” IEEE Aerospace Conference, Big Sky, Montana, March 2007. 5. S. Singh, W. Blanding, V. Ravindra and K. Pattipati, “Communication Channel Equalization- Pattern Recognition or Neural Networks?,” IEEE International Conference on Communication Technology, November 2006. 6. S. Singh, W. Donat, H. Tu, J. Lu, K. Pattipati and P. Willett, “An Advanced System for Modeling Asymmetric Threats,” IEEE International Conference of Systems, Man, and Cybernetics, October 2006.

14 7. S. Singh, J. Allanach, H. Tu, K. Pattipati and P. Willett, “Stochastic Modeling of a Terrorist Event via the ASAM system,” IEEE Conference on Systems, Man and Cybernetics, The Hague, The Netherlands, October 2004. 8. A. Kodali, S. Singh, K. Choi, K. Pattipati, S. M. Namburu, S. Chigusa, D. V. Prokhorov, and L. Qiao, “Dynamic Set Covering for Real-Time Multiple Fault Diagnosis,” to be published in IEEE Aerospace Conference, Big Sky, Montana, March 2008. 9. A. Kodali, W. Donnat, S. Singh, K. Choi and K. Pattipati, “Dynamic Fusion and Parameter Optimization of Multiple Classifier Systems,” to be submitted to Turbo Expo 2008: Power for Land, Sea and Air, Berlin, Germany, June 2008. 10. J. Allanach, H. Tu, S. Singh, K. Pattipati and P. Willet, “Detecting, Tracking and Counteracting Terrorist Networks via Hidden Markov Models,” IEEE Aerospace Conference, Big Sky, MT, March 2004. 11. H. Lee, S. Singh, W. An, S. Gokhale, K. Pattipati and D. Kleinman, “Rollout Strategies for Hidden Markov Model- based Dynamic Sensor Scheduling,” IEEE SMC Conference, Montreal, Canada, October 2007. 12. H. Tu, S. Singh, J. Allanach, K. Pattipati and P. Willett, “On Detection Networks and Iterated Influence Diagrams: Application to a Parallel Distributed Structure,” IEEE Aerospace Conference, Big Sky, MT, March 2006.

15 13. H. Tu, J. Allanach, S. Singh, K. Pattipati and P. Willett, “The Adaptive Safety Analysis and Monitoring System,” SPIE Defense and Security Symposium, April 2004. 14. K. Choi, S. Singh, K. Pattipati, S. M. Namburu, S. Chigusa, D. V. Prokhorov, and L. Qiao, “Novel Classifier Fusion Approaches for Fault Diagnosis in Automotive Systems,” Proceedings of IEEE AUTOTESTCON, Baltimore, MD, September 2007. 15. R. Popp, K. Pattipati, P. Willett, D. Serfaty, W. Stacy, K. Carley, J. Allanach, H. Tu and S. Singh, “Collaborative Tools for Counter-Terrorism Analysis,” IEEE Aerospace Conference, Big Sky, MT, March 2005. 16. R. Popp, K. Pattipati, P. Willett, D. Serfaty, W. Stacy, K. Carley, J. Allanach, H. Tu and S. Singh, “Collaboration and Modeling Tools for Counter-Terrorism Analysis,” CIHSPS 2004 - IEEE International Conference on Computational Intelligence for Homeland Security and Personal Safety, Venice, Italy, July 2004.

Chapter 2

Anomaly Detection via Feature-Aided Tracking and Hidden Markov Models

2.1 Introduction Anomaly detection research is carried out in disparate domains, such as monitoring business news, epidemic or bioterrorism detection, intrusion detection, hardware fault detection, network alarm monitoring and fraud detection (Fawcett, May 2004). The anomaly detection problem involves large volumes of time series data, which has a significant number of entities and activities. The main goal of anomaly detection is to identify as many interesting and rare events (e.g., intrusions, frauds and unusual business activities) as possible with minimum delay and fewest false alarms.

16

17 In this chapter, we propose to employ the well known Page’s test (Page, 1954) (an efficient scheme for quickest detection of a change in a distribution) to detect an anomaly (or abnormal event) when the pattern to be detected is modeled as a hidden Markov model (HMM). There are two basic ways to detect an anomaly: first, show that the observation process has similarity to an adversary pattern; second, show that the observation process is dissimilar to a benign (or normal) pattern. An intuitive approach to detect an abnormal situation is to use the likelihood ratio, i.e., the ratio of probability density (or mass) function (pdf or pmf) of observations under the assumption of abnormality to the pdf (or pmf) of the same observations under benign (or normal) conditions. Both Bayesian and Neyman-Pearson optimal hypothesis testing use likelihood ratio. However, in the case of sequential testing, the best procedure is the sequential likelihood ratio test; while for quickest detection of a change in distribution, the Page’s scheme uses the cumulative sum (cusum) statistic. Basically, the cusum statistic is a clamped log likelihood ratio (LR) such that it cannot be below zero. If the log likelihood ratio is sufficiently large, an abnormality is declared. The similarity to an adversary pattern amounts to an increase in the numerator quantity within the LR under the hypothesis of abnormality. Likewise dissimilarity to a normal pattern means a decrease in the denominator quantity of the LR under the assumption that no adversary is active. Both cause the LR to rise, which indicates an onset of an abnormal event.

18 As discussed above, our approach to abnormality detection focuses on the “numerator” of the LR: the degree of likelihood to which the observation sequence matches a HMM that depicts abnormal activity. We use a library of available HMMs to model adversarial activities. The HMM framework is used to compute the posterior probabilities of the hidden states, given a sequence of noisy and partial observations. Hidden Markov models (HMMs) constitute a principal method for modeling partially-observed stochastic processes. The premise behind a HMM is that the true underlying process, represented as a Markov chain depicting the evolution of true transactions as a function of time, is not directly observable (hidden), but it can be probabilistically inferred through another set of stochastic processes (observed transactions, for example). HMMs are a natural choice to detect an anomaly (e.g., a pattern of suspicious activities). Real-world adversarial actions or events, such as terrorist attacks, are characterized as partially observable and uncertain signals. Their signals, or electronic signatures, are a series of observations. HMMs provide a systematic way to make inferences about the evolution of such partially observable asymmetric threats. The HMMs can solve three problems: (1) evaluate the probability of a sequence of observed events given a specific model; (2) determine the most likely evolution of an abnormal activity (state sequence) represented by the HMM; and (3) estimate HMM parameters that produce the best representation of the most likely state sequence. Here, we illustrate the capability of feature-aided tracking and HMMs to solve the first problem, i.e., evaluate the probability of a sequence

19 of noisy observations given a model of an asymmetric threat. To solve the evaluation problem, we need to develop a signal model that can be used to distinguish between suspicious and real instances of abnormal activities. In doing so, the model must be able to: (1) detect potential abnormal activities (e.g. asymmetric threats) in a highly-cluttered environment; (2) efficiently analyze large amounts of data; and (3) generate hypotheses with only partial and imperfect information. The signal model is a transaction-based model that identifies relationships among nodes in a network to describe its structure and functionality. If we can identify the types of activities (or observations) that an adversary may be involved in, then we can construct a model solely on these. Feature-aided tracking is the process of collecting data about the features of suspicious entities from one or more sources to enhance the knowledge about them such as age, citizenship or some other details, which are catalogued a priori by expert analysts generating hypotheses. For example, it is generally considered that suicide bombers are young. So, if an analyst is creating a model of a suicide bombing where he needs to describe a transaction involving the suicide bomber; he can describe the features of the suicide bomber such as age, skills, etc. Overall, the HMMs describe the dynamics of a terrorist network by including a priori information of the people involved, the temporal characteristics of the transaction, the geographical location, etc. These features are directly embedded within the

20 underlying states of the HMM, and can be used to distinguish the targets of interest from ambient background noise. Next, we discuss the research work related to anomaly detection of asymmetric threats.

2.1.1 Related Work HMMs are well-known and powerful statistical techniques and they have been widely applied in various fields such as speech recognition, DNA sequence analysis, robot control, fault diagnosis (Ying et al., November 2000), signal detection (Chen and Willett, December 2000), to name a few. Excellent tutorials on HMMs can be found in (Rabiner and Juang, January 1986; Rabiner, February 1989). In (Smyth, December 1994), Smyth described a method for extending the HMMs to allow for unknown states, which cannot be accounted for when the model is being designed. The anomaly detection problem is widely studied in the machine learning literature. In (Fawcett, May 2004), Fawcett considered anomaly detection as an on-line stream classification problem. The author argued that diverse domains such as intrusion detection, news story tracking, etc., can be naturally expressed in a framework whose central theme is to develop various evaluation metrics that can account for the temporal nature of the problem. Joshi et al. [8] used HMMs to build an anomaly detection system to discriminate between normal and abnormal behavior of network traffic. The authors used the 1999 knowledge discovery in databases (KDD) data set as an example. KDD is defined as

21 an integrated approach to discover knowledge by combining ideas drawn from fields such as databases, machine learning, statistics, visualization, parallel and distributed computing. The authors used the standard Baum-Welch (Baum et al., 1970) (Expectation Maximization (Moon, November 1996)) procedure for HMM parameter training, and performed hypothesis testing using the maximum likelihood (ML) principles to rate the traffic as either normal or having originated from an attack during the recognition phase of the algorithm. Salvador et al. (Salvador et al., 2004) considered the anomaly detection problem using segmentation or clustering techniques to dynamically divide the time series and to determine a reasonable number of clusters. Further, they considered these clusters as states of a finite state machine to track normal behavior and detect anomalies. The method was applied to the data obtained from the NASA space shuttle. Agarwal et al. (Agarwal et al., January 2006) presented a holistic approach for simultaneously monitoring a large number of time series (or streams). Their method detects anomalies by applying control chart methodology to normal scores of p-values. The authors considered an adjustable five-parameter empirical Bayesian model for multiple comparisons at each time point. The procedure was illustrated on a bio-surveillance problem. Bay et al. (Bay et al., May 2004) proposed a general solution for time-series data to discover anomalous regimes, which they defined as a change in the functional relationships between the variables, or by the introduction of a previously unseen causal effect. The key idea is to transform the time series data into a set of local models,

22 where each model is trained on a set of small time-bounded data. The framework was used to compare models from the test set to those from the training set in the parameter space to detect anomalies. Detection of a pattern of abnormal activity is also of a significant interest to the national security community and there are several research groups working on this problem. Godfrey et al. (Godfrey, 2003) have developed a software tool, termed TerrAlert, which can generate a large number of potential operational schedules via Monte Carlo simulations. It uses Bayesian likelihood theory to adjust the weight on each schedule based on evidence. Another research effort in modeling asymmetric threats was pursued by Rosen et al. (Rosen, 2003) using influence networks (a variant of a Bayesian network) to model suspicious events. The authors developed an influence networks-based software tool, called situational influence assessment module (SIAM), which provides the ability to model causal relationships among seemingly unconnected events and determine their effect on outcomes. In the context of nation state stability analysis, Schrodt (Schrodt, 2000) employed HMMs to develop conflict measures based on observed event similarities to historical conflicts. Schrodt used a machine-coding program to perform linguistic parsing of the historical and current news reports. The machine-coded event sequences were generated using a large set of verbs commonly found in international conflicts. A combination of these machine-coded events was used to represent the states of HMM. Schrodt employed Baum-Welch algorithm to learn

23 the HMM parameters from the historical conflict data. After learning the parameters, Schrodt used the forward HMM algorithm to compute the probability of observing a sequence of events given the model. The results were presented for several international conflicts, including the Israeli-Palestinian conflict. The methods proposed in this chapter is one of the modules of a software tool, termed the Adaptive Safety Analysis and Monitoring (ASAM) system. Please refer to (Singh et al., October 2006, 2004; Tu et al., January 2006) for details on the software architecture, concept of operations, etc. of the ASAM system. Note that the focus of this chapter is entirely different from our previous work discussed in (Tu et al., January 2006). In (Tu et al., January 2006), we focused on information integration using hierarchical and hybrid Bayesian networks (HHBNs) (a hierarchical combination of regular HMMs with no features and Bayesian networks (BNs)). In the HHBN structure, HMMs function in the bottom (observation) layer to report processed evidence to the upper layer BN based on local information. In this chapter, we propose a sequential detection-based approach to detect HMMs, which are used to model asymmetric threats (e.g. terrorist events). As far as we are aware, this is the only work in the literature, which proposes a rigorous statistical framework to detect asymmetric threats. Our work is quite different from the existing literature for anomaly detection in both its application context and the representation details. Most of the previous work on anomaly detection is focused on finding outliers in a time series; however, in our application, an anomaly is a sequence of intelligence transactions. In addition, our HMM

24 state representation is significantly different from that of Schrodt (Schrodt, 2000). Our HMM state is depicted using an intelligence transaction, which contains information about transaction type, the entities (people, places, etc.) involved in it and their features. Our probabilistic transaction model also allows the missed detections and false alarms. In our application, an anomaly is a sequence of very few interesting transactions embedded in a large number of noise (benign) transactions. We cannot apply the existing data mining techniques such as on-line classification (Fawcett, May 2004) or clustering methods (Salvador et al., 2004), because there is not adequate data available for supervised learning of the distribution of interesting transactions. In addition, the anomalies (i.e., asymmetric threats) tend to be sparse and they do not tend to form clusters. A similar conclusion is made by Schrodt (Schrodt, 2000) in the context of nation state conflict analysis, where the author acknowledges that analyzing the event sequences using clustering techniques (Schrodt and Gerner, December 2000) has several drawbacks. For example, clustering requires aggregated data whereas HMMs do not require any temporal aggregation. Hence, clustering techniques are infeasible and unrealistic to model threats, which can unfold in a few days to a few years. However, using HMMs we can process the data sequentially; hence HMMs provide a viable framework to model highly adaptive and abrupt threats. Another major drawback of clustering techniques is to determine the start time of a crisis, whereas in the HMMs it can be easily modeled by prefixing an HMM with a background

25 state which represents the events of no crisis (Schrodt, 2000). Schrodt further proposed HMMs for international conflict analysis (Schrodt, 2000). The superiority of our proposed HMMFA-based anomaly detection method is demonstrated by comparing with a maximum likelihood-based data mining method, termed a Na¨ıve method. The Na¨ıve method models the asymmetric threat as an ergodic HMM with a doubly stochastic transition matrix, which considers the transitions among all the states as equally likely. In this case, all inferences are based on data only. In summary, the contributions of this chapter are: (1) a rigorous statistical framework for the detection of asymmetric threats modeled using HMMs; (2) novel HMM state representation and a method to compute the likelihood of modeled activity using concepts from feature-aided tracking, (3) an algorithm for updating the likelihood after skipping a known number of missing transactions, and (4) a detailed performance analysis of the proposed anomaly detection algorithm an comparison with the Na¨ıve method.

2.1.2 Organization We are using a HMM variant where observations are associated with arcs of the model instead of the states of the model (regular HMMs). The structure of such HMMs, along with feature-aided tracking, is discussed in Section 2.2. A transaction-based probabilistic model is also discussed in Section 2.2. Section 2.3 shows an application of our techniques to a hypothetical model of the development of a nuclear weapons program (DNWP) by a hostile country. A detailed

26 description of the modeling process using the Testability Engineering and Maintenance System (TEAMS® ) (QSI, 1994) software is provided in Section 2.4. The details of our algorithm are discussed in Section 2.5. Section 2.6 describes the simulation results for the DNWP model. Some performance analysis is also presented in Section 2.6. Finally, we conclude the chapter with a summary and future research directions in Section 2.7.

2.2 A Transaction-based Probabilistic Model In this section, we first discuss a variant of regular HMMs, where the observations are associated with arcs of the model instead of its hidden states. The state transition matrix of the underlying Markov chain associated with a discrete HMM, parameterized by Λ = (A, B, Π), is given by h ¡ ¢i ¡ ¢ A = [aij ] = p s(k + 1) = Sj |s(k) = Si ; i, j ∈ {1, 2, · · · , N } ,

(2.1)

where s(k) is the state at time k, and N is the number of states in the HMM. The observation process is represented via the emission matrix: h ¡ ¢i B = [bijxk ] = p xk = Xl |s(k) = Sj , s(k − 1) = Si ¡

(2.2)

¢ i, j ∈ {1, 2, · · · , N }, l ∈ {1, 2, · · · , NX }

where xk is the observation at time k, and NX is the number of observation types. The prior probabilities of the Markov states at time k = 1 are given by h ¡ ¢i ¡ ¢ i ∈ {1, 2, · · · , N } . Π = [π i ] = p s(1) = Si

(2.3)

27

Entity types Person

Observation types

Place

Communication Trust Travel Money & resources

Place

Object

Object Person

Figure 2-1: A simplified intelligence observation space Note in particular that the emission probabilities are slightly different from those of regular HMMs; here the observation is conditioned on both the current and previous states, whereas in regular HMMs the observation is conditioned merely upon the current state. The HMMs can be generalized to allow for continuous emissions, implying that bijl in (2.2) could be a probability density function. A convenient choice of the initial probability is the stationary distribution of the underlying Markov chain. The joint probability of a HMM state-observation sequence is p(s1 , ..., sn , x1 , ..., xn−1 ) = π s1

"n−1 Y

# "n−1 # Y ask sk+1 · bsk sk+1 xk

k=1

(2.4)

k=1

and this can be considered as its defining property. In the context of anomaly detection, A, B, and π represent, respectively, the probability of moving from the current state of abnormal activity to another (usually denoting an increase in threat), the probability of observing a new suspicious transaction given the current and previous states, and the initial probability, respectively. The forward variable is used to evaluate the probability of abnormal activity, because it is an efficient way to compute the probability of a sequence of observations. The forward, backward and termination steps are modified to

28 f1 (citizenship)

f2 (skills)

Entities

Suspicious person

f3 (age)

f1 (country)

Travel place Transaction type (travel) f3 (city) f2 (state) Features

Figure 2-2: An example of a transaction handle the observation dependence on the previous state and current state. The details of these modified steps are included in Subsection 2.5.1. In the context of asymmetric threats, observations are represented by transactions, such as communication, travel, and financing between various entities (people, objects, activities, or places). The features associated with these entities are directly embedded within the underlying states of the hidden Markov models (HMMs). The HMM framework along with feature-aided tracking provides the capability to detect suspicious entities. The observations are collected from an intelligence space as shown in Fig. 2-1. Each observation (or transaction) is represented as a line connecting two shapes. The shapes (diamond, rectangle and triangle) represent the entities such as people, places and objects. The observations are of various types, such as communication, trust, travel, money and resources. The observation types are shown by different line types (e.g. solid, dotted, dashed, etc.) connecting the two entities.

29

Hidden states

s1 (Communication)

s2 (Travel)

Entities (person, place, object)

o1

s4 (Money & resource)

s3 (Trust)

Features (city, state, etc.)

o2 Observations

o3

o4

Figure 2-3: A HMM combined with feature-aided tracking To understand our notation for an observation (or transaction), let us consider a simple example. An intelligence analyst receives information that a suspicious person has entered the USA and he has plans to conduct the surveillance of potential targets to carry out a terrorist attack. As shown in Fig. 2-2, this example is represented by a dashed line (travel transaction) connected with a diamond (suspicious person) and a rectangle (USA). Each entity has generic features (e.g. age, citizenship of a person), which are denoted by f1 , f2 and f3 . The features are shown by lines with circled ends.

Fig. 2-3 shows a HMM combined with

feature-aided tracking. The shaded circles represent the states (s1 , s2 , s3 , s4 , ..., etc.) of a true process. Each state consists of a transaction, which represents only “new” information. A sequence of states (e.g., s1 s2 s3 s4 ) represents a pattern (a sequence of true transactions), which describes the activities of an adversary. For example, a sequence of true transactions of communication, travel, trust and money type describes the HMM states s1 , s2 , s3 and s4 , respectively, as shown

30 in Fig. 2-3. The dotted transactions inside the HMM states represent previous transactions. For example, s2 contains a solid line (communication transaction) connecting a dotted triangle (object) and a dotted diamond (person), which represents the transaction corresponding to previous state s1 . The current and previous states together show the evolution of a true pattern of modeled activity (asymmetric threat). This true “hidden process” is observed through an imperfect and partial observation process (an intelligence database containing noisy information), i.e., a series of observed transactions (o1 , o2 , o3 , ..., etc.), which is shown inside the unshaded circles in Fig. 2-3. The observations could be of many different types. For example, an observation could be somewhat similar to a true state (e.g., o2 has the same transaction type (travel) as s2 , but it has different features). Another type of observation could be having a transaction type different from that of a true state, e.g., o3 has money and resource type transaction where as true state s3 has trust type. Our objective is to detect the hidden “true” pattern, which is a sequence of transactions (shown inside the shaded circles) via the observed process (observations in the unshaded circles). We can infer the existence of a true pattern based upon a set of observations, because the HMM states are statistically related to the noisy observation process. Next, we explain a transaction-based model which utilizes intelligence information to detect the modeled asymmetric threat. This model is also used to generate the observations for simulation. Fig. 2-4 shows the discretized transaction space of intelligence information along with various types of transactions under

31

i

i

Irrelevant (or null event)

f

i

f

i

i

f

i

i

$

i

f

i

i

$

True transaction

Gated false alarm (event)

i

@

f

Missed transaction

i

i

f

Gated false alarm (event)

“HMM active”

“Noise only” n0

Figure 2-4: Transaction space for hypothesis testing two hypotheses, null H (noise only) and alternative K (HMM in the presence of noise). It is assumed that the HMM becomes active at discrete time index n0 . For simplicity, we consider a single observation per scan. However, our formalism can be extended to multiple observations per scan mutatis mutandis. Each cell represents a transaction (intelligence) event in the database. The transactions are assumed to be stored in the repository at a specified sampling rate (e.g., one transaction per minute). While referring to the transactions, we distinguish among true transactions, irrelevant (or null), gated false alarms and clutter. The true transactions represent a pattern of threat, which is defined by an analyst in the template (hypothesized) model. The irrelevant or null type is defined as those transactions whose details (transaction type, etc.) do not match the template model. A gated false alarm is a transaction, which has transaction details (transaction type, entity types and features) similar to those of true transactions, and the number of gated false alarms in the scenario is determined by the probability of gated false alarms (Pf a ). To understand the irrelevant and gated false alarm type transaction, let us consider

32 the example discussed in Fig. 2-2. In the example, the observation type is of travel type. If we get any observation which is other than travel, such as money or communication, etc. then it will be an irrelevant type transaction. A gated false alarm for this example would be any travel transaction, which would involve a “benign” person whose features are entirely different from those of a suspicious person. Let us say that suspicious person is of age between 20 to 30 years and is a citizen of one of the high-terrorist activity country (say country X). The gated false alarm type observation could have the benign person of the same age group as a suspicious person; however, he may be a citizen of a different country (say country Y). The clutter (or false transaction set) consists of irrelevant and gated false alarms. The objective of a HMM is to detect a pattern of true transactions (a graph inside the shaded circles in Fig. 2-3) embedded in a large number of extraneous transactions (irrelevant and gated false alarms). Fig. 2-4 shows the transaction space, where “i” represents an irrelevant event (or null transaction), i.e., a transaction not related to the modeled scenario. The symbol “f ” indicates a gated false alarm, “$” indicates a true transaction, which is detected by the HMM. The “@” shows the transaction which is not detected (or is missed) by the HMM observation mechanism. In this chapter, we consider the problem of tracking multiple independent HMMs. In the independent case, a binary hypothesis can be constructed for each HMM. Specifically, instead of evaluating the probability of a sequence of observations a specified discrete-time index k, given a particular HMM as in the usual

33 evaluation problem, we are interested in a hypothesis testing problem with null hypothesis H as pure noise (“benign transactions”) and alternative hypothesis K as a HMM of interest (viz., “asymmetric threat”) being detected at a specified discrete time index. The details of a single HMM detection scheme based on Page’s test are given in Subsection 2.5.2. We also propose an algorithm to reduce the inference computations by skipping over irrelevant type transactions, which is presented in Subsection 2.5.3. Here, we discuss a transaction-based model for null and alternative hypotheses that is used to compute the inference.

2.2.1 Null Hypothesis (“Noise Only”) The null hypothesis is made up of gated false alarms and irrelevant (or null) transactions. Note that we are assuming false alarms as high-threshold false alarms, i.e., they look very much like true signal transactions. Each gated false alarm has a probability of transaction type of τ (k) denoted by g(τ (k)), and enti(l)

ties have features with specified probabilities gj (fj (k)). The subscript j refers to the number of features, and the superscript l refers to the number of entities. We assume that the probability of gated false alarm for each epoch is Pf a (Fig. 2-4), which can be estimated using the large number of benign transactions; and it need not be a constant. The probability of transaction type and features for gated false alarms is given by gf a = gτ (τ (k))

Y j

(l)

gj (fj (k)).

(2.5)

34 The emission probabilities under null hypothesis are as follows bH ijxk = p(x(k)|H, s(k), s(k − 1)) =     Pf a gf a         x(k) = × (gated false alarm) =    1 − Pf a         x(k) = φ (irrelevant transaction)

                      

.

(2.6)

2.2.2 Alternative Hypothesis “HMM in the Presence of Noise” When a HMM is present in the noise, we model the transactions according to (2.7) and (2.8). Equation (2.8) has two conditions. Under the first condition, the HMM remains in the current state, and behaves as if it is under the null hypothesis (“noise only”). Under the second condition, the state changes, and the HMM is under the alternative hypothesis (HMM in the presence of noise). We assume that the probability of detection for truth is Pd . The probability of features of true transactions is denoted by pT as follows pT =

Y j

(l)

pj (fj (k)).

(2.7)

35 The emission probabilities under the alternative hypothesis are given by bK ijxk = p(x(k)|K, s(k), s(k − 1)) =        (1 − Pf a )                      Pf a gf a                         (1 − Pf a )Pd pT + (1 − Pd )Pf a gf a                                                                   

(1 − Pf a )(1 − Pd )

(1 − Pd )Pf a gf a

Pd Pf a gf a

x(k) = φ, s(k) = s(k − 1)

x(k) = ×, s(k) = s(k − 1)

τ (x(k)) = τ (x(k)) & s(k) 6= s(k − 1)

                                                            .

          x(k) = φ, s(k) 6= s(k − 1)                  τ (x(k)) 6= τ (x(k))         & s(k) 6= s(k − 1)                |x(k)| = 2(notpossible)          (2.8)

2.3 Example: Development of a Nuclear Weapons Program (DNWP) There are many reasons a country may seek to develop nuclear weapons, but whatever the reasons, the development of nuclear capability by a country has

36

S1 Start research and design activities

S2 Leading scientists abroad recalled

S3 Scientists become active in nuclear community

S9 Collect information of high explosives

S8 Purchase simulation software

S14 Enrichment pilot program detected

S4 Collect information on warhead design

S10 Acquire produce material for high explosives test

S15 Nuclear material acquision for Test

S11 Acquisition of special test equipment

S16 Nuclear laboratory experiments

S6 Country X solicits non-NPT countries for assistance

S5 Country X sends more students abroad in nuclear field

S12 Conduct implosion tests

S17 Country X announces peaceful nuclear energy program

S7 Request for declassified nuclear documents

S13 Collect information on uranium enrichment

S18 Initial military tactics and strategy discussions

Figure 2-5: Markov chain of RD HMM vast implications for the US and its allies. The intelligence community’s ability to detect, analyze, and monitor the development of these programs is essential. The purpose of this model is to describe the evolution of a nuclear weapons program by a hostile country. The model developed herein describes the progress of a nuclear weapons program as a pattern of events grouped into three HMMs: the research and design (RD) HMM; the production of weapons grade material (P W GM ) HMM; and the fabrication, test, and integration (F T I) HMM. This model does not attempt to enumerate every possible step or observation, but rather to capture key observable events in the process of developing nuclear weapons. The model is gleaned from open sources (Barnaby, 2004; Settle, 2005; Spector and Smith, 1990; Paternoster, December 1992; Congress, December 1993).

On the road

to the development of nuclear weapons, a country must perform several tasks. The research and design (RD) HMM involves the processes and equipment used

37 to enrich nuclear material, planning for integration of the weapons into the military portfolio, as well as the design of weapons themselves. The activities of key scientists within a country would reveal that it is developing nuclear weapons. We consider information requests and studies related to nuclear engineering as key indicators. Computer simulations are an integral part of the initial design process of both enrichment and weapons technologies. In addition, a country must gain proficiency with the development and detonation of chemically pure high explosives. The research, experimentation, and testing of these high explosives are observable. Experimentation requires the acquisition of many pieces of specialized test equipment, which, we assume, might be detected. Finally, the unsafeguarded experimentation with refinement of nuclear material would be a strong indicator that a country was developing a nuclear weapons program. Fig. 2-5 shows a Markov chain diagram of the RD HMM. A country wishing to develop nuclear weapons must acquire or produce weapons grade nuclear material.

In the production of weapons grade material

(P W GM ) HMM, we assume that in order to remain clandestine and have adequate material to create even a meager arsenal, the country chooses to mine, mill, and refine its own uranium. However, the ultimate refinement to weapons grade is allowed to follow the uranium or plutonium enrichment technologies. The enrichment of uranium is assumed to be by means of gas diffusion or centrifuge. A gas diffusion facility would likely be co-located with a large power plant. Meanwhile, a centrifuge enrichment facility would be very large, and its construction

38

S7 S1 Construction of gas diffusion facility

Start preparations for weapons grade material

S6 Construction of conversion plant

S8

S3 Construction of centrifuge facility

Acquire centrifuge components

S5 S4

Political decision to start material fabricaiton

S9 Mining operation begins

Construction of uranium mill

Preparations at uranium mine

S18

S2 Nuclear energy program

S22

S19 Construction of heavy water reactor complete

Obtain specialized equipment

Construction of reprocessing plant S21 Refusal inspections of existing plants

S20

S12 Construction of gas diffusion plant complete S11 Construction of conversion plant complete

S16 Gas diffusion enrichment begins

11 S15 S13

S10

Processing at chemical plant begins

Construction of centrifuge complete

S17 Centrifuge enrichment begins

Construction of mill complete

Shipments of uranium ore arrive at mill S 14

S27 Weapons grade nuclear material

S23 Construction of reactor complete

S24 Reprocessing plant const. complete

S25 Reactor turn on

S26 Plutonium reprocessing

Figure 2-6: Markov chain of P W GM HMM

39

S1

S2

Start preparations for FTI

Inert gas equipment acquision

S9 High explosives machining

S3

S4

Enriched nuclear material to metal

S12

S11 Investigate permissive action control

S17 Military strategies and tactics integration

Reliability testing of system components

S18 Permissive action 15 control implemented

S6 Chemically pure high explosives created

Manufacture / experiments with Beryllium

Metal to Ignot

S10 Weapons delivery system maturation

S5

Advanced boost technology

S13 Political decision to assemble

S14 Assembly

S19 Develop command, control, comm., systems

S7

S20

Training

S8 High explosives production

S15 Policy integration of nuclear weapons

S16 Integration oversight established

S21 Successful test of nuclear weapon

Figure 2-7: Markov chain of F T I HMM as well as heat signature would likely be observable. The enrichment of plutonium from uranium would require a nuclear reactor and a processing plant. The reactor would have no output power when it was being used to create plutonium and this would likely be observable. Throughout the progress of this group of events, we assume that the acquisition of specialized equipment and resources are observable. Finally, if a country has its own nuclear energy program, its conduct (e.g., lack of cooperation) with International Atomic Energy Agency (IAEA) regulations and inspections is also considered as an indicator of an active nuclear weapons program. The Markov chain diagram of P W GM HMM is shown in Fig. 2-6. Once the nuclear material has been refined to weapons grade, the nuclear material must be assembled together with the non-nuclear components into the final nuclear weapon. The weapons grade material must be formed into a shape according to the particular design. This requires lathes, furnaces, an inert gas

40

Figure 2-8: A HMM state consists of a transaction environment, and other special equipment. A mature nuclear weapons program will have tested, and will be developing significant quantities of high explosives. It is also assumed that country X will be experimenting with boost technology to increase their weapon’s yield. Deployment of a nuclear weapon, once it has been developed, requires a delivery system. Therefore, we assume signs of the development of certain delivery systems as indicators of a nuclear weapons program. At some point before the weapon is actually assembled, a political decision must be made to do so. This event is likely to coincide with the implementation of policy, strategic, and military integration of the nuclear weapon program. The test of a nuclear weapon is the final state of fabrication, test, and integration (F T I) HMM. The Markov chain depicting the F T I HMM is shown in Fig. 2-7.

2.4 Modeling HMMs via TEAMS We have developed a graphical modeling tool using TEAMS® (Testability Engineering and Maintenance Systems). Using TEAMS® , an analyst can specify

41 the transition probabilities, prior probabilities, and the nodes associated with an HMM state. Each HMM state consists of a transaction (signal) and a transaction is made of two nodes (or entities), as explained in Section III. Fig. 2-8 shows a HMM state which has two nodes, Nuclear Scientist and Conference. Different features are available for each type of node. For example, a person type node may have features such as citizenship, age, skills, etc. Each feature must be given a confidence (a number between 0 and 1), which is a probability that the feature information is available in the observation. If the confidence is close to 1, then it indicates that it is more likely that feature information is available in the observations; on the other hand, a confidence value close to 0 indicates that it is less likely that feature information will be available in the observations. Fig. 2-9 shows the features associated with Nuclear Scientist node. After inputting all the nodes, the analyst can add transactions associated with each state of the HMM. Finally, the model information is exported to the database for further analysis by the HMM and feature-aided tracking algorithm.

2.5 Algorithm Details 2.5.1 Modified Forward, Backward and Termination Steps In this Subsection, we discuss the forward, backward and termination steps when the observations are tied to the arcs of a HMM instead of the states of a HMM. Here, the forward and backward variables at discrete time epoch k, are denoted by αk ’s and β k ’s respectively. A discrete HMM parameterized by Λ =

42

Figure 2-9: Features associated with Nuclear Scientist node

43 (A, B, Π) is given by (2.1), (2.2) and (2.3). The total number of states in a HMM is denoted by N , the state and observation at discrete time epoch k, are denoted by sk and xk respectively. Forward equations: αk (j) = p(x1 , ..., xk , sk = j|Λ) N X

=

p(x1 , ..., xk , sk = j, sk−1 = i|Λ);

i=1

=

N X

p(xk |sk−1 = i, sk = j) ·

i=1

p(sk = j|sk−1 = i) · p(x1 , x2 , ..., xk−1 , sk−1 = i|Λ) =

N X

bijxk aij αk−1 (i);

i=1

1 ≤ j ≤ N ; 1 ≤ k ≤ n;

(2.9)

Termination: αn (j) = p(x1 , ..., xn , sn = j|Λ) p(x1 , ..., xn |Λ) =

N X

αn (j)

(2.10)

j=1

Backward equations: β k (i) = p(xk+1 , ..., xn |Λ, sk = i) =

N X

p(xk+1 , ..., xn , sk+1 = j|Λ, sk = i)

j=1

=

N X

p(xk+1 |sk = i, sk+1 = j).

j=1

p(sk+1 = j|sk = i) · p(xk+2 , ..., xn |Λ, sk+1 = j); =

N X

bijxk+1 aij β k+1 (j);

j=1

1 ≤ i ≤ N;

0 ≤ k ≤ n − 1;

(2.11)

44 where β n (i) = 1.

2.5.2 HMM Detection Scheme Page’s test (Page, 1954), also known as the cumulative sum or cusum procedure, is an efficient change detection scheme. A change detection problem is such that the distribution of observations is different before and after an unknown time n0 ; and we want to detect the change, if it exists, as soon as possible. Casting it into a standard inference framework, we have the following hypothesis testing problem H : x(k) = v(k) 1 ≤ k ≤ n K : x(k) = v(k) 1 ≤ k < n0 x(k) = z(k) n0 ≤ k ≤ n

(2.12)

where x(k) are observations and v(k) and z(k) are independent identically distributed (i.i.d.) stochastic processes, with probability density functions (pdf) denoted by fH and fK respectively. Note that under the K hypothesis, the observations are no longer a stationary random sequence: their distribution has a switch at n0 from to fH to fK . The Page decision rule, which can be derived from the generalized likelihood ratio test (GLRT), amounts to finding the stopping time (NT ) when the observations are i.i.d., standard recursion for the Page’s test can be easily written as NT = arg min {Sn ≥ h} n

(2.13)

45 in which Sn = max {0, Sn−1 + g(xn )} and

µ g(xn ) = ln

fK (xn ) fH (xn )

(2.14)

¶ (2.15)

is the update nonlinearity. Page’s recursion assures that the test statistic Sn is “clamped” at zero; i.e., whenever the log likelihood ratio (LLR) of current observation would make the test statistic negative (which happens more often when H is true), Page’s test resets to zero. Thus, operationally, Page’s test is equivalent to a series of sequential probability ratio tests (SPRTs) with upper and lower thresholds h and 0. Whenever the lower threshold 0 is crossed, a new SPRT is initiated from the next sample until the upper threshold h is crossed. Consider the case when fH and fK are general non-i.i.d. probability measures. In compact form, we can write, in a manner similar to the standard Page’s recursion (2.14), Sn = max{0, Sn−1 + g(n; k)} where

µ g(n; k) = ln

fK (xn |xn−1 , ..., xk ) fH (xn |xn−1 , ..., xk )

(2.16)

¶ (2.17)

and xk is the first sample after the last reset, i.e., Sk−1 = 0. For the hidden Markov model, the existence of the forward variable, together with its recursive formula as discussed here, enables efficient computation of the

46 likelihood function of an HMM. Specifically, the likelihood function of an HMM with parameter triple Λ could be written as f (x1 , x2 , · · · , xk |Λ) =

N X

αk (i)

(2.18)

i=1

where N is the total number of states and the αk ’s are the forward variables defined in (2.9). Now the conditional probability is readily solved as fj (xk |xk−1 , · · · , x1 ) = f (xk−1 |xk−1 , xk−2 · · · , x1 , Λj ) PN αk (i) = PNi=1 i=1 αk−1 (i)

(2.19)

where j = H; K. In practice, it is found that the direct use of the likelihood function as defined in (2.15) will cause numerical underflow as the number of observations increases. For discrete HMMs, it is easily seen from the definition of the forward variable that the likelihood decreases monotonically (and generally geometrically) with the number of observations. The conditional likelihood function defined in (2.19) does not suffer from such a numerical problem. We therefore need to recursively compute the conditional likelihood function in (2.19) without the direct use of the 0

forward variable. This can be achieved by scaling. Define αk such that α01 (i) = α1 (i), but for k > 1 0

αk+1 (j) =

PN i=1

0

αk (i)aij p(x(k)|i, j) PN 0 i=1 αk

(2.20)

To summarize, for the quickest detection of HMMs, we propose the following procedure: 1. Set k = 1, l0 = 0 , where lk denotes the LLR at time k.

47 0

2. Initialize the (scaled) forward variable αk using 0

αk (j|K) = π j , 0

0

αk (j|H) = αk (1|H) = p(x(k)|H, i, j)

(2.21)

(Under the “noise only” hypothesis H, we model one state HMM with aH ij = 1, and NH = 1) 3. For each possible state j and for both hypotheses H and K, update the log likelihood ratio





N K P

0

αk (i|K)    i=1  lk = lk−1 + ln  N =1  H   P 0 αk (i|H)

(2.22)

i=1

4. If lk > h, declare detection of a change, stop; If lk < 0, set lk = 0; k = k + 1; then go to 2; If 0 < lk < h, continue. 5. Set k = k + 1; update the scaled forward variable using (2.23) and (2.24): ·N ¸ K P 0 K αk (i|K)aij p(x(k)|K, i, j) 0 i=1 αk+1 (j|K) = (2.23) N K P 0 αk (i|K) i=1

·N =1 H P 0

i=1

αk+1 (j|H) =

¸ 0

αk (i|H)aH ij p(x(k)|H, i, j) NP H =1 i=1

0

αk (i|H)

0

=

αk (i|H)p(x(k)|H, i, j) 0 αk (i|H)

(2.24)

then go to 3. Here, NK and NH are the number of states in a HMM under K and H hypotheses.

48 2.5.3 Cusum Update with Skipped Observations In Subsection 2.5.2, we discussed the inference algorithm which updates the cusum statistic for every observation. This kind of cusum update is slow and computationally intensive. We can accelerate the cusum update by skipping irrelevant (or null) observations i.e. we just perform the update when the observations are pertinent to the model and take into account for number of irrelevant (or null) transactions among them. To derive the cusum update for skipped observations case, first, we need to differentiate between the forward variable (αk (i)) 0

and scaled forward variable (αk (i)). Recall from the forward-backward algorithm (2.9), the forward variable is given by αk (i) = p(s(k) = i, X1k )

(2.25)

and the scaled forward variable is given by ((2.20) in Subsection 2.5.2) 0

αk (i) =

à X

à X

!−1 0

αk−1 (j)

·

j

!

0

αk−1 (j)aij p(x(k)|s(k) = i, s(k − 1) = j) .

(2.26)

j

Here, X1k denotes the superimposed observations x1 through xn . Let us propose a scaled forward variable as 0

αk (i) = p(s(k) = i, x(k)|X1k−1 ).

(2.27)

49 Then, using (2.26), the scaled forward variable can be written as à !−1 X 0 αk+1 (i) = p(s(k) = j, x(k)|X1k−1 ) · X

j

p(s(k) = j, x(k)|X1k−1 ) · p(s(k + 1) = i|s(k) = j)

j

·p(x(k + 1)|s(k + 1) = i, s(k) = j)

à X

¢−1 ¡ · = p(x(k)|X1k−1 )

!

p(x(k + 1), s(k + 1) = i, s(k) =

j

=

X

j, x(k)|X1k−1 )

p(x(k + 1), s(k + 1) = i, s(k) = j|X1k )

j

= p(x(k + 1), s(k + 1) = i|X1k )

(2.28)

so the conjectured scaled forward variable given by (2.26) is proven to be exactly what is desired, by induction. Further (2.26) can be used to check the cusum expression:

à k Y X τ =1

! 0

ατ (i)

i

=

k Y

p(x(τ )|X1τ −1 ) = p(X1k ).

(2.29)

τ =1

Now let us consider that we have a sequence of Q “null” observation epochs. At ¡ 0 ¢ the beginning, we have αk (i) i and p(X1k ) for both hypotheses (the latter is avail¡ 0 ¢ able from the cusum). We need to compute αk+Q (i) i and p(X1k+Q ). Consider k+Q p(s(k + Q), Xk+1 |X1k )

¢−1 ¡ (p(s(k + Q), Xkk+Q |X1k−1 )) = p(x(k)|X1k−1 )   ¡ ¢−1  X k+Q k−1  = p(x(k)|X1k−1 ) · p(sk+Q ) k+1 , Xk+1 , x(k)|X1 sk+Q−1 k+1

¢−1 ¡ · p(s(k), x(k)|X1k−1 · = p(x(k)|X1k−1 ) Q X Y sk+Q−1 k+1

n=1

k−1 ) p(x(k + n)|s(k + n)) · p(sk+Q k+1 |x(k), s(k), X1

(2.30)

50 and we also have p(x(k)|X1k−1 ) =

à X

! 0

αk (j)

(2.31)

j Q Y

p(x(k + n)|s(k + n)) = (1 − Pf a )Q (1 − Pd )B ,

(2.32)

n=1

where B is the number of states difference between the states s(k) and s(k + Q). For example, if states s(k) and s(k + Q) contains three and five transactions respectively, then to get from s(k) to s(k + Q) in (say) eleven null observations, we would need 11 non-false-alarms and 2 missed detections. If the probability of detection Pd is different for different types of transactions, then it can be easily accommodated. Note also that (2.32) does not depend on sk+Q k+1 and it depends only on s(k) and s(k + Q). Further, X

k−1 p(sk+Q ) k+1 |x(k), s(k), X1

sk+Q−1 k+1

= p(s(k + Q)|x(k), s(k), X1k−1 ) = p(s(k + Q)|s(k)),

(2.33)

which involves the transition matrix raised to the Qth power and could be easily precomputed. Overall, we have p(s(k + Ã X j

k+Q Q), Xk+1 |X1k )

=

à X

!−1 0

αk (j)

·

j

! 0

(1 − Pd )i−j (AQ )ji αk (j) ((1 − Pf a )Q ,

(2.34)

51 where, (i − j) is the number of transactions difference between states i and j. We use (2.34) as k+Q p(X1k+Q ) = p(X1k )p(Xk+1 |X1k )

=

p(X1k )(

P

p(s(k + Q) = i,

i

.

(2.35)

k+Q Xk+1 |X1k )

For the Q-step-at-a-time cusum update: 00

αk+Q (i) =

à X

!−1 k+Q p(s(k + Q) = j, Xk+1 |X1k )

j

³ p(s(k + Q) =

k+Q i, Xk+1 |X1k

· ´ ,

(2.36)

also note that 00



0

αk+Q (i) = αk+Q (i); = p(s(k + Q) = i, x(k + Q)|X1k ),

(2.37)

because they differ by a normalization constant. However, since in all updates à !−1 P 0 00 the term αk (j) is used, the normalization to αk (j) causes no problems. j

2.6 Simulations and Results For the simulations, data is a combination of the underlying hidden states of three HMMs of the DNWP model embedded in background noise from a benign source. The feature-aided tracking and HMM inference model’s parameters (viz., transition probabilities, emission matrix, prior probabilities and feature probabilities) could be estimated using a learning algorithm such as EM (Moon, November 1996) (or Baum-Welch algorithm (Baum et al., 1970) in the classical HMM phraseology). In a data-scarce environment, such as asymmetric threats, it

52

7 PWGM HMM RD HMM

6

FTI HMM

CUSUM statistic

5 4 3 2 1 0

0

5000

10000

15000

Epoch

Figure 2-10: Cusum statistic of PWGM, RD and FTI HMMs under the null hypothesis is doubtful if one can obtain enough training data to learn the model (including the structure and model parameters). Our approach has been to develop an initial model based on our understanding of the domain, and to seek a review and feedback from the subject matter experts (SMEs). We assumed the probability of detection of 80%, and probability of gated false alarm of 3% for generating the observations. In the real world, these probabilities could be estimated from data. The dataset contained 6000 observations for each HMM for a total of 18000 observations. The dataset included true transactions of about 58 and irrelevant/null transaction of about 17406 and gated false alarms of about 536. The following are typical results obtained using the simulations:

53 12 HMMFA method Naive method 10

Cusum statistic

8

6

4

2

0 0

1000

2000

3000

4000

5000

6000

Epoch

Figure 2-11: PWGM HMM: cusum statistic of HMMFA and Na¨ıve methods 2.6.1 Results under N ull Hypothesis Fig. 2-10 shows the cusum statistic under null hypothesis. The cusum plot mostly remains close to zero; however, peaks occur when there is any kind of match between the pattern defined by the HMM and the observations. The PWGM HMM shows more peaks with high values, because its structure contains many parallel branches (Fig. 2-6), which increase the probability of matches between a pattern and observations. The cusum statistic for RD HMM (dash-dotted curve) shows higher peaks in the range from 6000 to 12000, because in this range gated false alarms were randomly generated using the true states of RD HMM. Similarly, the cusum statistic for FTI HMM (solid curve) shows higher peaks in the range 12000-18000.

54

12 HMMFA method Naive method 10

Cusum statistic

8

6

4

2

0 6000

7000

8000

9000

10000

11000

12000

Epoch

Figure 2-12: RD HMM: cusum statistic of HMMFA and Na¨ıve methods

35

30

HMMFA method Naive method

Cusum statistic

25

20

15

10

5

0 1.2

1.3

1.4

1.5 Epoch

1.6

1.7

1.8 4

x 10

Figure 2-13: FTI HMM: cusum statistic of HMMFA and Na¨ıve methods

55 2.6.2 Results under Alternative Hypotheses Figs. 2-11, 2-12 and 2-13 show the cusum statistic under the alternative hypothesis. The observations were created by embedding the true states of the HMM in the noise. The observation range corresponding to PWGM HMM is 1 to 6000, RD HMM is 6001 to 12000 and FTI HMM is 12001 to 18000. The starting point of each HMM detection curve corresponds to the first time the HMM is detected; thus, we believe with certain probability that the modeled suspicious activity is in progress. A peak probability usually results when the pattern evolves into the absorbing state of the concomitant HMM, and we obtain maximum number of signal transactions for this HMM. All the cusum plots eventually level off because the HMM has most likely reached in its terminal state. Further, the false alarms change the numerator and denominator of the likelihood ratio nearly by the same amount; this results in leveling off behavior of the cusum statistic plot. We also compared the results of the proposed hidden Markov model and feature-aided tracking (HMMFA) method with the Na¨ıve method. The Na¨ıve method models the threat as a HMM with a doubly stochastic transition matrix. Hence, the Na¨ıve method may be viewed as a maximum likelihood-based data mining method. Figs. 2-11, 2-12, 2-13 compare the cusum statistics of the HMMFA and the Na¨ıve methods. The results demonstrate that HMMFA method is able to detect the modeled threat activity whereas Na¨ıve method completely misses the pattern.

56

10 Truncated RD HMM

9

CUSUM statistic

8 7 6 5 4 3 2 1 0

0

5000

10000

Epoch

Figure 2-14: Cusum statistic of truncated RD HMM under what-if scenario no. 1

4 3.5

Truncated RD HMM

CUSUM statistic

3 2.5 2 1.5 1 0.5 0 0

5000

10000

Epoch

Figure 2-15: Cusum statistic of truncated RD HMM under what-if scenario no. 2

57 2.6.3 What-if Analysis under Alternative Hypothesis Next, we perform what-if analysis for the Research and Design (RD) HMM. In this analysis, we assume that the intelligence data contains transactions corresponding to only the first twelve states of the HMM and the rest of the pattern (i.e., the last six states of RD HMM) is missing from the observed data. Fig. 2-14 shows the cusum plot for this scenario. The cusum plot is changed from the plot (a solid curve in Fig. 2-12) when the entire pattern was present in the dataset. However, the HMM still detects the pattern associated with research and design activity of DNWP model. The fall in the cusum statistic occurs due to an increase in the number of gated false alarms. Next, we also removed the transactions corresponding to the first two states of the RD HMM in addition to removing six transactions from the end. Fig. 2-15 shows the cusum plot for new what-if scenario. The plot demonstrates that HMMs combined with feature-aided tracking are still able to detect the pattern. These what-if scenarios illustrates that our method is robust to missing data, and it can detect the pattern even if only a part of the pattern was embedded in the noisy intelligence data.

2.6.4 Performance Analysis The performance of Page’s test is measured in terms of average run length (ARL, the average number of observations it takes before declaring a detection) under the K and H hypotheses. It is always desired to have a small delay to

58 5

Average time interval between false alarms

10

4

10

12 States RD:HMMFA 12 States RD:Naive 15 States RD:HMMFA 15 States RD:Naive 18 States RD:HMMFA 18 States RD:Naive

3

10

2

10 100

200

300 400 500 Average delay untill detection

600

700

Figure 2-16: Effect of complexity of HMM on the performance ¯ while keeping the average time interval between false alarms (T) ¯ detection (D), as large as possible. Analogous to the conventional hypothesis testing problem, where we wish to maximize the probability of detection while keeping the false alarm rate under a fixed level, the trade-off amounts to the choice of the decision threshold h. The relationship between h and the ARL is often calculated in an asymptotic sense using first or second order approximations, usually credited to Wald and Siegmund (Wald, 1947; Siegmund, 1985).

We conducted the

performance analysis by varying several parameters, such as Page’s test threshold (h), the number of states in a HMM to compute the average time interval ¯ and the average delay to detection (D). ¯ We considered between false alarms (T), the RD HMM, which contains eighteen states, for performance analysis. In order

59 5

Average time interval between false alarms

10

4

10

3

10

PWGM:HMMFA PWGM:Naive RD:HMMFA RD:Naive FTI:HMMFA FTI:Naive

2

10

1

10

0

200

400

600

800

1000

1200

Average delay untill detection

Figure 2-17: Performance comparison between HMMFA and Na¨ıve methods to compare the performance of a HMM with different number of states, we truncated the RD HMM to construct new HMMs containing twelve and fifteen states. The number of observations and Monte Carlo simulations was chosen such that ¯ and high T. ¯ The observations were created using a Pf a of 3%, it can achieve low D ¯ was obtained by varying the and a Pd of 80%. The average delay to detection D Page’s test threshold(h) and we used 100 Monte Carlo runs of each having 4000 observations under the alternative hypothesis (when the RD HMM is present in noise) to compute the statistic. The average time interval between false alarms ¯ was obtained by using the same threshold (h) range as was used to get D. ¯ We (T) performed 50 Monte Carlo simulations of each having 50000 observations under ¯ The transaction feature probabilities were null hypothesis (noise only) to get (T).

60 ¯ vs. (D), ¯ which illustrates kept between 0.65 to 0.95. Fig. 2-16 shows the plot of (T) ¯ and (D). ¯ The large values of (D) ¯ indicate the exponential relationship between (T) that the activities take a long time to unfold, and they were detected when the HMMs have observed some transactions corresponding to the HMM states. Fig. 2-16 shows that the performance of the RD HMM improves when the number of ¯ vs. (D) ¯ plot towards upper-left corner for states is increased. The shifting of (T) various numbers of states indicates an improvement in the performance and it improves significantly when the number of states is increased from 12 to 18. We also compared the performance of the HMMFA and the Na¨ıve meth¯ as shown in Figs. 2-16 and 2-17. The T¯ and D ¯ valods by plotting T¯ vs. D ues were obtained using the above mentioned procedure. The results demonstrate that HMMFA method achieves higher values of T¯ for as compared to ¯ values. The performance results demonstrate the Na¨ıve method for given D that HMMFA method is superior and robust to false alarms as compared to the Na¨ıve method.

2.7 Summary In this chapter, we introduced feature-aided tracking combined with HMMs for analyzing asymmetric threats. HMMs can detect, track, and predict the potential threat activities in the presence of partial and imperfect sequential data. The proposed approach can also serve as a what-if analysis tool by allowing users to

61 modify models (i.e., states in the HMMs) and/or transaction sequences. We utilized a transaction-based probabilistic method to detect and track a pattern consistent with the development of a nuclear weapons program (DNWP). The results associated with the DNWP model were presented using the plots of cusum statistic under the null and alternative hypotheses. The simulation results demonstrate that HMM combined with feature-aided (HMMFA) tracking is an effective method to track asymmetric threats with high accuracy. Performance analysis shows that the detection of HMMs improve with increase in the number of states in a HMM. We have also provided a detailed performance comparison between the hidden Markov models and feature-aided tracking (HMMFA) method and the maximum likelihood-based data mining method for all the HMMs in the DNWP model. Performance analysis shows that the HMMFA method is superior to the Na¨ıve method in terms of lower false alarms.

Chapter 3

Stochastic Modeling of a Terrorist Event via the ASAM System

3.1 Introduction Terrorist groups are elusive, secretive, amorphously structured and decentralized entities that often appear unconnected. Analysis of prior terrorist attacks suggests that a high magnitude terrorist attack requires certain enabling events to take place. For example, terrorists planned for nearly four years to execute the 9/11 attacks, and a year to execute the bombing attacks in Bali and Madrid. During this time, terrorists were embedded in the target country and accomplished tasks, such as identifying targets, reconnaissance, gathering money, recruiting team members, acquiring weapons, arranging transportation, travelling to the target place, and so on. The premise of the ASAM system is that terrorists leave detectable clues about these enabling events in the information space, which can

62

63 be related, linked, and tracked over time. We denote the enabling events associated with terrorist attacks, such as financing, acquisition of weapons and explosives, travel, and communications among suspicious people, astransactions. A pattern of these transactions and its dynamic evolution over time is a potential realization of a terrorist activity. For example, a suspicious person withdraws money from a bank account, uses the money to purchase chemicals that could be used to make a deadly chemical weapon, and then buys a plane ticket destined for the United States. This sequence of events suggests a reason to be concerned; it may or may not arise from terrorist activity, but ought to be flagged for a careful scrutiny. A probabilistic model of evolution of these type of events can be achieved using a HMM, where the underlying states of the HMM represent the pattern of enabling events.

3.1.1 Organization In this chapter, we focus on the application of the ASAM system to real world examples. In Section 3.2, we briefly introduce the ASAM system and the analysis process. Detailed discussions about the ASAM system are available in our earlier publications (Tu et al., January 2006) and (Allanach et al., March 2004). In Section 3.3, we discuss some of the modeling aspects of terrorist events and illustrate the modeling process via two examples of hypothetical terrorist activities. Section 3.4 shows the application of multiple target tracking and attribute-aided tracking in analyzing terrorist activities. Section 3.5 explains the simulations and results.

64 In Section 3.6, a summary of the chapter, along with a brief description of future research work, is presented.

3.2 The ASAM System The ASAM system is an advanced counter-terrorism analysis tool designed to have the following capabilities: 1) Predicting intent and future states of the terrorist activities: The ASAM system employs a novel combination of HMMs and BNs to compute the likelihood that a certain terrorist activity exists. This likelihood is an important indicator of terrorist threat. 2) Identify threats:

The ASAM system utilizes attribute-aided tracking and

hidden Markov models to identify suspicious activity consistent with an a priori terrorist template model. A probabilistic matching of modeled attributes with the observed attributes provides an ability to identify the suspicious person, place, or an object (item). 3) Options Analysis: The ASAM system can suggest actions to prevent the terrorist activities. Using optimization techniques, effective action sequences can be suggested. Therefore, the ASAM system increases the range of options and early alarms to facilitate preemption. 4) “Inverting the bath tub” and automation: The ASAM system provides efficient and effective methods for counter-terrorism analysis. Inverting the bath tub refers to upending the plot of the time an intelligence analyst spends on the functions

65 of collecting, analyzing and of reporting information. That is, currently intelligence analysts spend the majority of their time on collecting and reporting information, when it should ideally be spent on analysis. The ASAM system is a semi-automated system, which has the ability to detect and track terrorist activity and perform what if analyses to enable an analyst gain deeper insights into a potential terrorist activity. 5) Model and scenario generation: The ASAM system provides a means to develop models based on real world events. We have developed Indian Airlines hijacking model (Tu et al., January 2006) and Athens 2004 Olympics threat model. Using the ASAM system, potential threat scenarios can be built and used to suggest priorities for efforts to reduce the overall threats. The ASAM system has a hierarchical process, where the lower levels correspond to hidden Markov models (HMMs), and the higher levels are modeled via Bayesian networks (BNs, and these, in turn, can be hierarchical as well). Briefly, a HMM is a stochastic model used to evaluate the probability of a sequence of events, determine the most likely state transition path, and estimate parameters which produce the best representation of the most likely path. Mathematically, a discrete HMM is described by three parameters: λ = {A, B, π} . Here, A represents the transition matrix of the underlying Markov chain, B denotes the probability of emission of a certain symbol from a particular state, and π represents the initial probability distribution of the underlying Markov states. The BN is a directed acyclic graph (DAG) that consists of nodes and links. It represents an

66 intuitive and modular representation of knowledge through causal links among nodes. In this chapter, it is assumed that the observed data (a series of transactions) is available from an intelligence database; it represents any kind of travel, task, trust, or communication between any person, place, or item of suspicious origin. As more transactions are detected, more links representing the transactions are made in the transaction space (Allanach et al., March 2004). The idea behind using an HMM is that we can represent its underlying states as snapshots of the growing transaction space, and that it is the evolution of these snapshots that provides the most valuable clue. Note that, within each of the states of the HMM, is a graphical representation of the terrorist network’s activity. HMMs function in the transaction space under a fast time-scale, while BNs operate in the strategy space under relatively slow time-scales. Each HMM can be viewed as a detailed stochastic time-evolution of a particular node state represented in BNs. The HMMs send soft evidence to BN nodes, and the BN inference algorithms integrate the soft evidence from multiple HMMs into an overall assessment of terrorist threat. In other words, the BN represents the overarching terrorist plot and the HMMs, which are related to each BN node, represent detailed terrorist subplots. We are limiting the discussion about the ASAM system in this chapter. A detailed discussion of the ASAM system, along with its architecture, is provided in (Tu et al., January 2006).

67 3.3 Modeling of a Terrorist Event In order to detect terrorist activities, the ASAM system must be given a priori information about the potential terrorist activities (“template models”), which are to be monitored. This a priori information is provided in the form of HMM and BN models of the terrorist activities. Examples of these models are discussed in Subsections 3.3.3 and 3.3.4.

3.3.1 Modeling Aspects Predicting a terrorist event out of vast amount of information is analogous to finding a needle in a haystack. While developing a model of a specific terrorist event out of the available information, one key question is: how much a priori information is needed to develop a good model? In analogy to the needle in the haystack problem, the question can be asked how big the magnifying lens should be in order to find the needle. The correct amount of a priori information in the model ensures a good design of a magnifying lens. Another issue which arises is the estimation of model parameters. In this case, a relevant question to ask is: How do we specify HMM and BN parameters? An approach to obtain the HMM parameters could be estimating it using Baum-Welch algorithm and maximum-likelihood estimation using historical data. When the historical data is not available, the parameters can be specified according to the model and the state description. For example, if the number of transactions in a state is high, then it is highly probable that HMM stays in that state for a long time. Similarly,

68

Attack to cause panic in the western world

Attack as a part of Islamic Jihad

Strategic reasons by al-Qaeda to attack

Truck bombing attack

Deadly chemical attack

Terrorist attack at the 2004 Olympics

Greece based terrorist networks

Anti-American sentiments

Figure 3-1: BN model of terrorist attack threat in the Athens 2004 Olympics if the transactions related to the state are low, then the probability of remaining in that state is low. Transition probabilities do affect the detection scheme; hence, the probabilities which best fit the scenario should be specified. The ASAM system requires that the model be generic so that it can be easily instantiated for any specific name, place, or item related to terrorist activities. In the next section, we present an example to analyze the vulnerabilities of the Athens 2004 Olympics.

69 3.3.2 BN Model of a Terrorist Event The Athens Olympics is one of the biggest events of this year. In this event, nearly 16000 athletes from 202 countries will be participating along with millions of visitors, volunteers, state officials, and dignitaries. Given such a mega event, any terrorist attack will be able to capture media attention across the world. The BN model of vulnerabilities at the Athens 2004 Olympics is a collection of diverse potential terrorist targets and scenarios. One of the keys to the scenario is the geographical location of Greece and its proximity to Middle East and Europe, which could be an advantage for terrorist groups to penetrate and execute an attack. Greece is also prone to attacks from home-grown terrorist groups. The ongoing conflict in Iraq, Israel, and Palestine also generates a vulnerable environment that could cause a significant threat to athletes and visitors from the USA, UK, Israel, and their allies. The construction delays in the Olympics sports complex is another problem that could leave many loopholes for terrorists to execute an attack. The BN model assimilates all the above-discussed scenarios and threats. Fig. 3-1 shows an abridged version of the BN of the terrorist attack threat in the Athens 2004 Olympics. The Bayesian node ‘Strategic reasons by al-Qaeda to attack’ depicts reasons such as, getting attention throughout the world during the Olympics and to cause panic in the western world. Another BN node depicts the threat of terrorist attack due to home grown terrorist networks in Greece. The states of BN nodes ‘Truck bombing attack’ and ‘Deadly chemical cloud attack’

70

0.2

0.5 S2 Recruitment / S1 training of new 0.5 S4 members AQ announces 0.4 Money for attack on operation western targets 0.3 0.4 Set up 0.2 0.2 AQ cell S3

0.8

S5

Planning for attack

S6 0.8

Gather resources 0.2

0.5 S7

0.3 Target reconnaissance

0.7

0.2 Weapons installed

0.7

1 S9

0.8 Attack S8

0.3

Figure 3-2: Markov chain for the truck bombing HMM are modeled by the underlying truck bombing HMM and deadly chemical cloud HMM, respectively.

3.3.3 Truck Bombing (HM M1 ) This model presents a fictitious story that AQ and its affiliated terrorist groups are planning a truck bombing in Athens during the 2004 Olympics. Fig. 3-2 shows the Markov chain of HM M1 . The HM M1 consists of 9 states and the transition probabilities are shown next to the transition. A detailed description of the HMM states and transactions is described in Table 4-1. The bulleted items in Table 4-1 show the transactions that characterize each state. Details of these transactions are also shown in Figs. 3-3 and 3-4.

3.3.4 Deadly Chemical Cloud (HM M2 ) This example depicts a hypothetical deadly chemical cloud attack. AQ plans a chemical cloud attack in a closed place through a ventilation system of subway

71

S1

S2 & S3

Planner

Targets

AQ cell Greece Money Website, Videos

Planner

Targets

terrorists

AQ cell

terrorists

Resources

Greece

Resources

Weapons team

Money

Weapons team

Attack

Website, Videos

Attack

fundamentalists

Smugglers

Smugglers

fundamentalists

S5

S4 Planner

Targets

Planner

Targets

AQ & IBDA

terrorists

AQ & IBDA

terrorists

Greece Money Website, Videos fundamentalists

Resources Weapons team Attack Smugglers

Greece Money Website, Videos fundamentalists

Resources Weapons team Attack Smugglers

Figure 3-3: HMM states (S1 -S5 ) of the truck bombing HMM

72

S7

S6 Planner

Targets

AQ & IBDA

terrorists

Greece

Resources

Money

Weapons team

Website, Videos

Attack

fundamentalists

Smugglers

Planner

Targets

AQ & IBDA

terrorists

Greece

Resources

Money

Weapons team

Website, Videos

Attack

fundamentalists

S8

Smugglers

S9

Planner

Targets

Planner

Targets

AQ & IBDA

terrorists

AQ & IBDA

terrorists

Resources

Greece

Weapons

Money

Attack

Website, Videos

Smugglers

fundamentalists

Greece Money Website, Videos fundamentalists

Resources Weapons Attack Smugglers

Figure 3-4: HMM states (S6 -S9 ) of the truck bombing HMM

73

Table 3-1: Transactions for the truck bombing HMM State 1

Transactions AQ announces attack on western targets: • Spiritual leader gives inflammatory preachings in MiddleEast. • Al-Jazeera, a Middle-East based media, reports that AQ website announces an attack on western targets.

2

Recruitment/ training of new members: • The ring leader in AQ recruits terrorists to carry out the truck bombing attack • AQ cell recruits operators to execute the attack and drive the vehicle.

3

Set up AQ cell: • The terrorists are embedded in Greece a few months or a year before the Olympics and set up the cell. • AQ ring leader assigns the operators, planners, and facilitators for the attack. The facilitator provides driving licenses, passports, etc. to the operators. • AQ cell members rent two or three apartments and they pay rent by cash.

4

Money for operation: • The AQ ring leader sends money to the AQ cell members via messengers.

5

Planning for attack: • The terrorists reconnaissance the target location multiple times. • The terrorist cell members communicate with the ring leader.

74

State 6

Transactions Gather Resources: • Terrorists purchase or steal chemicals, blasting caps, and fuses for explosives in Turkey and transfer via trucks to Greece. • Terrorists purchase or steal respirators and chemical mixing devices in Greece. • Terrorists purchase electronic parts such as satellite cellular phones from the illegal sources. • Terrorists rent a truck.

7

Target reconnaissance: • Suspicious persons (bomb building experts, persons in the watch lists) reconnaissance the potential targets. • Terrorists perform dry runs of routes to identify speed traps, road hazards, etc.

8

Weapons installed: • Terrorists modify the truck to handle heavy loads and neutralize any security arrangements at the target.

9

Attack: • The terrorists drive the truck into the target and detonate the bomb.

75 or in an open crowded place, such as downtown. The attack involves mixing of lethal chemicals including blistering agents to cause third-degree burns, nerve gas and choking agents. The Markov chain depicting HM M2 is similar to HM M1 shown in Fig. 3-2. However, the transactions defining the states of HM M2 are different from those discussed for HM M1 . This is a consequence of the fact that terrorists employ different tactics in order to carry out deadly chemical cloud attack. Detailed description of the transactions and features related to HM M2 are not discussed here due to space limitations. While analyzing the vulnerabilities at the Athens 2004 Olympics, we hypothesized that terrorists might plan and execute multiple attacks at the same time. In order to detect these multiple attacks, we need to adopt advanced target tracking methods. In the next section, we discuss such methods and illustrate their functionality based on the two examples discussed in this section.

3.4 Advanced Methods for Tracking Terrorist Activities One of the key capabilities of the ASAM system is its ability to continually track many instantiations of terrorist activity in a cluttered environment. While the detection and tracking of a single terrorist activity using an HMM involves the forward or forward-backward algorithm, the competition amongst HMMs for the observations (i.e., the association of transaction observations to the HMMs

76 whence they come) suggests that inference becomes essentially a multiple-target tracking (MTT) problem (Reid, 1979). Naturally, traditional methods of tracking such as the MHT and the Joint Probabilistic Data Association Filter (JPDAF) are not directly applicable to tracking terrorist activities due both to the models and to the nature of the observations. In this case, the observations appear to be superimposed: for example, the observations associated with HM M1 overlap the observations associated with HM M2 . Superposition of observations related to both HMMs can be linear, as in power, or nonlinear, as in the case of an OR combination of the observations. A technique developed in (Chen and Willett, December 2000) provides a solution to this problem by making a new target tracker that can infer the state sequence of multiple HMMs with overlapped observations.

3.4.1 Multiple Target Tracking As discussed in Section 2, the model for a particular terrorist network can be represented as a HMM. Suppose we want to detect the presence of either of the HMMs discussed in Subsections 3.3 and 3.4. The problem is complicated because it requires checking the existence of both HMMs. While we can assume that the HMMs describing these two terrorist activities are conditionally independent, we must however consider that their observation processes are strongly dependent. In order to compute the likelihood of multiple HMMs, we invoke a target

77

H: HMM 1 K: HMM 1 AND HMM 2 H: Null K: HMM 1 OR HMM 2 H: HMM 2 K: HMM 2 AND HMM 1

Figure 3-5: Multiple hypothesis tracking for two HMMs tracking algorithm that assumes the HMM state sequences to be conditionally independent and their likelihoods to be conditionally dependent. After evaluating the likelihood of each HMM (or combinations of HMMs) given the observations, we can then determine the validity of a hypothesis using a sequential probability ratio tests (SPRT) to update its track score. In this Page-like test (Page, 1954), the track score of each hypothesis is compared to a threshold and if it rises above a threshold, then the hypothesis is confirmed and a new hypothesis be formed. The following section highlights the logic behind hypothesis maintenance and formation.

3.4.2 Multiple Hypothesis Tracking When there is data association uncertainty (i.e., the observations are not labeled, and it is not known from which source, if any, a given transaction emanates), correct statistical inference requires the evaluation of all possibilities. An MHT (in the kinematic target context) is a type of target tracking system that

78 forms alternative data association hypotheses every time an observation-to-track conflict arises – a special case of this, also known as Reid’s algorithm, is presented in (Reid, 1979). After a new observation is made, a new set of hypotheses is created and is then propagated to the next scan. It is important to properly form and maintain track hypotheses, since their number can increase exponentially with each additional observation. In this chapter, we present an algorithm similar to Reid’s, but from a track-oriented approach, and naturally we adopt it from tracking targets to tracking transaction patterns. For example, consider only two HMMs that describe the activities associated with HM M1 and HM M2 . As shown in the hypothesis tree in Fig. 3-5, MHT begins by assuming independence: H represents the null hypothesis and K represents the active hypothesis in a conventional detection problem and the combination of each HMM via OR or AND operations represent the hypothesis generated by the MHT. If our detection algorithm receives a few transactions that strongly imply that terrorists are purchasing explosives to make a bomb, then HM M1 will be confirmed (statistically) and our new hypothesis will become: HM M1 AND HM M2 are active versus HM M1 only active. This implementation of an MHT is not susceptible to exponential complexity because the number of hypotheses is limited by the number of HMMs which needs to be tracked, and hypothesis generation is based on a logical combination of previous knowledge.

79 3.4.3 Attribute-Aided Tracking Attribute-aided tracking is the process of collecting data about the features of a target from one or more sources to enhance the knowledge about the dynamics and class of the target. HMMs describe the dynamics of a terrorist network by including a priori information that describes the people involved, the temporal characteristic of the transaction, the event place, and other characteristics. These attributes are directly embedded within the underlying states of the HMM, and can be used to distinguish the targets of interest from ambient background noise. For example, suppose that we are tracking HM M1 as described in Table 4-1. The third state of HM M1 contains a transaction related to the terrorists arrival in Greece. In order to distinguish the terrorists from millions of visitors arriving in Greece, we must consider their attributes. If these men turn out to be around the ages of 50-60 and are citizens of a friendly country, then they are certainly less likely to be a threat than men around the ages of 30-40 and decedents of nations tied to terrorists. It is for this reason that attribute fusion places a pivotal role in the ASAM system. There is a great deal of literature based on this type of tracking and we do not attempt to cover it all. The main purpose behind attribute-aided tracking is that we can use it to refine our knowledge about a group or groups of terrorist cells.

80

Figure 3-6: Detection of modeled HM M1 at t = 25 3.5 Simulations and Results The ASAM system provides both real-time and what-if-scenario analysis from the HMM and BN software. It provides following types of results to an intelligence analyst via the ASAM website (ASAM, 2003): 1) Likelihood of observations: The likelihood of the observations is a quantitative measure of the confidence of the match between the observed events and the template models. The HMM determines whether the monitored activity exists. If the activity is consistent with the models derived in the first step, then it is detected and the related soft evidence is reported back to the BNs for further analysis.

81 2) Evidence from the observations: The evidence is a description of actors, transaction type, transaction description, transaction time, etc. 3) Probability of a terrorist attack: The BN software uses the soft evidence from the HMMs to produce a belief about the global terrorist threat level. Note that the HMM software detects the local activity and measures local threat levels, whereas the BN inference is a culmination of all reported activities. In this chapter, we discuss only the first-type of result i.e. likelihood of observations using the examples discussed in Subsections 3.2 and 3.3. Second- and third-types of results for this specific example are still in progress. However, all types of results, can be analyzed using the Indian Airlines hijacking example via the ASAM website (ASAM, 2003). The likelihood of the observations is shown in Figs. 3-6 and 3-7 in the form of a CuSum statistic. The starting point of each HMM detection curve is associated with the first time this HMM is detected; thus, we believe with certain probability that the modeled terrorist activity is in progress. A peak probability usually results when this pattern evolves into the absorbing state of the HMM, and we get maximum number of signal transactions for this HMM. The vertical lines indicate the start and end of HM M1 and HM M2 , and the dashed horizontal line is the threshold for the test statistic arbitrarily chosen. For the simulations, data is

82

Figure 3-7: Detection of HM M1 AN D HM M2 in the presence of HM M1 a combination of the underlying hidden states of HM M1 and HM M2 embedded in background noise from a benign source. The ground truth is: • HM M1 is active starting from t = 1 and ending at t = 150. • HM M2 is active starting from t = 50 and ending at t = 92. • HM M1 and HM M2 are both active starting from t = 50 and ending at t = 92. • Probability of false alarm and probability of missed detection are both 20%. As discussed in Subsection 4.2, we begin the MHT by considering that the two HMMs are independent. Fig. 3-6 shows the test static which determines that HM M1 becomes active at t = 25 and then immediately begins testing for the

83 existence of both HM M1 AND HM M2 given that HM M1 is already active, as shown in Fig. 3-7. Figs. 3-6 and 3-7 reflect the ground truth because their test statistics rapidly increase during times when HM M1 and HM M2 are known to be active.

3.6 Summary In this chapter, we introduced the ASAM system as an advanced information technology tool for counter-terrorism analysis. The ASAM system can detect, track, and predict the potential terrorist activities in real time. It can also work as a what-if analysis tool by allowing users to modify models (i.e., states in the HMMs, conditional probabilities in BNs) and/or transaction sequences. We utilized the ASAM system to analyze the vulnerabilities at the Athens 2004 Olympics. We developed two HMMs to depict a hypothetical truck bombing attack and a deadly chemical cloud attack. We discussed multiple target tracking, multiple hypothesis tracking, and attribute-aided tracking and illustrated their applications using examples.

Chapter 4

Dynamic Multiple Fault Diagnosis: Mathematical Formulations and Solution Techniques

4.1 Introduction Online vehicle health monitoring and fault diagnosis is essential to improve the vehicle availability via condition-based and opportunistic maintenance, and to reduce maintenance and operational costs by seamlessly integrating the onboard and off-line diagnosis, thereby reducing troubleshooting time. During on-line (dynamic) fault diagnosis, the test outcomes are obtained over time as compared to static fault diagnosis where the observed test outcomes are available as a block. On-line vehicle health monitoring heavily relies on extensive processing of data in real-time, which is made possible by smart on-board sensors. Using these intelligent sensors, the system parameters that are essential to vehicle fault diagnosis can be transmitted to an on-board diagnostic inference engine.

84

85 A significant technical challenge in on-board vehicle health monitoring is the quality of tests. Generally, the tests are imperfect due to unreliable sensors, electromagnetic interference, environmental conditions, or aliasing inherent in the signature analysis of on-board tests. The imperfect tests introduce additional elements of uncertainty into the diagnostic process: the pass outcome of a test does not guarantee the integrity of components under test because the test may have missed a fault; on the other hand, a fail outcome of a test does not mean that one or more of the implicated components are faulty because the test outcome may have been a false alarm. Hence, it is desired that an on-board diagnostic algorithm should be able to accommodate missed detections and false alarms in test outcomes. The performance of on-board diagnosis can be improved by incorporating the knowledge of reliabilities of tests, and by incorporating temporal correlations of test outcomes. The hidden Markov model (HMM) is a natural choice here to represent the individual component states of the system. The HMM is a doubly-embedded stochastic process with an underlying unobservable (hidden) stochastic process (individual component state evolution), which can be observed through another set of stochastic processes (i.e., uncertain test outcome sequences). The individual component state HMMs are coupled through the observation process. Consequently, the fault diagnosis problem corresponds to a factorial HMM, where each HMM characterizes the individual component states of the system. The sequence of uncertain test outcomes are probabilistic functions of the underlying

86 Markov chains characterizing the evolution of system states. Here, we investigate the problem of determining the most likely states of components, given a set of partial and unreliable test outcomes over time.

4.1.1 Previous Work The multiple fault diagnosis (MFD) problem originates in several fields, such as medical diagnosis (Yu et al., 2007), error correcting codes, speech recognition, distributed computer systems and networks (Odintsova et al., 2005). The MFD problem in large-scale systems with unreliable tests was first considered by Shakeri et al. in (Shakeri et al., 1998). They proposed near-optimal algorithms using Lagrangian relaxation and subgradient optimization methods for the static MFD problem. In the area of distributed system management, the MFD problem is studied by Odintsova et al. in (Odintsova et al., 2005). They utilized an adaptive diagnostic technique, termed active probing, for fault diagnosis and isolation. A probe can be viewed as a test in our terminology; the purpose of a probe is to check the set of system components on the probed path. The probe outcomes determine if one or more of the components on the probed path are faulty or normal. Given the probe outcomes, a diagnostic matrix (D-matrix, diagnostic dictionary, reachability matrix) defining the relationship among the probes and component faults, as well as the initial system state, they developed a sequential multi-fault algorithm to diagnose the system state. They considered the probe outcomes as being deterministic, which is analogous to the assumptions made in our Problem

87 4, and in the work described in (Tu et al., 2003; Raghavan et al., 1999; Pattipati and Alexandridis, 1990). In (Le and Hadjicostis, December 2006), Le et al. applied graphical model-based decoding algorithms to the MFD problem in the presence of unreliable tests. They proposed a suboptimal belief propagation algorithm used to decode low density parity check codes. They considered a fault model, where tests are asymmetric, i.e., the D-matrix is not binary and the test outcomes are also unreliable, and they termed it the Y model. Their implementation is parallel to our Problem formulation 1; however, they considered only the static case. Dynamic single fault diagnosis problem using HMM formalism was first proposed by Ying et al. (Ying et al., November 2000), where it is assumed that, at any time, the system has at most one component state present. This modeling is somewhat unrealistic for most real-world systems. Another version of the dynamic fault diagnosis problem was studied in (Erdinc et al., April 2003): unknown probabilities of sensor error, incompletely-populated sensor observations, and multiple faults were allowed, but the faults could only occur or clear once per sampling interval. In the dynamic single fault framework (Ying et al., November 2000), a hidden Markov modeling framework was adopted, and a moving window Viterbi algorithm was used to infer the evolution of component states. In the multiple fault case, the state space of hidden Markov model increases exponentially

88 from (m + 1) to 2m , where m is the number of possible component states. Consequently, the HMM-based method would be viable only for small-sized systems. The solution method proposed in (Erdinc et al., April 2003) is a multiple hypothesis tracking approach, where at each observation epoch, k best component state configurations are stored. In that paper, the missed-detection/false-alarm process was a property of the sensor rather than the fault, with the effect that the underlying inference process could not be decoupled into a factorial HMM. In (Erdinc et al., April 2003), at each epoch, all candidate fault sets, derived from the previously identified faults are listed, based on at most one change per epoch assumption. Then, of all k(m + 1) possible candidate sets, each has its score calculated, the candidate set which obtains the highest score is selected as the inference result at the epoch, and the candidates with the k-best scores are updated. The method is equivalent to enumeration in a limited search space; consequently, it is either computationally expensive or far from optimal. A major contribution of this chapter is that the missed-detection/false-alarm process is modeled as being a property of the component state: the model is perhaps less realistic, but the computational benefit of a factorial HMM is large. Another approach, developed by Ruan et al. (Ruan et al., 2006a), decomposes the original DMFD problem into a series of decoupled subproblems, one for each epoch. For a single epoch MFD, they developed a deterministic stimulated annealing (DSA) method, which is inspired by its sibling stochastic simulated annealing and the approximate belief revision (ABR) heuristic algorithm (Yu et al.,

89

Component m (HMM m)

xm ( k − 1)

xm ( k )

xm ( k + 1)

. . .

. . .

. . .

. . .

Component 2 (HMM 2)

x2 (k −1)

x2 (k )

x2 (k + 1)

x1 (k − 1)

x1 (k )

x1 (k + 1)

Op(k-1)

Op(k)

Op(k+1)

Of(k-1)

Of(k)

Of(k+1)

k-1

k

Hidden component states Component 1 (HMM 1)

Op: Passed test outcomes Of : Failed test outcomes Time

k+1

Figure 4-1: DMFD problem viewed as a factorial hidden Markov model (FHMM) 2007). The single epoch MFD was extended to incorporate component states of multiple consecutive epochs. In addition, they applied a local search and update scheme to further smooth the “noisy” diagnoses stemming from imperfect test results, and, consequently increase the accuracy of fault diagnosis. The DMFD problem can be viewed as a factorial HMM (FHMM), which is discussed in the machine learning literature (Ghahramani and Jordan, 1996). Here, the HMM state is factored into multiple state variables and represented in a distributed manner. The authors discussed an exact algorithm for inference computations in the FHMM framework. In this framework, inference and learning involves computing the posterior probabilities of multiple hidden layers (or states),

90 given the test outcomes. However, due to combinatorial nature of the hidden state representation, the exact algorithm is intractable. They presented approximate inference algorithms based on Gibbs sampling and variational methods. The latter methods are similar to Lagrangian relaxation, although motivated from a Fenchel duality perspective (Yu et al., 2007; Shakeri et al., 1998; Tu et al., 2003; Bertsekas, 2003). Here, we extend the work of Ruan et al. (Ruan et al., 2006a), Shakeri et al. (Shakeri et al., 1998) and Tu et al. (Tu et al., 2003) on MFD to solve the DMFD problem by combining the Viterbi algorithm and Lagrangian relaxation in an iterative way. Depending on the probabilistic assumptions on fault-test relationships and test outcomes, one obtains various DMFD formulations. In summary, the contributions of this chapter are: (1) A primal-dual optimization framework to solve the DMFD problem; (2) Four formulations of the DMFD problem along with their solutions; (3) Simulation results on several real world systems for the first and most general formulation of the DMFD problem; (4) A comparison of the results between the subgradient and the deterministic simulated annealing methods (Ruan et al., 2006a); and (5) Simulation results, along with performance analysis, of the on-line DMFD problem using a sliding window method.

4.1.2 Organization The chapter is organized as follows. We formulate the NP-hard dynamic multiple fault diagnosis (DMFD) problem with imperfect test outcomes in Section

91 4.2. Four formulations of the DMFD problem are also discussed in Section 4.2. In Section 4.3, we decompose the DMFD problem formulation 1 using Lagrangian relaxation algorithm. The DMFD problem is decoupled into a set of parallel subproblems (involving dynamic single HMM state estimation problems) using Lagrange multipliers. A dynamic programming technique (the Viterbi algorithm) is used to solve each of the subproblems, and their solutions are used to update the Lagrange multipliers via the subgradient method. Feasible (primal) solutions are constructed using the dual solutions. In Sections 4.3 to 4.6, we discuss the details of DMFD problems 2, 3 and 4, respectively. On-line DMFD problem is solved using a sliding window method, which is presented in Section 4.7. The details of Viterbi algorithm and Subgradient optimization method are provided in Section 4.8. The simulation results of DMFD problem 1 is performed on several real-world datasets to validate our approach. Section 4.9 discusses the simulation results of both block and on-line DMFD problems. Finally, the chapter concludes with a summary and future research directions in Section 4.10.

4.2 DMFD Problem Formulations The dynamic multiple fault diagnosis problem consists of a set of possible component states in a system, and a set of binary test outcomes that are observed at each sample (observation, decision) epoch. Component states are assumed to be independent. Each test outcome provides information on a subset of the component states. At each sample epoch, a subset of test outcomes is available. Tests

92

Components

x1 (k )

x2 (k )

x3 (k )

x4 (k )



t1 (k )

t2 ( k )

t3 ( k )

t4 ( k )



o1 (k )

o2 (k )

o3 (k )

o4 (k )

xm (k )

Hidden Tests

Test outcomes



tn ( k )

on (k )

Figure 4-2: Tri-partite digraph for DMFD problem are imperfect in the sense that the outcomes of some of the tests could be missing, and tests have missed-detection/false-alarm processes associated with them. The observations consist of imperfect binary test outcomes, and are characterized by sets of passed test outcomes, Op and failed test outcomes, Of . Formally, we represent the DMFD problem as DM = {S, κ, T, O, D, P, Π}, where S = {s1 , ..., sm } is a finite set of m components (failure sources) associated with the system. The state of component si is denoted by xi (k) at epoch k, where xi (k) = 1 if failure source si is present; xi (k) = 0, otherwise. Here, κ = {0, 1, ..., k, ...K} is the set of discretized observation epochs. The status of all component states at epoch k is denoted by x(k) = {x1 (k), x2 (k), ..., xm (k)}. We assume that the initial state x(0) is known (or its probability distribution is known).

Our problem is to determine the time

evolution of component states based on imperfect test outcomes observed over time. Fig. 4-1 shows the DMFD problem viewed as a FHMM. The hidden component state of ith HMM at time epoch k is denoted by xi (k). Each component state

93 xi (k) is modeled as a two-state HMM. The observations at each epoch are subsets of binary outcomes1 of tests O = {o1 , o2 , ..., on }, i.e., oj ∈ {pass, f ail} = {0, 1}. Fig. 5-2 shows the DMFD problem as a tri-partite digraph at epoch k. Component states, tests and test outcomes represent the nodes of the digraph. Here, the true states of the component states and tests are hidden. P = {P d, P f } represents a set of probabilities of detection and false alarm, which is defined differently for each of the DMFD problem formulations. We also define the matrix D = [dij ] as the dependency matrix (D-matrix), which represents the full-order dependency among failure sources and tests. Each component state is modeled as a two-state non-homogenous Markov chain.

For each component state, e.g., for component si at epoch k, Π =

(P ai (k), P vi (k)) denotes the set of fault appearance probability P ai (k) and fault disappearance probability P vi (k) defined as P ai (k) = Pr(xi (k) = 1|xi (k − 1) = 0) and P vi (k) = Pr(xi (k) = 0|xi (k − 1) = 1). These probabilities are required to model the intermittent faults. Here, T = {t1 , t2 , ..., tn } is a finite set of n available binary tests, where the integrity of the system can be ascertained. We denote the set of passed tests, Tp and failed tests Tf . At each observation epoch, k, k ∈ κ, test outcomes upto and including epoch k are available, i.e., we let Ok = {O(b) = (Op (b), Of (b))}kb=1 , where Ok is the set of observed test outcomes at epoch k, with Op (b)(⊆ O) and Of (b)(⊆ O) as the corresponding outcomes of sets of passed and failed tests at epoch b, respectively. The tests are partially observed in the sense that outcomes of some tests may not be available, 1

Extension to multi-valued component states and test outcomes is straightforward.

94

Components

x1 (k )

x2 (k )

x3 (k )

x4 (k )

o1 (k )

o2 (k )

o3 (k )

o4 (k )



xm (k )

Hidden Test outcomes



on (k )

Figure 4-3: Bi-partite graph for the DMFD problem i.e., (Op (b) ∪ Of (b)) ⊂ O. In addition, tests exhibit missed detections and false alarms. Here, we also make the noisy-OR (“causal independence”) assumption (Pearl, 1988).

The DMFD problem can be formulated in the follow-

ing ways, arranged from the general to simplified: Problem 1: When the probability of detection (P dij ) and false alarm probability (P fij ) are associated with each failed test and each failure source, i.e., P dij = Pr(oj (k) = 1|xi (k) = 1) and P fij = Pr(oj (k) = 1|xi (k) = 0) of a failure source si and test tj . For notational convenience, when si does not affect the outcome of test tj , we let the corresponding P dij = P fij = 0. This problem scenario frequently arises in medical fault diagnosis. For example, the QMR-DT (Quick Medical Reference, Decision-Theoretic) database used in the domain of internal medicine, contains approximately 600 disease nodes (faults or failure sources) and 4000 symptoms (tests) (Yu et al., 2007). Each of the symptoms could have a probability pair (P dij , P fij ) associated with the symptom and the disease node. Fig. 4-3 shows the bi-partite graph, where the edges represent the probability pair (P dij , P fij ). These probabilities

95 can be obtained from the tri-partite digraph (Fig. 5-2) using the total probability theorem as follows: X

Pr(oj (k)|xi (k)) =

Pr(oj (k), tj (k)|xi (k))

tj ∈{0,1}

=

X

Pr(oj (k)|tj (k)) Pr(tj (k)|xi (k))

(4.1)

tj ∈{0,1}

Problem 2: In situations where the probability of detection (P dij ) is associated with each failure source-test pair, but the false alarm probability is specified only for the normal system state, i.e., P fj = P (oj (k) = 1|x1 (k) = 0, ..., xm (k) = 0), we obtain a slightly complicated variation of Problem formulation 1 (in terms of computational complexity, but not in terms of parameterization). This type of scenario arises when we design class-specific classifiers that distinguish between normal system operation and failure source, si only, or when the false alarms are defined on an overall system basis. Here, the probability pair (P dij ,P fj ) is associated with test outcomes to model imperfect test outcomes (Shakeri et al., 1998). This model is also called the Z model in (Le and Hadjicostis, December 2006). Similar to problem 1, the probability pair (P dij ,P fj ) is shown as edges between the hidden component states and test outcomes in Fig. 4-3, and they can be obtained from the tri-partite digraph (Fig. 5-2) using the total probability theorem on the nodes of test layer. Problem 3: When the detection probability (P dj ) and false alarm probability (P fj ) are associated with each test tj only. The probability pair (P dj ,P fj ) is shown as the edges between the tests and test outcomes in the tri-partite digraph (Fig. 5-2). This formulation is quite useful in classifier fusion using error correcting

96 codes. In the error correcting code (ECC) matrix, each column corresponds to a binary classifier with the associated (P dj ,P fj ) pair, which are learned during training and validation. In this case, the fault-test relationships are deterministic, but the test outcomes are unreliable and depend on the concomitant test only. This type of formulation is also considered in (Erdinc et al., April 2003). This formulation provides a nice vehicle for the dynamic fusion of classifiers, where each column of the ECC matrix is a classifier, and their associated probability pairs (P dj ,P fj ) are uncertainties associated with classifier outcomes. When the learned parameters and the ECC matrix are fed as an input to the DMFD algorithm, it performs dynamic fusion of classifier outputs over time. Note that the sampling interval of the dynamic fusion algorithm can be different from the sampling interval of the raw sensor data. Problem 4: This is the deterministic case when tests are perfect i.e. P dij = 1 and P fij = 0 (Tu et al., 2003). This formulation reduces the tri-partite digraph in Fig. 5-2 to a bi-partite graph between the components and tests. This scenario is useful in situations where the tests are highly reliable (e.g., automated testing of electronic cards), and leads to a novel dynamic set covering problem. we discuss the DMFD formulations in detail.

Next,

97 Normal state

Component

Faulty state

xi (k ) = 0

xi (k ) = 1 1 − Pdij Pdij

1 − Pf ij Pf ij Test Outcome

o j (k ) = 0

o j (k ) = 1

Figure 4-4: Detection and false alarm probabilities for problem 1 4.3 DMFD Problem 1 In this problem, we assume that the detection and false alarm probabilities (P dij , P fij ) are associated with each failure source and each test. Fig. 4-4 illustrates these probabilities. The DMFD problem is one of finding, at each decision epoch k, the most likely fault state candidates x(k) ∈ {0, 1}m , i.e., the fault state evolution over time, X K = {x(1), ..., x(K)}, that best explains the observed test outcome sequence OK . We formulate this as one of finding the maximum a posteriori (MAP) configuration: b K = arg max Pr(X K |OK , x(0)). X XK

Applying the Bayes rule in (4.2), the objective function is equivalent to b K = arg max Pr(OK |X K , x(0))Pr(X K |x(0)). X XK

(4.2)

98 With passed and failed test outcomes being conditionally independent given the status of component states (“the noisy OR assumption”), and the Markov property of component state evolution, the problem is equivalent to: b K = arg max X XK

K Y

{Pr(Op (k)|x(k)) ·

k=1

Pr(Of (k)|x(k)) · Pr(x(k)|x(k − 1))},

(4.3)

where Op (k) ⊆ O and Of (k) ⊆ O denote the sets of passed and failed test outcomes at epoch k, respectively. We define a new function fk (x(k), x(k − 1)) as fk (x(k), x(k − 1)) = ln{Pr(Op (k)|x(k)) · Pr(Of (k)|x(k)) · Pr(x(k)|x(k − 1))}.

(4.4)

Given the component state status, x(k), the test outcomes are independent. Consequently: Pr(Op (k)|x(k)) =

Y

Pr(oj (k) = 0|x(k)),

(4.5)

Pr(oj (k) = 1|x(k)).

(4.6)

oj (k)∈Op (k)

and Pr(Of (k)|x(k)) =

Y oj (k)∈Of (k)

For test tj to pass at epoch k, it shall pass on all its associated component states, so that Pr(oj (k) = 0|x(k)) =

m Y

Pr(oj (k) = 0|xi (k))

i=1

where Pr(oj (k) = 0|xi (k)) =

    1 − P fij , xi (k) = 0;    1 − P dij , xi (k) = 1;

(4.7)

99 = (1 − P dij )xi (k) (1 − P fij )1−xi (k) , xi (k) ∈ {0, 1}.

(4.8)

Pr(oj (k) = 1|x(k)) = 1 − Pr(oj (k) = 0|x(k)).

(4.9)

Evidently,

In the same vein, the assumption of independent evolution of component states leads to: Pr(x(k)|x(k − 1)) =

m Y

Pr(xi (k)|xi (k − 1))

(4.10)

i=1

where, Pr(xi (k)|xi (k − 1)) =     1 − P ai (k) xi (k − 1) = 0, xi (k) = 0;         P ai (k) xi (k − 1) = 0, xi (k) = 1;           

P vi (k)

xi (k − 1) = 1, xi (k) = 0;

1 − P vi (k) xi (k − 1) = 1, xi (k) = 1;

Equivalently, Pr(xi (k)|xi (k − 1)) = (1 − P ai (k))(1−xi (k−1))(1−xi (k)) · P ai (k)(1−xi (k−1))xi (k) · P vi (k)xi (k−1)(1−xi (k)) · (1 − P vi (k))xi (k−1)xi (k) ; xi (k − 1), xi (k) ∈ {0, 1}.

(4.11)

So, the problem that is equivalent to (4.3) is as follows b K = arg max X XK

K X k=1

fk (x(k), x(k − 1)).

(4.12)

100 where fk (x(k), x(k − 1)) = m X

X

[xi (k) ln(1 − P dij ) + (1 − xi (k))ln(1 − P fij )]

oj ∈Op (k) i=1

+

X

ln[1 −

oj ∈Of (k)

+

m X

m Y

(1 − P dij )xi (k) (1 − P fij )(1−xi (k)) ]

i=1

{(1 − xi (k − 1))(1 − xi (k)) ln(1 − P ai (k))

i=1

+(1 − xi (k − 1))xi (k) ln(P ai (k)) +xi (k − 1)(1 − xi (k)) ln(P vi (k)) +xi (k − 1)xi (k) ln(1 − P vi (k))}. x(k), x(k − 1) ∈ {0, 1}m

(4.13)

The primal DMFD problem posed in (4.12) and (4.13) is NP-hard. Indeed, even the single epoch problem, i.e., x b(k) = arg max fk (x(k), x b(k −1)), is NP-hard (Shakx(k)

eri et al., 1998), which, for all practical purposes, means that it cannot be solved to optimality within a polynomially bounded computation time.

4.3.1 Primal-Dual Optimization Framework The NP-hard nature of the primal DMFD problem motivates us to decompose it into a primal-dual problem using a Lagrangian relaxation approach. By defining new variables and constraints, the DMFD problem reduces to a combinatorial optimization problem with a set of equality constraints. The constraints

101 are relaxed via Lagrange multipliers. The relaxation procedure generates an upper bound for the objective function. The procedure of minimizing the upper bound via a subgradient optimization produces a sequence of dual feasible and the concomitant primal feasible solutions to the DMFD problem. If the objective function value for the best feasible solution and the upper bound are the same, the feasible solution is the optimal solution. Otherwise, the difference between the upper bound and the feasible solution, termed the approximate duality gap, provides a measure of suboptimality of the DMFD solution; this is a key advantage of our approach. Another advantage of the primal-dual method is that, although the primal DMFD problem is not concave, the dual DMFD problem is a piecewise convex function, which can be optimized via the subgradient method. In order to write the primal DMFD problem, we define new variables Y K = {y(1), y(2), ..., y(K)} and y(k) = {yj (k), ∀oj ∈ Of (k)} such that

ln yj (k) =

m X

cij xi (k) + η j ,

∀oj ∈ Of (k),

(4.14)

i=1

where, µ cij = ln

1 − P dij 1 − P fij

¶ , ηj =

m X

ln(1 − P fij ).

(4.15)

i=1

After simple algebraic manipulations of (4.13) and using (4.12) and (4.14), the primal problem can be written as max J(X, Y ) = max

X K ,Y K

X K ,Y K

K X k=1

fk (x(k), x(k − 1), y(k)),

(4.16)

102 where the component state sequence is X K = {x(1), x(2), ..., x(K)}. Here, the primal objective function for an individual component state i.e., fk (x(k), x(k − 1), y(k)) is defined as m X X

fk (x(k), x(k − 1), y(k)) =

cij xi (k) +

oj ∈Op (k) i=1

+

m X

m X

X

µi (k)xi (k) +

i=1

ln(1 − yj (k))

oj ∈Of (k)

σ i (k)xi (k − 1) + γ(k) + g(k) +

m X

i=1

hi (k)xi (k)xi (k − 1)

i=1

where, γ(k) =

µ

X

η j , µi (k) = ln

oj ∈Op (k)

µ hi (k) = ln

P ai (k) 1 − P ai (k)

(1 − P ai (k))(1 − P vi (k)) P ai (k)P vi (k)



µ , σ i (k) = ln

¶ , g(k) =

m X

P vi (k) 1 − P ai (k)



ln(1 − P ai (k)).

,

(4.17)

i=1

Note that the multiple HMMs are coupled here because their states are observed only via a set of test outcomes. In equation (4.17), the terms involving yj (k) and hi (k) show the coupling effects. Appending

constraints

(4.14)

to

(4.16)

via

Lagrange multipliers {λj (k)}oj ∈Of (k) , the Lagrangian function L(X, Y, Λ) can be written as L(X, Y, Λ) =

K P

fk (x(k), x(k − 1), y(k))

k=1

+

P ∀oj ∈Of (k)

λj (k)(ln yj (k) −

m P i=1

cij xi (k) − η j ),

(4.18)

where Λ = {λj (k) ≥ 0, k ∈ (1, K), oj ∈ Of (k)} is the set of Lagrange multipliers. In (4.18), Lagrange multipliers {λj (k)} are nonnegative despite equality

103

Original DMFD problem

Two-level Lagrangian relaxation approach

Update Lagrange multipliers using subgradient method

... Solve subproblem 1 using binary Viterbi algorithm

...

Solve subproblem m using binary Viterbi algorithm

Figure 4-5: Decomposition of the original DMFD problem constraints (4.14), because the yj (k) need to be nonnegative. Using the Lagrange multiplier theorem, we optimize the Lagrangian function in (4.17) w.r.t. yj (k) to obtain optimal yj∗ (k) as yj∗ (k) =

λj (k) . 1 + λj (k)

(4.19)

The dual of the primal DMFD problem as posed in (4.16)-(4.17), can be written as min Λ

Q(Λ)

(4.20)

subject to Λ = {λj (k) ≥ 0, k ∈ (1, K), oj ∈ Of (k)} where the dual function Q(Λ) is defined by Q(Λ) = max L(X, Y, Λ). X K ,Y K

(4.21)

Substituting (4.19) into (4.20) and simplifying further by rearranging and combining the terms, we obtain the dual function as Q(X, Λ) = max XK

m X i=1

Qi (Λ).

(4.22)

104 Here Qi (Λ) =

K X

ξ i (xi (k), xi (k − 1), λj (k)) +

k=1

ξ i (xi (k), xi (k − 1), λj (k)) = Ã P P cij + µi (k) −

1 wk (Λ) m

(4.23)

! cij λj (k) xi (k)

oj ∈Of (k)

oj ∈Op (k)

+σ i (k)xi (k − 1) + hi (k)xi (k)xi (k − 1)

(4.24)

and wk (Λ) = γ(k) + g(k) +

X

λj (k) ln λj (k) − λj (k)η j

∀oj ∈Of (k)



X

(1 + λj (k)) ln(1 + λj (k))

(4.25)

∀oj ∈Of (k)

represents the dual function for the ith component. The main benefit of (4.22) is that now the original problem is separable. As shown in Fig. 4-5, we employed the Lagrangian relaxation method to decompose the original DMFD problem into m separable subproblems, one for each component state sequence xi , where xi = (xi (1), xi (2)), ..., xi (K)), xi (k) ∈ {0, 1} and i ∈ {1, m}. This scheme can be viewed as a two-level coordinated solution framework for the DMFD problem. At the top (coordination) level, we update Lagrange multipliers Λ = {λj (k), k ∈ (1, K), oj ∈ Of (k)} using the subgradient method based on decoupled solutions of the individual subproblems. This level facilitates coordination among each of the subproblems, and can thus reside in a diagnostic control unit. At the bottom level, we use the dynamic programming technique (the

105 Viterbi algorithm) to solve each of the subproblems with computational complexity O(K), i.e., we optimize ξ i function in (4.24) to obtain the optimal state sequence x∗i for each component state, given a fixed set of Lagrange multipliers Λ = {λj (k), k ∈ (1, K), oj ∈ Of (k)}. The Viterbi algorithm is a dynamic programming technique to find the most likely fault sequence (Forney, 1973). It finds a recursive optimal solution to the problem of estimating the state sequence of a finite state Markov chain observed in memoryless noise. The key feature of Viterbi algorithm is that the objective function can be written as a sum of merit functions depending on one state and its preceding one. We obtain the optimal state sequence for each component state, i.e. X ∗ = {x∗1 , x∗2 , ..., x∗m } using a binary Viterbi algorithm. The key steps of the Viterbi algorithm are described in Subsection 4.8.1.

4.3.2 Approximate and Exact Duality Gap After evaluating the optimum state sequence X ∗ for fixed Λ, the problem reduces to one of minimizing the dual function value Ql (Λ) = Q(X, Y, Λ) at iteration l, which is computed using (4.22)-(4.25). Denoting Q∗ as the optimal dual function value, i.e., Q∗ = Q(Λ∗ ) = min Q(Λ), where the dual problem is given Λ

by (4.22)-(4.25). The optimal primal solution is denoted by J ∗ = J(X ∗ , Y ∗ ) = max J(X, Y ), where the primal problem is given by (4.16)-(4.17).

X K ,Y K

The dif-

ference between the optimal dual and the primal function values, i.e. (Q∗ − J ∗ ) is termed the exact duality gap. Since the DMFD problem is NP-hard, it is difficult

106

Step 1: Initialize Lagrange multipliers

Λ1 = {λ j (k ), k ∈ (1, K ), j ∈ O f (k )} Input fault universe Input test outcomes for K epochs

Step 2:

Step 2:

Find optimal sequence for component state 1 . using Viterbi algorithm for fixed Λ l

..

Find optimal sequence for component state m using Viterbi algorithm for fixed Λ l

Step 3: Compute current primal and dual values, lower and upper bounds Step 4: Update Lagrange multipliers via subgradient method

Step 5: Meet stopping criteria?

No

Yes Output: Feasible solution (most likely state sequence )

Figure 4-6: Flow chart of the algorithm

107 to obtain the global optimal solution J ∗ . However, we can obtain several feasible solutions from the dual solution and select the best feasible solution from the set. If Jf = J(Λ∗ , X f , Y f ) be the best feasible value, then we have Jf ≤ J ∗ ≤ Q∗ ≤ Ql .

(4.26)

Using this method, we can obtain an approximate duality gap Q∗ − Jf = (Q∗ − J ∗ ) + (J ∗ − Jf ) ≥ 0, which provides an overestimate of the error between the global optimal solution and the best feasible solution. To summarize, we update feasible solutions, i.e., X f , Y f , and the lower bound Qlb as follows: If J(X ∗ (Λl ), Y (X ∗ (Λl )) ≥ Qlb then X f = X ∗ (Λl ),

Y f = Y (X ∗ (Λl )) and

Qlb = Jf = J(X ∗ (Λl ), Y (X ∗ (Λl )).

(4.27)

The upper bound Qub is obtained using the current dual value Ql as follows Qub = Qmin = min(Qmin , Ql ).

(4.28)

Since the dual function Ql (Λ) is a piecewise differentiable function of Lagrange multipliers Λ, this problem cannot be solved using differentiable optimization algorithms. We use a subgradient algorithm to compute a sequence of upper bounds for Ql (Λ)(Bertsekas, 2003). The details of subgradient method are described in Subsection 4.8.2. Fig. 5.1.1 shows the flow chart of our algorithm. There are five major steps. In step 1, we initialize the Lagrange multipliers and the input fault universe i.e., fault and test information along with the associated probabilities (Pdij , Pfij , Pai

108

Components Normal state x1 (k ) = 0

Normal state

...

Faulty state

xm (k ) = 0

xi (k ) = 1 1 − Pdij Pdij

Test 1 − Pf j Outcome o j (k ) = 0

Pf j

o j (k ) = 1

Figure 4-7: Detection and false alarm probabilities for problem 2 and Pvi ). We also input test outcomes for the K epochs. In step 2, we run m binary Viterbi algorithms to obtain the optimal state sequences corresponding to the m faults. In step 3, we update the feasible solutions, i.e., X f , Y f , and the lower bound and upper bound i.e., Qlb and Qub using (4.24)-(4.26). Next, the Lagrange multipliers are updated using the subgradient method, which is described in Subsection 4.8.2. If stopping criteria, defined in Subsection 4.8.2, are met, then the algorithm outputs the most likely component state sequence for the m components.

109 4.4 DMFD Problem 2 In this formulation, we define P dij as P dij = Pr(oj (k) = 1|xi (k) = 1) and P fj = Pr(oj (k) = 1|x1 (k) = 0, x2 (k) = 0, ..., xm (k) = 0). This scenario is depicted in Fig. 4-7. Pr(oj (k) = 0|x(k)) = m Q

(1−x1 (k))...(1−xm (k))

(1 − P fj )

(4.29) (1 − P dij )

xi (k)

i=1

Using a new variable z(k) =

m Q

(1 − xi (k)) and

i=1 z(k)

Pr(oj (k) = 0|x(k)) = (1 − P fj )

m Y

(1 − P dij )xi (k) ,

i=1

Taking log ln z(k) =

m X

ln(1 − xi (k))

(4.30)

i=1

ln(Pr(oj (k) = 0|x(k))) = z(k) ln(1 − P fj )

+

m X

xi (k) ln(1 − P dij )

(4.31)

i=1

Following steps similar to those in Problem 1, we have ln(Pr(Op (k)|x(k))) =

P

ln(Pr(oj (k) = 0|x(k)))

oj (k)∈Op (k)

=

P

z(k) ln(1 − P fj )

oj (k)∈Op (k)

+

P

m P

xi (k) ln(1 − P dij )

oj (k)∈Op (k) i=1

= z(k)η j (k) +

m X

X

i=1 oj (k)∈Op (k)

xi (k) ln(1 − P dij )

(4.32)

110 P

where η j (k) =

ln(1 − P fj ), and z(k) is defined in (4.30). For failed tests

oj (k)∈Op (k)

P

ln(Pr(Of (k)|x(k))) =

ln(Pr(oj (k) = 1|x(k)))

oj (k)∈Of (k)

P

=

ln(1 − yj (k))

oj (k)∈Of (k)

where yj (k) = Pr(oj (k) = 0|x(k)) and using (4.31), ln(yj (k)) = z(k) ln(1 − P fj ) +

m X

xi (k) ln(1 − P dij )

(4.33)

i=1

Here the DMFD problem is equivalent to b K = arg max X XK

K X

fk (x(k), x(k − 1), y(k), z(k))

(4.34)

k=1

where, the primal objective function for an individual component state i.e., fk (x(k), x(k − 1), y(k), z(k)) is defined as fk (x(k), x(k − 1), y(k), z(k)) = z(k)η j (k) +

m X

X

xi (k) ln(1 − P dij )

i=1 oj (k)∈Op (k)

+

X

ln(1 − yj (k)) +

+

σ i (k)xi (k − 1) +

i=1

τ i (k)xi (k)

i=1

oj (k)∈Of (k) m X

m X

m X

hi (k)xi (k)xi (k − 1) + g(k)

i=1

where ln z(k) =

m X

ln(1 − xi (k)),

i=1

η j (k) =

X

ln(1 − P fj ),

oj (k)∈Op (k)

ln(yj (k)) = z(k) ln(1 − P fj ) +

m X i=1

xi (k) ln(1 − P dij ),

(4.35)

111 µ τ i (k) = ln µ

P ai (k) 1 − P ai (k)



P vi (k) 1 − P ai (k)

, ¶

σ i (k) = ln , ¶ µ (1 − P ai (k))(1 − P vi (k)) hi (k) = ln , P ai (k)P vi (k) m X

g(k) =

ln(1 − P ai (k)).

(4.36)

i=1

Appending constraints (4.30) and (4.33) via Lagrange multipliers µ(k), {λj (k)}j∈Of (k) , the Lagrangian function L(X, Y, z, Λ) can be written as L(X, Y, z, Λ) =

K X

fk (x(k), x(k − 1), y(k), z(k))

k=1

Ã

+µ(k) ln z(k) − +

m X

! ln(1 − xi (k))

i=1

X

λj (k) (ln yj (k) − z(k) ln(1 − P fj ))

∀oj ∈Of (k)



X

m X

xi (k)λj (k) ln(1 − P dij )

(4.37)

∀oj ∈Of (k) i=1

where Λ = {µ(k), λj (k) ≥ 0, k ∈ (1, K), oj ∈ Of (k)} is the set of Lagrange multipliers. Using the Lagrange multiplier theorem, we optimize the Lagrangian function in (4.34) w.r.t. yj (k) to obtain optimal yj∗ (k) as yj∗ (k) =

λj (k) , 1 + λj (k)

(4.38)

and optimizing w.r.t. z(k), we obtain optimal z ∗ (k) as z ∗ (k) =

−η j (k) +

P

µ(k) . λj (k) ln(1 − P fj )

(4.39)

∀oj ∈Of (k)

The dual function Q(Λ) of problem 4 is defined by Q(Λ) = max L(X, Y, z, Λ). X K ,Y K ,z

(4.40)

112 Substituting (4.38), (4.39) into (4.37) and simplifying further by rearranging and combining the terms, we obtain the dual function as Q(Λ) = max XK

m X

Qi (Λ)

(4.41)

i=1

where K X

Qi (Λ) =

ξ i (xi (k), xi (k − 1), λj (k), µ(k))

k=1

+

1 wk (λj (k), µ(k)) m

(4.42)

and ξ i (xi (k), xi (k − 1), λj (k), µ(k)) X

=

oj (k)∈Op (k)

X



xi (k) ln(1 − P dij ) + xi (k)τ i (k)

λj (k)xi (k) ln(1 − P dij ) + σ i (k)xi (k − 1)

oj (k)∈Of (k)

+hi (k)xi (k)xi (k − 1) − µ(k) ln(1 − xi (k))

(4.43)

and wk (λj (k), µ(k)) =

  µ(k)  +



η j (k) + ln(µ(k))  P  + g(k) −η j (k) + λj (k) ln(1 − P fj )

X

∀oj ∈Of (k)

[λj (k) ln λj (k) − (1 + λj (k)) ln(1 + λj (k))]

∀oj ∈Of (k)





 X −µ(k)  ∀oj ∈Of (k)

λj (k) ln(1 − P fj )  P  λj (k) ln(1 − P fj ) −η j (k) + ∀oj ∈Of (k)

(4.44)

113

Test

t j (k ) = 0

t j (k ) = 1 1 − Pd j

1 − Pf j

Pd j Pf j

Test Outcome

o j (k ) = 0

o j (k ) = 1

Figure 4-8: Detection and false alarm probabilities for problem 3 The dual problem posed in (4.40)-(4.44) is separable and it can be solved by following a procedure similar to that used for solving Problem 1. The only difference is that we also need to update the Lagrange multiplier µ(k) using a subgradient method.

4.5 DMFD Problem 3 In this formulation, we consider the case where the probabilities of detection and false alarm (P dj , P fj ) are associated only with each test tj (see Fig. 4-8). Formally, P dj = Pr(oj (k) = 1|tj (k) = 1) and P fj = Pr(oj (k) = 1|tj (k) = 0). We can convert these probabilities into a special case of Problem 1 by computing (P dij ,P fij ) using (4.1): P dij = (dij )P dj + (1 − dij )P fj

(4.45)

P fij = (dij )P fj + (1 − dij )P dj

(4.46)

Similarly

114 Here, D = [dij ] is the dependency matrix (D-matrix). The solution of Problem 3 can be obtained by substituting P dij and P fij in (4.45)-(4.46) in the solution of Problem 1.

4.6 DMFD Problem 4 Next, we consider the case when the system consists of reliable tests, and the fault-test relationships are deterministic, i.e. P dij = 1 and P fij = 0 for i = 1, ..., m and j = 1, ..., n or equivalently, the D-matrix completely characterizes the fault-test relationships (Tu et al., 2003). This formulation can be represented as a bipartite graph between the components and tests. In this case, if some tests have passed, then we can infer that all the failure sources covered by these tests are good components. Thus, we need to infer failed components from those covered by the failed tests only, i.e., by excluding those components covered by the passed tests. Consequently, the size of the DMFD problem can be reduced by removing all failure sources {si |P fik = 0, P dik = 1, and tj (k) ∈ Tp (k)}. For each failed test tj (k) ∈ Tf (k), the optimal solution contains at least one component state xi (k) = 1 that satisfies dij = 1. Thus, there must be one or more failure sources that cover the failed tests. Let us consider a matrix A, which has each row representing the list of failure sources covered by a failed test. After excluding the failure sources covered by the passed tests, the resulting matrix A is a binary matrix such that aij = dji . After substituting P dij = 1 and P fij = 0 in (4.13), the reliable test

115 scenario with a binary D-matrix simplifies to a dynamic set covering problem with the following objective function term at epoch k: fk (x(k), x(k − 1)) =

m X

{µi (k)xi (k) + σ i (k)xi (k − 1)

i=1

+hi (k)xi (k)xi (k − 1)} + g(k)

(4.47)

subject to following constraints: A(k)x(k) ≥ e for tj (k) ∈ Tf (k) where e is a vector of one’s. Appending constraints to (4.47) via Lagrange multipliers Λ = {λj (k) ≤ 0, k ∈ (1, K), tj ∈ Tf (k)}, the Lagrangian function L(X, Λ) can be written as L(X, Λ) =

K P

fk (x(k), x(k − 1))

k=1

+

¶ µ m P aji xi (k) λj (k) 1 −

P

(4.48)

i=1

∀tj ∈Tf (k)

After rearranging the terms, the Lagrangian function of the original problem is shown as a sum of the Lagrangian functions of each subproblem as follows: L(X, Λ) =

m X

Li (xi (k), Λ)

(4.49)

i=1

where Li (xi (k), Λ) =

K P k=1

µi (k)xi (k) −

P

aji xi (k)

tj ∈Tf (k)

+σ i (k)xi (k − 1) + hi (k)xi (k)xi (k − 1) ! Ã P λj (k) + m1 g(k) +

(4.50)

∀tj ∈Tf (k)

The dual function Q(Λ) is defined by Q(Λ) = max L(X, Λ). XK

(4.51)

116 Simplifying further by rearranging and combining the terms, we obtain the dual function as Q(Λ) = max XK

m X

Qi (Λ)

(4.52)

i=1

where Qi (Λ) =

K X

ξ i (xi (k), xi (k − 1), λj (k)) +

k=1

1 wk (Λ), m

(4.53)

ξ i (xi (k), xi (k − 1), λj (k)) = µi (k)xi (k) +σ i (k)xi (k − 1) + hi (k)xi (k)xi (k − 1) −

X

λj (k)aji (k)xi (k)

(4.54)

tj ∈Tf (k)

and wk (Λ) = g(k) +

X

λj (k).

(4.55)

∀tj ∈Tf (k)

The dual problem defined in (4.53)-(4.55) is separable. The Viterbi algorithm is used to solve each subproblem corresponding to each component state sequence {xi (k)}K k=1 . This algorithm can be viewed as a dynamic set covering problem, which is NP-hard. Thus, the dynamic set covering problem is solved by combining the Viterbi algorithm and Lagrangian relaxation. This generalizes Beasley’s Lagrangian relaxation algorithm for the static set covering problem (Tu et al., 2003; Beasley, 1987) to dynamic settings. We will explore the applications of this algorithm in our future work (Kodali et al., March 2008).

117

Table 4-1: Small-scale scenario for simulations Fault model (si , tj , P dij , P fij ) s1 (1,3, (1,13, (1,19, 0.80, 0.75,0) 0.74,0) 0.01) s2 (2,10, (2,12, 0.86, 0.88,0) 0.01) s3 (3,1, (3,16, 0.80, 0.72, 0.015) 0.01) . . . . . . . . . . . . s19 (19,12, (19,18, 0.88, 0.85, 0.016) 0.015) s20 (20,8, (20,11, 0.73, 0.82,0) 0.011) Test outcomes k No. of Failed test out- No. of failed comes Of (k) passed tests tests 1 1 3 19

2 3

0 2

φ 11, 18

20 18

. . . 19

. . . 5

. . . 3,10,12,16,17

. . . 15

20

6

3,4,10,12,16,17

14

Pa 0.0050

Pv 0.0002

0.0049

0.0002

0.0050

0.0002

. . . 0.0051

. . . 0.0003

0.0051

0.0002

Passed test outcomes Op (k) 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 S 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 19, 20 . . . 1, 2, 4, 5, 6, 7, 8, 9, 11, 13, 14, 15, 18, 19, 20 1, 2, 5, 6, 7, 8, 9, 11, 13, 14, 15, 18, 19, 20

118 4.7 Sliding Window DMFD Method During the online monitoring of a system, the observations and potential fault sequences are usually very long. Hence in order to reduce the amount of computation and storage, the DMFD problem is solved using a sliding window method. The window size, W is selected based on the performance criteria, such as low classification error and low false isolation rate. One of the key advantages of the sliding window method is that Lagrange multipliers are available W -1 samples ahead, which improves the speed of dual optimization. The sliding window method involves following steps: Step1: Solve the DMFD problem for the window size W (W ≤ K). Make a decision at epoch k = 1 Step 2: Move the window by 1 time epoch, i.e., k = 2 to k = W + 1 • Initialize (W − 1) Lagrange multipliers using previous window • Initialize component states at k = 2 using the results of previous window • Solve the online DMFD problem using the data from k = 2 to k = W + 1 • Make a decision at epoch k = 2 Step 3: Continue sliding the window until k = K − W + 1. Selection of window size is a key issue and it depends on the system and fault behavior, i.e., permanent or intermittent faults.

119 4.8 Algorithm Details 4.8.1 Solving subproblems using Viterbi Algorithm In this Subsection, we discuss the key steps of the Viterbi algorithm which is used to solve each subproblem corresponding to each component state sequence xi . Initialization: In this step, the objective function is computed at k = 1 for each node (component state). It is assumed that the initial state x(0) is known for all the component states. The maximum function value of ξ i in (4.24) at time k is denoted by δ k (xi (k)), and the value of xi where the function value is maximum; is denoted by ψ k (xi (k)). For binary case, we have used the notation δ k (0) = δ k (xi (k) = 0) and δ k (1) = δ k (xi (k) = 1). At time k = 1, δ 1 (xi (1)) = ξ i (xi (1), xi (0), {λj (1)}) = {µi (1) −

P

P

cij λj (1) +

oj ∈Of (1)

cij }xi (1)

oj ∈Op (1)

+σ i (1)xi (0) + hi (1)xi (1)xi (0) ψ 1 (xi (4.1)) = φ; where xi (0) ∈ {0, 1}.

(4.56)

Recursion: The recursion step involves maximizing the objective function at each epoch k. X

δ k (xi (k)) = {

oj ∈Op (k)

+

max

cij + µi (k)}xi (k) −

X

cij λj (k)xi (k)

oj ∈Of (k)

[δ k−1 (xi (k − 1)) + σ i (k)xi (k − 1) + hi (k)xi (k)xi (k − 1)];

xi (k−1)∈{0,1}

(4.57)

120 for 2 ≤ k ≤ K; xi (k) ∈ {0, 1} ψ k (xi (k)) = arg max [δ k−1 (xi (k − 1)) xi (k−1)∈{0,1}

+σ i (k)xi (k − 1) + hi (k)xi (k)xi (k − 1))];

(4.58)

Termination: This step computes the objective function at time epoch k = K. F∗ =

max

{δ K (xi (K))];

xi (K)∈{0,1}

x∗i (K) = arg max [δ K (xi (K))]

(4.59)

xi (K)∈{0,1}

Optimal state sequence backtracking— The backtracking step computes the optimal state sequence by tracing the path backwards. The optimal state x∗i (k) of ith fault at time epoch k is given by x∗i (k) = ψ k+1 (xi (k + 1)∗ ),

k = K − 1, ...., 1.

(4.60)

Similar to the recursion step, we can further simplify termination and backtracking for the binary case.

4.8.2 Updating lagrange multipliers via subgradient method Lagrange multipliers are updated via l l l λl+1 j (k) = max(0, λj (k) + β (k)dj (k))

(4.61)

for j ∈ Of (k) and k ∈ (1, K) where the subgradients dlj (k) at iteration l and at epoch k are dlj (k)

=

ln(yj∗ (k))



m X i=1

cij x∗i (k) − η j

(4.62)

121 and step size β l (k)is β l (k) = −υ

(Ql − Q∗ ) . Tf P (dlj (k))2

(4.63)

j=1

Since the optimal dual function value is not available, it is estimated using the primal feasible solution Jf and best current dual value Qmin using (4.27) and (4.28), respectively. We estimate the optimal dual function as ˆ ∗ = ω(Jf + Qmin ) Q 2

(4.64)

and initial value υ = 0.01 is used. If the best current dual value Qmin doesn’t decreases in the previous 20 iterations of the subgradient procedure with the current value of υ, then υ is reduced by a factor. To improve the subgradient convergence, we also vary ω, which is increased or decreased based on whether the dual function value is decreasing or not (Bertsekas, 2003). We used the following stopping criteria for the subgradient method. Of P

• Stop if

j=1

(dlj (k))2 = 0 since we cannot define a suitable step size in this

case, • Stop if υ ≤ 10−4 because step sizes becomes too small, • Stop if number of iterations crossed the maximum no. of iterations i.e., l ≥ 100.

122 4.9 Simulations and Results We implemented and applied the solution of problem 1, the most general version of the DMFD problem formulation, to a small-scale system and few real world models.

4.9.1 Small-Scale System We randomly generated a small-scale system to illustrate the inputs and outputs of our algorithm. The model was constructed for a system with 20 components, 20 tests and 20 observation epochs. Each component can have binary states i.e. normal and faulty. The detection probabilities were set between 0.7-0.9 and the false alarm probabilities were set between 0-0.02, and the tests uniformly cover the component states. The fault appearance and disappearance probabilities were varied between 0.0049-0.0051 and 0.00025-0.00033, respectively. These probabilities were chosen such that the average number of faults was 2 over a span of 20 epochs. The true fault state set and accordingly the test outcomes of each epoch are generated using the above model parameters. The stopping criteria as defined in Subsection 4.8.2 were used for the subgradient method. We used the following metrics to evaluate the performance of our algorithm: Correct isolation rate (CI): CI is the percentage of true fault states which are detected by the algorithm at epoch k. Let xˆ(k) be the fault state set at epoch k

123 Table 4-2: Results for small-scale scenario Approximate duality gap (D) (%) Correct isolation rate (CI) False isolation rate (F I) Primal function value (J) Dual function value (Q) Computation time (sec) (t)

13.2 100.0 0.0 -68 -59 0.12

detected by algorithm, and r(k) is the true fault state set at epoch k. Then CI and average CI over all epochs are obtained as follows: CI(k) =

|ˆ x(k) ∩ r(k)| |r(k)| K P

CI =

(4.65)

CI(k)

k=1

K

(4.66)

False isolation rate (FI): FI is percentage of fault states which are falsely detected by the algorithm as fault state at epoch k. FI and average F I are computed as F I(k) =

|ˆ x(k) ∩ ¬r(k)| |S| − |r(k)| K P

FI =

(4.67)

F I(k)

k=1

K

(4.68)

Table I shows only partial data (5 rows) of a small-scale example. Any component state can be detected by several tests. For example, state of component 1 can be detected by t3 , t13 and t19 with (P d1,3 , P f1,3 ) = (0.80, 0.01), (P d1,13 , P f1,13 ) = (0.75, 0), and (P d1,19 , P f1,19 ) = (0.74, 0), respectively. This implies that when component state s1 occurs, test t3 detects it with probability 0.80, and if component state s1 does not occur, test t3 has 0.01 probability of falsely implicating it. Similarly, test t13 detects s1 with probability 0.75, and has no false

124 −45 Lower bound Upper bound Dual fun. current value −50

−55

Bounds

−60

−65

−70

−75

−80

−85

0

5

10

15

20 Iterations

25

30

35

40

Figure 4-9: Approximate duality gap alarms. If s1 is not present at epoch k, then at epoch k+1, the probability of having s1 present is 0.005, while if s1 is present at epoch k, s1 has a probability 0.00026 of disappearing at epoch k + 1. The test outcomes at epochs k = 1, 2, 3, 19 and 20 are shown in the table; test outcomes at other epochs are not shown here for simplicity and for saving space. For example, at k = 1, the outcome of tests t3 is observed as having failed, while the outcome of all other tests except t3 is observed as having passed. The DMFD problem here is to identify the evolution of component states over the 20 epochs. The results for this model are shown in Table II. Here, J, Q, D, CI, F I and t denote the primal function value, the dual function value, the approximate duality gap, the correct isolation rate, the false isolation rate and the computation time per epoch. The primal and dual function values are computed using (4.16)-(4.17) and (4.22)-(4.25), respectively. The approximate duality

125 Table 4-3: Real world models Automotive

m 22

n 60

Docmatch Powerdist Helitrans EngineSim

257 96 34 53

180 98 51 30

P dij P fij (0.850.95), 0-0.02 (0.6-1),0 (0.6-1),0 (0.6-1),0 (0.6-1),0

c, P ai 3, 9.13e-04

9, 3.12e-04 3,3.13e-04 2, 2.95e-04 2,5.68e-04

gap (D) is computed as a ratio of the difference between Q and J divided by the absolute value of the primal feasible value J. The algorithms were implemented in MATLAB. We used a standard PC having Pentium 4 Processor with 3.0 GHz clock speed and 512 MB RAM. The approximate duality gap is also shown in Fig. 4-9 is 13.2%. The duality gap reduces as the number of iterations increases, and the subgradient method converges to the minimum dual function value.

4.9.2 Real World Data Sets Table III illustrates the model parameters of an automotive system, a document matching system (Docmatch), a power distribution system (Powerdist), a UH-60 helicopter transmission system (Helitrans), and an engine simulator (EngineSim). Details of these models are provided in (Tu et al., 2003). Here, m, n, and c denote the number of components (failure sources), number of tests and the average number of intermittent faults that can occur over a span of 100 epochs. The fault appearance probabilities (P ai ) were computed based on the average number of intermittent faults (c). These real-world systems are not ideal because they have

126 Table 4-4: Results on real world models

S Automotive DSA HS S Docmatch DSA HS S Powerdist DSA HS S DSA Helitrans HS S EngineSim DSA HS

J

Q

-658 -775 -658 -541 -405 -405 -232 -157 -157 -15 -15 -15 -85 -51 -51

-481 – -481 -311 – -311 -125 – -125 -14 – -14 -33 – -33

D (%) 27 – 27 42.5 – 23.2 46.1 – 20.3 6.7 – 6.7 61.1 – 35.3

CI

FI

t

99.5 75

0.05 1.30

0.43 0.01

88.2 69

0.36 0.70

4.53 0.24

91.6 84

0.75 0.30

1.56 0.05

94.8 100

0.31 0.0

0.47 0.02

95.1 86

2.28 0.3

0.64 0.02

fewer tests as compared to failure sources; hence, some failure sources are not covered by any tests. The fault disappearance probabilities (P vi ) were varied between 0.0025-0.0049 to allow c intermittent faults, on average. The probabilities of detection and false alarm were varied as shown in Table III. Here, J , Q, D, CI, F I and t denote the average primal function value, the average dual function value, the average approximate duality gap, the average correct isolation rate, the average false isolation rate and the average computation time per epoch. The maximum number of subgradient iterations was set at 80 and 100 Monte Carlo runs were used to generate the test outcomes.

Table IV shows the results ob-

tained using the subgradient (S) and the deterministic simulated annealing (DSA) (Ruan et al., 2006a) methods. The subgradient method (S) achieves higher correct isolation rates as compared to (DSA) for all systems except Helitrans. However,

127

Automotive system

Automotive system 20 False isolation rate

Correct isolation rate

100 80 60 40 20 0

15 10 5 0

1

5

10 15 20 Window size

25

1

Document matching system

10 15 20 Window size

25

Document matching system

100

2.5 False isolation rate

Correct isolation rate

5

80 60 40 20 0

2 1.5 1 0.5 0

1

5

10 15 20 Window size

25

1

5

10 15 20 Window size

25

Figure 4-10: Boxplots of CI and FI for automotive and document matching system

128

Power distribution system

Power distribution system 4 80

False isolation rate

Correct isolation rate

100

60 40 20 0

3 2 1 0

1

5

10 15 Window size

20

25

1

UH−60 helicopter transmission system

10 15 Window size

20

25

UH−60 helicopter transmission system 7

100

6 80

False isolation rate

Correct isolation rate

5

60 40 20

5 4 3 2 1

0

0 1

5

10 15 Window size

20

25

1

5

10 15 Window size

20

25

Figure 4-11: Boxplots of CI and FI for power distribution and UH-60 helicopter transmission system

129

Window size

Window size

Figure 4-12: Boxplots of CI and FI for engine simulator system the DSA method achieves better primal function value and is also effective in reducing the computation time (t). Also, note that we can obtain a hybrid duality gap by taking the maximum primal solution from the subgradient (S) and the deterministic simulated annealing (DSA) methods and the dual function value from the subgradient (S) method. The hybrid DSA-subgradient (HS) duality gaps are also shown in Table V. The average computation time ( t ) is measured in seconds. These numbers are attractive practically, and they can be further reduced significantly by a careful implementation in the C language. We also showed an application of the DMFD Problem 3 formulation in our recent paper (Singh et al., October 2007) where we performed dynamic fusion of classifiers over time for automotive engine fault diagnosis. The temporal correlations considered by dynamic fusion improve classification accuracy over a variety of static fusion techniques (based on batch data).

130

100

99

Correct isolation rate

98

97

96

95

94 Intermittent Permanent Highly Intermittent

93

92 0

5

10

15

20

25

30

Window size

Figure 4-13: Correct isolation rate for various fault behaviors 4.9.3 Sliding Window DMFD Results Figs. 4-10, 4-11 and 4-12 show the boxplots of correct isolation (CI) and false isolation (FI) rates for various real-world models. These plots were obtained using the sliding-window DMFD method. The boxplot shows the dispersion of the data with lines at lower quartile (25%), median and upper quartile (75%). The whiskers are shown by extending the lines from each end of the box and the maximum length of the line is kept as a function of the inter-quartile range. The outliers are shown as ‘o’ in the figures. The window size is selected such that it gives high CI with a small boxsize and a low FI with a low boxsize. The most suitable window size using the above criterion for Automotive, Docmatch, Powerdist, Helitrans, and EngineSim systems are 15, 10, 20, 15 and 15, respectively.

Next, we perform simulations to study the effect

131 Table 4-5: Type of faults Fault disappearance probabilities 0.0247 to 0.0484 6.8089e-004 to 9.2327e-004 0.2235 to 0.3926

Case 1 Case 2 Case 3

Fault behavior Intermittent Permanent Highly intermittent

7 Intermittent Permanent Highly Intermittent

6

False isolation rate

5

4

3

2

1

0

0

5

10

15

20

25

30

Window size

Figure 4-14: False isolation rate for various fault behaviors of intermittent faults on the performance of the sliding-window DMFD method. The automotive system is used for simulations. The fault appearance probability was kept so that on average 3 faults occur over a span of 100 epochs. Fig. 4-13 illustrates the correct isolation rate for various fault behaviors. The results show the mean value and the vertical lines on the data points indicate the standard deviation. These results were obtained using 1000 Monte Carlo runs. The results demonstrate that the algorithm achieves low variance when the fault behavior is highly intermittent.

132 False isolation rate plot (Fig. 4-14) also illustrates the same behavior as CI plot, i.e., the algorithm achieves the least variance for the highly intermittent fault types. This illustrates that the DMFD algorithm is highly suitable for intermittent faults.

4.9.4 Complexity The algorithm presented here reduces the overall complexity from O(K(2m )) to O(K(m + Of )) where m is the number of component states, K is the number of epochs and Of is the set of failed tests. More specifically, the complexities of binary Viterbi algorithm over all component states and the subgradient method are O(Km) and O(KOf ), respectively, per iteration; this is a substantial improvement over extensive approaches based on exact inference.

4.10 Summary In this chapter, we discussed the problem of dynamic multiple fault diagnosis (DMFD) with imperfect tests. The original DMFD problem is an intractable NPhard combinatorial optimization problem. Using a Lagrangian relaxation-based coordination framework, we decomposed the original DMFD problem into parallel decoupled subproblems coordinated via Lagrange multipliers. Each subproblem corresponds to finding the optimal state sequence of a fault with fixed

133 Lagrange multipliers. The subproblems were solved using a binary Viterbi decoding algorithm. The coordination among the subproblems was facilitated by Lagrange multipliers, which were updated using a subgradient method.

Chapter 5

Dynamic Fusion of Classifiers for Fault Diagnosis

5.1 Introduction Classifier fusion has been widely investigated in diverse fields, such as image segmentation, data mining from noisy data streams, credit card fraud detection, sensor networks, image, speech and handwriting recognition, fault diagnosis, to name a few. In the literature, classifier fusion is variously referred to as classifier ensembles, consensus aggregation, decision fusion, committee machines, classifier selection, mixture of experts, etc. The objective of classifier fusion is to achieve better classification accuracy by combining the results of individual classifiers. Our focus here is on combining class labels from multiple classifiers over time. The key motivation for performing dynamic classifier fusion in our application context is to improve the on-board diagnostic accuracy of safety-critical systems,

134

135 such as aircraft, automobiles, nuclear power plants and space vehicles. An accurate on-board diagnostic process will ensure performability, maintainability and survivability of safety-critical systems. It is generally believed that classifier fusion may enhance the diagnostic accuracy in situations when the constituent classifiers have low correlations among their classification errors or, equivalently, more diversity among their outcomes (Dietterich, 2000). Multiple classifiers can avoid the risk of picking the output of a single classifier, and, consequently, overcome the weaknesses of individual classifiers. If the individual classifiers are already performing well, then fusion accuracy is not expected to increase significantly, but the variability in its classification performance decreases. In this chapter, we formulate the dynamic classifier fusion problem as one of maximizing the a posteriori probability of a hidden state sequence given uncertain classifier outcomes over time. For simplicity of classifier fusion, we transform the data into binary classes by selecting the individual classifiers to correspond to the columns of an error correcting code (ECC) matrix (Dietterich and Bakiri, January 1995). In fault diagnosis area, we refer to classes as components and classifiers as tests. Thus, the binary classifiers (binary tests) correspond to the columns of the ECC matrix, and the components correspond to the rows of the ECC matrix. The ECC matrix may be viewed as a diagnostic matrix (D-matrix, diagnostic dictionary, reachability matrix), which defines the cause-effect relationship among fault sources/components (rows) and tests (columns).

136 Though not demonstrated experimentally in this chapter, a major advantage of our multistage fusion architecture is that it allows the use of a heterogeneous set of classifiers over time. For example, in a scenario with 30 time epochs, the first 10 epochs may employ a data-driven classifier such as the SVM, a knowledge-based classifier (e.g., TEAMS-RT (QSI, 1994)) for the next 10 epochs, and for the last 10 epochs a model-based classifier (e.g., (Luo et al., 2007)). Our fusion approach provides a flexible framework to optimize diagnostic systems with respect to data pre-processing, number and type of classifiers, the ECC matrix, as well as the temporal complexity measured in terms of time epochs used for fusion.

5.1.1 Previous Work In the literature, many techniques are proposed for classifier fusion. They can be divided into two categories: classifier combination and classifier selection. Classifier combination is an effective technique for combining independent classifiers with high accuracy and high diversity. Classifier combination can be applied to class labels, class rankings or confidence estimates on class labels. In (Ruta and Gabrys, 2000; Kuncheva et al., 2001; Reiter and Rigoll, 2004), several methods are proposed for classifier combination, such as the hierarchical mixture of experts, voting methods, behavior-knowledge space method, Borda count method, Bayesian fusion, fuzzy integrals, Dempster-Shaffer combination, and artificial neural networks, to name a few.

137 Classifier selection, as its name implies, chooses the best classifier for each test sample. A static classifier selection method decides on the best classifier a priori during training. The input patterns are partitioned, and the best classifier is nominated for each partition. While a static fusion method employs constant weights for each classifier based on training, a dynamic fusion method changes the weights of each classifier based on the observed test pattern (Vadovinos et al., 2005). For example, the distance of a test pattern to its nearest neighbor for each individual classifier may be used to compute the dynamic weights. A prototypical classifier selection method is the decision templates approach. Decision templates are the estimated averages of the decision profiles (DP) of the samples of each class in the training set. The DP is a matrix of classifier outcomes where ith row of the DP matrix contains output of ith classifier and j th column refers to the support from all the classifiers for classj. The fusion is performed by comparing the decision profile of a test set with the stored decision templates of the classes. A classifier is selected based on the elements of DP matrix. The process involves selecting the minimum value in each column of the DP matrix, and then declaring a classifier with the maximum value of these minimum values in each column as the best classifier. This method is used in (Vadovinos et al., 2005) to classify meeting events. Genetic algorithms are employed to select features in multiple classifier systems in (Kuncheva and Jain, September 2000). Another approach is to estimate

138 a local regression model for each partition of input data, and to dynamically decide on the combination function (Ruta and Gabrys, 2000). In (Giacinto and Roli, 2000), classification is performed by selecting the classifier with the highest classifier local accuracy (CLA) in a local region of the feature space. A priori and a posteriori selection methods are proposed to estimate CLA. In an a priori selection method, CLA is estimated as the ratio of the numbers of patterns correctly classified in the neighborhood to the total number of patterns in the neighborhood of the unknown test pattern. In an a posteriori selection method, CLA is estimated as the probability that a classifier assigns the test pattern to a particular class (e.g., as in a k-nearest neighbor method). A dynamic classifier selection (DCS) algorithm is proposed by selecting a CLA threshold, and rejecting classifiers below the threshold. Kim et al. (Kim and Ko, 2005) proposed a dynamic integration system, which selects the best classifier from multiple base classifiers. The system focuses on learning the local region in which the classifier is the best. In (Zhu et al., 2004), a feature-oriented dynamic classifier selection method is proposed for noisy data streams. Here, the evaluation set is split into subsets based on feature values of each pattern. Then, the classification accuracy of each base classifier is evaluated using these subsets during training. During testing, the feature values of a test pattern are used to select a subset and the concomitant best classifier for that subset. Next, we describe our dynamic fusion method.

139 Data preprocessing

Diagnostic-matrix

Offline • Data reduction • Feature extraction • Data statistics

Training data On-line

Learned probability of detection and false alarm for each classifier

Classification using Error Correcting Codes (ECC)

(Pd j , Pf j )

Testing data

Classifier outcomes at each epoch

Fault scenarios

(Op (k ),Of (k ))

Dynamic fusion

Fault appearance and disappearance probabilities (Pai , Pv i )

Fused decisions

Figure 5-1: Overview of dynamic fusion process 5.2 Dynamic Fusion Process Overview Our approach to dynamic fusion is shown in Fig. 5.1.1. It involves four key steps: (1) data preprocessing (noise suppression, data reduction and feature selection) using data-driven techniques, such as multi-way partial least squares (MPLS) to perform data reduction, computing statistical moments, etc., (2) error correcting codes to transform the multiclass data into dichotomous choice situations (binary classification), (3) fault detection using pattern recognition techniques (e.g., support vector machines), and (4) fault isolation via dynamic fusion of classifier output labels over time using the DMFD algorithm. Next, we discuss each step of the dynamic fusion process in detail.

140 Table 5-1: Error correcting code (ecc) matrix

x1 x2 x3 x4 x5

C1 0 1 0 0 1

C2 1 1 0 1 1

C3 0 1 1 1 0

C4 0 1 0 0 1

5.2.1 Feature Extraction or Data Pre-processing Feature extraction involves signal processing methods such as wavelets, fast Fourier transforms (FFT) and statistical techniques to extract relevant information for diagnosing faults. In our experiments, we perform pre-processing using data reduction techniques, such as MPLS to transform the data to low-dimensional structures for implementation in limited memory electronic control units (ECUs) of an automotive system (Choi et al., September 2006).

5.2.2 Error Correcting Codes (ECC) Matrix The next step involves fault detection using binary classifiers corresponding to the columns of an ECC matrix. Error correcting codes are widely used in communications to decode messages sent over noisy channels by exploiting the redundancy in the transmitted code. We use an ECC matrix to project the data into a binary orthogonal space. Each column of the ECC matrix represents a classifier, and each row depicts a component or class. In the context of fault diagnosis, the ECC matrix can be viewed as a diagnostic matrix (D-matrix), which provides the cause-effect relationships between the faults (rows) and tests (columns). The

141 ECC matrix provides a flexible and robust framework for combining the classifiers. For example, we can choose the first column to be any model-based classifier, a second column as any data-driven classifier, and third column to be any knowledge-based classifier, and so on. Table 5-1 shows an example of the ECC matrix. A column and a row of the ECC matrix represents a classifier (Cj ) and a component (xi ), respectively. For example, classifier C1 considers the data corresponding to faults in components 2, and 5 as class 1, and data for other faulty components as class 0.

5.2.3 Fault Detection using the Support Vector Machine (SVM) Classifiers The SVM introduced in (Boser et al., 1992; Vapnik, 1999) is applied in areas such as handwritten digit recognition, anomaly detection in computers, text classification, fault diagnosis, etc. The SVM principle is to find a hyperplane that maximizes the separation between classes. SVM uses nonlinear pre-processing techniques (“kernel”) to project the data from a low-dimensional space (input space) to a high-dimensional space (feature space). The linear operation in the feature space is equivalent to non-linear operation in the input space. It finds an optimal hyperplane in the high dimensional space using quadratic programming. To obtain the SVM, we need to specify the kernel parameter γ and cost relaxation parameterC. In this chapter, these parameters are empirically computed. The probabilities of detection and false alarm (P dj ,P fj ) for all the SVM classifiers are learned from the training data by constructing the confusion matrix.

Table

142 Table 5-2: Confusion matrix Estimated class 0 N00 True class 0 1 N10

1 N01 N11

5-2 shows the confusion matrix, where Nab shows the number of patterns having true class a ∈ {0, 1} but having estimated classb ∈ {0, 1}. The probabilities of detection and false alarm are computed from the sample statistics as P dj =

N11 N10 +N11

and P fj =

N01 . N00 +N01

5.2.4 Dynamic Fusion During the testing phase, we generate fault scenarios and the corresponding test patterns according to component failure and recovery rates. We assume that each fault is intermittent. The testing data is processed through the classifiers (columns of the ECC matrix) to obtain the classifier outcomes, i.e., sets of passed and failed classifier outcomes (Op (k), Of (k)). These test outcomes are fed to the dynamic fusion block, along with the probability pairs (P dj , P fj ) to obtain the fault isolation decisions.

5.3 Dynamic Multiple Fault Diagnosis (DMFD) Problem Our dynamic fusion process is based on an optimization framework that computes the most likely fault sequence over time. The dynamic fusion problem is a specific formulation of the dynamic multiple fault diagnosis problem (DMFD)

143

Components

x1 (k )

Hidden e11

e12

x2 (k ) e21

e23

x3 (k ) e33

x4 (k ) e43 e44

… xm (k )

ECC = [eij ] Matrix

Classifiers

C1

C2

C3

C4

Classifier outcomes

O1 (k )

O2 (k )

O3 (k )

O4 (k )

emn





Cn

On (k )

Figure 5-2: Tri-partite graph for dynamic fusion problem (Ruan et al., 2006b; Singh et al., May 2007,M). In the DMFD problem, the objective is to isolate multiple faults based on test (classifier) outcomes observed over time. The dynamic fusion problem consists of a set of possible fault states in a system (component states, as in Fig. 5-2), and a set of binary classifier outcomes that are observed at each sample (observation, decision) epoch. Evolution of each component states is assumed to be independent. Each classifier outcome provides information on a subset of the fault states (the entries with ones in the corresponding column of the ECC matrix). At each sample epoch, a subset of classifier outcomes is available. Classifiers are imperfect in the sense that the outcomes of some of the classifiers could be missing, and classifiers have missed-detection/false-alarm probabilities associated with them.

Formally, we represent the dynamic

fusion problem as DF = {S, κ, C, O, ECC, P, A}, where S = {s1 , ..., sm } is a finite set of m components (failure sources) associated with the system. The state of component si is denoted by xi (k) at epoch k, where xi (k) = 1 if failure source

144

1− Pvi (k)

1− Pai (k)

Pai (k) xi = 0 Normal

Binary states Pvi (k)

xi = 1 Faulty

Figure 5-3: Fault appearance and disappearance probabilities si is present; xi (k) = 0, otherwise. Here,κ = {0, 1, ..., k, ...K} is the set of discretized observation epochs. The status of all component states at epoch k is denoted byx(k) = {x1 (k), x2 (k), ..., xm (k)}. We assume that the initial state x(0) is known (or its probability distribution is known). The observations at each epoch are subsets of binary outcomes of classifiers O = {O1 , O2 , ..., On }, i.e., Oj (k) ∈ {pass, f ail} = {0, 1}. Fig. 5-2 shows the dynamic fusion problem as a tri-partite graph at epoch k. Component states, classifiers and classifier outcomes represent the nodes of the digraph. Here, the true states of the component states are hidden. The true states of classifiers are also hidden because the classifiers are imperfect. We also define the ECC matrix ECC = [eij ] as the diagnostic matrix (D-matrix), which represents the full-order dependency among failure sources and classifiers. Each component state is modeled as a two-state non-homogenous Markov chain. For each component state, e.g., for component si at epoch k, A = (P ai (k), P vi (k)) denotes the set of fault appearance probability P ai (k) and fault disappearance probability P vi (k) defined as P ai (k) = Pr(xi (k) = 1|xi (k − 1) = 0) and P vi (k) =

145

Hidden C j (k) = 0

1− Pf j

Pf j

Oj (k) = 0

C j (k) = 1

Classifier

1− Pd j Classifier outcomes

Pd j

Oj (k) = 1

Figure 5-4: Detection and false alarm probabilities Pr(xi (k) = 0|xi (k − 1) = 1). Fig. 5-3 shows fault appearance and disappearance mechanisms of the two-state HMM.

Here, C = {C1 , C2 , ..., Cn } is a finite set of n

available binary classifiers, where the integrity of the system can be ascertained. At each observation epoch, k, k ∈ κ, classifier outcomes upto and including epoch K K are available, i.e., we let OK = {O(k) = (Op (k), Of (k))}K k=1 , where O is the set

of observed classifier outcomes upto and including epoch K, with Op (k)(⊆ O(k)) and Of (k)(⊆ O(k)) as the sets of passed and failed classifier outcomes at epoch k, respectively. The classifiers are partially observed in the sense that outcomes of some classifiers may not be available, i.e., (Op (k) ∪ Of (k)) ⊂ O(k). In addition, classifiers exhibit missed detections and false alarms. P = {P dj , P fj }represents a set of probabilities of detection and false alarm, which is associated only with each classifierCj . Formally, P dj = Pr(oj (k) = 1|Cj (k) = 1) and P fj = Pr(oj (k) = 1|Cj (k) = 0). Fig. 5-4 illustrates these probabilities. The dynamic fusion problem is one of finding, at each decision epochk, the most likely fault state candidates x(k) ∈ {0, 1}m , i.e., the fault state evolution

146 over time, X K = {x(1), ..., x(K)}, that best explains the observed classifier outcome sequence OK . We formulate this as one of finding the maximum a posteriori (MAP) configuration: b K = arg max Pr(X K |OK ) X XK

The NP-hard nature of the primal dynamic fusion problem motivates us to decompose it into a primal-dual problem using a Lagrangian relaxation approach. By defining new variables and constraints, the dynamic fusion problem reduces to a combinatorial optimization problem with a set of equality constraints. The constraints are relaxed via Lagrange multipliers. The relaxation procedure generates an upper bound for the objective function. The procedure of minimizing the upper bound via a subgradient or surrogate subgradient optimization produces a sequence of dual feasible and the concomitant primal feasible solutions to the dynamic fusion problem. Details of the DMFD algorithm, subgradient method and dynamic programming are provided in our previous papers (Singh et al., May 2007,M). During the online monitoring of the system, the observations and potential fault sequences are usually very long hence in order to reduce the amount of computation and storage, the DMFD problem is solved using the sliding window method. The sliding window DMFD method solves the diagnostic problem over a set of observations. The window size is selected based on the performance criteria such as low classification error and low false isolation rate.

147

Classification error False isolation rate

0.3-0.4

0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 Classifier operate alone Probability of detection

Figure 5-5: Parameter optimization in dynamic fusion 5.4 Simulations and Results A realistic automotive engine model of Toyota Camry 4-cylinder engine is simulated under various fault conditions in a custom-built ComputeR Aided Multi-Analysis System (CRAMAS® ) simulator and controlled via a prototype ECU (Rtype) (Choi et al., September 2006). We simulated the engine model under eight faults conditions inserted manually via the CRAMAS® control panel. Eight faults inserted were: air flow sensor fault, fault leakage in air intake system, blockage of air filter, throttle angle sensor fault, less fuel injection, added engine friction, air/fuel sensor fault and engine speed sensor fault. We collected measurements from five sensors: air flow meter, air/fuel ratio, vehicle speed, turbine speed and engine speed. For each fault component, we performed simulations

Classification error

148

25% quartile outlier median 75% quartile

Single Classifier

Static fusion

Dynamic fusion

Figure 5-6: Comparison of classification error among various methods for 40 different severity levels (0.5 % to 20 %); each run is sampled at 5 ms sampling interval with 2,000 time points (10 s of data). The fault severity level refers to deviation of the sensor value from its nominal value e.g. 10% severity level of air flow sensor fault refers to the change in the air flow from its nominal value by 10%. Each fault class contains 40 patterns over a time period of 2000 time epochs. This data is divided into training and testing data using 10 randomized datasets of 2-fold cross-validation. The training and testing data is reshaped into 2 dimensional (2D) so that a support vector machine (SVM) classifier can be used. In order to suppress the noise in the data, we run the fusion algorithm with a sampling interval of 0.5 seconds. Thus, we use a down sampling rate of 100, and obtain 20 time epochs for the dynamic fusion process. We employed multi-way partial least squares (MPLS) method to perform data reduction over a window

149 Table 5-3: Results on CRAMAS® data Single classifier Classification error ± std dev in % Computation time ± std dev in sec False isolation rate ± std dev in %

8.19±2.52

Static fusion 9.0±2.85

Dynamic fusion

0.02±0.01

0.04±0.005 0.12±0.001





4.5±1.6

4.8±0.6

of 100 samples. The MPLS-based data reduction technique achieves high classification accuracy on high-dimensional datasets and is also computationally efficient [4]. This reduced data set was used as features to train the SVM classifiers. In the SVMs, we empirically computed the kernel parameter γ and cost relaxation parameterCas 9*10−5 and 8*106 respectively. We used 30 and 15 classifiers in static and dynamic fusion respectively, which are represented by the columns of the ECC matrix. The ECC matrix was generated using the Hamming code generation method (Hamming, 1950). In the dynamic fusion, Pdj and Pfj were learned using a coarse optimization technique when the classifiers are part of the dynamic fusion. Fig. 5.3 shows the mean and standard deviation values of classification error (%) and false isolation rate (%) at various Pdj values. Here, classification error is computed as the average percentage of true fault states, which are not isolated by the algorithm over

Classification error

150

Window size: 6

Window size: 9

Window size: 12

Figure 5-7: Effect of window size on the classification error a span of K epochs, and false isolation rate is computed as the average percentage of fault states, which are falsely isolated by the algorithm over a span of K epochs. We found that the classification error is significantly reduced when the parameters were learned as part of the fusion architecture as compared to standalone case. This is consistent with the finding in distributed detection that the individual sensors (classifiers in our case) operate at a different operating point when part of a team (fusion in our case) than when they operate alone (Pete et al., 1993). The optimal parameters were Pdj ∼ 0.5-0.6 and Pfj

∼ 0.0-0.02.

Table 5-3 shows the results of a single classifier, static fusion and dynamic fusion methods. In all the methods, the SVM was used as the base classifier. The static fusion method was performed using the ECC matrix and the final decision was made using Hamming distance between outputs and rows of the ECC matrix. The dynamic fusion algorithm achieves the lowest classification error

False isolation rate

151

Window size: 6

Window size: 9

Window size: 12

Figure 5-8: Effect of window size on the false isolation rate and lowest standard deviation of the classification error for CRAMAS® data. The static fusion achieves slightly higher classification error as compared to a single classifier because the classifiers are already performing very well and they are not diverse. Fig. 5.4 shows the box plot of the dispersion of the classification error for 10 datasets. The whiskers are shown by extending the lines from each end of the box and the maximum length of the line is a function of the inter-quartile range.

Next, we discuss the results of sliding window dynamic fusion method. Figs. 5.4 and 5.4 show the effect of window size on the classification error and false alarms. A window size of 9 is a good candidate to perform on-line dynamic fusion because it not only achieves the classification error similar to window size of 12 and but also it achieves the low false isolation rate as compared to window

152 size of 12. The window size of 6 achieves high classification error so it is not a good choice to perform on-line dynamic fusion. Here, the maximum window size is the number of epochs (i.e. 20) in the dynamic fusion. The sliding window method improves the computation time of the dynamic fusion algorithm by using the Lagrange multipliers from the previous window computation in the DMFD algorithm. The on-line dynamic fusion reduces the computation time (in MATLAB on a 3.0 GHz clock speed processor machine) to 0.016 sec per window as compared to 0.12 sec per epoch in off-line dynamic fusion. These numbers are attractive practically, and they can be further reduced significantly by a careful implementation in C language.

5.5 Summary This chapter presented a systematic process to perform temporal fusion of classifier outputs. We presented the dynamic fusion algorithm which is implemented as a special case of the dynamic multiple fault diagnosis method. We validated the algorithm using an automotive system dataset.

Chapter 6

Conclusion and Future Work

In this thesis, we applied hidden Markov model-based algorithms to address the problem of anomaly detection and dynamic multiple fault diagnosis. In chapter 2, we introduced feature-aided tracking combined with HMMs for analyzing asymmetric threats. A transaction-based probabilistic model is proposed to combine hidden Markov models and feature-aided tracking. A procedure analogous to Page’s test is used for the quickest detection of abnormal events. The simulation results show that our method is able to detect the modeled pattern of an asymmetric threat with a high performance as compared to a maximum likelihood-based data mining technique. Performance analysis shows that the detection of HMMs improves with increase in the complexity of HMMs (i.e., the number of states in a HMM). In chapter 2, we assumed that the HMMs are independent. The next challenge is to implement our techniques when multiple HMMs share a data source.

153

154 In this case, the inference problem becomes essentially a multiple target tracking problem, meaning there is a competition among the HMMs for observations. In the target tracking arena, this is referred to as the problem of data association, or of measurement-origin uncertainty (Blackman, 1986). Here, we assumed that the model parameters are derived from interviews of subject matter experts (SMEs). We are currently expanding the proposed framework to other applications where the data is available (e.g., fault diagnosis) and can learn the parameters from data. Another extension could be to use a factorial hidden Markov modeling (FHMM) framework (Ghahramani and Jordan, 1996) to track specific entities involved in the threat activities. The FHMM framework provides a capability to factorized the hidden state into multiple layers and it therefore represents the hidden state in a distributed form. In this framework, inference and learning involves computing the posterior probabilities of multiple hidden layers (or states) given the observations. The exact inference in FHMM framework is intractable. However, approximate inference can be computed using Gibbs sampling and structured approximation techniques (Ghahramani and Jordan, 1996). In the context of asymmetric threats, the suspicious activities and various entities present in the activity could be represented as different layers of a FHMM. The lowest layer could denote a suspicious activity and the other upper layers could represent the presence of entities in that specific activity. The FHMM framework would allow features such as the people’s identities to become a part of the model in real-time instead of needing to be pre-specified.

155 In chapter 3, we described a semi-automated model-based tool (the ASAM system) to detect and track terrorist activity and perform what if analyses to enable an analyst gain deeper insights into a potential terrorist activity. These methods provide a means to develop models based on real world events; hence, they are efficient and effective for counter-terrorism analysis. We modeled terrorist threats by combining two probabilistic methods: hidden Markov models (HMMs) and Bayesian networks (BNs).The HMMs detect the monitored terrorist activity and measure local threat levels. BNs combine the likelihoods from many different HMMs to evaluate the cumulative probability of terrorist activity. In other words, the BN represents the overarching terrorist plot and the HMMs, which are related to each BN node, represent detailed terrorist subplots. In this paper, we introduce these probabilistic methods and use them to analyze the threat level of potential terrorist attacks for the 2004 Olympics. We constructed a global threat model to assimilate the threats from diverse scenarios. A HMM is developed to depict the threat from truck bombing. The models and scenarios are developed based on information gleaned from open sources. The model-based methods suggested here could be utilized by agencies involved in counter-terrorism as templates. Further research includes incorporation of actions of counter-terrorist networks via influence diagrams to obtain a better picture of the real possibility of an attack. By assigning costs to the counter-terrorist actions and to the terrorist threats, optimization techniques can be used to allocate counter-terrorism resources. Another extension to improve

156 our software is to provide an ability to track multiple terrorist activities simultaneously using multi- target tracking algorithms. The final extension is related to the issue of software usability by intelligence analysts. In chapter 4, we discussed four formulations of the DMFD problem. Analogous forms of these formulations have been studied widely in fault diagnosis community in a static context, and applied in various fields. Here, we provided a unified formulation of all the MFD formulations in a dynamic context. The first formulation refers to a generalized version of the DMFD problem when the detection and false alarms probabilities are associated with each test and fault. In the second formulation, the false alarm probability is associated with fault-free case only. The solution to the second formulation was shown to be quite similar to that of problem formulation 1, except for the need to update an additional Lagrange multiplier. The third formulation considered the case where the uncertainties are associated with only test outcomes. This models dynamic fusion of classifier outputs. In the fourth formulation, we considered the deterministic case, which led to a novel dynamic set covering problem. We implemented the algorithm on several real world data sets and the results validated the theory. The key advantage of our approach is that the method provides an approximate duality gap, which is a worst case indicator of the difference between the feasible solution and the optimal solution. Our results demonstrate that our algorithm achieves high isolation rate as compared to the deterministic simulated annealing method proposed earlier for this problem. In this chapter,

157 we assumed that the DMFD model parameters are known and faults evolve independently, and are coupled through the test outcomes via the diagnostic matrix (D-matrix). Future work should implement techniques to learn the DMFD model parameters from the observed test outcome sequences and relax the independence assumption to solve the DMFD problem when faults are dependent. Coupled hidden Markov models offer a promising platform for the solution of dependent faults problem (Brand, 1996). Future work should also focus on improving the primal solution using a soft Viterbi algorithm. In chapter 5, we presented a dynamic fusion algorithm, which is implemented as a special case of the dynamic multiple fault diagnosis method. Capabilities of the dynamic fusion algorithm are demonstrated by way of its application to an automotive system dataset. The dynamic fusion algorithm achieves lowest classification error and lowest standard deviation in the classification error estimate as compared to a single classifier and static fusion of classifiers, which verifies that fusing classifier outputs over time improves the diagnostic accuracy. On-line version of the dynamic fusion algorithm was performed using a sliding window method to illustrate significant reduction in computation time without much sacrifice in accuracy. In our future work, we plan to test the dynamic fusion algorithm on a gas turbine engine dataset.

Bibliography Agarwal, D., Feng, J., Torres, V., January 2006. Monitoring massive streams simultaneously: a holistic approach. Interface. Allanach, J., Singh, S., Tu, H., Pattipati, K., Willett, P., March 2004. Detecting, tracking and counteracting terrorist networks via hidden markov models. In: IEEE Aerospace Conference. Big Sky, Montana. ASAM, 2003. Adaptive safety analysis and monitoring system. URL http://servery.engr.uconn.edu/asam/about.aspx Barnaby, F., 2004. How to build a nuclear bomb and other weapons of mass destruction. Nation Books. Baum, L., Petrie, T., Soules, G., Weiss, N., 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. of Math. Stat. 41, 164–171. Bay, S., Saito, K., Ueda, N., Langley, P., May 2004. A framework for discovering anomalous regimes in multivariate time-series data with local models. In: Symposium on Machine Learning for Anomaly Detection. Palo Alto,CA. Beasley, J. E., 1987. An algorithm for set covering problems. European Journal of Operational Research 31, 85–93. Bertsekas, D., 2003. Nonlinear programming. Athena Scientific. Blackman, S. S., 1986. Multiple targets tracking with radar application. Artech House. Boser, B. E., Guyon, I. M., Vapnik, V. N., 1992. A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. Brand, M., November 1996. Coupled hidden markov models for modeling interacting processes. Neural Computation. Chen, B., Willett, P., December 2000. Detection of hidden markov model transient signals. IEEE Transactions on Aerospace and Electronic Systems 36 (4), 1253– 1268. 158

159 Choi, K., Luo, J., Pattipati, K. R., Namburu, S. M., Qiao, L., Chigusa, S., September 2006. Data reduction techniques for intelligent fault diagnosis in automotive systems. In: Proc. of IEEE Autotestcon. Anaheim, CA. Congress, U. S., December 1993. Technologies underlying weapons of mass destruction. Office of technology assessment, ota-bp-isc-115. Dietterich, T., 2000. Ensemble methods in machine learning. In: Multiple Classier Systems. Cagliari, Italy. Dietterich, T. G., Bakiri, G., January 1995. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research 2, 263–286. Erdinc, O., Raghavendra, C., Willett, P., April 2003. Real-time diagnosis with sensors of uncertain quality. In: SPIE Conference Proceedings. Orlando. Fawcett, T., May 2004. Activity monitoring: anomaly detection as on-line classification. In: Symposium on Machine Learning for Anomaly Detection. Palo Alto,CA. Forney, D., 1973. The viterbi algorithm. Proceedings of IEEE 61, 268–278. Ghahramani, Z., Jordan, M. I., 1996. Factorial hidden markov model. In: Advances in Neural Information Processing Systems. Giacinto, G., Roli, F., 2000. Dynamic classifier selection. Springer. Godfrey, G., 2003. Terroralert system, metron. Hamming, R. W., 1950. Error detecting and error correcting codes. Journal of Bell Sys. Tech. 29. Kim, E., Ko, J., 2005. Dynamic classifier integration method. Multiple Classifier Systems, 97–107. Kodali, A., Singh, S., Choi, K., Pattipati, K., Namburu, S. M., Chigusa, S., Prokhorov, D. V., Qiao, L., March 2008. Diagnostic inference with nearly perfect tests. In: will be published in IEEE Aerospace Conference. Montana. Kuncheva, L. I., Bezdek, J. C., Duin, R. P. W., 2001. Decision templates for multiple classifier fusion: An experimental comparison. Pattern Recognition 34, 213– 244. Kuncheva, L. I., Jain, L. C., September 2000. Designing classifier fusion systems by genetic algorithms. IEEE Trans. on Evolutionary Computation 4 (4). Le, T., Hadjicostis, C. N., December 2006. Graphical inference methods for fault diagnosis based on information from unreliable sensors. In: Proceedings of Intl. Conf. on Control, Automation, Robotics and Vision. Singapore.

160 Luo, J., Pattipati, K. R., Qiao, L., Chigusa, S., 2007. An integrated diagnostic development process for automotive engine control systems. to appear in IEEE System, Man, and Cybernetics: Part C. Moon, T., November 1996. The expectation-maximization algorithm. IEEE Signal Processing Magazine 13, 47–60. Odintsova, N., Rish, I., Ma, S., 2005. Multifault diagnosis in dynamic systems. In: Proceedings of IM. Page, E., 1954. Continuous inspection schemes. Biometrika 41, 100–115. Paternoster, R., December 1992. Nuclear weapon proliferation indicators and observables. Tech. rep. Pattipati, K. R., Alexandridis, M. G., January 1990. A heuristic search and information theory approach to sequential fault diagnosis. IEEE Transactions on Systems, Man and Cybernetics 20 (4), 872–887. Pearl, J., 1988. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kauffmann Publishers Inc. Pete, A., Pattipati, K., Kleinman, D., 1993. Optimal team and individual decision rules in uncertain dichotomous situations. Public Choice 75, 205–230. QSI, 1994. TEAMS® . URL http://www.teamqsi.com Rabiner, L. R., February 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), 257–286. Rabiner, L. R., Juang, B. H., January 1986. An introduction to hidden markov models. IEEE ASSP Magazine 3 (1), 4–16. Raghavan, V., Shakeri, M., Pattipati, K. R., January 1999. Optimal and nearoptimal test sequencing algorithms with realistic test models. IEEE Transactions on Systems, Man and Cybernetics: Part A - Systems and Humans 29 (1), 11–27. Reid, D. B., December 1979. An algorithm for tracking multiple targets. IEEE Transactions on Automatic Control AC-24, 843–854. Reiter, S., Rigoll, G., 2004. Segmentation and classification of meeting events using multiple classifier fusion and dynamic programming. In: Proceedings of the International Conference on Pattern Recognition. Rosen, J. A., 2003. Influence net modeling.

161 Ruan, S., Y. Zhou, F. Y., Pattipati, K. R., Willett, P., Patterson-Hine, A., 2006a. Dynamic multiple fault diagnosis and imperfect tests. IEEE Trans. on Systems, Man and Cybernetics: Part A (under review). Ruan, S., Zhou, Y., Yu, F., Pattipati, K. R., Willett, P., Patterson-Hine, A., 2006b. Dynamic multiple fault diagnosis and imperfect tests. IEEE Trans. on Systems, Man and Cybernetics: Part A (under review). Ruta, D., Gabrys, B., 2000. An overview of classifier fusion methods. Computing and Information Systems, 1–10. Salvador, S., Chan, P., Brodie, J., 2004. Learning states and rules for time series anomaly detection. In: Proc. 17th Intl. FLAIRS Conf. Miami Beach, Florida. Schrodt, P., 2000. Political complexity: Nonlinear models of politics. University of Michigan Press. Schrodt, P. A., Gerner, D. J., December 2000. Using cluster analysis to derive early warning indicators for political change in the middle east,1979-1996,. American Political Science Review, 803–818. Settle, F., 2005. Nuclear chemistry, nuclear proliferation. URL http://www.chemcases.com/2003version/nuclear/nc-12.htm Shakeri, M., Pattipati, K. R., Raghavan, V., Patterson-Hine, A., August 1998. Optimal and near-optimal algorithms for multiple fault diagnosis with unreliable tests. IEEE Transactions on Systems, Man and Cybernetics- Part C: Applications and Reviews 28 (3). Siegmund, D., 1985. Sequential analysis tests and confidence intervals. SpringerVerlag. Singh, S., Allanach, J., Tu, H., Pattipati, K., Willett, P., 2004. Stochastic modeling of a terrorist event via the asam system. In: IEEE International Conference on Systems, Man and Cybernetics. Singh, S., Choi, K., Kodali, A., Pattipati, K., Namburu, S. M., Chigusa, S., Prokhorov, D. V., Qiao, L., October 2007. Dynamic fusion of classifiers for fault diagnosis. In: IEEE SMC Conference. Canada. Singh, S., Choi, K., Kodali, A., Pattipati, K., Sheppard, J., Namburu, S. M., Chigusa, S., Prokhorov, D. V., Qiao, L., March 2007. An optimization-based method for dynamic multiple fault diagnosis problem. In: IEEE Aerospace Conference. Big Sky, Montana. Singh, S., Choi, K., Kodali, A., Pattipati, K., Sheppard, J., Namburu, S. M., Chigusa, S., Prokhorov, D. V., Qiao, L., May 2007. Dynamic multiple fault diagnosis problem formulations and solution techniques. In: DX-07 Workshop. Nashville,TN.

162 Singh, S., Donat, W., Tu, H., Lu, J., Pattipati, K., Willett, P., October 2006. An advanced system for modeling asymmetric threats. In: IEEE International Conference on Systems, Man and Cybernetics. Smyth, P., December 1994. Markov monitoring with unknown states. Journal on Selected Areas in Communications 12 (9). Spector, L., Smith, J., 1990. Nuclear ambitions: The spread of nuclear weapons 1989-1990. Westview Press. Tu, F., Pattipati, K. R., Deb, S., Malepati, V. N., January 2003. Computationally efficient algorithms for multiple fault diagnosis in large graph-based systems. IEEE Transactions on Systems, Man and Cybernetics 33 (1), 73–85. Tu, H., Allanach, J., Singh, S., Willett, P., Pattipati, K., January 2006. Information integration via hierarchical and hybrid bayesian networks. IEEE Transactions on System, Man and Cybernetics, Part A: Systems and Humans, special issue on Advances in Heterogeneous and Complex System Integration. Vadovinos, R. M., Sanchez, J. S., Barandela, R., 2005. Dynamic and static weighting in classifier fusion. In: Iberian conference on Pattern recognition and image analysis. Vapnik, V. N., 1999. The nature of statistical learning theory. Springer. Wald, A., 1947. Sequential analysis. Wiley. Ying, J., Kirubarajan, T., Pattipati, K. R., Patterson-Hine, A., November 2000. A hidden markov model-based algorithm for fault diagnosis with partial and imperfect tests. IEEE Transactions on Systems, Man, and Cybernetics- Part C: Applications and Reviews 30 (4), 463–473. Yu, F., Tu, F., Tu, H., Pattipati, K. R., September 2007. Multiple disease (fault) diagnosis with applications to the qmr-dt problem. IEEE Trans. on SMC: Part A, 746–757. Zhu, X., Wu, X., Yang, Y., 2004. Dynamic classifier selection for effective mining from noisy data streams. In: 4th IEEE conference on Data mining.