PROCESS MONITORING, DIAGNOSTICS AND ...

PROCESS MONITORING, DIAGNOSTICS AND PROGNOSTICS USING SUPPORT VECTOR MACHINES AND HIDDEN MARKOV MODELS by FATIH CAMCI DISSERTATION

Submitted to the Graduate School of Wayne State University, Detroit, Michigan in partial fulfillment of the requirements for degree of DOCTOR OF PHILOSOPHY 2005

MAJOR:

INDUSTRIAL ENGINEERING

Approved by: ______________________________ Advisor

Date

______________________________ ______________________________ ______________________________ ______________________________

DEDICATION

To my parents for letting me study so far from home, To my advisor, Dr. Ratna Babu Chinnam, for his support, To my wife for her patience, and To my daughter,

ii

TABLE OF CONTENTS

INTRODUCTION......................................................................................................................... 1 1.1

MAINTENANCE TECHNIQUES ................................................................................. 1

1.2

CONDITION-BASED MAINTENANCE...................................................................... 4

1.2.1

Benefits of Condition-Based Maintenance.............................................................. 5

1.2.2

Steps in Condition-Based Maintenance .................................................................. 7

1.3

PROBLEM STATEMENT ........................................................................................... 10

1.4

ORGANIZATION OF THE DISSERTATION ............................................................ 12

LITERATURE REVIEW .......................................................................................................... 14 2.1

DIAGNOSTICS............................................................................................................ 14

2.1.1. Process Monitoring and Novelty Detection ................................................................ 14 2.1.2 Incipient Failure Diagnostics ...................................................................................... 24 2.2

PROGNOSTICS ........................................................................................................... 29

GENERAL SUPPORT VECTOR REPRESENTATION MACHINE (GSVRM) FOR STATIONARY AND NON-STATIONARY CLASSES ......................................................... 32 3.1

INTRODUCTION ........................................................................................................ 33

3.2

PREVIOUS WORK ...................................................................................................... 34

3.3

GENERAL SUPPORT VECTOR REPRESENTATION MACHINE.......................... 35

3.4

WEIGHTED-GSVRM FOR NON-STATIONARY CLASSES ................................... 42

3.5

ONLINE TRAINING................................................................................................... 47

3.6

EXPERIMENTS ........................................................................................................... 48 iii

3.7

CONCLUSION............................................................................................................. 53

PROCESS MONITORING USING GENERAL SUPPORT VECTOR REPRESENTATION MACHINE............................................................................................. 55 4.1 INTRODUCTION............................................................................................................... 55 4.2 SUPPORT VECTOR MACHINE ...................................................................................... 57 4.3 GENERAL SUPPORT VECTOR REPRESENTATION MACHINE................................ 59 4.4 ON-LINE TRAINING ........................................................................................................ 66 4.5 EXPERIMENTAL RESULTS ............................................................................................ 67 GSVRM for uncorrelated processes...................................................................................... 67 GSVRM for correlated manufacturing processes ................................................................. 74 4.6 CONCLUSION ................................................................................................................... 79 HEALTH-STATE ESTIMATION AND DIAGNOSTICS USING HIDDEN MARKOV MODEL COMMITTEES .......................................................................................................... 82 5.1

INTRODUCTION ........................................................................................................ 82

5.2

BACKGROUND: HIDDEN MARKOV MODELS ..................................................... 84

5.3

BAYESIAN NETWORK AND DYNAMIC BAYESIAN NETWORK ...................... 86

5.3.1

Inference in Bayesian network.............................................................................. 88

5.3.2

Learning in Bayesian Network.............................................................................. 89

5.3.3

Dynamic Bayesian Network.................................................................................. 90

5.3.4

Dynamic Bayesian Network as Hidden Markov Model........................................ 91

5.3.5

Auto-regressive Hidden Markov Models .............................................................. 93

5.3.6

Hierarchical Hidden Markov Models ................................................................... 93

iv

5.4

IMPLEMENTATION AND RESULTS ....................................................................... 96

5.4.1

Competitive Learning for Regular and Auto-regressive HMM ............................ 97

5.4.2

Hierarchical Hidden Markov Model .................................................................. 108

5.5

CONCLUSION AND FUTURE RESEARCH ........................................................... 111

MACHINE PROGNOSTICS USING HIDDEN MARKOV MODELS .............................. 112 6.1

INTRODUCTION ...................................................................................................... 112

6.2

BACKGROUND: HIDDEN MARKOV MODELS ................................................... 115

6.3

DYNAMIC BAYESIAN NETWORK........................................................................ 117

6.3.1

Dynamic Bayesian Network as Hidden Markov Model...................................... 118

6.4

HIERARCHICAL HIDDEN MARKOV MODELS................................................... 119

6.5

IMPLEMENTATION AND RESULTS ..................................................................... 122

6.5.1

RUL Calculation ................................................................................................. 123

6.6

FUTURE RESEARCH ............................................................................................... 131

6.7

CONCLUSION........................................................................................................... 134

CONCLUSION & FUTURE RESEARCH ............................................................................ 135 REFERENCES: ........................................................................................................................ 138 ABSTRACT............................................................................................................................... 150 AUTOBIOGRAPHICAL STATEMENT ............................................................................... 152

v

LIST OF TABLES Table 1: Properties of experimental datasets. .............................................................................. 49 Table 2: Parameters for building WGSVRM. ............................................................................... 49 Table 3: Scalability of WGSVRM method to higher dimensions measured in terms of number of support vectors.............................................................................................................................. 52 Table 4: Parameters of distributions ............................................................................................ 68 Table 5: Format of Type I and Type II errors .............................................................................. 69 Table 6: Classification accuracy of GSVRM ................................................................................ 70 Table 7: Parameters of testing datasets........................................................................................ 71 Table 8: Classification accuracy of SVM, MLP, Shewhart Chart for Smith dataset .................. 71 Table 9: Type I and Type II errors for non-correlated data using GSVRM ................................. 73 Table 10: Type I and Type II errors for non-correlated data with limited number of in-control and out-of-control samples ........................................................................................................... 73 Table 11: Type I and Type II errors for Papermaking dataset using RBF, MLP, and GSVRM... 77 Table 12: Type I and Type II errors for Viscosity dataset using RBF, SVM, and GSVRM .......... 77 Table 13: Type I and Type II errors for Papermaking dataset using GSVRM ............................. 78 Table 14: Type I and Type II errors for Viscosity dataset using GSVRM .................................... 78 Table 15: Type I and Type II errors with limited in-control and out-of-control data .................. 78 Table 16: Parameters used in GSVRM implementation on datasets............................................ 79 Table 17. Conditional probabilities of nodes for the given Bayesian network............................. 88 Table 18: Learning methods for different problems [MM99]. ..................................................... 89 Table 19: Health state estimation results for “winner-takes-all” and “topological learning” approaches.................................................................................................................................. 101

vi

Table 20: Health state estimation for all drill bits using competitive learning.......................... 103 Table 21: Health state estimation using 5 HMMs ...................................................................... 104 Table 22: Health state estimation using competitive learning ................................................... 106 Table 23: Ar-HMM with error minimization .............................................................................. 106 Table 24: Computational times of regular and auto-regressive HMMs..................................... 107 Table 25: Comparison of Regular HMM and Ar-HMM............................................................. 108 Table 26: Health State Estimation using HHMM with 4 top-level states................................... 109 Table 27: Health state estimation of all drill bits using a HHMM............................................. 110

vii

LIST OF FIGURES

Figure 1: Illustration of the concept of optimal maintenance with failure and maintenance cost. 3 Figure 2: Illustration of the concept of optimal maintenance with total cost and equipment availability ...................................................................................................................................... 4 Figure 3: Maintenance techniques and their usage with equipment criticality ............................. 4 Figure 4: Generic and Point Solutions for CBM............................................................................ 5 Figure 5: Process steps in Condition Based Monitoring................................................................ 8 Figure 6 Separation of classes by hyper plane............................................................................. 18 Figure 7: A Markov chain with 6 states and state transition probabilities .................................. 28 Figure 8: Determination of class boundary: Data points on the boundary will be rejected by local inner GSVRMs. .................................................................................................................... 40 Figure 9: Influence of ς on GSVRM representation.................................................................... 41 Figure 10: Influence of GSVRM hyper-sphere radius on boundary representation (ς = 0) ......... 42 Figure 11: Non-stationary behavior of a mechanical pump ........................................................ 42 Figure 12: Contour plots of two most dominant principal components of vibration sensor collected from a pump: a) temporal domain, b) spectral domain, and c) energy domain. .......... 43 Figure 13: GSVRM algorithm. ..................................................................................................... 48 Figure 14: Results from GSVRM for dataset #1........................................................................... 50 Figure 15: Results from WGSVRM for dataset #2........................................................................ 50 Figure 16: Results from WGSVRM for dataset #3........................................................................ 51 Figure 17: Results from WGSVRM for dataset #4........................................................................ 51

viii

Figure 18: GSVRM computational time (in seconds) versus dataset size in two- and threedimensional data........................................................................................................................... 52 Figure 19: On-line WGSVRM using second dataset. ................................................................... 53 Figure 20: Online WSND using C shape dataset ......................................................................... 53 Figure 21: Separation of classes by hyper-plane. ........................................................................ 58 Figure 22: Determination of class boundary: Data points on the boundary will be rejected by local inner GSVRMs. .................................................................................................................... 63 Figure 23: Influence of ς on GSVRM representation.................................................................. 64 Figure 24: Influence of GSVRM hyper-sphere radius on boundary representation. ................... 65 Figure 25: Example Time Series................................................................................................... 68 Figure 26: Viscosity Data............................................................................................................. 75 Figure 27: Papermaking Data...................................................................................................... 75 Figure 28: A Markov chain with 6 states and state transition probabilities ................................ 85 Figure 29: Bayesian Network: Directed acyclic graphical model .............................................. 88 Figure 30: Example of Dynamic Bayesian Network

a) Prior Network b) Transition Network

c) Dynamic Bayesian Network: Combination of prior and transition networks .......................... 91 Figure 31: Representation of Dynamic Bayesian Network. ......................................................... 92 Figure 32: DBN representation of Auto-regressive Hidden Markov Model ................................ 93 Figure 33: Hierarchical representation of states. A: Top states, B: Sub states ........................... 94 Figure 34: Hierarchical Hidden Markov Model Representation ................................................. 95 Figure 35: Trust and Torque data for drill bit #5 ........................................................................ 97 Figure 36: Illustration of Competitive Learning .......................................................................... 98

ix

Figure 37: State mean and covariance plots of HMM1 and HMM3 with normalized thrust and torque scatter plots ..................................................................................................................... 102 Figure 38: Illustration of log-likelihood values of regular HMMs ............................................ 102 Figure 39: Illustration of mean and covariance of states of HMMs for drill bit #1................... 103 Figure 40: Illustration of log-likelihood values of HMMs for drill bit #1 and #3...................... 104 Figure 41: Illustration of log-likelihood values of Ar-HMMs for drill bit #3 and #1 ................ 107 Figure 42: Likelihood values for 3 and 5 top-level state HHMM for Drill bit 1........................ 109 Figure 43: Illustration of sub states mean and covariance of health states in HHMM ............. 110 Figure 44: Equipment availability and total cost for physics-based and empirical based methods ..................................................................................................................................................... 113 Figure 45: A Markov chain with 6 states and state transition probabilities .............................. 116 Figure 46: Example of Dynamic Bayesian Network................................................................... 118 Figure 47: Representation of Dynamic Bayesian Network. ....................................................... 119 Figure 48: Hierarchical representation of states. A: Top states, B: Sub states ......................... 120 Figure 49: Hierarchical Hidden Markov Model Representation ............................................... 121 Figure 50: Trust and Torque data for drill bit #5 ...................................................................... 123 Figure 51: Illustration of equipment health states: .................................................................... 124 Figure 52: Illustration of a RUL probability distribution .......................................................... 125 Figure 53: Illustration of RUL for 12 drill bits with a) 95% b) 75% confidence intervals........ 127 Figure 54: Illustration of estimation accuracy for 12 drill bits.................................................. 128 Figure 55: RUL accuracy given confidence interval for 12 drill bits ........................................ 130 Figure 56: RUL precision given confidence interval for 12 drill bits ........................................ 131 Figure 57: Illustration of transition probabilities ...................................................................... 132

x

Figure 58: Illustration of Hidden Semi-Markov Model.............................................................. 133 Figure 59: Illustration of Hierarchical Hidden Semi-Markov Model ........................................ 134

xi

1

CHAPTER I INTRODUCTION This section provides a brief introduction to maintenance techniques; describes Conditionbased Maintenance (CBM) and associated benefits; and outlines the research problem statement. The chapter is organized as follows: Maintenance techniques in section 1.1, CBM in section 1.2, and research problem statement in section 1.3. Organization of the overall dissertation is described in sections 1.4, respectively. 1.1

MAINTENANCE TECHNIQUES

The importance of availability of a machine or a system is obvious for industry. The capacity of a plant is directly related to the availability of the components in the plant. The whole system being idle because of a down machine or spending most of the time in maintenance tremendously decreases the capacity of the system. In case of high-risk machines such as helicopters, aircrafts, or marines, safety and reliability of the machine is also very crucial. Maintenance plays an important role on system’s safety, reliability, and availability especially in a complex system [SK97]. There are typically three maintenance models for a system: improvement maintenance (IM), preventive maintenance (PM), and corrective maintenance (CM) [Pat02]. The goal of IM is to eliminate the need for maintenance, which should be achieved in the production phase of a machine. IM has many restrictions, such as budget, market requirements [Yan02]. Producing a machine with the possibility of zero failures is almost impossible; if not impossible, it would be too expensive. Thus, IM is not a practical solution for many systems. Corrective Maintenance is the effort of maintenance after a failure occurs. Failure of a critical machine may cause tremendous loss in the process or in the line. In addition, failure of a

2

component may destroy other components and cause a catastrophic failure. Failure cannot be acceptable in case of high-risk machines, such as helicopters, aircrafts, etc. Thus, the need for effectiveness of a corrective maintenance program depends on the criticality of the machine. Preventive Maintenance is the effort of preventing a failure before it occurs. PM is an effective way of establishing a reliable system [Rao92]. Time-based preventive maintenance, which is the traditional approach for preventive maintenance, is the maintenance of the system in time intervals defined using historical failure information of the system. This approach decreases the down time of machines caused by a failure, although it increases the down time caused by maintenance. In time-based PM, a machine can be stopped for maintenance although it does not need it and a properly working part might be replaced with a brand new one. Therefore, time, money and labor is spent on a machine for its maintenance that could work for some more time without maintenance. Deciding the maintenance period is also difficult. Taking the mean time to failure of similar machines in the past as maintenance period means expecting half of the failures before maintenance occurs. The higher the frequency of maintenance, the more time, money, and labor spent on maintenance; while the lower the frequency of maintenance, the more the number of failures that occur before scheduled maintenance. Minimum time to failure can be chosen as the maintenance period [LT97], which will again increase the unnecessary maintenance cost of healthy machines. The maintenance period should be chosen so that cost of maintenance and failure will be minimized and equipment availability will be maximized. Figures 1 and 2 illustrate this concept. Effectiveness of time-based PM depends on the criticality and maintenance cost of the machine. Time-based PM may be effective for non-critical machines with low maintenance expenses, however time-based PM and all methods mentioned above may not be effective for critical machines with high maintenance expenses. Another approach of preventive

3

maintenance is Condition-based Maintenance (CBM), which is also called predictive maintenance [Yan02]. Condition-Based Maintenance is the philosophy of monitoring the condition of machinery and performing the maintenance only when there is objective evidence of impending failure [BG01, OSL02]. Figure 2 illustrates equipment availability and operation cost for mentioned maintenance techniques. CM, time-based PM and CBM have a role to play in industry depending on the criticality of the machinery. Some 70-80% of the machinery in industry may not require any maintenance activity until a failure and can be maintained through corrective maintenance. Some 15-25% of machinery is critical enough to require time-based PM. The remaining 1-5% of machinery is deemed very critical, and the risk of any failure with high maintenance cost is best managed through CBM. Figure 3 illustrates optimal maintenance strategy for industrial machinery as a function of equipment criticality. Corrective M

Condition-based M.

Preventive M.

Total Cost

Maintenance Cost

Failure Cost

Number of Maintenance Actions Figure 1: Illustration of the concept of optimal maintenance with failure and maintenance cost

This research focuses on condition-based maintenance, which will be discussed in the next section.

4

1.2

CONDITION-BASED MAINTENANCE

Optimizing the maintenance time can be achieved by tracking the health of the machine. Sensors are deployed on the machine to collect signals that contain the information about the health of the machine. These signals are processed to extract the information. Definition of condition-based maintenance can be expressed as “the philosophy of monitoring health of a machine by analyzing various signals collected from different sensors in order to have the minimum maintenance and failure cost “.

Corrective Maintenance

Predictive CBM

Preventive Maintenance

Equipment Availability Total Cost

Number of Maintenance Actions Figure 2: Illustration of the concept of optimal maintenance with total cost and equipment availability

Corrective Maintenance

Predictive CBM ~1-5 % ~15-25 %

Criticality

Condition-based Maintenance

Optimal Strategy

~70-80%

Equipments Figure 3: Maintenance techniques and their usage with equipment criticality

Customized (also called point solutions) solutions might be offered as well as generic technology for CBM. Point solutions are developed for specific machinery and often require an

5

in-depth characterization of equipment behavior in terms of fundamental principles. On the other hand, the generic solutions are applicable to many systems with slight modifications and do not require any deep level information about the system. Even though customized solutions may give better equipment availability, they are not desirable in some cases due to its cost. Figure 4 illustrates the equipment availability and operations cost in point and generic solutions. Point Solutions Equipment Availability

Generic CBM Technology

Total Cost of Operations

CBM Figure 4: Generic and Point Solutions for CBM

1.2.1

Benefits of Condition-Based Maintenance Desires that lead to a Condition-Based Maintenance system are catastrophic failure

reduction, cost minimization, system security and availability maximization and platform reliability improvement [BG01]. Catastrophic failure reduction and platform reliability improvement are obvious. The following subsections expand discussion on cost reduction, system security, and availability. 1.2.1.1 Cost Reduction Cost minimization can be analyzed in four groups [All90, Har03]. 1. Machine downtime reduction increases the capacity, machine utilization, and throughput of the system. Jay Lee and Jun Ni, the Co-Directors of National Science Foundation (NSF) center of Intelligent Manufacturing System (IMS) estimate that $5 billion

6

per year would be saved in US alone in equipment uptime improvement if CBM were implemented. In the example of an aircraft diversion, passenger off-loading costs onto a second aircraft, rescuing the original plane, hotel bills, guarantee payments and compensation to customers are more expensive than recovery cost of engine [Fri00]. Avoiding failure will remove these costs. 2. Maintenance cost reduction is the amount spent for unnecessary maintenance in case of time based maintenance approach. Resources dedicated for maintenance have grown very fast in last few decades. For example, maintenance portion of direct operation costs increased more than %120 at nuclear power plants [Dou95]. Maintenance cost reduction (improved resources scheduling) from effective deployment of CBM technology is estimated to be $9 billion per year. 3. Inventory reduction can be achieved by learning failure time before it occurs, which is the goal of CBM. Machines and components of the machines can be purchased close to the time of the expected failure instead of having them in the inventory for long time. It is estimated that annually $6 billion can be saved in inventory reduction through effective utilization of CBM technology. 4. Enhanced logistics and supply chain can be achieved if the health of the system is known. Necessary logistics can be supplied from other companies if failures and effects of failures can be identified in advance. It is estimated that annually $15 billion can be saved from effective deployment of effective CBM technology in logistics and supply chain. As a result, implementing CBM has great effect on decreasing the cost, which is estimated at $35 billion annually in US alone [Har03].

7

1.2.1.2 Availability and Safety Although reducing costs and increasing profits are in the major interest of industry, sometimes availability of the system and safety become more important than cost minimization and profit maximization. For example, the Joint Strike Fighter (JSF) project at US Department of Defense (DoD) will reduce the cost of engineering, manufacturing, and maintenance etc. as well as having more available and safe equipment [BBFN98]. Availability and safety become more important in US DoD in case of equipment such as helicopter, sub-marine, and aircraft. CBM not only reduces the cost, but can also save lives. 1.2.2

Steps in Condition-Based Maintenance There are three fundamental steps in CBM: feature extraction, diagnostics, and

prognostics. Feature extraction is the process of extracting useful information from raw data. Diagnostics is the process of diagnosing the failure of a machine. Prognostics is the process of estimating the remaining useful life (RUL) of a machine. Figure 5 illustrates the process steps in CBM. Optimizing time intervals between maintenances is the primary goal of Condition-Based Maintenance. A generic, machine independent method for prognostics is the dream of the researchers in this field. Feature extraction, failure diagnostics, and prognostics steps will be discussed in the following sections. 1.2.2.1 Feature Extraction and Selection Various sensors are mounted onto the machine or a component of a machine to collect data about the health of the machine or the component. These sensors acquire digital signals that encode various quantities as a function of time, such as velocity, acceleration, strain,

8

pressure, and vibration. Vibration signal is the most common signal type that is collected for CBM. Diagnostics Novelty Detection For Abrupt Failure

Sensor Signal

Signal Conditioning & Feature Extraction

Sensor Fusion Data Acquisition

Time Domain

Preprocessing

Frequency Domain

Networking Data Storage

Mixed Domain

Health State Estimation For Incipient Failure

OUTPUTS: Fault classification Fault condition Fault isolation

Prognostic Models & Algorithms

OUTPUTS: Remaining Useful Life

Forecasting Engine

DMSS

Failure Definition Fourier Transform Wavelets Matching Pursuit Empirical Mode Decomposition

CBM

OUTPUTS: Recommended Action

Prognostics

Figure 5: Process steps in Condition Based Monitoring

Data collected from various signals contain information about the health of the machine, but it is hidden in data and might not be understandable. Relative information about the health of the machine is expressed as a set of measured quantities [GZ03]. Several values that describe the overall operation are extracted from the signal, which may consist of thousands of values, and called features [TK98, TG74]. Feature extraction is the process of extracting these features that have understandable information about health of the component. The quality of machine condition monitoring techniques is determined by the effectiveness and efficiency of features. Features can be summarized as time domain, frequency domain, and joint frequency-time domain (or mixed domain) features [QC96]. Time domain features are the features that change in time such as amplitude, crest factor, kurtosis, and RMS values. Frequency domain features are the ones that have only frequency and amplitude information.

9

Joint frequency-time domain features are the ones that have frequency and amplitude information in time. Feature extraction and selection is out of the scope of this research. 1.2.2.2 Failure Detection Isermann defines failure as “a non-permitted deviation from characteristic property” [Ise84]. Machine diagnosis can be defined as the process of identifying, localizing and determining severity of a machine failure [MBND99]. Failure can be categorized as abrupt failures and incipient failures. Abrupt failures happen in a very short time and their development might not be tracked. They can be detected by departure of the features from the normal operating mode. Incipient failures occur in time, their effects can be seen in the features in advance, and their severities increase in time. Early stages of incipient failures can often be detected when the machine is working in normal mode. Abrupt failures can be detected by process monitoring techniques, which include statistical process control (SPC) charts and other novelty detection methods. Statistical process control detects an abrupt failure by comparing data with control limits, which are set by statistical methods. Novelty Detection has broader definition employing not only statistical methods but also pattern recognition techniques. Novelty detection can be defined as the process of finding abnormal behavior by learning the normal behavior of a system especially in case of lack of sufficient failure data. In case of incipient failure, even though machine may behave normally in the early stages of the failure, its severity (i.e. health state) can be detected when the machine is operating in the normal mode. Severity level of a failure and its identification will be referred as health state of the machine and health state estimation in the rest of the dissertation, respectively. Health state estimation is in essence a classification problem. Support vector machine and hidden Markov model based methods will be employed for novelty detection and health state estimation, respectively.

10

1.2.2.3 Prognostics Prognostics is built upon diagnostics and it is the process of estimating Remaining Useful Life (RUL) of a machine by predicting the progression of a diagnosed failure. Prognostic methods can be achieved by various ways such as simple trending algorithms based on recursive curve fitting, artificial intelligent prediction, state-space tracking algorithms, and higherfidelity physics of failure algorithms. In this research, dynamic Bayesian network based methods with the information obtained from diagnostics will be employed for prognostics. Prognostics is a dynamic process that evolves in time from the moment the machine is first used till it fails. RUL should not be confused with estimate of life expectancy, which is “mean time to failure” of an average component [EBGH00]. Estimate life expectancy is the average life of similar or a family of machines, while RUL is the time to failure of a specific machine, which is under monitoring. Prognostics is more difficult to formulate than diagnostics, attributed to the fact that prognostics is subject to stochastic process that has not been happened yet. Diagnostics models an existent stochastic system, while prognostics forecasts a stochastic system that will happen in the future. Consequently, prognostics methods must have three outputs: accuracy, precision, and confidence [EBGH00]. Accuracy is the measure of closeness of estimated failure time to actual failure time (High accuracy=close estimate). Precision is the length of the interval that estimated RUL falls in. Confidence is the probability of actual RUL falls in the given precision. High accuracy, narrow precision, and high confidence are desired. 1.3

PROBLEM STATEMENT

Development and implementation of generic methods to monitor processes, diagnose failures, and estimate RUL of machinery or components is the objective of this research.

11

Support vector machine (SVM) and hidden Markov model (HMM) based methods will be developed and employed for diagnostics and prognostics in this research. Support Vector Machine is a very powerful classification tool. It is generic to any kind of data not necessarily following any particular distribution. SVM is often associated to physical meaning and requires small amount of training samples [GDZX04] and it has sparse representation. HMM is a very powerful method with strong mathematical theory and has been implemented successfully in different fields especially in automatic speech recognition (ASR). The similarity between speech signal and machine sensory signals such as vibration is often discussed in the literature [BMA00]. Implementation of HMM as dynamic Bayesian network (DBN) reduces the number of parameters and gives us more flexibility in model structure design. Dynamic Bayesian network is the underlying theory of various methods that have been used widely in the literature such as Hidden Markov Model (HMM), Kalman filter, Principal Component Analysis (PCA) and can be defined as a probabilistic graphical model that deals with time information [Smy98] [RG99]. This research focuses on DBN as Hidden Markov Model. In this research, process-monitoring method is a non-parametric method with no assumption of data distribution, whereas it is assumed that data follow a Gaussian distribution, which depends on the current health state of the machine, in health state estimation. Selection of sensors for extracting necessary health information, data fusion of multiple methods and/or sensors, extraction and selection of useful features and physical, historical data analysis of the system for decision support are beyond the scope of this research. The research will conclude with a report of a framework for non-parametric process monitoring, health state estimation for diagnostics, and RUL estimation for prognostics. The framework will be validated by implementation of a real world example and/or using benchmarking datasets in the literature. The efficiency of the process monitoring and

12

diagnostics methods will be identified by the classification accuracy of failures. The efficiency of prognostics method will be identified by RUL estimation effectiveness, which is a function of accuracy, precision, and confidence. 1.4

ORGANIZATION OF THE DISSERTATION

Existing literature on novelty detection, failure diagnostics, and prognostics is reviewed in Chapter 2. Modeling structure and implementation of the models are presented in Chapters 3 through 6: proposed process monitoring method, General Support Vector Representation Machine (GSVRM), in Chapter 3; implementation of GSVRM to benchmarking process control datasets in the literature in Chapter 4; health state estimation using hidden Markov models in Chapter 5; and prognostics using Hierarchical HMM in Chapter 6. Conclusions are given in Chapter 7. Chapters 3 to 6 are presented as modular as possible and some parts are repeated to obtain the modularity. Process monitoring techniques in the literature have general difficulties such as inability to handle non-stationary processes, assumptions about data distribution and independence of data, necessity of enough examples from abnormalities, and inability to learn from abnormal data, if applicable. The process monitoring technique (GSVRM) presented in Chapter 3 targets these difficulties. The method is applied to synthetic datasets in this chapter. The main contribution of GSVRM presented in this chapter is the ability to overcome many of the difficulties mentioned above. Novelty detection is a broader definition of process monitoring problem and used in this chapter instead. In Chapter 4, GSVRM is applied to benchmarking datasets for process monitoring and compared with the methods in the literature such as Shewhart control chart, multi-layer perceptron, support vector machines, radial basis networks, etc. This chapter presents the results and analysis of the method.

13

Chapter 5 presents the various hidden Markov models (i.e. regular, auto-regressive, and hierarchical HMM) for health state estimation. Auto-regressive HMM (Ar-HMM) removes the assumption of data independence made by regular HMM. Hierarchical HMM (HHMM) can represent complex systems, which leads to better health state representations. Health state estimation results and analysis of methods are reported in the chapter. Chapter 6 presents a prognostics method based on HHMM. Remaining useful life (RUL) calculation is achieved by Monte-Carlo simulation employing transition probabilities between health states obtained from HHMM. The results reported in this chapter are very promising. This chapter also presents future research ideas on prognostics.

14

CHAPTER II LITERATURE REVIEW

This chapter discusses previous work on diagnostics including abrupt and incipient failure detection and prognostics in section 2.1 and 2.2, respectively. 2.1

DIAGNOSTICS Failures can be categorized as abrupt (sudden) and incipient (slowly developing) failures.

Abrupt failures are step-like deviations and occur in a very short time. Incipient failures occur in time, their effects can be seen in features in advance, and their severity increases in time. Abrupt failures can be detected by departure of features from normal operating mode. Abrupt failure can also be called as novel event. Process monitoring and novelty detection essentially targets the same problem. Novelty detection is the process of finding abnormal behavior by learning normal behavior of a system especially in case of lack of sufficient abnormal behavior data. It is very difficult, if not impossible, to track the abrupt failure in advance. On the other hand, early stage of an incipient failure may be detected using classification methods [Elv97], although machine behaves normally. Process monitoring (novelty detection) and incipient failure diagnosis (health state estimation) will be discussed in the sections 2.1.1 and 2.1.2, respectively. 2.1.1. Process Monitoring and Novelty Detection There are several approaches for novelty detection. The basic approach involves estimation of the probability density of the assumed distribution and outlier probability calculation [WM03, SM03]. The effectiveness of this method is limited by the degree to which the assumption regarding the distribution is satisfied.

15

Statistical process control methods (i.e. control charts) set the control limits to detect the abrupt failure by using statistical techniques. In addition, conventional SPC charts such as Shewhart control charts, cumulative sum (CUSUM) control charts, and exponentially weighted moving average (EWMA) control charts are developed for univariate processes and do not work well for multi-variable processes [MSIH04]. Multivariate statistical process control (MSPC) is employed to monitor processes that have correlated multi-variables. Unfortunately, SPC and MSPC methods were developed under the assumption that process variables are normally distributed [ROS91], which does not necessarily hold in many real industrial processes. In addition, SPC techniques (including MSPC) of control charting often make the assumption that data from subsequent samples are independent, which does not hold in many industrial processes [WM99, Chi02, CC98]. Several attempts have been made in the literature to monitor auto-correlated parameters by extending traditional SPC techniques [AR88, HR91, WMP94, RWP95, BL97] as well as applying pattern recognition techniques such as Radial Basis Function networks (RBF) [CC98], Multi-Layer Perceptron networks (MLP) [Pug91, Smi94] and Support Vector Machines (SVM) [Chi02]. In the first case, a time series model is fitted to the auto-correlated data and residual is obtained, and then SPC techniques are applied to the residual. The performance of these time-series modeling techniques is not very good, especially for detecting small shifts [Chi02]. Furthermore, SPC focuses only on data collected from an incontrol process for characterization, and hence, cannot take advantage of historical data available from out-of-control conditions and failures. On the other hand, almost all the machine learning methods proposed in the literature for process control strictly require example data from all out-of-control states of interest. In addition, the results from using RBF and MLP methods for process monitoring are marginal in that they lack adaptability and make no provision for making tradeoffs between Type-I errors (false alarms) and Type-II errors (inability to detect shifts in process condition). While application of SVM as a classifier for process

16

monitoring has yielded better success, it is often not practical to apply SVMs directly to many real world problems for process monitoring because of two reasons: 1)

Availability of out-of-control data: The training procedure for SVM (also for RBF and MLP) requires excessive amount of failure data from in-control states as well as outof-control states, which may not be available. In many cases, obtaining failure data is difficult, expensive, or even impossible.

2)

The necessity of modeling and training for each specific failure type. A model that is developed for a specific type abnormal event (out-of-control state) cannot give good classification accuracy for another type of abnormal event. For example, SVM that is trained for classification of in-control and large mean shift samples gives bad classification accuracy for detecting small mean shift samples although it works well for large mean shift samples. Thus, it is necessary to create and train a distinct model for each type of abnormal event in case of MLP and SVM in order to obtain the reported classification accuracy. This barrier could prove insurmountable in many real world applications.

Dasgupta proposes a method inspired from Immune System [DF95], which encodes the normal behavior in binary form and generates all the possibilities of abnormalities in binary form. This method has the weakness of the possibility of having null abnormality set. Support Vector Regression is also used for novelty detection [MP03]. Generally, the assumption of independent occurrence of novel events, made by these methods, does not hold in real life datasets. SmartSifter developed by K. Yamanishi is an outlier detection method that can handle nonstationary data [YTW00]. This is a probabilistic method and may not work in the cases where data does not follow any particular parametric distribution.

17

D. Tax proposed novelty detection method based on one-class classification using Support Vector Machine and labeled it Support Vector Data Description (SVDD) [TD01]. This approach assumes the whole representation of the normal behavior of the system in order to estimate the sigma parameter for the Gaussian kernel, which is not the case for many real life applications, such as condition monitoring of durable machines with the possibility of several failure modes. SVDD will be discussed in detail in section 2.1.1.2. Support Vector Representation Machine (SVRM) has the same essence with SVDD including a different way of estimating the sigma ( σ ) parameter for the Gaussian kernel [YC03]. However, this method is computationally inefficient and does not handle non-stationary data. SVRM will be discussed in detail in section 2.1.1.3. There are several other methods proposed in the literature. Almost all of them exhibit some weakness or the other in terms of the ability to set thresholds for determination, generosity of the method, the ability to support incremental training, or the ability to handle non-stationary data. Next subsection gives brief information about support vector machine, which is the underlying method of proposed novelty detection method. 2.1.1.1 Support Vector Machine Support Vector Machine (SVM) is a two-class classification method that separates the classes with a hyper plane, which gives the largest margin between classes. SVM was first introduced by Vapnik in 1995 and it has been widely used in the literature in various areas1. [GDZX04] Training data is represented as ( xi , yi ) , where xi and yi represent the i th multi dimensional data and the class the data i belongs, respectively. ( xi ∈ R d i = 1, 2,.....n and 1

http://www.kernel-machines.org/ is a great source for tutorials, application, software, etc. about SVM.

18

yi ∈ {−1,1} ). The goal is to separate unseen data from different classes with minimum classification error. SVM reduces the structural risk by separating classes with largest margin between classes. The points on the separating hyper plane satisfy w ⋅ x + b = 0 , where w is the normal to hyper plane.

b w

is the perpendicular distance from hyper plane to the origin, where

w is the Euclidian norm of w [Vap98]. Figure 6 illustrates separation of classes using SVM.

Figure 6 Separation of classes by hyper plane.

Separating margin is calculated as

2 and can be maximized by minimizing wT w . Thus w

maximum separation of classes becomes a quadratic optimization problem. It is also required for training data to satisfy the constraint in (2.1) in order to have all the data from different classes on different sides of the hyper plane.

yi ( xi ⋅ w + b ) − 1 ≥ 0

∀i

(2.1)

Lagrangian formulation of the problem as given in (2.2) makes it easier to solve with the advantage of having Lagrange multipliers in the constraints and training data in the form of dot products between vectors [Bur98, Sch00] .

19

Lp =

l l 1 || w ||2 −∑ α i yi ( xi ⋅ w + b) + ∑ α i 2 i =1 i =1

(2.2)

By taking the gradient of Lp with respect to w and b we can obtain:

dLP = w − ∑ α i yi xi yi = 0 dw

(2.3)

w = ∑ α i yi xi

(2.4)

i

dLP = −∑ α i yi = 0 db i

(2.5)

∑α y

(2.6)

i

i

=0

i

We can substitute these equality constraints in (2.2) and obtain the dual formulation:

LD = ∑ α i − i

1 ∑α iα j yi y j xi ⋅ x j 2 i, j

(2.7)

A cost function also can be added to the objective function in order to handle the nonseparable case as in (2.8), where C is a constant value for punishment of the misclassification. The constraints can be modified as seen in (2.9) to give flexibility for outliers in case of nonseparable classes. Primal, dual Lagrangian formulations and constraints are given in (2.10), though (2.13), respectively.

LP =

1 || w ||2 +C ∑ ξi 2 i

(2.8)

yi ( xi ⋅ w + b ) ≥ 1 − ξi

(2.9)

l l 1 || w ||2 +C ∑ ξ i − ∑ α i [ yi ( xi ⋅ w + b) − 1 + ξi ] − ∑ µiξi 2 i i =1 i =1

(2.10)

20

LD = ∑ α i − i

1 ∑α iα j yi y j xi ⋅ x j 2 i, j

(2.11)

Subject to the constraints

0 ≤ αi ≤ C

(2.12)

∑α y

(2.13)

i

i

=0

i

Nonzero α values represent data that are either misclassified or on the hyper plane and called as support vectors. Support vectors play essential role on sparseness of SVM. The dot product in (2.11) can be replaced by a Kernel function in order to handle nonlinear separable cases. The idea behind this is that nonlinear separable case can be converted to a linear separable case by transforming the feature space to the higher dimensional space. Kernel function gives the values of dot products in the higher dimensional space avoiding the necessity of actually going to that space and calculating the dot products. This is called kernel trick [Str88]. Support Vector Machine is often associated to physical meaning and it requires small amount of training samples. SVM has been successfully applied to many applications, such as pattern recognition, multi-regression, nonlinear fitting, etc. [GDZX04]. Great amount of resources including articles, software, tutorials, etc. can be found at http://www.kernelmachines.org/ In the next two subsection, SVM based novelty detection methods will be discussed. 2.1.1.2 Support Vector Data Description Support Vector Data Description (SVDD) finds a sphere with minimum volume that contains all normal data [TD99]. Closed boundary is assumed to be a sphere, which is represented by

21

center c and radius r. Minimization of the volume of the sphere is achieved by minimizing r2, which can also be defined as structural error [Vap98]: Min r 2

(2.14)

xi − c ≤ r 2 , ∀i 2

Subject to:

(2.15)

The above equations do not allow any data to fall outside of the sphere. In order to make a provision within the model for handling outliers within the training set, a penalty cost function is introduced for data that lie outside of the sphere as in (2.16) and (2.17), where C is the coefficient of penalty for each outlier. Min r 2 + C

∑ξ

(2.16)

i

i

Subject to xi − c ≤ r 2 + C ∑ ξi , ∀i , ξi ≥ 0 ∀i 2

(2.17)

i

This is a quadratic optimization problem and can be solved by introducing Lagrange multipliers as shown in (2.18), where γ i ≥ 0 , α i ≥ 0 , xi ⋅ xi is inner product of xi and xi [Str88].

L ( r , c, ξ , α , γ ) = r 2 + C ∑ ξ i − i

∑ α {r i

i

2

+ ξi − ( xi ⋅ xi − 2c ⋅ xi + c ⋅ c )} − ∑ γ iξi

(2.18)

i

Taking the derivatives of (2.18) we obtain the following constraints in (2.19), (2.20), and (2.21).

c = ∑ α i xi

(2.19)

i

C − α i − γ i = 0 ∀i

∑α i

i

=1

(2.20)

(2.21)

22

The following quadratic programming equations are obtained by substituting (2.19), (2.20), and (2.21) in (2.18). Min

∑α ( x ⋅ x ) − ∑α α i

i

i

i

i

j

( xi ⋅ x j )

(2.22)

i, j

0 ≤ α i ≤ C ∀i

Subject to

(2.23)

One of the good attributes of support vector machine is sparse representation. Support vectors can sufficiently represent data. Data points can be classified as follows: •

Inside the hyper-sphere when α i =0

•

On the boundary (i.e., support vectors) when 0< α i 0 , γ i will be zero resulting in α i to be C . When the data point is at the boundary, α i and γ i will be between zero and C to satisfy (3.8). The quadratic programming solution often yields a few data points with a non-zero α i value, called support vectors. What is of particular interest is that support vectors can effectively represent the data while remaining sparse. In general, a sphere in the original input space may not represent the dataset well enough.

Hence, data ought to be transformed to a higher

dimensional feature space where it can be effectively represented using a hyper-sphere. By employing the so called kernel trick, one may use the inner-product kernel K (xi , x j ) to construct the optimal hyper-sphere in the higher dimensional feature space without having to consider the feature space itself (which can be extremely large) in explicit form [Vap98]. This kernel trick makes SVMs computationally efficient. The inner-product kernel is a special case of the Mercer’s theorem and is defined as follows: K ( x i , x j ) = ϕ T ( x) ϕ T ( x i ) m

= ∑ ϕ j (x)ϕ j (xi ) ∀i

(3.12)

j =0

where {ϕ j ( x)}mj =1 denote a set of nonlinear transformations from the input space to the feature space and m is the dimensionality of the feature space. Thus, the dot product in (3.10) is replaced by a Kernel function, leading us once again to the following quadratic programming problem: Max

∑ α K (x , x ) − ∑ α α i

i

i

i

i

j

K ( xi , x j )

(3.13)

i, j

Subject to 0 ≤ α i ≤ C ∀i , ∑ α i = 1 i

One of the popular kernels in the literature is the Gaussian kernel [TD99, YC03]:

(3.14)

39

 − x −x i j K (xi , x j ) = exp   σ2 

2

  ∀i ≠ j  

(3.15)

The proposed GSVRM method strictly employs the Gaussian kernel. The value of σ could be provided by the user or optimized iteratively [MMRTS01]. As stated earlier, SVDD and SVRM methods employ different techniques to optimize σ . GSVRM employs a method similar to that employed by SVRM. The procedure is as follows: −

Calculate the average nearest neighbor distance, denoted by nd , between all the data points in the data set.

−

For each data point, construct a local GSVRM utilizing data within a sphere of radius 2 × nd . In building the local GSVRM, average distance of the data within the local sphere

to their mean is employed as σ for the Gaussian kernel. −

For each local GSVRM, construct an inner local GSVRM (hyper-sphere) by employing a radius smaller than that suggested by the quadratic programming solution for the local GSVRM. The parameter that controls the reduction in radius (i.e., reduction %) is prespecified by the user.

−

If the data point is rejected by the inner local GSVRM (meaning lies outside), it is added into the boundary list. Figure 8 illustrates this procedure for determination of boundary list. If there is a single data point within the local GSVRM hyper-sphere, it is not added to the boundary list for it might be an outlier.3

−

Global GSVRMs (i.e., GSVRMs that represent all data) are constructed using different σ values, the range spanning from the smallest nearest neighbor distance ( σ min ) to the

largest nearest neighbor distance ( σ max ) in the data set. The value that gives the “best fit” is chosen to be the optimal σ . What constitutes best is described below. 3 For computational savings, data points that lie within the boundary of any other data point’s inner local GSVRM are not considered for entry into the boundary list. Experimental results reveal that this results in computational savings around 50%.

40

Boundary Point

Local Sphere of radius 2 xnd Local GSVRM Inner Local GSVRM

Interior Point

Figure 8: Determination of class boundary: Data points on the boundary will be rejected by local inner GSVRMs.

In the context of a global GSVRM, smaller σ values yield more representing points (i.e., support vectors) and a tighter hyper-sphere, whereas larger values give fewer support vectors and result in a bigger hyper-sphere. The goal is to identify a value for σ that results in good agreement between the support vector list of the global GSVRM and the boundary list resulting from local GSVRMs. In general, smaller σ values result in a global support vector list that is a “superset” of the boundary list, with some points that are not part of the boundary list. On the contrary, larger σ values result in a global support vector list that is a “subset” of the boundary list. The key being to achieve agreement between the two sets. In assessing this agreement, GSVRM computes the fitness of a σ value by employing a two part strategy: effective representation and compactness. Effective representation is achieved by ensuring that the global support vector list “best matches” the boundary list.

Compactness on the contrary

emphasizes a smaller support vector list. Compactness is managed through a user-defined parameter 0 ≤ ς ≤ 1 . The higher the value of ς the more compact the support vector list and the higher the Type II error (i.e., inability to detect novel conditions), resulting in a larger hypersphere.

41

There is typically a σ value, denoted by σ c , that results in near perfect agreement between the support vector list and the boundary list. As σ exceeds σ c , the support vector list gets smaller and smaller. The actual value of σ employed by the proposed GSVRM is: σ = σ c + ς (σ Max − σ c )

(3.16)

Figure 9 illustrates the influence of different compactness levels on the quality of representation, using an example dataset.

The innermost GSVRM provides effective

representation but with 20 support vectors all of which are in the boundary list (ς = 0) , whereas the outermost GSVRM achieves compactness with just 2 support vectors (ς = 1) .

The

parameters σ min , σ max , σ c are calculated empirically, whereas ς and reduction percentage require repeated trials.

.5

ς

.75

1

.25 =0

Figure 9: Influence of ς on GSVRM representation.

Once the optimal σ value is calculated based on the desired degree of compactness, one can construct ‘inner’ and ‘outer’ boundary representations by correspondingly changing the radius of the GSVRM hyper-sphere. Figure 10 illustrates this procedure for the same dataset from Figure 9. It is clear that as the radius is changed the overall geometric shape is maintained while the scale changes. Experimental results from GSVRM implementations are presented in Section V.

42

Outer GSVRM Inner GSVRM

Figure 10: Influence of GSVRM hyper-sphere radius on boundary representation (ς = 0) .

Figure 11: Non-stationary behavior of a mechanical pump captured in the space of the two dominant principal spectral components of vibration sensor data collected at 12.5kHz for 0.5 sec at 4 hr intervals over a time span of 3 months.

3.4

WEIGHTED-GSVRM FOR NON-STATIONARY CLASSES As stated earlier in section 1, most data domain description methods assume a stationary

process, including SVDD [TD99] and SVRM [YC03]. This may not be the case for many real world applications. Some examples of non-stationary processes include catalyst deactivation, equipment behavior with age, sensor and process drifting, and fault conditions [YCO00, GWBWB97]. For example, Figure 11 illustrates the non-stationary vibratory behavior of a mechanical pump in the space of the two most dominant principal spectral components. It should also be pointed out that while this pump shows evolution or trajectory in the principal spectral space, there were no mechanical faults, and hence, this represents ‘normal’

43

evolutionary behavior of the pump.

Figure 12 shows contour plots of two most dominant

principal components of vibration sensor data collected from a pump using temporal, spectral, and energy domain features. It is evident from the figures that the data does not follow any particular parametric distribution. In this section, we propose a weighted-GSVRM (WGSVRM) for representation of such non-stationary processes. Kernel density contours 0.6 0.4 0.4 0.2 0.2 0

0

-0.2

-0.2

-0.4 -0.4 -0.6 -0.6 -1

-0.8 -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

-1

-0.5

0

0.5

1

1

0.5

0

-0.5

-1

-1.5

-2

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Figure 12: Contour plots of two most dominant principal components of vibration sensor collected from a pump: a) temporal domain, b) spectral domain, and c) energy domain.

To explicitly support the non-stationary nature of the data, we introduce the notion of ‘weight’ or ‘importance’ of a data point in determining the boundary representation based on its ‘age’. These weights are defined as follows: ωi = (1 − λ )t

c

− ti

∀i

(3.17)

where λ is the forgetting factor (0 ≤ λ < 1) , tc is time of the “latest” data point, and ti is the time of data point i . The WGSVRM procedure incorporates these weights by modifying (3.3) as follows: Min r 2 + C ∑ ωiξi i

(3.18)

44

Subject to: x i − c ≤ r 2 + ξi , ξi ≥ 0 ∀i 2

(3.19)

The dual of the Lagrangian results in the following: Max

∑ α K (x , x ) − ∑ α α i

i

i

i

i

j

K ( xi , x j )

(3.20)

i, j

Subject to: 0 ≤ α i ≤ Cωi , ∑ α i = 1

(3.21)

i

WGSVRM is built on GSVRM and differs from GSVRM in formulation and optimal σ calculation. As seen from (3.13), (3.14), (3.20), and (3.21), the only formulation difference between GSVRM and WGSVRM exists in the upper bound of α i in (3.21). Thus, each α i will be limited by a different upper bound. Similar to the discussion in GSVRM, α is interpreted as follows: data point lies inside the WGSVRM boundary if α i = 0 ; outside the WGSVRM boundary if α i = Cωi ; and on the boundary if 0 < α i < Cωi . Note that (3.4) and (3.19) are identical, resulting in the same constraint terms in the Lagrange formulation. Thus, WGSVRM also satisfies the Karush-Kuhn-Tucker (KKT) conditions. With respect to calculation of optimal σ , the critical difference between WGSVRM and GSVRM lies in the fitness function that assesses the degree of agreement between support vectors and the boundary list. The fitness function employed by WGSVRM consists of two subfunctions: a sub-function that assesses “closeness of support vectors to the boundary list” and another sub-function that assesses the “impact of age of the support vectors on overall solution”.

The closer the support vectors to the boundary and/or the younger the support

vectors, the higher the fitness value. Closeness measure, cm , calculated by the first sub-function, is in essence a function of the ratio of the distance from a support vector to the closest boundary list point and the maximum of shortest distances between data points and their closest boundary list points: cmi = 1 − di / D

(3.22)

D = max(d) d i = min(dis ( xi , v ))

(3.23)

45

where v denotes the vector of data points in the boundary list. Closeness measure reaches a maximum when the individual support vectors are in the boundary list or are near the boundary. The concept of an age measure of a data point, am , is also introduced here for developing the second sub-function of the fitness function: ami = β tc −ti .

(3.24)

Thus, the second sub-function forces the support vectors to be as young as possible so as to be able to effectively track the non-stationary process. Note that the overlap between new data and previous data during evolution is allowed with the possibility of having two age values for overlapping data points. The overall fitness, F∂ , is calculated by combining age and closeness measures of all support vectors as follows: N SV

F∂ =

∑ ( w am + w cm ) 1

i =1

i

2

i

(3.25)

N SV

where, w1,2 denote weighting factors for age and closeness measures, respectively, and N SV the number of support vectors. As mentioned earlier, one-class classification is important in cases where there are not enough examples from abnormal classes. The question “how can the model benefit from the existent limited abnormal examples?” is very logical. SVDD incorporates these examples in the model [Tax01]. Existing examples from abnormal classes are particularly important when they are “within normal data”. The objective function can incorporate these examples as follows: Min r 2 + C

∑ω ξ

i i

i

+ Co ∑ ω j ξ j

(3.26)

j

Subject to:

xi − c ≤ r 2 + ξ i

(3.27)

46

x j − c ≥ r2 − ξ j

(3.28)

where Co is the penalty value for an abnormal data point inside the representation boundary. The Lagrange formulation of the problem is as follows: L(r , c, ξ, α, γ ) = r 2 + C ∑ ωi ξi + Co ∑ ω j ξ j − ∑ γ i ξi i

j

i

(3.29)

−∑ γ j ξ j − ∑ α i {r + ξi − (xi ⋅ xi − 2c ⋅ xi + c ⋅ c)} 2

j

i

After taking the derivatives, we obtain the following equations:

∑α − ∑α i

i

i

=1

(3.30)

i

c = ∑ α i xi − ∑ α j x j i

(3.31)

j

Substituting (3.30) and (3.31) in (3.29), leads us to following Lagrange formulation: L (r , c, ξ, α, γ ) = ∑ α i (xi ⋅ xi ) − ∑ α i (xi ⋅ xi ) − i∈I

i∈J

∑

+

i∈I , j∈J

−

α iα j ( x i ⋅ x j ) +

∑αα

i , j∈J

i

j

∑

i∈J , j∈I

∑αα

i , j∈I

i

j

( xi ⋅ x j )

α i α j ( xi ⋅ x j )

(3.32)

(xi ⋅ x j )

where, I is class of normal examples and J is class of abnormal examples.

This

formulation can be simplified by labeling abnormal classes as “-1” and normal class as “+1”.

 1 yi =  −1

xi ∈ I xi ∈ J

(3.33)

α i' = yiα i

(3.34)

Substituting (3.33) and (3.34) in (3.32) results in the following:

L(r , c, ξ, α, γ ) =

∑ α (x

i∈{ I , J }

' i

i

⋅ xi ) −

∑

i , j∈{ I , J }

α i'α 'j (xi ⋅ x j )

(3.35)

0 ≤ α i ≤ Cωi

(3.36)

0 ≤ α j ≤ Coω j

(3.37)

47

As can be seen from equation (3.35), the formulation essentially remains the same with an additional constraint for the corresponding Lagrange multiplier value of the abnormal example. The formulation also clearly shows that GSVRM is a special case of WGSVRM. The next section describes the online training process. 3.5

ONLINE TRAINING Provision for online training is an important attribute for many models. The need for an on-

line training procedure is obvious, especially for machine monitoring sort of applications where response time is of the essence. Creating local GSVRM for each data point to choose the optimum σ value is the most time consuming part of the proposed method.

In order to

computational efficiency, the boundary list is updated instead of re-calculating the entire boundary points from scratch, when new data are available. There are two tasks to be carried out when a new data point becomes available: First, we need to check whether the new data point is a boundary point. Secondly, we need to check whether the new data point is forcing existing boundary list data points to leave the boundary. For the first task, a local GSVRM is created for the new data point. If the new point lies outside its inner local GSVRM, it is added to the boundary list. For the second task, an inner local GSVRM is created for data points within the vicinity of the new data point and data points within the boundary list. If the new data point falls inside the boundary of any of these inner local GSVRMs, it is removed from the boundary list. Once the boundary list is updated, different σ values obtained by increasing and decreasing the previous optimal σ value are tried. The σ value that gives the highest fitness value is chosen to be the new optimal σ value. The complete GSVRM algorithm is outlined in Figure 13.

48 Step 1: Determination of data points close to boundary (i.e., creation of boundary list). 1.1 Calculate the average nearest neighbor distance ( nd ) 1.2 For each data point i, construct a local GSVRM with those data points inside the sphere with radius of 2 × nd and centered at the current data point i 1.3 Calculate an inner GSVRM with reduced threshold 1.4 If the data point is rejected by the inner GSVRM (meaning outside the threshold boundary), add the point to the boundary list 1.5 Choose the next data point and go to step 1.2. Step 2: Calculate the global GSVRM using an optimal σ . The σ value that gives the best fit or agreement between the support vectors of the global GSVRM and the boundary list from Step 1 is chosen to be the optimal σ . Step 3: When a new data point is available, update the boundary points as follows: 3.1 Construct a local GSVRM for the new data point as in Step 1. 3.2 Calculate inner GSVRM with a reduced threshold. 3.3 Add the new data point to the boundary list if the data point is rejected by inner GSVRM. 3.4 Implement steps 3.1 and 3.2 for all the data points within the local GSVRM sphere of the new data point and the boundary list. 3.5 Remove the data point from the boundary list if it is inside one of the inner GSVRM 3.5 Any points that are not rejected should be removed from the boundary list. 3.6 Check for better fitness of the global GSVRM by increasing and decreasing σ . If a different σ leads to better fitness, update σ . 3.7 Calculate minimum volume hyper-sphere using updated boundary points.

Figure 13: GSVRM algorithm.

Experimental results from implementation of WGSVRM are discussed in the next section. 3.6

EXPERIMENTS The proposed one-class classification methods (both GSVRM and WGSVRM) were tested

using multiple datasets in two- and three-dimensional spaces. The properties of the datasets are illustrated in Table 1. The parameter levels employed for constructing the WGSVRMs are reported in Table 2. All experiments dealing with non-stationary datasets employed a forgetting factor of λ = 0.03. Obviously, the results are sensitive to this factor. In general, the forgetting factor should be based on subject matter experience and/or data analysis. The higher the non-stationary nature of the data, the larger should be the forgetting factor. In all our experiments, the C parameter is set at 1.03 so that penalty values for most recent data come close to unity, ensuring that recent data are well represented. Age coefficient ( β ) is chosen to be the inverse

49

of the number of data points. Weighting factors for both age and closeness measures ( w1,2 ) are both set to be 0.5. Reduction percentage for creation of inner GSVRM is set to be 0.7% for all datasets (i.e., the radius for inner local GSVRM is 0.993 times the radius of GSVRM). All experiments are conducted on a PC with a Pentium III processor running at 700 MHz. All experiments involved a Gaussian kernel. Table 1: Properties of experimental datasets.

Dataset Dimensionality

#1 2

#2 2

#3 2

Stationarity

YES

NO – straight line movement

NO – arc movement

Gaussian

Gaussian

Gaussian

µ = [0 0]

µ = [0 0]

µ = [0 0]

1 .8 Σ=  .8 1

1 .8 Σ=  .8 1

1 .8 Σ=  .8 1

Geometry

#4 2 NO – straight line movement Letter “C”

#5 3 YES

Helical “C”

Table 2: Parameters for building WGSVRM.

Dataset Parameters

#1

#2

#3

#4

#5

C

1.03 0.01 0 0 -

1.03 0.01 .03 1/ N .5 .5

1.03 0.01 .03 1/ N .5 .5

1.03 0.01 .03 1/N .5 .5

1.03 0.01 0 0 -

ς λ β w1 w2

Figure 14 illustrates the estimated GSVRM boundary and an actual 99 percentile contour plot for dataset #1. Given the geometric similarity between the GSVRM boundaries and the probability contour, one can conclude that GSVRM is reasonably effective in describing the data. In the case of dataset #2, the class moves along a straight line. If one were to ignore the non-stationary nature of the process, all data will receive equal importance and the GSVRM will try to include all data within the hyper-sphere. As can be seen from Figure 15, the resulting

50

WGSVRM boundary does a good job representing the non-stationary process and old data are allowed to fall outside the boundary. As can be seen from Figure 16, in the case of dataset #3, the mean moves along an arc of a circle. WGSVRM once again does an effective job representing the dynamics of the nonstationary process. Figure 17 presents results for dataset #4. WGSVRM is once again effective in representing the non-stationary process in spite of the complexity of the class boundary. The primary objective in building the last dataset (i.e., the helical “C” dataset #5) was to assess the scalability of the proposed WGSVRM to higher dimensional data, in terms of computational efficiency. This dataset in the three-dimensional space is constructed by adding an additional dimension to dataset #4 where the new variable increases in value linearly from the first data point to the last data point, resulting in a helical “C”.

Figure 14: Results from GSVRM for dataset #1. 25

20

15

10

5

0

-5 -5

0

5

10

15

20

25

Figure 15: Results from WGSVRM for dataset #2.

51

0.02 0 -0.02 -0.04 -0.06 -0.08 -0.1 -0.12 -0.14 -0.15

-0.1

-0.05

0

0.05

0.1

Figure 16: Results from WGSVRM for dataset #3. 0.25 Initial Middle Last SVND

0.2

0.15

0.1

0.05

0

-0.05

-0.1 -0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Figure 17: Results from WGSVRM for dataset #4.

Given the curse of dimensionality, one would expect to obtain more support vectors for higher dimensional datasets. Since capacity control in a Support Vector Machine is obtained by controlling the number of support vectors [CT00], a model that results in number of support vectors that equal or near the number of data points is susceptible to high misclassification probability. Several experiments have been carried out here to assess these properties for the proposed WGSVRM method. Table 3 reports the number of support vectors for two- dimensional datasets (i.e., “C”) and three-dimensional datasets (i.e., helical “C”), as a function of dataset size.

While there is

evidence that the method can handle higher dimensional datasets, there is clear evidence that the number of support vectors increases with data dimensionality.

Given the difficulty in

producing three-dimensional visualizations, we only report the number support vectors and the computational time for constructing WGSVRM.

Computational time (seconds)

52

124

174

224

274

324

Figure 18: GSVRM computational time (in seconds) versus dataset size in two- and three-dimensional data.

Table 3: Scalability of WGSVRM method to higher dimensions measured in terms of number of support vectors.

Dimensionality

Dataset Size 124 174 224 274 324 Letter “C” in 2 dimensions

9

12

12

10

12

Helical “C” in 3 dimensions

23

17

28

26

33

Figure 18 reports the computation time for the two- and three-dimensional datasets from Table 3. It is clear that the computation time increases with the increase in dataset size and dimensionality. The procedure for online training is discussed in section 5. Checking for a better σ value every time a new data point becomes available may not be necessary. We chose to check for a better σ value once every 5 new data points are available. The optimal batch size would depend on how fast the process is moving and by how much the class density distribution is changing. Figure 19 shows results from implementation of on-line WGSVRM for dataset #2. It is evident that the WGSVRM representation tracks the data. “+” represents the data points used initially for training the WGSVRM and “ i ” represents data points subsequently added to the dataset one by one. WGSVRM is updated after every five new data points become available.

53

25

20

15

10

5

0

-5 -5

0

5

10

15

20

25

Figure 19: On-line WGSVRM using second dataset. 0.15

0.1

0.05

0

-0.05

-0.1

-0.15

-0.2 -0.15

-0.1

-0.05

0

0.05

0.1

Figure 20: Online WSND using C shape dataset

On-line WGSVRM method is also tested using the dataset #4.

Initially, the first half

(represented by ‘+’) of the “C” shaped dataset is used to initialize the process. The second half (represented by ‘ i ’) of the data is added one by one. Boundary list is updated after each data point is added, and σ value is updated after every five data points are added. Figure 20 shows the moving hyper-sphere of the data. As seen from the graph, the hyper-sphere seems to do an effective job in tracking the data. 3.7

CONCLUSION One-class classification methods have been of particular interest to researchers in

domains where it is difficult or expensive to find examples of abnormal behavior (such as in

54

medical/machine diagnostics and IT network surveillance). This paper proposes a novel oneclass classification method named Weighted-General Support Vector Representation Machine (WGSVRM) for stationary as well as non-stationary classes. The methods does not make any strong assumptions regarding the cluster data density.

In representing the ‘normal’ class,

GSVRM essentially minimizes the volume of the hyper-sphere in the Gaussian kernel space that encapsulates normal data, while making an explicit provision for incorporating any data available from ‘abnormal’ classes. The Weighted-GSVRM offers the ability to represent nonstationary classes by making a provision for assigning different weights (or degrees of importance) to the data points as a function of their ‘age’. WGSVRM formulation still remains to be a quadratic programming formulation and meets the KKT conditions, allowing use of existing solvers to arrive at a global optimal solution. Experimental evaluation reveals that the proposed method can effectively represent both stationary and non-stationary classes. An efficient on-line version of the WGSVRM is also proposed.

55

CHAPTER IV PROCESS MONITORING USING GENERAL SUPPORT VECTOR REPRESENTATION MACHINE

4.1 INTRODUCTION

In order to improve product quality and to reduce production cost, it is important to detect equipment malfunctions, failures, or other special events as early as possible. For example, according to the survey conducted by Nimmo (1995), the US-based petrochemical industry could save up to $10 billion annually if abnormal process behavior could be detected, diagnosed and appropriately dealt with [CKML04]. By monitoring the performance of a process over time, statistical process control (SPC) attempts to make a distinction between process variation attributed to common causes from variation attributed to special causes, and hence, forms a basis for process monitoring and equipment malfunction detection [MMZ96]. It is also the most commonly used tool to analyze and monitor the processes [EA03]. Conventional SPC charts such as Shewhart control charts, cumulative sum (CUSUM) control charts, and exponentially weighted moving average (EWMA) control charts are developed for univariate processes and do not work well for multi-variable processes [MSIH04]. Multivariate statistical process control (MSPC) is employed to monitor processes that have correlated multi-variables. Unfortunately, SPC and MSPC methods were developed under the assumption that process variables are normally distributed [ROS91], which does not necessarily hold in many real industrial processes. In addition, SPC techniques (including MSPC) of control charting often make the assumption that data from subsequent samples are independent, which does not hold in many

56

industrial processes [WM99, Chi02, CC98]. Several attempts have been made in the literature to monitor auto-correlated parameters by extending traditional SPC techniques [AR88, HR91, WMP94, RWP95, BL97] as well as applying pattern recognition techniques such as Radial Basis Function networks (RBF) [CC98], Multi-Layer Perceptron networks (MLP) [Pug91, Smi94] and Support Vector Machines (SVM) [Chi02]. In the first case, a time series model is fitted to the auto-correlated data and residual is obtained, and then SPC techniques are applied to the residual. The performance of these time-series modeling techniques is not very good, especially for detecting small shifts [Chi02]. Furthermore, SPC focuses only on data collected from an incontrol process for characterization, and hence, cannot take advantage of historical data available from out-of-control conditions and failures. On the other hand, almost all the machine learning methods proposed in the literature for process control strictly require example data from all out-of-control states of interest. In addition, the results from using RBF and MLP methods for process monitoring are marginal in that they lack adaptability and make no provision for making tradeoffs between Type-I errors (false alarms) and Type-II errors (inability to detect shifts in process condition). While application of SVM as a classifier for process monitoring has yielded better success, it is often not practical to apply SVMs directly to many real world problems for process monitoring because of two reasons: 1) Availability of out-ofcontrol data: The training procedure for SVM (also for RBF and MLP) requires excessive amount of failure data from in-control states as well as out-of-control states, which may not be available. In many cases, obtaining failure data is difficult, expensive, or even impossible. 2) The necessity of modeling and training for each specific failure type. A model that is developed for a specific type abnormal event (out-of-control state) cannot give good classification accuracy for another type of abnormal event. For example, SVM that is trained for classification of incontrol and large mean shift samples gives bad classification accuracy for detecting small mean shift samples although it works well for large mean shift samples. Thus, it is necessary to create

57

and train a distinct model for each type of abnormal event in case of MLP and SVM in order to obtain the reported classification accuracy. This barrier could prove insurmountable in many real world applications. In this article, we present a new control chart method based on support vector machine called General Support Vector Representation Machine (GSVRM) that is non-parametric and supports multivariate and auto-correlated processes. In addition, GSVRM requires only incontrol data for training, while retaining a provision to learn from out-of-control examples, if applicable. The rest of this paper is organized as follows: Section 2 provides background information on SVM, the theory behind GSVRM is presented in Section 3, on-line GSVRM in Section 4, experimental results in Section 5, and our conclusions in Section 6. 4.2 SUPPORT VECTOR MACHINE This section provides a brief background on SVM, which was first introduced by Vapnik in 1995 [Vap98]. The fundamentals of SVM arise from Statistical Learning Theory. SVM has strong mathematical foundations and performs well as a tool in many practical applications [Chi02]. In recent years, SVM has been used for solving a variety of problems including pattern recognition, classification, regression, etc. The interested reader is referred to a collection of articles, references and software about SVM available at http://kernel-machines.org . SVM was initially developed as a two-class (binary pattern) classification method that classifies the input vector into one of the two possible classes. The input vectors consist of

(

patterns xi ∈ℜ d

) and labels ( y ∈ {1, −1}) and are represented as ( x , y ) , ( x , y ) ,..., ( x i

1

1

2

2

m

, ym ) .

We want to predict unseen yi given xi by learning the training set. SVM separates the classes with a hyper-plane that gives the largest margin between classes. Margin maximization is achieved by minimizing w subject to the constraint yi ( xi ⋅ w + b ) − 1 ≥ 0 , where w is normal to 2

58

hyper-plane. Figure 21 illustrates a hyper-plane for the separable case. In order to provide for non-separable cases, the formulation is modified as follows:

w + C ∑ ξi 2

Minimize

(4.1)

Subject to: yi ( xi ⋅ w + b ) − 1 + ξ i ≥ 0

(4.2)

This is a quadratic optimization problem and can be solved using a Lagrangian dual formulation as follows. Maximize

∑α i

Subject to:

i

−

1 ∑ α iα j yi y j xi ⋅ x j α i : Lagrange multiplier 2 i, j

0 ≤ α i ≤ C and

∑α .y i

i

=0

(4.3)

(4.4)

i

The Lagrangian formulation of the problem gives the advantage of having Lagrange multipliers in the constraints and training data in the form of dot products between pattern vectors [MMRTS01]. In the solution, non-zero α i values represent data points that are on the hyper-plane and are called support vectors. Support vectors satisfy the equation w ⋅ x + b = 0 .

w o o x

o

x

x −b w

o o

x

o

o

x

x

x x x

Margin=

2 w

Figure 21: Separation of classes by hyper-plane.

59

In most cases, classes are not linearly separable. The dot product of pattern vectors from training data in the formulation is replaced by kernel functions to transform the data into a higher dimensional space and obtain a non-linear hyper-plane. The kernel function can take various forms, including Gaussian, polynomial, and others. For more detailed information about support vector machine see [MMRTS01]. In this section, we will discuss General Support Vector Representation Machine (GSVRM). 4.3 GENERAL SUPPORT VECTOR REPRESENTATION MACHINE GSVRM is inspired from Support Vector Data Description (SVDD) [TD99] and Support Vector Representation Machine (SVRM) [YC03] and gives the minimum volume closed spherical boundary around the data, represented by center c and radius r. Minimization of the volume is achieved by minimizing r2, which represents structural error [MMRTS01]: Min r 2

(4.5) 2

Subject to: x i − c ≤ r 2 ∀i , x i : i th data point

(4.6)

The above equations do not allow any data to fall outside of the sphere. In order to make provision within the model for potential outliers within the training set, a penalty cost function is introduced as follows (for data that lie outside of the sphere): Min r 2 + C ∑ ξi

(4.7)

i

Subject to: x i − c ≤ r 2 + ξi , ξi ≥ 0 ∀i 2

(4.8)

where C is the coefficient of penalty for each outlier (also referred to as the regularization parameter) and ξi is the distance between the i th data point and the hyper-sphere. Once again, this is a quadratic optimization problem and can be solved efficiently by introducing Lagrange multipliers for constraints [Vap98].

60

L( r , c, ξ, α, γ ) = r 2 + C ∑ ξi −

∑ α {r i

i 2

}

(4.9)

+ ξi − (xi ⋅ xi − 2c ⋅ xi + c ⋅ c) − ∑ γ iξi

i

i

where γ i and α i are Lagrange multipliers, γ i ≥ 0 , α i ≥ 0 , and xi ⋅ xi is inner product of xi and xi . Note that for each training data point xi , a corresponding α i and γ i are defined. L is minimized with respect to r , c , and ξ , and maximized with respect to α and γ . Taking the derivatives of (4.9) with respect to r , c , ξ , and equating them to zero, we obtain the following constraints: c = ∑ α i xi

(4.10)

i

C − α i − γ i = 0 ∀i

(4.11)

∑α

(4.12)

i

=1

i

Given that γ i ≥ 0 , α i ≥ 0 , constraint (4.11) can be rewritten as: 0 ≤ α i ≤ C ∀i

(4.13)

The following quadratic programming equations can be obtained by substituting (4.10), (4.11), (4.12), and (4.13) in (4.14). Max

∑ α (x ⋅ x ) − ∑ α α i

i

i

i

i

j

( xi ⋅ x j )

(4.14)

i, j

Subject to: 0 ≤ α i ≤ C ∀i ,

∑α

i

=1

(4.15)

i

Standard algorithms exist for solving this problem [CT00].

The above Lagrange

formulation also allows further interpretation of the values of α . If necessary, the Lagrange multipliers ( α i , γ i ) will take a value of zero in order to make the corresponding constraint term zero in (4.9). Thus, formulation of GSVRM satisfies the Karush-Kuhn-Tucker (KKT) conditions for achieving a global optimal solution. Noting that C = α i + γ i , if one of the multipliers becomes zero, the other takes on a value of C .

When a data point xi is inside the sphere, the

61

corresponding α i will be equal to zero. If it is outside of the sphere, i.e. ξi > 0 , γ i will be zero resulting in α i to be C . When the data point is at the boundary, α i and γ i will be between zero and C to satisfy (4.12). The quadratic programming solution often yields a few data points with a non-zero α i value, called support vectors. What is of particular interest is that support vectors can effectively represent the data while remaining sparse. In general, a sphere in the original input space may not represent the dataset well enough. Hence, data ought to be transformed to a higher dimensional feature space where it can be effectively represented using a hypersphere. By employing the so-called kernel trick, one may use the inner-product kernel K (x i , x j ) to construct the optimal hyper-sphere in the higher dimensional feature space without having to consider the feature space itself (which can be extremely large) in explicit form [Vap98]. This kernel trick makes SVMs computationally efficient. The inner-product kernel is a special case of the Mercer’s theorem [Vap98] and is defined as follows: K ( x i , x j ) = ϕ T ( x) ϕ T ( x i ) (4.16)

m

= ∑ ϕ j (x)ϕ j (xi ) ∀i j =0

where {ϕ j ( x)}mj =1 denote a set of nonlinear transformations from the input space to the feature space and m is the dimensionality of the feature space. Thus, the dot product in (4.14) is replaced by a Kernel function, leading us once again to the following quadratic programming problem: Max

∑ α K (x , x ) − ∑ α α

K ( xi , x j )

(4.17)

Subject to 0 ≤ α i ≤ C ∀i , ∑ α i = 1

(4.18)

i

i

i

i

i

j

i, j

i

One of the popular kernels in the literature is the Gaussian kernel [YC03, TD99]:

62

 − x −x i j K (xi , x j ) = exp   σ2 

2

  ∀i ≠ j  

(4.19)

The proposed GSVRM method employs the Gaussian kernel. The value of σ could be provided by the user or optimized iteratively [Vap98]. GSVRM optimizes σ as follows: −

Calculate the average nearest neighbor distance, denoted by nd , between all the data points in the data set.

−

For each data point, construct a local GSVRM utilizing data within a sphere of radius 2 × nd . In building the local GSVRM, average distance of the data within the local sphere to their mean is employed as σ for the Gaussian kernel.

−

For each local GSVRM, construct an inner local GSVRM (hyper-sphere) by employing a radius smaller than that suggested by the quadratic programming solution for the local GSVRM. The parameter that controls the reduction in radius (i.e., reduction %) is prespecified by the user.

−

If the data point is rejected by the inner local GSVRM (meaning lies outside), it is added into the boundary list. Figure 22 illustrates this procedure for determination of boundary list. If there is a single data point within the local GSVRM hyper-sphere, it is not added to the boundary list for it might be an outlier.4

−

Global GSVRMs (i.e., GSVRMs that represent all data) are constructed using different σ values, the range spanning from the smallest nearest neighbor distance ( σ min ) to the largest nearest neighbor distance ( σ max ) in the data set. The value that gives the “best fit” is chosen to be the optimal σ . What constitutes best is described below.

4

For computational savings, data points that lie within the boundary of any other data point’s inner local GSVRM are not considered for entry into the boundary list. Experimental results reveal that this results in computational savings around 50%.

63

Boundary Point

Local Sphere of radius 2 xnd Local GSVRM Inner Local GSVRM

Interior Point

Figure 22: Determination of class boundary: Data points on the boundary will be rejected by local inner GSVRMs.

In the context of a global GSVRM, smaller σ values yield more representing points (i.e., support vectors) and a tighter hyper-sphere, whereas larger values give fewer support vectors and result in a bigger hyper-sphere. The goal is to identify a value for σ that results in good agreement between the support vector list of the global GSVRM and the boundary list resulting from local GSVRMs. In general, smaller σ values result in a global support vector list that is a “superset” of the boundary list, with some points that are not part of the boundary list. On the contrary, larger σ values result in a global support vector list that is a “subset” of the boundary list. The key being to achieve agreement between the two sets. In assessing this agreement, GSVRM computes the fitness of a σ value by employing a two-part strategy: effective representation and compactness. Effective representation is achieved by ensuring that the global support vector list “best matches” the boundary list.

Compactness on the contrary

emphasizes a smaller support vector list. Compactness is managed through a user-defined parameter 0 ≤ ς ≤ 1 . The higher the value of ς the more compact the support vector list and the higher the Type II error (i.e., inability to detect novel conditions), resulting in a larger hypersphere.

64

There is typically a σ value, denoted by σ c , that results in near perfect agreement between the support vector list and the boundary list. As σ exceeds σ c , the support vector list gets smaller and smaller. The actual value of σ employed by the proposed GSVRM is: σ = σ c + ς (σ Max − σ c )

(4.20)

Figure 23 illustrates the influence of different compactness levels on the quality of representation, using an example dataset.

The innermost GSVRM provides effective

representation but with 20 support vectors all of which are in the boundary list (ς = 0) , whereas the outermost GSVRM achieves compactness with just 2 support vectors (ς = 1) .

The

parameters σ min , σ max , σ c are calculated empirically, whereas ς and reduction percentage require repeated trials.

Figure 23: Influence of ς on GSVRM representation.

Once the optimal σ value is calculated based on the desired degree of compactness, one can construct ‘inner’ and ‘outer’ boundary representations by correspondingly changing the radius of the GSVRM hyper-sphere. Figure 24 illustrates this procedure for the same dataset from Figure 23. It is clear that as the radius is changed the overall geometric shape is maintained while the scale changes.

65

Figure 24: Influence of GSVRM hyper-sphere radius on boundary representation.

Learning from abnormal data The question “how can the model benefit from the existent limited abnormal examples?” is very logical. SVDD incorporates these examples in the model [Tax01]. Existing examples from out-of-control states can be incorporated to the formulation as follows: Min r 2 + C

∑ξ

i

+ Co ∑ ξ j

i

(4.21)

j

Subject to:

xi − c ≤ r 2 + ξ i

(4.22)

x j − c ≥ r2 − ξ j

(4.23)

where Co is the penalty value for an out-of-state data point inside the representation boundary. The Lagrange formulation of the problem is as follows: L(r , c, ξ, α, γ ) = r 2 + C ∑ ξi + Co ∑ ξ j − ∑ γ iξi i

j

i

−∑ γ jξ j − ∑ α i {r + ξi − (xi ⋅ xi − 2c ⋅ xi + c ⋅ c)}

(4.24)

2

j

i

After taking the derivatives, we obtain the following equations:

∑α − ∑α i

i

i

i

=1

(4.25)

66

c = ∑ α i xi − ∑ α j x j i

(4.26)

j

Substituting (4.25) and (4.26) in (4.24), leads us to following Lagrange formulation: L (r , c, ξ, α, γ ) = ∑ α i (xi ⋅ xi ) − ∑ α i (xi ⋅ xi ) − i∈I

+

i∈J

∑

i∈I , j∈J

−

α iα j ( x i ⋅ x j ) +

∑

i∈J , j∈I

∑αα

i , j∈I

i

j

( xi ⋅ x j )

α i α j ( xi ⋅ x j )

(4.26)

∑ α i α j (xi ⋅ x j )

i , j∈J

where, I is the class of in-control state examples and J is the class of out-of-state examples. This formulation can be simplified by labeling out-of-state classes as “-1” and incontrol class as “+1”.

 1 yi =  −1

xi ∈ I

(4.27)

xi ∈ J

α i' = yiα i

(4.28)

Substituting (4.28) and (4.29) in (4.26) results in the following:

L(r , c, ξ, α, γ ) =

∑ α (x

i∈{ I , J }

' i

i

⋅ xi ) −

∑

i , j∈{ I , J }

α i'α 'j (xi ⋅ x j )

(4.29)

0 ≤ αi ≤ C

(4.30)

0 ≤ α j ≤ Co

(4.31)

As can be seen from equations (4.29, 4.30 and 4.31), the formulation essentially remains the same with an additional constraint for the corresponding Lagrange multiplier value of the out-of-control state example. The next section describes the online training process. 4.4 ON-LINE TRAINING Provision for online training is an important attribute for many models. Creating a local GSVRM for each data point to choose the optimal σ value is the most time consuming part of the proposed method. When new data are available the boundary list is updated instead of recalculating all boundary points, in order to improve computational efficiency. There are two

67

tasks to be carried out when a new data point becomes available: First, we need to check whether the new data point is a boundary point. Second, we need to check whether the new data point is forcing existing boundary data points to leave the boundary list. For the first task, a local GSVRM is created for the new data point. If the new point lies outside its inner local GSVRM, it is added to the boundary list. For the second task, an inner local GSVRM is created for data points within the vicinity of the new data point and data points within the boundary list. If the new data point falls inside the boundary of any of these inner local GSVRMs, it is removed from the boundary list. Once the boundary list is updated, different σ values are evaluated by increasing and decreasing the previous optimal σ value. The σ value that gives the support vector list that best fits the boundary list is chosen to be the new optimal σ value. The algorithm is given in the appendix. 4.5 EXPERIMENTAL RESULTS This section presents results from applying GSVRM in two different datasets (i.e. uncorrelated and correlated datasets) and is organized as follows: subsection 1 presents results from uncorrelated processes, subsection 2 presents results for correlated manufacturing processes. GSVRM for uncorrelated processes The application of GSVRM to uncorrelated processes will be discussed in two subsections: First, GSVRM is evaluated using datasets that follow common probability distributions (i.e. normal, lognormal and exponential).

We then evaluate GSVRM using a

benchmarking dataset (i.e. Smith Alice dataset) [Smi94] that has been used extensively for methodological development and evaluation in the process control literature [Chi02].

68

Table 4: Parameters of distributions

µ σ µ σ µ σ µ σ Large Variance Shift (LV) µ σ Normal Behavior (Nor) Small Mean Shift (SM) Large Mean Shift (LM) Small Variance Shift (SV)

Normal Lognormal Exponential 0 0 1 1 1 1 1 2 1 1 3 3 4 1 1 0 0 2 2 0 0 3 3 -

(a) µ = 1

(b) µ = 0, σ = 1

(c) µ = 0, σ = 1 Figure 25: Example Time Series a) Exponential Distribution, b) Normal Distribution, c) Lognormal Distribution

In the first subsection, normal, lognormal and exponential distributions are employed to generate data. Parameters for distributions and a sample of dataset are given in Table 4 and

69

Figure 25, respectively. For a typical x chart, it is recommended to have at least 20-25 patterns. [WM99] In our experiment, 250 samples are generated for each distribution. Samples are grouped with size of 10 and two features (i. e., mean and standard deviation) of each sample group are calculated resulting 25 two-dimensional patterns. We trained GSVRM with only in-control data as well as with in-control and limited out-ofcontrol data. In the latter case, we used 10 abnormal patterns (mean and standard deviation), each of which is obtained from 10 data points. As mentioned before, GSVRM does not require out-of-control data, but having some helps to improve the accuracy of the method. Type I and Type II errors are defined as rejecting a true hypothesis and accepting a false hypothesis, respectively. In statistical process control contest, Type I error refers to rejecting an in-control process as if it is out-of-control and Type II error refers to accepting an out-of-control process as if it is in-control. The format of reporting Type I and Type II errors is given in the Table 5. The results of GSVRM implementation on processes that follow normal, lognormal, and exponential distributions are summarized in Table 6. Table 5: Format of Type I and Type II errors

Actual

Hypothesis Test

Estimated InOut-ofcontrol control

InType 1 Correct control error Out-of- Type 2 Correct control error

As seen from Table 6, GSVRM is non-parametric and able to detect out-of-control data with highest (lowest) 26.8% (11.2%) Type I error and 23.9% (0%) Type II error in case of no available out-of-control data; and highest (lowest) 19.2% (13.4%) Type I error and 16.9% (0%) Type II error in case of training in-control and limited out-of-control data.

70

Table 6: Classification accuracy of GSVRM 1) (Left side of the table) Training only with 25 in-control patterns, each of which obtained from 10 samples for distributions. 2) (Right side of the table) Training with 25 in-control and 10 out-of-control patterns, each of which is obtained from 10 samples a) Normal b) Lognormal c) Exponential distributions SM: Small Mean Shift, SV: Small Variance Shift, LM: Large mean shift, LV: Large variance shift

1) Training only with in-control data

2) Training with in-control and limited out-of-control data

Estimated

Estimated a.1

Out-ofIncontrol control

SM 9.00% 91.00% LM

0.00% 100.00%

SV

4.10% 95.90%

LV

0.10% 99.90%

Incontrol

Out-ofcontrol

In-control 88.30% 11.70% Actual

Out-of-control

Actual

Incontrol 86.60% 13.40%

Out-of-control

a.2

SM

7.40%

92.60%

LM

0.00% 100.00%

SV

6.40%

93.60%

LV

0.20%

99.80%

a) Normal Distribution Estimated b.2

InOut-ofcontrol control

SM 4.50% 95.50% LM

0.00% 100.00%

SV 14.80% 85.20% LV

Incontrol

Out-ofcontrol

In-control 81.30% 18.70% Actual

Out-of-control

Actual

Incontrol 73.20% 26.80%

4.10% 95.90%

Out-of-control

b.1

Estimated

SM

9.50%

90.50%

LM

0.00% 100.00%

SV 16.90% 83.10% LV

3.00%

97.00%

b) Lognormal Distribution Estimated

SM 23.90% 76.10% LM

0.70% 99.30%

Actual

InOut-ofcontrol control

Incontrol 88.80% 11.20%

Actual Out-ofcontrol

Estimated c.2

Incontrol

Out-ofcontrol

In-control 80.80% 19.20% Out-ofcontrol

c.1

SM 16.00% 84.00% LM

c) Exponential Distribution

0.40%

99.60%

71

We will also demonstrate the effectiveness of GSVRM for detecting mean and variance shifts in non-correlated data using the dataset generated by Smith [Smi94] and compare the results from SVM [Chi02] and MLP [Smi94]. The dataset has 300 samples from in-control state and out-of-control states of large mean shift (LM), small mean shift (SM), large variance shift (LV), and small variance shift (SV). The parameters are given in Table 7. The results will be reported in three different categories. In the first category, the results of GSVRM will be compared with results of Shewhart chart, MLP and SVM. Note that MLP and SVM create a distinct model for each abnormality type (e.g. small mean shift, large variance shift). Thus, a distinct model is created for each type of abnormality for GSVRM in order to achieve a fair comparison. In the second category, two different GSVRM results are reported: GSVRM(1), which is trained with only in-control data and GSVRM(2), which is trained with incontrol and limited out-of-control data. Note that MLP and SVM cannot be implemented in case of absence of out-of-control samples. In the third category, the results of GSVRM that is trained with very limited in-control (i.e. 25 samples) and out-of-control data (i.e. 10 samples) are reported. Table 7: Parameters of testing datasets

State Label Normal Behavior (Nor) Small Mean Shift (SM) Large Mean Shift (LM) Small Variance Shift (SV) Large Variance Shift (LV)

µ

σ

0

1

1

1

3

1

0 0

2 3

Table 8: Classification accuracy of SVM, MLP, Shewhart Chart for Smith dataset

72

in: In-control data, out: Out-of-control data. GSVRM (1): Training only with 300 normal samples. GSVRM (2): Training with 300 normal and 200 abnormal samples (100 small mean shift, 100 small variance shift) Shewhart SVM GSVRM MLP Chart

Test Train Test

Test

Test Train

72%

73%

93% 91%

N/A 100%

100%

100% 99%

Small Shift 91% 92% Large Shift 96%

In the first category, classification accuracy is reported in methods such as MLP, Shewhart chart and SVM instead of Type I and Type II error. Thus, the classification accuracy of GSVRM is calculated in order to compare the results with these methods. The weighted averaged of Type I and Type II errors are calculated, considering the number of patterns used for in-control and out-of-control data. As seen from Table 8, GSVRM has better results than MLP and Shewhart chart. SVM is only 1% to 4% better than GSVRM. Even though SVM gives some better results than GSVRM, there are two difficulties of implementing SVM and MLP in many real world problems: 1. A distinct model is developed for each of the out-of-control state for SVM and MLP. There are four out-of-control states (SM, LM, SV, LV) resulting four SVM and MLP models. Even though each model works well for developed out-of-control state, they cannot effectively work for other out-of-control states. In addition, SVM and MLP may be sensitive to an undefined out-of-control state. In contrast, GSVRM characterizes the in-control state of the process and it has the ability to detect any undefined and unseen type of out-of-control state. 2. SVM and MLP are trained with 300 samples from in-control state and 300 samples from each modeled out-of-control state resulting 4 models using total of 2400 samples. On the other hand, GSVRM uses only 300 samples from in-control state and 200 samples from out-of-control states.

73

Table 9: Type I and Type II errors for non-correlated data using GSVRM

Smith Data

Actual Out*

Incont. SM LM SV LV

In*: in-control state, Out*: Out-of-control state Estimated GSVRM (1) GSVRM (2) In* Out* In* Out* In* Out* In* Out* Train Test Train Test 1 N/A N/A N/A N/A

0 N/A N/A N/A N/A

84% 5% 0% 15% 15%

16% 95% 100% 85% 85%

88% 1% N/A 4% N/A

12% 99% N/A 96% N/A

90% 7% 0% 9% 8%

10% 93% 100% 91% 92%

In the second category, we will report GSVRM results with no and limited out-of-control data. In the former case (i.e. GSVRM(1)), only 300 samples of in-control data are used for training, in the latter case (i.e. GSVRM(2)) 300 samples of in-control and 100 samples of small mean shift and 100 samples of small variance shift are employed for training. The Type I and Type II errors as given in Table 9 are promising with 16% Type I error and highest 15% Type II error in GSVRM (1) and 10% Type I error and highest 9% Type II error in GSVRM(2). In the third category, GSVRM is implemented with very limited in-control data. We implemented GSVRM with 25 training patterns from the Smith [Smi94] dataset and results are shown in Table 10. Table 10: Type I and Type II errors for non-correlated data with limited number of in-control and out-ofcontrol samples (i.e. 25 in-control and 10 out-of-control patterns) Smith Data size GSVRM of 25 in* out*

out

Actual

in SM LM SV LV

88% 9% 1% 8% 4%

12% 91% 99% 92% 96%

As seen from Table 10, GSVRM has the power to effectively detect out-of-control conditions 91% of the time (%9 Type II error), with false alarm rate of only 12% (Type I error) even with limited data size.

74

GSVRM for correlated manufacturing processes One of the major problems of implementing conventional SPC techniques to many industries is auto-correlated data. As mentioned earlier, several attempts have been made in the literature to monitor auto-correlated data by extending traditional SPC techniques as well as applying pattern recognition techniques. These techniques struggle with one or the other difficulties of high Type-I or Type-II error, and/or the necessity of out-of-control data for training. In this section, the papermaking dataset from Pandit and Wu [PW83] and viscosity dataset from Box and Jerkins [BL97] are used as examples of correlated manufacturing data and shown in Figure 26 and 27. These data have been used in methodological benchmarking studies by researchers including Cook [CC98] and Chinnam [Chi02]. Cook used radial basis function networks. However, the results were not impressive with around 50% Type-I errors in viscosity dataset. Chinnam achieved good classification accuracy with SVM. However, support vector machine has the aforementioned difficulties to implement: availability of out-of-state data and need for developing a model for each type of out-of-control state. Cook and Chiu reported that autoregressive processes of lag 1 (i. e., AR(1)) model is appropriate for both papermaking and viscosity datasets as illustrated in Figure 26 and 27. Auto-regression is a form of regression where dependent variable is related to its past values at different time lags and can be represented as follows:

X t = µ + φ ( X t −1 − µ ) + ε t

(4.32)

where X t is value at time t , µ is the mean of the time series, X t −1 is the value at time

t − 1 , ε t is a normal, independently distributed error term, and φ is autoregressive coefficient between –1 and +1.

75

10.50 10.00 9.50 9.00 8.50 8.00 7.50 7.00 0 30 60 90 120 150 180 210 240 270 300 Figure 26: Viscosity Data

35 34 33 32 31 30 0

20

40

60

80

100 120 140 160

Figure 27: Papermaking Data

The reported models for papermaking and viscosity data are as follows: Papermaking:

X P ,t = 32.02 + 0.90 ( X P ,t −1 − 32.02 ) + ε P ;t

ε P;t ∼ N ( 0, σ P2 ,ε ) Viscosity:

(4.33)

(4.34)

X V ,t = 9.10 + 0.86 ( X V ,t −1 − 9.10 ) + ε V ;t

ε V ;t ∼ N ( 0, σ V2 ,ε )

(4.35)

(4.36)

The raw data is standardized in order to make mean 0 and standard deviation 1. The resulting time series models are as follows: Papermaking:

Xɶ P ,t = 0.90 Xɶ P ,t −1 + εɶP ;t

(4.37)

76

εɶP;t ∼ N ( 0, σɶ P2 ,ε )

(4.38)

Xɶ V ,t = 0.86 Xɶ V ,t −1 + εɶV ;t

(4.39)

εɶV ;t ∼ N ( 0, σɶV2 ,ε )

(4.40)

Viscosity:

(

)

New error term can be calculated as σɶ ε2 = σɶ x2 1 − φ 2 , and it is 0.4359 and 0.5103 for papermaking and viscosity datasets, respectively. The sizes of the papermaking and viscosity datasets used in [Chi02, CC98] are 123 and 227 for training, 35 and 80 for testing, respectively. The process mean is shifted by one, one and half, and two standard deviations in order to obtain three different out-of-control datasets. Three inputs are used for GSVRM training: the previous value at time t − 1 (i.e. Xɶ t −1 ), current value at time t (i.e. Xɶ t ), and time series prediction value based on previous value (i.e.

ˆ Xɶ t ). As Smith dataset, we will report the results in three categories. In the first category, a distinct GSVRM model is developed for each abnormality type in order to have a fair comparison with RBF and SVM. In the second category, two GSVRM models are trained: GSVRM (1) and GSVRM (2) are trained with only in-control data (no shifted data is used for training) and with in-control and limited out-of-control data (i.e. 35 out-of-control samples for papermaking and 80 out-of-control samples for viscosity), respectively. In the third category, GSVRM is trained with very limited in-control and out-of-control samples. The results of first category are displayed in Table 11 and 12 for papermaking and viscosity datasets, respectively. As seen from the tables, GSVRM gives better results than RBF in all cases. In both datasets, SVM is 1-3% better than GSVRM in 1.5 and 2 standard deviation shifts. In 1 standard deviation shift, they are equivalent in viscosity datasets, whereas SVM is %9 better than GSVRM in papermaking dataset.

77

Table 11: Type I and Type II errors for Papermaking dataset using RBF, MLP, and GSVRM for 1, 1.5, and 2 standard deviation shifts. RBF SVM GSVRM Test Test Train Test Train in* out* in* Out* in* out* in* out* in* out* 91% 9% 91% 9% 95% 5% 97% 3% 98% 2%

Standard Deviation Shift In control

1 Actual

Out-of-control In control

1.5

Out of control In control

2

Out-of-control

9% 91% 0% 100% 100% 0% 97% 3%

0% 100% 15% 85% 11% 89% 96% 4% 98% 2% 88% 12%

3% 97% 0% 100% 100% 0% 97% 3%

0% 100% 0% 100% 0% 100% 98% 2% 98% 2% 88% 12%

9%

91% 0% 100%

0% 100% 0% 100%

0% 100%

The results of second category are displayed in Table 13 and 14, which show the strength of GSVRM with limited and none out-of-control samples. GSVRM gives better results than RBF in all cases. GSVRM (2) can be implemented with limited out-of-control data with the price of around 1% to 5% less accuracy (e.g. 91% vs. 85%) from SVM for papermaking and viscosity datasets, respectively. GSVRM (1) shows reasonable results in cases where no out-ofcontrol data are available with the price of 10% to 13% less accuracy (e.g. 91% vs. 80% and 88% vs. 75%). GSVRM is more practical than SVM in process control with close/reasonable low classification accuracy because of the aforementioned reasons. Table 12: Type I and Type II errors for Viscosity dataset using RBF, SVM, and GSVRM for 1, 1.5, and 2 standard deviation shifts. RBF Test in* out*

Standard Deviation Shift

Actual

1

1.5

2

In control Out-ofcontrol In control

SVM Test in* out*

46% 54% 88% 12% 4%

96%

0% 100%

89% 11% 100% 0%

Out of control

2%

98%

0% 100%

In control

99%

1%

99% 100%

Out-ofcontrol

2%

98%

0% 100%

Train in* out*

GSVRM Test Train in* out* in* out* 3%

98%

2%

0% 100% 15% 85%

8%

92%

92%

97%

8% 97%

3% 97%

3%

98%

2%

0% 100% 3%

97%

1%

99%

4% 99%

1%

99%

1%

0% 100% 1%

99%

0% 100%

96%

78

In the third category, GSVRM is applied to viscosity and papermaking datasets using limited data (i.e. 25 patters from in-control state and 10 patterns from the out-of-control state). The results are shown in Table 15. Table 13: Type I and Type II errors for Papermaking dataset using GSVRM for 1, 1.5, and 2 standard deviation shifts. GSVRM (2): Training with 123 in-control and 35 out-of-control patterns. GSVRM (1): Training with only 123 in-control patterns GSVRM (2) Train Test in* out* in* out*

Out

Actual

In

Nor Sh1 Sh1.5 Sh2

GSVRM (1) Train Test in* out* in* out*

98% 2% 97% 3% 88% 12% 80% 20% 11% 89% 15% 85% N/A N/A 2% 98% N/A N/A 0% 100% N/A N/A 0% 100% N/A N/A 0% 100% N/A N/A 0% 100%

Table 14: Type I and Type II errors for Viscosity dataset using GSVRM for 1, 1.5, 2 standard deviation shifts GSVRM (1): Training with 227 in-control and 80 out-of-control patterns. GSVRM (2): Training with only 227 in-control patterns GSVRM (2) Train Test in* out* in* out*

Out

Actual

in

Nor Sh1 Sh1.5 Sh2

GSVRM (1) Train Test in* out* in* out*

97% 3% 98% 2% 15% 85% 8% 92% N/A N/A 4% 96% N/A N/A 0% 100%

99% N/A N/A N/A

1% N/A N/A N/A

97% 25% 7% 1%

3% 75% 93% 99%

a) Viscosity Dataset and b) Papermaking Dataset GSVRM GSVRM in* out* in* out* In 72% 28% in 71% 29% 1Sh 1% 99% 1Sh 12% 88% 1.5Sh 0% 100% 1.5Sh 0% 100% 2Sh 0% 100% 2Sh 0% 100%

a) Viscosity Dataset

Actual out

out

Actual

Table 15: Type I and Type II errors with limited in-control and out-of-control data

b) Papermaking Dataset

As seen from Table 15, GSVRM is able to detect out-of-control states with 29% (28%) Type I error and 12% (1%) Type II error for papermaking and viscosity datasets.

79

The optimal parameters used in SPVC for all implemented datasets are given in Table 16. In case of training only with in-control data requires two parameters: a penalty value for misclassification of in-control data ( C ) and compactness ( ς ). The penalty value for misclassification of out-of-control data ( C o ) is required in case of available out-of-control data. Table 16: Parameters used in GSVRM implementation on datasets Training only with Training with in-control and in-control data limited out-of control data Dataset Normal Lognormal Exponential Smith Alice Viscosity Papermaking

C

ς

C

Co

ς

0.5 0.1 0.1 0.1 0.4 1

1 0.9 0.4 1 0 0.9

0.3 0.1 0.1 0.1 0.9 0.8

0.1 0.1 0.1 0.9 0.9 0.8

0.8 0 0.2 0.1 0 0.9

4.6 CONCLUSION A new process control technique based on support vector machine called support vector process control (GSVRM) is presented. GSVRM has several advantages over the conventional SPC techniques and pattern recognition methods in the literature. GSVRM does not make any assumption about the data distribution, which is a fundamental restriction for conventional SPC techniques. GSVRM also supports multivariate and auto-correlated processes, which again violates the independence assumption of conventional SPC techniques. In addition, conventional SPC techniques cannot benefit from available out-of-control data, whereas GSVRM can learn from out-of-control samples, where applicable. Pattern recognition methods used for process control in the literature such as SVM and RBF requires excessive amount of in-control data as well as out-of-control data. Furthermore, a distinct model for each type of outof-control process needs to be created and trained using both in-control and out-of-control data with RBF and SVM. These models are not sensitive to out-of-control conditions other than those

80

on which they are trained. In contrary, GSVRM characterizes in-control-processes and requires only in-control data. Thus, it is sensitive to all types of out-of-control processes. It is also shown that GSVRM is able to give reasonable results with limited in-control and out-of-control data.

APPENDIX GSVRM Algorithm Step 1: Determination of data points close to boundary (i.e., creation of boundary list). 1.1 Calculate the average nearest neighbor distance ( nd ) 1.2 For each data point i, construct a local GSVRM with those data points inside the sphere with radius of 2 × nd and centered at the current data point i 1.3 Calculate an inner GSVRM with reduced threshold 1.4 If the data point is rejected by the inner GSVRM (meaning outside the threshold boundary), add the point to the boundary list 1.5 Choose the next data point and go to step 1.2. Step 2: Calculate the global GSVRM using an optimal σ . The σ value that gives the best fit or agreement between the support vectors of the global GSVRM and the boundary list from Step 1 is chosen to be the optimal σ . Step 3: When a new data point is available, update the boundary points as follows: 3.1 Construct a local GSVRM for the new data point as in Step 1. 3.2 Calculate inner GSVRM with a reduced threshold. 3.3 Add the new data point to the boundary list if the data point is rejected by inner GSVRM. 3.4 Implement steps 3.1 and 3.2 for all the data points within the local SVRM sphere of the new data point and the boundary list. 3.5 Remove the data point from the boundary list if it is inside one of the inner GSVRM

81

3.5 Any points that are not rejected should be removed from the boundary list. 3.6 Check for better fitness of the global GSVRM by increasing and decreasing σ . If a different σ leads to better fitness, update σ . 3.6 Calculate minimum volume hyper-sphere using updated boundary points.

82

CHAPTER V

HEALTH-STATE ESTIMATION AND DIAGNOSTICS USING HIDDEN MARKOV MODEL COMMITTEES

5.1 INTRODUCTION Condition-Based Maintenance (CBM) technology increases system availability and safety while reducing costs, attributed to reduced maintenance and inventory, increased capacity, and enhanced logistics and supply chain performance [Cam05]. Unlike time-based preventive maintenance and corrective maintenance practices, CBM aims to avoid both unnecessary maintenance actions as well as machine failures. Jay Lee and Jun Ni, the Co-Directors of the National Science Foundation (NSF) center for Intelligent Manufacturing Systems, estimates that $35 billion per year would be saved in the US alone if CBM technology is widely employed [Har03]. Employing effective diagnostic and prognostic algorithms/methods is an important prerequisite for widespread deployment of CBM [Nist98]. Diagnostics is the process of identifying, localizing and determining severity of a machine failure, whereas prognostics is the process of estimating the remaining-useful-life (RUL) [MBND99]. Diagnostics is, in essence, a classification problem and there are many methods proposed and implemented in the literature; this is in total contrast to prognostics. However, most diagnostic algorithms have limited potential in that they cannot detect failure modes in a timely manner. See [Cam05] for a good review of popular diagnostic algorithms and methods. The failure mechanisms of mechanical systems usually involve several degraded health states. For example, a tiny change in a bearing’s position could cause a small nick in the bearing, which could cause scratches in the bearing race in time, which then could cause additional nicks, which could then lead to a complete bearing failure [KZX03]. Tracking the

83

health state of a machine is very critical for detecting, identifying, and localizing the failure as well as carrying out proper maintenance. Diagnosing effectively the earliest stages of a failure (i.e., the failure or health state), even if the machine is serving its intended function, is not only important but is a prerequisite for prognostics. It is logical to estimate these health states through their effects on observed sensor signals, since we are unable to ‘observe’ real health states. The primary challenge then within diagnostics is to achieve high classification accuracy in identifying health states given sensory signals (such as vibration, current, temperature etc.). Hidden Markov Models (HMM) characterize doubly embedded stochastic processes with an underlying stochastic process that can be observed through another stochastic process and have been successful in tackling such difficult tasks as automatic speech recognition (ASR) [San00, WH99]. The tasks of ASR and equipment diagnostics have many commonalities. Speech signals are quasi-stationary and so are the sensory signals collected from machines such as vibration [BMA00]. Quasi-stationary signals in some sense terminate in an absorbing state and show stationary behavior in any reasonable time scale [DV02]. In addition, words should be recognized in automatic speech recognition, although they are spoken by different speakers, whereas health states should be recognized in diagnostics although machine behavior can be quite different due to differences in machining, part-size variations, fastener tightness, wear variation, and aging. Beyond the aforementioned commonalities, implementation of HMMs in diagnostics is more difficult than their implementation in speech recognition. For example, the number of phonemes is a relatively small finite set in ASR (resulting in sound and word libraries), a notion that is neither observed nor justified in diagnostics. In addition, S/N ratios tend to be far better in speech signals unlike machine sensor signals. Nevertheless, speech signal only remains stationary over intervals of 10ms or so. In comparison, machine vibration signals remain stationary on time scales of many seconds and even minutes [BMA00].

84

Regular HMMs were implemented for health state estimation in the literature [KZX03, CB03]. However, the standard assumption behind regular HMMs regarding the independence of an observation to previous observation within the health state may not hold in many monitoring applications that involve high data acquisition rates. Auto-regressive HMMs (Ar-HMM) allow us to relax this standard assumption of independent observations made by regular HMMs. Furthermore, regular HMMs and Ar-HMMs tend to be limited in their ability to represent complex systems and their training process (i.e. competitive learning) is computationally tedious. Hierarchical HMM (HHMM), a variant of a HMM that is composed of several sub-HMMs in a hierarchical fashion, strengthens the ability of a HMM to jointly represent multiple health states along with their state transition properties. In this article, we present the implementation of regular, auto-regressive HMM, hierarchical HMMs as dynamic Bayesian networks for health state estimation. The paper is organized as follows: Section II briefly describes HMMs, section III discusses dynamic Bayesian networks (DBN) and implementation of HMM as a DBN. Section IV presents results from implementation of HMMs, Ar-HMM and HHMMs for health state estimation. Finally, Section V offers some concluding remarks. 5.2 BACKGROUND: HIDDEN MARKOV MODELS A stochastic system can be described as being in one of a finite number of states at any time. The system evolves through the states according to a set of probabilities associated with each state as demonstrated in Figure 28. The model is called Hidden Markov Model if states are not observable (hidden) and are assumed to be causing the observations. The system behavior depends on the current state and predecessor states. A special case, first order HMM assumes that only the current state is responsible for producing the observations. In the rest of this paper, HMM implies a first order HMM.

85

To better understand HMM, let us consider an urn and ball system with 3 urns and different number of colored balls in each urn [Law89]. An urn is selected randomly, then a ball is chosen from this urn, its color is recorded, and is replaced back to the urn it is chosen from. In the next step, a new urn is selected and a ball is chosen from this urn and recorded. This process is repeated in finite number of times and a finite number of observation sequence (colors) is obtained. Now, assume that the system is in a different room and done by somebody else so that we don’t see the selected urns. The only observable event to us is the colors of the selected balls. Obviously, simplest HMM representation of this system corresponds to states being urns with different color probabilities. S3

S2

S4

S1 S5

S6

Figure 28: A Markov chain with 6 states and state transition probabilities (arrows represents non-zero state transition probabilities)

There are several elements to a HMM: number of states

(N) ,

observations, state

transition probability distribution, observation probability distribution, and initial state distribution.

X t denotes the state at time t and Ot denotes observation at time t , which might either be a discrete symbol Ot ∈ {1,..., L} or a feature vector from L dimensional space, Ot ∈ ℝ L . State transition probability distribution models the probability of being in state i at time t , given that it

{ }

is in state j in time t − 1 denoted as A = α i , j = P ( X t = i | X t −1 = j ) . Observation probability distribution defines the probability of observing k at time t given the state i , denoted as

B = {bi ( k )} = P ( Ot = k | X t = i ) . These distributions are either mass functions in case of discrete observation or specified using a parametric model family -commonly Gaussian- in case

86

of continuous observations. Initial state distribution is the probability of being in state i at t = 0 and is denoted as π ( i ) = P ( X 1 = i ) . Generally, λ = ( A, B, π ) is used to indicate a HMM. There are three basic problems of interest to be solved given the above model specifications. •

How to compute the probability of obtaining the observation sequence O = O1O2 ...OT given the model λ (i.e. P ( O1O2 ...OT | λ ) =?)?

•

How to identify the most likely state sequence that might produce the observation sequence?

•

How to adjust the parameters of λ in order to maximize the likelihood of the given observation sequence? These three problems are tightly linked together and studied extensively in the literature.

The standard HMM solution to these problems requires an exponential number of parameters to specify the transition and observation models (for it calculates Cartesian product of the statespaces of each example). This means requiring excessive amounts of data to learn the model (high sample complexity) and exponential time for inference (for example, the forwards-

(

backwards cycle takes O Tk 2 N

)

operations) (high computational complexity). For more

detailed information about HMM, see [Law89]. Dynamic Bayesian Network (DBN) can more efficiently represent HMMs and alleviate some of these problems with the added flexibility of implementing different variants of HMM. Bayesian network will be discussed in the next section. 5.3 BAYESIAN NETWORK AND DYNAMIC BAYESIAN NETWORK Bayesian networks come from a combination of two different mathematical areas: graph theory and probability theory. Graph theory uses graphical models that visually illustrate the

87

variables as nodes and their relationships as edges between nodes. Conditional probability of variables is the third element of graphical models. There are two types of graphical models: Directed and undirected graphical models. In directed models, edges have direction representing cause-effect relationship (i.e. starting node of the edge causes the ending node). Undirected graphical models, on the other hand, have undirected edges that represent correlation between variables. Bayesian network is a directed acyclic graph (DAG), in which all nodes are directed and there are no cycles within the network (i.e. there is no way to start from any node and travel through the directed edges and come back to the starting node) [Mur01,

Ste00]. The conditional probability of a node depends only on its parents and is independent of ancestors (i.e. parent’s parents) given its parent. This is called factorization property and it dramatically reduces the number of parameters. The graphical model in Figure 29 is a Bayesian network, since all the edges are directed and there are no cycles within the model. As seen from the model, A causes B and C (i.e. A is parent of B and C) and B and C are the parents of D. The nodes with no edges between them are independent from each other, which causes dramatic reduction in the number of parameters.

For

example,

nodes

B

and

C

are

conditionally

independent,

thus

P ( B | A, C ) = P ( B | A ) and P ( C | A, B ) = P ( C | A ) . Also note that P ( D | A, B, C ) = P ( D | B, C ) for the given graphical model, since A is the ancestor of D. Joint distribution of all variables can be represented as follows:

P ( A, B, C , D ) = P ( A | B, C , D ) ⋅ P ( B | A, C , D ) ⋅ P ( C | A, B, D ) ⋅ P ( D | A, B, C ) (5.1) This equation can be simplified by factorization as follows:

P ( A, B, C , D ) = P ( A ) ⋅ P ( B | A ) ⋅ P ( C | A ) ⋅ P ( D | B, C ) In general, joint probability function for any Bayesian network can be written as

(5.2)

88

P ( X ) = ∏ P ( X i | Parents ( X i ) ) , where X = X , X 2 , X 3 ,... X n n

(5.3)

i =1

A

B

C

D

Figure 29: Bayesian Network: Directed acyclic graphical model

In the next sub section, inference in Bayesian network will be discussed. 5.3.1

Inference in Bayesian network Inference is the probability calculation of all possible values of a node when the values of

some other nodes become known. For illustration of inference, assume that all the nodes in Figure 29 represent binary variables and their conditional probabilities are as given in Table 17. Table 17. Conditional probabilities of nodes for the given Bayesian network. Node A

Node B

Node C

Node D

P(A=0) P(A=1)

A P(B=0) P(B=1)

A P(C=0) P(C=1)

B C P(D=0) P(D=1)

0

0.2

0.8

0

0.8

0.2

0 0

1

0.9

0.1

1

0.25

0.75

1 0

0.1

0.9

0 1

0.15

0.85

1 1

0.4

0.6

0.8

0.2

0.65

0.35

Suppose that the value of some variables became available and we need to update our belief for a node given available data. For example, we want to calculate the probability of variable A being 0, given variables B=1, C=0, and D=0, which can be written formally as follows:

P ( A = 0 | B = 1, C = 0, D = 0 ) =

P ( A = 0, B = 1, C = 0, D = 0 ) P ( B = 1, C = 0, D = 0 )

(5.4)

Denominator and nominator can be transformed to:

=

P ( A = 0 ) ⋅ P ( B = 1| A = 0 ) ⋅ P ( C = 0 | A = 0 ) ⋅ P ( D = 0 | B = 1, C = 0 ) (5.5) P ( B = 1, C = 0, D = 0 | A = 0 ) ⋅ P ( A = 0 ) + P ( B = 1, C = 0, D = 0 | A = 1) ⋅ P ( A = 1)

89

In denominator followings are calculated, P ( B = 1, C = 0, D = 0 | A = 0 ) = P ( B = 1 | A = 0 ) ⋅ P ( C = 0 | A = 0 ) ⋅ P ( D = 0 | B = 1, C = 0 ) (5.6)

= 0.8 × 0.8 × 0.1 = 0.064 P ( B = 1, C = 0, D = 0 | A = 1) = P ( B = 1 | A = 1) ⋅ P ( C = 0 | A = 1) ⋅ P ( D = 0 | B = 1, C = 0 ) (5.7)

= 0.1× 0.25 × 0.1 = 0.0025 Thus,

P ( A = 0 | B = 1, C = 0, D = 0 ) =

0.8 × 0.2 × 0.8 × 0.1 = 0.126 0.064 × P ( A = 0 ) + 0.25 ⋅ P ( A = 1)

(5.8)

This process is applied to all possible values of variable A to obtain the belief of A. In the next section learning in Bayesian network will be discussed. 5.3.2

Learning in Bayesian Network Structure (i.e. nodes and edges, also called topology) and conditional probability

distributions of all nodes need to be defined for a Bayesian network. Both of them can be learned from data, although structure learning is much harder than learning only conditional probabilities. Models with only observable nodes and the ones that include hidden nodes need different learning approaches. Thus, there are four combinations for learning in Bayesian networks. Table 18 displays these combinations and methods used for each combination. Bayesian learning can be defined as optimization of conditional probability density parameters and structure (in case of unknown structure) to obtain maximum likelihood of data. Table 18: Learning methods for different problems [MM99]. Structure Known Known Unknown Unknown

Observability Full Partial Full Partial

Method Maximum Likelihood Estimation Expectation Maximization or gradient ascent Search through model space Expectation Maximization + Search through model space

90

Structure known and full observability:

In case of known topology and full observability, conditional probabilities are simply the calculated statistics from data. For example, probability of D = d1 given B = b1 and C = c1 can be approximated as the ratio of number of times D = d1 , B = b1 and C = c1 and number of times B = b1 and C = c1 . This can be written formally as in (5.9) and (5.10).

P ( D = d1 | B = b1 , C = c1 ) =

≈

P ( D = d1 , B = b1 , C = c1 ) P ( B = b1 , C = c1 )

number of times D = d1 , B = b1 , C = c1 number of times B = b1 , C = c1

(5.9)

(5.10)

Structure known and partial observability:

In case of known topology with partial observability that may occur because of hidden states or missing data, expectation maximization (EM) or gradient ascent methods are used for estimation of conditional probability distributions. EM calculates the statistics using the expected values for hidden nodes. Expected values are calculated using inference mentioned in the previous section. EM is more efficient since gradient ascent requires learning rate and step size parameters. Thus, we will use EM for learning.

P ( D = d1 | B = b1 , C = c1 ) ≈

Expected number of times D = d1 , B = b1 , C = c1 (5.11) Expected number of times B = b1 , C = c1

In case of unknown topology, model space is searched and models are evaluated with their likelihood values given data. Readers are referred to [Mur02] for detailed information. In the next section, dynamic Bayesian network will be discussed. 5.3.3

Dynamic Bayesian Network Standard Bayesian networks do not deal with time. Dynamic Bayesian networks address

how the variables are changing in time. Dynamic Bayesian Network (DBN) is designed to model

91

probability distributions over sequence of random variables to be able to handle sequenced observations that are generated by some underlying hidden states that evolve in time [DK89]. Dynamic Bayesian network consists of two networks: prior network and transition network. Prior network represents the prior probabilities of all variables in the network in the initial time slice (i.e., t = 0 ). Transition network represents the probabilities of all the variables in all other time slices (i.e., t = 1, 2,...n ) conditioned on the variables on the previous slices. Prior network, transition network and their combinations are illustrated in Figure 30a, 30b, and 30c, respectively. For the rest of this paper, shaded nodes represent observable variables, whereas blank nodes represent hidden variables. HMMs, Kalman Filters, Principal Component Analysis, Vector Quantization, etc. are all variants of dynamic Bayesian networks [Smy98, RG99]. This research involves implementation of HMMs as DBNs, and this will be discussed in the next sub-section. See [Mur02] for more detailed information about DBNs. X t −1

Xt

X

Y0

Yt −1

Yt

Y0

…

O0

O t −1

Ot

O0

…O

X

a)

0

0

…X

b)

t −1

Yt −1

t −1

X

t

Yt

Ot

c)

Figure 30: Example of Dynamic Bayesian Network a) Prior Network b) Transition Network c) Dynamic Bayesian Network: Combination of prior and transition networks

5.3.4

Dynamic Bayesian Network as Hidden Markov Model The goal of a DBN acting as a HMM is to infer the hidden state given the observation

sequence, which can be represented more precisely as P( X t = i | O1:t ) . Initially, the structure of DBN needs to be defined. DBN structure consists of different levels such as observation level ( O nodes - shaded) and hidden state level ( X nodes – not shaded) as illustrated in Figure 31. Observation in time t is generated by hidden state X t and represented as node Ot . Note that

92

every level needs to be represented in every time t . Instead of representing DBN with all time slices as in 31.a, transition network can be represented by one time slice that leads to a more compact representation of DBN as seen in Figure 31.b. This compact representation (rolled representation) reduces the number of different nodes to be defined. For example, there are two different nodes (i.e. X 1 and X t ) in the hidden state level in Figure 31.b instead of M number of nodes ( M : Total number of slices). In the observation level, all the observations can be represented by one node ( O ) since all of them have the same parent ( X t ). As a result, DBN has three different nodes ( X 1 , X t , and O ) as shown in Figure 31.b. In the second step, conditional probability distribution of each node given its parents needs to be defined. These include the initial probability distribution, P( X 1 = i ) , the state transition distribution, P( X t | X t −1 ) , and the observation distribution, P(Ot | X t ) . It is also assumed here that transition and observation functions do not change over time. In this work, the observation distribution is assumed continuous and Gaussian as represented in (5.12).

(

)

(

P Ot | X t = i ~ N µi ,σ i2

)

(5.12)

Note that there is only one hidden layer and one observation variable defined in the discussed HMM. Following sections will discuss about different types of HMM: auto-regressive, hierarchical HMM, respectively.

t1

t2

t3

X1

X

2

X3

Hidden States

O1

O2

O3

Observed signal

a) Unrolled

Time sequence

… …

X1

Xt

O

O

b) Rolled-compact

Figure 31: Representation of Dynamic Bayesian Network. Observed nodes are shaded, whereas hidden nodes are not shaded

93

5.3.5

Auto-regressive Hidden Markov Models In a regular HMM, it is assumed that observations are independent, as illustrated by

(5.12). This assumption often does not hold in real world applications, especially in cases where a trend exists in the observations and/or sensors collect information at high data acquisition rates. Auto-regressive HMM model can relax this assumption by accommodating a link between consecutive observations as seen in Figure 32. The model allows Ot −1 to help predict Ot and generally leads to higher likelihoods. Observed node Ot now has two parents, one being the current hidden state and the other being the previous observed node. Observation distribution can be represented as in (5.14).

X1

Xt

Hidden States

O1

Ot

Observed signal

Figure 32: DBN representation of Auto-regressive Hidden Markov Model

P ( Ot | X t = i, Ot −1 + o ) ∼ N ( µi + o.wi , σ i2 )

(5.13)

where o denotes observed value and wi denotes weight of previous observation’s effect. ( o ∈ R ). Note that O1 has only one parent (i.e., X 1 ) and P(Ot | X 1 = i ) ∼ N ( µi , σ i2 ) . Hierarchical HMM will be discussed in the next sub-section. 5.3.6

Hierarchical Hidden Markov Models Hierarchical HMM (HHMM) is an extension of a HMM that is designed to model

hierarchical structures for sequential data. In a HHMM, states consist of sub-states and both (sub-states and states) cause the observation. This model is especially important in cases where data is non-stationary within a state. Hierarchical structure can be maintained in such a

94

way that ‘top level’ cannot change state unless the ‘lower level’ state reaches the last possible state. This can be achieved through a binary control variable F, allowing a top level state change only if the control variable F=1. Also, control variable F is allowed to take a value of 1 only if the lower level reaches its last possible state. Figures 33 and 34 illustrate the hierarchical structure and the rolled DBN representation of hierarchical HMM, respectively. The conditional probabilities in prior network (i.e., the initial slice) are initial distributions and are represented as in (5.14) and (5.15).

(

)

1 1 d P X = i = π ( j ) , X t : node X in time t , level d 1

(

P X

)

d d −1 d =i| X = j = π (i ) j 1 1

B2

(5.15)

A2

A1

B1

(5.14)

B3

B1

B2

B3

Figure 33: Hierarchical representation of states. A: Top states, B: Sub states

In the transition network, conditional probabilities can be discussed in three categories: bottom level, intermediate level, and top level. In the bottom level, conditional probability can be drawn from either the initial distribution or the transition distribution depending on the state of the binary control variable F. If F is ‘on’ (i.e., F=1), it is a vertical transition (i.e., lets an upper level state change) and conditional probability distribution is the initial distribution; otherwise, it is a horizontal transition (i.e., transition to a state in the bottom level under the same upper level state) and conditional probability distribution is the transition distribution. This can be represented as in (5.16). Probability of turning control variable F on is equal to the probability of transition to the last state as stated in (5.17).

95

X11

Top state level

X t1

Control Variable

X12

Xt2

Sub state level

O

O

Observed

Hidden States

F

Figure 34: Hierarchical Hidden Markov Model Representation

 AkD ( i, j ) if f = 0 P ( X t = j | X t −1 = i, Ft −1 = f , X t = k ) =  D  π k ( j ) if f = 1 D

D

D

D

P ( Ft D = 1| X tD −1 = k , X tD = i ) = AkD ( i, last )

(5.16)

(5.17)

Similar to bottom level, in the intermediate level the conditional probability distribution can be either initial probability or transition distributions. The difference is that not only the control variable of the same level, but also the control variables in lower levels need to be turned on for conditional probability to be initial distribution in the intermediate level. In other words, all the lower levels should be in their last possible states for a higher-level state transition. This can be represented as in (5.18). The conditional probabilities in the top level are same as with previous equations except that no higher-level parents exist. Note that we implement a two-level HHMM as represented in Figure 34.

P ( X t = j | X t −1 = i, Ft −1 = f b , Ft −1 = f , X t d

d

d +1

d

d −1

if f b = 0  δ ( i, j )  d = k ) =  Ak ( i, j ) if f b = 1, f = 0  π d ( j ) if f = 1, f = 1 b  k

(5.18)

where δ ( i, j ) is 1 if i = j and 0 otherwise. Ft d−1+1 is a control variable in time t − 1 in level d + 1 ( d + 1 is lower level compared to d ).

See [Mur02] for more detailed information about DBNs. Following section discusses the implementation and results of the methods.

96

5.4 IMPLEMENTATION AND RESULTS The proposed health state estimation methods are applied to a drilling process, the most popular industrial machining process. The intent is to estimate the health state of the drill bit so as to facilitate timely replacement (to avoid any failures within the workpiece or a premature replacement). Drill bits are normally subject to gradual wear along the cutting lips and the chisel edge, which leads to a series of transitions in health states from a ‘brand new’ state through failed’ state [CB03]. The objective is to estimate these health states of the drill bit. Experimental setup consists of a HAAS VF-1 CNC Machine, a workstation with LabVIEW software for signal processing, a Kistler 9257B piezo-dynamometer for measuring thrust-force and torque, and a NI PCI-MIO-16XE-10 card for data acquisition. Twelve drill bits are used for the experiment and each is operated until it reached a state of physical failure. Thrust-force and torque sensor signals are employed here for health state estimation given their strong correlation with the condition of the bit. Stainless steel bars with a thickness of 0.25 inches are used as specimens for tests. The drill bits were high-speed twist drill bits with two flutes, and were operated under the following conditions without any coolant: feed-rate of 4.5 inches-per-minute (ipm) and spindle-speed of 800 revolutions-per-minute (rpm). The thrust-force and torque data were collected for each hole from the time instant the drill bit penetrated the work piece through the time the drill bit protruded from the other side of the work piece. The data were collected at 250 Hz considered adequate to capture cutting tool dynamics in terms of thrust-force and torque. Number of data points collected for a hole changes between 380 and 460. Each hole is reduced to 24 RMS (root mean square) values. Data collected from drill bit #5 are depicted in Figure 35. Thrust-force and torque signals need to be normalized since their amplitudes are quite different. Regular, auto-regressive and hierarchical HMMs are implemented using nine drill bits for training and three for testing and discussed in the next sub-section. The application is

97

implemented using MATLAB based on Kevin Murphy’s Bayesian network toolbox, available at http://www.ai.mit.edu/~murphyk/Software/BNT/bnt.html A PC with 3GHz processor and 512MB memory is used for implementation of methods.

Figure 35: Trust and Torque data for drill bit #5

5.4.1

Competitive Learning for Regular and Auto-regressive HMM In the competitive learning case, a health state is represented by a distinct HMM. Data

sequence for a hole from all training drill bits is randomly selected and likelihood values of the competing HMMs are calculated. The HMM with the highest likelihood is defined as the winner of this hole. Winner is trained with the corresponding data sequence. There are two approaches for training HMMs. One can train only the winner (i.e. winner takes all) or resort to a topological learning process that involves training all the HMMs to the degree of their ability to represent the training data sequence (which is basically the likelihood value). The results, advantages and disadvantages of both will be discussed later. Competitive learning is illustrated in Figure 36.

98

HMM1

Likelihood Calculation HMM2

Trainin g Data

Randomly select a

Likelihood Calculation

Select Maximum

All data used

Train winner

YES

…

NO

HMMN

Likelihood Calculation

Figure 36: Illustration of Competitive Learning

In order to evaluate the effectiveness of the method, classification accuracy needs to be calculated. Ground truth information is essential for calculation of ideal classification accuracy. For the given experiment, ground truth information is the actual health state of drill bits, which might be identified as a function of wear and tear on the cutting lips or the chisel edge at the end of each drilling cycle, an extremely tedious proposition. However, this process might cause the drill bit to cool down and might lead to a weak representation of real degradation under actual operating conditions. In addition, many real world applications lack ground truth information about the health of the equipment. In the absence of ground truth information, we have employed the following three criteria for judging the quality of the models: number of reverse jumps, uniformity, and health state resolution. Reverse jump can be defined as revisiting a previously visited health state after going through some other health state(s). Uniformity measures the consistent presence of the different health states in all the drill bits. Each health state needs to be visited by most, if not all, drill bits for generalization of health state estimation. This is a reasonable measure for this experiment and may not be appropriate for systems that fail due to multiple failure mechanisms. The last criterion, health state resolution, ensures

Training complete

99

adequate resolution between a brand new drill bit and a complete failure. This is measured here by counting the number of HMMs that survived the competitive learning process. Initialization is also an important issue for building HMMs for health state estimation (to both reduce training times as well as improve diagnostic performance). In this investigation, the first and last holes of all drill bits are used for initialization of the ‘first’ and ‘last’ HMM models in the pool, respectively. Then, one hole from each drill bit is selected in such a way that the selected holes are as far away from each other as possible to initialize the remaining HMMs. For example, with four HMMs in the competitive learning pool and 22 hole data available from drill bit #1, holes 1, 8, 15, and 22 of drill bit #1 are used to initialize HMM1, HMM2, HMM3, and HMM4, respectively. After initialization, HMMs are trained by competition. Competition can be terminated in two different ways: convergence and classification error minimization. In the former case, competition ends when convergence is achieved, which is basically obtaining the same classification results for two consecutive epochs. In the latter case, competition ends when classification error is minimized. In both cases, maximum number of epochs is set in order to assure termination. Data sequences from all training holes are used in a random order in each epoch. In the first termination criteria, training continues until a convergence is achieved or maximum number of iterations is reached. Convergence is achieved when classification results do not change for two consecutive iterations. In this step, a learning rate ( ℓ i ) is introduced to assure convergence. Learning rate can be defined as the importance of new information in the current iteration compared to the information obtained from previous iterations. As previous information increases, the importance of the new information decreases for faster convergence. The amount of reduction in learning rate is defined by reduction factor ( ℏ ). Higher reduction factor leads slower the convergence but generally more robust result. Learning rate in “winner takes all” approach is implemented here as a function of starting learning rate ( ℓ st ) and

100

reduction factor and is specified as in (5.19) for iteration i . In the topological learning approach, closeness of HMM to the data sequence ( c ) effects the learning rate as well as starting learning rate and reduction factor and it is formally written in (5.20). Closeness value ( c ) takes values from 1, if the HMM has the highest likelihood, to N (i.e. number of HMMs), if HMM has the lowest likelihood.

ℓ i = ℓ st × ℏi −1

ℏ : Reduction factor, ℓ st : Starting learning rate

(5.19)

ℓ i ,c = ℓ st × ℏi + c − 2

c : Closeness value

(5.20)

In the second termination criteria, training continues until zero reverse jumps are obtained for all training drill bits or maximum number of epochs is reached. The results of both termination criteria were discussed in the next section for regular HMM. 5.4.1.1 Regular Hidden Markov Models We firstly implemented convergence termination criteria for ‘winner takes all’ and ‘topological’ learning strategies for regular HMM. The health state estimation results for both strategies are given in Table 19 with four regular HMMs in the pool and three states within each HMM. Numbers in the table represent the winner HMMs for the corresponding hole (columns) in the corresponding drill bit (row). As can be seen in Table 19, the sequence of health states from brand new to failure corresponds to HMMs 1, 3, 2 and 4, respectively. Surprisingly, both results are very similar. However, training all HMMs for each hole in “topological learning” approach dramatically increases computational time. The other important difference between “topological learning” and “winner takes all” approaches is the effect of random ordering of holes in the training process, which will be discussed in detail below.

101

Table 19: Health state estimation results for “winner-takes-all” and “topological learning” approaches 4 regular HMMs and 3 states within each HMM. Numbers in the table correspond to the winner HMMs for that drill bit (table row) and hole (column). Holes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 1 3 3 3 3 3 3 3 3

3

3

3

3

3

3

2

2

2

2 3 3 3 3 3 3 3 3 3

3

3

3

3

3

3

3

2

4

3 1 1 1 1 1 1 1 3 3

3

3

3

2

2

4

2

2

2

2

Drill bits

4 1 1 1 1 1 3 4 5 1 3 3 3 3 3 3 3 3

1 3 3 3 3 3 3 3 3

3

3

3

3

3

3

2

2

2

33 3 3 3 3 3 3 3

3

3

3

3

3

3

3

2

4

11 1 1 1 1 1 3 3

3

3

3

2

2

4

3

3

3

4

2

2

2

2

2

2

2

2

2

11 1 1 3 3 4 3

3

3

4

13 3 3 3 3 3 3 3

6 1 1 1 1 1 1 3 4

11 1 1 1 1 3 4

7 3 3 3 3 3 3 3 3 3

3

3

3

2

4

8 3 3 3 3 3 3 3 3 3

3

3

3

3

3

3

3

2

9 1 1 1 1 3 3 3 3 3

3

3

3

3

3

3

2

4

10 1 1 1 1 1 1 3 3 2

4

11 1 1 1 1 1 3 3 3 4 12 3 3 3 3 3 3 3 3 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

2

2

2

2

2

2

33 3 3 3 3 3 3 3

3

3

3

2

4

4 3 3 3 3 3 3 3 3 3

3

3

3

3

3

3

3

2

11 1 1 3 3 3 3 3

3

3

3

3

3

3

2

4

11 1 1 1 3 3 3 2

4 3

3

3

2

2

2

2

2

4

11 1 1 1 3 3 3 4 3

3

3

a) Winner takes all

3

3

2

2

2

3 3 3 3 3 3 3 3 3

3

b) Topological Learning

The parameter optimization of HMM (i.e. learning) is highly depended on the initial observation probability distribution and transition probabilities of HMM, since the problem in essence is a non-linear optimization problem. Thus, different starting points might lead to a different optimum solution. Another factor that affects the solution is the order of holes fed into the HMMs for training. In “topological learning” approach, the result might be totally different for models that use the same initial points because of randomly ordered training holes. However, it is important to train HMMs randomly for good generalization. On the other hand, in “winner takes all” approach, the result is not sensitive to the random ordering of training data. The same initial points lead to the same results in the latter approach. Thus, “topological learning” approach is not as appropriate because of computational time and sensitivity to ordering of training data. In the rest of the paper, we only refer to “winner takes all” approach in training, which is naturally sensitive to initial distributions and initialization is handled as mentioned before. It is also important to understand the effect of learning rate ( ℓ i ) and reduction factor ( ℏ ). Starting learning rate is recommended as 0.01. Higher values might result in too slow convergence and a totally different result from chosen initial health state estimation, whereas

102

lower values might not give enough tuning. Reduction factor is set to 0.9. In general, low values lead to faster convergence whereas high values give more robust results. For illustrative purposes, the states within different HMMs (e.g., HMM1 and HMM3) along with scatter plots of normalized thrust and torque data are shown in Figure 37 for the trained HMMs. Note that there are three states within a HMM in this example. After training, loglikelihood values of HMMs for the holes of all training and testing drill bits are calculated. The maximum log-likelihood value defines the health state of the drill bit during this hole. Figure 38 illustrates the log-likelihood values of a drill bit for all of its holes. The classification of health states can also be seen from log-likelihood graphs.

Figure 37: State mean and covariance plots of HMM1 and HMM3 with normalized thrust and torque scatter plots

Figure 38: Illustration of log-likelihood values of regular HMMs

In the second termination criteria, namely error minimization, learning continues until there exists no reverse jump in all drill bits or maximum number of epochs is reached. The result of this strategy is given in Table 20. Each drill bit is represented in a row with various numbers of

103

holes. The empty cell means that the corresponding drill bit failed before reaching the corresponding hole. As seen from table, HMMs represent the health states with zero reverse jumps and each health state is represented by many of the drill bits. Figure 39 illustrates the learnt mean and covariance matrices of states within the HMMs. As seen from the figure, the data moves from inner through outer as drill bit is used and sub-states of HMMs can effectively represent this movement. For illustrative purposes, the log-likelihood values of the four HMMs for drill bit #1 are plotted in Figure 40. As can be seen from the figure, the log-likelihood values are highest for HMM1, HMM4, HMM3, and HMM2, as we go from brand new to failure states, respectively. In other words, health state from brand new to failure is represented by HMM1, HMM4, HMM3, and HMM2, respectively. This was also partially evident with the other training drill bits. Table 20: Health state estimation for all drill bits using competitive learning

Drill bits

Holes 1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 2 1 1 1 1 1 1 1 1

1 1 1 1 1 4 4 4 4 4 4 3 3 3 2 1 1 1 1 1 4 4 4 4 3 2 1 1 1 1 1 4 3 2 1 2 1 1 1 1 4 1

1 1 4 4 2 1 1 1 2 2 1

1 4 4 3 2 1 1 1 1 4 4 4 4 4 4 4 4 3 3 2 1 1 1 1 4 3 2 2 4 4 4 4 4 3 3 2

Figure 39: Illustration of mean and covariance of states of HMMs for drill bit #1

104

In both learning termination strategies, number of HMMs within the pool and the number of hidden states allowed within each HMM ought to be optimized. In this experiment, the maximum number of health states to be estimated is restricted to four. It was noticed that when more than four HMMs are used for training, some of the HMMs either couldn’t win any competition or tend to represent a sub part of a health state. For example in Table 21, both HMM 3 and 5 represent the failure state (i.e. the last hole of all drill bits). In other words, HMM 3 and 5 represent the sub states of the failure state. Experimental investigation revealed that the optimal number of states within the regular HMMs is approximately four. Too many hidden states within a HMM led to data over-fitting (and hence poor performance on the testing set), whereas fewer states led to poor performance (even within the training set) [CB03].

Figure 40: Illustration of log-likelihood values of HMMs for drill bit #1 and #3 Table 21: Health state estimation using 5 HMMs

Drill bits

Holes 1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 4 1 1 1 1 4 4 1 1 1 4

4 4 1 1 4 1 4 4 1 1 1 4

4 4 1 1 4 1 4 4 1 1 1 4

4 4 1 1 4 1 4 4 1 1 1 4

4 4 4 1 4 1 4 4 4 1 1 4

4 4 4 4 4 1 4 4 4 4 4 4

4 4 4 4 4 4 4 2 2 2 4 4 4 4 4 4 4 2 2 2 1 4 4 4 4 2 2 2 3 3 4 4 4 4 2 2 3 1 3 4 4 4 4 4 2 2 3 4 4 4 4 4 4 4 4 4 2 44 4 4 4 4 4 2 2 3 4 4 5 3 4 2 3 4 4 4 4 2 2 2 2 2 2

2 2

2 3

2 2

2 5

2

2

2 2

2 2

2

5

5

We evaluate the results of two learning termination strategies by comparing number of parameters, computational time as well as aforementioned criteria: reverse jump, uniformity,

105

and identified number of health states. In both cases, number of reverse jumps is zero and four health states are identified. In the latter case (i.e. error minimization) uniformity is higher, since health states are visited by most of the drill bits. In addition, in the former case (i.e. convergence) two parameters (learning rate and reduction factor) needs to be defined. Furthermore, computational time is higher in the former case. Thus, the second termination criterion (i.e. error minimization) is a better strategy for this experiment. Next section discusses the implementation of Ar-HMM. 5.4.1.2

Auto-regressive Hidden Markov Model

As mentioned before, the difference between Ar-HMM and regular HMM is that the former removes the independence assumption of observations. Auto-regressive HMM also employs competitive learning as regular HMM and implementation details are the same with regular HMM. Both learning termination criteria (i.e. convergence and error minimization) have been employed for Ar-HMM. In the case of termination with convergence, several models are created with different initial points and the best one is chosen among them. Parameter tuning is done in the next step −5 with a learning rate of 10 . Learning rate is low in order to decrease number of epochs before

convergence. In case too many epochs are allowed, one HMM might represent the whole lifetime (i.e. win all the competitions), which is meaningless. Reduction factor is set as 0.9. Table 22 gives the health state estimation using convergence as termination criterion with three Ar-HMMs. As seen from the table, there are no reverse jumps leading to %100classification accuracy. We can define health states as brand new (HMM #3), normal (HMM#1), and ready to fail (HMM #2). Having normal health state last very long time might make it difficult to estimate the remaining-useful-life (RUL) accurately in prognostics. Figure 41 illustrates the log-likelihood values of Ar-HMM for drill bit #1 and #3.

106

Table 22: Health state estimation using competitive learning -5

3 Ar-HMMs each of which has 3 states. Learning rate is 10 . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 3 1 3 3 3 3 3 3 1 3 3 3

Drill bits

Holes 1 2 3 4 5 6 7 8 9 10 11 12

3 1 1 1 1 1 3 1 1 3 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 11 1 1 1 1 2 1 1 2 1 1 1

1 1 1

1 1 1

1 1 1

1 1 1

1

1

1

2

1 1

1 1

1 1

1 1

1

1

1

1 1 1

1 1 2

2 1 1

1 1

1 1

1 1

1 2

1

1

1

2

1

1

1

1

1

1

1

1

2

1

2

2

2 1

1

1

1

1

1

In case of error minimization termination criteria, competition continues until there is no reverse jump (or maximum number of epochs is reached). Table 23 displays the results of health state estimation with Ar-HMM. As seen from the table, HMM1 and HMM4 together represent the ‘ready to fail’ health state. Thus, number of health states identified for Ar-HMM is three. Compared the previous results, this is more appropriate to estimate RUL in prognostics, since brand new health state is represented longer in this case than the previous case. In addition, due to the other difficulties of convergence termination criteria as mentioned earlier (number of parameters, computational time, etc.) error minimization termination criterion is better for Ar-HMM implementation. Table 23: Ar-HMM with error minimization

Drill bits

Holes 1 2 3 4 5 6 7 8 9 10 11 12

1 3 2 2 2 2 2 3 3 2 2 2 3

2 3 3 2 3 3 2 3 3 3 2 2 3

3 3 3 2 3 3 2 3 3 3 2 2 3

4 3 3 2 3 3 2 3 3 3 2 2 3

5 3 3 2 3 3 2 3 3 3 2 2 3

6 3 3 2 3 3 2 3 3 3 2 3 3

7 3 3 2 4 3 3 3 3 3 3 3 3

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

3 3 3

3 3 3

3 3 3

3 3 3

3 3 3

3 3 3

3 3 3

3 3 4

3 3

3 3

3 4

3

3

3

1

3 4 3 3 3 3 3 3

3

3

3

3

4

3 3 3 1 4 3

3 3 3 4

3 3 3

3 3 3

3 3 3

4 3 3

3 3

3 4

3

3

3

3

3

3

3

3

3

3

3

3

3

1

3

1

Three and four Ar-HMMs are used for competition. In all cases, number of identified health states is limited to three. Any extra Ar-HMMs entertained often did not win any competition or

107

represent a sub-part of a health state as shown in Table 23. Thus, number of health states that can be identified using Ar-HMM is three. The reason is that Ar-HMM gives higher likelihood values since it models the dependency inside the state as well as dependence to the state. The higher likelihood value increases HMM’s representation ability, leading to a need for fewer health states. Ar-HMM has one extra parameter (i.e., weight w ) to be optimized as given in (5.13). w is the weight value of previous observation’s effect on the present observation. This parameter gives Ar-HMM the ability to represent time series data. Therefore, only one state within an ArHMM can represent the whole data sequence within a hole. As mentioned earlier, states within the HMM represent the states in a hole. Various numbers of states are used in Ar-HMM, however at the end of the training, only one state represents the whole hole. The reason behind this is the ability to represent the data sequence of a hole as a time series in Ar-HMM. In order to force Ar-HMM to have several states within a hole, initially we trained it as regular HMM and introduced weight parameter later. However, whenever the weight parameter is introduced, one state represents the whole data sequence within a hole and all extra states become ineffective.

Figure 41: Illustration of log-likelihood values of Ar-HMMs for drill bit #3 and #1

Table 24: Computational times of regular and auto-regressive HMMs. One Epoch

Overall Training

Regular HMM

41 sec.

212.08 sec.

Ar-HMM

416.72 sec.

979.94 sec.

108

We compare results of regular HMMs and Ar-HMMs based on computational time, classification accuracy, uniformity, and number of identified health states. Training of Ar-HMM for one iteration is almost 10 times slower than regular HMM. Table 24 reports the computational times involved in training regular and Ar-HMMs. Regular HMM identifies more health states than Ar-HMM, is computationally more efficient, and contrary to Ar-HMM identified health states are visited by most of the drill bits in regular HMM. The comparison of regular HMM with Ar-HMM is summarized in Table 25. According to these criteria, regular HMM is more appropriate for health state estimation for this experiment although Ar-HMM removes the independency assumption made by regular HMM. Implementation of Hierarchical HMM will be discussed in the next section. Table 25: Comparison of Regular HMM and Ar-HMM Regular HMM Classification Accuracy 100% Uniformity Better Number of Identified Health States 4 Computational Time Shorter Chosen Termination criteria Error Minimization

5.4.2

Auto-regressive HMM 100% Worse 3 Longer Error Minimization

Hierarchical Hidden Markov Model Hierarchical HMM (HHMM) is designed to handle complex systems. We implemented two-

level HHMM with top-level states that represent health states with sub states under them. Both cause the observation as seen in Figure 34. Hierarchical HMM (HHMM) gives us the opportunity to model all health states by using a single overall model. Thus, it is enough to train one HHMM instead of employing competitive learning for several HMMs. The top level states within HHMM represents the health state of drill bits. Left-to-right HHMM, which has appropriate initial transition probabilities to represent health state progression, is implemented.

109

Likelihood values for data sequences are calculated and the top-level state with highest likelihood is assumed to represent the health state of the data sequence. Figure 42 displays the likelihood values for HHMMs with three and five top-level.

Figure 42: Likelihood values for 3 and 5 top-level state HHMM for Drill bit 1

We tried different number of health states from three through six. As the number of toplevel states increases, some of the states might not represent a health state by itself. For example, top-level state 3 could not represent any data sequence in the HHMM with four toplevel states as seen in Table 26, although there exist four possible health states. Table 26: Health State Estimation using HHMM with 4 top-level states

Drill bits

Holes 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1

2 1 1 1 1 1 1 1 1 1 1 1 1

3 1 1 1 1 1 1 1 1 1 1 1 1

4 1 1 1 1 1 1 1 1 1 1 1 1

5 1 1 1 1 1 1 1 1 1 1 1 1

6 1 1 1 2 1 1 1 1 2 1 1 1

7 1 1 1 4 1 2 1 1 4 1 1 1

8 1 1 1

9 1 1 1

10 1 1 1

11 1 1 2

12 1 1 2

13 2 1 2

14 2 2 2

15 16 17 18 19 20 21 22 23 24 2 2 2 2 2 2 2 4 2 2 2 4 4

1 1 1 2 2 4 4 1 1 1 2 2 2 4 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 4 2 2 4 2 4 1 1 2 2 2 2 2 2 2 4

The second parameter to be optimized is number of sub states. This parameter identifies the number of states within a hole. In our application, five sub states are enough to represent the data in a hole.

110

: Health state 1

: Health state 2

: Health state 3

: Health state 4

Figure 43: Illustration of sub states mean and covariance of health states in HHMM

Table 27: Health state estimation of all drill bits using a HHMM

Drill bits

Holes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1 2 3 4 5 6 7 8 9 10 11 12

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 2 1 1 1 1 1 1 1 1

1 1 1 3 1 2 1 1 1 1 1 2

1 1 1 5 1 3 1 2 1 1 2 2

1 1 1 2 2 2 3 3 3 3 3 3 3 4 5 1 2 2 2 2 2 2 3 3 3 5 1 2 2 2 2 3 3 5 1 5 2 2 1 2 3 2

2 2 2 3 5 2 2 1 3 5 2

2 2 3 3 5 2 2 2 2 2 2 2 3 3 3 3 3 3 4 5 1 2 2 2 2 3 5 5 2 2 2 3 3 3 4 5

Table 27 gives the health state estimation of all drill bits for HHMM with five health states, however state #4 is represented only in three drill bits. Thus, states #4, and #5 can be combined as one health state. As seen from the table, health state estimation seems better with HHMM than employing a committee of HMMs, since almost all drill bits visit all the health states. As seen from the table %100 classification accuracy is achieved. Figure 43 gives the mean and covariance matrices of sub states for four health states. The computational time for HHMM varied between 174 and 255 seconds depending on the number of states used in top and lower levels. In our experiments, this was comparable to regular HMM committees. Careful comparison of HMMs trained using competitive learning for health state estimation versus using a single hierarchical HMM revealed the following advantages for using HHMMs:

111

1.)

Better Classification: HHMM gives better health state estimation than regular HMM.

2.)

Training: No need for competitive learning. It is enough to train only one HHMM.

3.)

Computational time: Training time is in the same range as regular HMM and less than Ar-HMM.

4.)

Topology: It naturally forces topological structure, which minimizes reverse jumps.

5.)

Transition Probability: HHMM automatically calculates the transition probabilities between health states (i.e. top level states), whereas regular HMM has transition probabilities only between sub states and not between health states. Transition probabilities between health states are important for prognostics.

6.) Parameters: No need to define parameters such as learning rate, reduction factor, etc that are necessary for competitive learning in case of convergence termination criterion. Number of parameters to define is the same with ‘error minimization’ termination criterion in competitive learning. 5.5

CONCLUSION AND FUTURE RESEARCH

We implemented regular, auto-regressive and hierarchical HMM as dynamic Bayesian network for health state estimation. Competitive learning is employed for training pools of standard and auto-regressive HMMs, each of which represents a distinct health state. Although regular and Ar-HMM give %100 accuracy, if classification error is defined simply as a function of number of reverse jumps among the health states, they have some implementation difficulties such as computational time, some extra parameters to define etc. On the other hand, a single hierarchical HMM can naturally represent all health states and offers several advantages over pools of standard HMMs, such as better classification, ease of implementation and training. In addition, Hierarchical HMM can be extended for prognostics in order to estimate remaining useful life (RUL). Future research can consider implementation of a HHMM for estimating RUL.

112

CHAPTER VI MACHINE PROGNOSTICS USING HIDDEN MARKOV MODELS 6.1

INTRODUCTION Condition-Based Maintenance (CBM) aims to avoid both unnecessary maintenance

actions and machine failures, which can be succeeded neither in time-based preventive maintenance nor in corrective maintenance. It minimizes the failure and maintenance cost by optimizing the maintenance time periods. CBM technology increases the system availability and safety as well as cost reduction, which involves reduced maintenance and inventory, increased capacity, and enhanced logistics and supply chain [Cam05]. Jay Lee and Jun Ni, the CoDirectors of National Science Foundation (NSF) center of Intelligent Manufacturing System (IMS), estimates that $35 billion per year would be saved in US alone if CBM were implemented [Har03]. Diagnostic is the process of identifying, localizing and determining severity of a machine failure. Prognostic is built upon diagnostic and it is the process of estimating Remaining Useful Life (RUL) of a machine by predicting the progression of a diagnosed failure [MBND99]. Prognostic is receiving most attention for systems consisting of mechanical and structural components, for unlike electronic or electrical systems. Mechanical systems typically fail slowly as structural faults in mechanical parts progress slowly to a critical level. Monitoring the growth of these failures provides the opportunity to estimate RUL [Mar01]. Therefore, incipient failures that develop slowly are in the scope of prognostics. Prognostic is a dynamic process that evolves in time from the moment the machine is first used till it fails. RUL should not be confused with estimate of life expectancy, which is “mean time to failure” of an average component [EBGH00]. Estimate life expectancy is the average life

113

of the similar or a family of machines, while RUL is the time to the failure of specific machine, which is under monitoring. Prognostic is more difficult to formulate than diagnostic, attributed to the fact that prognostic is subject to stochastic process that has not been happened yet. Diagnostic models an existent stochastic system, while prognostic forecasts a stochastic system that will happen in the future. Consequently, prognostic methods should have three outputs: accuracy, precision, and confidence [EBGH00]. Accuracy is the measure of closeness of estimated failure time to actual failure time (High accuracy=close estimate). Precision is the length of the interval that estimated RUL falls in. Confidence is the probability of actual RUL falls in the given precision. High accuracy, narrow precision and high confidence are desired. A study by NIST concluded that the availability of generic methods for effective diagnostics and prognostics is a prerequisite for widespread deployment of CBM [Nist98]. However, the literature on prognostics is extremely sparse. The prognostics methods can be broadly grouped into two categories: physics-based and empirical-based. Even though physics based prognostics models have been attempted for a variety of mechanical components with some success [BMBM99, Kac02] and might give better results than empirical-based models, they are much more expensive to implement. Figure 44 illustrates the total cost, which consists of maintenance and failure cost, and equipment availability for physics-based and empirical based methods. In addition, the replication of a physics-based method to slightly different equipment is prohibitive and intractable. Physics-based methods also have scalability problems. Physics-based Equipment Availability

Empirical-based

Total Cost of Operations

CBM Figure 44: Equipment availability and total cost for physics-based and empirical based methods

114

Empirical prognostic methods can be grouped into three categories: First approach, evolutionary prognostic, involves trending of features combined with simplistic thresholds set from past experience and analysis of change rate from current condition to the known failure in feature space. However, many systems are not simply enough to set thresholds to the features for failure states. Second approach [BRG02] is to utilize statistical regression models and/or computational intelligence methods such as neural network to model known failure degradation paths in feature space. However, these models are not promising for forecasting and especially are disappointing for long-term forecasting, which is a necessity for estimation of RUL. Third approach, future state estimation, estimates a state vector that represents the equipment health condition from brand new to failure by employing subspace and non-linear dynamic methods [BG01]. These methods forecast the progression of health states of the machine from current state estimated by diagnostician to the failure state by employing transition probabilities between states and time spent in each state [BMBM99, BG01, CCH97]. HMM is used in [CB03, BC03] for state estimator prognostics, where each health state that a machine evolves through is represented by a distinct HMM in the HMM pool. Kalman and Alpha-Beta-Gamma tracking filters are other examples of this approach [RT96]. S. Engel presents the idea of probability distribution estimation of machine life, however he ends up with its practical difficulty [EBGH00].

Other methods used in prognostics include

wavelet analysis [LJW02], recurrent Networks [KP02], wavelet neural networks [WV99], and reliability-based techniques [SK97]. In summary, current applications on prognostics primarily employ physics-based methods with specific applications or case studies [e.g. RK00, TFM01, HHA01]. To date, there exist no robust generic prognostic methods that could apply across many electromechanical systems. In this paper, implementation of Hierarchical Hidden Markov Model (HHMM) as dynamic Bayesian network will be employed to estimate Remaining Useful Life (RUL).

115

The paper is organized as follows: Section II gives the background information about HMMs, Section III discusses dynamic Bayesian network (DBN) and implementation of HMM as DBN. Hierarchical Hidden Markov Model (HHMM) is given in Section IV. Section V presents the implementation and results of HHMM for RUL estimation. Future research is given in section VI. Finally, conclusion is given in section VII. 6.2 BACKGROUND: HIDDEN MARKOV MODELS A stochastic system can be described as being in one of a finite number of states at any time. The system evolves through the states according to a set of probabilities associated with each state as demonstrated in Figure 45. The model is called Hidden Markov Model if states are not observable (hidden) and are assumed to be causing the observations. The system behavior depends on the current state and predecessor states. A special case, first order HMM assumes that only the current state is responsible for producing the observations. In the rest of this paper, HMM implies a first order HMM. To better understand HMM, let us consider an urn and ball system with 3 urns and different number of colored balls in each urn [Law89]. An urn is selected randomly, then a ball is chosen from this urn, its color is recorded, and is replaced back to the urn it is chosen from. In the next step, a new urn is selected and a ball is chosen from this urn and recorded. This process is repeated in finite number of times and a finite number of observation sequence (colors) is obtained. Now, assume that the system is in a different room and done by somebody else so that we don’t see the selected urns. The only observable event to us is the colors of the selected balls. Obviously, simplest HMM representation of this system corresponds to states being urns with different color probabilities.

116

S2

S3 S4

S1 S5

S6

Figure 45: A Markov chain with 6 states and state transition probabilities (arrows represents non-zero state transition probabilities)

There are several elements to a HMM: number of states

(N) ,

observations, state

transition probability distribution, observation probability distribution, and initial state distribution.

X t denotes the state at time t and Ot denotes observation at time t , which might either be a discrete symbol Ot ∈ {1,..., L} or a feature vector from L dimensional space, Ot ∈ ℝ L . State transition probability distribution models the probability of being in state i at time t , given that it

{ }

is in state j in time t − 1 denoted as A = α i , j = P ( X t = i | X t −1 = j ) . Observation probability distribution defines the probability of observing k at time t given the state i , denoted as

B = {bi ( k )} = P ( Ot = k | X t = i ) . These distributions are either mass functions in case of discrete observation or specified using a parametric model family -commonly Gaussian- in case of continuous observations. Initial state distribution is the probability of being in state i at t = 0 and is denoted as π ( i ) = P ( X 1 = i ) . Generally, λ = ( A, B, π ) is used to indicate a HMM. There are three basic problems of interest to be solved given the above model specifications. •

How to compute the probability of obtaining the observation sequence O = O1O2 ...OT given the model λ (i.e. P ( O1O2 ...OT | λ ) =?)?

•

How to identify the most likely state sequence that might produce the observation sequence?

117

•

How to adjust the parameters of λ in order to maximize the likelihood of the given observation sequence? These three problems are tightly linked together and studied extensively in the literature.

The standard HMM solution to these problems requires an exponential number of parameters to specify the transition and observation models (for it calculates Cartesian product of the statespaces of each example). This means requiring excessive amounts of data to learn the model (high sample complexity) and exponential time for inference (for example, the forwards-

(

backwards cycle takes O Tk 2 N

)

operations) (high computational complexity). For more

detailed information about HMM, see [Law89]. Dynamic Bayesian Network (DBN) can more efficiently represent HMMs and alleviate some of these problems with the added flexibility of implementing different variants of HMM. Dynamic Bayesian network will be discussed in the next section. 6.3 DYNAMIC BAYESIAN NETWORK Standard Bayesian networks do not deal with time. Dynamic Bayesian networks address how the variables are changing in time. Dynamic Bayesian Network (DBN) is designed to model probability distributions over sequence of random variables to be able to handle sequenced observations that are generated by some underlying hidden states that evolve in time [DK89]. Dynamic Bayesian network consists of two networks: prior network and transition network. Prior network represents the prior probabilities of all variables in the network in the initial time slice (i.e., t = 0 ). Transition network represents the probabilities of all the variables in all other time slices (i.e., t = 1, 2,...n ) conditioned on the variables on the previous slices. Prior network, transition network and their combinations are illustrated in Figure 46a, 46b, and 46c, respectively. For the rest of this paper, shaded nodes represent observable variables, whereas blank nodes represent hidden variables.

118

HMMs, Kalman Filters, Principal Component Analysis, Vector Quantization, etc. are all variants of dynamic Bayesian networks [Smy98, RG99]. This research involves implementation of HMMs as DBNs, and this will be discussed in the next sub-section. See [Mur02] for more detailed information about DBNs. X t −1

Xt

X

Y0

Yt −1

Yt

Y0

…

O0

O t −1

Ot

O0

…O

X

a)

0

0

…X

b)

t −1

Yt −1

t −1

X

t

Yt

Ot

c)

Figure 46: Example of Dynamic Bayesian Network a) Prior Network b) Transition Network c) Dynamic Bayesian Network: Combination of prior and transition networks

6.3.1

Dynamic Bayesian Network as Hidden Markov Model The goal of a DBN acting as a HMM is to infer the hidden state given the observation

sequence, which can be represented more precisely as P( X t = i | O1:t ) . Initially, the structure of DBN needs to be defined. DBN structure consists of different levels such as observation level ( O nodes - shaded) and hidden state level ( X nodes – not shaded) as illustrated in Figure 47. Observation in time t is generated by hidden state X t and represented as node Ot . Note that every level needs to be represented in every time t . Instead of representing DBN with all time slices as in Figure 47.a, transition network can be represented by one time slice that leads to a more compact representation of DBN as seen in Figure 47.b. This compact representation (rolled representation) reduces the number of different nodes to be defined. For example, there are two different nodes (i.e. X 1 and X t ) in the hidden state level in Figure 47.b instead of M number of nodes ( M : Total number of slices). In the observation level, all the observations can

119

be represented by one node ( O ) since all of them have the same parent ( X t ). As a result, DBN has three different nodes ( X 1 , X t , and O ) as shown in Figure 47.b. In the second step, conditional probability distribution of each node given its parents needs to be defined. These include the initial probability distribution, P( X 1 = i ) , the state transition distribution, P( X t | X t −1 ) , and the observation distribution, P(Ot | X t ) . It is also assumed here that transition and observation functions do not change over time. In this work, the observation distribution is assumed continuous and Gaussian as represented in (6.1).

(

)

(

P Ot | X t = i ~ N µi ,σ i2

)

(6.1)

Note that there is only one hidden layer and one observation variable defined in the discussed HMM. Following sections will discuss about hierarchical HMM.

t1

t2

t3

X1

X

O1

O2

2

a) Unrolled

X

Time sequence 3

O3

…

Hidden States

X1

Xt

O

O

…

Observed signal

b) Rolled-compact

Figure 47: Representation of Dynamic Bayesian Network. Observed nodes are shaded, whereas hidden nodes are not shaded

6.4 HIERARCHICAL HIDDEN MARKOV MODELS Hierarchical HMM (HHMM) is an extension of a HMM that is designed to model hierarchical structures for sequential data. In a HHMM, states consist of sub-states and both (sub-states and states) cause the observation. This model is especially important in cases where data is non-stationary within a state. Hierarchical structure can be maintained in such a way that ‘top level’ cannot change state unless the ‘lower level’ state reaches the last possible state. This can be achieved through a binary control variable F, allowing a top-level state

120

change only if the control variable F=1. Also, control variable F is allowed to take a value of 1 only if the lower level reaches its last possible state. Figures 48 and 49 illustrate the hierarchical structure and the rolled DBN representation of hierarchical HMM, respectively. The conditional probabilities in prior network (i.e., the initial slice) are initial distributions and are represented as in (6.2) and (6.3).

(

)

1 1 d P X = i = π ( j ) , X t : node X in time t , level d 1

(

P X

)

d d −1 d =i|X = j = π (i ) j 1 1

B2

(6.3)

A2

A1

B1

(6.2)

B3

B1

B2

B3

Figure 48: Hierarchical representation of states. A: Top states, B: Sub states

In the transition network, conditional probabilities can be discussed in three categories: bottom level, intermediate level, and top level. In the bottom level, conditional probability can be drawn from either the initial distribution or the transition distribution depending on the state of the binary control variable F. If F is ‘on’ (i.e., F=1), it is a vertical transition (i.e., lets an upper level state change) and conditional probability distribution is the initial distribution; otherwise, it is a horizontal transition (i.e., transition to a state in the bottom level under the same upper level state) and conditional probability distribution is the transition distribution. This can be represented as in (6.4). Probability of turning control variable F on is equal to the probability of transition to the last state as stated in (6.5).

121

X11

Top state level

X t1

Control Variable

X12

Xt2

Sub state level

O

O

Observed

Hidden States

F

Figure 49: Hierarchical Hidden Markov Model Representation

P ( X t = j | X t −1 D

D

 AkD ( i, j ) if f = 0 = i, Ft −1 = f , X t = k ) =  D  π k ( j ) if f = 1 D

D

P ( Ft D = 1| X tD −1 = k , X tD = i ) = AkD ( i, last )

(6.4)

(6.5)

Similar to bottom level, in the intermediate level the conditional probability distribution can be either initial probability or transition distributions. The difference is that not only the control variable of the same level, but also the control variables in lower levels need to be turned on for conditional probability to be initial distribution in the intermediate level. In other words, all the lower levels should be in their last possible states for a higher-level state transition. This can be represented as in (6.6). The conditional probabilities in the top level are same as with previous equations except that no higher-level parents exist. Note that we implement a two-level HHMM as represented in Figure 49.

P ( X t = j | X t −1 = i, Ft −1 = f b , Ft −1 = f , X t d

d

d +1

d

d −1

if f b = 0  δ ( i, j )  d = k ) =  Ak ( i, j ) if f b = 1, f = 0  π d ( j ) if f = 1, f = 1 b  k

(6.6)

where δ ( i, j ) is 1 if i = j and 0 otherwise. Ft d−1+1 is a control variable in time t − 1 in level d + 1 ( d + 1 is lower level compared to d ).

Next section discusses the implementation and results of HHMM for RUL estimation.

122

6.5 IMPLEMENTATION AND RESULTS The HHMM is applied to the drilling process, which is the most popular industrial machining process, in order to estimate the remaining useful life (RUL), which can be defined as number of holes to be drilled before the failure occurs. Drill bits are normally subject to gradual wear along the cutting lips and chisel edge, which leads projection of health states from brand new state through failure state [CC05]. Thrust-force and torque signals are used since they have strong correlation with the condition of the drill bit. Experimental setup consists of a HAAS VF-1 CNC Machine, a workstation with LabVIEW software for signal processing, a Kistler 9257B piezo-dynamometer for measuring thrust-force and torque, and a NI PCI-MIO-16XE-10 card for data acquisition. Twelve drill bits are used for the experiment and each is operated until it reached a state of physical failure. Thrust-force and torque sensor signals are employed here for health state estimation given their strong correlation with the condition of the bit. Stainless steel bars with a thickness of 0.25 inches are used as specimens for tests. The drill bits were high-speed twist drill bits with two flutes, and were operated under the following conditions without any coolant: feed-rate of 4.5 inches-per-minute (ipm) and spindle-speed of 800 revolutions-per-minute (rpm). The thrust-force and torque data were collected for each hole from the time instant the drill bit penetrated the work piece through the time the drill bit protruded from the other side of the work piece. The data were collected at 250 Hz considered adequate to capture cutting tool dynamics in terms of thrust-force and torque. Number of data points collected for a hole changes between 380 and 460. Each hole is reduced to 24 RMS (root mean square) values. Data collected from drill bit #5 are depicted in Figure 50. Thrust-force and torque signals need to be normalized since their amplitudes are quite different. The application is implemented using MATLAB based on Kevin Murphy’s Bayesian network toolbox, which is available at http://www.ai.mit.edu/~murphyk/Software/BNT/bnt.html A

123

PC with 3GHz processor and 512MB memory is used for implementation of methods. Next subsection discusses the implementation of HHMM for RUL calculation.

Figure 50: Trust and Torque data for drill bit #5

6.5.1

RUL Calculation Machinery equipments generally progress through several health states from brand new

through failure [CC05]. Prognostic gets the current health state information of the equipment from diagnostic module. For illustration purposes, health states and their non-zero transition probabilities are displayed in Figure 51. Our implementation of HHMM to estimate equipment’s current health state for diagnostics module was reported in [CC05]. In this section, we will discuss the calculation of remaining useful life of the equipment given equipment’s current health state. Remaining useful life (RUL) is simply the answer to the question: “how many transition needs to be made to go from current state to the failure state”. There is no singular answer to this question, because it is a stochastic process and depends on the transition probabilities between states. Therefore, the best answer will be the characterization of RUL as a probability distribution. We employed Monte-Carlo simulation in order to characterize the RUL. During the

124

simulation, next health state, which can be either the same as current health state or a different health state, is probabilistically selected using the transition probability of current state. This process is repeated for the next health state until it reaches the failure state, then number of transitions on the way from current state to failure state is recorded as RUL. In order to characterize the RUL, large number of RUL values needs to be obtained in the same way. We used 10000 RUL values.

HS1

HS2

HS3

HS4

Figure 51: Illustration of equipment health states: HS1: Brand new equipment, HS5: Failure state

(a)

HS5

125

(b)

(c) Figure 52: Illustration of a RUL probability distribution for a) drill bit #2, b) drill bit #4, c) drill bit #9

126

Illustration of RUL probability distribution is given in Figure 52 for drill bit #2, #4, and #9 for several holes, which fail in 17th, 19th, and 8th holes respectively. The x-axis displays the overall life (i.e. number of holes) of the drill bit and y-axis displays the probability of the lifetime given in the x-axis. The arrow within each graph shows the actual life of the machine in the given hole. Drill bit #2 and #9 are examples of good RUL estimation, whereas drill bit# 4 is the worst case among all drill bits. As seen from the graphs, accuracy of the estimation of RUL increases as the machine approaches to the failure. In all cases, we are able to estimate the RUL as 0 with 100% in the hole just before the failure. Figure 53 displays the RUL estimation for all 12 drill bits with 95% and 75% confidence. In graphs, the dashed line represents the actual RUL, while outer solid lines represent the confidence limit and middle solid line represents the expected RUL.

(a)

127

(b) Figure 53: Illustration of RUL for 12 drill bits with a) 95% b) 75% confidence intervals

The result of RUL estimation can also be evaluated by the probability of actual RUL in estimated RUL probability distribution at a given time. The accuracy of the estimation is high when the probability of obtaining the value of actual RUL given RUL probability distribution, which can be called as estimation accuracy, is high. Figure 54 illustrates the RUL estimation accuracy for current states represented in x-axis for all drill bits. As seen from the graph, the estimation accuracy increases as drill bit approaches to failure except two cases (i.e. 4th and 10th drill bits, which last only 7 and 9 holes, respectively) with fluctuation before failure occurs. Their short life times may be an indication of lack of degradation of the failure. However, note that the RUL is estimated as 0 with 100% confidence just before the failure in all cases.

128

Figure 54: Illustration of estimation accuracy for 12 drill bits

Estimation accuracy as well as the RUL estimation graphs with confidence interval is employed for optimizing the number of health states and sub-states to be used in HHMM. The best results are obtained with four health states with four sub-states under them. After obtaining the best model, the expected outcome of a prognostician is calculated. As mentioned earlier, expected outcome of a prognostician includes three values: precision, accuracy, and confidence. It is more convenient to report the RUL as a range rather than a singular value. Thus, expected outcome of a prognostician can be such that “RUL is between 4 and 6 holes with 75% confidence and 90% accuracy”. It means the method is 75% confidence that 90% of the times the RUL is between 4 and 6 holes. The accuracy given confidence interval is displayed in Figure 55.

As seen from the figures, the accuracy increases as the drill bit

approaches to failure. The average accuracy is 84.59% and 95.69% given 75% and 95% confidence, respectively. As mentioned earlier, precision is the range length of the RUL estimation. For example, if RUL is estimated between 4 and 6 holes with 75% confidence, the

129

precision is 2 (i.e. difference between 4 and 6). Note that, as the confidence increases the RUL estimation precision increases. Average precision is 9.2775 and 17.3931 for 75% and 95% confidence. Precision for 12 drill bits is displayed in Figure 56. As seen from the figures, the precision decreases as the drill bit approaches failure.

Accuracy with 95% confidence 1 1

1

0.95 10

20

0.95 40 1 0

10

0.95 20 1 0

10

20

0.95 10

5

0.95 10 1 0

10

0.95 20 1 0

5

10

0.95 10

10

0.95 20 1 0

20

0.95 40 1 0

10

20

5

0.95 10 0

5

0.95 10 0

10

20

0.95

0

(a)

130

Accuracy with 75% confidence 1 1

1 0.8

0.8

0.6 10

20

0.8

0.6 40 1 0

10

0.8

0.6 10

5

0.8 10

0.9

0.6 10 1 0

10

5

10

20

0.8 20 1 0

5

10

10

20

10

20

0.8

0.6 20 1 0

20

0.9 0

0.8 20 1 0 0.9

0.8

0.6 10

0.8

0.9

0.6 40 1 0 0.8

0.8 10 0

5

0.6 10 0

(b) Figure 55: RUL accuracy given confidence interval for 12 drill bits given current hole in x-axis with confidence interval a)95% and b) 75%

Precision with 95% confidence 40 20

40 20

20

0 40 0

20

0 40 40 0

20

10

0 20 20 0

20

0 40 0

5

10

0 20 40 0

20

0 0 40

10

20

20 5

10

0 0 20

0

20

5

10

10

20

10

20

20 20

40

10 0

10

10

0 10 40 0

20

0

10

0 0 40 20

0

5

10

0

0

131

(a)

Precision for 75% confidence 20 10

20 10

10

0 0 20

20

40

10

0 0 20

10

20

10

0 20 0

5

10

0 20 20 0

10

0 20 0

10

20

0 40 20 0

5 0

5

10

0

10

20

5

10

10

20

10

20

10

0 20 10 0

10

0 0 10 5

0 10 20 0

10

0

5

10 0

5

10

0

0

(b) Figure 56: RUL precision given confidence interval for 12 drill bits given current hole in x-axis with confidence interval a)95% and b) 75%

6.6 FUTURE RESEARCH Hierarchical Hidden Markov Model (HHMM) has been successfully implemented for RUL estimation. However, transition probability of a state does not incorporate information of time (i.e. state duration) spent in the state in HHMM. In other words, it doesn’t matter how long time spent in a state, the probability of leaving the state is the same. However, it is normal to expect that the probability of leaving a state should be higher (lower) if the time spent in the state is long (short) enough. Thus, incorporating state duration to the model leads better representation of the real world systems, which basically results better classification accuracy. In addition,

132

available state duration information obviously increases the forecasting ability for the future states, which is directly related to RUL estimation. In Markov models, the duration of states are assumed to follow geometric probability distribution, which is not adequate in many cases [Mur02], since transition probability is a constant value for a state. For example, let the probability of staying in the same state be p and leaving the state be 1-p. The equation of probability of staying exactly d times in the state is given in (6.7). The basic approach for modeling the state duration can be achieved by replicating the state with n states, combination of which represents the original state, with the same transition probabilities as illustrated in Figure 57. Thus, the probability of staying in the original state d times can be written as in (6.8). This is a negative binomial distribution and the state duration can be modeled as different probability distributions by changing n and p. However, this approach has some limitations and is not flexible to implement in HMM. p

p X

1-p

p X 1 1-p

a)

X2

p 1-p

X3

b) Figure 57: Illustration of transition probabilities

Pi ( d ) = (1 − p ) p d −1

p = A (i, i )

(6.7)

 d − 1 d −n n p( d ) =  p (1 − p )   n − 1 (6.8)

State duration information can be incorporated to the HMMs by creating a hidden layer in which nodes represent the duration information. Thus, state depends not only previous state but also on time spent in that state, called semi-Markov model [Mur02]. Figure 58 illustrates Hidden Semi-Markov Model.

133

X1

X3

Hidden States

Q2

Q3

Duration States

Y2

Y3

Observation States

X2 F2

F1 Q1 Y1

Figure 58: Illustration of Hidden Semi-Markov Model

The idea of Hidden Semi-Markov Model (HSMM) is the same with HHMM. HSMM has duration state instead of sub-states, in the second hidden layer. When the model enters to a state, duration is generated from a probability distribution and duration is decreased in each step till it reaches to 0. The top layer cannot change the state unless the duration level is in state 0. Thus, the transition probabilities of top-level states are depended on the low layer hidden state. Probability of state duration and transition probability of top-level states can be written formally as in (6.9) and (6.10), respectfully.

P ( Qt = d ' | Qt −1 = d , X t = k , Ft −1 = 1) = pk ( d ') δ ( i , j ) = 1 if i = j P (X

t

= j| X

t −1

= i , Ft −1 = f

pk ( d '): parametric, non-parametric function or a table

)=

 δ (i , j )   A ( i,j )

If

f = 0

If

f =1

(6.9)

remain in same state transition

(6.10)

Hidden Semi-Markov models can be employed for health state estimation through competitive learning for drilling machine similar to HMM as in [CC05]. However, a health state is represented by distinct HMM and it is difficult to model transition probabilities in this approach. Therefore, in order to model complex systems and state durations together, we propose Hierarchical Hidden Semi-Markov Model (HHSMM) by combining the idea of HHMM and HSMM. HHSMM has three hidden layers: health states in the top, duration states in the middle and sub-level states in the bottom as illustrated in Figure 59.

134

Q1

Q2

Duration States

X 11

X 21

Health States

X 22

Sub States

Y2

Observation States

F1 X

2 1

Y1

2

Figure 59: Illustration of Hierarchical Hidden Semi-Markov Model

As seen from Figure 59, observation is caused by health state and sub-state and duration affects the health states leading dynamic transition probabilities for health states. Health state also affects the duration of next health state. Incorporation of duration information is critical for prognostics. Future research will be the implementation of the mentioned method that models state duration for RUL estimation.

6.7 CONCLUSION

Hierarchical Hidden Markov Model (HHMM) is employed here in order to estimate the remaining useful life. HHMM is composed of sub-HMMs in a hierarchical fashion, providing functionality beyond a HMM for modeling complex systems. HHMMs have been employed to estimate the machine health states, which are represented as distinct nodes in the top of the hierarchy, in our previous work. Transition probabilities between health states in HHMMs present us the opportunity to estimate the RUL. In this paper, the HHMM is applied to a drilling machine and Monte-Carlo simulation is employed in order to estimate RUL of the drill bits. The results of RUL estimation are very promising and reported in detail in this paper.

135

CHAPTER VII CONCLUSION & FUTURE RESEARCH Condition Based Maintenance (CBM) is the philosophy of monitoring health of a machine by analyzing various signals collected from different sensors in order to have the minimum maintenance and failure cost and maximum equipment availability. Diagnostics and prognostics are the main elements of CBM technology. Process monitoring, diagnostics and prognostics have been studied in this research. Process monitoring targets abrupt failures that develop rapidly. Novelty detection has a broader definition of process monitoring and is the process of finding abnormal behavior by learning normal behavior of a system especially in case of lack of sufficient abnormal behavior data. Existing methods in the literature lack the ability to handle non-stationary and nonparametric processes. However, industrial processes might be non-stationary and nonparametric. In addition, data independence assumption and inability to learn from existing abnormal data limit the performance of most existing methods. We developed a non-parametric novelty detection method that can handle non-stationary processes. In addition, the proposed method does not make data independence assumption, requires only normal data, and can learn from abnormal examples when available. Support vector machine, which is a strong method and has been successfully implemented in different areas, is the basis of the method. The method was applied to synthetic data as well as benchmarking datasets in the literature. The results are promising and reported in the dissertation. Health state estimation is the process of identifying the state of an incipient failure, even if the machine is working properly. The diagnostic methods in the literature can identify the failure after it reaches a severity level. However, contrary to abrupt failures, incipient failures develop

136

slowly and it is important to identify the health state in advance for effective and timely diagnosis. In case of prognostics, health state estimation is not only important but also a prerequisite. Hidden Markov model based methods are employed for health state estimation. Variants of hidden Markov models (i.e. regular, auto-regressive and hierarchical) have been implemented as health state estimators. Auto-regressive HMM removes the assumption of data independence made in regular HMM. Hierarchical HMM can represent complex systems more effectively. Implementation of HMM as dynamic Bayesian network dramatically reduces the number of parameters and gives us more flexibility on model structure. Prognostic methods acquire health state information from diagnostics and estimate the remaining useful life (RUL) based on an incipient failure. Unfortunately, there exist no robust generic prognostic methods in the literature. Prognostics basically estimates the time to be spent on the way from estimated current health state to the failure state. Monte-Carlo simulation is employed here with transition probabilities between states in a Hierarchical HMM to characterize the RUL probability distribution. The results obtained from each module of the dissertation are very promising and reported in the dissertation. Modeling least-square SVM (Ls-SVM) for novelty detection is recommended as future research. Ls-SVM solves a linear optimization problem instead of a quadratic problem and has not been used for one-class classification. Future research for prognostics is the development of variants of HMM that can incorporate state duration information directly into the model. HMM with state duration information can give better RUL estimation. In Markov models, state duration information does not explicitly exist and state transition probabilities are constant for a given state. Hidden semiMarkov models can incorporate state duration information and can facilitate dynamic state

137

transition probabilities. Modeling details are given in Chapter 6, and implementation of hierarchical hidden semi-Markov model is referred for future research.

138

REFERENCES: [All90]

G. Allenby, Condition Based Maintenance, Proceedings of CODADEM 90, The 2nd International Congress on Condition Monitoring and Diagnosis Engineering Management (1990) 155-162

[AR88]

Alwan, L. C., Roberts, H. V. Time Series Modeling for Statistical Process Control, Journal of Business and Economic Statistics, 6, (1988), p. 87-95.

[BBFN98] K. Becker, C. Byington. N. Forbes, G. W. Nickerson, Predicting and Preventing Machine Failures, American Institute of Physics, (1998). [BC03]

P. Baruah, R.B Chinnam, HMMs for Diagnostics and Prognostics in Machining Processes, Proceedings of the 57th Society for Machine Failure Prevention Technology Conference, Virginia Beach, VA, April, (2003).

[BG01]

C.S. Byington and A.K. Garga, Data Fusion for Developing Predictive Diagnostics for Electromechanical Systems, Handbook of Multisensor Data Fusion, D.L. Hall and J. Llinas eds., CRC Press, FL: Boca Raton, 2001.

[BL97]

Box, G. E. P, Luceno, A., Statistical Control by Monitoring Feedback Adjustment (New York: Wiley), 1997.

[BMA00]

C. Bunks, D. McCarthy, T. Ani, Condition Based Maintenance of Machines Using Hidden Markov Models, Mechanical Systems and Signal Processing, 14(4), 2000, 597-612

[BMBM99]

C. Begg, T. Merdes, C.S. Byington, and K.P. Maynard, Mechanical System Modeling for Failure Diagnostics and Prognosis, Maintainability and Reliability Conference (MARCON 99), Gatlinburg, May (1999)

139

[BRG02]

C.S. Byington, M.J. Roemer, and T. Galie, “Prognostic Enhancements to Diagnostic Systems for Improved Condition-Based Maintenance,” Proceedings of the 2000 IEEE Aerospace Conference, Big Sky, MT, March 2002.

[Bur98]

C. J. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition Data Mining and Knowledge Discovery, 2, (1998) 121-167

[Cam05]

F. Camci, Autonomous Diagnostics and Prognostic Framework, PHD Dissertation, Wayne State University, 2005

[CB03]

R. B. Chinnam, P. Baruah, Autonomous diagnostics and prognostics through competitive learning driven HMM-based clustering Proceedings of the International Joint Conference on Neural Networks, July 2003, 2466- 2471

[CC05]

F. Camci, R. B. Chinnam, Dynamic Bayesian Networks for Machine Diagnostics: Hierarchical Hidden Markov Models vs. Competitive Learning, Proceedings of the International Joint Conference on Neural Networks, 2005

[CC98]

Cook, D. F.; Chiu, C.; Using Radial Basis Function Neural Networks to Recognize in Correlated Manufacturing Process Parameters, IIE Transactions, Vol.30, No.3 (1998), p. 227-234.

[CCH97] J.P. Cusumano, D. Chelidze, and N.K. Hecht, “Using phase space reconstruction of tract parameter drift in a nonlinear system,” Proceedings of the ASME 16th Biennial Conference on Mechanical Vibrations and Noise, Symposium on Time-Varying Systems and Structures, September 14-17, 1997. [Chi02]

Chinnam, R. B.; Support Vector Machines for Recognizing Shifts in Correlated and Other Manufacturing Processes, International Journal of Production Research, Vol.40, No.17, (2002), p. 4449-4466.

140

[CKML04]

Chen, Q.; Kruger, U.; Meronk M.; Leung A. Y. T.; Synthetic of T2 and Q Statistics for Process Monitoring, Control Engineering Practice, Vol.12, No.6, (2004) p. 745755.

[CM95]

M. Costa and L. Moura, “Automatic assessment of scintmammographic images using a novelty filter”, in Proc. 19th Annual Symposium on Computer Applications in Medical Care, PA, 1995, pp. 537-541.

[CT00]

N. Cristianini, J. S. Taylor, An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press, 2000, 122-125

[CV95]

C. Cortes, V. N. Vapnik, Support vector networks, Machine Learning, 20(3), (1995) 273-297.

[DF95]

D. Dasgupna and S. Forrest, “Novelty detection in time series data using ideas from immunology”, in Proc. of the International Conference on Intelligent Systems Conference, Nevada, 1995, pp.82-87.

[DK89]

T. Dean and K. Kanazawa, A model for reasoning about persistence and causation. Artificial Intelligence, 93(1–2), 1–27, 1989.

[Dou95]

J. Douglas, The Maintenance Revolution, EPRI J. 6-15 May/June 1995

[DSBR03] L. Dennis S. Sun; L. J. Brian. M. E. Rob “New Fault diagnosis of Circuit Breakers” in IEEE Transactions on Power Delivery 18(2), 2003,454-459 [Dui76]

P. W. Duin: “On the Choice of Smoothing Parameters for Parzen Estimators of Probability Density Functions”. IEEE Trans. Computers, vol.25, no.11, pp.1175-1179, 1976.

[DV02]

R. Dickman, R. Vidigal, Quasi-stationary distributions for stochastic processes with an absorbing state, Journal of Physics A: Mathematical and General, (35), 2002, 1147-1166

141

[EA03]

Eickelmann, N.; Anant, A.; Statistical Process Control: What you don’t measure can hurt you!, Software, IEEE, Vol.20, No.2, (2003), p. 49-51.

[EFKAE01] A. Elmitwally,S. Farghal, M. Kandil, S. Abdelkader, M. Elkateb “Proposed Waveletneurofuzzy Combined System for Power Quality Violations Detection and Diagnosis” in IEE Proceedings-Generation, Transmission and Distribution, Vol.148,No.1,2001, pp.15-20 [EGBH00] S. J. Engel, B. J. Gilmartin, K. Bongort, A. Hess, Prognostics, The Real Issues Involved With Predicting Life Remaining, IEEE Aerospace Conference Proceedings, 6, (2000) 457-469. [Elv97]

B. Elverson, Machinery fault diagnosis and prognosis, MS Thesis, The Pennsylvania State University, (1997).

[FC04a]

F. Camci, R. B. Chinnam, Online Novelty Detections for Non-stationary Classes using a Modified SVM Proceedings of Neural Networks and Computational Intelligence, Switzerland, 2004

[FC04b]

F. Camci, R. B. Chinnam, Non-stationary Data Domain Description using Support Vector Novelty Detector, International Joint Conference on Neural Networks, Budapest, 2004

[Fri00]

S. R.Friend, A probabilistic, Diagnostic and Prognostics System for Engine Health and Usage Management, IEEE Aerospace Conference, 6, (2000) [11]

[Gar01]

A.K. Garga et al., “Hybrid reasoning for prognostic learning in CBM systems,” Proceedings of the 2000 IEEE Aerospace Conference, Big Sky, Montana, March 1017, 2001.

[GCMC00] H. Gao,J. Crossman,Y. Murphey, M. Coleman, “Automotive Signal Diagnosis Using Wavelet and Machine Learning” in IEEE transactions on vehicular technology vol 49, No:5 2000

142

[GDZX04] M. Ge, R. Du, G. Zhang, Y. Xu, Fault Diagnosis Using Support Vector Machine with an Application in Sheet Metal Stamping Operations, Mechanical Systems and Signal Processing, 18, (2004), 143-159 [GWBWB97] V. B. Gallagher, R. M. Wise, S. W. Butler, D. D. White, and G. G Barna, “Development and benchmarking of multivariate statistical process control tools for a semiconductor etch process; improving robustness through model updating”, in Proc. International Symposium on Advanced Control of Chemical Processes, Banff, Canada 1997, pp.149-161. [GZ03]

S. Goumas, M. Zervakis, Classification of Washing Machine Vibration Signals Using Discrete Wavelet Analysis for Feature Extraction, IEEE Transactions on Instrumentation and Measurement, 51(3) (2003)

[Har03]

Harbor Research Pervasive Internet Report, “Approaching Zero Downtime: The Center for Intelligent Maintenance Systems” April 2003

[HBVBD00] G. D. Hadden, P. Bergstrom, G. Vachtsevanos,B. H. Bennett, J. V. Dyke, Shipboard Machinery Diagnostics and Prognostics/ Condition Based Maintenance: A Progress Report, 2000 IEEE Aerospace Conference. Proceedings,6, 2000, 272-296 [Her95]

C. Herringshaw, “Detecting attacks on networks”, Computer, vol.30, no.12 , pp.1617, 1997

[HHA01]

W. Hardman, A. Hess, and R. Ahne, “USN Development Strategy, Fault Testing Results, and Future Plans for Daignostics, Prognostics, and Health Management of Helicopter Drive Train Systems,” Proceedings of the DSTO International Conference on Health and Usage Monitoring, Melbourne, February 19-20, 2001.

[HR91]

Harris, T.J., Ross, W. H., Statistical Process Control Procedures for Correlated Observations, Canadian Journal of Chemical Engineering, Vol. 69, (1991) p. 48-57.

143

[Ise84]

R. Isermann, Process Fault Detection Based on Modeling and Estimation Methods-A survey, Automatica, 20(2) (1984) 387-404

[Kac02]

G. J. Kacprzynski et al., “Enhancement of Physics-of-Failure Prognostic Models with System Level Features, Proceedings of the 2000 IEEE Aerospace Conference, Big Sky, MT, March 2002.

[KP02]

K. Kim, A. Parlos, “Induction Motor Fault Diagnosis Based on Neuropredictors and Wavelet Signal Processing” in IEEE/ASME Transaction on Mechatronics vol.7,No 2, 2002

[KZX03]

C. Kwan, X. Zhang, R. Xu, L. Haynes, A novel approach to fault diagnostics and prognostics, Proceedings. ICRA '03. IEEE International Conference on Robotics and Automation. 1(3), September 2003, 604-609

[Law89]

R. Lawrence, A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of IEEE 77(2), (1989) 257-286

[LJW02]

E.K. Lada, L. Jye-Chyi, J.R Wilson “A Wavelet-based Procedure for Process Fault Detection” in IEEE Transactions on Semiconductor Manufacturing vol.15, no.1 pp.79-90 2002

[LM00]

F. LeGland, L. Mevel, Fault Detection in Hidden Markov Models: A Local Asymptotic Approach, Proceedings of the 39th IEEE Conference on Decision and Control, 5, (2000) 4686-90

[LMVCS01] K. Van Leemput, F. Maes, D. Vandermeulen, A. Colchester, and P. Suetens, “Automated segmentation of multiple sclerosis lesions by model outlier detection”, Medical Imaging IEEE Transactions, vol.20, no.8 pp.677-688, 2001. [LT97]

R. Logenrand, D. Talkington, Analysis of cellular and functional manufacturing systems in the presence of machine breakdown, International Journal of Production Economics, 53(3), (1997) 239-256

144

[Mar01]

A. Mathur et al., “Reasoning and Modeling Systems in Diagnosis and Prognosis,” Proceedings of the SPIE AeroSense Conference, Orlando, FL, April 16-20, 2001.

[MBND99] K. Maynard, C. S. Byington, G. W. Nickerson, and M. V. Dyke, Validation of Helicopter Nominal and Faulted Conditions Using Fleet Data sets, Proceedings of the International Conference on Condition Monitoring, UK, (1999) 129 –141 [MM99]

Murphy, K., Mian S., Modeling gene expression data using dynamic Bayesian networks, Technical report, Computer Science Division, University of California, Berkeley, CA, 1999

[MMRTS01] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. “An introduction to kernel-based learning algorithms”,. IEEE Neural Networks, vol.12, no.2, pp.181-201, 2001. [MMZ96] Martin, E.B.; Morris, A.J.; Zhang, J., Process performance monitoring using multivariate statistical process control, Control Theory and Applications, IEE Proceedings-, Vol.143, Iss.2, Mar 1996 p.132-144 [MP03]

J. Ma and S. Perkins, “Online novelty detection on temporal sequences”, in Proc. of International Conference on Knowledge Discovery and Data Mining, Washington DC, 2003, pp. 417-423.

[MSIH04] Manabu, K.; Shinji, H.; Iori, H.; Hiromu, O.; Evolution of Multivariate Statistical Process Control: Application of Independent Component Analysis and External Analysis, Computers & Chemical Engineering, Vol. 28, No. 6-7, (2004), p. 11571166. [Mur01]

K. P. Murphy, The Bayes Net Toolbox for Matlab, Computing Science and Statistics, 33 2001 /I2001Proceedings/KMurphy/KMurphy.pdf

[Mur02]

K. Murphy, Dynamic Bayesian Network: Representation, Inference, and Learning, PhD. Dissertation, University of California, Berkeley, 2002

145

[NCRTT97] T. Nairac, C. Corbet, R. Ripley, N. Townsend, and L. Tarassenko, “Choosing an appropriate model for novelty detection”, in Proc. of 5th International Conference on Artificial Neural Networks, UK, 1997, pp.117-122. [Nist98]

NIST-ATP CBM Workshop Report, NIST-ATP Workshop on Condition-Based Maintenance, Atlanta, GA, November (1998) (http://www.atp.nist.gov/www/cbm/cbm_wp1.htm)

[OSL02]

R. Orsagh, C. Savage, M. Lebold, Development of Performance and Effectiveness Metrics for Gas Turbine Diagnostics Technologies IEEE Aerospace Conference Proceedings, 6 (2002) 2825-2835

[Pat02]

Jr. JD. Patton, Preventive Maintenance, New York: Instrument of Society of America, 1983

[Pug91]

G. A. Pugh, A Comparison of Neural Networks to SPC Charts, Computers and Industrial Engineering, 21, 253-255, 1991.

[PW83]

S.M Pandit, S. Wu, Time Series and System Analysis with Applications (New York, Wiley), (1983).

[QC96]

S. Qian, D. Chen, Joint Time-Frequency Analysis: Methods and Applications, Engle Cliffs,NJ Prentice-Hall,1996

[Rao92]

SS Rao, Reliability Based Design New York: McGraw-Hill,1992

[RG99]

S. Roweis & Z. Ghahramani, A Unifying Review of Linear Gaussian Models, Neural Computation 11(2) (1999) 305-345

[RJ91]

S. J. Raudys and A. K. Jain, “Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners”, Pattern Analysis and Machine Intelligence, IEEE Transactions, vol.13, no.3, pp.252-264, 1991.

146

[RK00]

M.J. Roemer and G.J. Kacprzynski, “Advanced diagnostics and prognostics for gas turbine risk assessment,” Proceedings of the 2000 IEEE Aerospace Conference, Big Sky, Montana, March 18-25, 2000.

[ROS91]

Rose, K.; Mathematics of Success and Failure, Circuits and Devices, IEEE, Vol.7, No.6, (1991), p.26-30

[RT96]

A. Ray, S. Tangirala, Stochastic Modeling of Fatigue Crack Dynamics for Online Failure Prognostics in “IEEE Transactions on Control Systems Technology” Vol. 4, No. 4, pp. 443-451 July 1996

[RWP95] Runger, G.C, Willemain, T.R, Prabhu, S. Average Run Lengths for CUSUM Control Charts Applied to Residuals, Communications in Statistics Theory and Methods, 24, 273-282, 1995. [San00]

I. Sanches, Noise-compensated hidden Markov models, Speech and Audio Processing, IEEE Transactions on, 8(5), Sep 2000, 533-540

[Sch00]

B. Scholkophf, Statistical Learning and Kernel Methods, Technical report Microsoft Research, Microsoft Corporation, February (2000)

[SK97]

H. Saranga and J. Knezevic, “Reliability prediction for condition-based maintained systems,” Reliability Engineering and System Safety, Vol. 71, pp. 219-224, 2001.

[SM03]

S. Singh and M. Markou, An Approach To Novelty Detection Applied To The Classification Of Image Regions, IEEE Transactions on Knowledge And Data Engineering, 15, (in press, 2003).

[Smi94]

E. Smith, X-bar and R Control Chart Interpretation Using Neural Computing, International Journal of Production Research, 32, 309-320, 1994.

[Smy98]

P. Smyth,. Belief networks, hidden Markov models, and Markov random fields: a unifying view, Pattern Recognition Letters 18(11-13) (1998) 1261-1268

147

[Ste00]

Stephenson, T., An introduction to Bayesian Network Theory and Usage, IDIAP-RR 00-03, 2000

[Str88]

G. Strang, (1988). Linear algebra and its applications. Harcourt Brace Jovanovich College Publishers.

[SWST99] B. Sch¨olkopf, R. Williamson, A. Smola, and J. S. Taylor, “SV Estimation of a Distribution’s Support”, in Proc. NIPS’99, 1999. [Tax01]

D. M. Tax, “One Class Classification”, Ph.D. dissertation, Delft Technical University, 2001.

[TD01]

D. Tax, R. Duin, Combining one-class classifiers, Proceedings of the Second International Workshop Multiple Classifier systems, MCS 2001, (Cambridge, UK, July 2001), Lecture Notes in Computer Science, 2096, Springer Verlag, Berlin, 2001, 299-308.

[TD99]

D. M. Tax and R. Duin, “Support vector domain description”, Pattern Recognition Letters, vol.20, no.11-13, pp 1191-1199, 1999.

[TFM01]

P. Tappert, A. von Flotow, and M. Mercadal, “Autonomous PHM with blade tip sensors: algorithms and seeded fault experience,” Proceedings of the 2000 IEEE Aerospace Conference, Big Sky, Montana, March 10-17, 2001.

[TG74]

J. Tou, R. Gonzales Pattern Recognition Principles, Reading MA:Addision-Wesley 1974

[THC95]

L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady, “Novelty detection for the identification of masses in mammograms”, in Proc. 4th International Conference on Artificial Neural Networks, London, UK, 1995, pp. 442-447.

[TK98]

S. Theodoridis, K. Koutroumbas, Pattern Recognition, NY Academic 1998

[Vap98]

V. Vapnik, Statistical learning theory, Wiley, 1998, pp.401-440.

148

[VPE01]

G. Vachtsevanos, W. Peng, J. Echauz A Wavelet Neural Network Framework for Diagnostics of Complex Engineered Systems, Proceeding of the 2001 IEEE International Symposium on Intelligent Control (ISIC '01) (2001) 79-84

[VW99]

G. Vachtsevanos and P. Wang, An intelligent approach to fault diagnosis and prognosis, MFPT’53, 231-241 ,1999

[VWK99] G. Vachtsevanos, P. Wang and N. Khiripet, “Prognostication: Algorithms and Performance Assessment Methodologies”, ATP Fall National Meeting ConditionBased Maintenance Workshop, San Jose, California, November 15-17, 1999. 50 [WH99]

B.D. Womack, J.H.L. Hansen, N-channel hidden Markov models for combined stressed speech classification and recognition Speech and Audio Processing, IEEE Transactions on, 7(6), Nov 1999, 668-677

[WM03]

K. Worden and G. Manson, “Experimental validation of structural health monitoring methodology”, Journal of Sound Vibration, vol.259, no.2, pp.345-363, 2003.

[WM99]

Woodall, W.H., Montgomery, D.C., Research Issues and Ideas in Statistical Process Control, Journal of Quality Technology, 31 (4), (1999) 376-386.

[WMA03] K. Worden, G. Manson, and D. J. Allman, “Experimental validation of structural health monitoring methodology I: novelty detection on a laboratory structure”, Journal of Sound and Vibration, vol.259, no.2, pp.323-343, 2003. [WMP94] Wardell, D. G., Moskowitz H., Plante, R. D., Run-length Distributions of Specialcause Control Charts for Correlated Processes (with discussion), Technometrics, 36, 3-27, 1994. [WPKLV99] P. Wang,N. Propes, N. Khiripet, Y. Li, G. Vachtsevanos, An Integrated Approach to Machine Fault Diagnosis, IEEE Annual Textile, Fiber and Film Industry Technical Conference (1999) 7

149

[WV99]

P. Wang and G. Vachtsevanos, “Fault prognosis using dynamic wavelet neural networks”, Maintenance and Reliability Conference, MARCON 99, Gatlinburg, May 10-12, 1999.

[Yan02] S. K. Yang, An Experiment of State Estimation for Predictive Maintenance Using Kalman Filter on a DC Motor, Reliability Engineering and System Safety, 75(1), (2002) 103-111 [YC03]

Yuan, D. Casanent Support vector machines for class representation and discrimination, International Joint Conference on Neural Networks, Portland, 2003 1610-1615.

[YCO00]

W. Li, H. Yue, S. Valle-Cervantes, and J. Qin, “Recursive PCA for adaptive process monitoring”, Journal of Process Control , vol.10, no.5, pp.471-486, 2000.

[YCY03]

D. Yu, J. Cheng, Y. Yang, Application of EMD Method and Hilbert Spectrum to the Fault Diagnosis of Roller Bearings, Mechanical Systems and Signal Processing, October (2003)

[YTW00]

K. Yamanishi, J. Takeuchi, and G. Williams, “Online unsupervised outlier detection using finite mixtures with discounting learning algorithms”, in Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000, pp.250-254.

[ZLZ98]

Y. Zhang; X. R. Li, K. Zhou, A Fault Detection and Diagnosis Approach Based on Hidden Markov Chain Model, Proceedings of the 1998 American Control Conference (ACC), 4, (1998), 2012-16

[ZY01]

J. Zhang, Y. Yan, Wavelet Based Approach to Abrupt Fault Detection and Diagnosis of Sensors, IEEE Transactions on Instrumentation and Measurement, 50(5), (2001) 1389-1396

150

ABSTRACT PROCESS MONITORING, DIAGNOSTICS AND PROGNOSTICS USING SUPPORT VECTOR MACHINES AND HIDDEN MARKOV MODELS

by FATIH CAMCI March 2005

Advisor:

Assoc. Prof. Ratna Babu Chinnam

Major:

Industrial Engineering

Degree:

Doctor of Philosophy

Condition-Based Maintenance (CBM) technology increases system availability and safety while reducing costs, attributed to reduced maintenance and inventory, increased capacity, and enhanced logistics and supply chain performance. Employing effective generic process monitoring methods for abrupt failures and diagnostics and prognostics algorithms for incipient failures is an important prerequisite for widespread deployment of CBM. Diagnostics is the process of identifying, localizing and determining severity of a machine failure, whereas prognostics is the process of estimating the remaining-useful-life (RUL). In contrast to prognostics, there exist many methods in the diagnostics literature. However, most generic diagnostic algorithms cannot effectively detect failure modes in a timely manner. A generic, machine independent method for diagnostics and prognostics is the dream of the researcher’s in this field and the focus of this research.

151

This work presents methods based on support vector machines and hidden Markov models to diagnose abrupt and incipient failures and to estimate the RUL. The presented methods have the ability to handle non-stationary processes. There exist three major goals for this dissertation: detecting abrupt failures (i.e. process monitoring), identifying the state of incipient failures in advance (i.e. health state estimation) and estimating RUL of the machine (i.e. prognostics). A General Support Vector Representation Machine (GSVRM) based on novelty detection principles is proposed for process monitoring. GSVRM is a non-parametric method and does not make strong assumptions about auto-correlation structure of the data. In addition, it requires only ‘normal’ data for training. However, it can learn from failure data when available. GSVRM is implemented on benchmarking datasets in the literature as well as synthetic datasets. Variants of hidden Markov models (i.e. regular, auto-regressive, and hierarchical HMM) are implemented for health state estimation and prognostics. Implementation of HMM as dynamic Bayesian network dramatically reduces the number of parameters and gives us more flexibility in model structure design. In prognostics, Monte-Carlo simulation using Markov models is employed to estimate RUL distribution. The proposed health state estimation and prognostics methods are applied to a drilling process, the most popular industrial machining process. The results of all three modules are very promising and reported in the dissertation.

152

AUTOBIOGRAPHICAL STATEMENT Fatih Camci is a Doctoral Candidate in the Department of Industrial & Manufacturing Engineering at Wayne State University (U.S.A.). He received his B.S. degree in Computer Engineering from Istanbul University (Turkey) and M.S. degree in Computer Engineering from Fatih University (Turkey). His research interests include Computational Intelligence, Diagnostics, Prognostics, Novelty Detection, and Condition-Based Maintenance. He is a member of the Computational Intelligence Society.

PROCESS MONITORING, DIAGNOSTICS AND ...

PROCESS MONITORING, DIAGNOSTICS AND ...

Suggest Documents

Diagnostics, monitoring, and therapy

Transformer Monitoring & Diagnostics System - mPrest

Maintenance, Condition Monitoring and Diagnostics - VTT Virtual ...

Condition monitoring and diagnostics of machines — General ...

Network and Process Monitoring - LogRhythm

Midwives' intrapartum monitoring process and

Network and Process Monitoring - LogRhythm

Network and Process Monitoring - LogRhythm

Quest Diagnostics Health Trendsâ¢ Prescription Drug Monitoring ...

a monitoring system for preventive diagnostics

Process Diagnostics: a Method Based on Process Mining

Adaptive Monitoring, Fault Detection and Diagnostics, and Prognostics ...

Clinical diagnostics and therapy monitoring in the congenital disorders ...

Non-Destructive Testing for Building Diagnostics and Monitoring ...

Wireless technologies in condition monitoring and remote diagnostics

Monitoring and Diagnostics with Intelligent Agents using Fuzzy Logic

monitoring and diagnostics of power transformer insulation - CiteSeerX

Machine and Process System Diagnostics Using One ... - CiteSeerX

Wireless technologies in condition monitoring and remote diagnostics ...

Statistical Process ControlBased Intrusion Detection and Monitoring

Workflow-Based Process Monitoring and Controlling - CiteSeerX

mathematical modelling and in-process monitoring

Process monitoring and visualisation using self ... - CiteSeerX

Internet-Based Manufacturing Process Optimization and Monitoring ...

PROCESS MONITORING, DIAGNOSTICS AND ...