neural computation

6 downloads 0 Views 1MB Size Report
The sensory information is quite noisy some- times, which means that ..... conducted in two-dimensional space using the reparameterization trick. This ability can ...
HANDBOOK OF

NEURAL COMPUTATION Pijush Samui, Sanjiban Sekhar Roy, and Valentina E. Balas

CHAPTER

UNSUPERVISED DEEP LEARNING FOR DATA-DRIVEN RELIABILITY AND RISK ANALYSIS OF ENGINEERED SYSTEMS ∗ School

23

Peng Jiang∗ , Mojtaba Maghrebi† , Alan Crosky∗ , Serkan Saydam‡

of Materials Science and Engineering, UNSW Australia, Sidney, NSW, Australia † Ferdowsi University of Mashhad, Khorasan Razavi, Iran ‡ School of Mining Engineering, UNSW Australia, Sidney, NSW, Australia

23.1 INTRODUCTION Reliability and risk analysis is engineering means that studies the stability and risk of a system or component, emphasizing the prediction of the evolving state of this system under given conditions. For example, product or maintenance engineers need to estimate the ability of a product or a facility to function so that the service life can be accurately predicted. By studying the historic trend of stocks, financial engineers make trading decision for maximizing the return. Despite the different domains, reliability and risk analysis involves the analysis of the historic data, and a quantified prediction or a decision is then made, which can be the probability of failure of one system [1], remaining useful life of a structure [2], or trading strategies [3]. Therefore, whether the predictions are acceptable is heavily dependent on how much useful information can be extracted from the original data and how to conduct the analysis. Thanks to the development of various data collection devices, the available data grow to a degree that it is difficult to process them manually. Decades ago a civil engineer might have to check the structures every week and combine his previous experience to estimate the safety factor. Nowadays one can easily get different sources of data from sensors at a frequency of 1 million Hz. Traders are not limited to the open and close price provided by the stock market. They also use mass media and social websites to analyze others’ attitudes towards the future economics for making better decisions. With the explosion of available data, data-driven methods, particularly machine learning based methods, have been developed and achieved a considerable success. Reliability and analysis is one of the most widely studied topics in Civil Engineering and Mechanical Engineering. Many case studies in this domain involve complex failure mechanism and the unpredictable failure process. For example, stress corrosion cracking (SCC) is one of the most dangerous failures due to the fast crack propagation without any obvious signs. In addition, the mechanism and the variables affecting the cracking process still remain debatable, adding difficulty to analyzing Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00023-5 Copyright © 2017 Elsevier Inc. All rights reserved.

417

418

CHAPTER 23 DATA-DRIVEN RELIABILITY ANALYSIS

and predicting this failure with physics-based model. Shi et al. [4] used an artificial neural network (ANN) model to predict the crack growth rate in Type 304 stainless steel. The predicted values of crack growth rate were in good agreement with the experimental values. In addition, the sensitivity analysis showed that the temperature and conductivity contributed the most to the crack growth rate. Kamrunnahar et al. [5] developed an ANN model using the actual measurements of corrosion weight loss data of Alloy 22 to predict the future corrosion weight loss of this material. Results demonstrated a good agreement between the predicted and measured values under similar environmental and sampling conditions. Jiménez-Come et al. [6] presented an automatic model based on ANN to predict pitting corrosion behavior of austenitic stainless steel and the results revealed the excellent classification performance compared to K-Nearest Neighbor (KNN) and classification tree (CT). For industries requiring real-time quality monitoring, reliability and risk analysis can be conducted for early defect identification to avoid the potential loss. For example, semiconductor foundries often use hundreds of processes; an undetected failure during one process can lead to a considerable reduction in the yield rate. Kim et al. [7] developed a support vector machine based novelty detection method for early fault detection, which can classify the novel and minority products successfully. In addition, Malhotra et al. [8] used stacked long short-term memory (LSTM) for anomaly detection, showing that this model can learn higher-level temporal correlations of a time-series data set without any prior knowledge. Celaya et al. [9] developed Gaussian process regression (GPR) to estimate the remaining useful life (RUL) of metal oxide field effect transistors (MOSFETs) with the historic run-to-failure data, and their model provided valid results compared to physics-based model. As is mentioned, reliability engineering now collects data from various sources, and engineers are supposed to integrate and process the multi-dimensional information. One popular type are time-series data, which are also known as temporal data. Time-series data are information sampled from a sequential process over time, which are different from other types of data in the following two aspects. First, the information is significantly noisy. A sensor can generate billions of samplings, most of which might be contaminated with environmental noise, adding difficulties in the further analysis. Secondly, timeseries data are often non-stationary, which means that the moments like frequency, variance, and mean are dynamic. It is not easy to find an invariant or a distribution to describe the process by conventional approaches. Only by a series of filtering and transformation can the signals be analyzed and classified. For some reliability engineering problems, particularly some structures which have a spatial distribution on the failure or error patterns, it is more useful to use the related spatial information. For example, an underground tunnel might have some areas where failure occurs frequently. Similarly, a mine site may have some areas where the chance of collapse is obviously higher than in the other places. Due to the limitations in conducting in situ experiments, most of the available information can only be obtained from observations. In this case, spatial data can be a valuable source of information. Spatial data is often topological or distance information conveying some quantified observations or events like failure frequency, stress level, or environmental variables. Cross-sectional data are also a common type, collected by recording many features of a system or components at the same point of time. The opportunity to observe or conduct experiments is scarce in some situations, so the engineers have to collect enough data at one time. This type of data can lead to a lack of useful information, since reliability problems are often time dependent. However, a reasonable reliability and risk analysis can be conducted by providing enough observations with re-

23.2 RELIABILITY AND RISK ANALYSIS OF ENGINEERED SYSTEMS

419

markable variance in features [10]. Reliability engineering problems often involve two or more types of data. For instance, spatial-temporal databases are widely used in georisk assessment [11]. Multidimensional data of various types contribute to a better understanding of the mechanism. On the other hand, they pose a challenge to finding proper information extraction methods. Deep learning, a class of artificial neural networks with multiple layers for feature transformation, has achieved record-breaking results in various machine learning problems [12]. In this study, we review one popular unsupervised deep learning framework, autoencoders, and discuss the application on reliability and risk analysis. It is noted that reliability analysis covers a wide range of research objectives and approaches. In this study, we only focus on data-driven technologies, which do not require a prior knowledge on the complex mechanism behind the evolving patterns of the system. Conventional model-based methods [13, 14], the First- and Second-Order Reliability Methods (FORM and SORM) [15] are not within our scope.

23.2 RELIABILITY AND RISK ANALYSIS OF ENGINEERED SYSTEMS A number of challenges exist in reliability and risk analysis of engineered systems. First, complex working mechanisms make it difficult for an inexperienced engineer to monitor engineered systems. To solve this problem, automatic monitoring systems, e.g. the distributed sensor networks, have been applied to real-time monitoring and diagnosis to collect health information of systems at relatively low cost. Secondly, it is difficult to interpret the raw data collected from monitor instruments. In general, domain-dependent feature engineering is needed to reconstruct the raw monitoring information. If the distributed sensor networks are employed, then signal processing techniques are required to denoise electronic monitoring signal. Moreover, some threshold values need to be set based on domaindependent knowledge [16]. These threshold values are used to build rule-based diagnose methods for reliability and risk analysis of an engineered system. Another challenge lies in developing algorithms for conducting reliability and risk analysis of engineered systems. In general, a real value, e.g., remaining useful life (RUL) and health state obtained from algorithms, is needed for planning maintenance. Algorithms play a critical role in mapping monitoring information to such values. To conclude, monitor instruments, feature engineering, and algorithms are three challenges in reliability and risk analysis of engineered systems. With the development of smart sensors, the available information increases to a degree that traditional domain-dependent feature engineering no long works [17]. The complex failure mechanisms also require algorithms with bigger learning capacity to capture the hidden patterns [10]. These challenges lead to application of deep learning to reliability and risk analysis. Unlike many algorithms that require domain-dependent feature engineering, deep learning has a class of unsupervised algorithms, e.g., autoencoders, for feature reconstruction. More importantly, deep learning shows an encouraging capability in capturing complex hidden patterns in data, and has achieved a number of record-breaking successes in image processing [18], speech recognition [19], and natural language processing [20].

420

CHAPTER 23 DATA-DRIVEN RELIABILITY ANALYSIS

23.3 DEEP LEARNING: THEORETICAL BACKGROUND Just like a healthy human being who uses the eyes, ears, and skin to perceive the world, a machine can also collect the information through various smart sensors, historic records, or even the prior knowledge (experience) from the humans, and then it uses algorithms to process the data. The main difference between a human being and a machine lies in the learning and feature extraction approaches, i.e., how to extract useful features, which can be important predictors for the further analysis or prediction. The machine learning community has already seen the development of a considerable number of methods such as support vector machines (SVM), random forest, and artificial neural networks. Limited by the shallow feature extraction structures, most of the machine learning methods need domain-dependent feature engineering to extract information from raw data [21,22]. As a representation-learning based method, deep learning consists of multiple layers of non-linear representation, exempting one from complex and hand-engineering features. Generally, machine learning problems can be divided into supervised learning and unsupervised learning. The former requires that the data include the targets or labels, which can be predicted through mapping the n-dimensional input data x to a real value y : Rn → R. Unsupervised learning, which is the main focus of this study, emphasizes a reconstruction of the original input data. Unlike supervised learning, we are only concerned with some important properties of the original data instead of predicting y based on x. It is noted that sometimes it is unnecessary to formally distinguish the two machine learning categories. For example, in semi-supervised learning, only a few data points are provided with the labels. In addition, clustering, one unsupervised learning method, can also perform dividing the data sets into different groups. In fact, unsupervised learning is often conducted for pretraining a supervised learning architecture, in order to boost the prediction performance. In reliability engineering, it is often expensive to get the labeled data. For example, to estimate the remaining useful life of a product, the run-to-failure experiments are needed. However, in some cases these experiments are time-consuming, leading to a lack of labels in the available data. Currently, smart sensors are employed to collect the information of a system. The sensory information is quite noisy sometimes, which means that a data reconstruction step should be performed before the further analysis. In this section, we give an introduction to autoencoders, a popular unsupervised deep learning framework, aiming to describe the data reconstruction procedure that is often needed in reliability and risk analysis.

23.3.1 AUTOENCODERS An autoencoder is an artificial neural network attempting to reproduce the original input by encoding and decoding. A simple autoencoder consists of an encoder and a decoder, as shown in Fig. 23.1. The former allows the transformation from the original input into a hidden representation h = f (x), and the latter maps the hidden representation from the encoder back onto the original input r = g(f (x)). Since the final output of the decoder is nothing but the approximation of the input, we are only concerned with the feature transformation occurring during encoding, which is believed to extract the essential information from the original representations at the lowest possible reconstruction error L(x, r). The encoder and decoder are often implemented by the Multi-Layer Perceptrons (MLPs), with linear or non-linear functions (a) as the activations: f (x) = af (bf + Wf x)

(23.1)

23.3 DEEP LEARNING: THEORETICAL BACKGROUND

421

FIGURE 23.1 Schematic view of an autoencoder consisting of one encoder and one decoder.

g(h) = ag (bg + Wg h)

(23.2)

where af and ag denote the activation functions for the encoder and decoder respectively. (Wf , bf ) and (Wg , bg ) are the corresponding weight matrices and bias terms for the encoder and the decoder. Therefore, autoencoders can be viewed as the supervised learning using MLPs, aiming to obtain the optimized network parameters (Wf , bf ) and (Wg , bg ) at the lowest reconstruction error L(x, r). As is mentioned, autoencoders attempt to obtain the nonlinear transformation, so the activation functions for the encoder can be sigmoid or tanh functions. For the decoder, the activation function should be linear if the input domain is unbounded, and contrarily non-linear. The loss function is a squared error ∥x − r∥2 for an unbounded domain, or a binary cross-entropy loss if the input takes a binary form. By adding the penalty or changing the architectures of the encoder and the decoder, one can obtain the regularized autoencoders to cope with different tasks. It is noted that the hidden representations produced by the encoder can be used as the input for other machine learning methods, helping to initialize a network. In addition, autoencoders are often used to compress the input data, which is also known as dimension reduction, through restricting the number of output neurons of the encoder. Since the MLPs are often employed as the basic framework of autoencoders, it is natural to perform function estimations to describe the relationship between the original input and the latent variables. We can also define the autoencoders from a probabilistic perspective by using distributions pencoder (h | x) and pdecoder (x | h). In this case, the model tries to optimize the parameters determining the two distributions through maximum likelihood estimation (MLE), i.e., minimizing the negative log-likelihood − log p(y | x). For an unbounded input domain, minimizing the negative log-likelihood actually gives rise to the same estimations of parameters W as does minimizing the mean squared error.

23.3.1.1 Sparse Autoencoder Though the encoder is often designed to compress the data by reducing the dimensions of the input, assigning a large number of output neurons for the encoder (probably larger than the input dimensions) can also extract useful information, particularly from the spatial information. This type of autoencoder is known as a sparse autoencoder (SAE), which was originally implemented by imposing a sparsity constraint on the encoder: L(x, r) + !(h)

(23.3)

422

CHAPTER 23 DATA-DRIVEN RELIABILITY ANALYSIS

FIGURE 23.2 Schematic view of a sparse autoencoder.

!(h) is the penalty which introduces the sparsity to the encoded representation, as is seen in Fig. 23.2. The interpretation consists of a reconstruction error and a regularizer, which is a common form for other regularized autoencoders that will be introduced in the following sections. L1 penalty or L1 norm, which is the sum of the output of the encoder hj , can be used as the constraint ! (h) in the loss function. Useful though L1 penalty is in introducing the sparsity, studies showed [23,24] that it underperformed Bayesian methods with spike-and-slab priors such as Laplace distribution and Student-t distribution regarding the prediction performance. Some other approaches can also be used for introducing sparsity. Glorot et al. [25] used the rectified linear units (RLU) as the activation function, achieving the sparsity in the encoding representations. The main methods for obtaining sparse encoding involve activation functions, penalties, and sampling methods. Makhzani and Frey [26] proposed the k-sparse autoencoder, enforcing sparsity in the encoded representation without implementing non-linear function or penalties. The k-sparse autoencoders only keep the k highest activities, contributing to a fast encoding stage. With the sparsity penalty, the encoder is forced to generate the representation that contains a few non-zero elements, and the corresponding neurons are viewed as active.

23.3.1.2 Denoising Autoencoder Proposed by Vincent et al. [27], denoising autoencoder (DAE) attempts to obtain the robust latent representations by introducing a stochastic noise to the original data: x ∼ x ′ . The denoising autoencoder then reconstructs the original data from the corrupted input, which helps to discover the robust representations and prevent it from learning the less important identity. In fact, Gallinari et al. [28] have already used MLPs to denoise images before the development of DAE. But their work only concerned denoising, while Vincent’s work aimed at producing a robust representation as a by-product of denoising. Therefore, the noise is artificially introduced in DAE. During the denoising procedure, the model needs to sample an observation x from the training set, which then generates a corresponding corrupted x0 according to the corruption process P (x ′ | x). Then x ′ is encoded so that a hidden representation h can be obtained, as seen in Fig. 23.3. It is noted that in DAE, the loss function should be L(x, r) instead of L(x ′ , r). Since the core idea is that in order to let the decoder reconstruct the original uncorrupted input data from the corrupted one, the encoder has to generate the robust representations. The stochastic noise can be

23.3 DEEP LEARNING: THEORETICAL BACKGROUND

423

FIGURE 23.3 Schematic view of a denoising autoencoder.

generated by randomly setting some of the input features to zero [27], or other more complex corruption process such as additive isotropic Gaussian noise, masking noise and salt-and-pepper noise [29]. For many unsupervised learning methods, probabilistic modeling and maximum likelihood estimation (MLE) are employed due to the efficiency and consistency. Recently there has been an increasing interest in interpreting learning as a construction of an unnormalized energy surface. Based on this definition, encoding is actually a procedure of finding the local minima of this energy surface. Score matching was then introduced to learn the parameter θ of the probability distribution p(x; θ ) over configuration of variables of interest: p(x; θ ) =

exp(−E(x, θ )) Z

(23.4)

where Z is the partition function that represents a statistical ensemble, ensuring p(x) = 1. E is an energy function, which is a concept borrowed from physics, aiming to let the model learn some desirable properties of some probability distribution. The score is then the gradient field of the log density: #(x; θ ) =

∂ log p(x; θ ) ∂x

(23.5)

Similarly to MLE, score matching attempts to learn the parameter θ so that #(x; θ ) best matches the corresponding true distribution. DAE approximates this score by training with the squared error criterion ∥g(f (x ′ ) − x)∥2 and the corruption process.

23.3.1.3 Variational Autoencoder Based on a more generalized definition of autoencoders introduced in DAE, the encoding procedure can be viewed as an inference procedure pe (h | x) and the decoding as a stochastic mapping pd (x | h). This reinterpretation allows autoenocoders to produce a probabilistic description of the data through inference (calculating the latent space variables) and learning (optimizing the hyperparameters), so that many algorithms, which can sample from the implicitly learned density function such as Markov chain Monte Carlo (MCMC) [30] and variational inference [31], can be employed. Proposed by Kingma and Welling [32], variational autoencoder (VAE) is a deep generative architecture attempting to construct a probability distribution modeling the latent variables. One interesting

424

CHAPTER 23 DATA-DRIVEN RELIABILITY ANALYSIS

property of VAE is that it can generate new input data through sampling from the distribution, enabling us to observe the unknown data points similar to the input data points. Actually, a generative model is a quite useful tool for reliability and risk analysis. One reason is that labeling observations might be pretty expensive. Generative models can enable one to obtain the similar data from the labeled ones. The second reason is that for a generative model, the modeling, calculating the latent distribution, and optimizing hyperparameters are modularized, which means that prior knowledge can be easily integrated. As was mentioned, VAE learns to give rise to a probabilistic distribution of the input data through the encoder’s inference, which is an approximation on the true posterior pθ (z | x). A metric, Kullback– Leibler (KL) divergence (DKL ) was then introduced to measure the similarity between qφ (z | x) and pθ (z | x). In order to optimize the whole model (the encoder and the decoder), we need to maximize the marginal likelihood, which is a sum of the log-likelihoods across all data points: log pθ (x) = L(θ, φ; x) + DKL (qφ (z | x) ∥ pθ (z | x))

(23.6)

Here L(θ, φ; x) is the variational lower bound. Since DKL is non-negative, the variational lower bound can be rewritten as: L(θ, φ; x) = Eqφ (z|x) [log pθ (x | z)] − DKL (qφ (z | x) ∥ pθ (z | x))

(23.7)

The first RHS term is the expected reconstruction error, forcing the decoder to reconstruct the data. The second term, KL divergence, measures how close our appropriator is to the true posterior, which can be viewed as a regularizer, like we have introduced in VAE and DAE. For each input data point x, a generative model gives rise to a set of hidden representations z by sampling from qφ (z | x), which is defined by the encoder. VAE uses a clever approach, the reparameterization trick, to sampling z: z=µ+σ ⊙ϵ

(23.8)

where µ and σ are the mean value and the standard deviation of the approximate posterior generated by the encoder. ⊙ is the element-wise product. ϵ is a Gaussian noise ϵ ∼ N (0, I ). This trick allows us to transform z from a random drawn value to a deterministic one with noise. Since the mean and standard deviation are obtained from the encoder’s inference procedure, we can use backpropagation with respect to θ through the variational lower bound function. In addition to generating new samples and dimension reduction, VAE can allow us to visualize the high-dimensional data in the low-dimensional latent space. We will use this property to discover the structural similarity of the data in the following case study section.

23.3.1.4 Advanced Autoencoder For many real-world problems, the data may represent a strong spatial correlation or temporal correlation. For example, image processing and object recognition tasks involve the spatial information, but the conventional autoencoders are not designed to capture this 2D image structure, or represent the dynamic temporal relationship. In this sense, some advanced autoencoders, which can combine with interesting deep architectures like convolutional neural network (CNN) [33,34] and recurrent neural network (RNN) [35,36], have been developed to deal with such data sets. Masci et al. [34] developed the convolutional autoencoders, whose weights are shared among all locations in the input so

23.3 DEEP LEARNING: THEORETICAL BACKGROUND

425

FIGURE 23.4 Schematic view of a deep autoencoder.

that spatial locality can be preserved. In addition, a max-pooling layer was implemented to obtain translation-invariant features which was inspired by some recent biological investigations. The stacked convolutional autoencoders were then used in pretraining the CNN, showing a superior performance on some benchmark data set like MINIST and CIFAR10 data set. Srivastava et al. [37] used a long short-term memory (LSTM) autoencoder to learn a fixed length of encoded representations from video sequences. These representations can improve classification accuracy with only a small number of samples. In addition, the LSTM decoder can continue to produce motion beyond the time scales it was trained for. Therefore, the application of autoencoders can be greatly extended if a corresponding strategy dealing with the input data is employed.

23.3.1.5 Stacked Autoencoders The autoencoders we described above contain only one encoder and one decoder. However, it is possible to build a deep autoencoder, which can bring many advantages. By stacking multiple layers for encoding and a final output layer for decoding, a stacked autoencoder, or a deep autoencoder, can be obtained. Fig. 23.4 shows a stacked autoencoder with three encoding layers and one decoding layer. Many experiments [17] have proved that a deep architecture can exponentially reduce the computation cost and the amount of training data. The representations generated by a deep autoencoder are relatively robust and useful compared to a shallow autoencoder. Paul and Venkatasubramanian [38] studied the mechanism from the perspective of group theory, establishing a connection between the deeper representations and the orbit-stabilizer interplay in group actions. Recently, a study [39] shows that deep architectures may be employing a generalized renormalization group-like scheme, which is one technique in theoretical physics for deep feature representation from data. Although the development of deep architectures is fast and encouraging, the understanding of the mechanism behind its success is limited. Studies on the mechanism behind the success of deep learning models still need more theoretical and conceptual work.

426

CHAPTER 23 DATA-DRIVEN RELIABILITY ANALYSIS

23.3.2 HYPERPARAMETERS OPTIMIZATION So far there has not been a general rule guiding the tuning of hyperparameters for deep learning models we have discussed above. In addition, some deep learning models may involve too many hyperparameters, making tuning pretty difficult due to the large computation cost. In general, the adjustment of hyperparameters depends on the corresponding prediction performance on the out-ofsample data. So, cross-validation is often recommended for obtaining the generalization error. Bengio [40] has introduced some tricks in adjusting the hyperparameters. However, for a considerable number of hyperparameters, their optimal values are data-dependent [41]. In addition, one can consider the nature-inspired algorithms for optimizing the hyperparameters, such as Bayesian optimization [42], genetic algorithm [43], and particle swarm optimization [44]. Recently, one experience-based deep learning architecture, deep reinforcement learning model, has been reported to succeed in accelerating the training of deep neural networks [45]. In spite of the algorithms, GPUs are highly recommended for large deep learning frameworks. They can speed up graphical operations significantly due to ability of parallel processing vector and matrix operations. Hence many deep learning libraries support GPUs-accelerated computation such as Tensorflow, Torch, and Theano.

23.4 CASE STUDY Three autoencoders introduced above, SAE, DAE, and VAE, were applied on four data sets consisting of multivariate time series signals obtained from an aircraft engine run-to-failure simulation [46]. Each time series can be viewed as the signals from a different engine, beginning with unknown conditions of initial wear and manufacturing variation. It is noted that this initial wear and variation is the normal condition, only affecting the length of run-to-failure series for each engine. The engine is operating normally in the beginning of each signal series, and at some point one fault rises and grows in magnitude, resulting in an end of running. The failure time for each simulated engine is unique, recorded as the number of total operating cycles. Four data sets, FD001, FD002, FD003 and FD004, have different fault modes associated with their conditions. For each time series, three operational settings and 21 values for measurements from 21 sensors are included. The 21 signals collected from sensors are contaminated with measurement noise. The aim of this case study is to predict the state of the engine given the operational settings and sensory information. Three autoencoders, SAE, DAE and VAE, are employed for feature reconstruction. The original features involve multiple sensor collecting variables, and domain-dependent knowledge is needed to conduct reliability and risk analysis with physical models [47]. SAE and DAE have proven encouraging feature learning ability in various machine learning tasks [48]. However, both of them need high-dimensional hidden layers to boost the reconstruction performance [40]. To visualize hidden representations in low-dimensional space, VAE is used. VAE allows the reconstruction process to be conducted in two-dimensional space using the reparameterization trick. This ability can help us gain insights to the inner structures of original. Each of the reconstructed representations are then fed into a two-layer fully-connected neural network to classify the health state of engines. To compare the prediction performance with and without feature reconstruction, a three-layer fully-connected ANN is also employed. This ANN uses the scaled original data as the input and no feature reconstruction process is involved.

23.5 RESULTS AND DISCUSSION

427

23.4.1 STATE DEFINITION The four data sets are multivariate time series without labels. Some [49,50] proposed that the first X data points be termed as the healthy states and the last Y data points as degraded states. Manually labeling seems to be a reasonable choice since a supervised learning can then be conducted. However, it is difficult for a less experienced engineer to label the data (select values or ratios for X and Y ). In addition, the complex temporal correlation exists in time series data, adding difficult for classification tasks. Moreover, only two states (healthy states and degrading states) seem to be too simplified for the reliability and risk analysis problems since some cases may involve more complex states which vary in size. Nevertheless, a physically correct state definition should take the degradation mechanism into account, which is not within the scope of this study. The purpose of this case study is to classify the state, especially the degraded state, in order to avoid the potential catastrophic failure. We assume that some action should be taken after the degraded state of one system is detected, causing the cost of maintenance and a decrease in yield rate. If no action is adopted, an unexpected failure may happen, resulting in a bigger loss. The engineer is supposed to take all the costs into account and obtain optimized X, Y or even more complex state labeling. In this paper, we are only concerned with the classification tasks which contain two states (healthy state and degraded state) and three states (healthy state, degrading state, and degraded state).

23.4.2 DATA PREPROCESSING Two methods were adopted to label the data. Method 1 assumes that the first 20% observations in each series represent a healthy state and the last 20% represent a degraded state. Method 2 adopts the same ratios for the healthy state and degraded state, except that the middle 20% (between 40% and 60%) were labeled as the degrading state. Each data set was split into a training set (70%) and a test set (30%). The former was used for constructing the model and the latter was for testing the performance. In addition, 10-fold cross validation was conducted to obtain the optimal parameters during training: nine of them were used to build a model and the remaining one then used to evaluate the predicting performance of this model. Bayesian optimization [35] was adopted to adjust the hyperparameters.

23.5 RESULTS AND DISCUSSION Three classification metrics, accuracy, precision, and area under and a receiver operating characteristic curve (AUC), have been calculated for the four test sets, as shown in Tables 23.1–23.4. It is clear that all the deep learning models achieved better results for the 2-state data sets than the 3-state data sets. For SAE and DAE, the overall accuracy dropped from around 0.9 to around 0.7 if the number of states increased from 2 to 3, while the precision remained basically at the same level. This means that the models were sensitive to the degraded states. This is desirable since we hope that the healthy states can be detected as early as possible to avoid the catastrophic failure it may cause. Both SAE and DAE show a better classification performance than VAE and ANN. This explains that a high-dimensional latent space for feature reconstruction is useful. To explain the drop in the accuracy of 3-state data set, we applied VAE on the FD001 data set and obtained the mean (µ) and log stand deviation (log σ ) for the 2D hidden distribution, and plotted the 2D representations of the input features, as seen in Fig. 23.5. It is clear that in the latent space of the

428

CHAPTER 23 DATA-DRIVEN RELIABILITY ANALYSIS

Table 23.1 FD001 Classification Results Methods

Two States Accuracy

Precision

AUC

Three States Accuracy

Precision

AUC

SAE DAE VAE ANN

98.92 98.96 98.44 98.53

99.14 99.22 99.13 99.21

99.39 99.62 99.89 99.32

76.56 76.30 70.07 68.54

96.97 96.15 95.21 95.34

90.78 90.38 85.98 90.11

Table 23.2 FD002 Classification Results Methods

Two States Accuracy

Precision

AUC

Three States Accuracy

Precision

AUC

SAE DAE VAE ANN

98.42 96.20 89.56 95.34

99.17 96.62 91.40 96.13

99.89 99.71 89.54 95.76

74.50 71.28 66.69 70.65

93.82 90.42 85.43 89.54

89.75 88.03 85.67 88.03

Precision

AUC

98.69 97.79 97.61 96.57

92.37 92.67 88.76 89.25

Table 23.3 FD003 Classification Results Methods

Two States Accuracy

Precision

AUC

Three States Accuracy

SAE DAE VAE ANN

99.47 99.57 93.30 93.12

99.74 99.67 96.12 96.45

99.89 99.99 97.27 96.87

78.54 80.00 73.66 78.27

2-state data set, two states are disentangled. However, for the 3-state data set, the degrading state is entangled with the healthy state, which means the two states are structurally similar. One important result VAE uncovered is that the states are not distributed evenly in the time-series data. By defining the top 20% data points as one state (healthy) and the bottom 20% data points as the other state (degraded), our models can classify them well. But adding the center 20% observations as one state (degrading) weakened the classification performance of the model, because the degrading state and the healthy state seem to be from the same cluster. Based on the latent feature representations produced by VAE, the degraded states might occur and accumulate quickly, leading to a sudden failure of an engine. By showing the latent distribution of the input data, VAE could help gain some insights into the failure mechanism of this system.

23.6 CONCLUSIONS In this study, we reviewed the application of machine learning in reliability and risk analysis. Autoencoders, an unsupervised deep learning framework producing latent feature representations, are believed to be a useful tool in this domain. We mainly introduced three types of autoencoders, SAE, DAE, and VAE, from a probabilistic perspective. Then we applied the three autoencoders on four data sets that

REFERENCES

429

Table 23.4 FD004 Classification Results Methods

Two States Accuracy

Precision

AUC

Three States Accuracy

Precision

AUC

SAE DAE VAE ANN

96.74 97.18 91.67 96.38

98.11 96.58 92.83 97.17

99.53 99.70 91.67 98.42

78.34 79.16 61.30 76.84

99.33 99.01 82.93 89.29

91.93 92.16 81.04 90.16

FIGURE 23.5 The 2D visualization of the VAE encoded representations of (A) 2-label data set, and (B) 3-label data set. The color of each cluster denotes one state.

consist of multivariate time series signals obtained from an aircraft engine run-to-failure simulation. We trained the three autoencoders to classify the state of an engine with the corresponding sensory information, and obtained good classification performance for 2-state data sets. From VAE’s visualization of the latent feature representations, we can conclude that the states are not distributed evenly in the time series. We believe that autoencoders can be extended by combining with different deep learning architectures. Applying unsupervised learning could help extract valuable information regarding the true failure mechanism. In our future work, we plan to combine autoencoders with CNN and RNN to discover more complex spatial and temporal relationship in reliability and risk analysis.

REFERENCES [1] J.F. Murray, G.F. Hughes, K. Kreutz-Delgado, Machine learning methods for predicting failures in hard drives: a multipleinstance application, J. Mach. Learn. Res. 6 (2005) 783–816. [2] B. Saha, et al., A Bayesian framework for remaining useful life estimation, in: Proceedings Fall AAAI Symposium: AI for Prognostics, Arlington, 2007.

430

CHAPTER 23 DATA-DRIVEN RELIABILITY ANALYSIS

[3] M.A.H. Dempster, V. Leemans, An automated FX trading system using adaptive reinforcement learning, Expert Syst. Appl. 30 (3) (2006) 543–552. [4] J. Shi, J. Wang, D.D. Macdonald, Prediction of crack growth rate in type 304 stainless steel using artificial neural networks and the coupled environment fracture model, Corros. Sci. 89 (2014) 69–80. [5] M. Kamrunnahar, M. Urquidi-Macdonald, Prediction of corrosion behaviour of alloy 22 using neural network as a data mining tool, Corros. Sci. 53 (3) (2011) 961–967. [6] M. Jiménez-Come, I. Turias, F. Trujillo, An automatic pitting corrosion detection approach for 316L stainless steel, Mater. Des. 56 (2014) 642–648. [7] D. Kim, et al., Machine learning-based novelty detection for faulty wafer detection in semiconductor manufacturing, Expert Syst. Appl. 39 (4) (2012) 4075–4083. [8] P. Malhotra, et al., Long short term memory networks for anomaly detection in time series, in: Proceedings, 2015. [9] J.R. Celaya, et al., Prognostics approach for power MOSFET under thermal-stress aging, in: Reliability and Maintainability Symposium (RAMS), 2012 Proceedings-Annual, 2012. [10] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005. [11] M. Rezaeian, A. Gruen, Automatic 3D building extraction from aerial and space images for earthquake risk management, Georisk 5 (1) (2011) 77–96. [12] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444. [13] S. Nešic, et al., A mechanistic model for carbon dioxide corrosion of mild steel in the presence of protective iron carbonate films–Part 2: a numerical experiment, Corrosion 59 (6) (2003) 489–497. [14] S. Nesic, J. Postlethwaite, S. Olsen, An electrochemical model for prediction of corrosion of mild steel in aqueous carbon dioxide solutions, Corrosion 52 (4) (1996) 280–294. [15] A. Der Kiureghian, First- and second-order reliability methods, in: Engineering Design Reliability Handbook, 2005, p. 14-1. [16] M.A. Kramer, B. Palowitch, A rule-based approach to fault diagnosis using the signed directed graph, AIChE J. 33 (7) (1987) 1067–1078. [17] I.G.Y. Bengio, A. Courville, Deep Learning, 2016. [18] D. Cire¸san, U. Meier, J. Schmidhuber, Multi-column deep neural networks for image classification, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [19] L. Deng, et al., Recent advances in deep learning for speech research at Microsoft, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013. [20] R. Collobert, J. Weston, A unified architecture for natural language processing: deep neural networks with multitask learning, in: Proceedings of the 25th International Conference on Machine Learning, 2008. [21] C.R. Turner, et al., A conceptual basis for feature engineering, J. Syst. Softw. 49 (1) (1999) 3–15. [22] Y. Xu, et al., Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries, J. Am. Med. Inform. Assoc. 19 (5) (2012) 824–832. [23] S. Mohamed, K. Heller, Z. Ghahramani, Bayesian and L1 approaches to sparse unsupervised learning, arXiv:1106.1157, 2011. [24] B.A. Olshausen, D.J. Field, Sparse coding with an overcomplete basis set: a strategy employed by V1?, Vis. Res. 37 (23) (1997) 3311–3325. [25] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: International Conference on Artificial Intelligence and Statistics, 2011. [26] A. Makhzani, B. Frey, k-sparse autoencoders, arXiv:1312.5663, 2013. [27] P. Vincent, et al., Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th International Conference on Machine Learning, 2008. [28] P. Gallinari, et al., Mémoires associatives distribuées: une comparaison (distributed associative memories: a comparison), in: Cesta-Afcet, 1987. [29] P. Vincent, et al., Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion, J. Mach. Learn. Res. 11 (2010) 3371–3408. [30] G. Alain, Y. Bengio, What regularized auto-encoders learn from the data-generating distribution, J. Mach. Learn. Res. 15 (1) (2014) 3563–3593. [31] D.J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, arXiv:1401.4082, 2014. [32] D.P. Kingma, M. Welling, Auto-encoding variational Bayes, arXiv:1312.6114, 2013. [33] B. Leng, et al., 3D object retrieval with stacked local convolutional autoencoder, Signal Process. 112 (2015) 119–128.

REFERENCES

431

[34] J. Masci, et al., Stacked convolutional auto-encoders for hierarchical feature extraction, in: International Conference on Artificial Neural Networks, 2011. [35] J.T. Rolfe, Y. LeCun, Discriminative recurrent sparse auto-encoders, arXiv:1301.3775, 2013. [36] F. Weninger, et al., Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition, in: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014. [37] N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised learning of video representations using lstms, arXiv:1502. 04681, 2015. [38] A. Paul, S. Venkatasubramanian, Why does deep learning work? A perspective from group theory, arXiv:1412.6621, 2014. [39] P. Mehta, D.J. Schwab, An exact mapping between the variational renormalization group and deep learning, arXiv:1410. 3831, 2014. [40] Y. Bengio, Practical recommendations for gradient-based training of deep architectures, in: Neural Networks: Tricks of the Trade, Springer, 2012, pp. 437–478. [41] H. Larochelle, et al., Exploring strategies for training deep neural networks, J. Mach. Learn. Res. 10 (2009) 1–40. [42] G.C. Cawley, N.L.C. Talbot, Preventing over-fitting during model selection via Bayesian regularisation of the hyperparameters, J. Mach. Learn. Res. 8 (2007) 841–861. [43] F. Friedrichs, C. Igel, Evolutionary tuning of multiple SVM parameters, Neurocomputing 64 (2005) 107–117. [44] R. Liao, et al., Particle swarm optimization-least squares support vector regression based forecasting model on dissolved gases in oil-filled power transformers, Electr. Power Syst. Res. 81 (12) (2011) 2074–2080. [45] J. Fu, et al., Deep Q-networks for accelerating the training of deep neural networks, arXiv:1606.01467, 2016. [46] A. Saxena, K. Goebel, Turbofan Engine Degradation Simulation Data Set, NASA Ames Prognostics Data Repository, 2008. [47] A. Saxena, et al., Damage propagation modeling for aircraft engine run-to-failure simulation, in: International Conference on Prognostics and Health Management, PHM 2008, IEEE, 2008. [48] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828. [49] P. Tamilselvan, P. Wang, Failure diagnosis using deep belief learning based health state classification, Reliab. Eng. Syst. Saf. 115 (2013) 124–135. [50] F.O. Heimes, Recurrent neural networks for remaining useful life estimation, in: International Conference on Prognostics and Health Management, PHM 2008, 2008.