Bayesian Modeling of External Corrosion in Underground Pipelines ...

7 downloads 0 Views 893KB Size Report
Underground Pipelines Based on the Integration of Markov Chain Monte Carlo Techniques and. Clustered Inspection Data. Hui Wang, Ayako Yajima & Robert ...
Computer-Aided Civil and Infrastructure Engineering 30 (2015) 300–316

Bayesian Modeling of External Corrosion in Underground Pipelines Based on the Integration of Markov Chain Monte Carlo Techniques and Clustered Inspection Data Hui Wang, Ayako Yajima & Robert Y. Liang* Department of Civil Engineering, The University of Akron, Akron, OH, USA

& Homero Castaneda Department of Chemical and Biomolecular Engineering, National Center for Research in Corrosion and Materials Performance, The University of Akron, Akron, OH, USA

Abstract: In this study, a model is developed to assess external corrosion in buried pipelines based on the unification of Bayesian inferential structure derived from Markov chain Monte Carlo techniques using clustered inspection data. This proposed stochastic model combines clustering algorithms that can ascertain the similarity of corrosion defects and Monte Carlo simulation that can give an accurate probability density function estimation of the corrosion rate. The metal loss rate is chosen as the indicator of corrosion damage propagation, obeying a generalized extreme value (GEV) distribution. Bayesian theory was employed to update the probability distribution of metal loss rate as well as the GEV parameters in order to account for the model uncertainty. The proposed model was validated with direct and indirect inspection data extracted from a 110-km buried pipeline system. 1 INTRODUCTION Steel pipelines are widely used for oil transportation in the petroleum industry. Usually, these metallic longitudinal structures are buried underground. Because To whom correspondence should be addressed. E-mail: rliang@ uakron.edu.  C 2014 Computer-Aided Civil and Infrastructure Engineering. DOI: 10.1111/mice.12096

they are directly exposed to aggressive soil and underwater environments, different protection systems are added to the structure: one system uses direct protection by incorporating physical barriers (such as coatings), and a second employs an external source (such as cathodic protection). These systems can be classified as preventative or as mitigation. Unfortunately, neither of these protection systems is completely reliable for safeguarding an operating pipeline, as various circumstances can decrease the performance of these systems and produce risk factors leading to failure conditions (Abes et al., 1985). Excavation damage to the pipeline, fouling of coatings, or electrochemically triggered processes that cause coating failure can be some initiators for corrosion activation. Cathodic protection is an external source that is a function of the environment (soil) and can contribute to damage evolution of the coating at any particular location. The corrosion process can be activated under protection conditions depending on the environment and the degree of damage. Hence, external localized corrosion is one of the most common defects that occurs under normal operating conditions of a pipeline (Peabody et al., 2001). The severity of the external corrosion process can be evaluated by quantifying the rate of local damage evolution propagation. A number of studies (DeWaard

Bayesian modeling of external corrosion in underground pipelines

and Milliams, 1975; Engelhardt et al., 1999; Anderko et al., 2001) have developed models in order to estimate the corrosion rate using basic variables such as pH, pressure, temperature, and concentration of ions (i.e., Cl− , SO4 2− , OH− ). The limitation of these models is that they are only valid under laboratory conditions. Melchers (2003) developed a phenomenological model for general corrosion of mild and low-alloy steels under fully aerated condition in marine environment. For in situ or noncontrolled situations, however, there is a lack of knowledge about all the factors contributing to corrosion propagation. In addition, the pipeline is exposed to a dynamic aggressive environment during the years of its operation, and the factors affecting the corrosion process will change continuously over time. From a dynamic (time-dependent) perspective, the corrosion rate must be considered as a random variable under field conditions, and it can be predicted from the perspective of probability (Shibata, 1996; Alamilla and Sosa, 2008; Caleyo et al., 2009b). On the other hand, at sites where the inspection or monitoring tools indicate that the corrosion rate is extremely high, the parameters associated with corrosion process are difficult to quantify deterministically. Furthermore, it is relatively difficult (and sometimes impossible) to predict the extreme value using a traditional physical model. Hence, uncertainties should be given significant attention when estimating the corrosion propagation. To the best of the authors’ knowledge, only a few stochastic models can be used to predict the external corrosion rate in underground pipelines in terms of probability and to perform risk analysis (Hong, 1999; Sinha and Pandey, 2002; Kiefner and Kolovich, 2007; Race et al., 2007; Alamilla and Sosa, 2008; Caleyo et al., 2009a). Hong (1999) took the effect of the generation of new defects into consideration on the reliability estimation and the corrosion effect on pipeline was modeled by a Markov process. Alamilla et al. (2008) proposed a stochastic method that contains four models to estimate the corrosion propagation rate. The probability distributions of the corrosion depth and corrosion rate are derived analytically, based on the empirical distribution of the corrosion depth, the number of corrosion defects, and the distribution of the starting time. Some limitations to the previous models are the lack of consideration of soil properties due to environmental changes as well as evidence of the distribution of starting time. Kiefner and Kolovitch (2007) developed a Monte Carlo approach for estimating the probability distribution of the corrosion rate. In order to produce the probability density function (PDF) of the corrosion rate, samples were drawn from the depth and starting time distribution to perform the Monte Carlo simulation. The results were very sensitive to the distribution of the corrosion

301

starting time. Caleyo et al. (2009a) developed the probability distribution of pitting corrosion depth and rate in underground pipelines using Monte Carlo simulation. The investigation was based on a field study (Velazquez et al., 2009). The corrosion rate formula derived from the field study is a power law formulation of local environmental factors. The statistic characters of the environmental factors which were used as the inputs for the Monte Carlo simulation were obtained by soil determinations. The limitation of this model is that the linear assumption between the soil properties and the parameters in the power law formulation could be too strong. Technically, the pit depth or metallic indications at a given time can be measured by in-line inspection (ILI) devices or can be estimated from direct observation of a high number of excavations. The measurement of corrosion defects may be imprecise due to imperfect performance of operator and device as well as noise affecting measured signals (Maes et al., 2009; Sahraoui et al., 2013). The soil properties (half-cell potential, resistivity, pH, concentration of ions, and soil moisture) can be measured via analytical and titration methods by taking in situ samples along the pipeline at equal intervals. Hence, the information of soil environment is spatially discrete. In this work, we employ sampling, characterization, and analysis for soil properties along the right-of-way of oil pipeline. For purposes of data analysis, the pipeline is divided into equal segments so that the physical and chemical properties are position-dependent. The corrosion rate is considered to be under steady-state conditions. Each segment has a set of variables which are referred to as features containing soil properties and operation conditions. All the features form a highdimensional feature space; within the feature space, each segment is represented by a vector. Clustering techniques should be employed to classify the segments, as the segments within each cluster will have high similarity in feature space. The similarity can be interpreted by a measure of distance (Ghosh-Dastidar and Adeli, 2003; Jain, 2010), a probability density (Fraley and Raftery, 1998; Ahmadlou and Adeli, 2010; Kodogiannis et al., 2013), or a graph (Bello-Orgaz et al., 2012; ´ Menendez et al., 2014). Because of the high similarity, we can easily find the statistical property of each cluster and estimate the probability distribution of corrosion rate within each cluster, this is the practical advantage of the present work. Some of the most recent clustering techniques are well reported in the literature (Kodogiannis et al., 2013; Peng and Ouyang, 2013; Rizzi ´ et al., 2013; Menendez et al., 2014). In this article, we propose a methodology based on clustering of the segments followed by the estimation

302

Wang, Yajima, Liang & Castaneda

where k and α are empirical constants; usually, α  1. And t is the time. Based on Equation (1), the corrosion velocity is derived by the equation below

Start

Clustering Process

dD = kαt α−1 (2) dt In general, this expression shows the nonlinear trend of the corrosion damage propagation: the corrosion rate is relatively high at the beginning of the damage process and attenuates gradually as the service age increases. This function shows a nonphysical limit. V =

Bayesian Inferential Framework (MCMC) GEV parameter estimation of each cluster

Site feature

dD (3) = kαt α−1 → ∞, at t → 0 dt Such a high-corrosion rate is impossible. Instead of Equation (2), an equation proposed by Engelhardt and Macdonald (2004) can be applied to estimate the corrosion rate:   t α−1 dD V = (4) = V0 1 + dt t0 V =

PDF of metal loss rate at a certain location along pipeline

End

Fig. 1. Flowchart of modeling procedure.

of the probability distribution parameters of the corrosion rate within each cluster using Bayesian inferential framework. We obtain the boundary distribution of the metal loss rate for each cluster; then generalize the Bayesian approach to estimate the probability distribution of the corrosion rate at a specific site. The proposed method is able to interpret model uncertainty, which allows the upper and lower boundary distribution of the corrosion rate to be estimated. By the integration of clustering algorithm and Markov chain Monte Carlo (MCMC) techniques, the proposed method can estimate corrosion rate at a specified location along pipeline, which is the methodological advantage. The proposed method is developed by validating in-line inspection data and by performing in situ soil property measurement sampling and quantification for each 200 meters along a 110-km pipeline. The modeling flowchart is shown in Figure 1.

2 MODEL FRAMEWORK

D (5) t One of the problems in using Equation (5) to obtain the corrosion rate is that we need to identify the same defect in two consecutive inspection results. This identification is relatively difficult, as current technology is not accurate enough to give an exact match of a corrosion defect in two consecutive inspections (Westwood and Hopkins, 2004). V =

2.1 Indicator of corrosion damage propagation As is well known by both experiment (SzklarskaSmialowska, 1986) and theory (Engelhardt et al., 1999, 2003, 2004) for localized corrosion, the defect characteristic dimension (e.g., the depth of the corrosion pitting) can be expressed by Equation (1) D = kt α

where V0 is the initial corrosion rate, and t0 is constant which has the effect on the time interval for reaching a stable corrosion rate. The corrosion behavior reflected in Equation (4) considers the early stages of corrosion, when the metal is exposed to a relatively high concentration of corrosive species. As the exposure time increases, an oxide film forms on the corrosion defect surface, which can serve as a protective barrier against further corrosion (Bockris and Reddy, 1970). By assuming the corrosion product as being a stable passive layer, we can consider that the corrosion rate is constant following the formation of this layer. During operation conditions, the depth and location of an indicator of possible corrosion can be measured and determined via magnetic flux leakage (MFL) or ultrasonic tools (UTs). It is highly impractical to apply the technology to obtain initial corrosion velocity V0 , since it is very costly to perform a direct inspection, and inspections are typically scheduled at different time intervals. Theoretically, by using the measured corrosion depth D at inspection interval t, the average corrosion velocity can be estimated as

(1)

Bayesian modeling of external corrosion in underground pipelines

2.2 Clustering algorithm

Metal Loss Rate (mm/year)

1 Metal Loss Rate 0.8

Fitted Mean Line

0.6

0.4 0.2

0

303

0

10

20

30

40

50

Service Life (year)

Fig. 2. The metal loss rate calculated by Equation (6) for entire 110-km length of pipeline.

A logical way to obtain the corrosion rate is to use the inspection data and the service age, since corrosion is a dynamically stable process if there are no sudden changes that radically affect the factors influencing process kinetics. Also, the localized critical pit depth (e.g., the wall thickness of a pipeline) and the typical service life will impose a significant restriction on the initial and average corrosion current (Engelhardt and Macdonald, 2004). Hence, considering the corrosion process as stable and relatively slow, we can use the metal loss rate as an indicator of corrosion damage propagation, which is defined by Equation (6)

r=

D(t) t

(6)

where D(t) is the inspection data (the corrosion depth of defect) and t is the service life (from installation to the time of inspection before replacement) of the pipeline segment. Based on the measurement profile of the defect from in-line inspections and the service life obtained from the maintenance history along a 110-km pipeline, we calculated the metal loss rate at each defect using Equation (6) and plotted a curve that shows the trend in metal loss rate as the pipeline ages in Figure 2. Figure 2 shows the correlation between metal loss rate and service life. From the inspection profile, we notice that the trend is the same as Equation (4), with a high corrosion rate at the beginning and a slowing due to the formation of a passive oxide film. As can be seen in Figure 2, after 29 years, the metal loss rate is nearly constant. For the convenience of statistical analysis and incorporating as many samples as possible into the stochastic analysis, we consider the metal loss rate to be constant after 29 years.

The inspection data (metal loss rate) is clustered into groups based on similarity in environmental information, which contains several critical parameters, by using the Gaussian mixture model (GMM) to extract underlying patterns in the feature space (where all the critical parameters form a feature space) and to improve the accuracy of the estimation and the prediction of the rate of metal loss. To avoid a natural clustering problem (Pang-Ning et al., 2006), we adopted expert opinions to assist in choosing the relevant corrosion factors. In this work, there are nine corrosion factors: distance from the nearest cathodic protection station to the defect, half-cell potential, soil resistivity, pH, four different ion concentrations (CO3 2− , HCO3 − , Cl− , SO4 2− ), and soil moisture. The GMM is evaluated with the expectationmaximization (EM) algorithm, which is a method for finding the maximum likelihood function by optimizing the parameters under hidden conditions (Ben-Hur et al., 2002; Fraley and Raftery, 2002). The result is compared to K-means algorithm clusters that have been previously reported by Jain (2010). The difficulty in using clustering models is apparent when optimizing a number of clusters having a large dissimilarity between clusters while achieving a large similarity within each cluster. Several studies have been performed on model selection such as the Akaike information criterion (AIC), the Bayesian information criteria (BIC) (Fraley and Raftery, 1998), neural network methods (Lam et al., 2006; Adeli and Panakkat, 2009), and Bayesian approach combined with Monte Carlo integration (Yuen and Mu, 2011). The BIC is more suitable for high-dimensional data because it considers the feature dimension by including it as a penalty (Fraley and Raftery, 1998, 2002). Therefore, in this study, BIC is used because of the multidimensional data structure. In the GMM, the environmental information of a segment can be considered as a feature vector xi = (x1 , x2 , . . . , xd ) in feature space. All the data are assumed to be generated by a mixture of normal probability density. Thus, we can calculate the likelihood of the observations: L(α, θ |x) =

n  K 

α j f j (xi |θ j )

(7)

i=1 j=1

where α j is the mixing probability of jth cluster such  that kj=1 α j = 1, n is the number of observations, K is the number of components, and f j (xi |θ j ) is a multivariate normal distribution of d-dimension with parameters θ j = (μ j ,  j ) where μ is a vector which contains the mean of every feature and  is the covariance

304

Wang, Yajima, Liang & Castaneda

structure of all the features which is defined in Appendix A. 1 f j (xi |θ j ) =  1/2 (2π )d/2  j   (8) 1 T −1 × exp − (xi − μ j )  j (xi − μ j ) 2 To obtain the mixture density, f(x), the parameters θ j = (μ j ,  j ), and α j are optimized by calculating the log-likelihood: log(L(α, θ |x)) =

n  i=1

log

K 

α j f j (xi |θ j )

(9)

j=1

with the EM algorithm (Fraley and Raftery, 1998, 2002). The EM algorithm consists of two steps, the E-step and M-step. After initializing all parameters, the expectation of xi belonging to each cluster is calculated at the E-step, and then the parameters are updated by using maximum likelihood estimation which is M-step. This process is repeated until all parameters converge. The data xi is categorized into the cluster with the maximum probability calculated with the optimized parameters. In order to apply the GMM clustering, we first choose the number of components and the covariance structure of the mixture model. We adopted a published code named mclust, an R package for normal mixture modeling (Fraley et al., 2012), to evaluate the covariance models based on the BIC. The BIC is a widely used measure to choose the best covariance structure and the optimal number of clusters (Fraley and Raftery, 2007), and it has the form BIC = 2loglike(x, θ ) − T log(n)

(10)

where loglike(x, θ ) is the maximized log-likelihood. T is the number of free parameters to be estimated and n is the number of observations. Previous studies suggest choosing the model with the maximum BIC (McLachlan and Basford, 1988; Fraley and Raftery, 1998, 2002; McLachlan and Krishnan, 2007) is a good approach; however, the number of components might be physically meaningless when the clusters are too many or too few, besides, too many components lead to overfitting and too few to underfitting (Tsai and Huang, 2010). Some clusters may contain only one observation when the number of clusters is too big. Fraley and Raftery (1998) suggested selecting the number of components where the BICs for all covariance models seem to converge. We selected the optimal number of clusters and the best model where the BIC does not improve significantly.

2.3 Bayesian model of metal loss rate The first stage in modeling the metal loss rate is to find a proper probability distribution of the metal loss rate. Velazquez et al. (2009) suggested that the best fitting density function is the generalized extreme value (GEV) distribution based on site data. As shown in Figure 3, using data extracted from indirect and direct inspection data for a 110-km pipeline, we chose three candidate distributions to fit the data. The fitting result also shows that GEV is the best distribution to describe the uncertainty feature of the metal loss rate with the lowest Kolmogorov–Smirnov statistic. At the most extreme level of the GEV distribution, data are scarce, where for design and maintenance purposes, we are interested in the extreme level (tail behavior). Because of the scarce observation of extreme metal loss rate, it is difficult to use the limited data within each cluster to form a relatively accurate empirical PDF of the metal loss rate, especially to build up a PDF with an exact tail behavior. Therefore, we have to develop an inferential process to maximize the use of available data. Bayesian inferential framework is suitable as a basis to undertake the extreme value analysis (Coles and Tawn, 1996). Within each cluster, the metal loss rate has a GEV distribution: r ∼ GEV(μ,σ ,ξ )

(11)

where r is metal loss rate, and μ, σ , and ξ are the location, scale, and shape parameter, respectively. Let us consider that each observation of metal loss rate is independent from others. The likelihood function is

n   ri − μ −1/ξ 1 1+ξ L(μ, σ, ξ |r) = n exp − σ σ i=1  n (12)  ri − μ −1/ξ −1 1+ξ σ i=1

) > 0, and n is the where r = (r1 , r2 , r3 , ..., rn ), 1 + ξ ( r −μ σ number of observations. Notice that for σ > 0, the parameterization φ = ln(σ ) is easier to work with, since σ is constrained to be positive. Thus, if we substitute σ = exp(φ) into Equation (12), then multiply Jacobian, we can get the expression of likelihood function in terms of μ, φ, ξ .

n   ri − μ −1/ξ L(μ, φ, ξ |r) = exp (1 − n)φ − 1+ξ exp(φ) i=1

n  

1+ξ

i=1

ri − μ exp(φ)

−1/ξ −1 (13)

Bayesian modeling of external corrosion in underground pipelines

305

Metal Loss Rate GEV (K-S Dn=0.17) Lognormal (K-S Dn=0.19) Gamma (K-S Dn=0.21)

0.9999

Probability

0.9995 0.999 0.995 0.99 0.95 0.9 0.75 0.5 0.25 0.01

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Metal Loss Rate (mm/year)

Fig. 3. Probability plot of metal loss rate based on data obtained from direct inspection along the 110-km pipeline. The Kolmogorov–Smirnov (K–S) test was performed, and Dn is the K–S statistic.

If we consider that no prior information is available about the parameters of the GEV distribution (although we have the observation of metal loss rate, we still do not have enough prior information about the distribution of μ, φ, ξ ), a normal assumption is introduced to build up the prior distribution of GEV parameters. π (μ, φ, ξ ) = π (μ) · π (φ) · π (ξ )

(14)

where μ ∼ N (0,σμ ), φ ∼ N (0,σφ ), ξ ∼ N (0,σξ ). We use π (.) to denote prior distribution and π *(.) to denote posterior distribution hereafter. To make the prior distribution almost flat, we have arbitrarily chosen a large variance, since we do not have enough information regarding the prior parameter distribution. According the Bayesian inferential framework, π (μ, φ, ξ )L(μ, φ, ξ |r) π (μ, φ, ξ ) = π (μ, φ, ξ )L(μ, φ, ξ |r)dμdφdξ ∗

(15)

Because there are three parameters in the GEV distribution, we need a sampling process that can generate samples from a multivariate distribution. We set a combination of Gibbs sampling (Ching et al., 2006) and the Metropolis–Hastings (M–H) algorithm. The key to the Gibbs sampler is that we only consider univariate conditional distributions Equation (18). Such conditional distributions are practicable in simulation since there is no analytical conjugate posterior distribution for normal prior distribution which is updated by GEV likelihood function. The scheme comes from work by Geman and Geman (1984), Metropolis et al. (1953), and Hastings (1970).

According to Equation (15), the posterior density function can be written in the form: π ∗ (μ, φ, ξ ) ∝ π (μ, φ, ξ )L(μ, φ, ξ |r)

(16)

With the prior parameter distribution given in Equation (14), π (μ, φ, ξ )L(μ, φ, ξ |r) = π (μ)π (φ)π (ξ )L(μ, φ, ξ |r) (17) we can form the full conditional functions for Gibbs sampling: ⎧ π (μ∗ |φ, ξ ) = π (μ)L(μ, φ, ξ |r) ⎪ ⎪ ⎪ ⎪ ⎨ π (φ ∗ |μ∗ , ξ ) = π (φ)L(μ∗ , φ, ξ |r) (18) ⎪ ⎪ ⎪ ⎪ ⎩ π (ξ ∗ |μ∗ , φ ∗ ) = π (ξ )L(μ∗ , φ ∗ , ξ |r) where μ*, φ*, and ξ * are new candidates of GEV parameters for sampling. To process the M–H algorithm, a proposal distribution is needed to draw μ*, φ*, ξ * given the current state. This proposal distribution and its parameters may have a strong influence on the robustness of the simulation process (Dai et al., 2012). In this work, three Gaussian proposal functions are chosen, as shown below: μ∗ = μ + ωμ ;

φ ∗ = φ + ωφ ;

ξ ∗ = ξ + ωξ

where ωμ , ωφ , and ωξ are normally distributed with zero means and a certain standard deviation (STD) that we call jump length. The proposal distributions determine the correlation between two successive statuses within the Markov chain. Here, the jump length is chosen carefully so that the acceptance probability will be in the approximate range of 40% to 60%. This is because a

306

Wang, Yajima, Liang & Castaneda

reasonable acceptance probability will make the M–H algorithm avoid the local optimization problem, and the suitable acceptance probability will guarantee the generated samples from Gibbs sampling process represent the posterior distributions (Zhang et al., 2013). By employing the M–H algorithm within each cluster, the realizations of the GEV parameters were used to estimate the posterior distributions of μ, φ, ξ by using the maximum likelihood estimation. Since the uncertainty of distribution parameters is taken into consideration, the model uncertainty can be reflected by estimating the upper and lower bound (i.e., the 95% credibility interval) of the PDF for the parameters (μ, φ, ξ ). Then, we develop the upper and lower bound distribution of metal loss rate by plugging the boundary values of the parameters (μ, φ, ξ ) into the distribution function for the metal loss rate. The distributions of metal loss rate are descriptions of the random property at an average level within each cluster. However, we are more interested in the distribution at a specific location where the corrosion event occurs. Since we developed the distributions within the clusters, we have some prior information regarding the metal loss rate at a certain point, although at an average level. In addition, we also have the feature vector (location and soil properties), which allows the Bayesian inferential framework to be used for estimating the distribution of the metal loss rate. In order to estimate the distribution of the metal loss rate at a specific location, we must first build up a training matrix using soil measurement data to evaluate the likelihood function value. In the training matrix [X,R], each row represents all the features of one defect point Xi and the corresponding metal loss rate Ri . The critical parameters for corrosion soils are included −

− Xi = [CP, Eh, Resistivity, pH, [CO2− 3 ], [HCO3 ], [Cl ],

[S O 2− 4 ], Soil Moisture] where CP is the distance from the nearest cathodic protection station to the defect, Eh is the half-cell potential, [.] is the concentration for each ionic species, and pH indicates the activity of H+ ions. A multivariable normal assumption is introduced to form the joint distribution of all the features and metal loss rate. The likelihood function is given by: L(r |x) =

f ([x,r ]) P(r )

(19)

where x is the feature vector of a certain location, r is the corresponding metal loss rate, and P(r) is the marginal

probability distribution of metal loss rate. f ([x,r ]) =

1 (2π )5

||1/2

1 × exp(− ([x,r ] − μ)T  −1 ([x,r ] − μ)) (20) 2 where  is the covariance matrix of [X,R], and ❘❘ is the determinant of . The distribution of metal loss rate in each cluster can be the prior distribution. Therefore, the updated expression is π ∗ (r ) ∝ π (r )L(r |x)

(21)

where π *(r) is posterior and π (r) is prior. Using the M–H algorithm to estimate the posterior, an updated distribution can be formed from the MCMC sampling process. The complete algorithm of the proposed methodology is shown in Appendix B.

3 MODEL APPLICATIONS The proposed model is applied to predict the external corrosion of a pipeline section that is 110 km in length used for transmission of oil product go downstream operations. The diameter of the pipeline is 457.2 mm and the wall thickness varies between the sections along the length of the pipeline, with wall thicknesses of 6.4, 9.5, 10.3, and 12.7 mm. According to the database for the pipeline section, measurements from in-line inspections performed in 2005, 2010, and 2011 are available. The 2005 and 2010 inspections contain information about defects along the entire length of the pipeline; however, the 2011 inspection was only performed on a portion of the pipeline. The operation times of the sections of pipeline also vary, with most sections having been in service for more than 29 years. In this analysis, we use the results obtained in the 2010 in-line inspection, the most recent inspection that includes data from the full length of the pipeline. During the 2010 pipeline inspection, a total of 1,924 external events related to wall thickness metal loss were detected. 3.1 Clustering results Prior to clustering, it was necessary to create a new data set based on the physical distance along the pipeline, since the environmental and soil condition measurements were collected every 200 m during the site measurement. Locations where measurements were taken are referred to as “stations” and the section of pipeline between two stations is called a “zone.” The physical properties of a given zone are considered to be similar

Bayesian modeling of external corrosion in underground pipelines

Nearest cathodic protection to ith holiday

Soil and environment survey point Zone 1 200 m

Zone 2 200 m

307

Zone j 200 m

Inspection device ...... Inspection direction

Zone 599 200 m

......

ith holiday in zone j

CPi

Total inspected pipeline = ~ 110 km

Fig. 4. Schematic diagram of pipeline inspection.

-9000 -9300

BIC

3 4 8 7 2 -9600 8 7 6

4 3 8 2 6 7 5 1

4 3 5 2 8 6 7 1

6 1 5

4 3 2 5 6 1 8 7

3 4 2 5 6 1 8 7

3 4 2 5 1 6 8 7

4 3 5 2 6 1

3 4 2 5 1 6

3 2 4 5 1 6

8 7

7 8 7 8

-10200 1 3 5 4 2 2

4

6

3 2 5 4 1 6

8 7

-9900

-10500 0

2 3 5 4 1 6

8

10

7 8 12

Number of components

Fig. 5. BIC plot for eight covariance models for the environmental properties data. Spherical models (1–2); diagonal models (3–6); ellipsoidal models (7–8).

-8800 2

-9000

2 1 K2

1 -9200

BIC

to those of the mean values for the two nearest stations. We counted a total of 599 zones along approximately 110 km of the pipeline; however, 98 of these zones did not contain any observations, and these sections were eliminated from our data set. The remaining 501 zones were identified for clustering. The zones are categorized into clusters based on the features; all observations were labeled into the corresponding clusters. For example, the observations between Stations j and j + 1 (say ith defect in zone j) are labeled as cluster 3 if zone j is located in Cluster 3. To examine the effect of cathodic protection, we also measured the distance between a holiday (a discontinuity in the protective coating) and the nearest cathodic protection station (indicated as CPi in Figure 4). From the new data set, 80% of the data were randomly chosen as the training set for building the model, and the remaining 20% was assigned to the testing set; the training set was grouped into clusters via the GMM algorithm. The clustering result was verified using the testing set, and the BICs for various models (Fraley

2 K2 1 K1

2 1 K2 K1

2 K2 1 K1

2 1 K2 K1

2 K2 1 K1

2 1 K1 K2

K1

-9400

K2 2

-9600

1 K1

K1

2

4

1 K1 2 K2

1 K1 2 K2

K2

-9800 -10000 -10200 -10400 0

K1 1 2 K2 6

8

10

12

Number of components

Fig. 6. BIC plot for GMM (1 and 2) comparing to K-means (K1 and K2) clustering for the environmental properties data: spherial covariance structure (1 and K1); diagonal covariance structure (2 and K2).

and Raftery, 2002) were compared. This process was repeated for 100 iterations to obtain the average optimum number of components. The clustering result shows that the diagonal covariance model 4 (shown in Figure 5) at eight components is the best. However, the increment of BIC decreases around four components, and the BICs for all models seem to converge at four components. Therefore we chose to use four clusters. We also compared the GMM results to clusters determined through the K-means algorithm as shown in Figure 6. As expected, GMM performed better for grouping than the K-means for specific covariance structures, because GMM takes the optimum parameters to maximize log-likelihood while K-means calculates the loglikelihood at the fixed centers after the data are classified into clusters. By employing the Gaussian clustering technique, the observations (i.e., data points in feature space) were classified into four clusters: cluster 1 contains 727 events, cluster 2 contains 383 events, cluster 3 contains

Wang, Yajima, Liang & Castaneda

Eh

308

0 -200 -400

pH

10 5

[Cl-]

0 10 5 5

x 10 2 0

C4

C3

C2

C1 Resistivity

0

0

200

400

600

800 Defect ID

1000

1200

1400

Fig. 7. The clustering results (e.g., a defect appearing in C1 belongs to cluster 1) and the correponding information of soil features (Eh, pH, [Cl− ], resistivity). Defects are ordered consecutively from the starting point to the end point of the inspected pipeline.

339 events, and cluster 4 contains 86 events. The soil features of all the defects follow the multivariate normal distribution, but the typical values (means) and the divergence (variances) of the observations differ for the clusters. This is because the soil corrosivity is affected by many features (resistivity, pH, [Cl− ], etc.) that are correlated with each other, in addition, distribution-based clustering model considers the features as random variables and provides the probabilities of belonging to the clusters for a given observation based on the mixture density in the feature space. The clustering results and some corresponding soil features are shown in Figure 7. Within each cluster, we are interested in finding the distribution of the metal loss rate. 3.2 Estimation of distribution parameters within each cluster By employing the Bayesian approach, not only can we obtain the distribution of the metal loss rate, but we can also estimate the distribution of the parameters—this means we have a family of GEV distributions according to the distribution of GEV parameters. Because of this feature of Bayesian inferential framework, we can take model uncertainty and its influencing factors into consideration. The algorithm described in Section 2.3 is applied to all clusters. For each cluster, 50,000 Markov chain samples

were generated, of which the first 10,000 samples were discarded because the chain behavior is not stationary. After burn-in time (about 3,000 to 6,000 samples), the chain converged and mixed very well. Because the Markov chain is a kind of short memory random process, the successive sampling may have some correlation; hence, we choose one sample from each 10 to form an independent realization. The subset of uncorrelated realizations after burn-in can be used as a stationary MCMC trace to estimate the distribution of parameters. (1) The result of parameter distribution The simulation steps are shown in Appendix B, and the posterior parameter distribution results of four clusters are shown in Figure 8. (2) Model uncertainty analysis Based on the estimation results, an uncertainty evaluation should be performed to measure the model error. Table 1 shows the statistical characteristics of the parameters within each cluster after the GMM clustering process was completed. Based on the statistical characteristics of the parameters, the upper bound and lower bound (95% credibility interval) cumulative distribution functions (CDFs) are shown in Figure 9, which present the distribution uncertainty of the metal loss rate.

Bayesian modeling of external corrosion in underground pipelines

(b) 10

3000 2000

8

1000 0

(c) 25

Cluster 1 Cluster 2 Cluster 3 Cluster 4

6 4

15 10 5

2

0

0.024 0.025 0.026 0.027 0.028 0.029

μ

-6

0.03

Cluster 1 Cluster 2 Cluster 3 Cluster 4

20

Density

Cluster 1 Cluster 2 Cluster 3 Cluster 4

Density

Density

(a) 4000

309

-5.8

-5.6

-5.4

-5.2

φ

-5

-4.8

-4.6

-0.2

0

0.2

0.4

0.6

ξ

Fig. 8. The estimation of parameter distributions based on Gaussian mixture model (GMM) clustering technique: (a) the posterior distribution of location parameter μ; (b) the posterior distribution of transformed scale parameter φ; (c) the posterior distribution of shape parameter ξ . Table 1 Statistical characteristics of the parameters of each cluster (GMM) μ

a

C1 C2 C3 C4 aC

φ

ξ

Mean

STDb

Mean

STD

Mean

STD

0.0275 0.0248 0.0252 0.0284

9.764E-5 5.154E-4 3.896E-4 5.683E-4

−5.954 −4.722 −5.006 −5.358

0.0398 0.0446 0.0418 0.0995

0.6268 0.1226 −0.2607 0.2352

0.0170 0.0426 0.0159 0.0876

= cluster. = standard deviation.

b STD

Model error comes from the lack of information (the sample size within one cluster may be not sufficient to have a precise estimation of GEV parameters). The uncertainty area here reflects the possible region of CDF curves of metal loss rate, since we don’t know which curve is the actual one. The more uncertainty area we obtain, the less confidence we have on the accuracy of the parameter estimation. Model error is used to quantify the plausibility of using actual GEV parameter to estimate the metal loss rate. We still hold the GEV assumption but may not have a narrow interval of the distribution parameters when the prior information (i.e., the observations) is inadequate. The number of observations varies among clusters. When the GMM algorithm is processed, cluster 4 is found to contain only 86 observations, which is far fewer than the sample size of other clusters (340 or more observations). From Table 1, we notice that the STD of φ and ξ are significantly greater in cluster 4 than in other clusters, which can result in a larger area of uncertainty. Figure 9 shows the varying regions of the CDF curves. Because of the large uncertainty areas obtained for cluster 4, we have relatively low confidence in the CDF estimation for cluster 4.

3.3 Distribution estimation of metal loss rate within each cluster Figure 10 shows the upper bound distributions of metal loss rate within the four GMM clusters. As may be expected, the density functions belonging to different clusters are different from each other. This property demonstrates the statistical difference between the clusters. Once the clustering process was completed, we observed different patterns of PDF among the four clusters and gained additional knowledge regarding the tail behavior of the various clusters. From the perspective of probability, the severity of corrosion can be reflected by the specified upper tail probability. Besides, the failure probability of pipeline structure has significant correlation with corrosion rate since the limit state functions contain corrosion rate as a variable (Caleyo et al., 2002). From the inspection data, the empirical histograms for the metal loss in each cluster can be generated. The resulting density histograms are shown in Figure 11. By comparing Figures 10 and 11, it can be seen that the implementation of a Bayesian inferential framework can maximize the use of available data and give an approximate distribution based on the limited observations of extreme metal loss rate.

3.4 Predicted distribution of metal loss rate at a specific location Since we established the PDF of metal loss rate within each cluster, we consider it as the prior distribution at a specific site corresponding to the cluster that the site belongs to. According to Equations (19) to (21), the likelihood function value is estimated by processing the training data, and the posterior comes from the realization generated via M–H algorithm using the site-specific feature in testing site as well as the likelihood of observations. Figure 12a shows one example of the sampling result at a site in the testing set using the MCMC method.

310

Wang, Yajima, Liang & Castaneda

(a)

1

(b)

upper bound

0.8

1

upper bound

0.8 uncertainty area

uncertainty area lower bound

CDF

CDF

0.6 0.4 0.2 0 0

0.6

0.2

0.02

0.04

0.06

0.08

0 0

0.1

Metal Loss Rate (mm/year)

(c)

1

(d)

0.8

CDF

0.6

CDF

0.04

1

0.06

0.08

0.1

upper bound

0.8

uncertainty area

lower bound

0.4

0.02

Metal Loss Rate (mm/year)

upper bound

uncertainty area

0.6 lower bound 0.4 0.2

0.2 0 0

lower bound

0.4

0.02

0.04

0.06

0.08

0.1

0 0

0.02

0.04

0.06

0.08

0.1

Metal Loss Rate (mm/year)

Metal Loss Rate (mm/year)

Fig. 9. The credibility bound (0.025 and 0.975) of CDF based on GMM clustering technique and shaded area shows the uncertainty of CDF within each cluster: (a) cluster 1, (b) cluster 2, (c) cluster 3, and (d) cluster 4.

140 GMM Cluster 1 GMM Cluster 2 GMM Cluster 3 GMM Cluater 4 Before clustering

120

PDF

100 80 60 40 20 0 0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Metal Loss Rate (mm/year)

Fig. 10. The upper bound distributions (0.975 credibility bound) of metal loss rate based on the GMM clustering technique comparing with the empirical PDF of metal loss rate before clustering.

We notice that the extreme value (tail behavior) of the GEV distribution can be simulated well by using M–H algorithm, and the simulation result converges well and is stationary. Figure 12b is the estimated PDF of metal loss rate. Based on the estimated PDFs of metal loss rate at different sites, we are able to obtain the quantile at certain significant level α so that we can produce different upper bound predictions of metal loss rate, which means

the metal loss rate may be greater than the prediction with probability α. Figure 13 exhibits the prediction result of metal loss rate in the testing set along the pipeline system using the proposed Bayesian model. Two kinds of predicted metal loss rate, one at the median and the other at the α = 0.05 significance level, are shown. It can be seen that most of the inspection data in the testing set are around the median of the predicted PDF. Hence, the median can be taken as an indicator reflecting the corrosion rate at a high probability. As for the extreme point, one can also notice that they are sparse. Although it is difficult (sometimes even impossible) to predict the extreme value using a traditional physical model, this issue can be overcome by using a high percentage quantile of the PDF at a specific site. If the chosen percentage is very high (say, 0.9999), the corresponding prediction is extremely rare. Such a small probability event will be nearly impossible to observe. Hence, it is necessary to find a proper α that can lead to predictions that will match the site observations. After several trials, we chose an α value equal to 0.05 and then calculated the quantile corresponding to the percentage as the predicted extreme metal loss rate, which is shown in Figure 13. The observed extreme points are all around the prediction curve (indicated by the dash-dot line). From the perspective of probability, the prediction result of the extreme value agrees with the inspection result.

Bayesian modeling of external corrosion in underground pipelines

200

(b)

200

180

180

160

160

140

140

120

120

Density

Density

(a)

100

100

80

80

60

60

40

40 20

20 0 0.01

311

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 0

0.09

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Metal Loss Rate (mm/year)

Metal Loss Rate (mm/year) 200

(c)

(d)

180

180

160

160 140

120

Density

Density

140

100 80 60

120 100 80 60

40

40

20 0 0

200

20

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Metal Loss Rate (mm/year)

0 0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Metal Loss Rate (mm/year)

Fig. 11. Density histograms of metal loss rate from inspection data within each GMM cluster: (a) cluster 1, (b) cluster 2, (c) cluster 3, and (d) cluster 4.

4 DISCUSSIONS 4.1 Impact of imperfect inspections In industry practice, the in-line inspection cannot be perfect and their performance can be defined in terms of probability of detection (PoD) and probability of false alarm (PFA) (Schoefs et al., 2009; Sahraoui et al., 2013). Both PoD and PFA impact on the likelihood function of the proposed methodology. The ideal state is PoD = 1 while PFA = 0 (no detection error and no false alarm); however, such a state cannot be achieved in practice because of the detectability and the intrinsic measurement error of the ILI tools. If the detectability of the device is low (e.g., identifiable defect depth is 20% of wall thickness at PoD = 0.9) while the measurement noise is relatively small, the defects with small corrosion depth are not likely to be detected. As a result, the likelihood function will miss this part of information and yield a biased estimation. On the other hand, if the device detectability is high (e.g., identifiable defect depth is 5% of wall thickness at PoD = 0.9) but the measurement

noise is considerable, then the likelihood function may incorporate many false alarms and the estimation might be inaccurate. In the present work, the inspection was conducted by a high-resolution UT. The identifiable defect depth is 0.5 mm at PoD = 0.9. It can achieve a high PoD and a low PFA. Therefore, we assume that PoD = 1 and PFA = 0. In future work, we will consider the probability of the defect existence and give a conditional estimation of the likelihood function so that we can take imperfect inspections into consideration. 4.2 Some issues about data The data set is heterogeneous before clustering, i.e., one single model cannot describe the data. Therefore, clustering approach partitions the data into more homogeneous groups and improves the significance of the statistical characteristics. However, the result of clustering is subjective to the data. In other words, the number of observations and the quality of the data might have an influence on the clustering results. As the number

312

Wang, Yajima, Liang & Castaneda

of observations increases, the accuracy of the clustering increases. The data must be preprocessed properly, including such tasks as normalizing and detecting outliers—because these procedures can improve the separability of the data set. In this work, the multiple Gaussian assumption of the feature space might be too strong. However, for simplicity, we assumed the data are generated from the Gaussian mixture density. In future work, we will implement the general mixture model to make the assumption more realistic.

0.11

(a) Metal Loss Rate (mm/year)

0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0

1000

2000

3000

4000

5000

Iteration

(b) 140 120

Density

100 80

60 40 20

0

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Metal Loss Rate (mm/year)

Fig. 12. A realization of metal loss rate via MCMC sampling process at a specific location in the testing set: (a) MCMC trace of metal loss rate from a sampling process, (b) probability fit of the MCMC trace.

4.3 The number of components and the covariance model The optimum number of components and the best covariance structure for GMM were selected at the converging point of BICs for all models, as shown in Figure 5. We identified the convergence where the sum of the squared BIC of each number of components achieved minimum. The selection process was repeated with 100 cross-validations of 401 training samples and 50 testing samples, and the final optimum number was taken as the nearest integer from the average optimum number of components; in this case, the average was 3.8, giving the number of components as 4. This Monte Carlo approach is able to achieve a reasonable estimation of the true optimum number of components based on the central limit theorem (Kuehl, 1994). 4.4 Parameter variability to the number of observations in one cluster The number of observations varies between clusters. We are much more interested in whether the parameter

Metal Loss Rate (mm/year)

0.18 0.16

GMM Cluster 1 GMM Cluster 2 GMM Cluster 3 GMM Cluster 4 Median of Prediction PDF α = 0.05 Upper Bound of Prediction PDF

0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

50

100

150

200

250

300

350

Defect ID

Fig. 13. Evaluation of the metal loss rate via MCMC sampling process at every defect location in the testing data set, the scatters are calculated from inspection data. Defects are numbered consecutively from the start station to the end station along the inspected pipeline interval.

Bayesian modeling of external corrosion in underground pipelines

estimation is sensitive to the number of observations; if so, we need to ascertain which parameter is more sensitive for the GMM clustering algorithm. We use the STD of parameters as the indicator of sensitivity. When the GMM algorithm is processed, cluster 4 is found to contain 86 observations; this is far fewer than other clusters, which have about 340 or more observations. From Table 1, we notice that the STD of φ and ξ are significantly greater in cluster 4, which can result in a bigger area of uncertainty. This demonstrates that the variability of φ and ξ is considerable when the size of the sample space is small.

313

but under some circumstances can present a dangerous condition while the pipeline system is in operation. ACKNOWLEDGMENTS The authors would like to acknowledge the Secretar´ıa de Energ´ıa (SENER) and the Consejo Nacional de Ciencia y Tecnolog´ıa (CONACYT) in Mexico for the financial support to perform the work in this project (under SENER-CONACYT contract number 159913). The constructive comments of the anonymous reviewers are also appreciated.

4.5 The model uncertainty of metal loss rate The uncertainty area in Figure 9 can be used to evaluate the model error. A large uncertainty area indicates a low level of confidence, which means that the accuracy of the estimated PDF will be challenged. For the GMM clustering algorithm, the transformed scale parameter φ and shape parameter ξ mainly govern the area of uncertainty. Therefore, based on the statistical characteristics of the parameters (Table 1) and the uncertainty area (Figure 9), we can perform tests to determine if the clustering algorithm is good or not, as well as to determine if the number of components is appropriate.

5 CONCLUSION In this work, a probabilistic model for the prediction of metal loss rate in underground pipeline structure is proposed which contains two parts: (1) a clustering method using GMM is applied based on the environmental features along a pipeline for each segment; (2) within each cluster, a Bayesian model using M–H algorithm with Gibbs sampling for the prediction of metal loss rate is established based on the in-line inspection results. The proposed model is able to account for the model error and estimate the PDF of metal loss rate at a specified location. Our results demonstrate a new methodology to classify and analyze external corrosion defects using in-line inspection results and the local soil environmental information. Since this model is based on the site experience, it can be well used under field conditions. The novelty of the methodology can be described as follows: we divided the pipeline into many segments so that the feature differences and similarities of the defects can be taken into consideration by clustering method; taken a step further, a position-dependent probabilistic model along the pipeline is established which not only can predict the metal loss rate at a high probability but also can predict the extreme value, which occurs infrequently

REFERENCES Abes, J. A., Salinas, J. J. & Rogers, J. (1985), Risk assessment methodology for pipeline systems, Structural Safety, 2(3), 225–37. Adeli, H. & Panakkat, A. (2009), A probabilistic neural network for earthquake magnitude prediction, Neural Networks, 22(7), 1018–24. Ahmadlou, M. & Adeli, H. (2010), Enhanced probabilistic neural network with local decision circles: a robust classifier, Integrated Computer-Aided Engineering, 17(3), 197– 210. Alamilla, J. L. & Sosa, E. (2008), Stochastic modelling of corrosion damage propagation in active sites from field inspection data, Corrosion Science, 50(7), 1811–19. Anderko, A., McKenzie, P. & Young, R. D. (2001), Computation of rates of general corrosion using electrochemical and thermodynamic models, Corrosion, 57(3), 202–13. Bello-Orgaz, G., Menendez, H. D. & Camacho, D. (2012), Adaptive K-means algorithm for overlapped graph clustering, International Journal of Neural Systems, 22(05), 1250018. Ben-Hur, A., Horn, D., Siegelmann, H. T. & Vapnik, V. (2002), Support vector clustering, The Journal of Machine Learning Research, 2, 125–37. Bockris, J. O. M. & Reddy, A. K. N. (1970), Modern Electrochemistry, Vols. 1 & 2, Macdonald, London. Caleyo, F., Gonzalez, J. & Hallen, J. (2002), A study on the reliability assessment methodology for pipelines with active corrosion defects, International Journal of Pressure Vessels and Piping, 79(1), 77–86. Caleyo, F., Velazquez, J. C., Valor, A. & Hallen, J. M. (2009a), Probability distribution of pitting corrosion depth and rate in underground pipelines: a Monte Carlo study, Corrosion Science, 51(9), 1925–34. ´ Caleyo, F., Velazquez, J. C., Valor, A. & Hallen, J. M. (2009b), Markov chain modelling of pitting corrosion in underground pipelines, Corrosion Science, 51(9), 2197–207. Ching, J., Muto, M. & Beck, J. L. (2006), Structural model updating and health monitoring with incomplete modal data using Gibbs sampler, Computer-Aided Civil and Infrastructure Engineering, 21(4), 242–57. Coles, S. G. & Tawn, J. A. (1996), A Bayesian analysis of extreme rainfall data, Applied Statistics-Journal of the Royal Statistical Society Series C, 45(4), 463–78. Dai, H., Zhang, H., Wang, W. & Xue, G. (2012), Structural reliability assessment by local approximation of limit state

314

Wang, Yajima, Liang & Castaneda

functions using adaptive Markov chain simulation and support vector regression, Computer-Aided Civil and Infrastructure Engineering, 27(9), 676–86. DeWaard, C. & Milliams, D. (1975), Carbonic acid corrosion of steel, Corrosion, 31(5), 177–81. Engelhardt, G., Macdonald, D. & Urquidi-Macdonald, M. (1999), Development of fast algorithms for estimating stress corrosion crack growth rate, Corrosion Science, 41(12), 2267–302. Engelhardt, G. & Macdonald, D. D. (2004), Unification of the deterministic and statistical approaches for predicting localized corrosion damage. I. Theoretical foundation, Corrosion Science, 46(11), 2755–80. Fraley, C. & Raftery, A. E. (1998), How many clusters? Which clustering method? Answers via model-based cluster analysis, The Computer Journal, 41(8), 578–88. Fraley, C. & Raftery, A. E. (2002), Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, 97(458), 611–31. Fraley, C. & Raftery, A. E. (2007), Bayesian regularization for normal mixture estimation and model-based clustering, Journal of Classification, 24(2), 155–81. Fraley, C., Raftery, A. E., Murphy, T. B. & Scrucca, L. (2012), MCLUST Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation, Technical Report no. 597, Department of Statistics, University of Washington, June 2012. Geman, S. & Geman, D. (1984), Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(6), 721–41. Ghosh-Dastidar, S. & Adeli, H. (2003), Wavelet-clusteringneural network model for freeway incident detection, Computer-Aided Civil and Infrastructure Engineering, 18(5), 325–38. Hastings, W. K. (1970), Monte Carlo sampling methods using Markov chains and their applications, Biometrika, 57(1), 97–109. Hong, H. P. (1999), Inspection and maintenance planning of pipeline under external corrosion considering generation of new defects, Structural Safety, 21(3), 203–22. Jain, A. K. (2010), Data clustering: 50 years beyond K-means, Pattern Recognition Letters, 31(8), 651–66. Kiefner, J. F. & Kolovich, K. M. (2007), Calculation of a corrosion rate using Monte Carlo simulation, CORROSION 2007, NACE International, Nashville, TN, March 11–15, 2007. Kodogiannis, V. S., Amina, M. & Petrounias, I. (2013), A clustering-based fuzzy wavelet neural network model for short-term load forecasting, International Journal of Neural Systems, 23(05), 1350024. Kuehl, R. O. (1994), Statistical Principles of Research Design and Analysis, Duxbury Press, Belmont, CA. Lam, H. F., Yuen, K. V. & Beck, J. L. (2006), Structural health monitoring via measured ritz vectors utilizing artificial neural networks, Computer-Aided Civil and Infrastructure Engineering, 21(4), 232–41. Maes, M. A., Faber, M. H. & Dann, M. R. (2009), Hierarchical modeling of pipeline defect growth subject to ILI uncertainty, in Proceedings of the ASME 28th International Conference on Ocean, Offshore and Arctic Engineering, ASME, Honolulu, HI, USA, May 31–June 5, 2009. McLachlan, G. J. & Basford, K. E. (1988), Mixture Models: Inference and Applications to Clustering, Statistics: Textbooks and Monographs,Vol. 1, Marcel Dekker, New York.

McLachlan, G. J. & Krishnan, T. (2007), The EM Algorithm and Extensions, John Wiley & Sons, New York. Melchers, R. (2003), Modeling of marine immersion corrosion for mild and low-alloy steels-part 1: phenomenological model, Corrosion, 59(4), 319–34. Melchers, R. (2004), Pitting corrosion of mild steel in marine immersion environment-part 1: maximum pit depth, Corrosion, 60(9), 824–36. ´ Menendez, H. D., Barrero, D. F. & Camacho, D. (2014), A genetic graph-based approach to the partitional clustering, International Journal of Neural Systems, 24(03), 1430008. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953), Equation of state calculations by fast computing machines, The Journal of Chemical Physics, 21, 1087–92. Pang-Ning, T., Steinbach, M. & Kumar, V. (2006), Introduction to Data Mining, Addison-Wesley, Boston. Peabody, A. W., Bianchetti, R. L. & National Association of Corrosion Engineers (2001), Control of Pipeline Corrosion, National Association of Corrosion Engineers, Houston, TX. Peng, F. & Ouyang, Y. (2013), Optimal clustering of railroad track maintenance jobs, Computer-Aided Civil and Infrastructure Engineering, 29(4), 235–47. Race, J. M., Dawson, S. J., Stanley, L. M. & Kariyawasam, S. (2007), Development of a predictive model for pipeline external corrosion rates, Journal of Pipeline Engineering, 6(1), 15–29. Rizzi, M., D’Aloia, M. & Castagnolo, B. (2013), A supervised method for microcalcification cluster diagnosis, Integrated Computer-Aided Engineering, 20(2), 157–67. Sahraoui, Y., Khelif, R. & Chateauneuf, A. (2013), Maintenance planning under imperfect inspections of corroded pipelines, International Journal of Pressure Vessels and Piping, 104(2013), 76–82. ´ Schoefs, F., Clement, A. & Nouy, A. (2009), Assessment of roc curves for inspection of random fields, Structural Safety, 31(5), 409–19. Shibata, T. (1996), 1996 W.R. Whitney award lecture: statistical and stochastic approaches to localized corrosion, Corrosion, 52(11), 813–30. Sinha, S. K. & Pandey, M. D. (2002), Probabilistic neural network for reliability assessment of oil and gas pipelines, Computer-Aided Civil and Infrastructure Engineering, 17(5), 320–29. Szklarska-Smialowska, Z. (1986), Pitting Corrosion of Metals, National Association of Corrosion Engineers, Houston, TX. Tsai, Y. & Huang, Y. (2010), Automatic detection of deficient video log images using a histogram equity index and an adaptive gaussian mixture model, ComputerAided Civil and Infrastructure Engineering, 25(7), 479– 93. Velazquez, J. C., Caleyo, F., Valor, A. & Hallen, J. M. (2009), Predictive model for pitting corrosion in buried oil and gas pipelines, Corrosion, 65(5), 332–42. Westwood, S. & Hopkins, P. (2004), Smart pigs and defect assessment codes: completing the circle, CORROSION 2004, NACE International, New Orleans, LA, USA, March 28– April 1, 2004. Yuen, K. V. & Mu, H. Q. (2011), Peak ground acceleration estimation by linear and nonlinear models with reduced order Monte Carlo simulation, Computer-Aided Civil and Infrastructure Engineering, 26(1), 30–47.

Bayesian modeling of external corrosion in underground pipelines

Zhang, J., Wan, C. & Sato, T. (2013), Advanced Markov Chain Monte Carlo approach for finite element calibration under uncertainty, Computer-Aided Civil and Infrastructure Engineering, 28(2013), 522–30.

Step 1: M-step: n (t) (t+1) nj ← zˆ i j ; i=1

(t+1) αˆ j

APPENDIX A: THE DEFINITION OF GMM COVARIANCE STRUCTURE We assume that the observed values are clustered into K components. Geometric characteristic, volume, shape, and orientation of the multi-Gaussian distribution representing the kth (k = 1,2, . . . ,K) component is defined by its covariance matrix  k (Fraley and Raftery, 1998), which is factorized as eigenvalue decomposition:  k = λk Dk Ak Dk T , where λk is a scalar, Dk is the orthogonal matrix of eigenvectors, and Ak is a diagonal matrix whose elements are proportional to the eigenvalues of  k . The volume and shape are determined by λk and Ak respectively, while the orientation is determined by Dk . Eight different GMMs are defined in Table A1 (Fraley et al., 2012). Table A1 Model definition Model ID 1 2 3 4 5 6 7 8

k λI λk I λDADT λk DADT λDAk DT λk DAk DT λk Dk ADk T λk Dk Ak Dk T

Volume

Shape

Spherical Spherical Diagonal Diagonal Diagonal Diagonal Ellipsoidal Ellipsoidal

Equal Variable Equal Variable Equal Variable Variable Variable

Equal Equal Equal Equal Variable Variable Equal Variable

Orientation – – Equal Equal Equal Equal Variable Variable

APPENDIX B: THE ALGORITHM OF THE PROPOSED METHODOLOGY 1) Clustering process:

αj K 

f j (xi |μ j ,  j ) αl fl (xi |μl , l )

l=1

where Mi is the cluster label of observation i. EM algorithm Step 0: Initialize zˆ i j ,α j ,μ j , j

(t+1)

μˆ j



n i=1

 (t) (t+1) zˆ i j xi /n j ;

(t+1) ˆ j :

depends on covariance structure of choice (Table A1). Step 2: E-step: (t+1)

(t+1) zˆ i j



αˆ j K 

(t+1)

f j (xi |μˆ j

(t+1) , ˆ j )

;

(t+1) (t+1) (t+1) αˆ k f k (xi |μˆ k , ˆ k )

k=1

Repeat steps 1–2 until convergence criteria are satisfied. 2) Within each cluster, using M–H algorithm with Gibbs sampling: Step 0: initialize (μ(0) , φ (0) , ξ (0) ) and the number of runs N, choose the jump length σω.μ , σω.φ , σω.ξ ; Step 1: while i N, μ∗ ← μ(i) + ωμ(i) ;

A∗ ← π (μ∗ |φ (i) , ξ (i) )/π (μ(i) |φ (i) , ξ (i) ); accept μ(i+1) ← μ∗ with the probability α = min(1, A∗ ), otherwise μ(i+1) ← μ(i) ; (i)

ωφ ← N (0, σω.φ ),

φ ∗ ← φ (i) + ωφ ; (i)

A∗∗ ← π ×(φ ∗ |μ(i+1) , ξ (i) )/π (φ (i) |μ(i+1) , ξ (i) ); accept φ (i+1) ← φ ∗ with the probability min(1, A∗∗ ), otherwise φ (i+1) ← φ (i) ;

α=

ωζ ← N (0, σω.ξ ), ξ ∗ ← ξ (i) + ωξ ; (i)

(i)

A∗∗∗ ← π

Maximize the likelihood function: n  log(L(α, μ, |x)) = i=1 log Kj=1 α j f j (xi |μ j ,  j ) via EM algorithm. Denote the probability of observation i belongs to cluster j: z i j = P[Mi = j|xi , μ j ,  j ] =



(t+1) n j /n;

{ωμ(i) ← N (0, σω.μ ),

Geometric characteristic

315

×(ξ ∗ |μ(i+1) , φ (i+1) )/π (ξ (i) |μ(i+1) , φ (i+1) ); accept ξ (i+1) ← ξ ∗ with the probability α = min(1, A∗∗∗ ), otherwise ξ (i+1) = ξ (i) } . If converge, go to step 2, otherwise, (μ(0) , φ (0) , ξ (0) ) ← (μ(N ) , φ (N ) , ξ (N ) ), go to step 1. Step 2: pick a subset (1 from each 10 runs) from the stationary part of the generated realization and estimate the posterior distribution of three GEV parameters, construct the PDF of metal loss rate π (r ). 3) At a specific location, using M–H algorithm: Step 0: initialize r(0) and the number of runs N, choose the jump length σ ω.r ;

316

Wang, Yajima, Liang & Castaneda

Step 1: while i  N,

If converge, go to step 2, otherwise, r (0) ← r (N ) , go to step 1.

{ωr(i) ← N (0, σω.r ), r ∗ ← r (i) + ωr(i) ; A∗ ← [π (r ∗ )L(r ∗ |x)]/[π (r (i) )L(r (i) |x)]; ∗



accept r ← r with the probability α = min(1, A ), otherwise r (i+1) ← r (i) } . (i+1)

Step 2: pick a subset (1 from each 10 runs) from the stationary part of the generated realization and estimate the posterior distribution π ∗ (r ).

Suggest Documents