sensors Article
Outlier-Detection Methodology for Structural Identification Using Sparse Static Measurements Marco Proverbio 1,2, *, Numa J. Bertola 1,2 1 2
*
ID
and Ian F. C. Smith 2
ID
ETH Zurich, Future Cities Laboratory, Singapore-ETH Centre, 1 CREATE Way, CREATE Tower, Singapore 138602, Singapore;
[email protected] Applied Computing and Mechanics Laboratory (IMAC), School of Architecture, Civil and Environmental Engineering (ENAC), Swiss Federal Institute of Technology (EPFL), CH-1015 Lausanne, Switzerland;
[email protected] Correspondence:
[email protected]; Tel.: +65-9109-2080
Received: 3 May 2018; Accepted: 22 May 2018; Published: 24 May 2018
Abstract: The aim of structural identification is to provide accurate knowledge of the behaviour of existing structures. In most situations, finite-element models are updated using behaviour measurements and field observations. Error-domain model falsification (EDMF) is a multi-model approach that compares finite-element model predictions with sensor measurements while taking into account epistemic and stochastic uncertainties—including the systematic bias that is inherent in the assumptions behind structural models. Compared with alternative model-updating strategies such as residual minimization and traditional Bayesian methodologies, EDMF is easy-to-use for practising engineers and does not require precise knowledge of values for uncertainty correlations. However, wrong parameter identification and flawed extrapolation may result when undetected outliers occur in the dataset. Moreover, when datasets consist of a limited number of static measurements rather than continuous monitoring data, the existing signal-processing and statistics-based algorithms provide little support for outlier detection. This paper introduces a new model-population methodology for outlier detection that is based on the expected performance of the as-designed sensor network. Thus, suspicious measurements are identified even when few measurements, collected with a range of sensors, are available. The structural identification of a full-scale bridge in Exeter (UK) is used to demonstrate the applicability of the proposed methodology and to compare its performance with existing algorithms. The results show that outliers, capable of compromising EDMF accuracy, are detected. Moreover, a metric that separates the impact of powerful sensors from the effects of measurement outliers have been included in the framework. Finally, the impact of outlier occurrence on parameter identification and model extrapolation (for example, reserve capacity assessment) is evaluated. Keywords: structural identification; model falsification; outlier detection; static measurements; bridge load tests; reserve capacity
1. Introduction Sensing in the built environment has shown the potential to improve asset management by revealing intrinsic resources that can be exploited to extend the service life of infrastructure [1]. However, sensors on infrastructure often provide indirect information since effects, rather than causes, are measured. Physics-based models are necessary to convert this information into useful knowledge of as-built structure behaviour. Nonetheless, civil-engineering models involve uncertainties and systematic biases due to their conservative, rather than precise, objectives. Therefore, great care is
Sensors 2018, 18, 1702; doi:10.3390/s18061702
www.mdpi.com/journal/sensors
Sensors 2018, 18, 1702
2 of 20
required when measurements are used to improve the accuracy of model predictions, especially when the same models have been used for design. In the field of structural identification, measurements are employed to update parameter values affecting structure behaviour. Residual minimization, also known as model-calibration, consists of adjusting model parameters to minimize the difference between predicted and measured values. This approach is the most common data-interpretation technique and has been a research topic for several decades [2–4]. However, several authors [5–7] have noted that while calibrated parameter values may be useful for interpolation, they are usually inappropriate for extrapolation. Unfortunately, extrapolation is needed for asset-management tasks such as the widening of bridge decks, retrofitting, and the comparison of competing future-proofing scenarios. Bayesian model updating is the most common population-based structural-identification approach. This approach updates the initial knowledge in terms of probabilistic distributions of parameters by including the probabilistic distribution of measurement observations. The accurate identification of parameter values can be obtained using non-traditional implementations of Bayesian model updating [8]. However, methodologies that require deep knowledge of conditional probability and precise knowledge of statistical distributions, as well as their correlations, are not familiar and easy-to-use for practising engineers, who ultimately assume professional responsibility for decisions. Therefore, black-box approaches are not appropriate in such contexts. Other population-based methodologies, including falsification approaches, which sacrifice precision for accuracy, have provided accurate results when dealing with ill-posed problems and systematic modelling uncertainties [9]. Error-domain model falsification (EDMF) [10] is an engineering-oriented methodology that helps identify candidate models—models that are compatible with behaviour measurements—among an initial model population. EDMF provides parameter identification without requiring precise knowledge of levels of uncertainty correlations. Engineers simply define model uncertainties by providing upper and lower bounds for parameter values and model accuracy. First, the initial model population is generated; then, falsification methods are employed to detect and reject wrong models whose predictions are not compatible with measurements. Models that represent the real behaviour with a defined confidence are accepted and stored in the candidate model set (CMS). The EDMF performance in identifying parameter values depends on factors such as the adopted sampling approach to generate the initial model population, the selection of relevant parameters, and the sensor configuration. An adaptive EDMF-compatible sampling approach that is able to outperform traditional sampling techniques has been recently proposed in Reference [11]. Some algorithms have been proposed to maximise the identification performance of sensor configurations by reducing the number of candidate models [12–14]. A basic hypothesis of all structural-identification methodologies is that measurement datasets do not include wrong data. Anomalous values in measurements, which are often called outliers, can occur due to faulty sensors or unexpected events during the monitoring process [15]. From a statistical point of view, there are several methods to identify outliers. If they are taken to be observations that are very far from other observations, data mining techniques can be employed for their detection [16,17]. However, outliers may seem to occur because of model deficiencies in the model classes [18]. In the field of damage detection, the presence of even a small number of wrong measurements or missing data reduced the performance of most algorithms [19,20]. Moreover, environmental variability and operational influence can affect model features in different ways, thus, leading to incorrect damage detection [21]. A comparison of methods to identify outliers and replace wrong measurement values in signal processing was proposed in Reference [22]. The solutions applicable to the measurement datasets affected by missing data can be found in References [23,24]. Additionally, a methodology that unifies data normalization and damage detection through the identification of measurement outliers has been proposed in Reference [25]. However, methodologies that are designed for detecting anomalies in continuous measurement contexts such as those described in References [26,27] have not been found
Sensors 2018, 18, 1702
3 of 20
to be suitable to examine datasets that consist of non-time-dependent measurements (for example, measurements of changes in stress, rotation, and displacement under static load testing). Such static measurements have been the most commonly used measurement strategies for large civil infrastructure since they are informative, they are easily comparable to code requirements, and they are the least costly. Effective support to analyse and validate sparse static measurements for outliers is currently unavailable. In structural identification, the presence of outliers reduced the performance of current methods in terms of identification accuracy and prediction reliability. In Bayesian model updating, the classes of methods for outlier detection were proposed. The main two classes were based respectively on probabilistic measures such as posterior probability density function of errors [28] and L1 or Chi-square divergences [29]. Heavy-tailed likelihood functions such as Student’s t distribution or a combination of Normal and Student’s t distribution [30] have been employed for robust parametric estimations. Another class of methods treats outliers by assuming an outlier generation model, although, in practical applications, the information required to build such a model has often been unavailable [31]. Pasquier and Smith in [32] proposed an outlier-detection framework for EDMF, which is based on a sensitivity analysis of the CMS, with respect to sensor removal from the initial set. Model falsification was carried out iteratively while measurements provided by sensors were removed one at a time for any load case and the corresponding variations in CMS populations were noted. If anomalous high values of variation were obtained, then the measurement data was removed from the dataset. This framework represented only a semi-quantitative method for performing the outlier detection task as no rational definition of limits for CMS variations was proposed. As a result, sensors with the capability to falsify several model instances risked detection as outliers. This paper presents a new outlier-detection framework that is compatible with population approaches such as EDMF. The proposed strategy is based on a metric used to evaluate the expected performance of sensor configurations that are often employed to optimize sensor placement. Additionally, a context metric that separates the impact of powerful sensors from the effects of measurement outliers has been included in the framework. The new approach is, therefore, suitable to analyse data sets that consist of sparse non-time-dependent measurements and overcomes limitations that characterise existing outlier-detection methodologies. The remainder of the paper is organised as follows. Section 2 contains background information on EDMF and the proposed framework for outlier detection. In Section 3, the results of a full-scale case study are presented. Finally, the advantages and limitations of the proposed method are discussed. 2. Materials and Methods 2.1. Background—EDMF Error-domain model falsification (EDMF) [10] is a recently developed methodology for structural identification in which the finite-element (FE) model predictions are compared with measurement data in order to identify plausible model instances of a parameterized model class. A model instance is generated by assigning unique combinations of parameter values to a model class g(·), which consists of an FE parametric model including characteristics such as material properties, geometry, boundary conditions, and actions. Let Ri be the real response of a structure—unknown in practice—at a sensor location i, and yi be the measured value at the same location. The model predictions at location i, gi (θ), are generated by assigning a vector of parameter values θ to the selected FE model class. Model uncertainty Ui,g and the measurement uncertainty Ui,y are estimated and linked to the real behaviour using the following equation: (1) gi (θ) + Ui,g = Ri = yi + Ui,y ∀i ∈ 1, . . . , ny , where ny is the number of measurement locations. The terms in Equation (1) can be rearranged and the two sources of uncertainty (Ui,g and Ui,y ) can be merged in a unique term Ui,c , thus, leading to the following relationship:
Sensors 2018, 18, 1702
4 of 20
gi (θ) − yi = Ui,c .
(2)
In Equation (2), the difference between a model prediction and a measured value at location i, is referred to as the residual ri = gi (θ) − yi . Measurements errors Uy includes sensor accuracy—based on the manufacturing specifications and site conditions—and the measurement repeatability that is usually estimated by conducting multiple series of tests on site. The model–class uncertainty source Ug , which is often dominant over Uy , is estimated using engineering judgment, technical literature, and local knowledge. Since a limited number of parameters can be sampled to generate the model class, an additional error—estimated using stochastic simulations—is often included in Ug . Plausible behaviour models are selected by falsifying those for which residuals exceeds the thresholds boundaries that are defined in the uncertainty domain (that is, the error domain). Being a falsification approach, EDMF initially requires that a set of model instances is generated by assigning parameter values to hthe model iclass. Then, the threshold bounds are defined at each sensor location as 1/ny
the shortest interval ulow , uhigh that contains a probability equal to Φd 1/ny
Φd
=
Z u i,high ui,low
f Uc,i (uc,i )duc,i
, using the following equation:
∀i ∈ 1, . . . , ny ,
(3)
where f Uc (uc ) is the combined probability density function at each sensor location, while the confidence level Φd is adjusted using the Sidák correction to take into account the simultaneous use of multiple measurements to falsify model instances. Models for which residuals are within the threshold bounds (ulow , uhigh ) at each sensor location are included in the candidate model set (CMS). The models for which residuals exceed these bounds, at one or more sensor locations, are falsified and, therefore, rejected. When a candidate model set is identified, the prediction tasks involve using the CMS to assess the reserve capacity of the structure. Predictions Q j at locations j are given by Q j = g j (θ00 ) + Uj,g ,
(4)
where θ00 is a set of combinations of parameter values representing the CMS and Ug is the model uncertainty. When all initial model instances generated are falsified, the entire model class is falsified. This means that no model is compatible with the observations given the current estimation of model and measurement uncertainties. Thus, it is usually a sign of incorrect assumptions in the model–class definition and uncertainty assumptions. Complete falsification helps avoid the wrong identification of parameter values and detects wrong initial assumptions, highlighting one of the main advantages of EDMF compared with other methodologies [5]. However, the wrong falsification of the entire CMS can occur because of the presence of outliers in the measurement data set. The sensor configuration—designed according to the behaviour measurements to be collected—has a high sensitivity to the precision and accuracy of EDMF. The approach described in Reference [13] and extended in Reference [14], used simulated measurements to provide probabilistic estimations of the expected number of candidate models obtained with a sensor configuration. The aim was to find the sensor configuration that minimizes the expected number of candidate models. The simulated measurements are generated based on the model instances adding a random value taken from the combined uncertainties. Sensor locations were evaluated using respectively 95% and 50% quantiles of the expected candidate-model-set size. However, the procedure is computationally costly [33], because it requires the execution of the falsification procedure for a large number of simulated measurements and sensor locations. This issue has been acknowledged in References [34,35], where the expected identification performance is used as a metric to evaluate the information gain of a sensor configuration rather than being used as an objective function to be optimised.
Sensors 2018, 2018, 18, 18, 1702 x FOR PEER REVIEW Sensors
55of of 20 20
2.2. Methodology 2.2. Methodology This paper proposes a new framework to improve the robustness of EDMF against the presence This paper proposes a new framework to improve the robustness of EDMF against theFE presence of anomalous values in measurement datasets and to detect flaws in the definition of the model of anomalous in the measurement datasets and to detect flaws inidentification, the definitioninofwhich the FEspecific model classes. Figurevalues 1 shows general EDMF framework for structural classes. Figureintroduced 1 shows the EDMF framework by forthe structural in which specific contributions in general this paper are highlighted shadedidentification, boxes. contributions introduced in this paper are highlighted by the shaded boxes. The model–class validation is carried out by comparing predictions of the initial model The model–class validation isof carried out by comparing predictions of the initial model population population with measurements real behaviour. This check is performed before updating the with measurements of real This check performed updating the parameter values parameter values since thebehaviour. flawed model classesis usually leadbefore to wrong parameter identification. since the model classes lead to measurement wrong parameter When with an accurate When anflawed accurate model classusually is used, the dataidentification. can be compared model model class is used, the measurement data can be compared with model predictions and EDMF predictions and EDMF identifies the ranges of the parameter values that explain the real behaviour. identifies the of the values that explain the real behaviour. However, the presence of However, theranges presence of parameter the outliers in the measurement datasets may lead to incorrect results. the outliers in the measurement may lead incorrect results. Outlier detection is particularly Outlier detection is particularlydatasets challenging whentothe measurement data consist of unique values challenging when the measurement data consist of unique valuesfrom collected under static conditions, collected under static conditions, rather than signals obtained continuous monitoring. The rather thanmethodology signals obtained from continuous Themeasurements proposed methodology takes proposed takes advantage of monitoring. the simulated to compute theadvantage expected of the simulated to compute the expected of a sensor network.the Anomalous performance of ameasurements sensor network. Anomalous situationsperformance can be detected by comparing expected situations be detected expected and performance of (i) each sensor and actualcan performance of by (i) comparing each sensorthe individually, andactual (ii) the entire sensor configuration. individually, anddeemed (ii) the entire configuration. Sensors thatthe areCMS deemed to be suspicious are Sensors that are to be sensor suspicious are removed. Finally, is computed using only removed. Finally, the CMS is computed using only reliable measurements. The proposed model–class reliable measurements. The proposed model–class validation and outlier-detection methodology are validation and outlier-detection methodology are described in detail in the next sections. described in detail in the next sections. Interpolation tasks tasks(that (thatis,is, predicting at unmeasured locations) mostly, extrapolation Interpolation predicting at unmeasured locations) and,and, mostly, extrapolation tasks tasks is, (that is, assessment of reserve capacity) represent the ultimate of structural identification. (that assessment of reserve capacity) represent the ultimate aims aims of structural identification. The The extrapolation tasks intrinsically more demanding since thefictitious fictitiousparameter parametervalues values do do not extrapolation tasks are are intrinsically more demanding since the compensate for model classes cancan lead to compensate for model–class model–classerrors errors[9]. [9].The Theoutlier outlieroccurrence occurrenceand andinaccurate inaccurate model classes lead wrong reserve-capacity assessments; thus, reiterating thethe importance of of ensuring thethe robustness of to wrong reserve-capacity assessments; thus, reiterating importance ensuring robustness identification methodologies. of identification methodologies.
Figure The general framework forstructural the structural identification using error-domain model Figure 1.1.The general framework for the identification using error-domain model falsification falsification (EDMF). The contributions of this paper are highlighted by the shaded boxes. (EDMF). The contributions of this paper are highlighted by the shaded boxes.
Sensors 2018, 2018, 18, Sensors 18, 1702 x FOR PEER REVIEW
of 20 20 66of
2.2.1. Model–Class Validation 2.2.1. Model–Class Validation Model–class accuracy is checked by comparing prediction ranges that are computed using the Model–class accuracy is checked by comparing prediction ranges that are computed using the initial model population with measured values at each sensor location. A qualitative comparison initial model population with measured values at each sensor location. A qualitative comparison between the two model classes, namely and , is shown in Figure 2. between the two model classes, namely MC1 and MC2 , is shown in Figure 2.
Figure 2. The model–class validation methodology. The model classes for which the prediction Figure 2. The model–class validation methodology. The model classes for which the prediction intervals of the initial population do not include measured values at several locations may reveal flaws intervals of the initial population do not include measured values at several locations may reveal in the model class definition, rather than the outliers in the measurement dataset. flaws in the model class definition, rather than the outliers in the measurement dataset.
a sensor andand the prediction ranges are depicted using interval bounds. Each vertical verticalaxis axisrepresents represents a sensor the prediction ranges are depicted using interval Measured values yvalues in the prediction intervalsintervals obtained using model class MC bounds. Measured are included in the prediction obtained using model class 2 for all i are included locations i, while ,predictions of MC1ofdo not do include the measured values for sensor for all locations while predictions not include the measured values for S sensor , S4,. 1 , S2 , and As to provide accurate explanations of theofmeasured behaviour. In this anda result, . As a MC result, is unlikely to provide accurate explanations the measured behaviour. In 1 is unlikely situation, engineers should reviserevise the model class assumptions, for example, through collecting further this situation, engineers should the model class assumptions, for example, through collecting information during the inspection of the site. of Thisthe iterative approach to structural is further information during the inspection site. This iterative approachidentification to structural described in Reference [29].in Reference [29]. identification is described However, the situation depicted in Figure 2 may have alternative explanations. For example, the to the lower bound of theofprediction rangesranges for both measured value valueofofsensor sensorS4 is close is close to the lower bound the prediction formodel both classes. model This suggests verifyingverifying that the that initial ranges behaviour parameters are sufficiently wide andwide that classes. This suggests the initialofranges of behaviour parameters are sufficiently an appropriate sample density beenhas achieved. Alternatively, the measurements can be farcan from and that an appropriate samplehas density been achieved. Alternatively, the measurements be the from prediction ranges due to thedue presence many outliers in the dataset. situationThe presumed in far the prediction ranges to theofpresence of many outliers in The the dataset. situation this paper involves a limited amount of sensors sinceofoutliers amount to less than 20%to ofless the presumed in this paper involves a limited amount sensorstypically since outliers typically amount entire20% dataset [21]. than of the entire dataset [21]. 2.2.2. Outlier Outlier Detection Detection 2.2.2. Unlike continuous continuous monitoring monitoring in in which which aa large large amount amount of of data data is Unlike is collected collected over over time, time, datasets datasets obtained during during static statictests testsoften often consist a few measurements are related to specific obtained consist of of a few measurements that that are related to specific static static configurations. Even when the same test is performed multiple times—usually to assess configurations. Even when the same test is performed multiple times—usually to assess the the measurement repeatability under conditions—the amountofofvalues valuescollected collected from from each each measurement repeatability under sitesite conditions—the amount measurement is is insufficient insufficient to Therefore, anomaly anomaly detection detection cannot cannot be be measurement to carry carry out out statistical statistical analyses. analyses. Therefore, performed by uniquely analysing the dataset. performed by uniquely analysing the dataset. Figure 33 outlines outlines the the framework frameworkthat thatisisproposed proposedininthis thispaper paperfor foroutlier outlier detection. order Figure detection. In In order to s s to detect suspicious measurements, first, a vector simulatingmeasurements measurements yi (that (that is, is, 100,000 100,000 detect suspicious measurements, first, a vector of of nysimulating
Sensors 2018, 18, 1702
7 of 20
Sensors 2018, 18, x FOR PEER REVIEW
7 of 20
measurements) is generated for each sensor location i by adding a random value of combined measurements) is each generated each sensor a random value(5). of combined uncertainty Ui,c to modelfor prediction gi (θ) location in Equationby (2),adding according to Equation uncertainty , to each model prediction ( ) in Equation (2), according to Equation (5). yis = gi (θ) − rand(Ui,c ) ∀i ∈ 1, . . . , ny . (5) = ( )− ∀ ∈ {1, … , }. (5) , Then, using EDMF for each set set of of simulated simulated measurements measurements and the the corresponding corresponding number of candidate models in the the CMS CMS (that (that is, is, the the candidate-model-set candidate-model-set population population #CMs) #CMs) is is recorded. recorded. This This number represents the expected expected dimension dimension of of the the CMS CMS population populationififaaspecific specificset setofof yis was used. used. Assuming Assuming that that an accurate model class g(·) (∙) isisused usedand andno nooutlier outlier affects affects the the dataset, dataset, the distribution of the expected expected #CMs #CMs include include the the value value that that is is obtained obtained when when the the real real measurement measurement yk isisused. used.
Figure 3. The outlier-detection steps.
Given a sensor , three steps should be performed to ensure that the measured value is Given a sensor k, three steps should be performed to ensure that the measured value yk is plausible. In step 1, the cumulative density function (CDF) of #CMs, obtained using only the plausible. In step 1, the cumulative density function (CDF) of #CMs, obtained using only the simulated simulated measurement for this sensor ( ), is plotted. The CDF is used to compute the cumulative measurement for this sensor (yis=k ), is plotted. The CDF is used to compute the cumulative probability probability to observe the CMS population given by . A low probability value (for example,