Universität Stuttgart - Institut für Wasserbau Lehrstuhl für Hydromechanik und Hydrosystemmodellierung Prof. Dr.-Ing. Rainer Helmig
Diplomarbeit
Parameter Estimation by Ensemble Kalman Filters with Transformed Data
Submitted by
Anneli Schöniger Matrikelnummer 2221942
Stuttgart, March 31st, 2010
Examiners:
Jun.-Prof. Dr.-Ing. W. Nowak Prof. Dr. rer.nat. Dr.-Ing. A. Bárdossy
External Advisor:
Prof. Dr. Harrie-Jan Hendricks Franssen
I hereby certify that I have prepared this thesis independently, and that only those sources, aids and advisors that are duly noted herein have been used and/or consulted. Stuttgart, March 31st 2010
(Anneli Schöniger)
Universität Stuttgart - Institut für Wasserbau Lehrstuhl für Hydromechanik und Hydrosystemmodellierung Jungwissenschaftlergruppe Stochastic Modelling of Hydrosystems Prof. (jun.) Dr.-Ing. Wolfgang Nowak, M.Sc.
Thesis Topic: “Parameter Estimation by Ensemble Kalman Filters with Transformed Data” Spatial variability in conjunction with scarce data introduces parameter and prediction uncertainty in simulations of flow and transport in heterogeneous porous media. A very recent development is the use of Ensemble Kalman Filters (EnKFs) to condition random heterogeneous parameter fields on measurement data. This way, one obtains an ensemble of conditional parameter fields along with their respective model predictions, allowing for a relatively rigorous uncertainty quantification at very low computational costs. The largest remaining drawback of EnKFs are that they are optimal (i.e., accurate in the sense of Bayesian updating) only for multi-Gaussian dependence between data and parameters. This thesis will investigate non-linear data transformations to move data closer to Gaussianity. For example, water saturation is physically bounded between zero and unity, such that a beta-to-normal transformation can improve the situation, leading to a higher degree of EnKF accuracy. Similar techniques are promising for hydraulic heads between two Dirichlet boundaries or concentrations between zero and the solubility limit. Other data types may be non-negative and treatable with log transforms. The expected gain of such transformations is a more accurate processing of information, and hence a more accurate quantification of parameter and prediction uncertainty. The expected improvement is substantial, but not complete, because univariate normality is not sufficient to ensure multivariate normality. Individual work steps:
• • • •
•
Developing an automatic tool to determine adequate Gaussian anamorphism transforms (parametric or empirical) for arbitrary data types Implementation of EnKF and numerical test cases (MATLAB-based FEM code) Comparing EnKF performance with and without transformed data Testing the robustness of the transform with respect to: ◦ Sampling error (too small ensemble sizes) ◦ Conceptual error (e.g., assessed for inaccurate boundary conditions) ◦ Extreme data values (at the uncertain margins of the transform) Investigate the remaining degree of non-multi-normality in the multivariate dependence after transformation, e.g., by copula analysis. Collaborations and connections:
• • •
This thesis is a cooperation with Prof. Harrie-Jan Hendricks Franssen (FZ Jülich), mirrored by similar work on soil moisture with remote sensing data in his group. For the copula analysis, cooperation with Prof. Bàrdossy is intended. The EnKF is black-box compatible with arbitrary simulation codes and data types. At the same time, the proposed improvement will move EnKFs closer to parameter estimation in highly non-linear multiphase flow and transport problems. This offers an excellent opportunity for exchange or collaboration with, e.g., the IRTG NUPUS. Contact:
Jun.-Prof. Dr.-Ing. Wolfgang Nowak, M.Sc. Institut für Wasserbau/SimTech Pfaffenwaldring 7a 70569 Stuttgart Email:
[email protected] Phone: 0711/685-60113
Parameter Estimation by Ensemble Kalman Filters with Transformed Data Uncertain hydrogeological parameters compromise the reliability of predictions for contaminant spreading in the subsurface. In this work, an inverse stochastic modeling framework is used for parameter estimation. This allows to include available measurement data and then quantify the uncertainty of model prognoses and determine exceedance probabilities as a basis for decision-making. Assimilation of available data by Ensemble Kalman Filters (EnKFs) has been successfully applied to state variable estimation in atmospheric and oceanic sciences [Evensen, 2007]. Recent research has prepared the use of EnKFs for parameter estimation in groundwater applications [Nowak, 2009]. The largest remaining drawback of EnKFs is their optimality only for multivariate Gaussian distributed variables. This is a major limitation to the application in subsurface parameter estimation since flow and transport variables generally do not show Gaussian dependence on hydraulic conductivity. This study investigates the use of non-linear, monotonous transformations that render arbitrary marginal distributions of state variables Gaussian. This transformation step is included in the EnKF without interfering with its traditional analysis scheme. Transformation approaches have recently been presented by Béal et al. [2010] and Simon and Bertino [2009] in the context of state estimation; this study will transfer the methodology to parameter estimation. Moreover, critical issues like non-stationarity of state variables, implementation of physical bounds of state variable values and clustering of distributions at these bounds are addressed. Results show that 1. an implicit pseudo-linearization is achieved by Gaussian anamorphosis, and 2. the linearized dependence of transformed state variables from the parameters increases the efficiency of the updating step. This yields a more accurate prognosis of flow and transport in heterogeneous aquifers. The proposed approach (combining numerically efficient EnKFs for parameter estimation with Gaussian anamorphosis of data) is an attractive alternative in handling strongly non-linear model behavior, given that existing linearization-free methods are computationally demanding.
Parameterschätzung mit Ensemble Kalman Filtern angewandt auf transformierte Daten Unsichere hydrogeologische Parameter beeinträchtigen die Vorhersagegüte von Schadstoffausbreitung im Untergrund. In dieser Arbeit werden Parameter mithilfe inverser stochastischer Modellierung geschätzt. Dieser methodische Rahmen erlaubt es, vorhandene Messdaten einzubeziehen und die Unsicherheit von Modellprognosen zu quantifizieren sowie Überschreitungswahrscheinlichkeiten zu bestimmen, die als Entscheidungsgrundlage dienen. Die Assimilation vorhandener Daten mit dem Ensemble Kalman Filter (EnKF) wurde bereits erfolgreich bei der Zustandsschätzung in Meeres- und Atmosphärenwissenschaften angewendet [Evensen, 2007]. Kürzlich veröffentlichte Forschungsarbeiten haben den Weg für den Einsatz des EnKF bei der Parameterschätzung in Grundwassermodellen geebnet [Nowak, 2009]. Die größte Schwachstelle des EnKF dabei ist, dass nur für multivariat-normalverteilte Variablen optimale Ergebnisse zu erwarten sind. Dies stellt eine maßgebliche Einschränkung der Anwendbarkeit auf die Schätzung von Bodenparametern dar, da Strömungs- und Transportvariablen im Allgemeinen keine Gaußsche Abhängigkeit von der hydraulischen Leitfähigkeit aufweisen. In dieser Diplomarbeit wird die Anwendung nicht-linearer, monotoner Transformationen erforscht, die beliebige Randverteilungen von Zustandsvariablen in die Normalverteilung umwandeln. Dieser Transformationsschritt wird in den Filter eingebettet, ohne dessen grundsätzlichen Ablauf zu modifizieren. Transformationsansätze wurden jüngst von Béal et al. [2010] und Simon and Bertino [2009] im Rahmen der Zustandsschätzung vorgestellt; die vorliegende Arbeit überträgt die Methodik auf die Parameterschätzung. Weiterhin werden kritische Punkte wie Nicht-Stationarität von Zustandsvariablen, Implementierung von physikalischen Wertegrenzen und Clustering von Verteilungsfunktionen an diesen Grenzen untersucht. Die Ergebnisse zeigen, dass 1. eine implizite pseudo-Linearisierung durch die Gaußsche Anamorphose erreicht wird, und 2. die linearisierte Abhängigkeit der transformierten Zustandsvariablen die Effizienz des Updating-Schritts erhöht. Das führt zu einer genaueren Vorhersage von Strömung und Transport in heterogenen Aquiferen. Die vorgeschlagene Methode (Kombinieren der numerisch effizienten EnKFs für Parameterschätzung mit der Gaußschen Anamorphose von Daten) ist eine attraktive Alternative, um mit stark nicht-linearem Modellverhalten umzugehen, da existierende Simulationstechniken ohne Linearisierung mit großem Rechenaufwand verbunden sind.
Acknowlegements I hereby thank the German Research Foundation (DFG) for the funding within the International Research Training Group “Non-Linearities and Upscaling in Porous Media” (NUPUS).
iii
Contents 1 Motivation
1
2 Approach
4
3 Flow and Transport in the Subsurface 3.1 Conceptual Model . . . . . . . . . 3.2 Mathematical Model . . . . . . . . 3.2.1 Assumptions . . . . . . . . 3.2.2 Balance Equations . . . . .
. . . .
6 6 6 7 7
. . . .
. . . .
. . . .
. . . .
4 Geostatistics 4.1 Probability Functions . . . . . . . . . . . 4.1.1 Univariate Probability Functions . 4.1.2 Multivariate Probability Functions 4.2 Statistical Moments . . . . . . . . . . . . 4.3 Statistics of Random Fields . . . . . . . . 4.4 Spatial Dependence . . . . . . . . . . . . . 4.5 Spatial Interpolation and Simulation . . . 5 Data Assimilation 5.1 History of Kalman Filters . . . 5.2 Ensemble Kalman Filter . . . . 5.2.1 Analysis Scheme . . . . 5.2.2 Ensemble Kalman Filter 5.3 Particle Filter . . . . . . . . . .
. . . . . . for . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
9 9 9 10 11 11 12 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parameter Estimation . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
14 15 15 15 17 17
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
19 19 19 20 24 24 26 29 31 32
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
6 Gaussian Anamorphosis in the Ensemble Kalman Filter 6.1 Classification of Transformation Techniques . . . . . 6.1.1 Direct Transformation . . . . . . . . . . . . 6.1.2 Indirect Transformation Techniques . . . . . 6.2 Anamorphosis Function Fitting . . . . . . . . . . . . 6.2.1 Interpolation Techniques . . . . . . . . . . . . 6.2.2 Regression Techniques . . . . . . . . . . . . . 6.3 Definition of Tails . . . . . . . . . . . . . . . . . . . 6.3.1 Handling of Clustered Data . . . . . . . . . . 6.3.2 Extrapolation towards Population Bounds . .
iv
. . . . . . . . . . .
. . . . . . . . .
. . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . . . . .
35 36 36 39 42 44 47 49 49 52 52 54 55
7 Application to Synthetic Test Case 7.1 Numerical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Description of Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56 56 56 58
8 Results and Discussion 8.1 Filtering Procedure with Transformed Data . . . . . . . . . . . . . . . 8.2 Effects of Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Pseudo-Linearized Dependence of States on Parameters . . . . 8.2.2 Bivariate Dependence Structures of State Variables . . . . . . . 8.2.3 Qualitative Differences in the Updating Step . . . . . . . . . . 8.3 Transformation of Different Data Types . . . . . . . . . . . . . . . . . 8.3.1 Drawdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Hydraulic Head . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Solute Concentration . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Suitability of State Variable Types for Gaussian Anamorphosis 8.4 Comparison with Particle Filter as Reference Solution . . . . . . . . .
59 59 63 63 66 69 71 71 77 81 85 86
6.4
6.5
6.3.3 Fitting the Tails . . . . . . . . . . . . . . . . . . . . . . . . . Exactness of Transformations . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Performance Test Procedure . . . . . . . . . . . . . . . . . . . 6.4.2 Performance of Interpolation Techniques . . . . . . . . . . . . 6.4.3 Performance of Expansion in Hermite Polynomials . . . . . . 6.4.4 Performance of Other Regression Techniques . . . . . . . . . 6.4.5 Methods of Choice Based on Performance Tests . . . . . . . . Implementation in Ensemble Kalman Filter . . . . . . . . . . . . . . 6.5.1 Accounting for Properties of State Variables . . . . . . . . . . 6.5.2 Comparability of Observations and Simulated Measurements 6.5.3 Transformation of Measurement Error . . . . . . . . . . . . . 6.5.4 Parameter Updating Step . . . . . . . . . . . . . . . . . . . . 6.5.5 Model Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
9 Summary, Conclusion and Outlook
91
Bibliography
93
v
Nomenclature αl / αt Longitudinal/ transversal dispersivity C
Copula
c
Solute concentration
D
Diffusion tensor / dispersion tensor
Vector of measurement errors
F
Cumulative density function (CDF)
F (x) Cumulative distribution function of the original variable G (z) Standard Gaussian cumulative distribution function h
Hydraulic head
Kf
Permeability
λ
Correlation length
µ
Mean
N
Sample size / ensemble size
nmeas Number of observation points p
Fluid pressure
f
Probability density function (PDF)
φ
Porosity
ψ
Anamorphosis function
q
Specific discharge
Qϑϕ
Covariance / covariance matrix
R
Measurement error covariance matrix
ρ
Fluid density
vi
r
Rank correlation
s
Vector of parameters
σ2
Variance
Θ, Φ Random variables ϑ, ϕ
Random variable values
t
Time
v
Seepage velocity
x
Vector of coordinates
x
Original variable
yo
Vector of observations
yu
Vector of simulated state variables
z
Transformed variable
vii
List of Figures 1.1
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16
8.1 8.2
Histograms of relevant variables for groundwater models: Bars represent relative frequency, solid line shows normal distribution that corresponds to mean and variance calculated from the sample. Data sets taken from two different measurement locations . . . . . . . . . . . . . . . . . . . . . Building empirical CDF. Intervals of non-exceedance probability assigned to each sample data point designated by double arrows for an exemplary sample size N = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphical Gaussian Transformation: For any value x, the corresponding Gaussian value z can be found by F (x) = G (z) . . . . . . . . . . . . . . . Empirical anamorphosis function . . . . . . . . . . . . . . . . . . . . . . . Histograms of untransformed and transformed variable and normal probability plot for transformed variable . . . . . . . . . . . . . . . . . . . . . Dealing with clustered data at the lower bound of the fictitious sample. Ellipsis highlights the discontinuity at the clustered data point . . . . . . Defining minimum and maximum values for the Gaussian transform . . . Beta-distributions that shall represent different data types. Parameters a, b are given in parenthesis . . . . . . . . . . . . . . . . . . . . . . . . . Perfect anamorphosis functions, depending on parameters of beta distribution function that the sample is taken from . . . . . . . . . . . . . . . . Deviations from perfect transformation, depending on sample size. Note that the lowest and highest value of the original variable depends on the randomly drawn sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of linear interpolation techniques, depending on sample size . Performance of interpolation and regression techniques, depending on sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oscillations of Hermite polynomials . . . . . . . . . . . . . . . . . . . . . . Properties of expansion in Hermite polynomials depending on sample size Performance of CDF smoothing techniques, depending on sample size . . Regression techniques to smooth empirical CDF . . . . . . . . . . . . . . Anamorphosis function. Illustration of clustering, ensemble and physical bounds and extension towards ± infinity (for any x < xmin : z = zmin ; for any x > xmax : z = zmax ) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
22 22 23 23 33 34 38 38 39 40 43 44 45 46 47 51
Synthetic truth: Log-conductivity field and drawdown field . . . . . . . . 59 A priori ensemble statistics of log-conductivity and drawdown . . . . . . . 60
viii
8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17 8.18 8.19 8.20 8.21 8.22 8.23 8.24 8.25 8.26
Gaussian anamorphosis of drawdown data. Upper row shows transformation of the ensemble at the measurement location closest to the well, lower row summarizes the transformation at the other measurement locations . Drawdown ensembles before (top) and after (bottom) updating at the measurement location closest to the pumping well. The observed value is marked by the thick, red line . . . . . . . . . . . . . . . . . . . . . . . . . A posteriori ensemble statistics of log-conductivity and drawdown . . . . Synthetic fields with marked measurement locations: Numbers indicate the pairs of strongly correlated state variable and log-conductivity . . . . Dependence of drawdown on log-conductivity (Locations 1 and 2) . . . . . Dependence of head on log-conductivity (Locations 3 and 4) . . . . . . . . Dependence of concentration on log-conductivity (Locations 5 and 6) . . . Empirical copula density for drawdown at locations 1 and 2 (left) and theoretical Gaussian copula density (right) with same rank correlation . . Empirical copula density for heads at locations 3 and 4 (left) and theoretical Gaussian copula density (right) with same rank correlation . . . . Empirical copula density for concentration at locations 5 and 6 (left) and theoretical Gaussian copula density (right) with same rank correlation . . Influence function of measurement 1 (drawdown) on the parameter field . Influence function of measurement 3 (head) on the parameter field . . . . Influence function of measurement 5 (concentration) on the parameter field Ratio of diagonal of measurement covariance matrix and measurement error variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Synthetic log-conductivity and drawdown field and best estimates resulting from different transformation methods in the EnKF . . . . . . . . . . A priori ensemble variance of log-conductivity and drawdown field and conditional variances resulting from different transformation methods in the EnKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistics of drawdown residuals from different assimilation methods . . . Synthetic log-conductivity and head field and best estimates resulting from different transformation methods in the EnKF . . . . . . . . . . . . A priori ensemble variance of log-conductivity and head field and conditional variances resulting from different transformation methods in the EnKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistics of head residuals from different assimilation methods . . . . . . Prior and conditioned ensemble in Gaussian space with data clustering at the lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Synthetic log-conductivity and concentration field and best estimates resulting from different transformation methods in the EnKF . . . . . . . . A priori ensemble variance of log-conductivity and concentration field and conditional variances resulting from different transformation methods in the EnKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistics of concentration residuals from different assimilation methods .
ix
61 62 63 64 64 65 66 67 68 68 69 70 71 72 73 74 76 78 79 80 81 83 84 85
8.27 Synthetic log-conductivity and head field (upper row) and best estimates resulting from different transformation methods in the EnKF and the particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 8.28 A priori ensemble variance of log-conductivity and drawdown field (upper row) and conditional variances resulting from different transformation methods in the EnKF and the particle filter . . . . . . . . . . . . . . . . . 88 8.29 Statistics of drawdown residuals resulting from different transformation methods in the EnKF and the particle filter (PF) . . . . . . . . . . . . . . 89
x
List of Tables 6.1 6.2
Overview of suggested transformation methods. CDF stands for cumulative distribution function of the original data; ANA represents anamorphosis function that links original data with Gaussian transformed data . 30 Statistics of untransformed and transformed variable in comparison with theoretical values for a Gaussian variable . . . . . . . . . . . . . . . . . . 35
7.1
Model parameters used for synthetic test case. K, logK stand for conductivity and log-conductivity, respectively. h, d, c represent the state variables head, drawdown and concentration. symbolizes measurement error. For concentration data, the measurement error standard deviation is composed of an absolute and a relative part and results in a measurementspecific standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.1
RMSE of updated fields with regard to synthetic truth. Comparison between assimilation of untransformed data and updating with transformed data and scaled measurement error variance. Note that negative percentage of reduction means an increase in RMSE in the transformed run compared with untransformed one . . . . . . . . . . . . . . . . . . . . . . 75 RMSE of fields obtained from the three different EnKF assimilation methods with regard to synthetic truth and particle filter (PF) . . . . . . . . . 89
8.2
xi
1 Motivation The prediction of contaminant transport in the subsurface has gained much importance as groundwater represents an essential part of urban water supply and contamination sources multiply with the current development of industry and urbanization. Possible health risks resulting from contaminated wells have to be prevented. To make responsible prognoses, the underlying processes of flow and transport in an aquifer have to be identified and implemented in numerical models. Soil structures can be very heterogeneous due to their geological formation over time (e.g. sedimentation, erosion, fracturing). Although the spatial variability of parameters like conductivity is commonly accepted, modelers tend to assume homogeneous (upscaled) conditions to reduce computational effort as well as measurement costs. As opposed to the deterministic approach, where parameters as well as boundary conditions are assumed to be well known, stochastic models are able to quantify the uncertainty that comes along with the reduction of complexity from reality to the upscaled model. In order to reduce uncertainty inherent to a prognosis, all available data have to be included in the model while unknown parameters should be estimated. Mainly two different approaches for inverse modeling have emerged, inverse parameter estimation and stochastic simulation. Hendricks Franssen et al. [2009] provide a detailed comparison of representative methods of both approaches. Classical inverse estimation has been discussed excessively in literature [e.g., Poeter and Hill, 1997, Carrera et al., 2005]: An objective function built of partial derivatives of state variable measurements with respect to the parameters is minimized. As result, the configuration of parameters that fits best the given observation data is obtained. The second approach to parameter estimation has been developed within the framework of Monte Carlo simulations [Robert and Casella, 2004]. An ensemble of parameter fields is used as equally likely input for a numerical model. The fields are conditioned on the available observations and simulations on those conditioned parameter fields then produce an ensemble of state variable predictions. The mean of the ensemble prediction is a good estimator for the expected contamination spreading and, as an additional benefit of this method, the quantity of uncertainty belonging to this prognosis is obtained as well. With these results, health risks can be quantified by exceedance probabilities of legal limits which forms a basis for monetary decision-making.
1
For data assimilation within the context of inverse modeling, the Ensemble Kalman Filter (EnKF) [Evensen, 2007] has risen attention. It has been successfully applied to state variable estimation in atmospheric and oceanic sciences [Bertino et al., 2003, Béal et al., 2010] and has become popular because of its ease of implementation and comparatively low computational costs. Recent research has modified the procedure of EnKFs to make them suitable for parameter estimation and thus for their application to stochastic modeling of subsurface flow[Nowak, 2009]. The largest remaining drawback of EnKFs is their theoretical derivation for multivariate Gaussian distributed variables: EnKFs do not perform optimal Bayesian updates for data of arbitrary, non-Gaussian distribution [Evensen, 2007]. State variables in subsurface flow and transport do not generally show Gaussian dependence on the parameter conductivity. The type of distribution and dependence is governed by physical processes and imposed boundary conditions [e.g., Nowak et al., 2008, Bellin and Tonina, 2007]. Figure 1.1 shows histograms of flow and transport variables at two different arbitrarily chosen measurement locations. It can be clearly seen that different histogram shapes result from the different data types and locations. Only a fraction of the data follows approximately a normal distribution which is indicated by the solid lines. Histogram of Heads 6
Histogram of Drawdowns Histogram of Concentrations 20 3 15
loc.#1
2
4 10
Relative frequency
2
1 5
0
0
0.2
0
0.4
10
−0.15 −0.1 −0.05
0
0 −0.2 0 0.2 0.4 0.6 0.8 20
30
8
15 20
6
10 loc.#2
4
10 5
2 0
0.7
0.8 0.9 Head
1
0
−0.1
−0.05 Drawdown
0
0
0.6
0.8 1 Concentration
Figure 1.1: Histograms of relevant variables for groundwater models: Bars represent relative frequency, solid line shows normal distribution that corresponds to mean and variance calculated from the sample. Data sets taken from two different measurement locations
2
Therefore, the non-optimality of updating is a major limitation to the use of EnKFs in subsurface parameter estimation. The aim of this thesis is to analyze and mitigate the effects of non-Gaussian data dependence on the performance of parameter estimation EnKFs. Non-linear, monotonous transformations of marginal distributions will be investigated to move arbitrarily distributed data closer to Gaussianity in the univariate sense. This transformation step will be included in the EnKF without interfering with its usual procedure. Data transformations have been examined in previous works [Béal et al., 2010, Simon and Bertino, 2009], but have only been applied to state variable estimation so far and critical issues like non-stationarity of state variables and physical boundaries have not been satisfyingly discussed yet - please be referred to Chapter 2 for an overview of issues that will be tackled within this study. It is expected that the linearized univariate dependence of the transformed state variables from the subsurface parameters will be more efficiently exploited when performing the conditioning step as suggested in this study. This would result in a more accurate prognosis of flow and transport in heterogeneous soils. It shall be demonstrated that parameter estimation by ensemble Kalman filters with transformed data can rightly be considered an attractive, computationally efficient alternative to existing conditional simulation techniques that are able to handle strong non-linear model behavior.
3
2 Approach The focus of this study is laid on finding appropriate transformations to render arbitrary data almost uni-variate Gaussian. The development of a generally applicable transformation function will be presented in the main part of the study. Subsequently, a numerical test case will illustrate the improvement that can be achieved by performing an ensemble Kalman filter with transformed data. The approach for both parts is described in the following. Data of arbitrary distributions are transferred into a non-parametric space by rank transformation [Conover and Iman, 1981]. Using the inverse Gaussian cumulative distribution function to transform a ranked data set into an almost-Gaussian one is known in the literature as Gaussian anamorphosis [e.g., Chilès and Delfiner, 1999]. Different methods to obtain the empirical anamorphosis function are presented in section 6.1. In the next step, a continuous anamorphosis function is fitted to cover any possible measurement value (section 6.2), even outside the empirical anamorphosis function (section 6.3). The different transformation techniques are examined and evaluated with regard to their benefits and drawbacks. Their exactness is assessed via performance tests with synthetic data that follow known distributions (section 6.4). Depending on the distribution of the data and the ensemble size, the most suitable techniques (i.e. most accurate and most stable) is identified. The additional step of data transformation in the procedure of the EnKF allows for the inclusion of additional a priori knowledge about state variables, e.g. physical bounds. This information can be included in the transformation and thus guarantees that the simulated measurements will take on physical values. Bounded state variables are prone to data clustering at the bounds and non-unique transformation, which has not yet been addressed in detail in the literature, but is discussed and taken care of in section 6.3. Section 6.5 focuses on the implementation of data transformation in the EnKF. It has to be decided whether the transformation functions shall be valid all over the domain (Global approach, Simon and Bertino [2009]) or if they are constructed to be valid at a specific measurement location (Local approach, Béal et al. [2010]). The latter approach is used here because it accounts for non-stationarity of flow and transport state variables: Different, location-specific anamorphosis functions are built from the ensemble of realizations at each measurement location. The transformation step is included in an existing parameter estimation EnKF code, written in MATLAB. It is designed as an automatic tool that chooses an appropriate
4
transformation for each type of variable according to the user’s specifications or default values depending on ensemble size and recommended type of transformation. Chapter 7 presents an application that allows for an assessment of the improvement achieved by Gaussian anamorphosis. A groundwater flow and transport model is chosen as test case. State variables like hydraulic heads, drawdowns and concentrations are simulated by a finite element code with a random, heterogeneous conductivity field as input parameter and certain imposed boundary conditions. “Real” measurements are taken from this synthetic truth to cancel out the influence of model bias. By running an ensemble Kalman filter code, the measurements of a specific type are assimilated and the stochastic parameter field is calibrated on these available data. The filter’s performance with untransformed and transformed data is compared with regard to the quality of the prognosis in Chapter 8. Criteria for the latter are the reduction of prediction error as well as the increase in prediction confidence. Statistics of the residuals at the measurement locations resulting from the different EnKF assimilation methods will be evaluated. The effects of transformation on the dependence structure of state variables and consequently on the updating procedure are explained in Section 8.2. Gaussian anamorphosis applied to the variables drawdown, head and concentration will be presented and discussed in Section 8.3. Finally, prediction accuracy of the EnKF applied to transformed data will be verified in a comparison with results obtained from the particle filter which is considered to be the reference solution for stochastic parameter estimation (Section 8.4). A summary, conclusion and outlook is given in Chapter 9 with special attention to the question, how the performance of the EnKF could be significantly further improved, e.g. by multivariate transformations using copulas to ensure multi-Gaussian dependence between flow and transport state variables and subsurface parameters. Transforming marginal distributions is expected to be a substantial, but not yet complete improvement towards optimality of the filter, because multi-Gaussianity does not necessarily result from Gaussian marginal distributions [Gómez-Hernández and Wen, 1998]. The following Chapters 4, 3 and 5 present the basic (geo-)statistical, physical and mathematical concepts assumed for this investigation of non-linear data transformations towards Gaussianity with application to subsurface flow and transport.
5
3 Flow and Transport in the Subsurface Gaussian anamorphosis in the ensemble Kalman filter shall be demonstrated and tested with transformed groundwater flow and transport data. To clarify the nomenclature and model concepts applied in Chapter 7, some basic definitions will be given here.
3.1 Conceptual Model The domain chosen for our test case will be a horizontal 2D-segment of an aquifer with heterogeneous conductivity. For simplicity, recharge through rainfall and percolation will not be taken into account. A fully saturated domain is assumed, thus we are modeling a one-phase (groundwater), two-component (water, solute) problem. The following transport processes will be considered: n
Diffusion - elliptical spreading of a solute motivated by Brownian motion
n
Advection - directional transport induced by the flow field
n
Dispersion - spreading of a plume in a flow field because of heterogeneities and upscaling effects (from pore scale to representative elementary volume scale, Bear [1972])
Reactions as well as adsorption / absorption will be neglected.
3.2 Mathematical Model In the following, a mathematical model is presented that translates the conceptual model into a set of differential equations which can be discretisized and numerically solved (this will be tackled in Section 7.1). For the simplicity of illustration, the complexity of the flow and transport model is reduced: A few assumptions on the properties of both fluid and soil matrix are made to obtain a stationary, depth-averaged model. The general methodology is not affected by these simplifications.
6
3.2.1 Assumptions Assumptions on the behavior of the fluid = groundwater: n
Incompressible, spatially constant density
n
Creeping flow, thus inertial forces can be neglected
Assumptions with regard to the soil matrix: n
Incompressible, with constant porosity
n
Locally isotropic conductivity
n
Isothermal system, no heat balance required
n
No external forces, depth-averaged approach
Assumptions concerning the solute: n
Isotropic diffusion
n
Conservative tracer: no reaction / adsorption / absorption
3.2.2 Balance Equations Mass Balance The continuity equation can be derived from balancing mass fluxes that enter or leave a representative elementary volume and their changes over time. In differential form, it reads as follows: ∂ (φρ) + ∇ · (ρq) = φ (qin − qout ) (3.1) ∂t with porosity φ, fluid density ρ, time t, specific discharge q and sources / sinks qin/out . ∇·(...) represents the divergence. Respecting the assumptions made previously, Equation 3.1 can be reformulated to a steady state, two-dimensional flow equation: ∇ · q = qin − qout .
(3.2)
Momentum Balance - Darcy’s Law In 1856, Henry Darcy found from experiments an empirical relation between the flow rate and the pressure gradient present in an aquifer [Darcy, 1856]. It has been shown by Neuman [1977] that this equation can also be analytically derived from the NavierStokes equation by picking the relevant assumptions from the list in Section 3.2.1 and
7
assuming a Newtonian fluid. Darcy’s law implies that the flow rate in porous media is proportional to the prevailing hydraulic head gradient ∇h = ∇ (p + ρgz) with p being the fluid pressure; conductivity Kf acts as constant of proportionality: q = −Kf ∇h
(3.3)
With the assumption of zero gravity, the head gradient is equivalent to the pressure gradient. Combined with the continuity equation (Equation 3.2), we obtain the simplified groundwater flow equation used to mathematically describe hydrostatic (∇h = 0) and hydrodynamic conditions in the subsurface: ∇ · (Kf ∇h) = qin − qout
(3.4)
Transport Equation In an analogous way to the continuity equation, a mass balance for the solute can be expressed. It is first shown in its general form, known as advection-diffusion equation (ADE): ∂c φ ∂ρ φ − c + ∇ · (φvc) − ∇ · (φD∇c) = φr + qin (cin − c) (3.5) ∂t ρ ∂t with c being the concentration of the solute to be balanced, ∇ (...) representing the gradient, D being the diffusion tensor, r being a reaction term and qin , cin describing a fluid flux entering the system with a certain solute concentration. For porous media, the seepage velocity v = φq is used to describe effective velocity in the pore space. Macroscopic dispersion is an implicit part of the model since conductivity is chosen to be element-wise (locally) constant, but varying from element to element which represents heterogeneities in the soil. To account for hydrodynamic dispersion, the diffusion tensor is replaced by a dispersion tensor [Scheidegger, 1961] that parameterizes both the influence of diffusion and dispersion in any spatial direction (with I symbolizing the identity matrix, De being the effective diffusion coefficient and αl , αt being the longitudinal and transverse dispersivity, respectively): D=
vv T (αl − αt ) + I (De + αt kvk) kvk
(3.6)
Again including our assumptions, Equation 3.5 can be reduced to ∂c + v · ∇c − ∇ · (D∇c) = 0 ∂t
(3.7)
if only a sink (e.g., a well) is present in the domain, but no source. By solving the groundwater flow equation for h, calculating q according to Equation 3.3 and dividing by φ, v can be determined and subsequently, Equation 3.7 can be solved. A continuous solute injection is chosen as boundary condition to create a stationary model: The time derivative vanishes and we can predict the shape of the stationary plume in our domain.
8
4 Geostatistics Uncertainty inherent to a model prediction can be quantified if the probability of parameters, state variables or events is known. To describe the probabilistic behavior of these variables, probability functions have to be found. General definitions and most popular examples are given in the following section.
4.1 Probability Functions 4.1.1 Univariate Probability Functions A distribution function or cumulative density function (CDF) is defined by ˆ F (ϑ) =
ϑ
f ϑ0 dϑ0
(4.1)
−∞
where f´ (ϑ) is the probability density function (PDF) with its properties f (ϑ) ≥ 0 for all ∞ ϑ and −∞ f (ϑ) dϑ = 1. While F yields the non-exceedance probability of the random variable Θ for a value ϑ, f can be interpreted as the probability density with which the value ϑ is drawn from the total population.
Gaussian Distribution The univariate Gaussian distribution is most often used to describe a random variable’s probability because of its “mathematical attractiveness” [Johnson and Wichern, 1988]: It satisfactorily describes many natural and technical processes and it shows convenient statistical properties, e.g. symmetry and maximum entropy if only the first and second moment are known [Cover and Thomas, 2006]. The Gaussian probability function is defined as ! 1 1 ϑ−µ 2 f (ϑ) = √ exp − (4.2) 2 σ σ 2π with mean µ and variance σ 2 . The Gaussian probability function with µ = 0 and σ 2 = 1 is referred to as standard-Gaussian.
9
Log-normal Distribution If the logarithm of a variable follows a Gaussian distribution, this variable can be considered log-normally distributed. The probability function (ln (ϑ) − µ)2 1 exp − f (ϑ) = √ 2σ 2 2πσϑ
!
(4.3)
is defined only for values ϑ > 0.
Beta Distribution Another popular univariate distribution is the beta distribution which proves to be very flexible in its shape, depending on the parameters α, β. As opposed to the distributions mentioned above, the beta distribution is bounded on both sides with support on the interval [0, 1]: 1 f (ϑ) = ϑα−1 (1 − ϑ)β−1 , (4.4) B (α, β) with B being the beta function [Abramowitz and Stegun, 1964]. Note that the beta distribution can be scaled to cover any finite interval [a, b].
4.1.2 Multivariate Probability Functions In most natural processes, state variables do not depend on one single parameter, but on several different ones. The probability of a certain combination of parameters that leads to a certain variable value can be described by multivariate distributions. The probability of two events happening together is given by the joint PDF f (ϑ, ϕ). The conditional PDF or posterior PDF f (ϑ|ϕ) is used to determine the probability of event ϑ, assuming that event ϕ has happened. If any information ´ ∞ about event ϕ is ignored, we obtain the marginal PDF or prior PDF of ϑ: f (ϑ) = −∞ f (ϑ, ϕ) dϕ. This leads to Bayes’ theorem, the probability distribution f (ϑ|ϕ) =
f (ϑ) f (ϕ|ϑ) . f (ϕ)
(4.5)
Please be referred to Berger [1985] for details on Bayesian analysis. In the context of parameter estimation, the conditional probability of a particular unobserved state vector y u , given known observations y o can be written as f (y u |y o ) =
f (y u ) f (y o |y u ) . f (y o )
10
(4.6)
4.2 Statistical Moments For high dimensional systems, complete distribution functions are too much data to handle. Information can instead be reduced to characteristics of the distributions, namely statistical moments. Commonly used univariate moments are n
Expected value (first moment): The average value, if an infinite number of samples are drawn from the random variable Θ ˆ ∞ ϑf (ϑ) dϑ (4.7) µ = E [Θ] = −∞
n
Variance (second central moment): The mean squared deviation from the expected value i ˆ ∞ h 2 2 (ϑ − E [Θ])2 f (ϑ) dϑ, (4.8) σ = E (Θ − E [Θ]) = with standard deviation σ =
√
−∞
σ2
To characterize bivariate distributions, the covariance of two random variables Θ and Φ can be drawn on: ¨ ∞ Qϑϕ = E [(Θ − E [Θ]) (Φ − E [Φ])] = (ϑ − E [Θ]) (ϕ − E [Φ]) f (ϑ, ϕ) dϑdϕ (4.9) −∞
In case the random variables Θ and Φ are independent, the joint probability becomes f (ϑ, ϕ) = f (ϑ) f (ϕ) and the covariance equals to zero. This does not hold the other way around: A covariance of zero does not imply independence - this is only the case if we are dealing with a bivariate Gaussian distribution.
4.3 Statistics of Random Fields Random fields are functions of random variables that depend on the location x = (x, y, z). A conductivity field can be seen as one realization of a random field f (ϑ (x)). To be able to derive moments at a location x, several realizations of the random field need to be available (e.g., from stochastic simulations): n
Sample mean: Average value of N independent realizations from the given random field N 1 X ϑi (x) (4.10) µ (x) ' ϑ (x) = N i=1
11
n
Sample variance: Variance of the sample of realizations σ 2 (x) '
n
(4.11)
Sample covariance: Covariance between the random variables at two different locations x1 , x2 of the random field Qϑϑ (x1 , x2 ) '
n
N 2 1 X ϑi (x) − ϑ (x) N − 1 i=1
N 1 X ϑi (x1 ) − ϑ (x1 ) ϑi (x2 ) − ϑ (x2 ) N − 1 i=1
(4.12)
Sample correlation: Normalized covariance Cor (ϑ (x1 ) , ϑ (x2 )) =
Qϑϑ (x1 , x2 ) σ (x1 ) σ (x2 )
(4.13)
For multivariate analysis, the variance-covariance matrix [Wackernagel, 2003] can be constructed. This matrix is filled with bivariate covariances (auto-covariances along the main diagonal): Qϑϑ,ij =
1 ϑ (xi ) − ϑ (xi ) ϑ (xj ) − ϑ (xj ) N −1
(4.14)
4.4 Spatial Dependence Subsurface parameter fields cannot be sufficiently characterized by marginal distributions because of strong spatial dependence: Within the correlation length λ of a parameter, the random variable at a specific location depends to a certain degree on the random variable at surrounding locations. A tool to capture spatial dependence is the variogram [e.g., Chilès and Delfiner, 1999]. For varying separation distances h, the variance of the difference between values at two locations separated by h is determined and plotted as experimental semi-variogram: i 1 1 h γ (h) = V ar [ϑ (x + h) − ϑ (x)] = E (ϑ (x + h) − ϑ (x))2 2 2
(4.15)
The variogram is founded on the intrinsic hypothesis which assumes that the increment ϑh = ϑ (x + h) − ϑ (x) is a second-order stationary random function (the mean of the increment is constant all over the domain or shows a linear drift and the variance of the increment only depends on the separation vector h). Theoretical variogram models can be fitted to data points experimentally determined from Equation 4.15. They are a useful tool to describe an averaged correlation depending on the separation distance. This expected value is sufficient to fully describe Gaussian
12
spatial dependence, but if a random field exhibits a different type of dependence, other tools need to be found to exploit information on the spatial behavior of different quantiles of the field’s distribution. An example for such an alternative tool are copulas [Bárdossy and Li, 2008]: Independent from the respective marginal distributions, they reveal multivariate spatial structures in all quantiles. Their statistical meaning is expressed by Sklar’s theorem [Sklar, 1959]: f (ϑ1 , ϑ2 , ..., ϑn ) = C (F (ϑ1 ) , F (ϑ2 ) , ..., F (ϑn ))
(4.16)
The copula C joins together the marginal distributions F (ϑi ) to a multivariate distribution function of random variables f (ϑ1 , ..., ϑn ) or put the other way around, the copula itself is “cleaned” of the influence of marginal distributions on the actual spatial structure and is thus able to reveal structures that might be hidden in a variogram.
4.5 Spatial Interpolation and Simulation If parameters or state variables are of interest at locations other than measurement points, available data have to be interpolated. Different approaches can be chosen from: Either the data is directly interpolated, e.g. by kriging, or parameters are inversely estimated and state variables are subsequently determined from simulations based on the conditioned parameter field. Kriging originates from mining statistics and is known to be a Best Linear Unbiased Estimator (BLUE) for the conditional mean value of a random field at location x∗ and its covariance [e.g., Kitanidis, 1997]. It considers the spatial configuration of observations in the vicinity of x∗ and observed values ϑα at the n surrounding locations: ∗
n X
λα ϑ (xα ) α=1 x∗ and are evaluated
ϑ =
(4.17)
The weights λα depend on the location with the help of covariance functions (constructed from the variogram model) between the different observations and the point to be estimated. Not only single points can be interpolated with the help of covariance functions, but an entire domain can be simulated based on the chosen geostatistical model. There is a variety of simulation methods applied in the geostatistics community, among them the spectral methods [Dietrich and Newsam, 1993] that will be implemented to efficiently generate random but spatially correlated conductivity fields in this study. If variables depend on a primary parameter and the physical relationship can be well approximated by simulation, inverse methods might be preferred to pure geostatistical interpolation. Simulations implicitly account for physical processes, non-linearities and physical boundaries while kriging only reflects spatial configuration [Schwede and Cirpka, 2010]. Chapter 5 will present data assimilation techniques for inverse modeling that take advantage of both simulation and interpolation.
13
5 Data Assimilation The term data assimilation refers to the process of combining an uncertain model prediction of a spatially dependent state variable with a set of discrete measurements [Evensen, 2007]. The challenge of giving the best possible estimation of the variable field consists in extracting (or filtering) as much information as possible from typically noisy observations. Measurements naturally include an unknown measurement error with assumed variance which shall be taken into consideration during data assimilation. Therefore, the optimal assignment of weights for the individual measurements depending on their spatial configuration and measurement error has to be found, which basically resembles the process of kriging (Section 4.5). The procedure of performing inverse modeling in order to estimate a random field of state variables can be summarized as follows: 1. A discrete, stochastic state space model is formulated to predict a model state 2. The predicted model state will be improved by assimilating observed measurements 3. The prediction confidence is determined as a basis for risk management In data assimilation, a weight of zero for an observation is equivalent to the assumption of a perfect simulation while a maximum weight fully accepts the noisy observation. Filtering methods have to be found that choose an appropriate value in between to reasonably combine both the imperfect simulation and the imperfect observation. The steps named above are derived for time-independent models. For dynamical models, the assimilation method needs to include observations whenever they become available (e.g., in weather forecasting). Sequential data assimilation estimates the unobserved variables in sequential time steps, therefore information is propagated forward in time while backward integration is not necessary. The filtering process can be divided into a predictive step and a conditioning step: The state at time tk is integrated according to the dynamic model to obtain a model state prediction for the next time step tk+1 at which observations are available; then the predicted state will be conditioned on these observations. The Kalman filter, invented by Kalman, 1960, allows to evaluate the evolution of the prediction error over time and has been further developed until today for different fields of application, e.g. weather forecasting. A short overview of the history of Kalman filters is provided in the following section.
14
5.1 History of Kalman Filters The Kalman filter is a set of equations to compute the best conditional estimate of the state variable given an a priori estimate and observed measurements (compare Bayes’ theorem, Equation 4.6). Those equations represent a least squares problem [Sorenson, 1970] and are solved by evaluating cross- and auto-covariances between estimated state variables and observations. As the covariance is an optimal measure only for linear dependence, the Kalman filter is an optimal, unbiased estimator for linear models of multivariate Gaussian variables [Evensen, 2007]. Non-Gaussian marginal functions or multivariate dependence structures can originate from strong non-linear model behavior and compromise the results of the filter. The extended Kalman filter was developed to address this problem by linearly approximating the error propagation and applying a closure assumption. Both the Kalman filter and the extended Kalman filter are computationally and storage-wise very costly for high-dimensional dynamical models. An additional drawback remains the questionable applicability to non-linear models due to divergence issues [Pearson et al., 1997]. As a Monte Carlo alternative to the deterministic filters listed above, the ensemble Kalman filter (EnKF) [Evensen, 1994] was intended to overcome those two drawbacks: It provides a more appropriate closure scheme for non-linear models and is computationally more efficient.
5.2 Ensemble Kalman Filter Within the framework of Monte Carlo simulations, the forecasted probability density function of the state variable is approximated by a large ensemble of simulated model states. The analysis scheme for a dynamical model is discussed below; in Section 5.2.2 differences between state estimation and parameter estimation are pointed out to prepare the ground for the application of the EnKF for conductivity estimation as performed in Chapter 7.
5.2.1 Analysis Scheme The set of equations needed to update an ensemble of model states shall be presented here. An error-free model is assumed, whose prediction quality is affected by uncertain initial conditions and parameters as well as measurement errors. The deviation of the simulated state at a certain location from the given observation shall be corrected by the filter not only exactly at this location, but the information about the observed value shall also be spread in the surrounding area.
15
Note that the observed state variable does not necessarily have to be of the same variable type as the variable to be estimated. It could as well be a different state variable that shows a strong (preferably linear) relationship to the estimated state variable (e.g. surface phytoplankton and nitrate concentration in the ocean [Béal et al., 2010]). The operator H maps the estimated state ϑk at time tk onto the observed state dk which can be expressed in matrix notation: dksim = Hϑk
(5.1)
Perturbed measurements are generated by adding a Gaussian distributed measurement error with zero mean and a prescribed variance to the unconditioned simulated state vector ϑu in order to ensure comparability with real, noisy observations d. The deviation of simulated states from observed states for every realization i at time tk can then be written as (5.2) ∆ki = dk − Hϑku,i + i . In the next step, this deviation (or innovation) is rated with regard to the amount of trust that shall be put into the observation. This clearly depends on the size of measurement error - measurements with a high measurement error shall not be taken too seriously. Also the covariance of the measurements has an impact on the worth of data - measurements that strongly depend on each other shall not be overestimated in their individual importance. Therefore, Equation 5.2 is divided by the measurement error covariance matrix R and the estimated measurement covariance matrix HQϑϑ H T , with Qϑϑ being the estimated ensemble covariance matrix: Qkϑϑ
=E
ϑku
−E
h
ϑku
i
ϑku
−E
h
ϑku
iT
(5.3)
R consists of the measurement error variances on the main diagonal (ensemble variance determined according to Equation 4.11) and off-diagonal entries of zero because measurement errors are assumed to be uncorrelated. Finally, this normalized innovation has to be translated into the state space of the random variable that shall be estimated. The relationship between the observed variable at each measurement location and the estimated state variable at each point of the domain is given by the cross-covariance matrix: Qkϑd = Qkϑϑ H T
(5.4)
The ensemble Kalman filtering step can then be formulated and performed for each ensemble member to obtain a conditioned state variable vector ϑc,i :
ϑkc,i = ϑku,i + Qkϑd Qkϑϑ + R
−1 h
dk − Hϑku,i + i
i
(5.5)
The influence function K = Qϑd (Qϑϑ + R)−1 is called Kalman gain and is responsible for the importance that is assigned to the measurements and for the spatial range of
16
influence of the innovation term. Note that the Kalman gain is formally equivalent to simple co-kriging. The EnKF converges to the result of the classical Kalman filter with increasing ensemble size and is derived to be an error covariance minimizing scheme for multi-Gaussian state variables and a linear observation model H [Burgers et al., 1998]. The error covariance matrix is defined with regard to the ensemble mean as the true state is not known: Qkϑϑ,c = E
h
ϑkc − E ϑkc
i
h
ϑkc − E ϑkc
iT
(5.6)
Thus, the ensemble mean is interpreted as the best estimate and its variance is taken as a measure for the a posteriori prediction uncertainty.
5.2.2 Ensemble Kalman Filter for Parameter Estimation Because of its ease of implementation and low computational costs [Evensen, 2003], the EnKF has gained much popularity in state variable estimation and has been recently prepared for its use in parameter estimation (Quasi-linear Kalman ensemble generator [Nowak, 2009]). Instead of conditioning a time variant state variable on state variable observations, timeinvariant uncertain parameters are estimated with the help of state observations. Equation 5.5 is transferred into
sc,i = su,i + Qsy Qyy + R
−1
y o − y u,i + i
,
(5.7)
with su representing the parameter vector to be conditioned and y o , y u being the state variable vector (observed and simulated, respectively). Note that in case of parameter estimation, the operator H does not establish a relationship between different types of state variables that suffer from uncertain initial conditions, but maps uncertain parameters onto dependent states, e.g. in the form of a flow and transport model.
5.3 Particle Filter The particle filter [Gordon et al., 1993] is an ensemble-based alternative to the EnKF that is computationally very costly, but comes with advantageous characteristics: No assumptions on the shape of the prior PDFs are made and this filter is able to handle arbitrarily nonlinear models. The particle filter is a direct numerical implementation of Bayes’ theorem (Equation 4.6) and is therefore optimal and accurate for an infinite ensemble size.
17
In contrast to the EnKF, the particle filtering step does not introduce an innovation term and condition the parameter field according to it, but it assigns weights to each of the realizations and gives a weighted mean of the ensemble as best estimate. Hence, the particle filter is a resampling method that does not perform local changes in the parameter fields. The normalized weights are determined as the Bayesian probability that the values d are observed given the simulated state y u :
f d | y u,i
wi = P N
j=1 f
d | y u,j
(5.8)
The weighted mean as best estimate at a specific location in the domain is calculated according to µweighted ' ϑweighted =
N X
wi ϑ i
(5.9)
i=1
and the weighted variance results in 2 σweighted '
N X
wi ϑi − ϑweighted
2
(5.10)
i=1
where ϑ stands for either the parameter s to be estimated or the state y simulated to match the given observation d. In this study, the particle filter will be used as reference solution with regard to data assimilation for nonlinear models because it does not rely on the assumptions of univariate or multivariate Gaussianity or linearity.
18
6 Gaussian Anamorphosis in the Ensemble Kalman Filter As explained above, the Ensemble Kalman Filter is only optimal for multivariate Gaussian state and parameter variables. Yet, it has been widely used as a reasonably accurate method in cases with non multi-Gaussian variables [Evensen, 2007]. The objective of this study is to show that the Ensemble Kalman Filter can be moved closer to optimality by applying it to transformed variables that follow at least a Gaussian marginal distribution. Therefore, appropriate transformation methods have to be found that turn arbitrary variables into Gaussian variables in the univariate sense. The following sections will present different transformation types and point out their major benefits and drawbacks.
6.1 Classification of Transformation Techniques 6.1.1 Direct Transformation The most direct and quite empirical approach to turn a skewed distribution into an approximately Gaussian distribution is to treat the variable with an appropriate mathematical expression, e.g., apply the natural logarithm to a positively skewed variable or square a negatively skewed variable. With regard to groundwater flow, a direct transformation might prove useful to handle drawdown data. These data are positively skewed, as many values equal to or close to zero can be found and only few high values build the right tail towards infinity. To transform drawdown data d into a Gaussian sample, the natural logarithm could be applied: z = log (d) (6.1) This log-transformation is a special case of a family of transformations introduced by Box and Cox [1964] which can be parameterized with z=
(d+λ2 )λ1 −1 , λ1
log (d + λ2 ) ,
19
λ1 6= 0 λ1 = 0.
Such a transformation is able to produce acceptable results if the input data are bounded on one side (parameter λ2 > −dmin ensures positiveness for any boundary value dmin ). For variables with two physical boundaries, this class of transformations is insufficient. Bertino et al., 2003, tried to transform concentration data into Gaussian values by applying the natural logarithm, but the properties of the transformed data are not close enough to Gaussianity for our needs. Therefore, it is shown that indirect transformations are more flexible with regard to the input data and produce output data that can even be considered standard normally distributed.
6.1.2 Indirect Transformation Techniques An arbitrarily distributed variable and its Gaussian transform are linked by their cumulative distribution functions (CDFs). Van der Waerden [1965], used this fact to examine test statistics and Krzysztofowicz [1997], gave a detailed description of the intuitive analytic relationship z = G−1 [F (x)] (6.2) for a random variable x, its cumulative distribution function F (x), the Gaussian variable z and the standard normal distribution function G (z). As G (z) is per definition monotonously increasing, the inverse G−1 exists. The operation ψ (x) = G−1 [F (x)]
(6.3)
is called Gaussian transformation or Gaussian anamorphosis function [e.g., Chilès and Delfiner, 1999]. In order to find the Gaussian anamorphosis function, the cumulative distribution function F has to be determined in a first step. Subsequently, the Gaussian inverse distribution function has to be evaluated. This stepwise procedure will be referred to as indirect transformation in this study. There are different approaches to indirectly build the Gaussian anamorphosis function; they can be divided into parametric and non-parametric methods. Parametric Methods For state variables like concentration, pressure heads or drawdowns, theoretical distribution functions can often be inferred from groundwater flow and transport processes and imposed boundary conditions. This has been reviewed in Chapter 1. With the knowledge that our variable theoretically follows a certain distribution, we can estimate the parameters of that distribution function with the help of the Maximum Likelihood Method. By applying the inverse Gaussian CDF to the theoretical cumulative frequencies of our sample according to Equation 6.2, we can obtain the anamorphosis function ψ (x) quite easily and with negligible computational effort.
20
Non-Parametric Methods If fitting a parametric distribution function is not appropriate because the data do not seem to follow any specific theoretical distribution, applying non-parametric methods will be an alternative. Those distribution-free methods produce more robust results in the sense that fewer assumptions are made: Information about the shape of the underlying distribution is not needed. Instead, characteristics of the data are drawn from the sample itself, which requires a large sample size. When applied to a Monte Carlo process, this should not be a hurdle. With the rising popularity of Monte Carlo methods, non-parametric statistics have become promising methods because of their wide applicability. As examples for distribution-free methods, histograms can be named as well as indicator kriging [Journel, 1983]. With regard to Gaussian transformation, non-parametric methods can be used to build the anamorphosis function. This procedure will be divided into three main steps, according to Simon and Bertino [2009]. 1. Find Gaussian values z corresponding to each of the data points x: construction of empirical anamorphosis function (discussed in this section) 2. Fit a continuous function to the empirical anamorphosis function: interpolation or regression of empirical anamorphosis function (please see Section 6.2) 3. Define the tails of the continuous anamorphosis function: extrapolation and dealing with clustered data (please see Section 6.3) Let us now focus on the first step. According to Equation 6.2, z can be determined with the help of the cumulative frequency of x. For a sample from the original variable x, the empirical CDF has to be built. The following estimator after Johnson and Wichern [1988] is used: j − 12 Fj = (6.4) N with Fj being the estimated cumulative frequency for sorted data points with ranks j = 1...N and N being the sample size. With this estimator, each data point of the sample influences the same non-exceedance probability interval length below and above, see Figure 6.1. With the knowledge of the empirical CDF of our variable x, we can determine the values of the Gaussian transform z by rank transformation [Conover and Iman, 1981]. Figure 6.2 visualizes the procedure on an exemplary case. After having found the Gaussian equivalent to each of our data points, we can now plot the empirical anamorphosis function, see Figure 6.3. Some characteristics of both the untransformed and the transformed variable can be drawn from the histograms in Figure 6.4. While, in the exemplary case, the histogram of the original variable is highly positively skewed, the histogram
21
Empirical CDF 1
Non−exceedance probability
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Original variable x
0.8
0.9
1
Figure 6.1: Building empirical CDF. Intervals of non-exceedance probability assigned to each sample data point designated by double arrows for an exemplary sample size N = 10
Non−exceedance probability
Empirical CDF
Gaussian CDF
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2 x’ 0.4
0.6
x’’0.8
1
0 −3
−2
−1
z’ 0
z’’1
2
3
Gaussian variable z
Original variable x
Figure 6.2: Graphical Gaussian Transformation: For any value x, the corresponding Gaussian value z can be found by F (x) = G (z)
22
Empirical Anamorphosis Function 3
Gaussian variable z
2
1
0
−1
−2
−3
0
0.2
0.4 0.6 Original variable x
0.8
1
Figure 6.3: Empirical anamorphosis function
Histograms of Untransformed and Transformed Variable
Probability
Frequency
20 15 10
0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003
5 0
Normal Probability Plot 0.997 0.99 0.98 0.95 0.90
0
0.25 0.5 0.75 Original variable x
1
−2 −1 0 1 2 Transformed variable z
−2 −1 0 1 2 Transformed variable z
Figure 6.4: Histograms of untransformed and transformed variable and normal probability plot for transformed variable
23
of the transformed variable reminds strongly of a standard-Gaussian distribution, being symmetrical to the mean around zero and characteristically bell-shaped. The normal probability plot in Figure 6.4 forms a straight line which is another indicator that the transformed variable very likely follows a Gaussian distribution.
6.2 Anamorphosis Function Fitting There are several possible interpolation and regression methods to fit a continuous function to the discrete values of the empirical anamorphosis function. They differ in several aspects: Their theoretical background and underlying assumptions - this will be looked at in detail in the following paragraphs. Furthermore, they can be characterized by their degree of exactness, robustness and computational effort which shall be investigated in Section 6.4: Representative performance tests will be carried out to assess the quality of the transformation techniques. Before discussing these techniques, it shall be mentioned that they can be applied at two successive stages in the process of defining an anamorphosis function: Either the data points of the empirical cumulative distribution function are used, or the ones of the empirical anamorphosis function. Inverting the cumulative frequency to obtain the corresponding Gaussian value is a non-linear transformation, thus the two procedures differ in the resulting anamorphosis curve. It will be exposed later on that both procedures have their justification depending on the chosen technique. Let us now consider those differing interpolation and regression techniques and their advantages as well as disadvantages.
6.2.1 Interpolation Techniques This subsection will present techniques that can be summarized under the concept of interpolation, which is a special type of curve-fitting. Interpolation respects all empirical data points, thus the fitted curve will go right through all of the data points of our empirical anamorphosis function. Due to this property, interpolation will be clearly recommended if we trust that the sampled data represents reality well, i.e. the data have a very low degree of uncertainty. Interpolation techniques can be classified based on the degree of the polynomial that is used to interpolate between the data points. Linear interpolation connects two adjacent data points with a linear function and results in a non-differentiable, piecewise linear fit to the empirical data. To obtain a differentiable function, a higher-order polynomial can be fitted to the sample. Usually, only a polynomial of order n − 1 is able to satisfy the constraint of honoring all n data points. This can yield oscillations at the outer data points. If such a high-order fit is actually a good fit has to be decided based on
24
the prior knowledge about the sample’s population. A different type of differentiable function can be built by spline interpolation: Higher-order polynomials are used to connect two data points and the transitions between the piecewise functions are smooth. Spline interpolation can be considered a local method in the sense that if one data point is changed, this will only affect close neighbors, not the whole fit. In contrast, a polynomial fit is a global method as one exchanged data point alters the entire fitted function.
Direct Gaussian Anamorphosis Linear interpolation can be done between all available data points, either of the empirical CDF or the empirical anamorphosis function. The latter procedure is known as direct Gaussian anamorphosis [e.g., Chilès and Delfiner, 1999]. Any value x within the range of our untransformed variable can be transformed according to z = zi +
zi+1 − zi (x − xi ) xi+1 − xi
(6.5)
for any xi < x < xi+1 , with xi , xi+1 the two points of the empirical anamorphosis function that enclose x. This method guarantees that the information of all the data points is included in the transformation, which is advantageous if the sample size is representative for the whole underlying unknown distribution. For small sample sizes, outliers could strongly reduce the quality of the transformation as the importance of those outliers is overestimated. To overcome this problem, Béal et al. [2010], suggest defining a number of equidistant intervals and then linearly interpolating between the edges of the intervals. This procedure smoothens the continuous anamorphosis function and attenuates the impact of outliers, while using only a fraction of the available information. More detailed consideration of the impact of sample size on the performance of the different fitting techniques will be presented in Section 6.4. Coming back to the question, which one of the two empirical functions interpolation should be applied to, the support for the assumption of linear behavior between data points should be considered. On the one hand, linearly interpolating the empirical CDF is a commonly used method and seems justified, as there is no information available other than the sample data points to suggest a shape of the empirical cumulative distribution function different from the linear connection of those points. On the other hand, linear interpolation of the empirical anamorphosis function automatically puts an assumption of linearity into the process of anamorphosis that is not supported by any knowledge. Within the scope of this study, interpolation without any justifying support will be applied to the earliest possible stage of the whole transformation procedure - this can be interpreted as “rather interpolate the input than the output of an operation”. Like that, it is possible to distinguish effects of interpolation (e.g. by varying the sample size) and
25
those of non-linear transformation. Hence, I suggest interpolating the empirical CDF rather than the empirical anamorphosis function.
Cubic Spline Interpolation It can be highly advantageous for further applications if a continuous and differentiable anamorphosis function is built. As opposed to linear interpolation, spline interpolation can fulfill this demand. Additionally, we can introduce a request for monotonicity, because both our empirical functions are monotonously increasing. The method of choice will then be monotone piecewise cubic interpolation [Fritsch and Carlson, 1980]. This method preserves monotonicity while respecting all data points. Also horizontal line segments would be preserved. The interpolation uses the cubic Hermite basis functions xi+1 − x 2 xi+1 − x 3 H1 (x) = 3 −2 , xi+1 − xi xi+1 − xi x − xi 2 x − xi 3 H2 (x) = 3 −2 , xi+1 − xi xi+1 − xi
H3 (x) = − (xi+1 − xi ) H4 (x) = (xi+1 − xi )
"
"
xi+1 − x xi+1 − xi
x − xi xi+1 − xi
3
−
3
−
xi+1 − x xi+1 − xi
x − xi xi+1 − xi
2 #
,
2 #
and constructs a piecewise function F (x) to interpolate each interval xi < x < xi+1 : F (x) = Fi H1 (x) + Fi+1 H2 (x) + F 0i H3 (x) + F 0i+1 H4 (x) .
(6.6)
When applied to the empirical CDF, Fi , Fi+1 denote the Gaussian cumulative frequencies corresponding to xi , xi+1 ; F 0i , F 0i+1 are the derivatives of the function F (x) at the edges of the interval. In order to find an interpolated value F (x), the derivatives in Equation 6.6 have to be determined. An algorithm for this purpose has been developed by Fritsch and Carlson [1980]. Please be referred to their work for a detailed derivation. Cubic interpolation of the CDF will be preferred to linear interpolation, if differentiability is required for further use of the transformation function (a differentiable CDF automatically results in a differentiable anamorphosis function).
6.2.2 Regression Techniques If we do not trust that our sample is representing reality in a sufficient manner, which can be the case for small sample sizes, I recommend using a regression technique instead
26
of interpolation. The concept of regression takes into account that data can have errors or can be misinterpreted due to a lack of neighboring data points. A best-fit function is obtained by minimizing the deviations from the sample points (for example with the least squares method, Gauss [1963]). This way, still all data points have an influence on the shape of the function, but outliers have a little less impact than in interpolation. Compared to a polynomial interpolant, regression can be done with a polynomial of much lower degree which might be preferred, for example to reduce the risk of oscillations. Finally it shall be mentioned that functions fitted by the regression methods presented in the following paragraphs are differentiable. Cubic Spline Regression Piecewise regression is a combination of interpolation using splines and regression. As opposed to interpolation, only a certain number of data points will be honored by this technique, but in contrast to usual regression, these few data points have to be exactly met. The selected data points of the sample, that shall be respected, are called knots. Between the knots, least squares regression will be done to fit a polynomial of the desired order; transitions at the knots are smooth as expected from spline techniques. In this study, cubic spline regression will be used as one of several regression techniques. Please be referred to Poirier [1973] for a derivation. Expansion in Hermite Polynomials Whenever fitting a function to sparse data, one is usually groping in the dark. To shed some light on the fitting process, every available information on the shape of the function should be considered. Within the scope of this study, the function to estimate - the anamorphosis function - is relating a Gaussian equivalent to any input type of input variable. This fact should be reflected by the fitting process in order to use the information we have. This can be done by expanding Hermite polynomials which are related to the Gaussian probability density. Their definition is given by Rodrigues’s formula 1 dn g (z) Hn (z) = √ (6.7) n! g (z) dz n with Hn the polynomial of order n, the standard normal variable z and its Gaussian pdf g (z). The polynomial of order n + 1 is related to the one of order n, so they can be built according to a recurrence relation: Hn+1 (z) = − √
1 z Hn (z) − n+1
r
n Hn−1 (z) , n > 0. n+1
(6.8)
Please be referred to Rivoirard [1994] for derivations and details on various applications of expansion in Hermite polynomials.
27
Because of their relation to the normal distribution, they have the property of orthogonality. That means the inner product (which corresponds here to the covariance) of two polynomials of different order equals zero. Because of this convenience and the possibility to establish a direct analytical relationship between any random variable x and its Gaussian equivalent z, expansion into Hermite polynomials has been widely used in geostatistics, for example in disjunctive kriging [Rivoirard, 1994]. To fit the empirical anamorphosis function with Hermite polynomials, we take advantage of the fact that any function of a standard normal random variable f (z) can be expanded into Hermite polynomials [e.g., Rivoirard, 1994] following the formula ∞ X
f (z) =
fn Hn (z) .
(6.9)
n=0
In our case, the original variable x of unknown distribution can be written as x = f (z). Once we have expanded our variable in Hermite polynomials, we can invert the equation and solve for z = f −1 (x) to find the Gaussian transform for any value of the sample. Because of the direct relation that can be established between x and z, the expansion into Hermite polynomials as a curve fitting technique only makes sense for the empirical anamorphosis function, not for the empirical CDF. The coefficients of the expansion, fn , can be calculated with the following expression: ˆ fn = E [f (z) Hn (z)] = f (z) Hn (z) g (z) dz (6.10) As the integral can only be solved numerically and we do not know the continuous function x = f (z), but only discrete sample points i = 1...N (N representing the sample size or the number of interval edges), the integral is approximated by the sum fn =
N X
ˆ
i=1
zi+1
xi Hn (z) g (z) dz
(6.11)
zi
and can be simplified to fn =
N X
1 (xi−1 − xi ) √ Hn−1 (zi ) g (zi ) , n i=2
(6.12)
with g (z0 ) = 0 and g (zN +1 ) = 0 the lower and upper bound of the Gaussian pdf [Rivoirard, 1994]. The coefficients will be evaluated by using the data points of the empirical anamorphosis function. Thus, the quality of the fit directly depends on the sample size. Another constraint to the goodness-of-fit is the considered order of polynomials. It has to be decided when to truncate the expansion. As a measure of quality, the properties of the Hermite polynomials can be used [Wackernagel, 2003, Ortiz et al., 2005]: V ar [f (z)] =
∞ X
(fn )2 ≈
n=1
28
p X
n=1
(fn )2 ,
(6.13)
thus the statistics of the coefficients determine the order p to which the Hermite polynomials should be expanded. Equation 6.13 should be satisfied to a certain degree of exactness, that is to a certain number of significant digits. Despite the advantageous theoretical properties of Hermite polynomials, they also bear problems that can occur when fitting data that show clustering close to or at the sample bounds. Oscillations cannot be prevented, as polynomials of a high order are not able to produce tangential behavior. It has to be decided for the individual case whether expansion in Hermite polynomials still proves useful. Methods to Smooth the Empirical CDF In cases when expansion in Hermite polynomials is not convenient or insufficiently accurate, alternative regression methods can be drawn on. Here, the idea of rather fitting the empirical CDF than the empirical anamorphosis function (as discussed previously in 6.2.1) shall be pursued. A supposedly simple approach consists in fitting a polynomial to the empirical CDF. Depending on the sample, finding a satisfying fit can be more or less tedious. To make a well-founded fit, it should also be taken care of the slopes at the end points such that the asymptotes of the function equal to zero and one, respectively. Another constraint has to be satisfied, namely monotonicity. These requirements make it hard to find a polynomial fit of low order and acceptable quality, i.e. with reasonably low deviations from the sample data points. If we assume that our empirical CDF deviates significantly from the distribution of a “perfect” sample due to small sample size, we might want to redistribute the assigned cumulative frequencies a little. The steps of the CDF curve can be smoothed at each data point by applying a kernel smoother [Bowman and Azzalini, 1997]. Kernel smoothing can be interpreted as a smooth version of a histogram: Instead of assigning cumulative frequencies to bars, a kernel smoothing density estimate assigns a polynomial to an interval of a specified bandwidth. The resulting CDF curve is a superposition of the smoothed interval densities and is therefore differentiable. There are many other possibilities to build a continuous CDF that might reflect reality better than the empirical CDF of a small sample. This shall not be deepened within this study; instead, the methods which have been exposed so far will be summarized in Table 6.1 and further examined and assessed with regard to their exactness and computational effort in Section 6.4.
6.3 Definition of Tails The third and last step towards a continuous anamorphosis function consists in the definition of the tails, to the left of the lower sample bound and to the right of the upper
29
30
Direct, parametric Indirect, parametric Indirect, non-parametric Indirect, non-parametric Indirect, non-parametric Indirect, non-parametric Indirect, non-parametric Indirect, non-parametric Indirect, non-parametric Indirect, non-parametric
Log transformation
Distribution function fitting
ANA - pointwise
CDF - pointwise
ANA - intervals
CDF - intervals
ANA - Hermite polynomials
CDF - spline regression
CDF - kernel smoothing
CDF - polynomial fit
Cubic spline interpolation
Kernel smoothing
Cubic spline regression
Expansion in Hermite polynomials
Linear interpolation of intervals
Linear interpolation of intervals
Linear pointwise interpolation
Linear pointwise interpolation
Parameter estimation by MLM
Parameter estimation by MLM
Fitting Technique
Empirical CDF
Empirical CDF
Empirical CDF
Empirical ANA
Empirical CDF
Empirical ANA
Empirical CDF
Empirical ANA
Original data
Original data
Fitting Applied to...
Table 6.1: Overview of suggested transformation methods. CDF stands for cumulative distribution function of the original data; ANA represents anamorphosis function that links original data with Gaussian transformed data
Classification
Identifier
sample bound. Several assumptions have to be made at this point. First of all, we need to decide whether the smallest value of the sample should be treated as the smallest possible value, or if there is reason to believe that even smaller values could occur. If the latter is true, then we have to extend our anamorphosis function to the smallest possibly occurring value. Now the next question arises: Which Gaussian value should correspond to the minimum value of the original variable? In Gaussian theory, minus infinity is the smallest “value”, so wanting to turn our variable x into a perfect Gaussian random variable z requires that we take zmin = −∞ into account. For numerical evaluation, this is not an option, therefore we have to find an appropriate value for zmin that can be numerically processed. The analogous thoughts apply for the highest value of the sample, with a theoretical zmax equal to plus infinity. If it is known that the lowest and highest possible values of x are part of the sample, there will be no need to extrapolate the empirical anamorphosis function. In this case, the challenge consists in the transformation of clustered data at the bounds, because if the actual bounds of the population are represented in the sample, those values will most likely occur manifold (If the lowest and highest possible values of x are taken only once each, the sample will be exactly treated as before, there is no special consideration necessary).
6.3.1 Handling of Clustered Data In this subsection, I will introduce a way of dealing with clustered data at the bounds and demonstrate my suggestion with the help of an example: Imagine a sample that consists of 100 values between zero and one; 97% of the data are distinct, 3% are assumed to be equal to the minimum of the population, zero. When building the empirical CDF, it has to be decided which cumulative frequency should be assigned to the lower bound. As 3% of the sample take the value zero, we know that the cumulative frequency (determined according to Equation 6.4) of zero lies somewhere between F1 (0) =
1− 12 100
= 0.005 (that accounts for the fact that zero is the first value of the sorted 3− 1
sample) and F3 (0) = 1002 = 0.025 (which corresponds to the value with rank j = 3) this leads to a jump in the empirical CDF. One possibility to avoid this difficulty would be extending the infinitesimally small interval around the bound to an interval with a numerically processable small length and linearly interpolating between both of the cumulative frequencies. This method is not advisable because of the fact that it produces different transformed values for the same original value. It will be explained later on (see Section 6.5.2) why this property is not acceptable for the use in Ensemble Kalman Filter applications. Searching for an alternative to deal with clustered data, an easy and stable method is to define cases: If x = 0 a discrete cumulative frequency will be used and for any value x > 0 there will be a continuous CDF (It is assumed that - except for clustering due to boundary effects - sample values drawn randomly from a continuous distribution are distinct). For the cumulative frequency at x = 0 , I suggest taking the
31
mean CDF value of the clustered values: F (0) = 12 (F3 + F1 ) = 0.015. The lower bound of the transformed variable can be derived according to zmin = G−1 (F (0)) ≈ −2.17. If we took F (0) = F3 , the tail of the Gaussian variable would be cut off at an unnecessarily close value. Figure 6.5 shall clarify this method of choice to handle clustered data at a bound.
6.3.2 Extrapolation towards Population Bounds Extrapolation of the empirical anamorphosis function is required if the minimum and maximum values of the original variable are not part of the sample, which is generally assumed if no opposite information is available. To be able to determine the Gaussian transform for any value x, the anamorphosis function needs to be extrapolated past the sample bounds. I suggest defining the empirical anamorphosis function in the same way as described before, because the information about the sample itself has not changed. Therefore, the minimum and maximum values of the population will be treated separately from the sample. Let us denote the lower and upper sample bounds x1 and xn , respectively. The theoretical minimum and maximum values of the random variable x are named xmin and xmax , to be consistent with the notation used above. Now we can determine z1 , zn by finding the corresponding Gaussian values to the cumulative frequencies of the lowest and the highest sample value, as explained in the paragraph above. In addition, we have to define zmin and zmax within the interval (−∞; +∞). Literature review has not revealed a commonly adopted method, but rather showed arbitrary choices of values for the Gaussian bounds [Simon and Bertino, 2009, Chilès and Delfiner, 1999]. To proceed in a more systematic and satisfying manner, I will define a meaningful condition that shall be satisfied by the Gaussian bounds. The condition, called sample range condition from now on, is set up to determine the proportion to which the sample covers the actual data range and to reflect that proportion in the Gaussian transformed variable. In other words, if the bounds of the sample lie close to the real bounds, the transforms of both the sample bounds and the real bounds should lie close together as well. If the opposite is the case and the range of the sample is very small in comparison to the actual range the data could have, then the transforms should show a large difference between zmin and z1 or zn and zmax . To put this idea into a mathematical equation, I have formulated the sample range condition: x1 − xmin ! z1 − zmin = ; xn − x1 zn − z1
(6.14)
xmax − xn ! zmax − zn = . xn − x1 zn − z1
(6.15)
32
Zoom Into Cumulative Distribution Function 0.3
Cumulative frequency F
0.25
0.2
0.15
0.1
0.05 F3 F(0) F1
0
0.05
0.1 0.15 0.2 Original variable x
0.25
(a) Definition of the CDF at the lower bound
Zoom Into Interpolated Anamorphosis Function
Transformed variable z
0 −0.5 −1 −1.5
z = G−1(F−2 ) 4 z = G−1−2.5 (F(0)) 0
0.1
0.2 0.3 Original variable x
0.4
0.5
(b) Interpolation of empirical anamorphosis function at the lower bound
Figure 6.5: Dealing with clustered data at the lower bound of the fictitious sample. Ellipsis highlights the discontinuity at the clustered data point
33
Solving the sample range condition, unique values for zmin and zmax can be found: x1 − xmin (zn − z1 ) ; xn − x1 xmax − xn = zn + (zn − z1 ) . xn − x1
zmin = z1 − zmax
(6.16) (6.17)
This shall be illustrated with Figure 6.6. Please note that the slope of the extrapolation is equal to the average slope of the anamorphosis segment (in the sense that the straight line is characterized by the starting and ending point (x1 , z1 ) and (xn , zn ), respectively), which can be interpreted as simple linear scaling in absence of any information about the correct transformation of the tails and is consistent with the principle of parsimony.
Transformed variable z
Empirical Anamorphosis Function
4
zmax − zn
2 zn − z1
0 Sample bounds
−2 −4
0
0.2
0.4
0.6
0.8 xmax − xn
xn − x1
1
Population bounds
Original variable x
Figure 6.6: Defining minimum and maximum values for the Gaussian transform
Once minimum and maximum Gaussian values have been defined, the transformed variable will show properties similar to those of a standard-normally distributed variable, but not exactly the same because it is cut off at zmin and zmax (or z1 and zn ) instead of covering the interval[−∞; +∞]. The effects of the cut-off can be seen in the properties of the transformed variable that was presented in Section 6.1.2. The statistics of the untransformed variable x, the transformed variable z and theoretical Gaussian moments are listed in table 6.2. While the odd moments mean and skewness of the transformed variable match the Gaussian statistics, the even moments variance and kurtosis deviate a little from the properties of a standard Gaussian distribution. This can be attributed to
34
Variable
Mean
Variance
Skewness
Kurtosis
Untransformed variable x
0.452
0.119
0.139
1.543
Transformed variable z
0.000
0.997
0.000
2.834
0
1
0
3
Gaussian random variable
Table 6.2: Statistics of untransformed and transformed variable in comparison with theoretical values for a Gaussian variable the missing tails towards infinity: For the odd moments, symmetry is the crucial factor, so truncation of the tails will not have a major influence and the moments will be close to the Gaussian statistics as long as the remaining distribution is symmetric. For the even moments, however, the sum of the deviations from the mean grows asymptotically with z going towards plus and minus infinity. The truncation long before infinity yields smaller values for even moments. The definition of the tails directly determines the degree of asymmetry the transformed variable will have. This is especially dramatic if one bound of the sample represents the actual population bound and the other one does not even come close to the actual limit - in this case, extrapolation on only one side of the anamorphosis function will turn the transformed variable fairly asymmetric and thus will violate one of the main assumptions for Gaussian variables. Values for both the odd and the even moments will now deviate from the theoretical ones. At this point, it could be worthwhile to think about defining tails in such a way that symmetry is still ensured for the transformed variable. Due to time constraints, I will focus on the sample range condition and implement only this method in the ensemble Kalman filter.
6.3.3 Fitting the Tails Now we have determined the points of the empirical anamorphosis function within the range of the sample, at the bounds of the sample and at the theoretical bounds of the original variable’s population. These data points have to be fitted to create a continuous anamorphosis function that covers the whole range of possible values to be transformed. The thoughts on appropriate interpolation or regression techniques presented previously are valid at this point, too, although they lead to a different strategy now. The difference is that we do not know any details on how to transform data that is located at the outside of the sample bounds. The rank transformation is based on assigning a rank to each of the sample points and cannot be extrapolated to values that are not part of the sample. We can choose any kind of fitting as we are not able to judge which one represents the truth better - we just do not know what the truth looks like. If any elaborated fitting technique is no improvement compared to linear interpolation, we will pick the latter for simplicity.
35
Once again, it has to be decided whether to interpolate linearly between the points of the empirical CDF or the empirical anamorphosis function. To include the information about the minimum and maximum Gaussian transforms obtained from the sample range condition into the empirical CDF, we have to evaluate their cumulative frequencies and assign those to the minimum and maximum values of the population of the original variable. Those two data points are added to the empirical CDF and linearly connected to the points of the sample minimum and maximum, respectively. By adding data to an already built empirical CDF, we bias the transformation of the added data points. Therefore, we still do not have any clue how to transform these values correctly. Both stages, the empirical CDF and the empirical anamorphosis function, are equally uncertain. Thus, the reasons given above to prefer interpolating the empirical CDF do not apply here. Performance tests have been carried out and showed difficulties when linearly interpolating the empirical CDF. The cumulative frequency can take values very close to zero, which numerically leads to bounds of the transformed variable equal to minus or plus infinity. For these practical reasons, the linear interpolation of the tails will be done within the environment of the anamorphosis function as opposed to the approach chosen for the interpolation of the anamorphosis function itself (Section 6.2.1).
6.4 Exactness of Transformations The methods described in Section 6.2 differ in their theoretical foundation as well as their ease of implementation. In order to chose those methods that are appropriate for a specific problem and result in the most accurate and robust transformation, it is necessary to examine the behavior of the different suggested methods depending on the sample size and the type of data. This investigation will be done in this section.
6.4.1 Performance Test Procedure To be able to assess the quality of the transformation, a known transformation shall be performed: We will draw a sample from a beta-distributed random variable and transform this sample into an almost Gaussian sample. The beta distribution family is chosen because it accounts for physical bounds of data values and its relevance for groundwater flow and transport variables (be referred to Section 6.5.1). The correct transformation would be z = G−1 (FBeta (x)) , (6.18) with Fbeta being the beta distribution function as defined in Section 4.1.1. The transformation would be perfect if we took an infinite number of sample values into account; as we have a limited sample size, our transformation will deviate from the perfect one. The magnitude of deviation will be taken as an indicator for the accuracy of the different methods.
36
As most theoretical distribution functions naturally do not consider clustering, I will focus on the basic data transformation within the range of the sample in this section. The justification of linear extrapolation towards population bounds will not be verified either because this would require a different methodology of testing and exceed the time frame given to this section. The performance tests will be carried out as follows: Four different distributions will be considered, namely a quite good-natured random variable that could correspond to a head distribution, a uniformly distributed random variable, a positively skewed random variable and finally a random variable with a bimodal distribution that could approximate the distribution of concentration data as obtained when implementing simple source geometries like the one described in Chapter 7. All four distribution types belong to the family of beta-distributions and are plotted in Figure 6.7. The corresponding perfect anamorphosis functions (according to Equation 6.18) are displayed in Figure 6.8; obviously, the shape of the anamorphosis function depends on the parameters of the beta distribution the random sample is taken from. Note that the beta distribution is chosen because it allows to include boundaries of state variable values; for more complex concentration source geometries or other state variables, other distribution functions might be more appropriate and could be assessed by tests similar to those presented in this section. Besides investigating the performance of the different direct and indirect transformation techniques depending on the type of data, the sample size will be varied to quantify the impact of small ensembles on the quality of transformation. To make results comparable, a standardized error will be calculated, namely the root mean square error (RMSE): v u u RM SE = t
plot nX runs n 2 X 1 zji − zˆ nruns · nplot i=1 j=1
(6.19)
with transformed values zji , the perfect transform zˆ, nplot being the number of test values transformed during one run (for all performance tests equal to 1000) and nruns being the number of runs (every one with a new randomly drawn sample). The number of runs was set to 200 which seemed to produce reasonably representative results. Figure 6.9 shows the evolution of the deviations from the perfect transformation depending on the sample size; the plots are obtained from linear interpolation of the empirical CDF. It can be observed that with an increasing sample size, the transformation becomes more exact and the plots of all 200 transformations gather more closely around the perfect anamorphosis line.
37
Probability density
Beta Probability Density Functions with Varying Parameters 1.5 1
2
5
1.5
4
0
beta(0.5,1)
beta(0.5,0.7)
3
3
1 0.5
4
2
2
beta(2,2)
beta(1,1)
0.5
0 0.25 0.5 0.75 1
0
0
0.25 0.5 0.75
1
1 1
0
0 0
0.25 0.5 0.75
1
0
0.25 0.5 0.75
1
Random variable
Figure 6.7: Beta-distributions that shall represent different data types. Parameters a, b are given in parenthesis
Transformed variable values z
Perfect Anamorphosis Functions for Varying Parameters 5
3
1
−1 beta(1,1) beta(2,2) −3
beta(0.5,1) beta(0.5,0.7)
−5
0
0.2
0.4 0.6 Original variable values x
0.8
1
Figure 6.8: Perfect anamorphosis functions, depending on parameters of beta distribution function that the sample is taken from
38
Anamorphosis Function Plots for Varying Ensemble Size Transformed variable values z
5 3 1 −1
−5
0
0.2
0.4
0.6
0.8
N = 1000
N = 200
N = 50
−3
1
0.2 0.4 0.6 0.8 Original variable values x
1
0.2
0.4
0.6
0.8
1
Figure 6.9: Deviations from perfect transformation, depending on sample size. Note that the lowest and highest value of the original variable depends on the randomly drawn sample
6.4.2 Performance of Interpolation Techniques The performance of the different transformation methods shall now be assessed. First of all, I will compare linear interpolation techniques among one another. Linear interpolation was applied to all data points of the empirical CDF (which will be labeled ’CDF pointwise’) or to intervals of it (labeled ’CDF - intervals’), and the obtained piecewise linear functions were transformed according to Equation 6.2. These two resulting transformations shall now be contrasted with linear interpolation of all data points of the empirical anamorphosis function (labeled ’ANA - pointwise’) or intervals of it (labeled ’ANA - intervals’). Figure 6.10 shows the results for those four techniques, applied to samples from two of the four beta distributions mentioned above (adjustable parameters are (1,1) and (0.5,0.7)). Obviously, all methods improve in exactness with increasing sample size, and also the rate of improvement is similar. Thus, our first finding is that a lot of improvement towards the perfect transformation can be achieved by using a larger sample (for example, increasing the sample size by a factor of 2 - 2.5), especially in the region between 50 and 500 ensemble members. The small additional improvement from 500 ensemble members to 1000 is disproportionate to the large additional amount of data to be processed and is therefore not recommended. Looking at the four different linear interpolation methods, it can be stated that pointwise interpolation of both the empirical CDF and the empirical anamorphosis function as well as linear interpolation of CDF intervals perform almost equally well, independently from the chosen type of beta distribution. Linear interpolation of empirical anamorphosis intervals, however, scores worst and shows a significantly higher RMSE for the more
39
RMSE of Linear Interpolation over Ensemble Size CDF − pointwise CDF − intervals
0.25
ANA − pointwise ANA − intervals
RMSE
0.2 0.15 0.1 0.05 0
50 100 200 500 Number of ensemble members
1,000
(a) Sample drawn from beta(1,1) distribution
RMSE of Linear Interpolation over Ensemble Size CDF − pointwise CDF − intervals
0.25
ANA − pointwise ANA − intervals
RMSE
0.2 0.15 0.1 0.05 0
50 100 200 500 Number of ensemble members
1,000
(b) Sample drawn from beta(0.5,0.7) distribution
Figure 6.10: Performance of linear interpolation techniques, depending on sample size
40
complicated beta distribution with parameters 0.5 and 0.7, supporting the decision in favor of interpolating the empirical CDF rather than the empirical anamorphosis function (see Section 6.2.1). The resulting transformations for the beta(2,2) distribution behave very similar to the ones for beta(1,1), and both the transformations for beta(0.5,1) and beta(0.5,0.7) behave very much alike; that is why their performance plots will not be shown here. It can be concluded that linear interpolation of empirical anamorphosis function intervals does not perform well for any type of data dealt with in this study and therefore should not be implemented in the ensemble Kalman filter. Béal et al. [2010] have used this method, but they did not assess the accuracy of their transformation. The authors of this paper suggest using a number of intervals that should be much smaller than the number of data points, and the intervals should be equidistant, if possible. Based on my investigations, I cannot support these ideas. Firstly, it has been shown that the linear interpolation of all data points of the empirical anamorphosis function scores dramatically better than the interpolation of intervals. Secondly, I found that adjusting interval lengths depending on the type of distribution of the original variable can yield better results than equidistant intervals (results are not shown here). This is true for the interpolation of CDF intervals as well, but has not been deeply examined, because I believe that the definition of the intervals should be made individually for a specific data set. It seems logical that for data of a more complex shape like samples from the beta(0.5,0.7) distribution, one should choose flexible interval lengths, for example with a constant number of data points in each interval. Like this, one can catch any peculiar shape of the CDF or of the anamorphosis function. Please note that, for simplicity, I only used equidistant intervals for the calculations that underlie the performance plots shown in this section. Nevertheless, the number of intervals has been increased for larger ensemble sizes. It was also tested if cubic spline interpolation of the empirical CDF would lead to an improvement compared to linear interpolation of all points, which was not the case: The deviations from the perfect anamorphosis function were almost the same. Besides the fact that spline interpolation produces a differentiable anamorphosis function, there is no benefit of applying this interpolation technique. To summarize the first part of my performance assessment, I would like to point out that pointwise interpolation of the empirical CDF and the empirical anamorphosis function perform practically equally well and - given the theoretical thoughts presented in Section 6.2.1 - linear interpolation of all data points of the empirical CDF will be preferred. Linear interpolation of CDF intervals seems to be a good choice for any type of data. Some experimentation showed that flexible handling of interval lengths can improve performance and can outrun pointwise interpolation for almost all data types and sample sizes. In deep contrast, linear interpolation of anamorphosis function intervals is highly discouraged from, based on both the theoretical concerns and the disappointing performance.
41
6.4.3 Performance of Expansion in Hermite Polynomials For further examination, I will compare different regression methods with pointwise linear interpolation of the empirical CDF as a reference. The performance plot in Figure 6.11 displays the root mean square error of the reference method as well as a method that ensures a differentiable anamorphosis function: expansion in Hermite polynomials of order 13 (labeled ’ANA - Hermite polynomials’). While scoring badly for a sample drawn from a beta(0.5,0.7) distribution, expansion in Hermite polynomials gives a slightly better estimate of the exact transformation for small samples taken from a beta(2,2) distribution. Yet, we have to consider that higher order polynomials tend to oscillate. This is a major problem, because we need to invert the expansion x = f (z) in order to obtain the transformed variable z corresponding to x. The inverse only exists if the expansion is unequivocal, thus we can only use that part of the expansion that does not oscillate. Figure 6.12 gives an example of the expansion for a sample of size N = 100, drawn from a beta(2,2) distribution. Please note that now the transformed variable z is plotted as abscissa. The plots of the expansion for all of the 200 runs show that, depending on the random sample, the range of values of the original variable that can be transformed is significantly reduced. Only the range, in which the expansion is monotonous, can be used. The attentive reader may have noticed that the RMSE of the reference case in Figure 6.11b is slightly lower than the one plotted in Figure 6.10b although it is based on the same sample distribution. That can be explained by missing values that could not be transformed due to oscillations of the Hermite polynomials - those values were not transformed linearly either, in order to still have comparable results. The proportion of untransformed values is quantified in Figure 6.13a. This graph compares the percentage of values of the sample that could not be transformed, because the expansion in Hermite polynomials could not be inverted for those values. It becomes clear that the fraction of untransformed values increases with decreasing sample size. This fact undoes the positive performance result mentioned above which only applied to small sample sizes. Also, the percentage of untransformed values is by far higher for distributions with a more complex shape like the beta(0.5, 1) or beta(0.5,0.7) distributions, represented with pink and green bars. To prevent oscillations, expansion could be done up to a lower order of polynomials, but this would reduce the accuracy of the regression. Equation 6.13 determines which value the sum of the squared coefficients fn would take on for an infinite expansion; the deviation from that value is a measure for the loss of accuracy due to truncation. The maxima of the absolute deviations are plotted in Figure 6.13b. For small sample sizes and expansion in polynomials up to order 13, the variance of the original variable can be reproduced up to the third digit. It has to be decided individually for specific cases, if polynomials of lower order can still be considered a reasonably good fit.
42
RMSE of Expansion in Hermite Polynomials over Ensemble Size 0.26 CDF − pointwise 0.24 ANA − Hermite polynomials 0.22 0.2 RMSE
0.18 0.16 0.14 0.12 0.1 0.08 0.06
50 100 200 500 Number of ensemble members
1,000
(a) Sample drawn from beta(2,2) distribution
RMSE of Expansion in Hermite Polynomials over Ensemble Size 0.2 CDF − pointwise 0.18 ANA − Hermite polynomials 0.16
RMSE
0.14 0.12 0.1 0.08 0.06 0.04
50 100 200 500 Number of ensemble members
1,000
(b) Sample drawn from beta(0.5,0.7) distribution
Figure 6.11: Performance of interpolation and regression techniques, depending on sample size
43
Oscillations of Hermite Polynomials 1.5
Original variable x
1
0.5
0
−0.5 −5
p = 13, N = 100
−4
−3
−2 −1 0 1 2 Transformed variable z
3
4
5
Figure 6.12: Oscillations of Hermite polynomials It has been shown that expansion in Hermite polynomials only improves the transformation towards the perfect one for less complex distribution shapes of the distribution function and small samples, but at the same time, for small samples, the expansion yields difficulties as a high fraction of the sample cannot be transformed due to oscillations and the variance of the sample cannot be satisfyingly reproduced. Additionally, it has to be remarked that the inversion of the expansion in Hermite polynomials is computationally very costly in comparison to all of the other fitting techniques. Consequently, I dismiss expansion in Hermite polynomials as a regression technique applicable to the types of data that are of interest within this study.
6.4.4 Performance of Other Regression Techniques The performance of the CDF smoothing techniques presented in Section 6.2.2 shall now be examined. Figure 6.14 visualizes the RMSE of the reference case, a cubic spline regression fit to the CDF, a kernel smoothing CDF estimate and a polynomial fit of third order to the CDF. Monotonous cubic spline regression was implemented with the help of a code for “Shape Prescriptive Modeling” provided by John D’Errico, 2008, on the MATLAB CENTRAL File Exchange platform. As constraints, a monotonic increase was prescribed and the lowest and highest value of the sample should be respected by the fit. For a sample from a good-natured distribution like the beta(2,2) distribution, it can be observed that spline regression performs slightly better than the reference case, if the
44
Percentage of untransformed values
Percentage of Untransformed Values over Ensemble Size 30 beta(2,2) beta(1,1)
25
beta(0.5,1) beta(0.5,0.7)
20 15 10 5 0
50
100 200 500 Number of ensemble members
1000
(a) Percentage of untransformed values due to oscillations of Hermite polynomials
Deviation from Sample Variance over Ensemble Size 0.008 Deviation from sample variance
beta(2,2) 0.007
beta(1,1) beta(0.5,1)
0.006
beta(0.5,0.7)
0.005 0.004 0.003 0.002 0.001 0
50
100 200 500 Number of ensemble members
1000
(b) Maximum absolute deviations from variance of original variable
Figure 6.13: Properties of expansion in Hermite polynomials depending on sample size
45
RMSE of Interpolation / Regression over Ensemble Size 0.26 CDF − pointwise 0.24 CDF − spline regression CDF − kernel smoothing 0.22 CDF − polynomial fit 0.2 RMSE
0.18 0.16 0.14 0.12 0.1 0.08 0.06
50 100 200 500 Number of ensemble members
1.000
(a) Sample drawn from beta(2,2) distribution
RMSE of Interpolation / Regression over Ensemble Size 0.22 CDF − pointwise 0.2 CDF − spline regression CDF − kernel smoothing 0.18
RMSE
0.16 0.14 0.12 0.1 0.08 0.06 0.04
50100 200 500 Number of ensemble members
1000
(b) Sample drawn from beta(0.5,1) distribution
Figure 6.14: Performance of CDF smoothing techniques, depending on sample size
46
sample size is small, i.e. for N ≤ 200. This positive effect does not occur for more complex distribution shapes as shown in Figure 6.14b: Here, spline regression performs worse than the reference case even for small sample sizes. Both of the other suggested smoothing techniques produced disappointing results as their deviations exceeded those of the reference case by far. This can be attributed to a bad fitting of the CDF which is illustrated by Figure 6.15. While the kernel smoothing estimate of the CDF does Smoothing Empirical CDF 1 beta(0.5,1)
beta(2,2)
0.9
Cumulative frequency
0.8 0.7 0.6 0.5 0.4 0.3 0.2
Empirical CDF Polynomial fit Kernel smoothing
0.1 0
0
0.2
0.4
0.6
0.8
1 0 0.2 Original variable
0.4
0.6
0.8
1
Figure 6.15: Regression techniques to smooth empirical CDF not seem to be that bad, it fails to correctly represent the shape at the bounds of the distribution and thus leads to large deviations from the perfect transformation. The polynomial fit is obviously not a good choice, it cannot provide a good estimate of the CDF because it is highly constrained (horizontal slope at the bounds) while being of a low order. Implementing a higher order fit might be of help, although the results of the performance tests do not give reason to believe that this would significantly improve the estimation of the transformation function (think of the performance of spline regression applied to the CDF, for example).
6.4.5 Methods of Choice Based on Performance Tests Summarizing my findings on the performance of non-parametric methods to build a continuous anamorphosis function, it can be stated that linear interpolation of all points of the empirical CDF or intervals of it leads to stable and reasonably accurate estimates of the perfect transformation. Accuracy can significantly be improved with increasing sample sizes. For small samples, however, cubic spline regression could be an alternative
47
to create a continuous input for the anamorphosis process; its applicability depends on the complexity of the shape of the variable’s distribution. In general, it is recommended to fit a parametric distribution function to the empirical CDF whenever there is reason to believe that the original variable follows a specific type of theoretical distribution. This can be determined from the CDF itself or from a theoretical approach, if the physical properties of the variable suggest a certain distribution as will be discussed in Section 6.5.1. To complete the performance tests, I fitted a beta distribution to the randomly generated sample (the parameters then will not be exactly equal to the parameters specified to draw the sample) and calculated the deviations from the transformation with the correct beta distribution parameters. It was observed that those deviations were about 20 % smaller than the error made by linear interpolation of the CDF, independently from the sample size. Of course such a test is bound to show good results because the same type of distribution function is fitted that the sample has been drawn from. In practice, we can expect such good performance if there is reliable knowledge about the underlying distribution of a variable. In conclusion, I suggest proceeding according to the following scheme: 1. Choice of technique to build continuous anamorphosis function: a) If a theoretical distribution function can be inferred from the physical properties of the variable, estimate the corresponding parameters and use the distribution function as continuous CDF. b) If non-parametric methods shall be applied instead, check for clustering: i. If there is clustering, calculate the cumulative frequencies according to Section 6.3.1. ii. If there is no clustering, go on to c). c) Chose an appropriate fitting technique based on sample size and empirical distribution shape: i. If the sample size is smaller than or equal to 200, apply cubic spline regression to the empirical CDF or linearly interpolate the intervals of the CDF; consider using flexible interval lengths depending on distribution shape. ii. If the sample size is larger than 200, linearly interpolate the empirical CDF or its intervals; again, consider using flexible interval lengths. 2. Definition of the tails of the anamorphosis function: a) If there is clustering, define the jumps at the clustering bounds according to Section 6.3.1. b) If the population bounds are not covered by the sample, linearly extrapolate the anamorphosis function according to the sample range condition (equations 6.16 and 6.17). c) If the population bounds are equal to the sample bounds, but there is no clustering: no special definition of the tails necessary.
48
6.5 Implementation in Ensemble Kalman Filter Now that we have found transformations that render arbitrary data almost univariate Gaussian, we have to rewrite the Ensemble Kalman Filter to handle the transformed variables adequately. A few pitfalls have to be considered, i.e. typical properties of the variables to be transformed and scaling of the measurement error.
6.5.1 Accounting for Properties of State Variables The transformation of variables gives the opportunity to introduce knowledge about non-stationarity of the variable’s properties or physical constraints. This will be shown in this subsection.
Non-Stationarity of State Variable Statistics For state variables relevant in groundwater modeling, stationarity can generally not be assumed because they do not behave in the same way at different locations. This is a direct consequence of boundary conditions imposed on a state variable and has been touched on before in Chapter 1. An illustrative example is concentration: It has a much smoother distribution within the plume than at the edge of it, where jumps between zero at the outside of the plume and a relatively high value within the plume can occur. Another factor that influences the degree of non-stationarity is the time frame that is considered by the model. The concentration distribution will be the smoother the further the solute has traveled, this is due to dispersion. Please be referred to Figure 1.1 once again. In the following, applying a Gaussian transformation to a sample of a variable x that consists of the simulated values of one realization at all the grid points will be called the global approach. This procedure requires the assumption of “full stationarity”: Not even the commonly assumed second order stationarity of spatial data [Gelhar, 1993] that refers to constant mean and variance would be sufficient to justify the definition of a complete distribution function that describes the behavior of the variable all over the domain. As an alternative to the global approach, which is supported by Simon and Bertino [2009], the local approach has been introduced by Béal et al. [2010]: To draw a sample, all simulated values of the N realizations at one grid point are collected. There have been concerns about the continuity of state variables when using the local approach [Béal et al., 2010]. Those concerns do not apply to data assimilation techniques that aim at parameter estimation, because the state variables will be adjusted by a new simulation with updated parameters rather than being updated themselves, as it would be the case for state estimation. This procedure ensures the physical consistency of
49
the states because they are forced by the partial differential equation solved during the simulation. Also, the spatial correlation of state variables will be preserved through the new simulation, it is not affected by the local point-wise transformation. As opposed to other data assimilation techniques, the ensemble Kalman filter offers the chance to build an empirical distribution function based on the ensemble of realizations at one grid point. Any method that does not work with a Monte Carlo ensemble does not have this opportunity and can only draw a sample from spatially distributed data, and therefore implicitly has to assume stationarity. In this study, we will put the local approach into practice to avoid (obviously unjustified) assumptions of full stationarity. As a consequence, we will need to build an anamorphosis function for every observation point where measurement data is available. The number of observations shall be denoted by nmeas . Each of the nmeas empirical anamorphosis functions will consist of an ensemble of N data points which are taken from the realizations. Physical Bounds The Ensemble Kalman Filter with transformed data shall be applied to a groundwater model in this study. This implies that we have to deal with an ensemble of simulations and a number of “real” observations for assimilation. The parameter to be updated will be log-conductivity; the state variables that will contribute to the updating process will be heads, concentration and drawdown. The local empirical anamorphosis functions will be constructed with the data of the whole ensemble at each measurement location. As it cannot be guaranteed that the range of the ensemble values of one of the variables contains the actual observation at this grid point, it is necessary to extend the transformation function up to physical bounds as explained above. Furthermore, it might occur that the “real” measured value lies outside the physically possible range of measurement values. This could happen due to measurement errors. Thus, we have to extend the transformation function even further to take these values into account, but at the same time correct them based on our a priori knowledge about the physical range of values. Figure 6.16 shows a mixture of clustering at the lower bound, zero, and extrapolation to the physical upper bound, one, as well as the extension towards minus or plus infinity to cover any input value x. The anamorphosis function is a convenient tool to introduce such corrections. So far, unphysical observations have been corrected manually [e.g., Clark et al., 2008], which compromises the random character of the measurement error. It will be shown in Section 6.5.3 how the measurement error is transformed without the need for a pre-processing step to ensure physical measurement values. Let us now have a look at the physical bounds inherent to the variables that will be examined in the groundwater model. The types of bounds can be categorized into positiveness, one-sided limitation and double-sided limitation and will be discussed in the following.
50
Anamorphosis Function 4
Transformed variable z
3 2 1 0 −1 Ensemble bounds
−2 −3 −0.2
0
0.2
0.4 0.6 0.8 Original variable x
Physical bounds
1
1.2
Figure 6.16: Anamorphosis function. Illustration of clustering, ensemble and physical bounds and extension towards ± infinity (for any x < xmin : z = zmin ; for any x > xmax : z = zmax ) Positiveness Physical state variables like drawdown, concentration or discharge take on positive values by definition. The required non-negativity is ensured by setting the lower bound to zero. Non-negative variables can be represented, e.g., by a log-normal distribution with a tail towards plus infinity. Bound on one side If we encounter one head Dirichlet boundary condition in combination with a Neumann boundary condition, the head distribution will be bounded on one side by the Dirichlet value. A similar scenario is created by two head Dirichlet boundary conditions if recharge is taken into account, here the head will never be below the lower one of the boundary conditions. This holds analogously for domains with sources and sinks. Bounds on both sides Two Dirichlet boundary conditions defining the main flow direction lead to hydraulic head distributions with two bounds. If no sources or sinks are present, all of the head values will be between the two boundary values. It has been shown by Nowak et al. [2008] that head distributions in boundary-influenced domains can be well represented by fitted beta distributions. Similarly, solute concentrations cover values between zero and the initial condition for concentration and follow a beta distribution function as derived by Bellin and Tonina [2007].
51
6.5.2 Comparability of Observations and Simulated Measurements Updating a parameter field by applying a correction term requires that this term is very sensitive to “meaningful” differences between the values that are subtracted, but at the same time does not introduce any differences that are not caused by physical processes. This implies that data transformation must not introduce differences that have not been present before. A special case of differences between two values is a difference equal to zero, i.e. the simulated observation is equal to the observed measurement. This is the optimal outcome of a simulation as the observation has been perfectly reproduced. Consequently, no correction should be performed in the updating step. Both the observed measurements and the simulated measurements from all realizations are transformed with the same anamorphosis function because we cannot determine an independent anamorphosis function for observations: Firstly, we only have one data point per variable and measurement location, and secondly we assume that our model is error-free and thus implicitly, that the real observation originates from the same population as the simulated values of our ensemble, only corrupted by measurement error. Transforming observations and simulations with the same function ensures comparability between both transformed variables, but is also a source of clustering. Special attention has to be paid here as clustering is an accumulation of equal values that have to be transformed equally. The method to handle clustering suggested in Section 6.3.1 has been developed to fulfill the requirement of an error equal to zero for measurements that are exactly the same.
6.5.3 Transformation of Measurement Error To determine the discrepancy between reality and the simulation based on parameters of realization i, a random error has to be added to the simulated measurements in order to create conditions comparable to the observed measurements y o which consist of the actual value of the state variable and a measurement error that is unknown, but assumed to be standard-Gaussian distributed. To be consistent with realistic data collection, it cannot be distinguished which fraction of the measured value is due to measurement error. Yet, to maintain the traditional analysis scheme of the ensemble Kalman filter, the simulated measurements and their error have to be transformed separately. In order ˆ u,i + ˆi and y ˆ o (which is crucial for to still create comparable transformed values y the effectiveness of the update as discussed in the previous section), there are different approaches how to transform the measurement error i .
52
Scaling of Measurement Error Variance A first possibility would be to scale with the same factor that the simulated measurement has been transformed with: ˆ u,i y ˆ = (6.20) y u,i This would maintain the fraction of measurement error with respect to the total measured value and turn ˆ into a normally distributed variable with zero mean and standard ˆ y deviation equal to the scaling factor yu,i . The drawback of this intuitive method is the u,i fact that, for nonlinear anamorphosis functions, this scaling will not yield the same results as the transformation of the real measurements that already include a measurement error. Consider a short, arbitrary example to point this out: Our actual measurement reads 3, the measurement error is assumed to be equal to 1. Now, separate nonlinear 2 ˆ transformation according to a function variable = variable and scaling would yield a transformed total measurement of 12; transformation of the original total measurement results in 16. If strict comparability between simulated perturbed measurements and real observations is claimed, the following condition can be formulated: ψ (y ∗ + ∗ ) = ψ (y ∗ ) + ˆ∗ !
(6.21)
This ensures for a specific value y ∗ and an ∗ assigned to it, that transformation of the sum of both yields the same result as separate transformation.
Transforming Measurement Error According to Anamorphosis Function Equation 6.21 can be used directly as a scaling method by rewriting it: ˆ∗ = ψ (y ∗ + ∗ ) − ψ (y ∗ ) .
(6.22)
In words, choose ˆ such that the transformation of the sum corresponds to the sum of the transformed value plus the scaled error. This can be interpreted as determining the local secant of the anamorphosis function. The procedure guarantees that the observed measurements and the simulated measurements plus their random error are treated equally, but then the error does not follow a Gaussian distribution anymore. Both methods will be implemented within this study and assessed with regard to their quality of prognosis in Chapter 8.
53
6.5.4 Parameter Updating Step Equation 5.7 presented within the context of the ensemble Kalman filter analysis scheme (Section 5.2.2) will be modified in order to incorporate Gaussian anamorphosed variables: ˜ sˆy Q ˜ yˆyˆ + R ˆ sc,i = su,i + Q
−1
ˆo − y ˆ u,i + ˆi y
,
(6.23)
with the hat symbolizing anamorphosed variables. The Gaussian anamorphosis function links the transformed variables with the original ones: For each realization i, the vector of the transformed values at the k = 1...nmeas measurement locations is obtained by evaluating the individual anamorphosis functions ψk . Methods to obtain and evaluate the anamorphosis function have been discussed excessively in sections 6.1 - 6.3. The ˜ yˆyˆ can be calculated from the transformed measurements y ˆ u,i as covariance matrix G ˜ well as the cross-covariance matrix Qsˆy . ˆ depends on the The determination of the transformed measurement error variance R chosen approach to transform the measurement error : If transformation technique 1 (transformation of measurement error according to anamorphosis function) is chosen, the error variance is calculated numerically rather than analytically determined to be consistent with ˆ: ˜ yˆyˆ + R ˆ = Cov y ˆ u,i + ˆi Q
ˆ u,i + Cov (ˆi ) ; = Cov y ˆ = Cov (ˆi ) . −→ R
(6.24)
The off-diagonal elements will be set to zero since independence of measurement errors is assumed. Note that, within this approach, the stability and accuracy of the transformed ˆ is strongly influenced by the ensemble size N . Therefore, measurement error variance R it is suggested to generate an additional ensemble of perturbed measurements: To the existing ensemble of simulated measurements, new randomly drawn measurement errors are added. The transformation of both the original perturbed ensemble y u + and the additional perturbed ensemble y u + add will double the amount of transformed measurement errors available for the calculation of the variance. Of course, any multiplying factor nadd can be chosen to obtain an ensemble of transformed errors of size n : n = (nadd + 1) N
(6.25)
For transformation technique 2 (scaling of measurement error variance), the transformed variance is analytically defined by ˆ =R R
ˆ yˆyˆ diag Q
,
diag Qyy
(6.26)
which also results in non-zero elements only on the main diagonal. Here, no additional perturbed measurements are required, which is computationally more effective.
54
6.5.5 Model Bias The decision of transforming the simulated measurements y u and the synthetic observations y o with the same anamorphosis function is justified by the assumption that we are working with an error-free model. Otherwise it could not be claimed that the real observation stems from the same population as the simulated measurements. Despite this assumption, the anamorphosis function is extrapolated beyond the ensemble bounds because an ensemble is just a sample which cannot cover the total range of the underlying population. Thus, statistically, it is possible that a real observation falls outside the range of the simulated ensemble. This effect is even strengthened by including a random measurement error which can widen or narrow the range covered by the simulated ensemble. In general, model bias could be introduced by adopting an inappropriate geostatistical model with a too small variance of log-conductivity or an inadequate spatial structure (e.g., assuming multi-Gaussianity although not justified). The flow and transport model could be inaccurate as well, if influential governing processes are neglected or unknown. A third source of model bias consists of assuming incorrect flow boundary conditions. Wrong model boundary conditions lead to two potential results: Firstly, a significant amount of realizations could produce measurement values that deviate strongly from the true measurement, i.e. the true measurement lies close to the ensemble bounds. This scenario will force the filter to apply heavy corrections to the parameters that are statistically inconsistent and most likely will not result in a reasonable prognosis. For future research, it is suggested to test whether the ensemble Kalman filter with transformed data performs significantly different in this case than the conventional filter with untransformed data. Please note that correcting wrong boundary conditions is not the main motivation to apply Gaussian anamorphosis, thus this should not be a criterion to evaluate the worth of transformation, although it is an indicator for its robustness. Secondly, wrong boundary conditions could result in a prior ensemble that clearly does not represent a parent distribution from which the true physical system might be a legitimate member because most of the true observations lie well outside the ensemble range or even the predefined physical range. Then the underlying physical or conceptual a priori model should be questioned and the preprocessing step of defining physical limits should be repeated. This is clearly not a task to be solved by the anamorphosis procedure, but lies within the responsibility of the modeler. If boundary conditions are known to be uncertain, it could be an option to parameterize them: They could be added to the list of parameters to be estimated and be corrected with the help of the filter. Instead of harming the prediction, the issue of specifying boundary conditions could then contribute to an accurate estimate of the uncertainty of the prognosis.
55
7 Application to Synthetic Test Case The performance of the ensemble Kalman filter with transformed data shall be evaluated by identifying and quantifying the impact of the anamorphosis on the resulting prognosis. This will be analyzed for different data types and two different methods of transforming measurement errors. The numerical implementation and the testing strategy will be discussed in the following; results will be presented in Chapter 8.
7.1 Numerical Implementation The ensemble Kalman filter applied to transformed data is implemented in MATLAB and coupled with a MATLAB-based FEM code to solve the flow and transport model. The standard Galerkin FEM is used to numerically approximate the groundwater flow equation 3.4, for the transport equation 3.7 the streamline upwind Petrov-Galerkin FEM is applied [Hughes, 1987]. The equations are solved by the UMFPACK solver [Davis, 2004]. Dirichlet and Neumann boundary conditions are prescribed to pose a well-defined problem. Conductivity values are assigned elementwise while the flow and transport model returns state variable values at the nodes of the grid. For simplicity, this will not be distinguished in the further course of this study. The random parameter fields for the different realizations are generated with the spectral method of Dietrich and Newsam [Dietrich and Newsam, 1993]. If not indicated otherwise, an ensemble of 1000 realizations is implemented to obtain satisfying statistics [Zhang et al., 2005], but at the same time keep the computational effort moderate to be able to perform several scenarios which will be presented in the following section.
7.2 Description of Test Case A steady-state depth-averaged groundwater flow and transport model is chosen as test case. The rectangular section of a confined aquifer expands over an area of 100m x 100m. Flow is induced by a head gradient from west to east (Dirichlet boundary conditions
56
h = 1m at the western boundary and h = 0m at the eastern boundary). Impermeable Neumann boundaries at the northern and southern boundaries maintain the main flow direction from west to east. A tracer plume enters the domain in the middle of the western boundary with a width of 30m. The state variables head and concentration will be simulated based on the underlying parameter field. The test case setting is completed by a well for pumping test analysis in the center of the domain; drawdown will be simulated while concentration is not considered within the pumping test scenario. Log-conductivity is assumed to follow a multi-Gaussian distribution, therefore variogram models to describe the covariance of log-conductivity are justified. The isotropic exponential model is used with a microscale smoothing parameter [Kitanidis, 1997] to generate random conductivity fields. Each realization consists of such a random field and the corresponding simulated heads, drawdowns and concentrations. 17 measurement locations are spread regularly over the domain. Measurements of head, drawdown or concentration will be taken at these locations to update the unconditional ensemble. Measurement errors are assumed to be Gaussian distributed and uncorrelated; the standard deviation depends on the data type. A summary of parameter values adopted in the test case is provided by Table 7.1. Parameter
Symbol
Value
Unit
Domain size Grid spacing
[Lx , Ly ] [4x , 4y ]
[100, 100] [1, 1]
m m
Geometric mean of K Variance of logK Correlation length of logK Microscale smoothing
Kg 2 σK [λx , λy ] d
10−5 1 [20, 20] 2.5
m/s m m
Porosity Dispersivities Diffusion coefficient
φ [αl , αt ] De
0.35 [2.5, 0.25] 10−9
− m m2/s
Standard deviation of h Standard deviation of d Standard deviation of c
σ,h σ,d σ,c
0.02 0.01 0.02 + 20%
m m -
Domain Discretization Geostatistical Model
Transport Model
Measurement Error
Table 7.1: Model parameters used for synthetic test case. K, logK stand for conductivity and log-conductivity, respectively. h, d, c represent the state variables head, drawdown and concentration. symbolizes measurement error. For concentration data, the measurement error standard deviation is composed of an absolute and a relative part and results in a measurement-specific standard deviation
57
7.3 Test Procedure A synthetic truth is generated as reference field: This random log-conductivity field together with its simulated heads, concentrations and drawdowns will be referred to as “true” field. Any parameter’s or state variable’s value is known everywhere in the domain here, but only the measurements at the designated locations are used in the updating step. An unconditioned ensemble is produced by generating random log-conductivity fields as previously described and running flow and transport simulations on these fields. The updating step is then performed based on the observations of one of the three data types; this will be clarified in the respective paragraph of Chapter 8. Not only one synthetic truth is used to assess the accuracy of the filter with transformed data, but several randomly generated “truths” because results obtained for a single synthetic field could be compromised by numerical or statistical artefacts. This shall be avoided by performing data assimilation for true fields with different characteristic features. Statistical analysis of the ensembles includes the mean and the variance of conductivity, head, concentration and drawdown fields as well as different measures that account for the deviation from the true fields: n
Root mean square error: v u n u1 X RM Sϑ = t ϑtrue,i − ϑc,i
n
(7.1)
i=1
with ϑ being the parameter logK or any of the state variables h, d, c; n stands for the number of elements or nodes, respectively. ϑ¯ is obtained from averaging over all realizations. n
Prediction bias:
ϑc,j − dj ϑbias,j = p (7.2) Rj,j for a measurement location j is defined as the deviation of the simulated measurements from the observed one normalized by the measurement error standard deviation. The prediction bias can be evaluated both in the transformed and the untransformed space. n
Statistics of residuals: For the ensemble of simulated measurements at location j, the standard deviation is calculated and normalized by the measurement error standard deviation; the higher order moments skewness and excess-kurtosis are also determined
It might be noticed that for a posteriori statistics, the pure simulated measurements ϑc are used instead of a perturbed ensemble ϑc + ; the reason is that the mean of the measurement errors is set to zero and therefore does not have an influence on quality measures.
58
8 Results and Discussion 8.1 Filtering Procedure with Transformed Data The procedure of ensemble Kalman filtering applied to transformed data shall be exemplarily discussed step by step. Parameter estimation by conditioning on drawdown data will be considered. Observations are assumed to be gathered at 17 locations, indicated by black rings around white crosses in the plots of the arbitrarily chosen synthetic truth in Figure 8.1 .
y [m]
Log−Conductivity
Drawdown
100
−9
80
−10
60
−11
40
−12
20
−13
0 −0.05 −0.1
20
40 60 x [m]
80 100
−14
−0.15 −0.2 20
40 60 x [m]
80 100
Figure 8.1: Synthetic truth: Log-conductivity field and drawdown field
Ensemble Generation First of all, an ensemble of 1000 parameter fields is generated according to the geostatistical model for log-conductivity as described in Chapter 7. Flow simulations on these log-conductivity fields yield an ensemble of 1000 drawdown fields. The ensemble of log-conductivity fields shall be corrected with regard to the observed data. Prior Statistics The prior statistics of the log-conductivity and drawdown ensemble are plotted in Figure 8.2. The mean of log-K lies around -11.5, the variance around 1, corresponding to the prescribed statistics for the generation of the fields. Mean and variance of the drawdown fields result from solving the groundwater flow equation with
59
y [m]
Log−Conductivity
Drawdown
100
−9
80
−10
60
−11
40
−12
20
−13
0 −0.05 −0.1 −0.15
−14 20
40 60 x [m]
80 100
20
40 60 x [m]
80 100
−0.2
(a) A priori ensemble mean
x 10 2
1
100
y [m]
−3
Drawdown
Log−Conductivity 80
0.8
60
0.6
40
0.4
20
0.2
1.5 1
20
40 60 x [m]
80 100
0.5
0
0 20
40 60 x [m]
80 100
(b) A priori ensemble variance
Figure 8.2: A priori ensemble statistics of log-conductivity and drawdown the boundary conditions specified in Section 7.2 and represent unconditioned statistics. As expected, the unconditioned ensemble mean is not able to satisfyingly reconstruct the synthetic truth (scales of the colorbars are maintained throughout this section). Gaussian Anamorphosis Before the EnKF conditioning step, the drawdown data at the measurement locations is transformed by linearly interpolating the empirical CDF. This is the method of choice since no theoretical distribution function is available for drawdown data and we are working with a relatively large ensemble (see recommendations in Section 6.4.5). The histogram of the highly skewed original data, the empirical anamorphosis function and the resulting histogram of the transformed data can be seen on Figure 8.3. The upper row shows exemplary plots for the measurement location at the pumping well; the lower row summarizes all of the other measurement locations (different color shades represent different locations). The synthetic observations are marked with black circles
60
0.2 0 −8
−6
−4
−2
0
Relative Frequency
1 0.8 0.6 0.4 0.2 0 −0.5 −0.4 −0.3 −0.2 −0.1 0 Simulated drawdown
Relative Frequency
0.4
0.3
4 2 0 −2 −4 −8
−6
−4
−2
0.2
0.1
0 −4
0
4
−2
0
2
4
0.3 Relative Frequency
Transformed drawdown
0.6
Transformed drawdown
Relative Frequency
0.8
Transformed Ensemble
Anamorphosis
Untransformed Ensemble 1
2 0 −2 −4 −0.5 −0.4 −0.3 −0.2 −0.1 0 Simulated drawdown
0.2
0.1
0 −4
−2 0 2 4 Transformed drawdown
Figure 8.3: Gaussian anamorphosis of drawdown data. Upper row shows transformation of the ensemble at the measurement location closest to the well, lower row summarizes the transformation at the other measurement locations in the anamorphosis function plot. Extrapolation towards the physical bound of zero or beyond is not required in this specific case, because the unconditioned ensemble of drawdown values encloses the observed data values. Nevertheless, the a priori ensemble has a large variability and we can conclude that prediction confidence will profit from an updating step to narrow the prediction ensemble. This transformation step is the only modification of the traditional EnKF analysis scheme and takes up only a fraction of the computational time needed to process the flow model runs; now the usual procedure will be resumed. Updating Step With these transformed data, the Kalman gain is computed and each of the realizations is updated according to Equation 6.23; measurement errors are transformed based on the individual anamorphosis functions as expressed by Equation 6.22. After this conditioning step, the log-conductivity fields have been corrected based on anamorphosed drawdown data. The new conductivity values are used as input for another run of the flow model. The back-transformed, updated drawdown values are determined by the subsequent model run, which replaces the inverse anamorphosis needed in the case of state estimation [Simon and Bertino, 2009].
61
To illustrate the success of the updating step, the simulated ensemble of drawdown values close to the pumping well before and after updating is shown on Figure 8.4. The horizontal line indicates the observed value that shall be reproduced. Obviously, conditioning has narrowed the ensemble considerably toward the true measurement value. 0
Simulated drawdown
−2 −4 −6 −8
−0.15 −0.2 −0.25 −0.3
0
100
200
300
400 500 600 Realizations
700
800
900
1000
Figure 8.4: Drawdown ensembles before (top) and after (bottom) updating at the measurement location closest to the pumping well. The observed value is marked by the thick, red line
A Posteriori Statistics Now that we have conditioned the log-K fields on drawdown data, a posteriori statistics can be calculated. Figure 8.5 shows the estimated mean and the prediction variance of the updated fields. The best estimate of drawdown has been significantly improved and reasonably reproduces the shape of the drawdown cone in the synthetic truth. The parameter field is not matched that well, but it has to be recalled that only limited data are available and the realizations are not conditioned on direct log-conductivity values. Thus, the best estimate of the parameter field can only reproduce large scale structures and lacks variability compared to any individual realization, such as the synthetic truth: Local differences between the realizations are evened out and this averaged parameter field cannot be used as input to simulate the expected drawdown field because of the non-linearity of the flow model. Instead, the best estimate of the drawdown field is obtained by model runs on the whole ensemble of parameter fields and offers a better foundation for interpretation. In this context, it shall be mentioned that the updated parameter field realizations still follow the prescribed geostatistical model which is preserved by consistent use of
62
Drawdown
y [m]
Log−Conductivity 100
−9
80
−10
60
−11
40
−12
20
−13
0 −0.05 −0.1
20
40 60 x [m]
80 100
−14
−0.15 −0.2 20
40 60 x [m]
80 100
(a) A posteriori ensemble mean
Log−Conductivity
y [m]
100
−3
Drawdown
x 10 2
1
80
0.8
60
0.6
40
0.4
20
0.2
1.5 1
0 20
40 60 x [m]
80 100
0.5 0 20
40 60 x [m]
80 100
(b) A posteriori ensemble variance
Figure 8.5: A posteriori ensemble statistics of log-conductivity and drawdown covariance matrices in the updating step. The mean parameter field, however, does not share these geostatistical properties since it shows a much smoother spatial structure. In the following section, the effects of transformation shall be examined in detail to understand and efficiently use its beneficial properties compared to the assimilation of untransformed data.
8.2 Effects of Transformation 8.2.1 Pseudo-Linearized Dependence of States on Parameters Univariate transformation techniques have been developed in this study to obtain univariate approximately Gaussian-distributed state variables. The implicit working hypothesis is that this will install a more linear dependence to their Gaussian-distributed
63
primary parameter log-conductivity. The state-parameter dependence will now be investigated for untransformed and transformed state variables of different types (drawdown, hydraulic head and solute concentration) at different measurement locations as marked in Figure 8.6. Log-conductivity locations and corresponding state variable locations are assigned the same identifying number. High (positive) correlations with log-conductivity were chosen to demonstrate the change in dependence.
−6 −8 −16 −14 −12 −10
−8
0
−0.1
Loc. 2
−0.2 −0.3 −0.4 −16 −14 −12 −10 −8 Log−conductivity
Transformed drawdown
−4
Anamorphosis 4 2 0 −2 −4 −8
−6
−4
−2
0
4 2 0 −2 −4 −0.4 −0.3 −0.2 −0.1 0 Simulated drawdown
Transformed drawdown
−2
Transformed drawdown
Untransformed Ensemble 0
Transformed drawdown
Simulated drawdown
Loc. 1
Simulated drawdown
Figure 8.6: Synthetic fields with marked measurement locations: Numbers indicate the pairs of strongly correlated state variable and log-conductivity
Transformed Ensemble 4 2 0 −2 −4 −16 −14 −12 −10
−8
4 2 0 −2 −4 −16 −14 −12 −10 −8 Log−conductivity
Figure 8.7: Dependence of drawdown on log-conductivity (Locations 1 and 2)
64
Drawdown Data Figure 8.7 displays scatter plots of the simulated drawdown ensemble at locations 1 and 2 versus log-conductivity values at locations 1’ and 2’. The untransformed ensemble at the pumping well (location 1) shows a strong, but highly non-linear dependence on the values of log-conductivity at location 1’. This dependence structure can be well linearized by the transformation plotted on the middle panel. The resulting dependence of the transformed drawdown on log-conductivity is now almost perfectly linear and can be more efficiently exploited in the EnKF analysis scheme. The farther away from the pumping well, the more linear the dependence, but at the same time more scattered as shown in the lower row of plots. Head Data Compared to drawdown data, heads depend on log-conductivity in a relatively linear manner. Therefore, such a strong improvement in linearity of the dependence cannot be expected. Figure 8.8displays the scatter plots for locations 3/3’ and 4/4’, respectively. The lower row for the measurement close to the western boundary shows the most non-linear behavior that might occur for head data; it is attributed to the influence of the boundary condition (see Section 6.5.1). For measurement locations that are considerably influenced by boundary conditions, transformation might improve the efficiency of data assimilation.
0.25
0.6 0.4 −16 −14 −12 −10 −8 Log−conductivity
0 −2 0
0.25 0.5 0.75
2 0 −2 −4 −16 −14 −12 −10
1
−8
4
4 Transformed head
Loc. 4
Simulated head
1 0.8
2
−4
−8
Transformed head
0.5
Transformed Ensemble 4
Transformed head
0.75
0 −16 −14 −12 −10
Anamorphosis 4 Transformed head
Loc. 3
Simulated head
Untransformed Ensemble 1
2 0 −2 −4 0.4
0.6 0.8 Simulated head
1
2 0 −2 −4 −16 −14 −12 −10 −8 Log−conductivity
Figure 8.8: Dependence of head on log-conductivity (Locations 3 and 4)
Concentration Data The dependence of concentration values on log-conductivity is visualized in Figure 8.9. Obviously, untransformed concentration data plotted over log-
65
−8
1 0.75 0.5 0.25 0 −16 −14 −12 −10 −8 Log−conductivity
2 0 −2 −4
0
0.25 0.5 0.75
1
4 2 0 −2 −4
0 0.25 0.5 0.75 1 Simulated concentration
Transformed concentration
0.25 −16 −14 −12 −10
Anamorphosis 4
Transformed concentration
0.5
Transformed concentration
Simulated concentration
0.75
Transformed concentration
Loc. 6
Simulated concentration
Loc. 5
Untransformed Ensemble 1
Transformed Ensemble 4 2 0 −2 −4 −16 −14 −12 −10
−8
4 2 0 −2 −4 −16 −14 −12 −10 −8 Log−conductivity
Figure 8.9: Dependence of concentration on log-conductivity (Locations 5 and 6) conductivity scatter much more than the other state variables. The higher fraction of independent log-K values is caused by a strong non-local dependence structure. It is difficult to identify a certain pattern within the scatter plots. Nevertheless, transformation has a linearizing effect on the dependence, although not as impressive as for drawdown data. Notice that the concentration value at location 5 does not depend mostly on the log-conductivity in its vicinity, but on the values at the source of the plume. It can be concluded that the hydraulic properties at the source of the contaminant are considerably determining the propagation of the plume [de Barros and Nowak, 2010]. Summary It has been demonstrated that the Gaussian anamorphosis has a linearizing impact on the model which links parameters and states - an especially strong effect is visible for drawdown data. Thereby, the implicit assumption and suggested benefit of filtering anamorphosed data has been affirmed.
8.2.2 Bivariate Dependence Structures of State Variables As explained in the motivation for this study, Gaussian marginals are only a first step toward meeting the assumption of multi-Gaussianity as a prerequisite for the optimal performance of the EnKF. Johnson and Wichern [1988] offer a selection of tests whether two variables with Gaussian marginals can be considered at least bi-Gaussian. In order to assess the remaining non-multi-Gaussianity after transformation, bivariate scatter
66
plots are produced to visualize dependence structures between the state variables at the locations defined above. Bivariate test plots can only provide a first hint whether a variable is multi-Gaussian distributed; even if all involved bivariate data sets have been found to be bi-Gaussian, the multi-dimensional distribution does not necessarily have to be multi-Gaussian. Yet, if bivariate plots already show non-Gaussian behavior, the assumption of multi-Gaussianity can be instantly dismissed. Note that the spatial dependence structure captured by copulas is not changed by transformation of the marginals since monotone, rank-preserving transformation techniques are applied [Bárdossy and Li, 2008]. Thus, the degree of non-multi-Gaussianity after transformation depends only on the multivariate behavior of the different variable types. Drawdown Data Figure 8.10 contrasts the empirical copula density as obtained from the drawdown ensemble at locations 1 and 2 with the theoretical Gaussian copula density that corresponds to a rank correlation r = 0.49 which was calculated for the empirical data sets. The empirical copula has been determined from an ensemble of N = 100, 000 realizations and shows similar features to the Gaussian one, although a slight non-symmetrical curvature is noticeable. This indicates that the bivariate behavior of drawdown data is close to bi-Gaussian, but there is still a non-Gaussian influence on the dependence structure that should be further investigated and, if possible, transformed to multi-Gaussian dependence in future work. Theoretical Gaussian Copula 1
Empirical Copula (Drawdown Data) 1
4 3 2 1
0
0
0
1
0
1
0
Figure 8.10: Empirical copula density for drawdown at locations 1 and 2 (left) and theoretical Gaussian copula density (right) with same rank correlation
Head Data For heads, a dependence structure close to bi-Gaussian is expected from the known quasi-linearity of the head-conductivity relation and found as plotted in Figure 8.11. Here, the empirical copula density of the data sets at locations 3 and 4 with a rank correlation r = 0.34 shows the typical symmetric features of Gaussian dependence, thus head data can be assumed to be at least bi-Gaussian distributed after transformation of the marginals.
67
Theoretical Gaussian Copula 1
Empirical Copula (Head Data) 1
3
2
1
0
0
0
1
0
1
0
Figure 8.11: Empirical copula density for heads at locations 3 and 4 (left) and theoretical Gaussian copula density (right) with same rank correlation Concentration Data Different results are expected for concentration data, since the state-parameter dependence already showed characteristics that could only be explained by various sources of influence but not by a single local influence that could be addressed by univariate transformation. The bivariate behavior of concentration data sets with a rank correlation r = 0.30 is plotted in Figure 8.12. The empirical copula density is characterized by a stronger non-symmetry than the dependence structures of the other variable types. Its shape reminds of a “coffee bean” which results from measurements that are either part of the plume or lie outside the plume. The complex spatial dependence of concentration data limits the effectiveness of univariate transformation of concentration data; multivariate transformations are inevitable in future works in order to transfer this data type closer to multi-Gaussianity. Empirical Copula (Concentration Data) 1
Theoretical Gaussian Copula 1
4 3 2 1
0
0
0
1
0
1
0
Figure 8.12: Empirical copula density for concentration at locations 5 and 6 (left) and theoretical Gaussian copula density (right) with same rank correlation
68
8.2.3 Qualitative Differences in the Updating Step As illustrated in the previous sections, transformation has no influence on the multivariate dependence among the state variables, but has a direct impact on the state-parameter dependence. Therefore, qualitative differences are expected in the updating step which translates the information obtained from the observations into information on the parameter field. Differences in the assignment of weights to the observations and in the spatial radius of influence (as discussed in Chapter 5) shall be investigated for state variable and parameter ensembles with N = 1, 000 realizations. Drawdown Data Figure 8.13 displays the influence of the drawdown observation at location 1 on the parameter field to be updated. The scale is normalized by the largest absolute correlation; positive values signify positive correlation while negative values represent negative correlation. As claimed before, drawdown data are rather locally dependent on the hydraulic conditions and therefore measurement 1 has a major influence on the log-conductivity values in the close vicinity of the well. The influence decreases smoothly with increasing distance from the well; negative correlations hardly occur. Remember that drawdown is defined as negative value, thus a positive correlation corresponds to an inversely proportional relationship between drawdown and conductivity: High absolute drawdown results from low conductivity. This spatial behavior is not significantly altered by transformation, only a slight tendency towards even smoother transitions is visible on the plot to the right. The area of influence seems to be slightly larger and more symmetric which would be expected from the relationship between drawdown and log-conductivity, given this specific spatial configuration.
Figure 8.13: Influence function of measurement 1 (drawdown) on the parameter field
69
Head Data Head data observations, however, show not only positive correlations with log-conductivity values but also negative correlations. This results in a less local dependency on hydraulic conditions as visualized in Figure 8.14. It can be seen that the head measurement at location 3 has a major positive correlation with the log-conductivity values in the close vicinity, but is also negatively correlated with the parameters close to the boundary of the domain. With regard to the transformed influence function, again only small smoothing effects are visible.
Figure 8.14: Influence function of measurement 3 (head) on the parameter field
Concentration Data The complex dependence structure of concentration data on logconductivity values yields a more complex influence function: As visualized in Figure 8.15, the observation at location 5 still has a large influence on conductivity in the vicinity, but correlations quickly change to large negative values and even the eastern part of the domain is significantly influenced by this observation. In line with the findings from the previous sections, it is concluded that concentration data depend in a complex and rather global than local manner on log-conductivity; this type of non-linear dependence is difficult to exploit with a linear estimator. This problem can only be mitigated to a certain degree by univariate transformation: The global dependence is a combination and interplay of various local dependencies that cannot be pseudo-linearized by univariate anamorphosis that acts on the sum of all dependencies. The shape of the influence function highly depends on the measurement location; even more spatially variant correlations were found that are not plotted here.
70
Figure 8.15: Influence function of measurement 5 (concentration) on the parameter field
8.3 Transformation of Different Data Types Now that the procedure of updating and the effects of transformation have been clarified, the benefits of applying the EnKF to transformed data shall be quantified. The performance will be investigated for the three different state variables and varying synthetic data sets for log-conductivity.
8.3.1 Drawdown The exemplary discussion of the synthetic data set in section 8.1 will be resumed here. In addition to the results obtained with transformation technique 1 (transforming measurement error according to the anamorphosis function, labeled “transf. data”), the performance of transformation technique 2 (determination of transformed measurement error by scaling the measurement error variance according to Equation 6.26, labeled “transf.2 data”) will be analyzed. Figure 8.16 contrasts the ratio of the main diagonal of the covariance matrices to the measurement error variance at each measurement location for both techniques. Transforming measurement error according to the anamorphosis function produces transformed perturbed measurements that include a higher portion of measurement error with respect to the total perturbed value than in the case of untransformed measurements.
71
Ratio of Measurement Covariance and Measurement Error Variance untr. data 50
transf. data
diag(Qyy)/diag(R)
transf. 2 data 40 30 20 10
2
4
6
8 10 12 Measurement number
14
16
Figure 8.16: Ratio of diagonal of measurement covariance matrix and measurement error variance Best Estimate and Prediction Variance The two differing methods of transforming measurement errors lead to a slightly different estimate of the log-conductivity field (Figure 8.17), but a very similar mean drawdown field. Both transformation techniques are able to reproduce the synthetic truth better than the traditional EnKF applied to untransformed data, which underestimates the conductivity in the center of the domain and therefore overestimates the drawdown at the pumping well. The prediction variance (Figure 8.18) of the log-conductivity ensemble is partly reduced when scaling the measurement error variance; again, both transformation methods achieve a higher reduction of the a priori ensemble variance. Be aware that the performance of transformation technique 1 depends on the ensemble size and the amount of additionally generated perturbed realizations (Equation 6.25) ˆ is directly computed from the transformed because the measurement error variance R measurement errors (Equation 6.24); thus the number of transformed measurement errors influences the stability and accuracy of transformed measurement error statistics. For the test applications presented here, nadd = 10 additional ensembles were used to derive the error statistics.
72
Figure 8.17: Synthetic log-conductivity and drawdown field and best estimates resulting from different transformation methods in the EnKF
73
Figure 8.18: A priori ensemble variance of log-conductivity and drawdown field and conditional variances resulting from different transformation methods in the EnKF
74
Prediction Error To summarize what is indicated by the plots shown above, the RMSE of the estimates with regard to the true fields is computed. When assimilating transformed data and scaling the measurement error variance, the deviation from this specific true field in the sense of RMSE can be reduced by 8% for the estimated conductivity field and reduced by 24% for the estimated drawdown field. An overview of the reductions in RMSE obtained by tests with varying synthetic data sets is given by Table 8.1. For drawdown data, transformation method 2 (scaling the measurement error variance) scored slightly better than transformation method 1 in that it produced smaller RMSEs with regard to the true drawdown field. Due to time constraints, a statistically representative amount of test cases could not be set up, but the results of these 10 different fields suggest that Gaussian anamorphosis of drawdown data is a promising and successful method to increase the accuracy of updating. Field
RMSE Log-Conductivity Untransf. Transf. 2 Reduction
RMSE Drawdown Untransf. Transf. 2 Reduction
1
0.8123
0.7478
7.9 %
0.0103
0.0078
24.4 %
2
0.8747
0.8772
-0.3 %
0.0162
0.0098
39.3 %
3
0.7815
0.8010
-2.5 %
0.0093
0.0078
15.8 %
4
0.8437
0.8145
3.5 %
0.0132
0.0091
31.0 %
5
0.7130
0.7085
0.6 %
0.0093
0.0070
24.7 %
6
0.7212
0.6887
4.5 %
0.0333
0.0388
-16.6 %
7
0.6907
0.6799
1.6 %
0.0150
0.0100
33.3 %
8
0.6249
0.6320
-1.1 %
0.1539
0.0375
75.6 %
9
0.7715
0.7481
3.0 %
0.0119
0.0113
5.6 %
10
0.9282
0.9307
-0.3 %
0.0202
0.0235
-16.4 %
Average
0.7762
0.7628
1.7 %
0.0293
0.0163
21.7 %
Table 8.1: RMSE of updated fields with regard to synthetic truth. Comparison between assimilation of untransformed data and updating with transformed data and scaled measurement error variance. Note that negative percentage of reduction means an increase in RMSE in the transformed run compared with untransformed one
Evaluation of Residuals Another criterion to judge the performance of the filter are the residuals at the measurement locations, although it has to be kept in mind that deviations from the observed value are justified by measurement error and by adherence to the prior (geo-)statistics. The mean of the residuals at one location over all realizations
75
does not necessarily have to be zero as the filter has to decide on the size of the weight assigned to each observation, based on a compromise between prior statistics and data error. Nevertheless, the average of the mean residuals over a certain high number of filter applications to different true fields should boil down to zero. Because of the limited time frame for this study, these effects cannot be proved here, but are assumed to be valid. The first and second moment of the residuals are normalized by division through the corresponding measurement error standard deviation. Consequently, the standard deviation of the residuals should lie around 1; values above 1 signify excess-uncertainty in the reproduction of the data set, values below 1 indicate that the ensemble of simulated values at this location is relatively narrow, i.e. narrower than measurement error variance would suggest. Besides the mentioned statistics, skewness and excess-kurtosis of the residuals have been determined and are shown in Figure 8.19. Mean
Standard Deviation 2
1
1.5 0
1
−1
0.5 2
4
6
8 10 12 14 16
2
Skewness 4
0.5
3
0
2
−0.5
1
−1
0 4 6 8 10 12 14 16 Measurement number untr. data
6
8 10 12 14 16 Kurtosis
1
2
4
2
transf. data
4 6 8 10 12 14 16 Measurement number transf. 2 data
Figure 8.19: Statistics of drawdown residuals from different assimilation methods Overall, residuals are reduced by applying transformation techniques compared to the result for anamorphosis of untransformed data. Especially the deviation from the observation at the pumping well and the standard deviation at this location are considerably decreased. This is a valuable finding since the traditional EnKF does not provide reasonable results for drawdown measurements directly at the well and requires the specification of an unrealistically high measurement error at this measurement location in order
76
to achieve acceptable accuracy. With the application of transformation, all available drawdown data can be treated equally. The standard deviation of residuals for transformed data lies around 1 which corresponds to the measurement error standard deviation; assimilating untransformed data leads to a higher standard deviation of about 1.3. Absolute skewness is reduced by applying transformation; both transformation methods result in a slightly negatively skewed updated data set, while the traditional filter leads to a higher positive skewness. With regard to excess-kurtosis, huge differences are noticeable for the well location: The transformed data sets show a much higher excess-kurtosis than the untransformed one. If the residuals were Gaussian distributed as implicitly assumed when multi-Gaussian variables are assimilated by the EnKF, a skewness and excess-kurtosis of 0 would be expected. With regard to skewness, this is at least approximated by the transformed data assimilation methods. Summary It has been demonstrated that the assimilation of transformed drawdown data improves the performance of the filter in terms of smaller deviations from the true field in comparison to the traditional filter with untransformed state variables.
8.3.2 Hydraulic Head The same synthetic truth as presented above shall now be estimated with the help of head observations. Best Estimate and Prediction Variance Figure 8.20 shows the best estimate of logconductivity and hydraulic heads after updating with the different assimilation methods. Since transformation method 1 and 2 show quite similar results, the results will be exemplarily discussed for method 1 only. Although the curvature in the head isolines seems to be reconstructed better without transformation, the overall RMSE leads to a different conclusion: With untransformed data, the RMSE of the estimated head field sums up to 0.0475, which is reduced by 24% when transformation is applied. The prediction variance of the ensembles is presented in Figure 8.21. While the variance of the log-conductivity ensembles is very similar, a slight decrease in variance of the head ensemble is visible for transformed data. Evaluation of Residuals The statistics of the residuals (Figure 8.22) convey a clearer message: The mean residuals resulting from the assimilation of transformed data are much closer to 0 while the filter applied to untransformed data seems to consequently underestimate the true head field in this specific case.
77
Figure 8.20: Synthetic log-conductivity and head field and best estimates resulting from different transformation methods in the EnKF
78
Figure 8.21: A priori ensemble variance of log-conductivity and head field and conditional variances resulting from different transformation methods in the EnKF
79
Standard Deviation
Mean 3 2 2
0
1
−2
0
−4 2
4
6
8 10 12 14 16
2
4
6
Skewness
8 10 12 14 16 Kurtosis
1
3
0.5
2
0
1
−0.5
0
−1 2
−1
4 6 8 10 12 14 16 Measurement number untr. data
2
4 6 8 10 12 14 16 Measurement number
transf. data
Figure 8.22: Statistics of head residuals from different assimilation methods Prediction Error Analysis of the filter performance for the other 9 synthetic data sets indicates that only minor improvement of updating based on head data can be achieved by Gaussian anamorphosis for both the estimated parameter field and the estimated state field while a degradation of performance with regard to the head field is observed for a significant fraction of the test cases. This compromising effect has also been observed by Béal et al. [2010]. A possible explanation for the increase in RMSE could be that the efficiency of the filter might suffer from the empirical transformation of an already quite Gaussian distributed variable which introduces unnecessary numerical inaccuracy, especially in the case of small ensemble sizes. Summary Gaussian anamorphosis of head data is not generally recommended since it requires an (computationally not very demanding, but still additional) extra step in the filtering procedure which in turn does not promise stable and improved results. Yet, transformation might improve the assimilation of highly boundary-influenced measurement data under the assumption that Gaussian anamorphosis does not compromise the (at least partially) existing multi-Gaussianity within the head field; this could be further investigated in future research.
80
8.3.3 Solute Concentration Tracer tests are an alternative to hydraulic pumping tests in order to obtain measurements that allow inference of the underlying hydraulic conductivity field. As discussed above, the dependence structure of concentration data on log-conductivity as well as the bivariate dependence among the data set are of complex shape and cannot be directly exploited by a linear updating process. Nevertheless, a mitigating effect of univariate transformation of concentration data can be demonstrated for the synthetic truth described above. Handling of Clustered Data Physical boundaries of concentration values and data clustering effects play an important role for this data type. The methods used to handle these issues have been introduced in Section 6.3.1. A large fraction of the ensemble that takes on a value of zero can, e.g., be found at the most south-western measurement location in the domain. This observation point can hardly be reached by the concentration plume, therefore only a few realizations of parameter fields yield a concentration greater than zero. A histogram of the transformed, clustered data at this location is plotted in Figure 8.23. 1000 a priori
900
a posteriori 800
observation
Frequency
700 600 500 400 300 200 100 0 −1
0
1 2 Transformed concentration
3
4
Figure 8.23: Prior and conditioned ensemble in Gaussian space with data clustering at the lower bound All values smaller than 10−6 are assigned a value of 0 which corresponds to a certain minimum transformed value (depending on ensemble size and number of clustered values, see definition of tails in Section 6.3). The conditioned ensemble (orange) exhibits a
81
narrowed range and its mean has moved closer to the observed value. Note that without clustering, the transformed ensemble would cover the value range from -3.2905 to +3.2905 and show typical Gaussian symmetry as plotted in Figure 8.3. Handling clustered data is a trade-off between ensuring univariate Gaussianity at this measurement location and ensuring comparability of simulated measurements and synthetic observations as discussed in Section 6.5.2. Best Estimate and Prediction Variance The best estimate of the log-conductivity field and the concentration field is shown in Figure 8.24. Despite the difficulties exposed above, the assimilation of transformed data performs obviously better in reconstructing the absence of the concentration plume at the affected observation locations in the eastern part of the domain. Also, the prediction variance of the concentration ensemble, plotted in Figure 8.25, can be significantly reduced compared to the traditional filter with untransformed data. These are unexpectedly positive results; for different synthetic data sets, however, degradation of the prediction accuracy can also occur. Evaluation of Residuals The statistics of the residuals support the positive effect of Gaussian anamorphosis (Figure 8.26): Residuals resulting from the assimilation of transformed concentration values show a mean around 0 with smaller amplitudes and a slightly smaller standard deviation. Prediction Error Based on the analysis of reconstructing different synthetic data sets, it can be stated that concentration updating with transformed data can yield significantly improved results (RMSE of estimated concentration field reduced by 20 - 40%), but depending on the field to be reconstructed, results can also be deteriorated by up to 60%. Applying transformation method 2 overall yields slightly better results, here the RMSE can be reduced by up to 25%, while degradation is limited to about 35%. With regard to the estimated parameter field, neither of the methods is able to significantly reduce RMSE; in contrary, for most synthetic data sets, the estimated field scores slightly worse than the field conditioned on untransformed data. Summary There might be various causes for the unstable success of transformation; e.g., the way of handling clustering effects or the definition of the tails of the anamorphosis function could have a major influence on the updating step, both in a desired or objectionable manner. Additionally, the degree of remaining non-multi-Gaussianity might vary from field to field and might have an non-neglectable impact on the linear filtering procedure. These issues should be addressed and investigated in further research in order to exploit the dependence of concentration on log-conductivity more efficiently and thus turn tracer tests into a reliable source of data to estimate hydraulic conditions with the EnKF.
82
Figure 8.24: Synthetic log-conductivity and concentration field and best estimates resulting from different transformation methods in the EnKF
83
Figure 8.25: A priori ensemble variance of log-conductivity and concentration field and conditional variances resulting from different transformation methods in the EnKF
84
Mean
Standard Deviation
3 3
2 1
2
0 −1
1
−2
0 2
4
6
8 10 12 14 16
2
4
6
Skewness
8 10 12 14 16 Kurtosis
800 20
600
10
400 200
0 0 2
4 6 8 10 12 14 16 Measurement number
2
untr. data
4 6 8 10 12 14 16 Measurement number
transf. data
Figure 8.26: Statistics of concentration residuals from different assimilation methods
8.3.4 Suitability of State Variable Types for Gaussian Anamorphosis All three data types were used to reconstruct the same synthetic log-conductivity field and its flow and transport variables. In summary, the best estimate of log-conductivity based on head observations matched the true parameter field best, which is attributed to the almost linear dependence between state and parameter. This naturally linear relationship should not be altered by transformation because the inherent Gaussian structures would probably be disturbed rather than improved towards multi-Gaussianity. Concentration observations provided the smallest information gain with regard to the parameter field. This is traced back to its complex, non-local dependence on logconductivity which cannot be satisfyingly exploited by a linear filter. The degree of mitigation achieved by Gaussian anamorphosis of concentration data depends on the individual true field; a general recommendation in favor of or against transformation cannot be given at this point; further investigation, especially considering multi-variate transformations, should follow up to allow well-founded statements. The assimilation of drawdown observation yields relatively confident estimates that are close to reality. This state variable shows a strong non-linear, but local dependence on
85
its primary parameter that can be even more efficiently exploited after transformation. To verify the positive effect of Gaussian anamorphosis, the performance of the EnKF with untransformed drawdown data as well as with transformed drawdown data shall be compared with the performance of the particle filter which is considered to be a reference solution for Monte Carlo data assimilation.
8.4 Comparison with Particle Filter as Reference Solution The accuracy of conductivity estimation by assimilating anamorphosed drawdown data with the EnKF shall be verified by a comparison with the solution of the particle filter, applied to the same synthetic data set as presented in Section 8.1. The particle filter requires a large ensemble in order to find reasonably reliable weighted statistics. The number of parameter fields is thus chosen to be N = 100, 000. To improve convergence of the filter, the number of included drawdown measurements is reduced and the measurement error standard deviation is now composed of the previously defined absolute part of 0.01m and an additional relative part of 10% which results in a measurement-specific standard deviation. The EnKF runs to be compared with the particle filter will perform the updating step on a sub-ensemble of NEnKF = 1, 000 realizations to reduce computational time and imitate realistic applications. 9 drawdown measurement locations are installed around the well within a radius of approximately 20m, which is equivalent to the correlation length of the log-conductivity field. Best Estimate and Prediction Variance Figure 8.27 displays the synthetic truth (same field as presented in Section 8.1) and the best estimates of log-conductivity and drawdown. It is clearly visible that both transformation methods result in estimated fields very similar to the result obtained by particle filtering. The synthetic drawdown measurement at the pumping well has been reconstructed successfully by applying the EnKF to transformed data while conditioning on untransformed data yields an overestimated drawdown and, correspondingly, underestimated log-conductivity values in the vicinity of the well. The prediction variance (Figure 8.28) of the estimated ensembles underlines the positive impact of transformation on the conditioning step: Filtering with anamorphosed variables reduces the prediction variance significantly in comparison to traditional updating. The variance of the estimated log-conductivity fields resulting from both transformation methods share similar features with the weighted variance determined by the particle filter. With regard to the prediction variance of the drawdown field, differences between the two transformation techniques are visible: Both transformation methods yield much lower variances than the anamorphosis of untransformed data, but here transformation method 2 achieves an even higher reduction in variance, which resembles the reference solution best.
86
Figure 8.27: Synthetic log-conductivity and head field (upper row) and best estimates resulting from different transformation methods in the EnKF and the particle filter
87
Figure 8.28: A priori ensemble variance of log-conductivity and drawdown field (upper row) and conditional variances resulting from different transformation methods in the EnKF and the particle filter
88
Evaluation of Residuals Analyzing the first and second moment of the residuals plotted in Figure 8.29, it can be stated that EnKF assimilation of transformed data yields residuals that follow the trends of the particle filter results. As emphasized before, the assimilation of measurements directly at the well are tremendously improved by Gaussian anamorphosis; the prediction confidence has also been increased. Mean
Standard Deviation
2
3
1 2 0 1 −1 0
−2 2
4 6 8 Measurement number
2
untr. data
transf. data
4 6 8 Measurement number PF
Figure 8.29: Statistics of drawdown residuals resulting from different transformation methods in the EnKF and the particle filter (PF)
Prediction Error and Deviation From Reference Solution Table 8.2 lists the RMSE of the best estimates resulting from the different assimilation methods with regard to both, the synthetic truth and the reference solution. Results clearly demonstrate that Gaussian anamorphosis of drawdown data significantly improves the prediction accuracy of both the estimated parameter field and the estimated drawdown field, while at the same time being consistent with the reference solution for stochastic parameter estimation. EnKF Assimilated Data
RMSE Log-Conductivity Synth. Truth PF
RMSE Drawdown Synth. Truth PF
Untransformed
0.9326
0.1785
0.0121
0.0051
Transformed
0.9014
0.0995
0.0095
0.0022
Transformed 2
0.9149
0.1082
0.0093
0.0019
Table 8.2: RMSE of fields obtained from the three different EnKF assimilation methods with regard to synthetic truth and particle filter (PF)
89
Summary The findings from the comparison with the reference solution confirm that EnKF assimilation of transformed data is a computationally efficient and reasonably accurate alternative to particle filtering since it requires only a fraction of the ensemble size (in this test case, an EnKF ensemble 100 times smaller than the PF ensemble was used). Gaussian anamorphosis proved useful for inverse conductivity estimation based on drawdown measurements, which could be applied, e.g., to hydraulic tomography. In future research, a similar comparison of particle filtering with EnKF assimilation of transformed concentration data might help to clarify the ambiguous results from Section 8.3.3. Moreover, influencing factors on the degree of multi-Gaussianity of transformed concentration data should be analyzed in detail; e.g., the handling of data clustering and the definition of the tails towards physical bounds. The proposed methods are meant to be a first approach that could be further developed toward a successful anamorphosis - given that, in the individual case, a pseudo-linearization of the complex dependence structure can be expected at all. With regard to updating based on head measurements, it might be worthwhile to investigate transformations that only affect boundary-influenced measurements, but preserve the existing (almost) linear dependence structure in the middle of the domain. This approach should also be verified with the help of the particle filter.
90
9 Summary, Conclusion and Outlook Summary The procedure of subsurface parameter estimation by EnKFs applied to transformed data has been discussed in this study. It has been pointed out that groundwater flow and transport variables most often violate the assumption of multi-Gaussianity and therefore, optimal behavior of the linear update by an EnKF cannot be expected. To mitigate the effects of non-Gaussian distributions on the performance of filtering, univariate transformation has been suggested to render arbitrarily distributed state variables Gaussian. It is implicitly assumed that Gaussian anamorphosis results in a pseudo-linearization of dependence that can be more efficiently exploited by the EnKF updating step. Different parametric and non-parametric methods have been presented to construct an appropriate anamorphosis function. Moreover, a possibility to implement physical bounds of the state variable values has been introduced and data clustering at these bounds has been addressed. The practical implementation of Gaussian anamorphosis in the EnKF analysis scheme has been demonstrated, including an extensive discussion of the transformation of measurement error which is crucial for both theoretical coherence and practical success of the update. The impact of anamorphosis on different variable types (drawdown, head and concentration) has been analyzed in detail with regard to changes in the dependence on the parameter field and changes in the influence function that controls updating. The performance of EnKFs applied to these transformed flow and transport state variables has been assessed with numerical test cases. Finally, the substantial improvement in quality of the prognosis achieved by Gaussian anamorphosis of drawdown data has been verified by a comparison with the particle filter solution which is considered to be the reference solution for stochastic parameter estimation. In Conclusion, the following effects have been detected: n
n
The implicit assumption that non-linear dependence can be pseudo-linearized by Gaussian anamorphosis is valid. Gaussian anamorphosis of state variables is able to improve the performance of EnKFs for parameter estimation; the degree of improvement depends on the type of variable, the spatial configuration of observations and the true field to be reconstructed.
91
n
n
n
n
n
n
Head data naturally show a relatively linear dependence on conductivity; not much improvement can be achieved by Gaussian anamorphosis. The dependence structure of concentration is more complex and of non-local nature, impeding a successful interpretation by the updating procedure even after transformation. The transformation is most effective for variables that show a strongly non-linear, but mostly local dependence on the parameters. Drawdown data show this type of dependence and are more accurately assimilated by EnKFs when transformation is applied than without transformation: The prediction error can be reduced by more than 20%. This success suggests to estimate conductivity with EnKFs using transformed drawdown data in hydraulic tomography studies. Conductivity estimation by EnKFs with transformed drawdown data is an attractive alternative to particle filtering because it is computationally less demanding and similarly accurate.
Further Work Steps that could not yet be carried out, but are of interest for the assessment of the proposed approach: n
n
n
Studying the effects of applying transformation only to head measurements that are strongly influenced by boundary conditions and therefore show strong nonGaussian behavior while preserving the existing (almost) linear dependence structure farther away from the boundaries. Evaluating the reaction of the proposed method on model bias: A promising approach would be joint inference of uncertain boundary conditions together with conductivity. Investigating the possible further improvement of the EnKF for parameter estimation by transformation and assimilation of combined data types.
Outlook n
n
toward possible future research:
Transfer of the proposed methodology back to state estimation; this includes inverse Gaussian anamorphosis and ensuring spatially continuous back-transformed state variables. Investigation of multi-variate transformations that guarantee multi-Gaussian dependence structures to fully exploit the filter’s potential.
92
Bibliography M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables. Dover, 1964. A. Bárdossy and J. Li. Geostatistical interpolation using copulas. Water Resources Research, 44(7), 2008. D. Béal, P. Brasseur, J.-M. Brankart, Y. Ourmières, and J. Verron. Characterization of mixing errors in a coupled physical biogeochemical model of the North Atlantic: implications for nonlinear estimation using Gaussian anamorphosis. Ocean Science, 6, 2010. J. Bear. Dynamics of fluids in porous media. American Elsevier, New York, 1972. A. Bellin and D. Tonina. Probability density function of non-reactive solute concentration in heterogeneous porous formations. Journal of Contaminant Hydrology, 94(1-2), 2007. J. O. Berger. Statistical decision theory and Bayesian analysis. Springer, 1985. L. Bertino, G. Evensen, and H. Wackernagel. Sequential data assimilation techniques in oceanography. International Statistical Review, 71(2), 2003. A. W. Bowman and A. Azzalini. Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. Oxford University Press, USA, 1997. G. E. P. Box and D. R. Cox. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 1964. G. Burgers, P. J. van Leeuwen, and G. Evensen. Analysis scheme in the ensemble Kalman filter. Monthly Weather Review, 126(6), 1998. J. Carrera, A. Alcolea, A. Medina, J. Hidalgo, and L. J. Slooten. Inverse problem in hydrogeology. Hydrogeology Journal, 13(1), 2005. J. P. Chilès and P. Delfiner. Interscience, 1999.
Geostatistics: modeling spatial uncertainty.
Wiley-
M. P. Clark, D. E. Rupp, R. A. Woods, X. Zheng, R. P. Ibbitt, A. G. Slater, J. Schmidt, and M. J. Uddstrom. Hydrological data assimilation with the ensemble Kalman filter: Use of streamflow observations to update states in a distributed hydrological model. Advances in Water Resources, 31(10), 2008. W. J. Conover and R. L. Iman. Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician, 35(3), 1981.
93
T. M. Cover and J. A. Thomas. Elements of information theory. Wiley, 2006. H. Darcy. Les fontaines publiques de la ville de Dijon. Victor Dalmont, 1856. T. A. Davis. Algorithm 832: UMFPACK V4. 3—an unsymmetric-pattern multifrontal method. ACM Transactions on Mathematical Software (TOMS), 30(2), 2004. F. P. J. de Barros and W. Nowak. On the link between contaminant source release conditions and plume prediction uncertainty. J. Cont. Hydrology, 2010. (submitted). C. R. Dietrich and G. N. Newsam. A fast and exact method for multidimensional Gaussian stochastic simulations. Water Resour. Res., 29(8), 1993. G. Evensen. Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. Journal of Geophysical Research, 99, 1994. G. Evensen. The ensemble Kalman filter: Theoretical formulation and practical implementation. Ocean dynamics, 53(4), 2003. G. Evensen. Data Assimilation: The Ensemble Kalman Filter. Springer Verlag, 2007. F. N. Fritsch and R. E. Carlson. Monotone piecewise cubic interpolation. SIAM Journal on Numerical Analysis, 17(2), 1980. K. F. Gauss. Theoria motus corporum celestium. English translation: Theory of the Motion of the Heavenly Bodies, 1963. L. W. Gelhar. Stochastic subsurface hydrology. Prentice-Hall, Englewood Cliffs, NJ, 1993. J. J. Gómez-Hernández and X. H. Wen. To be or not to be multi-Gaussian? A reflection on stochastic hydrogeology. Advances in Water Resources, 21(1), 1998. N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEE proceedings. Part F. Radar and signal processing, 140(2), 1993. H. J. Hendricks Franssen, A. Alcolea, M. Riva, M. Bakr, N. van der Wiel, F. Stauffer, and A. Guadagnini. A comparison of seven methods for the inverse modelling of groundwater flow. Application to the characterisation of well catchments. Advances in Water Resources, 32(6), 2009. T. J. R. Hughes. The finite element method. Prentice-Hall, Englewood Cliffs, NJ, 1987. R. A. Johnson and D. W. Wichern. Applied multivariate statistical analysis. PrenticeHall, Englewood Cliffs, NJ, 1988. A. G. Journel. Nonparametric estimation of spatial distributions. Mathematical Geology, 15(3), 1983. R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1), 1960. P. K. Kitanidis. Introduction to geostatistics: applications to hydrogeology. Cambridge University Press, 1997.
94
R. Krzysztofowicz. Transformation and normalization of variates with specified distributions. Journal of Hydrology, 197(1-4), 1997. S. P. Neuman. Theoretical derivation of Darcy’s law. Acta mechanica, 25(3), 1977. W. Nowak. Best unbiased ensemble linearization and the quasi-linear Kalman ensemble generator. Water Resour. Res, 45, 2009. W. Nowak, R. L. Schwede, O. A. Cirpka, and I. Neuweiler. Probability density functions of hydraulic head and velocity in three-dimensional heterogeneous porous media. Water Resources Research, 44(8), 2008. J. M. Ortiz, B. Oz, and C. V. Deutsch. A step by step guide to bi-gaussian disjunctive kriging, 2005. J. Pearson, R. Goodall, M. Eastham, and C. MacLeod. Investigation of Kalman filter divergence using robust stability techniques. In IEEE Conference on Decision and Control, volume 5, 1997. E. P. Poeter and M. C. Hill. Inverse models: A necessary next step in ground-water modeling. Ground Water, 35(2), 1997. D. J. Poirier. Piecewise regression using cubic splines. Journal of the American Statistical Association, 68(343), 1973. J. Rivoirard. Introduction to disjunctive kriging and non-linear geostatistics. Oxford University Press, USA, 1994. C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer Verlag, 2004. A. E. Scheidegger. General theory of dispersion in porous media. Journal of Geophysical Research, 66, 1961. R. L. Schwede and O. A. Cirpka. Interpolation of Steady-State Concentration Data by Inverse Modeling. Ground Water, 2010. E. Simon and L. Bertino. Application of the Gaussian anamorphosis to assimilation in a 3-D coupled physical-ecosystem model of the North Atlantic with the EnKF: a twin experiment. Ocean Science Discussions, 6(1), 2009. A. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publications de l’Institut de Statistique de l’Université de Paris, 8, 1959. H. W. Sorenson. Least-squares estimation: from Gauss to Kalman. IEEE spectrum, 7, 1970. B. L. van der Waerden. Mathematische Statistik. Springer, Heidelberg, 1965. H. Wackernagel. Multivariate geostatistics: an introduction with applications. Springer Verlag, 2003. Y. Zhang, G. F. Pinder, and G. S. Herrera. Least cost design of groundwater quality monitoring networks. Water Resources Research, 41(8), 2005.
95