Geophysical Inverse Theory - Universidad de los Andes

Geophysical Inverse Theory Notes by Germ´an A. Prieto Universidad de los Andes March 11, 2011 c

2009

ii

Contents 1 Introduction to inverse theory 1.1 Why is the inverse problem more difficult? 1.1.1 Example: Non-uniqueness . . . . . 1.2 So, what can we do? . . . . . . . . . . . . 1.2.1 Example: Instability . . . . . . . . 1.2.2 Example: Null space . . . . . . . . 1.3 Some terms . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

2 Review of Linear Algebra 2.1 Matrix operations . . . . . . . . . . . . . . . . . 2.1.1 The condition Number . . . . . . . . . . . 2.1.2 Matrix Inverses . . . . . . . . . . . . . . . 2.2 Solving systems of equations . . . . . . . . . . . . 2.2.1 Some notes on Gaussian Elimination . . . 2.2.2 Some examples . . . . . . . . . . . . . . . 2.3 Linear Vector Spaces . . . . . . . . . . . . . . . . 2.4 Functionals . . . . . . . . . . . . . . . . . . . . . 2.4.1 Linear functionals . . . . . . . . . . . . . 2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Norms and the inverse problem . . . . . . 2.5.2 Matrix Norms and the Condition Number 3 Least Squares & Normal Equations 3.1 Linear Regression . . . . . . . . . . . . . 3.2 The simple least squares problem . . . . 3.2.1 General LS Solution . . . . . . . 3.2.2 Geometrical Interpretation of the 3.2.3 Maximum Likelihood . . . . . . . 3.3 Why LS and the effect of the norm . . . 3.4 The L2 problem from 3 Perspectives . . 3.5 Full Example: Line fit . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

1 2 2 3 3 5 6

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

7 8 10 10 11 13 14 16 18 19 19 21 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

23 23 24 25 27 29 32 33 34

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

iv 4 Tikhonov Regularazation 4.1 Tikhonov Regularization . . . . . . . . . . . . 4.2 SVD Implementation . . . . . . . . . . . . . . 4.3 Resolution vs variance, the choice of α, or p . 4.3.1 Example 1: Shaw’s problem . . . . . . 4.4 Smoothing Norms or Higher-Order Tikhonov 4.4.1 The discrete Case . . . . . . . . . . . 4.5 Fitting within tolerance . . . . . . . . . . . . 4.5.1 Example 2 . . . . . . . . . . . . . . . .

CONTENTS

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

37 37 38 39 40 42 43 44 45

Chapter 1

Introduction to inverse theory In geophysics we are often faced with the following situation: We have measurements at the surface of the Earth of some quantity (magnetic field, seismic waveforms) and we want to know some property of the ground under the place where we made the measurements. Inverse theory is a method to infer the unknown physical properties (model) from these measurements (data). This class is called Geophysical Inverse Theory (GIT) because it is assumed we understand the physics of the system. That is, if we knew the properties accurately, we would be able to reconstruct the observations that we have taken. First, we need to be able to solve the forward problem di = Gi (m)

(1.1)

where from a known field m(x, t, . . . ) we can predict the observations di . We assume there are a finite number N of observations, thus di is a N-dimensional data vector. G is the theory that predicts the data from the model m. This theory is based on physics. Mathematically, G(m) is a functional, a rule that unambiguously assigns a single real number to an element of a vector space. As its name suggests, the inverse problem reverses the process of predicting the values of the measurements. It tries to invert the operator G to get an estimate of the model m = F (d) (1.2) Some examples of properties inside the Earth (model) and its surface observations used to make inferences about the model are shown in table 1.1 The inverse problem is usually more difficult than the forward problem. To start, we assume that the physics are completely under control, before even thinking about the inverse problem. There are plenty of geophysical systems where the forward problem is still incompletely understood, such as the geodynamo problem or earthquake fault dynamics.

2

Chapter 1. Introduction to inverse theory

Table 1.1: Example properties and measurements for inverse problems Model Topography Magnetic field at CMB Mass distribution Fault slip Seismic velocity

1.1

Data Altitute/Bathymetry measurements Magnetic field at the surface Gravity measurements Waveforms / Geodetic motion Arrival times / Waveforms

Why is the inverse problem more difficult?

A simple reason is that we have a finite number of measurements (and of limited precision). The unknown property we are after is a function of position or time and requires in principle infinitely many parameters to describe it. This leads to the problem that in many cases the inverse problem is non-unique. Nonuniqueness means that more than one solution can reproduce the data in hand. A finite di , where i = 1, . . . , M does not allow us to estimate a function that would take an infinite number of coefficients to describe.

1.1.1

Example: Non-uniqueness

Imagine we want to describe the Earth’s velocity structure, the forward problem could be described as follows: α(θ, φ, r) =

∞ X l ∞ X X

Ylm (θ, φ)Zn (r)almn

(1.3)

l=0 m=−l n=0

where α is the P -wave velocity as measured at position (θ, φ, r), Zn (r) are the basis functions that control radial dependence, Ylm are the basis functions that describe angular dependence (lat, lon) and almn are the unknown model coefficients. Note that even if we had 1000’s of exact measurements of velocity αi (θ, φ, r) the discretized forward problem is αi (θ, φ, r) =

∞ X l ∞ X X

(i)

Ylm (θ, φ)Zn(i) (r)almn

(1.4)

l=0 m=−l n=0

where i = 1, . . . , M . We have an infinite number of parameters almn to determine, leading to the non-uniqueness problem. A commonly used strategy is to drastically oversimplify the model αi (θ, φ, r) =

l 6 6 X X X l=0 m=−l n=0

(i)

Ylm (θ, φ)Zn(i) (r)almn

(1.5)

1.2 So, what can we do?

3

or a 1D velocity assumption with radial dependence only αi (θ, φ, r) =

20 X

Zn(i) (r)an

(1.6)

n=0

In this cases the number of data points is larger than the model parameters M > N , so the problem is overdetermined. If the oversimplification (i. e., radial dependence only) is justified by observations this may be a fine approach, but when there is no evidence for this arrangement even if the data is fit, we will be uncertain o the significance of the result. Another problem is that this may unreasonably limit the solution.

1.2

So, what can we do?

Imagine we could interpolate between measurements to have a complete data. In a few cases that would be enough, but in most cases geophysical inverse problems are ill-posed. In this sense they are unstable, an infinitesimal perturbation in the data can result in a finite change in the model. So, how you interpolate may control the features of the predicted model. The forward problem on the other hand is unique (remember the term functional), and it is stable too.

1.2.1

Example: Instability

Consider the anti-plane problem for an infinitely long strike-slip fault

x2

x1 x3 Figure 1.1: Anti-plane slip for infinitely long strike-slip fault The displacement at the Earth’s surface u(x1 , x2 , x3 ) is in the x ˆ1 direction, due to slip S(ξ) as a function of depth, ξ, 1 u1 (x2 , x3 = 0) = π

Z∞

S(ξ)

0

x2 2 x2 + ξ 2

dξ

(1.7)

4


where S(ξ) is the slip along x1 and varies only with depth x ˆ3 . If we had only discrete measurements di =

(i) u1 (x2 )

Z∞ S(ξ) gi (ξ)dξ

=

(1.8)

0

where   1 gi (ξ) = π 

(i) x2 2 (i) x2 +

    ξ2 

Now, lets assume that slip occurs only at some depth c, so that S(ξ) = δ(ξ−c)

d(x2 )

=

1 π

Z∞

S(ξ)

0

x22

dξ

x2 + c2

(1.9) (1.10)

u1

=

x2 x22 + ξ 2

x2 Figure 1.2: Observations at the surface due to concentrated slip at depth c. The results (Figure 1.2) show 1. Effect of concentrated slip is spread widely 2. This will lead to trouble (instability) in the inverse problem so that even if we did have data at every point on the surface of the Earth, the inverse problem would be unstable. The kernel of functional g(ξ) smooths the focused deformation. The problem lies in the physical model, not really how you solve it

1.3 Some terms

1.2.2

5

Example: Null space

We consider data for a vertical gravity anomaly observed at some height h to estimate the unknown buried line mass density distribution m(x) = ∆ρ(x). The forward problem is described by Z∞ d(s)

=

Γ

h

−∞

2

(x − s) +

h2

3/2 m(x)dx

(1.11)

Z∞ g(x − s) m(x)dx

=

(1.12)

−∞

Suppose now we can find a smooth function m+ (x), such that the integral in (1.12) vanishes, such that d(s) = 0. Because of the symmetry of the kernel g(x − s), if we choose m+ (x) to be a line with a given slope, the observed anomaly d(s) will be zero. The consequence of this is that we can add to the true anomaly, an anomaly function m+ to it m = mtrue + m+ and the new gravity anomaly profile will match the data just as well as mtrue Z∞ d(s)

g(x − s) [mtrue (x) + m+ ]dx

= −∞ Z∞

Z∞ g(x − s) mtrue (x)dx +

= −∞ Z∞

g(x − s) m+ (x)dx

(1.14)

−∞

g(x − s) mtrue (x)dx + 0

=

(1.13)

(1.15)

−∞

From the field observations, even if error free and infinitely sampled there is no way to distinguish between the real anomaly and any member of an infinitely large family of alternatives. Models m+ (x) that lie in the null space of g(x − s) are solutions to Z g(x − s)m(x)dx = 0 By superposition, any linear combination of these null space models can be added to a particular model and not change the fit to the data. This kind of problems do not have a unique answer even with perfect data,

6


Table 1.2: Example of inverse problems Model Discrete Discrete Discrete Continuous Continuous

1.3

Theory Linear Linear Nonlinear Linear Nonlinear

Determinancy Overdetermined Underdetermined Overdetermined Underdetermined Underdetermined

Examples Line fit Interpolation Earthquake Location Fault Slip Tomography

Some terms

The inverse problem is not just simple linear algebra. 1. For the continuous case, you don’t invert a matrix of infinite rows 2. Even the discrete case d = Gm you could simply multiply by the inverse of the matrix G−1 d = G−1 Gm = m = GG−1 m and this is only possible for square matrices, so for under/over determined cases would not work. Overdetermined • More observations than unknowns N > M • Due to errors, you are never able to fit all data points • Getting rid of data is not ideal (why?) • Find compromise in fitting all data simultaneously (least-squares sense) Underdetermined • More unknowns than equations N < M • Data could be fit exactly, but we could vary some components of the model arbitrarily • Add additional constraints, such as smoothness or positivity

Chapter 2

Review of Linear Algebra A matrix is a rectangular array of real (or complex) numbers arranged in sets of m rows with n entries each. The set of such m by n matrices is called Rm×n (or Cm×n f orcomplexones). A vector is simple a matrix consistent of a single column. Notice we will use the notation Rm rather than Rm×1 or R1×m . Also, be careful since Matlab does understand the difference between a raw vector and a column vector. Notation is important. We will use boldface capital letters (A, B, . . . ) for matrices, lowercase bold letters (a, b, . . . ) for vectors and lowercase roman and Greek letters (m, n, α, beta, . . . ) to denote scalars. When referring to specific entries of the array A ∈ Rm×n I use the indices aij , which means the entry on the ith row and the jth column. If we have a vector x, xj refers to its jth entry.     x1 a11 a12 . . . a1n  x2   ..  ..    . a . A= x =  ..  21     . .. . · · · amn xm We can also think of a matrix as an ordered collection of column vectors   a11 a12 . . . a1n  ..  .. . .  A=  a21  = a 1 a2 · · · an .. . · · · amn There are a number of special matrices to keep in mind. These are useful since some of them are used to get matrix inverses. • Square matrix • Diagonal Matrix • Tridiagonal matrix

m=n aij = 0

whenever i 6= j

aij = 0

whenever |i − j| > 1

8

Chapter 2. Review of Linear Algebra • Upper triangular matrix

aij = 0

whenever i > j

• Lower triangular matrix

aij = 0

whenever i < j

• Sparse matrix

Most entries zero

Note that the definition of upper and lower triangular matrices may apply to non-square matrices as well as square ones. A zero matrix is a matrix composed of all zero elements. It plays the role on matrix algebra as the scalar 0. A+0

= A = 0+A

The unit matrix is the square, diagonal matrix with only unity in the diagonal and zeros elsewhere and is usually denoted I. Assuming the right matrix sizes apply AI = A = IA

2.1

Matrix operations

Having a set of matrices Rm×n , addition A = B + C means

aij = bij + cij

and scalar multiplication A = αB

means

aij = αbij

where α is a scalar. Another basic manipulation is transposition B = AT

means

bij = aji

More important is matrix multiplication, where Rm×n × Rn×p → Rm×p C = AB

means

cij =

n X

aik bkj

k=1

Notice we can only multiply two matrices when the numbers of columns (n) in the first one equals the number of rows in the second. The other dimensions are not important, so non-square matrices can be multiplied. Other standard aritmethic rules are valid, such as distribution A(B + C) = AB + AC. Less obvious the association of multiplication holds A(BC) = (AB)C as long as the matrix sizes permit. But multiplication is not commutative AB 6= BA

2.1 Matrix operations

9

unless some special properties exist. When one multiplies a matrix into a vector, there are a number of useful ways of interpreting the operation y = Ax

(2.1)

1. If x and y are in the same space Rm , A is providing a linear mapping or linear transformation of one vector into another. Example 1: m = 3, A represents the components of a tensor x → A y

→ →

angular velocity inertia tensor angular momentum

Example 2: Rigid body rotation, used in Plate tectonics reconstructions. 2. We can think of A as a collection of column vectors, then y

= Ax = [a1 , a2 , . . . , an ] x = x1 a1 + x2 a2 + · · · xn an

so that the new vector y is simply a linear combination of te column vectors of A, with expansion coefficients given by the elements of x. Note, this is the way we think about matrix multiplication when fitting a model, y contains data values, A are the predictions of the theory hat includes some unknown weights (the model) given by the entries of x. d = Gm There are two ways of multiplying two vectors. For two vector in the outer product is    x1 x1 y1 · · · x1 yq  x2     .. .. T .. xy =  .  y1 y2 · · · yq =  . . .  ..  xp y1 · · · xp yq xp

Rp and Rq

  

and the inner product of two vectors of the same length is   y1    y2  xT y = x1 x2 · · · xp  .  = x1 y1 + x2 y2 + · · · + xp yp  ..  yp The inner product is just a vector dot product of vector analysis.

10

Chapter 2. Review of Linear Algebra

If A is a square matrix and if there is a matrix B such that AB = I the matrix B is called the inverse of A and is written A−1 . Square matrices that posses no inverse are called singular, when the inverse exists A is called nonsingular. The inverse of the transpose is the transpose of the inverse −1

(AT )

T

= (A−1 )

The inverse is useful for solving linear systems of algebraic equations. Starting with equation 2.1

A

−1

y y

= Ax = A−1 Ax = Ix = x

so if we know y and A and A is square and has an inverse we can recover the unknown vector x. As you will see later, calculating the inverse and then multiplying to y is a poor way to solve for x numerically. A final rule about transposes and inverses T

(AB)

−1

(AB)

2.1.1

= BT A T = B−1 A−1

The condition Number

The key to understanding the accuracy in the solution of y = Ax is to look at the condition number of the matrix A κA = kAkkA−1 k which estimates the factor by which small errors in y or A are magnified in the solution x . This can sometimes be very large (> 1010 ). It can be shown that the condition number in solving the normal equations (to be studied later) is the square of the condition number using a QR decomposition, which can sometimes lead to catastrophic error build up.

2.1.2

Matrix Inverses

Remember our definition that if we have a matrix A ∈ Rn×n is invertible if there exists A−1 such that A−1 A = I

and AA−1 = I

2.2 Solving systems of equations

11

Some examples of inverses 

d1

0 d2

D= 0







D−1 = 

d3

1 d1

0 1 d2

0

1 d3

 

the inverse of a diagonal matrix, is a diagonal matrix with the diagonal elements to the negative power.     1 0 0 1 0 0 P= 0 0 1  P−1 =  0 0 1  0 1 0 0 1 0 If ones exchanges rows 2 and 3, you get a simple inverse.  1 0 E= 2 1 0 0

diagonal matrix, so P will have a  0 0  1

In this case, the matrix is not diagonal, but we can use Gaussian Elimination which we will go through next.

2.2

Solving systems of equations

Consider a system of equations 1 2 3

2x + 4x + −2x +

y y 2y

+ z + z

= 1 = −2 = 7

and solve using Gaussian elimination. The first step is to end up with zeros in first column for all rows (except first one) Subtract 2× (1) from (2).

the factor 2 is called pivot

Subtract 1× (1) from (3).

the -1 is called pivot

1 2 3

2x +

y −y 3y

+ z − 2z + 2z

= 1 = −4 = 8

The next step is Subtract 3× (2) from (3). 1 2 3

2x +

y −y

+ −

z 2z −4z

= 1 = −4 = −4

12


and now solve each equation from bottom to substitution 3 −4z = 1 2 −y − 2 = −4 1 2x + 2 + 1 = 1

top by the process called back→ → →

z=1 y=2 x = −2

In solving this system of equations we have used elementary row operations, namely adding multiple of one equation to another, multiplying by a constant or swaping two equations. This process can be extended to solve systems of equations with an arbitrary number of equations. Another way to think of Gaussian Elimination is as a matrix factorization (triangular factorization). Rewrite the system of equations in matrix form

or



2  4 −2

1 1 2

Ax

= b

Aij xj

= bi

    1 x 1 0   y  =  −2  1 z 7

We are going to try and get A = LU, where L is a lower triangular matrix and U is upper triangular with the same Gaussian Elimination steps. 1. Subtract two times the first equation from the second       1 0 0 2 1 1 x 1  −2 1 0   4 1 0   y  =  −2  0 0 1 −2 2 1 z 7      2 1 1 x 1  0 −1 −2   y  =  −4  −2 2 1 z 7 or for short E1 Ax

= E1 b

A1 x = b1 2. Subtract 1 times the first equation from   1 0 0 2 1 1  0 1 0   0 −1 −2 −1 0 1 −2 2 1  2 1 1  0 −1 −2 0 3 2

the third    x 1   y  =  −4 7 z    x 1   y  =  −4 8 z

or for short E2 A1 x = E2 b1 A2 x = b2

   


13

3. Subtract 3 times the second equation from the third     1 0 0 2 1 1 x  0 1 0   0 −1 −2   y  = 0 −3 1 0 3 2 z    2 1 1 x  0 −1 −2   y  = 0 0 −4 z



 1  −4  8   1  −4  −4

or for short E3 A2 x = E3 b2 A3 x = b3

This new matrix will be assigned a new name, so the system now looks E3 E2 E1 Ax

= E3 E2 E1 b

Ux = c and since U = E3 E2 E1 A A

= E1 −1 E2 −1 E3 −1 U = LU

where our matrix L is



1 L= 2 −1

 0 0 1 0  −3 1

which is a lower triangular matrix. Notice that the non-diagonal components of L are the pivots. From this result it is suggested that if we find A = LU we only need to change Ax

=

b

to Ux = c and back-substitute. Easy right?

2.2.1

Some notes on Gaussian Elimination

Basic steps • Uses multiples of first equation to eliminate first coefficient of subsequent equations. • Repeat for coefficients n − 1. • Back substitute in reverse order.

14


Problems • Zero in first column • Linearly dependent equations • Inconsistent equations Efficiency If we count division, multiplication, sum as one operation and assume we have a matrix A ∈ Rn×n • n operations to get zero in first coefficient • n − 1 for the # of rows to do • n2 − n operations so far • N = (12 + · · · + n2 ) − (1 + · · · + n) =

n3 −n 3

to do remaining coefficients.

• For large n → N ≈ n3 /3 • Back-substitution part N = n2 /2 There are other more efficient ways to solve systems of equations.

2.2.2

Some examples

We want to solve systems of equations with m equations and n unknowns. Square matrices There are three possible outcomes for square matrices, with A ∈ Rm×m 1. A 6= 0 → x = A−1 b This is a non-singular case where A is an invertible metrix and the solution x is unique. 2. A = 0, b = 0 → 0x = 0 and x could be anything. This is the underdetermined case, the solution x in non-unique. 3. A = 0, b 6= 0 → 0x = b. There is no solution. this is an inconsistent case for which there is no solution.


15

Non-square matrices An example of a system with 3 equations and four unknowns (overdetermined, underdetermined?) is as follows     x1   1 3 3 2  0  x 2 = 0   2 6 9 5   x3  −1 −3 3 0 0 x4 We can use Gaussian elimination by setting to zero first 2 and 3      x1 1 0 0 1 3 3 2  x2    −2 1 0   2 6 9 5   x3  = 1 0 1 −1 −3 3 0 x4     x1 1 3 3 2    0 0 3 1   x2  =  x3  0 0 6 2 x4 and for the third coefficient for the last   1 0 0 1 3 3  0 1 0  0 0 3 0 −2 1 0 0 6

row   2  1   2    1 3 3 2   0 0 3 1   0 0 0 0

 x1 x2   = x3  x4  x1 x2   = x3  x4

coefficients in rows 

 0  0  0 

 0  0  0



 0  0  0 

 0  0  0

The underlined values are the pivots. The pivots have a column of zeros below and are to the right and below other pivots. Now we can try and solve the equations, but note that the last row has not information, xj could get any value. 0 0

= x1 + 3x2 + 3x3 + 2x4 = 3x3 + x4

and solving by steps

we have the solution 

x3

= −x4 /3

x1

= −3x2 − x4

    −x4 −3x2 − x4 −3x2      0 x x 2 2 = + x=  −x4 /3   0   − 13 x4 x4 0 x4

   

16


which means that all solutions to our initial problem Ax = b are combinations of this two vectors and form an infinite set of possible solutions. You can choose ANY value of x2 or x4 and you will get always the correct answer.

2.3

Linear Vector Spaces

A vector space is an abstraction of ordinary space and its members can be loosely be regarded as ordinary vectors. To define a linear vector space (LVS) it involves two types of objects, the elements of the space (f, g) and scalars (α, β ∈ R, although sometimes ∈ C is useful). A real linear vector space is a set V containing elements which can be related by two operations f +g

and

addition

αf scalar multiplication

where f, g ∈ V and α ∈ R. In addition, for any f, g, h ∈ V and any scalar α, β the following set of nine relations must be valid f +g

∈

V

(2.2)

αf

∈

V

(2.3)

f +g = g+f f + (g + h) = (f + g) + h

(2.4) (2.5)

f + g = f + h, if and only ifg = h α(f + g) = αf + αg

(2.6) (2.7)

(α + β)f

(2.8)

= αf + βf

α(βf ) = (αβ)f 1f = f

(2.9) (2.10)

An important consequence of these laws is that every vector space contains a unique zero element 0 f +0=f ∈V and whenever αf = 0

either α = 0 or f = 0

Some examples The most obvious space is Rn , so x = [x1 , x2 , . . . , xn ] is an element of Rn . Perhaps less familiar are spaces whose elements are functions, not just a finite set of numbers. One could define a vector space C N [a, b], a space of all

2.3 Linear Vector Spaces

17

N -differentiable functions on the interval [a, b]. Or solutions to PDE’s (∇2 = 0) with homogeneous boundary conditions. You can check some of the laws. For example, in the vector space C N [a, b] it should be easy to proof that adding two N -differentiable functions the resultant function is also N -differentiable. Linear combinations In a linear vector space you can add together a collection of elements to form a linear combination g = α1 f1 + α2 f2 + . . . where fj ∈ V , αj ∈ R and obviously g ∈ V . Now, a set of elements in a linear vector space a1 , a2 , . . . , an is said to be linearly independent if n X

only if β1 = β2 = · · · = βn = 0

βj aj = 0

j=1

in words, the only linear combination of the elements that equals zero is the one in which all the scalars vanish. Subspaces A subspace of a linear vector space V is a subset of V that is itself a LVS, meaning all the laws apply. For example Rn

is a subset of Rn+1

or C n+1 [a, b]

is a subset of C n [a, b]

since all (N + 1)-differentiable functions are themselves N -differentiable. Other terms • span the spanning set of a collection of vectors is the LVS that can be nuilt from linear combinations of the vectors. • basis a set of linearly independent vectors that form or span the LVS. • range written R(A) of a matrix Rm×n , it is simply the linear vector space that can be formed by taking linear combinations of the column vectors. Ax ∈ R(A) Ax = b is the set of ALL vectors b that can be build by linear combinations of the elements in A by using x with all possible scalar coefficients.

18

Chapter 2. Review of Linear Algebra • rank: The rank represents the number of linearly independent rows in A. rank((A)) = dim[R(A)] A matrix is said to be full rank if rank(A ∈ Rm×n ) = min(m, n) or to be rank deficient otherwise. • Null space: This is the other side of the coin of the rank. This is the set of xi ’s that cause Ax = 0 and it can be shown that dim[N (A)] = min(m, n) − rank(A)

2.4

Functionals

In geophysics we usually have a collection of real numbers (could be complex numbers in for example EM) as our observations. An observation or measurement will be a single real number. The forward problem is Z dj = gj (x)m(x)dx (2.11) where gj (x) is the mathematical model and will be treated as an element in the vector space V . We thus need something, a rule that unambigously assigns a real number to an element gj (x) and this is where the term functional comes in. A functional is a rule that unambigously assigns a single real number to an element in V . Note that every element in V will not necessarily be connected with a real number (remember the terms range and null space). Some examples of functionals include Zb Ii [m]

=

gi (x)m(x)dx

m ∈ C 0 [a, b]

a

d2 f D2 [f ] = f ∈ C 2 [a, b] dx2 x=0 N1 [x] = |x1 | + |x2 | + · · · + |xn | x ∈ Rn There are two kinds of functionals that will be relevant to our work, linear functionals and norms. We will devote a section to the second one later.

2.5 Norms

2.4.1

19

Linear functionals

For f, g ∈ D and α, β ∈ R a linear functional L obeys L[αf + βg] = αL[f ] + βL[g] and in general αf + βg ∈ D so that a combination of elements in space D, lies in space D. It is a subspace of D. The most general linear functions in RN is the dot product Y [x] = x1 · y1 + x2 · y2 + · · · + xN · yN =

X

xi yi

i

which is an example of an inner product For finite models and data, the general relationship is d = gj m or for multiple data di = Gij mj and is some way our forward problem is an inner product between the model and the mathematical theory to generate the data.

2.5

Norms

The norm provides a mean of attributing sizes to elements of a vector space. It should be recognized that there are many ways to define the size of an element. This leads to some level of arbitrariness, but it turns out that one can choose a norm with the right behavior to suit a particular problem. A norm, denoted k · k is a real-valued functional and satisfies the following conditions kf k kαf k kf + gk kf k

> 0

(2.12)

= |α|kf k 6 |f | + |g| = 0

(2.13) the triangle inequality only iff = 0

(2.14) (2.15)

If we omit the last condition, the functional is called a seminorm. Using the norms, in a linear vector space equipped with such a norm the distance between two elements d(f, g) = kf − gk

20


Some norms in finite dimensional space Here we define some of the common used norms L1 L2 L∞ Lp

kxk1 = |x1 | + |x2 | + · · · + |xN | 1/2 kxk2 = x1 2 + x2 2 + · · · + xN 2 max|xi | 1/p kxkp = (|x1 |p + |x2 |p + · · · + |xN |p )

x ∈ RN Euclidean norm p61

The areas for which the so called p-norms are less that unit (kxk 6 1) are shown p=1

p=2

p=3

p=∞

Figure 2.1: The unit circle p-norms in Figure 2.1. For the Euclidean norm the area is called the unit ball. Note that for large values of p, the larger vectors will tend to dominate the norm. Some norms in infinite dimensional space For the infinite dimensional space we work with functions rather than vectors Zb kf k1

=

|f (x)|dx

kf k2

 b 1/2 Z =  |f (x)|2 dx

kf k∞

max a6x6b(|f (x)|)

a

a

=

2.5 Norms

21

and other norms can be designed to measure some aspect of the roughness of the functions  00

0

2

1/2

Zb

00

2

kf k

= f 2 (a) + [f (a)] +

kf kS

 b 1/2 Z 0 =  (w0 (x)f 2 (x) + w1 (x)f (x)2 )dx

[f (x)] dx a

Sobolev norm

a

This last set of norms are going to be useful when we try to solve underdetermined problems. They are typically applied to the model rather than the data.

2.5.1

Norms and the inverse problem

Remembering our simple inverse problem d = Gm

(2.16)

we form the residual r = d − Gm ˆ r = d−d where from our physics we can make data predictions dˆ and we want our predictions to be as close as possible to the acquired data. What do we mean by small? We use the norm to define how small is small by setting the length of r, namely the norm of r → krk as small as possible L1 :

ˆ1 kd − dk

or minimizing the Euclidean or 2-norm L2 :

ˆ kd − dk 2

leading in the second case to the least squares solution.

2.5.2

Matrix Norms and the Condition Number

We return to the question of the condition number. Imagine we have a discrete inverse problem for the unperturbed system y = Ax

(2.17)

y0 = Ax0

(2.18)

and the perturbed case is

22


Here, assume the perturbation is small. Note that in real life, we have uncertainties in our observations, and we wish to know whether these small errors in the observations are severely effecting our end-result. Using a norm, we wish to know what the effect of the small perturbations is, so using the relations above A(x − x0 )

= y − y0

(x − x0 )

= A−1 (y − y0 )

kx − x0 k

6 kA−1 kky − y0 k

where in the last step, we use the triangular inequality rule. To get an idea of the relative effect of the perturbations to our result, kx − x0 k Ax kx − x0 k kx − x0 k kxk

ky − y0 k y ky − y0 k 6 kAxkkA−1 k kyk ky − y0 k 6 kAkkA−1 k kyk 6 kA−1 k

where we defined the condition number κ(A) = kAkkA−1 k

(2.19)

and shows the amount that a small perturbation in the observations (y) is reflected in perturbations in the resultant estimated model x. For the L2 norm, the condition number of a matrix κ = λmax /λmin , where λi are the eigenvalues of the matrix in question.

Chapter 3

Linear regression, least squares and normal equations 3.1

Linear Regression

Sometimes we will talk about the term inverse problem, while some other people will prefer the term regression. What is the difference? In practice, none. In the case where we are dealing with a function fitting procedure that can be cast as an inverse problem, the procedure is many times referred as a regression. In fact, economists use regressions quite extensively. Finding a parameterized curve that approximately fits a set of data points is referred to as regression. For example, the parabolic trajectory problem is defined y(t) = m1 + m2 t − (1/2)m3 t2 where y(t) represents the altitude of the object at time t, and the three (N = 3) model parameters mi are associated with a constant, slope and quadratic terms. Note that even if the function is quadratic, the problem in question is linear for the three parameters. If we have M discrete measurements yi at times ti , the linear regression problem or inverse problem can be written in the form     1 2 1 t1 y1   2 t1 1 2  y 2   1 t2  m1 2 t2      ..  =  .. .. ..  m2   .   . . .  m3 yM 1 tM 12 t2M When the regression model is linear in the unknown parameters, then we call this a linear regression or linear inverse problem.

24

3.2

Chapter 3. Least Squares & Normal Equations

The simple least squares problem

We start the application of all those terms we have learned above by looking at an overdetermined linear problem (more equations than unknowns) involving the simplests of norms, the L2 or Euclidean norm. Suppose we are given a collection of M measurements of a property to form a vector d ∈ RM . From our geophysics we know the forward problem such that we can predict the data from a known model m ∈ RN . That is, we know the N vectors gk ∈ RM such that d=

N X

gk mk = Gm

(3.1)

k=1

where G = [g1 , g2 , . . . , gN ] ˆ that minimizes the size of the residual vector We are looking for a model m defined r=d−

N X

gk m ˆk

k=1

We do not expect to have an exact fit so there will be some error, and we use a norm to measure the size of the residual krk = kd − Gmk For the least squares problem we use the L2 or Euclidean norm v uM uX krk2 = t rk2 k=1

Example 1: the mean value Suppose we have M measurements of the same quantity, so we have our data vector T

d = [d1 d2 , . . . , dM ]

The residual is defined as the distance between each individual measurement and the predicted value m: ˆ ri = d i − m ˆ

3.2 The simple least squares problem

25

Using the L2 norm krk22 =

M X

rk2

=

k=1

M X

(di − m) ˆ 2

k=1

=

M X

d2k − 2md ˆ i+m ˆ2

k=1

=

M X

d2k − 2m ˆ

M X

dk + M m ˆ2

k=1

k=1

Now, to minimize the residual, we take the derivative w.r.t. the model m ˆ and set to zero M X d krk22 = −2 dk + 2M m ˆ =0 dm ˆ k=1

and by solving for m ˆ we have m ˆ =

M 1 X dk M k=1

which shows that the sample mean is the result of a least squares solution for the measurements. The corresponding estimate that minimizes the L1 norm is the median. Note that the median is not found by a linear operation on the data, which is a general feature of the L1 norm estimates.

3.2.1

General LS Solution

Going back to our general problem, we have ˆ = Gm ˆ d and the predicted data is a linear combination of the gk ’s. Using the Linear ˆ must lie in the vector space theory, we can show that the predicted data d estimation space, on ALL possible results that G can produce (range). Setting the L2 norm for the residuals between the data and the prediction krk22

ˆ 2 = kd − dk 2 ˆ T (d − Gm) ˆ = rT r = (d − Gm) T T T ˆ G d+m ˆ T GT Gm ˆ = d d − 2m

ˆ and set to zero now we take the derivative wrt m d T d ˆ T GT d + m ˆ T GT Gm ˆ =0 krk22 = d d − 2m ˆ ˆ dm dm 0

=

ˆ 0 − 2GT d + 2GT Gm

26


Note that it is worth pointing out that the derivative is of a scalar with respect to a vector. We will show below that this works as simply as it appears, by writing out all the components. Simplifying a bit more ˆ GT d = GT Gm

(3.2)

which is called the normal equations. ˆ to end up with Assuming the inverse of (GT G) exists, we can isolate m −1

ˆ = (GT G) m

GT d

Note that the matrix (GT G) is a square N ×N matrix and GT d is an N column vector. Derivation with another notation Starting with the L2 norm of the residuals krk22

=

M X

2

(rj ) =

j=1

M X

dj −

j=1

N X

!2 gji mi

i=1

we take the derivative and set to zero 0

=

=

=

M d d X krk22 = (rj )2 ˆ dm dm ˆ k j=1

! ! M N N X X d X dj − dj − gji mi gjl ml dm ˆ k j=1 i=1 l=1 # " M N N N X N X X X d X dj dj − dj gjl ml − dj gji mi + gji gjl mi ml dm ˆ k j=1 i=1 i=1 l=1

l=1

We may look at each of these terms independently. The first term is M d X dj dj = 0 dm ˆ k j=1

the second and third terms are similar # " # " M N N M X X d X d X gjl ml −2dj gjl ml = −2dj dm ˆ k j=1 dm ˆk j=1 l=1

l=1

= −2

M X j=1

dj gjk δlk → GT d


27

and the last term # "N N M d X XX gji gjl mi ml dm ˆ k j=1 i=1 l=1

M X



N X



 d gji gjl mi ml  dm ˆk j=1 i,l=1 "N N # M X XX = (δik gji gjl ml + δlk gji gjl mi )

=

j=1

=

l=1 i=1

"N N M X XX j=1

# (gjk gjl ml + gji gjk mi )

l=1 i=1

and now note that gjk gjl → gji gjk due to symmetry. So in the end we will have 2

M X N X

ˆ mi gjk gji → GT Gm

j=1 i=1

and we have derived the same previous result for the normal equations.

3.2.2

Geometrical Interpretation of the normal equations

The normal equations seem to have no intuitive content. −1

ˆ = (GT G) m

GT d

which was derived from ˆ = GT d (GT G)m

(3.3)

ˆ as a linear combination of the gk vectors Let’s consider the data prediction d and assume they are linearly independent ˆ = Gm ˆ = g1 m d ˆ 1 + g2 m ˆ 2 + · · · + gN m ˆN where gk is the kth column vector of the G matrix. Recall that the set of gk ’s form a subspace of the entire RM data space, sometimes called the estimation space or model space. Starting with (3.3) we have ˆ = GT d (GT G)m ˆ GT (Gm) = GT d ˆ − d) = 0 GT (Gm and recalling the definition of the residual ˆ − d) = GT r = 0 GT (Gm

28


So in other words the normal equation in the least squares sense means that     

GTr = 0  g.1 · r  g.2 · r     = 0= ..   .

0 0 .. .

g.N · r

0

    

d

suggesting that the residual vector is orthogonal to every one of the column vectors of the G matrix. The key thing here is that making the residual perpendicular to the estimation sub-space minimizes the length of r.

r

Gm ˆ

subspace of G

Figure 3.1: Geometrical interpretation of the LS & normal equations. We are basically projecting the data d ∈ RM onto the column space of G. This concept is called the orthogonal projection of d into the subspace of R(G), such that the actual measurements d can be expressed as ˆ +r d=d ˆ where d ˆ is called the orthogonal ˆ = d, We have created a vector Gm projection of d into the subspace of G. The idea of this projection relies on the Projection Theorem for Hilbert spaces, but we are not going to go too deeply into this. The theorem says that given a subspace of G, every vector can be written uniquely as the sum of two parts, one part lies in the subspace of G and the other part is orthogonal to the first (see Figure 3.1). The part lying in this subspace of G is the orthogonal projection of the vector d onto G, ˆ +r d=d There is a linear operator PG , the projection matrix, that acts on d to generate ˆ d. PG = G(GT G)−1 GT


29

This projection matrix has particularly interesting properties. For example, P2 = P, meaning that if we apply the projection matrix twice to a vector d, ˆ This also we get the same result as if we apply it only once, namely we get d. suggests that P must be a symmetric matrix. Example: Straight line fit Assume we have 3 measurements d ∈ RM where M = 3. For a straight line we only need 2 coefficients, the zero crossing and the slope, thus m ∈ RN with N = 2. The data predictions are then ˆ = Gm ˆ d  ˆ    d1 1 x1 ˆ1  dˆ2  =  1 x2  m m ˆ2 1 x3 dˆ3 or dˆ1 dˆ2

= g11 m ˆ 1 + g12 m ˆ2

dˆ3

= g31 m ˆ 1 + g32 m ˆ2

= g21 m ˆ 1 + g22 m ˆ2

and as we have said, the residual vector would be ˆ r=d−d Reorganizing we have ˆ + r, d=d

ˆ r⊥d

which is described in the figure below.

3.2.3

Maximum Likelihood

We can also use the Maximum LIkelihood method in order to interpret the Least Squares method and normal equations. This technique was developed by R.A. Fisher in the 1920’s and has dominated the field of statistical inference since then. Its power is that it can (in principle) be applied to any type of estimation problem, provided that one can write down the joint probability distribution of the random variables which we are assuming model the observations. The maximum likelihood looks for the optimum values of the unknown model parameters as those that maximize the probability that the observed data is due to the model from a probabilistic point of view. Suppose we have a random sample of M observations x = x1 , x2 , . . . , xM drawn from a probability distribution (PDF) f (xi , θ) where the parameter θ is unknown. We can extend this probability to a set of model parameters to f (xi , m). The joint probability for all M observations is: f (x, m) = f (x1 , m)f (x2 , m) · · · f (xM , m) = L(x, m)

30


y3

r3

dˆ = Gm ˆ

y1 y2

r1

x1

r2

x2

x3

Figure 3.2: The LS fit for a straight line. The estimation space is the straight ˆ this is where all predictions will lie. The real measurements line given by Gm, dk line above or below this line, and are sort of projected into the line via the residual.

We call L(x, m) = f (x, m) the likelihood function of m. If L(x, m0 ) > L(x, m1 ) we can say that m0 is a more plausible value for the model vector m than m1 , because m0 ascribes a larger probability to the observed values in vector x than m1 does. In practice we are given a particular data vector and we wish to find the more plausible model that ”generated” these data, by finding the model that gives the largest likelihood.

Example 1: The mean value Assume we are given M measurements of the same quantity and that the data contains normally distributed errors, then d ∼ N (µ, σ 2 ), where µ is the mean value and σ 2 is the variance. The probability function for a single datum is 1 exp f (di , µ) = √ 2πσ

−(di − µ)2 2σ 2

and the joint distribution or likelihood function is

L(d, µ) = (2π)−M/2 σ −M exp

 M  X    2  − (di − µ)       i=1

    

2σ 2

    


31

Maximizing the likelihood function is equal to maximizing its logarithm, so L = max L(d, µ) µ

=

max ln{L(d, µ)}

=

max [ln{f (d, µ)}]

µ

µ

where we let L be our log-likelihood function to maximize L=−

M M −1 X ln(2π) − M ln(σ) − 2 (di − µ)2 2 2σ i=1

taking the derivative with respect to µ 0=

∂L ∂µ

=

=

M 1 X (di − µ) σ 2 i=1 M X

(di ) −

i=1

=

M X

m X

(µ)

i=1

(di ) − m · µ

i=1

and as expected we obtain the arithmetic mean µ=

M 1 X di M i=1

We can also look for the maximum likelihood for the variance σ 2 0=

∂L ∂σ

= −

M M 1 X + 3 (di − µ)2 σ σ i=1

we get M 1 X σ = (di − µ)2 M i=1 2

The least squares problem with maximum likelihood We return to the linear inverse problems we had before d = Gm + where we assume the errors are normally distributed i ∼ N (0, σi2 ). The joint probability distribution or likelihood function in this case is L(d, m) =

M Y 1 exp −(di − Gi m/2σi2 Q M M/2 (2π) i=1 σi i=1

32


We want to maximize the function above, thus the constant term has no effect, leading to " ( M )# X 2 2 max L = max exp − (di − Gi m) /2σi m

m

i=1

take the logarithm of this likelihood function " M # X 2 2 max L = max − (di − Gi m) /2σi m

m

i=1

and switching to a minimization instead "M # X 2 2 min (di − Gi m) /2σi m

i=1

In matrix form this can be expressed as 1 min (d − Gm)T Σ−1 (d − Gm) m 2 where Σ is the data covariance matrix. So, to minimize, we want to take the derivative with respect to the model parameter vector as set to zero 0

∂ (d − Gm)T Σ−1 (d − Gm) ∂m ∂ −1 T = dΣ d − 2mT GT Σ−1 d + mT GT Σ−1 Gm ∂m = 0 − 2GT Σ−1 d + 2GT Σ−1 Gm =

finally leading to ˆ = (GT Σ−1 G)−1 GT Σ−1 d m which comes from the sometimes called the generalized normal equations ˆ = GT Σ−1 d (GT Σ−1 G)m

(3.4)

or the weighted least squares solution for the overdetermined case.

3.3

Why LS and the effect of the norm

As you might have expected, the choice of norm is kind of arbitrary. So why is the use of least squares so popular? 1. Least squares estimates are linear in data and easy to program.

3.4 The L2 problem from 3 Perspectives

33

L1 L2

d

L∞

outlier

x Figure 3.3: Schematic of a straight line fit for (x, d) data points under the L1, L2, and L∞ norms. The L1 is not as affected by the single outlier. 2. Corresponds to maximum likelihood estimate for normally distributed errors. The normal distribution comes from the central limit theorem – add up random effects and you get a Gaussian. 3. The statistical distribution is linear, meaning we will have the propagation of errors as a linear mapping from the input statistics (data). 4. Well known statistical tests and confidence intervals can be obtained. It has some disadvantages too. The main one is that the result is sensitive to outliers (see figure 3.3). Another popular norm used is the L1 norm. Some characteristics include 1. non-linear, solved by linear programming (to be seen later). 2. Less sensitive to outliers 3. confidence intervals and hypothesis testing are somewhat more difficult, but can be done.

3.4

The L2 problem from 3 Perspectives

1. Geometry Orthogonality of the residual & predicted data ˆ ·r = 0 d ˆ (d − Gm) ˆ (Gm) = 0 T ˆ G (d − Gm) = GT r = 0 T

which leads to ˆ = (GT G)−1 GT d m

34

Chapter 3. Least Squares & Normal Equations 2. Calculus where we want to minimize krk2 rT r =

ˆ T (d − Gm) ˆ (d − Gm)

∂ (rT r) = 0 ∂m ˆ GT (d − Gm) = 0 leading to ˆ = (GT G)−1 GT d m 3. Maximum likelihood for a multivariate normal distribution • Maximize: ˆ T Σ−1 (d − Gm) ˆ exp −(d − Gm) • Minimize: ˆ T Σ−1 (d − Gm) ˆ (d − Gm) • Led to: ˆ = (GT Σ−1 G)−1 GT Σ−1 d m which comes from the generalized normal equations.

3.5

Full Example: Line fit

We come back to the general line fit problem, where we have two unknows, the intercept m1 and the slope m2 . We have M observations di . The inverse problem is d = Gm and the indexed matrix is then    d1  d2       ..  =   .   dM

1 1 .. .

x1 x2 .. .

1

xM

   m1   m2

As you are already aware of, the least squares solution of this problem is ˆ = (GT G)−1 GT d m which we are doing explicitly.

3.5 Full Example: Line fit

35

The last term is 

T

1 x1

G d =

··· ···

1 x2

   

1 dM



d1 d2 .. .

   

dM M X





di   i=1 =  M  X  xi di

    

i=1

The first term (note typo in Book)  (GT G)−1

  1 =   x1

 ··· ···

1 x2

  M  =  M  X  xi i=1

 1   xM 

M X i=1 M X

M X

1

xM

   

xi      x2i

i=1

1 M

x1 x2 .. .

−1

M X

 =

−1

1 1 .. .

x2i −

M X

xi

  !2   

i=1 M X

−

−

i=1

xi

M

 xi     

i=1

i=1

i=1

x2i

M X

leading to our final result  1

ˆ = m M

M X i=1

x2i −

M X i=1

xi

  !2   

M X

x2i

i=1 M X

−

−

M X



M X

xi   di   i=1  M  X  M xi di

i=1

xi

i=1

i=1

Using the concept for the covariance of the model parameters ˆ cov(m)

= σ 2 (GT G)−1  M X x2i   i=1 ≈ σ2  M  X  − xi i=1

−

M X i=1

M

 xi     

     

36


where σ 2 is the variance of the individual estimates. This equation shows that even if the data di are uncorrelated, the model parameters can be correlated: cov(m1 , m2 ) = −

M X

xi

i=1

A number of important observations • There is a negative correlation between intercept and slope • The magnitude of the correlation depends on the spread of the x axis. How can we reduce the covariance between the model parameters? We define a new axis M 1 X xi yi = xi − M i=1 which is basically equivalent to shifting the origin in is now  M 0 M 2 X ˆ cov(m) = σ  0 yi2

the x axis. The covariance −1  

i=1



1 M

  = σ2   0 



0 1 M X

yi2

    

i=1

This new relation now shows independent intercept and slopes and if σ are the standard errors in the observed data then • Standard error of intercept σ/M, with more data you reduce the variance of the intercept. • Standard error of slope

σ v uM uX t y2 i

i=1

showing that if the observation points on the x axis are close, the uncertainties in the slope estimates are greater.

Chapter 4

Tikhonov Regularization, variance and resolution 4.1

Tikhonov Regularization

Tikhonov Regularization is one of the most common methods used for regularizing an inverse problem. The reason to do this is that in many cases the inverse problem is ill-posed and small errors in the data will give very large errors in the resultant model. Another possible reason for using this method is if we have a mixed determined problem, where for example we might have a model null-space.For the overdetermined part, we would like to minimize the residual vector min krk

→

ˆ = (GT G)−1 GT d m

while for the underdetermined case we actually minimize the model norm min kmk

→

ˆ = GT (GGT )−1 d m

and of course, for the mixed determined case, we will be trying something in between Φm = kd − Gmk22 + α2 kmk22 and as we have seen before, we want to minimize

2

G d

min Φm = min m − 0 2 m m αI or equivalently min Φm m

→

ˆ = (GT G + α2 I)−1 GT d m

So the question is, what do we choose for α? If we choose α very large, we are focusing our attention on minimizing the model norm kmk, while neglecting

38

Chapter 4. Tikhonov Regularazation

the residual norm. If we choose α too small, we are dealing with the complete contrary and we are trying to fit the data perfectly, which is probably not what we want. A graphical way to see how the two norms interact depending on the choice of α is shown in Figure 5.1 of our book. The idea is that as the residual norm increases, the model norm decreases, leading to the so-called L-curve. This is because kmk2 is a strictly decreasing function of α, while kd − Gmk2 is an strictly increasing function of α. Our job now is to find an optimal value of α. There are a few methods we are going to see that get the optimal α. This include the discrepancy criterion, the Lcurve criterion and cross-validation. Before going there, we want to understand the effect of the choice of α in the resolution of the estimate and well as the covariance of the model parameters. Similarly we want to understand the choice of the number of singular values used in solving the generalized inverse using SVD and the SVD implementation of the Tikhonov regularization. Finally we will see how other norms can be chosen in order to penalize models with excesive roughness or curvature.

4.2

SVD Implementation

Using our previous expression, but using the SVD of the G nmatrix, namely G = UΛVT and from above ˆ = GT d (GT G + α2 I)m we can replace ˆ (VΛUT UΛVT + α2 I)m 2 T 2 ˆ (VΛ V + α I)m and the solution is ˆα = m

X i

where fi =

λ2i

= VΛUT d = VΛUT d

uTi d λ2i vi + α 2 λi

λ2i λ2i + α2

are called the filter factors. The filter factors have an obvious effect on the resultant model, such that for λi α, the factor fi ≈ 1 and the result would be like T ˆ = Vp Λ−1 m p Up d

where we had chosen the value of p for all singular values that are large. In contrast, for λi α, the factor fi ≈ 0 and this part of the solution will be damped out, or downweighted.

4.3 Resolution vs variance, the choice of α, or p

39

In matrix form we can write the expression as ˆ α = VFΛ−1 UT d m where Fii =

λ2i

λ2i + α2

and zero elsewhere. Unlike what we saw earlier, the truncation achieved by choosing an integer value p, for the number of singular values and singular vectors to use, there is going to be a smooth transition from the included and excluded singular values. Other filter factors have been suggested fi =

4.3

λi λi + α

Resolution vs variance, the choice of α, or p

From previous lectures, we can now discuss the resolution and variance of our resultant model using the generalized inverse, and in this case, the Tikhonov regularization. We had ˆ m

=

(GT G + α2 I)−1 GT d

= G# d = VFΛ−1 UT d T = Vp Λ−1 p Up d

where the first and second equations use the general Tikhonov regularization, the third equation is the SVD using filter factors and the last one is the result if we choose a p number of singular values. The model resolution matrix Rm was defined via ˆ = Ggen d = Ggen Gmtrue = Rmtrue m is then defined for the three cases as Rm,α

= G# G

Rm,α

= VFVT

Rm,p

= Vp VpT

ˆ 6= mtrue . The In all regularizations R 6= I, the estimate will be biased and m bias introduced by regularizing is ˆ − mtrue = [R − I]mtrue m but since we don’t know mtrue , we don’t know the sense of the bias. We can’t even bound the bias, since it depends on the true m as well.

40


Finally, we also have to deal with uncertainties, so as we have seen before the model covariance matrix Σm is

ˆm ˆT Σm = m D E T = G# ddT G# = G# Σd G#

T

and assuming Σd = σd2 I, our three cases lead to T

Σm,α Σm,α

= σ 2 G# G# = σ 2 VF2 Λ−2 VT

Σm,p

T = σ 2 Vp Λ−2 p Vp

We could use this to evaluate confidence intervals, ellipses on the model, but since the model is biased by an unknown amount, the confidence intervals might not be representative of the true deviation of the estimated model.

4.3.1

Example 1: Shaw’s problem

In this example I would like to use some practical application of the Tikhonov regularazation using both the general approach (generalized matrix explicitly) and using the SVD. I take the examples from Aster’s book directly. In the Shaw problem, the data that is measured is diffracted light intensity as a function of outgoing angle d(s), where the angle is −π/2 6 s 6 π/2. We use the discretized version of the problem as outlined in the book, namely the mathematical model relating the data observed d and the model vector m is d = Gm where d ∈ RM and m ∈ RN , but in our example we will have M = N . The G matrix is defined for the discrete case 2 π 2 sin (π(sin(si ) + sin(θj ))) Gij = (cos(si ) + cos(θj )) N π(sin(si ) + sin(θj )) Note that the part inside the large brackets is the sinc function. We discretize the model and data vectors at the same angles si = θi =

(i − 0.5)π π − N 2

i = 1, 2, . . . , N.

which in theory would give us an even-determined linear inverse problem, but as we will see, the problem is very ill-conditioned. Similar to what was done in the book, we use a simple delta function for the true model 1 i = 10 mi = 0 otherwise

4.3 Resolution vs variance, the choice of α, or p

41

and generate synthetic data by using d = Gm + where the errors are i ∼ N (0, σ 2 ) with a σ = 10−6 . Note that the errors are quite small, but nevertheless due to the ill-posed inverse problem, will have a significant effect on the resultant models. In this section we will focus on two main ways to estimate an appropriate ˆ model m, ˆ m ˆ m

(GT G + α2 I)−1 GT d

=

T = Vp Λ−1 p Up d

where in the first case we need to choose a value of α, while in the second case (SVD) we need to choose a value of p, the num,ber of singular values and vectors to use. Since the singular values are rarely exactly zero, the choice is not so easy to make. In addition to making a particular choice, we need to understand what the effect of our choice has on our model resolution and model covariance. In the next figures I present the results graphically in order to get an intuitive understanding of our choices. 2.5

4

10

real model α = 0.001 α = 3.1623e−06 p = 8e−08

2

3

10

1.5

1

2 Intensity

||m||

10

1

10

0.5

0

−0.5

0

10

−1

−1.5

−1

10 −6 10

−5

10

−4

||d−Gm||

−3

10

10

−2 −1.5

−1

−0.5

Resolution 0.0017783

0 θ

0.5

1

Resolution 1e−05

1.5

Resolution 8e−08

20

20

20

15

15

15

10

10

10

5

5

0.7 real model α = 0.001 α = 3.1623e−06 p = 8e−08

0.6

Observed Intensity

0.5

5

10

15

20

5 5

10

15

20

5

10

15

20

0.4

Covariance 0.0017783

Covariance 1e−05

Covariance 8e−08

0.3

20

20

20

0.2

15

15

15

10

10

10

5

5

0.1

0 −1.5

−1

−0.5

0 Outgoing angle θ

0.5

1

5

1.5

5

10

15

20

5

10

15

20

5

10

15

20

Figure 4.1: Some models using the Generalized inverse. Top-Left: The L-curve for the residual norm and model norm. Various choices of α are used, and the colored dots are three choices made. Top-Right: True model (circles) and the estimated models for the three choices on the left. Bottom-Left: The synthetic data (circles) and the three predicted data. Bottom-Right: The resolution (top panels) and covariance (bottom panels) matrices for the three choices. White represents large amplitudes, black represents lower amplitudes.

42

Chapter 4. Tikhonov Regularazation 10

10

0

1.5

10

s

i

real model p = 14 p=8 p=2

|uTid| |uTd/s | i

5

10

i

1 0

−2

10

Intensity

||d−Gm||

10

−5

10

0.5

−4

10

−10

10 0

−15

10

−6

10

0

5 10 15 Number of Singular Value (p)

20

−20

−0.5 −1.5

−1

−0.5

0 θ

0.5

1

1.5

Resolution 14

10

0

2

4

6

8

10 12 Singular Value

Resolution 8

14

16

18

20

Resolution 2

20

20

20

15

15

15

10

10

10

5

5

0.7 real model p = 14 p=8 p=2

0.6

5

0.5

Observed Intensity

5

10

15

20

5

10

15

20

5

10

15

20

0.4

0.3

Covariance 14

Covariance 8

Covariance 2

20

20

20

15

15

15

10

10

10

5

5

0.2

0.1

0

−0.1 −1.5

−1

−0.5

0 Outgoing angle θ

0.5

1

5

1.5

5

10

15

20

5

10

15

20

5

10

15

20

Figure 4.2: Some models using the Generalized inverse. Top-Left: The L-curve for the residual norm and model norm. Various choices of α are used, and the colored dots are three choices made. Top-Right: True model (circles) and the estimated models for the three choices on the left. Bottom-Left: The synthetic data (circles) and the three predicted data. Bottom-Right: The resolution (top panels) and covariance (bottom panels) matrices for the three choices. White represents large amplitudes, black represents lower amplitudes.

4.4

Smoothing Norms or Higher-Order Tikhonov

Very often we seek solution that minimize the misfit, but also some measure of roughness of the solution. In some cases when we try to minimize the minimum norm solution Zb 2 kf k = f (x)2 dx a

we may get the unwanted consequence of putting the estimated model only where you happen to have data. Instead, our geophysical intuition might suggest that the solution should not be very rough, so we minimize instead

2

Zb

kf k =

f 0 (x)2 dx,

f (a) = 0

a

where we need to add a boundary condition (right hand side). The boundary condition is needed, since the derivative norm is insensitive to constants, that is the norm of kf + bk is equal to kf k. This means we really have a semi-norm.

4.4 Smoothing Norms or Higher-Order Tikhonov

4.4.1

43

The discrete Case

Assuming the model parameters are ordered in physical space (e.g., with depth, or lateral distance), then we can define differential operators of the form   −1 1 0 0 ...  0 −1 1 0 ...    D1 =  0 0 −1 1 . . .    .. .. . . and the second derivative    D2 =  

−2 1 0

1 −2 1

0 1 −2 .. .

0 0 1 .. .

... ... ... .. .

    

There are a few ways to implement this in the discrete case, namely 1. minimize a functional of the for 2

2

Φm = kd − Gmk2 + α2 kDmk2 which leads to

−1 ˆ = GT G + α2 DT D GT d m

Note the similarity with our previous results, where instead of the matrix DT D we had the identity matrix I. 2. Alternatively, we can try and solve the coupled system of equations d G = m+ 0 αD and you can rewrite this in a simplified way d0 = Hm + where we have now the standard expression for the inverse problem to be solved. Due to the effect of the D matrix, the ill-posedness of the original expression can be significantly reduced (depending on the chosen value of α). The advantage of this approach is that one can impose additional constraints, like non-negativity. 3. We can also transform the system of equations in a similar way by d = Gm + d = GD−1 Dm + d = G0 m 0 +

44

Chapter 4. Tikhonov Regularazation where G0 m

0

= GD−1 = Dm

As you can see, we have not changed the condition of fitting the data, so that 2 2 kd = G0 m0 k = kd − Gmk but we have also added the model norm of the form 2

km0 k = kDmk

2

Note that for this to actually work, the matrix D needs to be invertible. Sometimes, it is possible to do it analytically. We can also use the SVD at this stage. As a cautonary note, it is important to keep in mind that the Tikhonov regularization will recover the true model depending on whether the assumptions of the additional norm (be it kmk, or kDmk) is correct. We would not expect to get the right answer in the previous examples, since the true model mtrue is a delta function.

4.5

Fitting within tolerance

In real life, the data that we have acquired has some level of uncertainty. This means there is some random error which we do not know, but we think we know its statistical distribution (e.g., normally distributed with zero mean and variance σ 2 ). So, in this respect we should not try to fit the data exactly, but rather fit it to within the error bars. This method is sometimes called the discrepance principle, but I prefer to use the term fitting within tolerance. In our inverse problem we want to minimize a functional with two norms min kDmk min kd − Gmk and to do that we were looking at the L-curve, using the Damped least squares or the SVD approach, that is choosing an α or a value of p non-zero singular vectors. In fact, for data with uncertainties we should actually be looking at a system that looks like min kDmk min kd − Gmk

6 T

4.5 Fitting within tolerance

45

where the value of the tolerance T we arrive at by a subjective decision about what we regard as acceptable odds of being wrong. We will almost always use the 2-norm on the data space, and thus the chi-squared statistic will be our guide. In contrast to our previous case, we don’t have equality anymore. Under certain assumptions Dm = 0 where for model norm D = I, does not satisfy min kd − Gmk 6 T we can try to find the equality constrained equation h i 2 2 Φm = T 2 − kd − Gmk2 + α2 kmk2 From a simple point of view, for a fixed value of T , minimization of the two constraints can be regarded as seeking a compromise between two undesirable properties of the solution: the first term represents model complexity, which we wish to keep small; the second measures model misfit, also a quantity to be suppressed as far as possible. By makingα > 0 but small we pay attention to the penalty function at the expense of data misfit, while making α large works in the other direction, and allows large penalty values to secure a good match to observation. From a more quantitative perspective, when the residual norm kd − Gmk2 is just above the tolerance T , we are not fitting the data to the level needed. But we also don’t want to over-fit the data. As can be seen from the figure, if we know the threshold value, the problem is simpler, because we just need to figure out what the value of the Lagrange multiplier α is, such that the residual norm tolerance is satisfied. Choosing a value of α to the left of this threshold, will fit the data better and will result in a model with a larger norm (or rougher) than what is required by the data. Choosing a value of α to the right instead, will have a poor fit to the data, even within uncertainties.

4.5.1

Example 2

First, we need to figure out the value of T . In our example, we said that the errors were ∼ N (0, σ 2 ), σ = 1e−6 Since we have M = 20 points, we need to find a solution whose residual norm is v u 20 p uX σi2 = 20 × 1e−12 = 4.47e−6 T = kk2 = t i

46


Now that we have our value of the tolerance T , we can go back to our initial problem i h 2 2 Φm = T 2 − kd − Gmk2 + α2 kmk2 and find the value of α or the ideal value of the p that satisfies our new functional. In this example I will use the same graphical interface as in the previous example. Now, in addition to the L-curve obtained for the SVD and the Damped Least Squares methods, we have our threshold value T represented by a vertical dash line. We pick the value on the L-curve that is closest to T . In the SVD approach, since we have discrete singular values, we would choose the one that is closest, while for the DLS we could get in fact really close. In both cases, I just show approximate values, using my discretization of the α used for plotting the figure. 6

10

1.2

Damped LS SVD (*10)

real model α = 0.0001 p=9 α = 3.1623e−05 p = 10 α = 1e−05 p = 10

5

10

1

4

10

0.8

3

10

yaxis

||m||

0.6

2

10

0.4

1

10

0.2

0

10

0

−1

10

−6

−5

10

−4

10

−3

10

10 ||d−Gm||

−2

−1

10

−0.2

0

10

10

0

2

4

6

8

10 x axis

12

14

16

18

20

10

10

0.7

si

real model α = 0.0001 p=9 α = 3.1623e−05 p = 10 α = 1e−05 p = 10

0.6

T |ui d| |uTd/s | i i

5

10

0.5

0

Observed Intensity

10

0.4 −5

10 0.3

−10

10 0.2

−15

10 0.1

0

−20

0

2

4

6

8

10 12 Outgoing x axis

14

16

18

20

10

0

2

4

6

8

10 12 Singular Value

14

16

18

20

Figure 4.3: Fitting within tolerance with the DLS and SVD approaches. Our preferred model is the blue colored one. Top-Left: The L-curve for the residual norm and model norm. The SVD has been shifted upwards for clarity. Value of T is shown as a vertical dashed line. Various choices of α around T are chosen. Top-Right: True model (circles) and estimated models for the choices on the left. Bottom-Left: The synthetic data (circles) and predicted data. Bottom-Right: For the SVD, the singular values and Picard criteria are shown.

4.5 Fitting within tolerance

Resolution 0.0001 20

47

Resolution 3.1623e−05 20

Resolution 1e−05 20

15

15

15

10

10

10

5

5

5

5 10 15 20 Covariance 0.0001 20

5 10 15 20 Covariance 3.1623e−05 20

5 10 15 20 Covariance 1e−05 20

15

15

15

10

10

10

5

5

5

5 10 15 20 Resolution 9



20

20

20

15

15

15

10

10

10

5

5

5

5 10 15 20 Covariance 9



20

20

20

15

15

15

10

10

10

5

5

5

5 10 15 20

5 10 15 20

5 10 15 20

Figure 4.4: Resolution matrix and Covariance matrix for the DLS (top 2 panels) and SVD (bottom 2 panels) approaches, while fitting within tolerance. Note that since the SVD approach is discrete in nature, we might not get an ideal selection, hence the repeated value of p.. Using the filter factors approach might lead to better results. Our preferred value is the column in the middle.

48