A Black-Box Optimization of Regularization ... - Semantic Scholar

Monte-Carlo SURE: A Black-Box Optimization of Regularization Parameters for General Denoising Algorithms Supplementary Material Sathish Ramani∗ , Student Member, Thierry Blu, Senior Member, and Michael Unser, Fellow This material supplements some sections of the paper entitled “Monte-Carlo SURE: A Black-Box Optimization of Regularization Parameters for General Denoising Algorithms”. Here, we elaborate on the solution to the differentiability issue associated with the Monte-Carlo divergence estimation proposed (in Theorem 2) in the paper. Firstly, we verify the validity of the Taylor expansion-based argumentation of Theorem 2 for algorithms like total-variation denoising (TVD). Following that, we give a proof of the second part of Theorem 2 which deals with a weaker hypothesis (using tempered distributions) of the problem. I. T OTAL -VARIATION D ENOISING : V ERIFICATION

OF

TAYLOR E XPANSION - BASED

HYPOTHESIS

In the paper, we considered the discrete-domain formulation for total-variation denoising (TVD) where we minimize the cost JTV (u) = ky − uk2 + λ TV{u},

(1)

P p

(Dh u)[k]2 + (Dv u)[k]2 is the discrete 2D total variation norm and Dh where TV{u} = k and Dv are matrices corresponding to the first order finite difference in the horizontal and vertical directions, respectively. We will concentrate only on the bounded-optimization (half-quadratic) algorithm developed in [1]. However, since a typical implementation of TVD will always involve finite differences in place of continuous domain derivatives the analysis can be easily extended to other algorithms including the variant in [1] and those based on Euler-Lagrange equations. We show that the bounded-optimization algorithm for TVD admits first and second order derivatives with respect to the data y and therefore satisfies the stronger hypothesis (Taylor expansion-based) of Theorem 2. The TVD algorithm is described by the following recursive equation [1]: the signal estimate at iteration k + 1 denoted by the N × 1 vector fλk+1 is obtained by solving the linear system Mk fλk+1 = y,

(2)

where Mk is the N × N system matrix at iteration k given by k T k Mk = I + DT h Λ Dh + Dv Λ Dv ,

(3)

and Λk = diag{wik ; i = 1, 2, . . . , N }, with λ wik = q , 2 (Dh fλk )2i + (Dv fλk )2i + κ

(4)

where (D∗ fλk )i is the ith element of the vector D∗ fλk and κ > 0 is a small constant that prevents the denominator of wik going to zero (or else the algorithm fλ itself may become numerically unstable). Email: [email protected], [email protected], [email protected].

1

Differentiating (2) with respect to y, we obtain ∂f k+1 ∂Mk k+1 = I, fλ + Mk λ ∂y ∂y | {z }

(5)

Jk+1 fλ λ

Jk+1 fλ

is the Jacobian matrix of fλk+1 at iteration k k represent the mth elements of y ym and fm

k represents the mnth element + 1. If Mmn where k k and fλ , respectively, then using Einstein’s of M and summation notation (repeated indices will be summed over unless they appear on both sides of an equation) (5) can be written as k k+1 ∂Mmn k ∂fn fnk+1 + Mmn = δmp , ∂yp ∂yp

(6)

where, for example, the index n is summer over in both the terms on the L.H.S of the above equation. Differentiating (6) a second time, we obtain k+1 k ∂f k+1 k ∂f k+1 k ∂Mmn ∂Mmn ∂ 2 Mmn n n k ∂fn + fnk+1 + + Mmn = 0. (7) ∂yl ∂yp ∂yp ∂yl ∂yl ∂yp ∂yl ∂yp n k+1 oN ∂fn It is clear that the N ×N ×N tensor r = ∂y is the desired second derivative in the Taylor l ∂yp n,l,p=1

expansion in Theorem 2. We will show that for a given (l, p), all the terms in (7) are well-defined, k+1 λ so that the N × 1 vector rlp = ∂f ∂yl ∂yp can be obtained by solving a linear system. Firstly, we analyze

with

k ∂Mmn ∂yp

which is given by

k ∂wqk ∂Mmn = Dhqm Dhqn + Dvqm Dvqn , ∂yp ∂yp

∂wqk =− ∂yp

(8)

2 ∂f k 2 i , (wqk )3 (Dh fλk )q Dhqi + (Dv fλk )q Dvqi λ ∂yp

(9)

where the index q is not summed over on the R.H.S of the above equation. Similarly,

where

k ∂ 2 wqk ∂ 2 Mmn = Dhqm Dhqn + Dvqm Dvqn , ∂yl ∂yp ∂yl ∂yp

 k k 1 λ 2 k )−4 ∂wq ∂wq (w q 3 2 ∂yl ∂yp ∂ 2 wqk ∂fik ∂fjk 2 2 k 3 k k) D (wq )  =− − (D f h λ q hqi Dhqj + (Dv fλ )q Dvqi Dvqj ∂yl ∂yp  ∂yl ∂yp λ 2 fik − (Dh fλk )q Dhqi + (Dv fλk )q Dvqi ∂y∂l ∂y p

(10)



 . 

(11)

The analysis is then simply as follows: 1) In principle, if we start with a well-defined initial estimate fλ0 , the algorithm described by equations (2)-(4) is designed so that Mk and fλk+1 are well-defined for all k ≥ 1. Moreover, Mk is a full-rank matrix and therefore has a stable inverse (Mk )−1 . 2) It should be noted all the elements of Mk are differentiable because (4) is a true function of y. Thus, all the derivatives involved in this analysis are in the true sense of differentiation and not in the weak sense of distributions. k k mn 3) Then, (8) and (9) ensure that ∂M ∂yp is well-defined ∀ m, n, p = 1, 2, . . . , N , provided Λ is well-conditioned which is the case as long as wqk < +∞, ∀ q = 1, 2, . . . , N , and k ≥ 1. This

2

can be ensured numerically. Therefore, for a fixed p, a well-defined N × 1 vector obtained from (6) as ∂fλk+1 = (Mk )−1 (ep − Skp fλk+1 ), ∂yp

∂fλk+1 ∂yp

is

(12)

k

mn where Skp is a N × N matrix such that (Skp )mn = ∂M ∂yp , ep is a N × 1 column vector whose elements are all zeros except the pth one which is unity. 2 k mn is well-defined ∀ m, n, l, p = 1, 2, . . . , N , equations (8) 4) While (10) and (11) ensure that ∂∂yM ∂y l p and (12) ensure that the second and third terms in the L.H.S of (7) are well-defined. Thus, we k+1 λ see that for a given (l, p) a well-defined N × 1 vector ∂f ∂yl ∂yp is obtained from (7) as ! k+1 k+1 ∂f ∂f ∂fλ k+1 = −(Mk )−1 Pklp fλk+1 + Skp λ + Skl λ , (13) ∂yl ∂yp ∂yp ∂yp

where Pklp is a N × N matrix such that (Pklp )mn = II. M ONTE -C ARLO D IVERGENCE E STIMATION

k ∂ 2 Mmn ∂yl ∂yp .

UNDER A

W EAKER H YPOTHESIS

Here, we restate the second part of Theorem 2 which deals with the Monte-Carlo divergence estimation under the weaker hypothesis of tempered distributions and then give a formal proof of this result. Theorem 2: Let b′ be a zero-mean unit variance i.i.d random vector. Assume that ∃ n0 > 1 and C0 > 0 such that kfλ (y)k ≤ C0 (1 + kykn0 ),

(14)

fλ (y + εb′ ) − fλ (y) ′T divy {fλ (y)} = lim Eb′ b ε→0 ε

(15)

that is, fλ is tempered. Then

in the weak-sense of tempered distributions. Proof: Let ψ ∈ S be a rapidly decaying (test) function that is infinitely differentiable. We have to show that fλ (y + εb′ ) − fλ (y) ′T , ψ(y) . (16) hdiv{fλ (y)}, ψ(y)i = lim Eb′ b ε→0 ε We note that the L.H.S of (16) can be expressed as (from theory of distributions) hdiv{fλ (y)}, ψ(y)i = −hfλ (y), ∇ψ(y)i.

The R.H.S of (16) involves the double integration Z Z ′ ′T fλ (y + εb ) − fλ (y) q(b′ )db′ , b dy ψ(y) I1 (ε) = ε b′ y

(17)

(18)

where q(b′ ) represents the probability density function of b′ . The order of the integration can be swapped as soon as (Fubini’s Theorem) Z Z f (y + εb′ ) − fλ (y) ′ ′ |ψ(y)| b′T λ I2 (ε) = (19) q(b ) db dy < +∞. ε ′ y b Using Triangle inequality, we bound I2 (ε) as I2 (ε) ≤

J(ε) + J(0) , ε

(20)

3

where J(ε) =

Z Z y

b′

|ψ(y)| b′T fλ (y + εb′ ) q(b′ ) db′ dy.

(21)

Using Cauchy-Schwarz inequality and the fact that fλ is tempered (cf. (14)), we get Z Z |ψ(y)| kb′ k C0 (1 + ky + εb′ kn0 ) q(b′ ) db′ dy. J(ε) ≤ y

(22)

b′

Using the convexity property of the function (·)n0 for n0 > 1, we get Z Z |ψ(y)| kb′ k C0 (1 + 2n0 −1 kykn0 + 2n0 −1 kεb′ kn0 ) q(b′ ) db′ dy J(ε) ≤ y

b′

= C0

Z

|ψ(y)| dy

y

n0 −1

+C0 2

Z

Z

′

< +∞,

′

kb k q(b ) db b′ n0

|ψ(y)| kyk dy

Z

Z

|ψ(y)| dy

Z

′

′

kb′ kn0 +1 q(b′ ) db′

b′

y

′

kb k q(b ) db

b′

y

+C0 2n0 −1 εn0

′

(23)

under the hypothesis that Eb′ {kb′ kn0 } < +∞, ∀ n0 ≥ 1. The ones involving ψ are also finite because ψ is a rapidly decaying function with finite support. Thus, J(ε) < ∞, ∀ ε ≥ 0. Hence, we interchange the integrals (with appropriate change of variables) to get Z Z ′ ′ ′ ′T fλ (y + εb ) − fλ (y) q(b )db ψ(y)b I1 (ε) = dy ε y b′ Z Z ψ(y − εb′ ) − ψ(y) dy. (24) q(b′ )db′ b′T fλ (y) = ε y b′ Since ψ is infinitely differentiable, we apply Taylor’s theorem [2] to ψ(y − εb′ ) and obtain Z 1 ψ(y − εb′ ) − ψ(y) b′T ∇ψ(y − t εb′ ) dt. =− ε 0

(25)

Therefore,

I1 (ε) = −

Z

b′

′

′

q(b )db

Z

y

′T

b fλ (y)

Z

1

b′T ∇ψ(y − t εb′ ) dt dy.

(26)

0

We want to let ε tend to 0 in the above expression. This is accomplished by the application of Lebesgue’s dominated convergence theorem. But firstly, we must bound the integrand z(y, b′ , t, ε) = −q(b′ ) b′T fλ (y) b′T ∇ψ(y − t εb′ ) β 0 (t),

by an integrable function g(y, b′ , t), where 1, if t ∈ (0, 1) 0 . β (t) = 0, otherwise

(27)

(28)

To do that, we start with |z(y, b′ , t, ε)| and apply Cauchy-Schwarz inequality to obtain 0 ≤ |z(y, b′ , t, ε)| ≤ q(b′ ) kb′ k2 k∇ψ(y − t εb′ )k β 0 (t) kfλ (y)k. {z } |

(29)

def

= g0 (y,b′ ,t,ε)

Now, by using convexity of (·)n1 for n1 ≥ 1, we have

1 + kykn1 = 1 + ky − εtb′ + εtb′ kn1 ≤ 1 + 2n1 −1 ky − εtb′ kn1 + 2n1 −1 kεtb′ kn1 .

(30)

4

Then, for ε ≤ 1, (1 + kykn1 ) g0 (y, b′ , t, ε) ≤ (1 + 2n1 −1 ky − εtb′ kn1 )k∇ψ(y − t εb′ )k q(b′ ) kb′ k2 β 0 (t) +2n1 −1 k∇ψ(y − t εb′ )k q(b′ ) kb′ kn1 +2 tn1 β 0 (t) ≤ Cψ,1 q(b′ ) kb′ k2 β 0 (t) + Cψ,2 q(b′ ) kb′ kn1 +2 tn1 β 0 (t), {z } | def

= g1 (b′ ,t)

where Cψ,1 = supy {(1 + 2n1 −1 kykn1 ) k∇ψ(y)k} and Cψ,2 = 2n1 −1 supy {k∇ψ(y)k}. Since Eb′ {kb′ kn1 +2 } < +∞, it is clear that Z Z g1 (b′ , t) db′ dt < +∞. b′

(31)

t

Therefore, choosing n1 > n0 + N , where N is the dimension of y, and ∀ ε ≤ 1 we see that kfλ (y)k 1 + kykn1 1 + kykn0 ≤ C0 g1 (b′ , t) . 1 + kykn1 {z } |

|z(y, b′ , t, ε)| ≤ g0 (y, b′ , t, ε) kfλ (y)k ≤ g1 (b′ , t)

(32)

def

= g(y,b′ ,t)

Then, we notice that Z

y

1 + kykn0 dy = 1 + kykn1

X Z

k∈ZN

[0, 1)N

1 + ky + kkn0 dy = 1 + ky + kkn1

Z

[0, 1)N

X 1 + ky + kkn0 1 + ky + kkn1 N

k∈Z

!

dy

(Fubini)

X 1 + k1 + kkn0 (1 is a column vector of 1s) 1 + kkkn1 N

≤

k∈Z

≤

X 1 + 2n0 −1 N n20 + 2n0 −1 kkkn0 (using convexity of (·)n0 ) n1 1 + kkk N

k∈Z

n0 −1

≤ (1 + 2

N

n0 2





   X X 1 1    n0 −1 + 2 ) 1 +  n1  n1 −n0  kkk kkk k∈ZN \{0} k∈ZN \{0}   {z } {z } | |