Monte-Carlo SURE: A Black-Box Optimization of Regularization Parameters for General Denoising Algorithms Supplementary Material Sathish Ramani∗ , Student Member, Thierry Blu, Senior Member, and Michael Unser, Fellow This material supplements some sections of the paper entitled “Monte-Carlo SURE: A Black-Box Optimization of Regularization Parameters for General Denoising Algorithms”. Here, we elaborate on the solution to the differentiability issue associated with the Monte-Carlo divergence estimation proposed (in Theorem 2) in the paper. Firstly, we verify the validity of the Taylor expansion-based argumentation of Theorem 2 for algorithms like total-variation denoising (TVD). Following that, we give a proof of the second part of Theorem 2 which deals with a weaker hypothesis (using tempered distributions) of the problem. I. T OTAL -VARIATION D ENOISING : V ERIFICATION
OF
TAYLOR E XPANSION - BASED
HYPOTHESIS
In the paper, we considered the discrete-domain formulation for total-variation denoising (TVD) where we minimize the cost JTV (u) = ky − uk2 + λ TV{u},
(1)
P p
(Dh u)[k]2 + (Dv u)[k]2 is the discrete 2D total variation norm and Dh where TV{u} = k and Dv are matrices corresponding to the first order finite difference in the horizontal and vertical directions, respectively. We will concentrate only on the bounded-optimization (half-quadratic) algorithm developed in [1]. However, since a typical implementation of TVD will always involve finite differences in place of continuous domain derivatives the analysis can be easily extended to other algorithms including the variant in [1] and those based on Euler-Lagrange equations. We show that the bounded-optimization algorithm for TVD admits first and second order derivatives with respect to the data y and therefore satisfies the stronger hypothesis (Taylor expansion-based) of Theorem 2. The TVD algorithm is described by the following recursive equation [1]: the signal estimate at iteration k + 1 denoted by the N × 1 vector fλk+1 is obtained by solving the linear system Mk fλk+1 = y,
(2)
where Mk is the N × N system matrix at iteration k given by k T k Mk = I + DT h Λ Dh + Dv Λ Dv ,
(3)
and Λk = diag{wik ; i = 1, 2, . . . , N }, with λ wik = q , 2 (Dh fλk )2i + (Dv fλk )2i + κ
(4)
where (D∗ fλk )i is the ith element of the vector D∗ fλk and κ > 0 is a small constant that prevents the denominator of wik going to zero (or else the algorithm fλ itself may become numerically unstable). Email:
[email protected],
[email protected],
[email protected].
1
Differentiating (2) with respect to y, we obtain ∂f k+1 ∂Mk k+1 = I, fλ + Mk λ ∂y ∂y | {z }
(5)
Jk+1 fλ λ
Jk+1 fλ
is the Jacobian matrix of fλk+1 at iteration k k represent the mth elements of y ym and fm
k represents the mnth element + 1. If Mmn where k k and fλ , respectively, then using Einstein’s of M and summation notation (repeated indices will be summed over unless they appear on both sides of an equation) (5) can be written as k k+1 ∂Mmn k ∂fn fnk+1 + Mmn = δmp , ∂yp ∂yp
(6)
where, for example, the index n is summer over in both the terms on the L.H.S of the above equation. Differentiating (6) a second time, we obtain k+1 k ∂f k+1 k ∂f k+1 k ∂Mmn ∂Mmn ∂ 2 Mmn n n k ∂fn + fnk+1 + + Mmn = 0. (7) ∂yl ∂yp ∂yp ∂yl ∂yl ∂yp ∂yl ∂yp n k+1 oN ∂fn It is clear that the N ×N ×N tensor r = ∂y is the desired second derivative in the Taylor l ∂yp n,l,p=1
expansion in Theorem 2. We will show that for a given (l, p), all the terms in (7) are well-defined, k+1 λ so that the N × 1 vector rlp = ∂f ∂yl ∂yp can be obtained by solving a linear system. Firstly, we analyze
with
k ∂Mmn ∂yp
which is given by
k ∂wqk ∂Mmn = Dhqm Dhqn + Dvqm Dvqn , ∂yp ∂yp
∂wqk =− ∂yp
(8)
2 ∂f k 2 i , (wqk )3 (Dh fλk )q Dhqi + (Dv fλk )q Dvqi λ ∂yp
(9)
where the index q is not summed over on the R.H.S of the above equation. Similarly,
where
k ∂ 2 wqk ∂ 2 Mmn = Dhqm Dhqn + Dvqm Dvqn , ∂yl ∂yp ∂yl ∂yp
k k 1 λ 2 k )−4 ∂wq ∂wq (w q 3 2 ∂yl ∂yp ∂ 2 wqk ∂fik ∂fjk 2 2 k 3 k k) D (wq ) =− − (D f h λ q hqi Dhqj + (Dv fλ )q Dvqi Dvqj ∂yl ∂yp ∂yl ∂yp λ 2 fik − (Dh fλk )q Dhqi + (Dv fλk )q Dvqi ∂y∂l ∂y p
(10)
.
(11)
The analysis is then simply as follows: 1) In principle, if we start with a well-defined initial estimate fλ0 , the algorithm described by equations (2)-(4) is designed so that Mk and fλk+1 are well-defined for all k ≥ 1. Moreover, Mk is a full-rank matrix and therefore has a stable inverse (Mk )−1 . 2) It should be noted all the elements of Mk are differentiable because (4) is a true function of y. Thus, all the derivatives involved in this analysis are in the true sense of differentiation and not in the weak sense of distributions. k k mn 3) Then, (8) and (9) ensure that ∂M ∂yp is well-defined ∀ m, n, p = 1, 2, . . . , N , provided Λ is well-conditioned which is the case as long as wqk < +∞, ∀ q = 1, 2, . . . , N , and k ≥ 1. This
2
can be ensured numerically. Therefore, for a fixed p, a well-defined N × 1 vector obtained from (6) as ∂fλk+1 = (Mk )−1 (ep − Skp fλk+1 ), ∂yp
∂fλk+1 ∂yp
is
(12)
k
mn where Skp is a N × N matrix such that (Skp )mn = ∂M ∂yp , ep is a N × 1 column vector whose elements are all zeros except the pth one which is unity. 2 k mn is well-defined ∀ m, n, l, p = 1, 2, . . . , N , equations (8) 4) While (10) and (11) ensure that ∂∂yM ∂y l p and (12) ensure that the second and third terms in the L.H.S of (7) are well-defined. Thus, we k+1 λ see that for a given (l, p) a well-defined N × 1 vector ∂f ∂yl ∂yp is obtained from (7) as ! k+1 k+1 ∂f ∂f ∂fλ k+1 = −(Mk )−1 Pklp fλk+1 + Skp λ + Skl λ , (13) ∂yl ∂yp ∂yp ∂yp
where Pklp is a N × N matrix such that (Pklp )mn = II. M ONTE -C ARLO D IVERGENCE E STIMATION
k ∂ 2 Mmn ∂yl ∂yp .
UNDER A
W EAKER H YPOTHESIS
Here, we restate the second part of Theorem 2 which deals with the Monte-Carlo divergence estimation under the weaker hypothesis of tempered distributions and then give a formal proof of this result. Theorem 2: Let b′ be a zero-mean unit variance i.i.d random vector. Assume that ∃ n0 > 1 and C0 > 0 such that kfλ (y)k ≤ C0 (1 + kykn0 ),
(14)
fλ (y + εb′ ) − fλ (y) ′T divy {fλ (y)} = lim Eb′ b ε→0 ε
(15)
that is, fλ is tempered. Then
in the weak-sense of tempered distributions. Proof: Let ψ ∈ S be a rapidly decaying (test) function that is infinitely differentiable. We have to show that fλ (y + εb′ ) − fλ (y) ′T , ψ(y) . (16) hdiv{fλ (y)}, ψ(y)i = lim Eb′ b ε→0 ε We note that the L.H.S of (16) can be expressed as (from theory of distributions) hdiv{fλ (y)}, ψ(y)i = −hfλ (y), ∇ψ(y)i.
The R.H.S of (16) involves the double integration Z Z ′ ′T fλ (y + εb ) − fλ (y) q(b′ )db′ , b dy ψ(y) I1 (ε) = ε b′ y
(17)
(18)
where q(b′ ) represents the probability density function of b′ . The order of the integration can be swapped as soon as (Fubini’s Theorem) Z Z f (y + εb′ ) − fλ (y) ′ ′ |ψ(y)| b′T λ I2 (ε) = (19) q(b ) db dy < +∞. ε ′ y b Using Triangle inequality, we bound I2 (ε) as I2 (ε) ≤
J(ε) + J(0) , ε
(20)
3
where J(ε) =
Z Z y
b′
|ψ(y)| b′T fλ (y + εb′ ) q(b′ ) db′ dy.
(21)
Using Cauchy-Schwarz inequality and the fact that fλ is tempered (cf. (14)), we get Z Z |ψ(y)| kb′ k C0 (1 + ky + εb′ kn0 ) q(b′ ) db′ dy. J(ε) ≤ y
(22)
b′
Using the convexity property of the function (·)n0 for n0 > 1, we get Z Z |ψ(y)| kb′ k C0 (1 + 2n0 −1 kykn0 + 2n0 −1 kεb′ kn0 ) q(b′ ) db′ dy J(ε) ≤ y
b′
= C0
Z
|ψ(y)| dy
y
n0 −1
+C0 2
Z
Z
′
< +∞,
′
kb k q(b ) db b′ n0
|ψ(y)| kyk dy
Z
Z
|ψ(y)| dy
Z
′
′
kb′ kn0 +1 q(b′ ) db′
b′
y
′
kb k q(b ) db
b′
y
+C0 2n0 −1 εn0
′
(23)
under the hypothesis that Eb′ {kb′ kn0 } < +∞, ∀ n0 ≥ 1. The ones involving ψ are also finite because ψ is a rapidly decaying function with finite support. Thus, J(ε) < ∞, ∀ ε ≥ 0. Hence, we interchange the integrals (with appropriate change of variables) to get Z Z ′ ′ ′ ′T fλ (y + εb ) − fλ (y) q(b )db ψ(y)b I1 (ε) = dy ε y b′ Z Z ψ(y − εb′ ) − ψ(y) dy. (24) q(b′ )db′ b′T fλ (y) = ε y b′ Since ψ is infinitely differentiable, we apply Taylor’s theorem [2] to ψ(y − εb′ ) and obtain Z 1 ψ(y − εb′ ) − ψ(y) b′T ∇ψ(y − t εb′ ) dt. =− ε 0
(25)
Therefore,
I1 (ε) = −
Z
b′
′
′
q(b )db
Z
y
′T
b fλ (y)
Z
1
b′T ∇ψ(y − t εb′ ) dt dy.
(26)
0
We want to let ε tend to 0 in the above expression. This is accomplished by the application of Lebesgue’s dominated convergence theorem. But firstly, we must bound the integrand z(y, b′ , t, ε) = −q(b′ ) b′T fλ (y) b′T ∇ψ(y − t εb′ ) β 0 (t),
by an integrable function g(y, b′ , t), where 1, if t ∈ (0, 1) 0 . β (t) = 0, otherwise
(27)
(28)
To do that, we start with |z(y, b′ , t, ε)| and apply Cauchy-Schwarz inequality to obtain 0 ≤ |z(y, b′ , t, ε)| ≤ q(b′ ) kb′ k2 k∇ψ(y − t εb′ )k β 0 (t) kfλ (y)k. {z } |
(29)
def
= g0 (y,b′ ,t,ε)
Now, by using convexity of (·)n1 for n1 ≥ 1, we have
1 + kykn1 = 1 + ky − εtb′ + εtb′ kn1 ≤ 1 + 2n1 −1 ky − εtb′ kn1 + 2n1 −1 kεtb′ kn1 .
(30)
4
Then, for ε ≤ 1, (1 + kykn1 ) g0 (y, b′ , t, ε) ≤ (1 + 2n1 −1 ky − εtb′ kn1 )k∇ψ(y − t εb′ )k q(b′ ) kb′ k2 β 0 (t) +2n1 −1 k∇ψ(y − t εb′ )k q(b′ ) kb′ kn1 +2 tn1 β 0 (t) ≤ Cψ,1 q(b′ ) kb′ k2 β 0 (t) + Cψ,2 q(b′ ) kb′ kn1 +2 tn1 β 0 (t), {z } | def
= g1 (b′ ,t)
where Cψ,1 = supy {(1 + 2n1 −1 kykn1 ) k∇ψ(y)k} and Cψ,2 = 2n1 −1 supy {k∇ψ(y)k}. Since Eb′ {kb′ kn1 +2 } < +∞, it is clear that Z Z g1 (b′ , t) db′ dt < +∞. b′
(31)
t
Therefore, choosing n1 > n0 + N , where N is the dimension of y, and ∀ ε ≤ 1 we see that kfλ (y)k 1 + kykn1 1 + kykn0 ≤ C0 g1 (b′ , t) . 1 + kykn1 {z } |
|z(y, b′ , t, ε)| ≤ g0 (y, b′ , t, ε) kfλ (y)k ≤ g1 (b′ , t)
(32)
def
= g(y,b′ ,t)
Then, we notice that Z
y
1 + kykn0 dy = 1 + kykn1
X Z
k∈ZN
[0, 1)N
1 + ky + kkn0 dy = 1 + ky + kkn1
Z
[0, 1)N
X 1 + ky + kkn0 1 + ky + kkn1 N
k∈Z
!
dy
(Fubini)
X 1 + k1 + kkn0 (1 is a column vector of 1s) 1 + kkkn1 N
≤
k∈Z
≤
X 1 + 2n0 −1 N n20 + 2n0 −1 kkkn0 (using convexity of (·)n0 ) n1 1 + kkk N
k∈Z
n0 −1
≤ (1 + 2
N
n0 2
X X 1 1 n0 −1 + 2 ) 1 + n1 n1 −n0 kkk kkk k∈ZN \{0} k∈ZN \{0} {z } {z } | |