A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients Arnulf Jentzen1 , Diyora Salimova2 , and Timo Welti3 1 Seminar
for Applied Mathematics, Department of Mathematics,
ETH Zurich, Switzerland, e-mail:
[email protected] 2 Seminar
for Applied Mathematics, Department of Mathematics,
ETH Zurich, Switzerland, e-mail:
[email protected] 3 Seminar
for Applied Mathematics, Department of Mathematics,
ETH Zurich, Switzerland, e-mail:
[email protected]
September 19, 2018 Abstract In recent years deep artificial neural networks (DNNs) have very successfully been employed in numerical simulations for a multitude of computational problems including, for example, object and face recognition, natural language processing, fraud detection, computational advertisement, and numerical approximations of partial differential equations (PDEs). Such numerical simulations indicate that DNNs seem to admit the fundamental flexibility to overcome the curse of dimensionality in the sense that the number of real parameters used to describe the DNN grows at most polynomially in both the reciprocal of the prescribed approximation accuracy ε > 0 and the dimension d ∈ N of the function which the DNN aims to approximate in such computational problems. There is also a large number of rigorous mathematical approximation results for artificial neural networks in the scientific literature but there are only a few special situations where results in the literature can rigorously explain the success of DNNs when approximating high-dimensional functions. The key contribution of this article is to reveal that DNNs do overcome the curse of dimensionality in the numerical approximation of Kolmogorov PDEs with constant diffusion and nonlinear drift coefficients. We prove that the number of parameters used to describe the employed DNN grows at most polynomially in both the reciprocal of the prescribed approximation accuracy ε > 0 and the PDE dimension d ∈ N. A crucial ingredient
1
in our proof is the fact that the artificial neural network used to approximate the solution of the PDE is indeed a deep artificial neural network with a large number of hidden layers.
Contents 1 Introduction
2
2 Existence of a realization on a suitable artificial probability space 2.1 Markov-type estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Existence of a realization with the desired approximation properties .
6 6 7
3 The Feynman-Kac formula revisited
8
4 Stochastic differential equations (SDEs) 4.1 A priori bounds for SDEs . . . . . . . . 4.2 A priori bounds for Brownian motions . 4.3 Strong perturbations of SDEs . . . . . . 4.4 Weak perturbations of SDEs . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
9 . 9 . 10 . 10 . 13
5 Deep artificial neural network (DNN) calculus 20 5.1 Sums of DNNs with the same architecture . . . . . . . . . . . . . . . 22 5.2 Compositions of DNNs involving artificial identities . . . . . . . . . . 24 5.3 Representations of the d-dimensional identities . . . . . . . . . . . . . 29 6 DNN approximations for partial differential equations (PDEs) 30 6.1 DNN approximations with general activation functions . . . . . . . . 31 6.2 Rectified DNN approximations . . . . . . . . . . . . . . . . . . . . . . 39 6.3 Rectified DNN approximations on the d-dimensional unit cube . . . . 41
1
Introduction
In recent years deep artificial neural networks (DNNs) have very successfully been employed in numerical simulations for a multitude of computational problems including, for example, object and face recognition (cf., e.g., [37, 41, 60, 62, 64] and the references mentioned therein), natural language processing (cf., e.g., [15, 25, 31, 36, 39, 65] and the references mentioned therein), fraud detection (cf., e.g., [12, 56] and the references mentioned therein), computational advertisement (cf., e.g., [63, 68] and the references mentioned therein), and numerical approximations of partial differential equations (PDEs) (cf., e.g., [4, 5, 6, 17, 19, 20, 23, 26, 28, 30, 40, 46, 48, 55, 61]). Such numerical simulations indicate that DNNs seem to admit the fundamental flexibility to overcome the curse of dimensionality in the sense that the number of real parameters used to describe the DNN grows at most polynomially in both the reciprocal of the prescribed approximation accuracy ε > 0 and the dimension d ∈ N of the function which the DNN aims to approximate in such computational problems. There is also a large number of rigorous mathematical approximation 2
results for artificial neural networks in the scientific literature (see, for instance, [1, 2, 3, 7, 8, 9, 10, 11, 13, 14, 16, 18, 20, 21, 22, 24, 26, 29, 32, 33, 34, 35, 42, 43, 44, 45, 47, 49, 50, 51, 52, 53, 54, 57, 58, 59, 61, 66, 67] and the references mentioned therein) but there are only a few special situations where results in the literature can rigorously explain the success of DNNs when approximating high-dimensional functions. The key contribution of this article is to reveal that DNNs do overcome the curse of dimensionality in the numerical approximation of Kolmogorov PDEs with constant diffusion and nonlinear drift coefficients. More specifically, the main result of this article, Theorem 6.3 in Subsection 6.2 below, proves that the number of parameters used to describe the employed DNN grows at most polynomially in both the reciprocal of the prescribed approximation accuracy ε > 0 and the PDE dimension d ∈ N and, thereby, we establish that DNN approximations do indeed overcome the curse of dimensionality in the numerical approximation of such PDEs. To illustrate the statement of Theorem 6.3 below in more details, we now present the following special case of Theorem 6.3. Theorem 1.1. Let Ad = (ad,i,j )(i,j)∈{1,...,d}2 ∈ Rd×d , d ∈ N, be symmetric positive semidefinite matrices, for every d ∈ N let k·kRd : Rd → [0, ∞) be the d-dimensional Euclidean norm, let f0,d : Rd → R, d ∈ N, and f1,d : Rd → Rd , d ∈ N, be functions, let Ad : Rd → Rd , d ∈ N, be the functions which satisfy for all d ∈ N, x = (x1 , . . . , xd ) ∈ Rd that Ad (x) = (max{x1 , 0}, . . . , max{xd , 0}), let N = ∪L∈{2,3,4,... } ∪(l0 ,l1 ,...,lL )∈NL+1 (×Ln=1 (Rln ×ln−1 × Rln )),
(1)
let P : N → N and R : N → ∪k,l∈N C(Rk , Rl ) be the functions which satisfy for all L ∈ {2, 3, 4, . . . }, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × l0 Rln )), x0 ∈ RP , . . ., xL−1 ∈ RlL−1 with ∀ n ∈ N ∩ [1, L) : xn = Aln (Wn xn−1 + Bn ) that P(Φ) = Ln=1 ln (ln−1 + 1), R(Φ) ∈ C(Rl0 , RlL ), and (RΦ)(x0 ) = WL xL−1 + BL ,
(2)
let T, κ ∈ (0, ∞), (φm,d ε )(m,d,ε)∈{0,1}×N×(0,1] ⊆ N , assume for all d ∈ N, ε ∈ (0, 1], Pd d 0,d d d x, y ∈ R that R(φε ) ∈ C(Rd , R), R(φ1,d ε ) ∈ C(R , R ), |f0,d (x)| + i,j=1 |ad,i,j | ≤ κ κ 1,d κd (1 + kxkRd ), kf1,d (x) − f1,d (y)kRd ≤ κkx − ykRd , k(Rφε )(x)kRd ≤ κ(dκ + kxkRd ), P1 κ κ κ m,d κ −κ 0,d , |(Rφ0,d ε )(x) − (Rφε )(y)| ≤ κd (1 + kxkRd + kykRd )kx − m=0 P(φε ) ≤ κd ε ykRd , and 1,d κ κ |f0,d (x) − (Rφ0,d ε )(x)| + kf1,d (x) − (Rφε )(x)kRd ≤ εκd (1 + kxkRd ),
(3)
and for every d ∈ N let ud : [0, T ] × Rd → R be an at most polynomially growing viscosity solution of ∂ ∂ ud )(t, x) = ( ∂x ud )(t, x) f1,d (x) + ( ∂t
d P i,j=1
2
ad,i,j ( ∂x∂i ∂xj ud )(t, x)
(4)
with ud (0, x) = f0,d (x) for (t, x) ∈ (0, T ) × Rd . Then for every p ∈ (0, ∞) there exist (ψd,ε )(d,ε)∈N×(0,1] ⊆ N , c ∈ R such that for all d ∈ N, ε ∈ (0, 1] it holds that P(ψd,ε ) ≤ c dc ε−c , R(ψd,ε ) ∈ C(Rd , R), and Z 1/p p |ud (T, x) − (Rψd,ε )(x)| dx ≤ ε. (5) [0,1]d
3
Theorem 1.1 is an in immediate consequence from Corollary 6.4 in Subsection 6.3 below. Corollary 6.4, in turn, is a special case of Theorem 6.3. Next we add some comments regarding the mathematical objects appearing in Theorem 1.1. Theorem 1.1 is an approximation result for rectified DNNs and for every d ∈ N the function Ad : Rd → Rd in Theorem 1.1 above describes the d-dimensional rectifier function. The set N in (1) in Theorem 1.1 above is a set of tuples of real numbers which, in turn, represents the set of all artificial neural networks. For every artificial neural network Φ ∈ N in Theorem 1.1 above we have that R(Φ) ∈ ∪k,l∈N C(Rk , Rl ) represents the function associated to the artificial neural network Φ (cf. (2) in Theorem 1.1). The function R : N → ∪k,l∈N C(Rk , Rl ) from the set N of all artificial neural networks to the union ∪k,l∈N C(Rk , Rl ) of continuous functions thus describes the realizations associated to the artificial neural networks. Moreover, for every artificial neural network Φ ∈ N in Theorem 1.1 above we have that P(Φ) ∈ N represents the number of real parameters which are used to describe the artificial neural network Φ. In particular, for every artificial neural network Φ ∈ N in Theorem 1.1 we can think of P(Φ) ∈ N as a quantity related to the amount of memory storage which is needed to store the artificial neural network. The real number κ > 0 in Theorem 1.1 is an arbitrary constant used to formulate the hypotheses in Theorem 1.1 (cf. (3) in Theorem 1.1 above) and the real number T > 0 in Theorem 1.1 describes the time horizon under consideration. Our key hypothesis in Theorem 1.1 is the assumption that both the possibly nonlinear initial value functions f0,d : Rd → R, d ∈ N, and the possibly nonlinear drift coefficient functions f1,d : Rd → Rd , d ∈ N, of the PDEs in (4) can be approximated without the curse of dimensionality by means of DNNs (see (3) above for details). Simple examples for the functions f0,d : Rd → R, d ∈ N, and f1,d : Rd → Rd , d ∈ N, which fulfill the hypotheses of Theorem 1.1 above are, for instance, provided by the choice that for all d ∈ N, x = (x1 , x2 , . . . , xd ) ∈ Rd it holds that f0,d (x) = max{x1 , x2 , . . . , xd } and f1,d (x) = x [1 + kxk2Rd ]−1 . A natural example for the matrices Ad = (ad,i,j )(i,j)∈{1,...,d}2 ∈ Rd×d , d ∈ N, fulfilling the hypotheses in Theorem 1.1 above is, for instance, provided by the choice that for all d ∈ N it holds that Ad ∈ Rd×d is the d-dimensional identity matrix in which case the second order term in (4) reduces to the d-dimensional Laplace operator. Roughly speaking, Theorem 1.1 above proves that if both the initial value functions and the drift coefficient functions in the PDEs in (4) can be approximated without the curse of dimensionality by means of DNNs, then the solutions of the PDEs can also be approximated without the curse of dimensionality by means of DNNs (see (5) above for details). In numerical simulations involving DNNs for computational problems from data science (e.g., object and face recognition, natural language processesing, fraud detection, computational advertisement, etc.) it is often not entirely clear how to precisely describe what the involved DNN approximations should achieve and it is thereby often not entirely clear how to precisely specify the approximation error of the employed DNN. The recent articles [17, 28] (cf., e.g., also [4, 5, 6, 19, 20, 23, 26, 30, 40, 46, 48, 55, 61]) suggest to use machine learning algorithms which employ DNNs to approximate solutions and derivatives of solutions, respectively, of PDEs and in the framework of these references it is perfectly clear what the involved DNN approximations should achieve as well as how to specify the approximation error: the DNN should approximate the unique deterministic 4
function which is the solution of the given deterministic PDE (cf., e.g., Han et al. [28, Neural Network Architecture on page 5] and Beck et al. [4, Proposition 2.7 and (103)]. The above named references thereby open up the possibility for a complete and rigorous mathematical error analysis for the involved deep learning algorithms and Theorems 1.1 and 6.3, in particular, provide some first contributions to this new research topic. The statements of Theorems 1.1 and 6.3 and their strategies of proof, respectively, are inspired by the article Grohs et al. [26] (cf., e.g., Theorem 1.1 in [26]) in which similar results as Theorems 1.1 and 6.3, respectively, but for Kolmogorov PDEs with affine linear drift and diffusion coefficient functions have been proved. The main difference of the arguments in [26] to this paper is the deepness of the involved artificial neural networks. Roughly speaking, the affine linear structure of the coefficients of the Kolmogorov PDEs in [26] allowed the authors in [26] to essentially employ a flat artificial neural network for approximating the solution flow mapping of such PDEs. In this work the drift coefficient is nonlinear and, in view of this property, we employ in our proofs of Theorem 1.1 and Theorem 6.3, respectively, iterative Euler-type discretizations for the underlying stochastic dynamics associated to the PDEs in (4). The iterative Euler-type discretizations result in multiple compositions which, in turn, result in deep artificial neural networks with a large number of hidden layers. In particular, in our proof of Theorem 1.1 and Theorem 6.3, respectively, the artificial neural networks ψd,ε ∈ N , d ∈ N, ε ∈ (0, 1], approximating the solutions of the PDEs in (4) (see (5) above) are also deep artificial neural networks with a large number of hidden layers even if the artificial neural networks approximating or representing f0,d : Rd → R, d ∈ N, and f1,d : Rd → Rd , d ∈ N, are flat with one hidden layer only. Moreover, our proofs of Theorem 1.1 and Theorem 6.3, respectively, reveal that the number of hidden layers increases to infinity as the prescribed approximation accuracy ε > 0 decreases to zero and the PDE dimension increases to infinity, respectively (cf. (149) and (168) below). Theorem 1.1 above and Theorem 6.3, respectively, are purely deterministic approximation results for DNNs and solutions of a class of deterministic PDEs. Our proofs of Theorem 1.1 and Theorem 6.3, respectively, are, however, heavily relying on probabilistic arguments on a suitable artificial probability space. Roughly speaking, in our proof of Theorem 6.3 we (I) design a suitable random DNN on this artificial probability space, (II) show that this suitable random DNN is in a suitable sense close to the solution of the considered deterministic PDE, and (III) employ items (I)–(II) above to establish the existence of a realization with the desired approximation properties on the artificial probability space. The specific realization of this random DNN is then a deterministic DNN approximation of the solution of the considered deterministic PDE with the desired approximation properties. The main work of the paper is the construction and the analysis of this random DNN. For the construction of the random DNN we need suitable general flexibility results for rectified DNNs which, roughly speaking, demonstrate how rectified DNNs can be composed with a moderate growth of the number of involved parameters (see Subsection 5.2 below for details). The construction of the 5
random DNN (cf. (I) above) is essentially performed in Section 5 and Section 6 and the analysis of the random DNN (cf. (II) above) is essentially the subject of Section 3, Section 4, and Subsection 6.1. The argument for the existence of the realization with the desired approximation properties on the artificial probability space (cf. (III) above) is provided in Section 2 and Subsection 6.1.
2
On the existence of a realization with the desired approximation properties on a suitable artificial probability space
In this section we establish in Corollary 2.4 in Subsection 2.2 below on a very abstract level, roughly speaking, the argument that good approximation properties of the random DNN (cf. items (I)–(II) in Section 1 above) imply the existence of a realization with the desired approximation properties on the artificial probability space (cf. item (III) in Section 1 above). The function u : Rd → R in Corollary 2.4 will essentially take the role of the solution of the considered deterministic PDE and the random field X : Rd × Ω → R will essentially take the role of the random DNN. Our proof of Corollary 2.4 is based on an application of Proposition 2.3 in Subsection 2.2 below. Proposition 2.3 is, very loosely speaking, an abstract generalized version of Corollary 2.4. Our proof of Proposition 2.3 is based on an application of the elementary Markov-type estimate in Lemma 2.2 in Subsection 2.1 below. Lemma 2.2, in turn, follows from the Markov inequality in Lemma 2.1 in Subsection 2.1 below. For completeness we also provide the short proof of the Markov inequality in Lemma 2.1. Results related to Lemma 2.2 and Proposition 2.3 can, e.g., be found in Grohs et al. [26, Subsection 3.1]. In particular, Lemma 2.2 is somehow an elementary extension of [26, Proposition 3.3 in Subsection 3.1].
2.1
Markov-type estimates
Lemma 2.1 (Markov inequality). Let (Ω, F, µ) be a measure space, let ε ∈ (0, ∞), and let X : Ω → [0, ∞] be an F/B([0, ∞])-measurable function. Then R X dµ . (6) µ X≥ε ≤ Ω ε Proof of Lemma 2.1. Note that the fact that X ≥ 0 proves that
1{X≥ε} =
ε · 1{X≥ε} X · 1{X≥ε} X ≤ ≤ . ε ε ε
(7)
Integration with respect to µ hence establishes (6). The proof of Lemma 2.1 is thus completed. Lemma 2.2. Let (Ω, F, P) be a probability space, let X : Ω → [−∞, ∞] be a random variable, and let ε, q ∈ (0, ∞). Then 1/q 1/q E |X|q P |X| ≥ ε ≤ . (8) ε 6
Proof of Lemma 2.2. Observe that Lemma 2.1 ensures that " #1/q q 1/q q E |X| E |X| 1/q 1 = . [P(|X| ≥ ε)] = [P(|X|q ≥ εq )] /q ≤ q ε ε
(9)
The proof of Lemma 2.2 is thus completed.
2.2
Existence of a realization with the desired approximation properties
Proposition 2.3. Let ε ∈ (0, ∞), let (Ω, F, P) be a probability space, and let X : Ω → [−∞, ∞] be a random variable which satisfies that 1/q inf q∈(0,∞) E |X|q < ε. (10) Then there exists ω ∈ Ω such that |X(ω)| < ε. Proof of Proposition 2.3. First, observe that Lemma 2.2 assures that for all (0, ∞) it holds that 1/q 1/q E |X|q P |X| ≥ ε ≤ . ε 1/q Next note that the hypothesis that inf q∈(0,∞) E |X|q < ε demonstrates there exists q ∈ (0, ∞) such that 1/q E |X|q < ε. Combining this with (11) proves that 1/q P |X| ≥ ε < 1.
q ∈ (11) that (12)
(13)
Hence, we obtain that P |X| ≥ ε < 1.
(14)
P |X| < ε = 1 − P |X| ≥ ε > 0.
(15)
This shows that Therefore, we obtain that {|X| < ε} = ω ∈ Ω : |X(ω)| < ε 6= ∅.
(16)
The proof of Proposition 2.3 is thus completed. Corollary 2.4 (Existence of approximating realizations of a random field). Let d ∈ N, p, ε ∈ (0, ∞), let u : Rd → R be B(Rd )/B(R)-measurable, let (Ω, F, P) be a probability space, let ν : B(Rd ) → [0, 1] be a probability measure on Rd , let X : Rd × Ω → R be (B(Rd ) ⊗ F)/B(R)-measurable, and assume that Z 1/p p E |u(x) − X(x)| ν(dx) < ε. (17) Rd
Then there exists ω ∈ Ω such that Z 1/p p |u(x) − X(x, ω)| ν(dx) < ε. Rd
7
(18)
Proof of Corollary 2.4. Throughout this proof let Y : Ω → [−∞, ∞] be the random variable given by Z 1/p p Y = |u(x) − X(x)| ν(dx) . (19) Rd
Observe that Fubini’s theorem and (17) ensure that Z 1/p p 1/p p = E E |Y | |u(x) − X(x)| ν(dx) Rd
Z =
1/p p E |u(x) − X(x)| ν(dx) < ε.
(20)
Rd
Hence, we obtain that inf q∈(0,∞)
1/p 1/q < ε. ≤ E |Y |p E |Y |q
(21)
This allows us to apply Proposition 2.3 to obtain that there exists ω ∈ Ω such that |Y (ω)| < ε.
(22)
Combining this with (19) establishes (18). The proof of Corollary 2.4 is thus completed.
3
The Feynman-Kac formula revisited
Theorem 6.3 in Subsection 6.2 below (the main result of this article) and Theorem 1.1 in the introduction, respectively, are, as mentioned above, purely deterministic approximation results for DNNs and a class of deterministic PDEs. In contrast, our proofs of Theorem 6.3 and Theorem 1.1, respectively, are based on a probabilistic argument on a suitable artificial probability space on which we, roughly speaking, design random DNNs. Our construction of the random DNNs is based on suitable Monte Carlo approximations of the solutions of the considered deterministic PDEs. These suitable Monte Carlo approximations, in turn, are based on the link between deterministic Kolmogorov PDEs and solutions of SDEs which is provided by the famous Feynman-Kac formula. In this section we recall in Theorem 3.1 below a special case of this famous formula (cf., e.g., Hairer et al. [27, Subsection 4.4]). Theorem 3.1 below will be used in our proof of Theorem 6.3 below (cf. (145) and (150) in the proof of Proposition 6.1, Proposition 6.1, Corollary 6.2, and Theorem 6.3). Theorem 3.1. Let (Ω, F, P) be a probability space, let T ∈ (0, ∞), d ∈ N, B ∈ Rd×m , let W : [0, T ]×Ω → Rm be a standard Brownian motion, let k·k : Rd → [0, ∞) be the d-dimensional Euclidean norm, let h·, ·i : Rd × Rd → R be the d-dimensional Euclidean scalar product, let ϕ : Rd → R be a continuous function, let µ : Rd → Rd be a locally Lipschitz continuous function, and assume that |ϕ(x)| kµ(x)k + < ∞. (23) inf sup p∈(0,∞) x∈Rd (1 + kxkp ) (1 + kxk) Then 8
(i) there exist unique stochastic processes X x : [0, T ] × Ω → Rd , x ∈ Rd , with continuous sample paths which satisfy for all x ∈ Rd , t ∈ [0, T ] that Z t x (24) µ(Xsx ) ds + BWt , Xt = x + 0
(ii) there exists a unique function u : [0, T ] × Rd → R such that for all x ∈ Rd it |u(t,x)| holds that u(0, x) = ϕ(x), such that inf p∈(0,∞) sup(t,x)∈[0,T ]×Rd 1+kxk p < ∞, and such that u is a viscosity solution of
∂ u)(t, x) = (∇x u)(t, x), µ(x) + 21 Trace BB ∗ (Hessx u)(t, x) (25) ( ∂t for (t, x) ∈ (0, T ) × Rd , and (iii) it holds for all t ∈ [0, T ], x ∈ Rd that E |ϕ(Xtx )| < ∞ and u(t, x) = E ϕ(Xtx ) .
4
(26)
Stochastic differential equations (SDEs)
In our proofs of Theorem 1.1 above and Theorem 6.3 below (the main result of this article), respectively, we design and analyse (cf. items (I)–(II) in Section 1 above) a suitable random DNN. The construction of this suitable random DNN is based on Euler-Maruyama discretizations of solutions of the SDEs associated to the Kolmogorov PDEs in (4) and for our error analysis of this suitable random DNN we employ appropriate weak error estimates for Euler-Maruyama discretizations of solutions of SDEs. These weak error estimates are established in Lemma 4.5 and Proposition 4.6 in Subsection 4.4 below. Our proofs of Lemma 4.5 and Proposition 4.6, respectively, use suitable strong error estimates for Euler-Maruyama discretizations. These strong error estimates are the subject of Proposition 4.4 in Subsection 4.3 below. Proposition 4.4 follows from an application of the deterministic perturbation-type inequality in Lemma 4.3 in Subsection 4.3 below. Perturbation estimates which are related to Lemma 4.3 and Proposition 4.4 can, e.g., be found in Hutzenthaler et al. [38, Proposition 2.9 and Corollary 2.12]. In particular, our proof of Lemma 4.3 is inspired by the proof of Proposition 2.9 in Hutzenthaler et al. [38]. Furthermore, our proof of Proposition 4.6 employs the elementary a priori estimate in Lemma 4.1 in Subsection 4.1 below. Lemma 4.1, in turn, is a straightforward consequence from Gronwall’s integral inequality (see, e.g., Grohs et al. [26, Lemma 2.11]) and its proof is therefore omitted. In our proof of Theorem 6.3 we will also employ the elementary a priori estimate for standard Brownian motions in Lemma 4.2 in Subsection 4.2 below. Lemma 4.2 is a straightforward consequence from Itˆo’s formula and its proof is therefore also omitted.
4.1
A priori bounds for SDEs
Lemma 4.1. Let d, m ∈ N, ξ ∈ Rd , p ∈ [1, ∞), c, C, T ∈ [0, ∞), B ∈ Rd×m , let k·k : Rd → [0, ∞) be the d-dimensional Euclidean norm, let (Ω, F, P) be a probability 9
space, let W : [0, T ] × Ω → Rm be a standard Brownian motion, let µ : Rd → Rd be a B(Rd )/B(Rd )-measurable function which satisfies for all x ∈ Rd that kµ(x)k ≤ C + ckxk, let χ : [0, T ] → [0, T ] be a B([0, T ])/B([0, T ])-measurable function which satisfies for all t ∈ [0, T ] that χ(t) ≤ t, and let X : [0, T ] × Ω → Rd be a stochastic process with continuous sample paths which satisfies for all t ∈ [0, T ] that Z t µ Xχ(s) ds + BWt = 1. (27) P Xt = ξ + 0
Then it holds that sup E[kXt kp ]
1/p
1/p cT e . ≤ kξk + CT + E kBWT kp
t∈[0,T ]
4.2
(28)
A priori bounds for Brownian motions
Lemma 4.2. Let d, m ∈ N, T ∈ [0, ∞), p ∈ (0, ∞), B ∈ Rd×m , let k·k : Rd → [0, ∞) be the d-dimensional Euclidean norm, let (Ω, F, P) be a probability space, and let W : [0, T ] × Ω → Rm be a standard Brownian motion. Then it holds for all t ∈ [0, T ] that 1/p p E kBWt kp ≤ max{1, p − 1} Trace(B ∗ B) t.
4.3
(29)
Strong perturbations of SDEs
Lemma 4.3. Let d ∈ N, L, T ∈ [0, ∞), δ ∈ (0, ∞), p ∈ [2, ∞), let k·k : Rd → [0, ∞) be the d-dimensional Euclidean norm, let µ : Rd → Rd be a function which satisfies for all v, w ∈ Rd that kµ(v) − µ(w)k ≤ Lkv − wk, (30) let X, Y : [0, T ] → Rd be continuous functions, let a : [0, TR] → Rd be a B([0, T ])/B(Rd )t measurable function, and assume for all t ∈ [0, T ] that 0 kas k ds < ∞ and Z Xt − Yt = X0 − Y0 +
t
µ(Xs ) − as ds.
(31)
0
Then it holds for all t ∈ [0, T ] that kXt − Yt kp h ≤ exp L +
(1−1/p) δ
i
Z t p p (p−1) kas − µ(Ys )k ds . p t kX0 − Y0 k + δ
(32)
0
Proof of Lemma 4.3. Throughout this proof let h·, ·i : Rd × Rd → R be the d-dimensional Euclidean scalar product and let α ∈ (0, ∞) be the real number given by h i α = L + (1−1/p) p. (33) δ
10
Note that (31) ensures that the function ([0, T ] 3 t 7→ (Xt − Yt ) ∈ Rd ) is absolutely continuous. The fundamental theorem of calculus and the chain rule hence prove that for all t ∈ [0, T ] it holds that Z t p kXs − Ys kp−2 hXs − Ys , µ(Xs ) − as i kXt − Yt kp p = kX0 − Y0 k + ds exp(αt) exp(αs) 0 Z t αkXs − Ys kp − ds exp(αs) 0 (34) Z t p kXs − Ys kp−2 hXs − Ys , µ(Xs ) − µ(Ys )i p = kX0 − Y0 k + ds exp(αs) 0 Z t p kXs − Ys kp−2 hXs − Ys , µ(Ys ) − as i − αkXs − Ys kp + ds. exp(αs) 0 Next observe that (30) and the Cauchy-Schwartz inequality ensure that for all s ∈ [0, T ] it holds that hXs − Ys , µ(Xs ) − µ(Ys )i ≤ kXs − Ys kkµ(Xs ) − µ(Ys )k ≤ LkXs − Ys k2 .
(35)
This and (34) demonstrate that for all t ∈ [0, T ] it holds that Z t pLkXs − Ys kp kXt − Yt kp p ≤ kX0 − Y0 k + ds exp(αt) exp(αs) 0 Z t p kXs − Ys kp−2 hXs − Ys , µ(Ys ) − as i − αkXs − Ys kp + ds (36) exp(αs) 0 Z t p kXs − Ys kp−2 hXs − Ys , µ(Ys ) − as i − (α − pL)kXs − Ys kp p ds = kX0 − Y0 k + exp(αs) 0 Z t p kXs − Ys kp−2 hXs − Ys , µ(Ys ) − as i − (p−1) kXs − Ys kp p δ = kX0 − Y0 k + ds. exp(αs) 0 Next observe that the Cauchy-Schwartz inequality and Young’s inequality prove that for all s ∈ [0, T ] it holds that kXs − Ys kp−2 hXs − Ys , µ(Ys ) − as i ≤ kXs − Ys kp−1 kµ(Ys ) − as k =δ ≤ =
(1−p)/p
kXs − Ys kp−1 δ
(p−1) (1−p)/p δ kXs p (p−1) kXs − Ys kp δp
(p−1)/p
kµ(Ys ) − as k (p−1)/p p p/(p−1) + p1 δ − Ys kp−1 kµ(Ys ) − as k
+
δ (p−1) kµ(Ys ) p
(37)
− as k p .
Combining this with (36) assures that for all t ∈ [0, T ] it holds that kXt − Yt kp exp(αt) p
Z
≤ kX0 − Y0 k +
t (p−1) kXs δ
0 p
t
Z
= kX0 − Y0 k + 0 p
Z
≤ kX0 − Y0 k +
− Ys kp + δ (p−1) kµ(Ys ) − as kp − exp(αs) δ (p−1) kµ(Ys ) − as kp ds exp(αs)
t
δ (p−1) kµ(Ys ) − as kp ds.
0
11
(p−1) kXs δ
− Ys k p
ds (38)
This implies (32). The proof of Lemma 4.3 is thus completed. Proposition 4.4 (Perturbation). Let d, m ∈ N, x, y ∈ Rd , L, T ∈ [0, ∞), δ ∈ (0, ∞), p ∈ [2, ∞), B ∈ Rd×m , let k·k : Rd → [0, ∞) be the d-dimensional Euclidean norm, let (Ω, F, P) be a probability space, let W : [0, T ] × Ω → Rm be a standard Brownian motion, let µ : Rd → Rd be a function which satisfies for all v, w ∈ Rd that kµ(v) − µ(w)k ≤ Lkv − wk, (39) let X, Y : [0, T ] × Ω → Rd be stochastic processes with continuous sample paths, let a : [0, T ] × Ω → RRd be a (B([0, T ]) ⊗ F)/B(RRd )-measurable function, and assume for t t all t ∈ [0, T ] that 0 kas k ds < ∞, Yt = y + 0 as ds + BWt , and Z
t
µ(Xs ) ds + BWt .
Xt = x +
(40)
0
Then it holds for all t ∈ [0, T ] that 1/p E kXt − Yt kp Z t 1/p ! h i ≤ exp L + (1−1/p) t kx − yk + δ (1−1/p) E kas − µ(Ys )kp ds . δ
(41)
0
Proof of Proposition 4.4. First, note that for all t ∈ [0, T ] it holds that Z t Xt − Yt = x − y + µ(Xs ) − as ds.
(42)
0
Lemma 4.3 hence ensures that for all t ∈ [0, T ] that kXt − Yt kp h ≤ exp L +
(1−1/p) δ
i
Z t p (p−1) p p t kx − yk + δ kas − µ(Ys )k ds .
(43)
0
This implies that for all t ∈ [0, T ] it holds that E kXt − Yt kp Z t i h p (1−1/p) (p−1) p p t kx − yk + δ E kas − µ(Ys )k ds . ≤ exp L + δ
(44)
0
The fact that ∀ b, c ∈ R : |b + c|1/p ≤ |b|1/p + |c|1/p hence demonstrates that for all t ∈ [0, T ] it holds that 1/p E kXt − Yt kp Z t 1/p ! h i (1−1/p) (1−1/p) p ≤ exp L + δ t kx − yk + δ . E kas − µ(Ys )k ds 0
The proof of Proposition 4.4 is thus completed. 12
(45)
4.4
Weak perturbations of SDEs
Lemma 4.5. Let d, m ∈ N, ξ ∈ Rd , h, T, ε0 , ε1 , ς0 , ς1 , L0 , L1 , ` ∈ [0, ∞), δ ∈ (0, ∞), B ∈ Rd×m , p ∈ [2, ∞), q ∈ (1, 2] satisfy 1/p + 1/q = 1, let k·k : Rd → [0, ∞) be the ddimensional Euclidean norm, let (Ω, F, P) be a probability space, let W : [0, T ]×Ω → Rm be a standard Brownian motion, let φ0 : Rd → R, f1 : Rd → Rd , φ2 : Rd → Rd , and χ : [0, T ] → [0, T ] be functions, let f0 : Rd → R be a B(Rd )/B(R)-measurable function, let φ1 : Rd → Rd be a B(Rd )/B(Rd )-measurable function, assume for all t ∈ [0, T ], x, y ∈ Rd that kφ1 (x) − f1 (x)k ≤ ε1 (1 + kxkς1 ), 1 ` rkxk + (1 − r)kyk dr kx − yk ,
|φ0 (x) − f0 (x)| ≤ ε0 (1 + kxkς0 ), Z |φ0 (x) − φ0 (y)| ≤ L0 1 +
(46) (47)
0
kf1 (x) − f1 (y)k ≤ L1 kx − yk,
χ(t) = max({0, h, 2h, . . . } ∩ [0, t]) , (48)
and
d
and let X, Y : [0, T ] × Ω → R be stochastic processes with continuous sample paths Rt which satisfy for all t ∈ [0, T ] that Yt = φ2 (ξ) + 0 φ1 Yχ(s) ds + BWt and Z t Xt = ξ + f1 (Xs ) ds + BWt . (49) 0
Then it holds that E f0 (XT ) − E φ0 (YT ) ≤ ε0 (1 + E[kXT kς0 ]) h + L0 2max{`−1,0} exp L1 + "
(50)
i h 1/q 1/q i T 1 + E kXT k`q + E kYT k`q " # 1 /p 1 1 · kξ − φ2 (ξ)k + ε1 δ (1− /p) T /p 1 + sup E kYt kpς1 (1−1/p) δ
t∈[0,T ]
" + h δ (1− /p) T 1
1/p
L1
# # 1/p 1 /p 1 1 sup E kφ1 (Yt )kp + δ (1− /p) T /p L1 E kBWh kp . t∈[0,T ]
Proof of Lemma 4.5. First, note that the triangle inequality ensures that E f0 (XT ) − E φ0 (YT ) ≤ E f0 (XT ) − E φ0 (XT ) + E φ0 (XT ) − E φ0 (YT ) ≤ E |f0 (XT ) − φ0 (XT )| + E |φ0 (XT ) − φ0 (YT )| ≤ ε0 E[1 + kXT kς0 ] + E |φ0 (XT ) − φ0 (YT )| .
(51)
This implies that E f0 (XT ) − E φ0 (YT ) ≤ ε0 E[1 + kXT kς0 ] Z 1 ` + L0 E 1 + rkXT k + (1 − r)kYT k dr kXT − YT k 0 ς0
≤ ε0 E[1 + kXT k ] Z max{`−1,0} + L0 E 1 + 2
1
krXT k + k(1 − r)YT k dr kXT − YT k . `
0
13
`
(52)
Therefore, we obtain that E f0 (XT ) − E φ0 (YT ) ≤ ε0 E[1 + kXT kς0 ] Z max{`−1,0} + L0 E 1 + 2
1
`
r dr
`
kXT k + kYT k
`
kXT − YT k
0
(53)
ς0
= ε0 E[1 + kXT k ] max{`−1,0} 2 ` ` kXT − YT k . + L0 E 1 + kXT k + kYT k (` + 1) H¨older’s inequality hence demonstrates that E f0 (XT ) − E φ0 (YT ) ≤ ε0 (1 + E[kXT kς0 ]) (54) max{`−1,0} h i 1/q 1/q 2 1/p + L0 1 + + E kYT k`q (E[kXT − YT kp ]) . E kXT k`q (` + 1) Next observe that Proposition 4.4 (with d = d, m = m, x = ξ, y = φ2 (ξ), L = L1 , T = T , δ = δ, p = p, B = B, (Ω, F, P) = (Ω, F, P), W = W , µ = f1 , X = X, Y = Y , a = ([0, T ] × Ω 3 (t, ω) 7→ φ1 (Yχ(t) (ω)) ∈ Rd ) in the notation of Proposition 4.4) ensures that 1/p
(E[kXT − YT kp ]) i h T ≤ exp L1 + (1−1/p) δ ·
kξ − φ2 (ξ)k + δ (1− /p) 1
Z
T
E kφ1 (Yχ(s) ) − f1 (Ys )kp ds
1/p !
0
≤ exp ·
h
i L1 + (1−1/p) T δ
kξ − φ2 (ξ)k + δ
(1−1/p)
Z
T
E kφ1 (Yχ(s) ) − f1 (Yχ(s) )kp ds
1/p ! (55)
0
+ exp ≤ exp
h
h
+ exp
L1 +
L1 +
h
(1−1/p) δ
(1−1/p) δ
L1 +
T
Z i (1−1/p) δ T
1/p ! p E kf1 (Yχ(s) ) − f1 (Ys )k ds
0
Z i (1−1/p) T kξ − φ2 (ξ)k + ε1 δ
(1−1/p) δ
T
E (1 + kYχ(s) kς1 )p ds
0
i 1 T L1 δ (1− /p)
T
Z 0
14
1/p p E kYχ(s) − Ys k ds .
1/p !
This shows that 1/p
(E[kXT − YT kp ]) i h T kξ − φ2 (ξ)k ≤ exp L1 + (1−1/p) δ " Z h i 1/p (1−1/p) (1−1/p) + exp L1 + δ T ε1 δ T +
T
E kYχ(s) kpς1 ds
1/p # (56)
0
h
i 1 (1−1/p) L1 + δ T L1 δ (1− /p)
+ exp 1/p Z T h
p i
s
ds . E ∫χ(s) φ1 (Yχ(u) ) du + B(Ws − Wχ(s) ) · 0
Moreover, observe that the triangle inequality assures that 1/p h
p i s E ∫χ(s) φ1 (Yχ(u) ) du + B(Ws − Wχ(s) ) ds
T
Z 0
Z
T
1/p Z h
p i s E ∫χ(s) φ1 (Yχ(u) ) du ds +
T
h i 1/p Z p p + E |s − χ(s)| kφ1 (Yχ(s) )k ds
≤ 0
Z
1/p h
p i E B(Ws − Wχ(s) ) ds
T
0
=
T
1/p h
p i E B(Ws−χ(s) ) ds
(57)
0
0
Z
T
≤h
h i E kφ1 (Yχ(s) )kp ds
1/p
Z +
T
1/p h
p i
E B(Ws−χ(s) ) . ds
0
0
This and (56) show that 1/p
(E[kXT − YT kp ]) h i (1−1/p) ≤ exp L1 + δ T kξ − φ2 (ξ)k " " ## h i 1/p 1 1 1 (1−1/p) + exp L1 + δ T ε1 δ (1− /p) T /p + T /p sup E kYt kpς1 t∈[0,T ]
+ exp
h
"
i 1/p 1 1 (1−1/p) T h L1 δ (1− /p) T /p sup E kφ1 (Yχ(t) )kp L1 + δ t∈[0,T ]
+ exp
h
L1 +
(1−1/p) δ
Z i (1−1/p) T L1 δ 0
15
T
1/p h
p i
E B(Ws−χ(s) ) ds .
(58) #
Therefore, we obtain that 1/p
(E[kXT − YT kp ]) i h T kξ − φ2 (ξ)k ≤ exp L1 + (1−1/p) δ h 1 1 + ε1 δ (1− /p) T /p exp L1 +
# " i 1 /p (1−1/p) T 1 + sup E kYt kpς1 δ t∈[0,T ]
+ h δ (1− /p) T 1
1/p
h L1 exp L1 +
h (1−1/p) 1/p +δ T L1 exp L1 +
(1−1/p) δ
i T
"
(59)
# 1/p sup E kφ1 (Yt )kp t∈[0,T ]
i (1−1/p) p 1/p . T E kBW k h δ
Combining this with (54) demonstrates that E f0 (XT ) − E φ0 (YT ) ≤ ε0 (1 + E[kXT kς0 ]) max{`−1,0} h i 2 `q 1/q `q 1/q + L0 1 + + E kYT k E kXT k (60) (` + 1) " " # i h (1−1/p) (1−1/p) 1/p pς1 1/p T kξ − φ2 (ξ)k + ε1 δ T · exp L1 + δ 1 + sup E kYt k t∈[0,T ]
" + h δ (1− /p) T 1
1/p
L1
#
# 1/p 1 /p 1 1 sup E kφ1 (Yt )kp + δ (1− /p) T /p L1 E kBWh kp . t∈[0,T ]
The proof of Lemma 4.5 is thus completed. Proposition 4.6. Let d, m ∈ N, ξ ∈ Rd , T ∈ (0, ∞), c, C, ε0 , ε1 , ε2 , ς0 , ς1 , ς2 , L0 , L1 , ` ∈ [0, ∞), h ∈ [0, T ], B ∈ Rd×m , p ∈ [2, ∞), q ∈ (1, 2] satisfy 1/p + 1/q = 1, let k·k : Rd → [0, ∞) be the d-dimensional Euclidean norm, let (Ω, F, P) be a probability space, let W : [0, T ] × Ω → Rm be a standard Brownian motion, let φ0 : Rd → R, f1 : Rd → Rd , φ2 : Rd → Rd , and χ : [0, T ] → [0, T ] be functions, let f0 : Rd → R be a B(Rd )/B(R)-measurable function, let φ1 : Rd → Rd be a B(Rd )/B(Rd )-measurable function, assume that kξ −φ2 (ξ)k ≤ ε2 (1+kξkς2 ), assume for all t ∈ [0, T ], x, y ∈ Rd that |φ0 (x) − f0 (x)| ≤ ε0 (1 + kxkς0 ), Z |φ0 (x) − φ0 (y)| ≤ L0 1 +
kφ1 (x) − f1 (x)k ≤ ε1 (1 + kxkς1 ), 1 ` rkxk + (1 − r)kyk dr kx − yk ,
(61)
χ(t) = max({0, h, 2h, . . . } ∩ [0, t]) ,
(63)
(62)
0
kf1 (x) − f1 (y)k ≤ L1 kx − yk,
and kφ1 (x)k ≤ C + ckxk, let $r ∈ R, r ∈ (0, ∞), satisfy for all r ∈ (0, ∞) that $r = 1/r E kBWT kr , and let X, Y : [0, T ] × Ω → Rd be stochastic processes with continu Rt ous sample paths which satisfy for all t ∈ [0, T ] that Yt = φ2 (ξ)+ 0 φ1 Yχ(s) ds+BWt and Z t Xt = ξ + f1 (Xs ) ds + BWt . (64) 0
16
Then it holds that E f0 (XT ) − E φ0 (YT ) ≤ ε0 + ε1 + ε2 + (h/T )1/2 · e(`+3+2L1 +[` max{L1 ,c}+c max{ς1 ,1}+L1 max{ς0 ,1}+2]T ) kξk + max{1, ε2 }(1 + kξkς2 ) (65) max{1,ς0 ,ς1 }+` + max{1, C, kf1 (0)k} max{1, T } + $max{ς0 ,ς1 p,p,`q} max{1, L0 }. Proof of Proposition 4.6. First, observe that Lemma 4.5 shows that E f0 (XT ) − E φ0 (YT ) ≤ ε0 (1 + E[kXT kς0 ]) h 1/q i 1/q 1 + E kYT k`q + L0 2max{`−1,0} e[L1 +(1− /p)]T 1 + E kXT k`q # " " 1 /p 1 · kξ − φ2 (ξ)k + ε1 T /p 1 + sup E kYt kpς1
(66)
t∈[0,T ]
" + hT
1/p
# # 1/p 1 /p 1 sup E kφ1 (Yt )kp + T /p L1 E kBWh kp .
L1
t∈[0,T ]
Hence, we obtain that E f0 (XT ) − E φ0 (YT ) ≤ ε0 (1 + E[kXT kς0 ]) h i `q 1/q `q 1/q + E kYT k + L0 2 e 1 + E kXT k " " # 1/p 1 · kξ − φ2 (ξ)k + ε1 T /p 1 + sup E kYt kpς1 max{`−1,0} [L1 +(1−1/p)]T
t∈[0,T ]
" + hT
1/p
(67)
#
1/p L1 ε1 sup E (1 + kYt kς1 )p t∈[0,T ]
" + hT
1/p
L1
# # 1/2 1/p−1/2 p 1/p sup E kf1 (Yt )k + L1 $p h T . t∈[0,T ]
In addition, note that for all x ∈ Rd it holds that kf1 (x)k ≤ kf1 (x) − f1 (0)k + kf1 (0)k ≤ kf1 (0)k + L1 kxk.
17
(68)
This and (67) ensure that E f0 (XT ) − E φ0 (YT ) ≤ ε0 1 + E kXT kς0 h 1/q i 1/q 1 + E kYT k`q + L0 2max{`−1,0} e[L1 +1− /p]T 1 + E kXT k`q # " " 1/p 1 · kξ − φ2 (ξ)k + ε1 T /p 1 + sup E kYt kpς1 t∈[0,T ]
#
" 1/p 1 + hT /p L1 ε1 1 + sup E kYt kpς1 t∈[0,T ]
" + hT
1/p
1/p sup E kYt kp
L1 kf1 (0)k + L1
#
##
"
+ L1 $p h /2 T 1
(69)
1/p−1/2
t∈[0,T ]
= ε0
1 + E kXT kς0
h 1/q 1/q i 1 + L0 2max{`−1,0} e[L1 +1− /p]T 1 + E kXT k`q + E kYT k`q " " # 1 /p 1 · kξ − φ2 (ξ)k + ε1 T /p [1 + hL1 ] 1 + sup E kYt kpς1 t∈[0,T ]
" + hT
1/p
"
L1 kf1 (0)k + L1
## 1/p sup E kYt kp
# + L1 $p h /2 T 1
1/p−1/2
.
t∈[0,T ]
Next observe that Lemma 4.1 and (68) demonstrate that for all r ∈ [1, ∞), t ∈ [0, T ] it holds that cT r 1/r r 1/r sup E kYt k ≤ kφ2 (ξ)k + CT + E kBWT k e t∈[0,T ] (70) cT = (kφ2 (ξ)k + CT + $r ) e and 1/r 1/r L1 T sup E kXt kr ≤ kξk + kf1 (0)kT + E kBWT kr e t∈[0,T ] L1 T
= (kξk + kf1 (0)kT + $r ) e
(71)
.
Combining this with (69) shows that E f0 (XT ) − E φ0 (YT ) ς0 ς0 L1 T 1 ≤ ε0 1 + kξk + kf1 (0)kT + $max{ς0 ,1} e + L0 2max{`−1,0} e[L1 +1− /p]T ` ` · 1 + kξk + kf1 (0)kT + $max{`q,1} e`L1 T + kφ2 (ξ)k + CT + $max{`q,1} e`cT " ς 1 · kξ − φ2 (ξ)k + ε1 T /p [1 + hL1 ] 1 + kφ2 (ξ)k + CT + $max{ς1 p,1} 1 eς1 cT + hT
1/p
# cT 1/2 1/p−1/2 L1 kf1 (0)k + L1 kφ2 (ξ)k + CT + $p e + L1 $p h T .
18
(72)
Hence, we obtain that E f0 (XT ) − E φ0 (YT ) ς ≤ ε0 1 + kξk + kf1 (0)kT + $max{ς0 ,1} 0 eς0 L1 T + L0 2max{`−1,0} e[max{`L1 ,`c}+c max{ς1 ,1}+L1 +1− /p]T max{1, T /p } ` ` · 1 + kξk + kf1 (0)kT + $max{`q,1} + kφ2 (ξ)k + CT + $max{`q,1} " ς1 · kξ − φ2 (ξ)k + ε1 [1 + hL1 ] 1 + kφ2 (ξ)k + CT + $max{ς1 p,1} 1
1
(73)
# 1/2 + hL1 kf1 (0)k + L1 kφ2 (ξ)k + CT + $p + (h/T ) L1 $p .
This implies that E f0 (XT ) − E φ0 (YT ) ς0 ς0 L1 T ≤ ε0 1 + kξk + kf1 (0)kT + $max{ς0 ,1} e + L0 2max{`−1,0} e[max{`L1 ,`c}+c max{ς1 ,1}+L1 +1]T (74) ` ` · 1 + kξk + kf1 (0)kT + $max{`q,1} + kξk + ε2 (1 + kξkς2 ) + CT + $max{`q,1} " ς · ε2 (1 + kξkς2 ) + ε1 [1 + T L1 ] 1 + kξk + ε2 (1 + kξkς2 ) + CT + $max{ς1 p,1} 1
+ (h/T ) /2 T L1 kf1 (0)k + L1 kφ2 (ξ)k + CT + $p 1
# + (h/T ) /2 L1 $p . 1
Therefore, we obtain that E f0 (XT ) − E φ0 (YT ) ς ≤ ε0 1 + kξk + kf1 (0)kT + $max{ς0 ,1} 0 eς0 L1 T
(75)
+ L0 2max{`−1,0} e[max{`L1 ,`c}+c max{ς1 ,1}+L1 +1]T ` ` · 1 + kξk + kf1 (0)kT + $max{`q,1} + kξk + ε2 (1 + kξkς2 ) + CT + $max{`q,1} " · [ε1 + ε2 ] max{1, T }[1 + L1 ]
ς2
· 1 + kξk + max{1, ε2 }(1 + kξk ) + CT + $max{ς1 p,1} 1/2
+ (h/T )
max{1,ς1 }
# max{1, T }[1 + L1 ] kf1 (0)k + L1 kξk + ε2 (1 + kξkς2 ) + CT + $p .
19
This and the fact that ∀ x ∈ [0, ∞) : max{x, 1} ≤ x + 1 ≤ ex demonstrate that E f0 (XT ) − E φ0 (YT ) ς ≤ ε0 1 + kξk + kf1 (0)kT + $max{ς0 ,1} 0 eς0 L1 T 1 (76) + L0 2max{`−1,0} e(L1 +[max{`L1 ,`c}+c max{ς1 ,1}+L1 +2]T ) ε1 + ε2 + (h/T ) /2 ` ` ς2 · 1 + kξk + kf1 (0)kT + $max{`q,1} + kξk + ε2 (1 + kξk ) + CT + $max{`q,1} h · max{1, kf1 (0)k} max{1,ς1 } i + max{1, L1 } kξk + max{1, ε2 }(1 + kξkς2 ) + CT + $max{p,ς1 p} . Hence, we obtain that E f0 (XT ) − E φ0 (YT ) 1 ≤ 2max{`−1,0} e(L1 +[max{`L1 ,`c}+c max{ς1 ,1}+max{ς0 ,1}L1 +2]T ) ε0 + ε1 + ε2 + (h/T ) /2 ` ` · 1 + kξk + kf1 (0)kT + $max{`q,1} + kξk + ε2 (1 + kξkς2 ) + CT + $max{`q,1} h · max{1, L0 } max{1, kf1 (0)k} + max{1, L1 } kξk + max{1, ε2 }(1 + kξkς2 ) max{1,ς0 ,ς1 } i + max{C, kf1 (0)k}T + $max{p,ς1 p,ς0 } . (77) This and the fact that ∀ x ∈ [0, ∞) : max{x, 1} ≤ x + 1 ≤ ex show that E f0 (XT ) − E φ0 (YT ) 1 ≤ 2max{`,1} e(2L1 +[` max{L1 ,c}+c max{ς1 ,1}+L1 max{ς0 ,1}+2]T ) ε0 + ε1 + ε2 + (h/T ) /2 ` ` ς2 · 1 + kξk + kf1 (0)kT + $max{`q,1} + kξk + ε2 (1 + kξk ) + CT + $max{`q,1} · max{1, L0 } kξk + max{1, ε2 }(1 + kξkς2 ) max{1,ς0 ,ς1 } + max{1, C, kf1 (0)k} max{1, T } + $max{ς0 ,ς1 p,p} . (78) Therefore, we obtain that E f0 (XT ) − E φ0 (YT ) ≤ e(`+3+2L1 +[` max{L1 ,c}+c max{ς1 ,1}+L1 max{ς0 ,1}+2]T ) 1 · ε0 + ε1 + ε2 + (h/T ) /2 max{1, L0 } kξk + max{1, ε2 }(1 + kξkς2 ) max{1,ς0 ,ς1 }+` . + max{1, C, kf1 (0)k} max{1, T } + $max{ς0 ,ς1 p,p,`q}
(79)
The proof of Proposition 4.6 is thus completed.
5
Deep artificial neural network (DNN) calculus
In Section 6 below we establish the existence of a DNN approximating the solution of the PDE without the curse of dimensionality. To demonstrate the existence of such a DNN, we need a few properties about representation flexibilities of DNNs, 20
which we establish in this section. In particular, we state in the elementary and essentially well-known result in Lemma 5.1 in Subsection 5.1 below that every linear combination of realizations of DNNs with the same architecture is again a realization of a suitable DNN. Results similar to Lemma 5.1 can, e.g., be found in Yarotsky [66]. Moreover, in Proposition 5.2 in Subsection 5.2 below we demonstrate under suitable hypotheses that the composition of the realizations of two DNNs is again a realization of a suitable DNN and the number of parameters of this suitable DNN grows at most additively in the number of parameters of the composed DNNs. For the construction of this suitable DNN in Proposition 5.2 we plug an artificial identity in between the two DNNs and for this we employ in Proposition 5.2 the hypothesis that the identity can within the class of considered fully-connected neural networks (see (93)–(94) in Proposition 5.2 below) be described by a suitable flat artificial neural network. In Proposition 5.2 the tuples φ1 and φ2 represent the DNNs which we intend to compose (where the realization of φ1 is a function from Rd2 to Rd3 and where the realization of φ2 is a function from Rd1 to Rd2 ), the tuple I represents the artificial neural network which describes the identity on Rd2 , and the tuple ψ represents the DNN whose realization coincides with the composition of the realizations of φ1 and φ2 (the realization of ψ is thus a function from Rd1 to Rd3 ). The hypothesis of the existence of the artificial neural network I can, roughly speaking, be viewed as a hypothesis on the activation function a : R → R used in Proposition 5.2. Proposition 5.2, loosely speaking, then asserts that the number of parameters of ψ can up to a constant be bounded by the sum of the number of parameters of φ1 and of the number of parameters of φ2 . A straightforward DNN construction of the composition of φ1 and φ2 (without artificially plugging the identity on Rd2 in between φ1 and φ2 ) would possibly result in a DNN whose number of parameters is essentially equal to the product of the number of parameters of φ1 and of the number of parameters of φ2 . Such a construction, in turn, would in our proof of the main result of this article (Theorem 6.3 below) not allow us to conclude that DNNs do indeed overcome the curse of dimensionality in the numerical approximation of the considered PDEs (see (177) in the proof of Proposition 6.1 for details). Moreover, in Proposition 5.3 in Subsection 5.2 below we establish under similar hypotheses as in Proposition 5.2 a result similar to Proposition 5.2 which is tailor-made to the DNNs which we design in the proof of our main result in Theorem 6.3 below. In particular, (109) in Proposition 5.3 is tailor-made to construct a DNN which is based on an Euler discretization of a (stochastic) differential equation. We refer to (147) and (175) in the proof of Proposition 6.1 below for further details. To apply Proposition 5.2 and Proposition 5.3, respectively, we need to verify that the class of considered ANNs does indeed enjoy the property to be able to represent the identity on Rd2 . Fortunately, ANNs with the rectifier function as the activation function do indeed admit this property. This fact is verified in the elementary result in Lemma 5.4 in Subsection 5.3 below. In particular, Lemma 5.4 shows for every d ∈ N that the d-dimensional identity can be explicitly represented by a suitable flat rectified ANN (with one hidden layer with 2d neurons and the rectifier function as the activation function in front of the 2d-dimensional hidden layer).
21
5.1
Sums of DNNs with the same architecture
Lemma 5.1. Let An : Rn → Rn , n ∈ N, and a : R → R be continuous functions which satisfy for all n ∈ N, x = (x1 , . . . , xn ) ∈ Rn that An (x) = (a(x1 ), . . . , a(xn )), let N = ∪L∈{2,3,4,... } ∪(l0 ,l1 ,...,lL )∈NL+1 (×Ln=1 (Rln ×ln−1 × Rln )), (80) let P : N → N and R : N → ∪k,l∈N C(Rk , Rl ) be the functions which satisfy for all L ∈ {2, 3, 4, . . . }, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × l0 , . . ., xL−1 ∈ RlL−1 with ∀ n ∈ N ∩ [1, L) : xn = Aln (Wn xn−1 + Bn ) Rln )), x0 ∈ RP that P(Φ) = Ln=1 ln (ln−1 + 1), R(Φ) ∈ C(Rl0 , RlL ), and (RΦ)(x0 ) = WL xL−1 + BL ,
(81)
let L ∈ {2, 3, 4, . . . }, M , L0 , L1 , . . . , LL ∈ N, h1 , h2 , . . . , hM ∈ R, and let (φm )m∈{1,2,...,M } ⊆ (×Ln=1 (RLn ×Ln−1 × RLn )). Then there exists ψ ∈ N such that for all x ∈ RL0 it holds that R(ψ) ∈ C(RL0 , RLL ), P(ψ) ≤ M 2 P(φ1 ), and (Rψ)(x) =
M X
hm (Rφm )(x).
(82)
m=1
Proof of Lemma 5.1. Throughout this proof let ((Wm,1 , Bm,1 ), . . . , (Wm,L , Bm,L )) ∈ (×Ln=1 (RLn ×Ln−1 × RLn )), m ∈ {1, 2, . . . , M }, satisfy for all i ∈ {1, 2, . . . M } that φi = ((Wi,1 , Bi,1 ), . . . , (Wi,L , Bi,L )), let (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy for all i ∈ {1, 2, . . . , L − 1} that l0 = L0 , li = M Li , and lL = LL , let ((W1 , B1 ), . . . , (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × Rln )) satisfy that W1,1 B1,1 W2,1 B2,1 (M L1 )×L0 l1 ×l0 W1 = .. ∈ R =R , B1 = .. ∈ R(M L1 ) = Rl1 , (83) . . WM,1 BM,1 WL = h1 W1,L h2 W2,L · · · hM WM,L ∈ RLL ×(M LL−1 ) = RlL ×lL−1 , (84) and
BL =
M X
hm Bm,L ∈ RLL = RlL ,
(85)
m=1
assume for all i ∈ {2, 3, 4, . . .} ∩ [0, L − 1] that W1,i 0 ··· 0 .. . 0 W2,i · · · Wi = . ∈ R(M Li )×(M Li−1 ) = Rli ×li−1 . . .. .. .. 0 0 ··· 0 WM,i B1,i B2,i and Bi = .. ∈ R(M Li ) = Rli , . BM,i 22
(86)
(87)
and let ψ = ((W1 , B1 ), . . . , (WL , BL )) ∈ N . Note that for all x ∈ Rl0 it holds that W1,1 x + B1,1 W2,1 x + B2,1 W1 x + B1 = (88) . .. . WM,1 x + BM,1 Moreover, observe that for all i ∈ N ∩ [0, L − 2], x1 , x2 , . . . , xM ∈ Rli it holds that W1,i+1 0 ··· 0 x1 B1,i+1 x1 .. x B x2 W2,i+1 · · · . 2 2,i+1 0 Wi+1 .. + Bi+1 = . .. + .. .. .. . .. . . 0 . . xM xM BM,i+1 0 ··· 0 WM,i+1 (89) W1,i+1 x1 + B1,i+1 W2,i+1 x2 + B2,i+1 = . .. . WM,i+1 xM + BM,i+1 Next note that for all x1 , x2 , . . . , xM ∈ RlL−1 it holds that x1 x1 M x2 x2 X hm Bm,L WL .. + BL = h1 W1,L h2 W2,L · · · hM WM,L .. + . m=1 . xM xM "M # "M # M X X X hm Wm,L xm + Bm,L . = hm Wm,L xm + hm Bm,L = m=1
m=1
m=1
(90) This, (88), and (89) ensure that for all x ∈ RL0 it holds that R(ψ) ∈ C(RL0 , RLL ) and M X (Rψ)(x) = hm (Rφm )(x). (91) m=1
Moreover, observe that the assumption that for all i ∈ {1, 2, . . . , L − 1} it holds that l0 = L0 , li = M Li , and lL = LL assures that P(ψ) =
L X
ln (ln−1 + 1) = l1 (l0 + 1) + lL (lL−1 + 1) +
n=1
L−1 X
ln (ln−1 + 1)
n=2
= M L1 (L0 + 1) + LL (M LL−1 + 1) +
L−1 X
M Ln (M Ln−1 + 1)
(92)
n=2
" ≤ M2
L X
# Ln (Ln−1 + 1) = M 2 P(φ1 ).
n=1
Combining this with (91) establishes (82). The proof of Lemma 5.1 is thus completed. 23
5.2
Compositions of DNNs involving artificial identities
Proposition 5.2 (Composition of neural networks). Let d1 , d2 , d3 ∈ N, let An : Rn → Rn , n ∈ N, and a : R → R be continuous functions which satisfy for all n ∈ N, x = (x1 , . . . , xn ) ∈ Rn that An (x) = (a(x1 ), . . . , a(xn )), let N = ∪L∈{2,3,4,... } ∪(l0 ,l1 ,...,lL )∈NL+1 (×Ln=1 (Rln ×ln−1 × Rln )),
(93)
let P : N → N, L : N → ∪L∈{2,3,4,... } NL+1 , and R : N → ∪k,l∈N C(Rk , Rl ) be the functions which satisfy for all L ∈ {2, 3, 4, . . . }, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , l0 , . . ., xL−1 ∈ RlL−1 with ∀ n ∈ N ∩ (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × Rln )), x0 ∈ RP L [1, L) : xn = Aln (Wn xn−1 + Bn ) that P(Φ) = n=1 ln (ln−1 + 1), R(Φ) ∈ C(Rl0 , RlL ), L(Φ) = (l0 , l1 , . . . , lL ), and (RΦ)(x0 ) = WL xL−1 + BL ,
(94)
and let φ1 , φ2 , I ∈ N , L1 , L2 ∈ {2, 3, 4 . . .}, i, l1,0 , l1,1 , . . . , l1,L1 , l2,0 , l2,1 , . . . , l2,L2 ∈ N satisfy for all x ∈ Rd2 , i ∈ {1, 2} that R(φ1 ) ∈ C(Rd2 , Rd3 ), R(φ2 ) ∈ C(Rd1 , Rd2 ), R(I) ∈ C(Rd2 , Rd2 ), L(φi ) = (li,0 , li,1 , . . . , li,Li ) ∈ NLi +1 , L(I) = (d2 , i, d2 ) ∈ N3 , and (R I)(x) = x. Then there exists ψ ∈ N such that for all x ∈ Rd1 it holds that R(ψ) ∈ C(Rd1 , Rd3 ), L(ψ) = (l2,0 , l2,1 , . . . , l2,L2 −1 , i, l1,1 , l1,2 , . . . , l1,L1 ) ∈ NL1 +L2 +1 , P(ψ) ≤ max{1, 2−1 (d2 )−2 P(I)}(P(φ1 ) + P(φ2 )), and (Rψ)(x) = (Rφ1 ) (Rφ2 )(x) = (Rφ1 ) ◦ (Rφ2 ) (x). (95) Proof of Proposition 5.2. Throughout this proof let (W3,1 , B3,1 ) ∈ Ri×d2 × Ri , (W3,2 , Lj B3,2 ) ∈ Rd2 ×i ×Rd2 , and ((Wj,1 , Bj,1 ), . . . , (Wj,Lj , Bj,Lj )) ∈ (×n=1 (Rlj,n ×lj,n−1 ×Rlj,n )), j ∈ {1, 2}, satisfy for all j ∈ {1, 2} that I = ((W3,1 , B3,1 ), (W3,2 , B3,2 )) and φj = ((Wj,1 , Bj,1 ), . . . , (Wj,Lj , Bj,Lj )), let L4 = L1 + L2 , let l4,0 , l4,1 , . . . , l4,L4 ∈ N satisfy for all i ∈ {0, 1, . . . , L2 − 1}, j ∈ {1, 2, . . . , L1 } that l4,i = l2,i ,
l4,L2 = i,
and
l4,L2 +j = l1,j ,
(96)
4 let ((W4,1 , B4,1 ), . . . , (W4,L4 , B4,L4 )) ∈ (×Ln=1 (Rl4,n ×l4,n−1 × Rl4,n )) satisfy for all i ∈ {1, 2, . . . , L2 − 1}, j ∈ {2, 3, . . . , L1 } that
(W4,i , B4,i ) = (W2,i , B2,i ),
(97)
(W4,L2 , B4,L2 ) = (W3,1 W2,L2 , W3,1 B2,L2 + B3,1 ),
(98)
(W4,L2 +1 , B4,L2 +1 ) = (W1,1 W3,2 , W1,1 B3,2 + B1,1 ),
(99)
and (W4,L2 +j , B4,L2 +j ) = (W1,j , B1,j ), and let ψ = ((W4,1 , B4,1 ), . . . , (W4,L4 , B4,L4 )) ∈ N . Observe that for all x ∈ Rl4,L2 −1 = Rl2,L2 −1 , y ∈ Rl4,L2 = Ri it holds that W4,L2 x+B4,L2 = W3,1 W2,L2 x+W3,1 B2,L2 +B3,1 = W3,1 (W2,L2 x+B2,L2 )+B3,1 (100) and W4,L2 +1 y +B4,L2 +1 = W1,1 W3,2 y +W1,1 B3,2 +B1,1 = W1,1 (W3,2 y +B3,2 )+B1,1 . (101) 24
This ensures that for all x ∈ Rd1 it holds that R(ψ) ∈ C(Rd1 , Rd3 ) and (Rψ)(x) = (Rφ1 ) (R I) (Rφ2 )(x) = (Rφ1 ) (Rφ2 )(x) .
(102)
Moreover, note that P(ψ) =
L4 X
l4,n (l4,n−1 + 1) =
LX 1 +L2 n=1
n=1
=
"L −1 2 X
l4,n (l4,n−1 + 1)
# l4,n (l4,n−1 + 1) +
n=1
" L +L 1 2 X
# l4,n (l4,n−1 + 1)
n=L2 +2
(103)
+ l4,L2 (l4,L2 −1 + 1) + l4,L2 +1 (l4,L2 + 1) "L −1 # "L # 2 1 X X = l2,n (l2,n−1 + 1) + l4,L2 +n (l4,L2 +n−1 + 1) n=1
n=2
+ i (l2,L2 −1 + 1) + l1,1 (i + 1). Hence, we obtain that # "L # "L −1 1 2 X X l2,n (l2,n−1 + 1) + l1,n (l1,n−1 + 1) P(ψ) = n=1
n=2
+ i (l2,L2 −1 + 1) + l1,1 (i + 1) # "L # "L −1 1 2 X X l2,n (l2,n−1 + 1) + l1,n (l1,n−1 + 1) = n=1
n=2 i i · l2,L2 (l2,L2 −1 + 1) + l1,1 l1,0 · +1 + d2 d2 "L # "L #! 2 1 X X ≤ max{1, i/d2 } l2,n (l2,n−1 + 1) + l1,n (l1,n−1 + 1) n=1
(104)
n=1
= max{1, i/d2 } P(φ1 ) + P(φ2 ) . Next observe that P(I) = i (d2 + 1) + d2 (i + 1) = 2 i d2 + i + d2 > 2 i d2 .
(105)
This and (104) ensure that P(ψ) ≤ max{1, 2−1 (d2 )−2 P(I)} P(φ1 ) + P(φ2 ) .
(106)
Combining this with (102) establishes (95). The proof of Proposition 5.2 is thus completed. Proposition 5.3. Let d ∈ N, let An : Rn → Rn , n ∈ N, and a : R → R be continuous functions which satisfy for all n ∈ N, x = (x1 , . . . , xn ) ∈ Rn that An (x) = (a(x1 ), . . . , a(xn )), let N = ∪L∈{2,3,4,... } ∪(l0 ,l1 ,...,lL )∈NL+1 (×Ln=1 (Rln ×ln−1 × Rln )), 25
(107)
let P : N → N, L : N → ∪L∈{2,3,4,... } NL+1 , and R : N → ∪k,l∈N C(Rk , Rl ) be the functions which satisfy for all L ∈ {2, 3, 4, . . . }, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , l0 , . . ., xL−1 ∈ RlL−1 with ∀ n ∈ N ∩ (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × Rln )), x0 ∈ RP L [1, L) : xn = Aln (Wn xn−1 + Bn ) that P(Φ) = n=1 ln (ln−1 + 1), R(Φ) ∈ C(Rl0 , RlL ), L(Φ) = (l0 , l1 , . . . , lL ), and (RΦ)(x0 ) = WL xL−1 + BL ,
(108)
and let φ1 , φ2 , I ∈ N , L1 , L2 ∈ {2, 3, 4 . . .}, i, l1,0 , l1,1 , . . . , l1,L1 , l2,0 , l2,1 , . . . , l2,L2 ∈ N satisfy for all x ∈ Rd , i ∈ {1, 2} that R(φi ) ∈ C(Rd , Rd ), L(φi ) = (li,0 , li,1 , . . . , li,Li ) ∈ NLi +1 , L(I) = (d, i, d) ∈ N3 , (R I)(x) = x, and l1,L1 −1 ≤ l2,L2 −1 + i. Then there exists ψ ∈ N such that for all x ∈ Rd it holds that R(ψ) ∈ C(Rd , Rd ), L(ψ) = (l1,0 , l1,1 , . . . , l1,L1 −1 , l2,1 + i, l2,2 + i, . . . , l2,L2 −1 + i, l2,L2 ) ∈ NL1 +L2 , P(ψ) ≤ P(φ1 ) + (P(φ2 ) + P(I))3 , and (Rψ)(x) = (Rφ1 )(x) + (Rφ2 ) (Rφ1 )(x) . (109) Proof of Proposition 5.3. Throughout this proof let (W3,1 , B3,1 ) ∈ (Ri×d ×Ri ), (W3,2 , Lj B3,2 ) ∈ (Rd×i ×Rd ), and ((Wj,1 , Bj,1 ), . . . , (Wj,Lj , Bj,Lj )) ∈ (×n=1 (Rlj,n ×lj,n−1 ×Rlj,n )), j ∈ {1, 2}, satisfy for all j ∈ {1, 2} that I = ((W3,1 , B3,1 ), (W3,2 , B3,2 )) and φj = ((Wj,1 , Bj,1 ), . . . , (Wj,Lj , Bj,Lj )), let L4 = L1 + L2 − 1, let l4,0 , l4,1 , . . . , l4,L4 ∈ N satisfy for all i ∈ {0, 1, . . . , L1 − 1}, j ∈ {0, 1, . . . , L2 − 2} that l4,i = l1,i ,
l4,L1 +j = l2,j+1 + i,
and
l4,L4 = l2,L2 = d,
(110)
4 let ((W4,1 , B4,1 ), . . . , (W4,L4 , B4,L4 )) ∈ (×Ln=1 (Rl4,n ×l4,n−1 × Rl4,n )), assume for all i ∈ {1, 2, . . . , L1 − 1} that
(W4,i , B4,i ) = (W1,i , B1,i ) ∈ (Rl1,i ×l1,i−1 × Rl1,i ) = (Rl4,i ×l4,i−1 × Rl4,i ), W2,1 W1,L1 W4,L1 = ∈ R(l2,1 +i)×l1,L1 −1 = Rl4,L1 ×l4,L1 −1 , W3,1 W1,L1 W2,1 B1,L1 + B2,1 B4,L1 = ∈ R(l2,1 +i) = Rl4,L1 , W3,1 B1,L1 + B3,1 W4,L4 = W2,L2 W3,2 ∈ Rl2,L2 ×(l2,L2 −1 +i) = Rl4,L4 ×l4,L4 −1 , and
B4,L4 = B2,L2 + B3,2 ∈ Rl2,L2 = Rl4,L4 ,
assume for all j ∈ N ∩ [0, L2 − 2] that W2,j+1 0 W4,L1 +j = ∈ R(l2,j+1 +i)×(l2,j +i) = Rl4,L1 +j ×l4,L1 +j−1 0 W3,1 W3,2 B2,j+1 and B4,L1 +j = ∈ R(l2,j+1 +i) = Rl4,L1 +j , W3,1 B3,2 + B3,1
26
(111) (112) (113) (114) (115)
(116)
(117)
and let ψ = ((W4,1 , B4,1 ), . . . , (W4,L4 , B4,L4 )) ∈ N . Observe that for all x ∈ Rl4,L1 −1 it holds that W2,1 B1,L1 + B2,1 W2,1 W1,L1 x+ W4,L1 x + B4,L1 = W3,1 W1,L1 W3,1 B1,L1 + B3,1 W2,1 W1,L1 x + W2,1 B1,L1 + B2,1 = (118) W3,1 W1,L1 x + W3,1 B1,L1 + B3,1 W2,1 (W1,L1 x + B1,L1 ) + B2,1 = . W3,1 (W1,L1 x + B1,L1 ) + B3,1 Moreover, note that for all i ∈ N ∩ [0, L2 − 2], x ∈ Rl2,i , y ∈ Ri it holds that x W2,i+1 0 x B2,i+1 W4,L1 +i + B4,L1 +i = + y 0 W3,1 W3,2 y W3,1 B3,2 + B3,1 W2,i+1 x + B2,i+1 = (119) W3,1 W3,2 y + W3,1 B3,2 + B3,1 W2,i+1 x + B2,i+1 = . W3,1 (W3,2 y + B3,2 ) + B3,1 Next observe that for all x ∈ Rl2,L2 −1 , y ∈ Ri it holds that x x W4,L4 + B4,L4 = W2,L2 W3,2 + B2,L2 + B3,2 y y
(120)
= (W2,L2 x + B2,L2 ) + (W3,2 y + B3,2 ). Moreover, note that the hypothesis that ∀ y ∈ Rd : (R I)(y) = y ensures that for all x ∈ Rd it holds that W3,2 Ai (W3,1 x + B3,1 ) + B3,2 = x. (121) Combining this, (111), (118), (119), and (120) proves that for all x ∈ Rd it holds that (Rψ)(x) = (Rφ1 )(x) + (Rφ2 ) (Rφ1 )(x) . (122) Next observe that P(ψ) =
=
L4 X
l4,n (l4,n−1 + 1) =
n=1 "L −1 1 X
L1 +L X2 −1
#
l4,n (l4,n−1 + 1)
n=1 "L +L −2 1X 2
l4,n (l4,n−1 + 1) +
n=1
# l4,n (l4,n−1 + 1)
n=L1 +1
+ l4,L1 (l4,L1 −1 + 1) + l4,L1 +L2 −1 (l4,L1 +L2 −2 + 1) "L −1 # "L −2 # 1 2 X X = l1,n (l1,n−1 + 1) + l4,L1 +n (l4,L1 +n−1 + 1) n=1
n=1
+ (l2,1 + i)(l1,L1 −1 + 1) + l2,L2 (l2,L2 −1 + i + 1).
27
(123)
The hypothesis that l1,L1 −1 ≤ l2,L2 −1 + i therefore assures that "L −1 # "L −2 # 1 2 X X P(ψ) = l1,n (l1,n−1 + 1) + (l2,n+1 + i)(l2,n + i + 1) n=1
n=1
+ (l2,1 + i)(l1,L1 −1 + 1) + l2,L2 (l2,L2 −1 + i + 1) # "L −2 2 X 2 ≤ P(φ1 ) + l2,n+1 (l2,n + 1) + l2,n+1 i + i (l2,n + 1) + i n=1
(124)
+ (l2,1 + i)(l2,L2 −1 + i + 1) + l2,L2 (l2,L2 −1 + i + 1) "L −1 # "L −2 # 2 2 X X ≤ P(φ1 ) + l2,n (l2,n−1 + 1) + i (l2,n+1 + l2,n ) n=2
n=1 2
+ (L2 − 2)(i + i ) + (l2,1 + i)(l2,L2 −1 + i + 1) + l2,L2 (l2,L2 −1 + 1) + l2,L2 i. Hence, we obtain that # "L # "L −2 2 2 X X (l2,n+1 + l2,n ) P(ψ) ≤ P(φ1 ) + l2,n (l2,n−1 + 1) + i n=1
n=2 2
+ (L2 − 2)(i + i ) + i (l2,1 + l2,L2 −1 + l2,L2 ) (125) + l2,1 (l2,L2 −1 + 1) + i (i + 1) "L # 2 X ≤ P(φ1 ) + P(φ2 ) + 2 i l2,n + (L2 − 1)(i + i2 ) + l2,1 (l2,L2 −1 + 1). n=1
Next observe that P(φ2 ) =
L2 X
" l2,n (l2,n−1 + 1) ≥ 2
n=1
L2 X
# l2,n ≥ 2 L2
(126)
n=1
and P(I) = i (d + 1) + d (i + 1) ≥ 2 i + 2.
(127)
This and (125) demonstrate that P(ψ) ≤ P(φ1 ) + P(φ2 ) + i P(φ2 ) + (i + i2 )P(φ2 ) + P(φ2 )(P(φ2 ) + 1) = P(φ1 ) + P(φ2 )(2 + 2 i + i2 ) + (P(φ2 ))2 ≤ P(φ1 ) + P(φ2 )(P(I) + (P(I))2 ) + (P(φ2 ))2 ≤ P(φ1 ) + 2 P(φ2 )(P(I))2 + (P(φ2 ))2 ≤ P(φ1 ) + (P(φ2 ) + P(I))3 .
(128)
Combining this with (122) establishes (109). The proof of Proposition 5.3 is thus completed.
28
5.3
Representations of the d-dimensional identities
Lemma 5.4 (Artificial neural networks with rectifier functions). Let d ∈ N, let An : Rn → Rn , n ∈ N, be the functions which satisfy for all n ∈ N, x = (x1 , . . . , xn ) ∈ Rn that An (x) = (max{x1 , 0}, . . . , max{xn , 0}), let N = ∪L∈{2,3,4,... } ∪(l0 ,l1 ,...,lL )∈NL+1 (×Ln=1 (Rln ×ln−1 × Rln )),
(129)
and let P : N → N, L : N → ∪L∈{2,3,4,... } NL+1 , and R : N → ∪k,l∈N C(Rk , Rl ) be the functions which satisfy for all L ∈ {2, 3, 4, . . . }, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , l0 , . . ., xL−1 ∈ RlL−1 with ∀ n ∈ N ∩ (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × Rln )), x0 ∈ RP L [1, L) : xn = Aln (Wn xn−1 + Bn ) that P(Φ) = n=1 ln (ln−1 + 1), R(Φ) ∈ C(Rl0 , RlL ), L(Φ) = (l0 , l1 , . . . , lL ), and (RΦ)(x0 ) = WL xL−1 + BL .
(130)
Then there exists ψ ∈ N such that for all x ∈ Rd it holds that R(ψ) ∈ C(Rd , Rd ), L(ψ) = (d, 2d, d) ∈ N3 , and (Rψ)(x) = x. (131) Proof of Lemma 5.4. Throughout this proof let w1 ∈ R2×1 , w2 ∈ R1×2 , (W1 , B1 ) ∈ (R(2d)×d × R2d ), and (W2 , B2 ) ∈ (Rd×(2d) × Rd ) satisfy that w1 0 0 · · · 0 0 w1 0 · · · 0 1 0 0 w1 · · · 0 2×1 w1 = ∈ R , W1 = ∈ R(2d)×d , B1 = 0 ∈ R2d , −1 .. .. .. . . . . . .. . . 0 0 0 · · · w1 (132) w2 0 0 · · · 0 0 w2 0 · · · 0 0 0 w2 · · · 0 1×2 w2 = 1 −1 ∈ R , W2 = ∈ Rd×(2d) , B2 = 0 ∈ Rd .. .. .. . . .. . . . . . 0 0 0 · · · w2 (133) and let ψ = ((W1 , B1 ), (W2 , B2 )) ∈ N . Observe that for all x = (x1 , x2 , . . . , xd ) ∈ Rd it holds that w 1 x1 w 1 x2 W1 x + B1 = W1 x = .. ∈ R2d . (134) . w1 xd
29
This ensures that for all x = (x1 , x2 , . . . , xd ) ∈ Rd it holds that max{x1 , 0} max{−x1 , 0} max{x2 , 0} A(2d) (W1 x + B1 ) = max{−x2 , 0} ∈ R2d . .. . max{xd , 0} max{−xd , 0}. Hence, we obtain that for all x = (x1 , x2 , . . . , xd ) ∈ Rd it holds that max{x1 , 0} max{−x1 , 0} max{x2 , 0} max{−x2 , 0} W2 A(2d) (W1 x + B1 ) + B2 = W2 .. . max{xd , 0} max{−xd , 0}. max{x1 , 0} − max{−x1 , 0} max{x2 , 0} − max{−x2 , 0} = = x ∈ Rd . .. . max{xd , 0} − max{−xd , 0}
(135)
(136)
This demonstrates that for all x ∈ Rd it holds that (Rψ)(x) = x.
(137)
Combining this with the fact that L(ψ) = (d, 2d, d) ∈ N3 establishes (131). The proof of Lemma 5.4 is thus completed.
6
DNN approximations for partial differential equations (PDEs)
In this section we establish in our main result in Theorem 6.3 in Subsection 6.2 below that rectified DNNs have the capacity to approximate solutions of second-order Kolmogorov PDEs with nonlinear drift and constant diffusion coefficients without suffering from the curse of dimensionality. Our proof of Theorem 6.3 is based on an application of Corollary 6.2 in Subsection 6.1 below. Corollary 6.2, in turn, follows immediately from Proposition 6.1 in Subsection 6.1 below. Proposition 6.1 is, roughly speaking, a generalized version of Theorem 6.3 which covers a more general type of activation function instead of only the rectifier function as the employed activation function. Proposition 6.1 shows for every p ∈ [2, ∞) that the Lp (νd ; R)-distance between the solution of the PDE at the time of maturity and the DNN is smaller or equal than the prescribed approximation accuracy ε > 0, where 30
νd : B(Rd ) → [0, 1], d ∈ N, is a suitable sequence of probability measures. Corollary 6.2 slightly generalizes this result, in particular, by assuming that p ∈ (0, ∞) is an arbitrary strictly positive real number instead of assuming that p ∈ [2, ∞) is greater or equal than 2 (cf. Proposition 6.1). Finally, in Corollary 6.4 in Subsection 6.3 below we specialize Theorem 6.3 in Subsection 6.2 to the case where for every d ∈ N we have that the probability measure νd is nothing else but the uniform distribution on the d-dimensional unit cube [0, 1]d . Theorem 1.1 in the introduction in Section 1 above follows directly from Corollary 6.4 in Subsection 6.3.
6.1
DNN approximations with general activation functions
Proposition 6.1. Let T, κ ∈ (0, ∞), η ∈ [1, ∞), p ∈ [2, ∞), let Ad = (ad,i,j )(i,j)∈{1,...,d}2 ∈ Rd×d , d ∈ N, be symmetric positive semidefinite matrices, for every d ∈ N let k·kRd : Rd → [0, ∞) be the d-dimensional Euclidean norm and let νd : B(Rd ) → [0, 1] be a probability measure, let f0,d : Rd → R, d ∈ N, and f1,d : Rd → Rd , d ∈ N, be functions, let Ad : Rd → Rd , d ∈ N, and a : R → R be continuous functions which satisfy for all d ∈ N, x = (x1 , . . . , xd ) ∈ Rd that Ad (x) = (a(x1 ), . . . , a(xd )), let N = ∪L∈{2,3,4,... } ∪(l0 ,l1 ,...,lL )∈NL+1 (×Ln=1 (Rln ×ln−1 × Rln )),
(138)
let P : N → N, L : N → ∪L∈{2,3,4,... } NL+1 , and R : N → ∪k,l∈N C(Rk , Rl ) be the functions which satisfy for all L ∈ {2, 3, 4, . . . }, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , l0 (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × Rln )), x0 ∈ RP , . . ., xL−1 ∈ RlL−1 with ∀ n ∈ N ∩ [1, L) : xn = Aln (Wn xn−1 + Bn ) that P(Φ) = Ln=1 ln (ln−1 + 1), R(Φ) ∈ C(Rl0 , RlL ), L(Φ) = (l0 , l1 , . . . , lL ), and (RΦ)(x0 ) = WL xL−1 + BL ,
(139)
2,d let (φm,d ε )(m,d,ε)∈{0,1}×N×(0,1] ⊆ N , (φ )d∈N ⊆ N , and assume for all d ∈ N, ε ∈ 2,d d d d d 0,d (0, 1], x, y ∈ R that R(φε ) ∈ C(R , R), R(φ1,d ε ), R(φ ) ∈ C(R , R ), |f0,d (x)| + Pd κ κ 1,d i,j=1 |ad,i,j | ≤ κd (1+kxkRd ), kf1,d (x)−f1,d (y)kRd ≤ κkx−ykRd , k(Rφε )(x)kRd ≤ P κ −κ 0,d , |(Rφ0,d κ(dκ + kxkRd ), P(φ2,d ) + 1m=0 P(φm,d ε ) ≤ κd ε ε R)(x) − (Rφε )(y)| ≤ p(2κ+1) κdκ (1+kxkκRd +kykκRd )kx−ykRd , L(φ2,d ) ∈ N3 , (Rφ2,d )(x) = x, Rd kzkRd νd (dz) ≤ ηdη , and 1,d κ κ |f0,d (x) − (Rφ0,d ε )(x)| + kf1,d (x) − (Rφε )(x)kRd ≤ εκd (1 + kxkRd ).
(140)
Then (i) there exist unique at most polynomially growing functions ud : [0, T ] × Rd → R, d ∈ N, such that for all d ∈ N, x ∈ Rd it holds that ud (0, x) = f0,d (x) and such that for all d ∈ N it holds that ud is a viscosity solution of ∂ ud )(t, x) ( ∂t
=
∂ ( ∂x ud )(t, x) f1,d (x)
+
d X i,j=1
for (t, x) ∈ (0, T ) × Rd and 31
2
ad,i,j ( ∂x∂i ∂xj ud )(t, x)
(141)
(ii) there exist (ψd,ε )(d,ε)∈N×(0,1] ⊆ N , c ∈ R such that for all d ∈ N, ε ∈ (0, 1] it holds that P(ψd,ε ) ≤ c dc ε−c , R(ψd,ε ) ∈ C(Rd , R), and Z 1/p p |ud (T, x) − (Rψd,ε )(x)| νd (dx) ≤ ε. (142) Rd
Proof of Proposition 6.1. Throughout this proof let ι ∈ R be the real number√given by ι = max{κ, 1}, let Ad ∈ Rd×d , d ∈ N, satisfy for all d ∈ N that Ad = 2Ad , 1,d d d d let Φ0,d δ : R → R, δ ∈ (0, 1], d ∈ N, and Φδ : R → R , δ ∈ (0, 1], d ∈ N, be the functions which satisfy for all m ∈ {0, 1}, d ∈ N, δ ∈ (0, 1], x ∈ Rd that m,d Φm,d δ (x) = (Rφδ )(x),
(143)
let (Ω, F, P) be a probability space, let W d,m : [0, T ] × Ω → Rd , d, m ∈ N, be independent standard Brownian motions, let $d,q ∈ R, d ∈ N, q ∈ (0, ∞), satisfy for all q ∈ (0, ∞), d ∈ N that 1/q $d,q = E kAd WTd,1 kqRd , (144) let X d,x : [0, T ] × Ω → Rd , d ∈ N, x ∈ Rd , be stochastic processes with continuous sample paths which satisfy for all x ∈ Rd , d ∈ N, t ∈ [0, T ] that Z t d,x f1,d (Xsd,x ) ds + Ad Wtd,1 Xt = x + (145) 0
(cf. Theorem 3.1), let χδ : [0, T ] → [0, T ], δ ∈ (0, 1], be the functions which satisfy for all δ ∈ (0, 1], t ∈ [0, T ] that χδ (t) = max 0, δ 2 , 2δ 2 , 3δ 2 , . . . ∩ [0, t] , (146) let Y δ,d,m,x : [0, T ] × Ω → Rd , δ ∈ (0, 1], d, m ∈ N, x ∈ Rd , be stochastic processes with continuous sample paths which satisfy for all x ∈ Rd , d, m ∈ N, δ ∈ (0, 1], t ∈ [0, T ] that Z t δ,d,m,x δ,d,m,x Φ1,d Y ds + Ad Wtd,m , (147) Yt =x+ δ χδ (s) 0
let Md,ε ∈ N, d ∈ N, ε ∈ (0, 1], be the natural numbers which satisfy for all ε ∈ (0, 1], d ∈ N that Md,ε i2/p p 2 2 h pκ 2(κ+4) pκdκ eκ T κ η = min N ∩ 1 + κd T + 2(pι − 1)κdκ T + ηd ,∞ , ε (148) and let Dd,ε ∈ (0, 1], d ∈ N, ε ∈ (0, 1], be the real numbers which satisfy for all ε ∈ (0, 1], d ∈ N that −1 −1 2 1 Dd,ε = ε max{2κdκ , 1} + T − /2 e−(3+3κ+[κ +2κι+2]T ) max{1, 2κ(κ + 1)dκ } 2−(2ι+1) − p1 p pι+pκ κ η κ + ηd . · 2 + max{1, κd , kf1,d (0)kRd } max{1, T } + 2(2ι − 1)κd T (149) 32
Observe that the assumption that for all d ∈ N, ε ∈ (0, 1] it holds that R(φ0,d ε ) ∈ d d C(R , R) and (140) ensure that f0,d ∈ C(R , R). This, the fact that for all d ∈ N it holds that the function f1,d : Rd → Rd is Lipschitz continuous, and Theorem 3.1 establish item (i). It thus remains to prove item (ii). For this note that the fact that ∀ y, z ∈ R, α ∈ [1, ∞) : |y + z|α ≤ 2α−1 (|y|α + |z|α ) and Theorem 3.1 ensure that for all M, d ∈ N, δ ∈ (0, 1] it holds that " p # M Z P 1 δ,d,m,x (Y ) Φ0,d E ud (T, x) − T νd (dx) M m=1 δ Rd Z h 0,d δ,d,1,x p i p−1 ≤2 E ud (T, x) − E Φδ (YT ) νd (dx) Rd " M # Z 0,d δ,d,1,x P 0,d δ,d,m,x p 1 p−1 +2 E E Φδ (YT Φ (YT ) νd (dx) ) − (150) M m=1 δ Rd Z p−1 E f0,d (X d,x ) − E Φ0,d (Y δ,d,1,x ) p νd (dx) =2 T T δ Rd " M p # Z P 1 δ,d,1,x δ,d,m,x + 2p−1 E E Φ0,d ) − Φ0,d ) νd (dx). δ (YT δ (YT M m=1 Rd √ The fact that 2 p − 1 ≤ p and, e.g., Grohs et al. [26, Corollary 2.5] hence prove that for all M, d ∈ N, δ ∈ (0, 1] it holds that " M p # Z P 1 δ,d,m,x E ud (T, x) − Φ0,d ) νd (dx) δ (YT M m=1 Rd Z E f0,d (X d,x ) − E Φ0,d (Y δ,d,1,x ) p νd (dx) ≤ 2p−1 T T δ Rd √ Z h p−1 i 2 (2 p − 1)p (151) Φ0,d (Y δ,d,1,x ) − E Φ0,d (Y δ,d,1,x ) p νd (dx) E + T T δ δ M p/2 d R Z E f0,d (X d,x ) − E Φ0,d (Y δ,d,1,x ) p νd (dx) ≤ 2p−1 T T δ Rd Z h i 2p−1 pp Φ0,d (Y δ,d,1,x ) − E Φ0,d (Y δ,d,1,x ) p νd (dx). E + T T δ δ M p/2 Rd The fact that ∀ y, z ∈ R, α ∈ [1, ∞) : |y + z|α ≤ 2α−1 (|y|α + |z|α ) and Jensen’s inequality therefore assure that for all M, d ∈ N, δ ∈ (0, 1] it holds that " M p # Z P 1 0,d δ,d,m,x Φδ (YT ) νd (dx) E ud (T, x) − M m=1 Rd Z E f0,d (X d,x ) − E Φ0,d (Y δ,d,1,x ) p νd (dx) ≤ 2p−1 T T δ Rd h h 2(p−1) p Z i i 2 p 0,d δ,d,1,x p 0,d δ,d,1,x p (152) + νd (dx) E Φδ (YT ) + E E Φδ (YT ) M p/2 d R Z p−1 E f0,d (X d,x ) − E Φ0,d (Y δ,d,1,x ) p νd (dx) ≤2 T T δ Rd h 2p−1 p Z i 2 p 0,d δ,d,1,x p + E Φδ (YT ) νd (dx). M p/2 Rd 33
Next observe that for all d ∈ N, x, y ∈ Rd it holds that Z 1 Z 1 κ κ rkxkRd + (1 − r)kykRd dr ≥ r kxkκRd + (1 − r)κ kykκRd dr 2 0 0 Z 1 κ kxk + kykκRd d κ κ κ R r dr = = kxkRd + kykRd . κ+1 0
(153)
0,d This and the hypothesis that ∀ d ∈ N, δ ∈ (0, 1], x, y ∈ Rd : |(Rφ0,d δ )(x) − (Rφδ )(y)| ≤ κdκ (1 + kxkκRd + kykκRd )kx − ykRd prove that for all d ∈ N, δ ∈ (0, 1], x, y ∈ Rd it holds that 0,d 0,d 0,d |Φ0,d δ (x) − Φδ (y)| = |(Rφδ )(x) − (Rφδ )(y)| ≤ κdκ (1 + kxkκRd + kykκRd )kx − ykRd Z 1 κ κ rkxkRd + (1 − r)kykRd dr kx − ykRd ≤ κd 1 + 2(κ + 1) 0 Z 1 κ κ ≤ 2κ(κ + 1)d 1 + rkxkRd + (1 − r)kykRd dr kx − ykRd .
(154)
0
Proposition 4.6 (with d = d, m = d, ξ = x, T = T , c = κ, C = κdκ , ε0 = δκdκ , ε1 = δκdκ , ε2 = 0, ς0 = κ, ς1 = κ, ς2 = 0, L0 = 2κ(κ + 1)dκ , L1 = κ, ` = κ, h = min{δ 2 , T }, B = Ad , p = 2, q = 2, k·k = k·kRd , (Ω, F, P) = (Ω, F, P), 1,d √ W = W d,1 , φ0 = Φ0,d δ , f1 = f1,d , φ2 = idRd , χ = χmin{δ, T } , f0 = f0,d , φ1 = Φδ , ($r )r∈(0,∞) = ($d,r )r∈(0,∞) , X = X d,x , Y = Y δ,d,1,x for d ∈ N, x ∈ Rd , δ ∈ (0, 1] in the notation of Proposition 4.6) hence ensures that for all d ∈ N, δ ∈ (0, 1], x ∈ Rd it holds that E f0,d (X d,x ) − E Φ0,d (Y δ,d,1,x ) p ≤ 2δκdκ + (min{δ 2 , T }/T )1/2 p T T δ · ep(κ+3+2κ+[κ max{κ,κ}+κ max{κ,1}+κ max{κ,1}+2]T ) kxkRd + max{1, 0}(1 + 1) p max{1,κ,κ}+pκ + max{1, κdκ , kf1,d (0)kRd } max{1, T } + $d,max{κ,2κ,2,2κ} p (155) · max{1, 2κ(κ + 1)dκ } p p 2 1 ≤ 2δκdκ + (δ 2 /T ) /2 ep(3+3κ+[κ +2κι+2]T ) max{1, 2κ(κ + 1)dκ } pι+pκ · kxkRd + 2 + max{1, κdκ , kf1,d (0)kRd } max{1, T } + $d,max{2κ,2} . Moreover, note that Lemma 4.2 assures that for all q ∈ [2, ∞), d ∈ N it holds that q p p $d,q ≤ (q − 1) Trace(A∗d Ad )T = 2(q − 1) Trace(Ad )T ≤ 2(q − 1)κdκ T . (156) This, (155), and the fact that ∀ y, z ∈ R, α ∈ [1, ∞) : |y + z|α ≤ 2α−1 (|y|α + |z|α )
34
demonstrate that for all d ∈ N, δ ∈ (0, 1] it holds that Z E f0,d (X d,x ) − E Φ0,d (Y δ,d,1,x ) p νd (dx) T T δ Rd p p 2 1 ≤ δ p 2κdκ + T − /2 ep(3+3κ+[κ +2κι+2]T ) max{1, 2κ(κ + 1)dκ } Z h ipι+pκ p kxkRd + 2 + max{1, κdκ , kf1,d (0)kRd } max{1, T } + 2(2ι − 1)κdκ T · νd (dx) Rd p p 2 1 ≤ δ p 2κdκ + T − /2 ep(3+3κ+[κ +2κι+2]T ) max{1, 2κ(κ + 1)dκ } h ipι+pκ p pι+pκ−1 ·2 2 + max{1, κdκ , kf1,d (0)kRd } max{1, T } + 2(2ι − 1)κdκ T Z pι+pκ + kxkRd νd (dx) . (157) Rd
Next note that the fact that ι ≤ κ + 1 and H¨older’s inequality prove that for all d ∈ N it holds that (ι+κ)/(2κ+1) Z Z p(2κ+1) pι+pκ νd (dx) kxkRd kxkRd νd (dx) ≤ (158) Rd Rd ≤ [ηdη ]
(ι+κ)/(2κ+1)
≤ ηdη .
Combining this and (157) ensures that for all d ∈ N, δ ∈ (0, 1] it holds that Z p−1 E f0,d (X d,x ) − E Φ0,d (Y δ,d,1,x ) p νd (dx) 2 T T δ Rd p p 2 1 ≤ δ p 2κdκ + T − /2 ep(3+3κ+[κ +2κι+2]T ) max{1, 2κ(κ + 1)dκ } 2p(2ι+1)−2 (159) h ipι+pκ p κ η κ · 2 + max{1, κd , kf1,d (0)kRd } max{1, T } + 2(2ι − 1)κd T + ηd . Next observe that for all d ∈ N, δ ∈ (0, 1], x ∈ Rd it holds that 0,d |Φ0,d δ (x)| ≤ |Φδ (x) − f0,d (x)| + |f0,d (x)| ≤ δκdκ (1 + kxkκRd ) + κdκ (1 + kxkκRd ) ≤ 2κdκ (1 + kxkκRd ).
(160)
Moreover, note that Lemma 4.1 shows that for all q ∈ [1, ∞), δ ∈ (0, 1], d, m ∈ N, x ∈ Rd it holds that 1/q 1/q κT e E kYTδ,d,m,x kqRd ≤ kxkRd + κdκ T + E kAd WTd,m kqRd (161) = kxkRd + κdκ T + $d,q eκT . This and (156) demonstrate that for all q ∈ [2, ∞), δ ∈ (0, 1], d, m ∈ N, x ∈ Rd it holds that p 1/q E kYTδ,d,m,x kqRd ≤ kxkRd + κdκ T + 2(q − 1)κdκ T eκT .
35
(162)
Combining this with (160), the fact that ∀ y, z ∈ R, α ∈ [1, ∞) : |y+z|α ≤ 2α−1 (|y|α + |z|α ), and H¨older’s inequality ensures that for all δ ∈ (0, 1], d ∈ N it holds that Z Z p i h h i δ,d,1,x κ 0,d δ,d,1,x p κ p νd (dx) E Φδ (YT E 1 + kYT kRd ) νd (dx) ≤ 2κd Rd Rd Z i h δ,d,1,x pκ κ p p−1 E 1 + kYT kRd νd (dx) ≤ 2κd 2 Rd Z h i δ,d,1,x pκ κ p E kYT kRd νd (dx) ≤ 4κd 1+ (163) Rd Z h iκ/ι δ,d,1,x pι κ p 1+ E kYT kRd ≤ 4κd νd (dx) Rd Z h p κT ipκ κ κ p κ 1+ kxkRd + κd T + 2(pι − 1)κd T e ≤ 4κd νd (dx) . Rd
The fact that ∀ y, z ∈ R, α ∈ [0, ∞) : |y + z|α ≤ 2α (|y|α + |z|α ) hence proves that for all δ ∈ (0, 1], d ∈ N it holds that Z h i δ,d,1,x p (Y ) E Φ0,d νd (dx) (164) T δ Rd Z pκ p pκ κ p κ pκ pκ2 T κ ≤ 4κd 1+2 e kxkRd νd (dx) + κd T + 2(pι − 1)κd T Rd pκ Z p pκ κ p pκ pκ2 T κ kxkRd νd (dx) . + ≤ 4κd 2 e 1 + κd T + 2(pι − 1)κdκ T Rd
Next note that H¨older’s inequality shows that for all d ∈ N it holds that κ/(2κ+1) Z Z p(2κ+1) pκ kxkRd νd (dx) kxkRd νd (dx) ≤ Rd η κ/(2κ+1)
Rd
≤ [ηd ]
(165)
η
≤ ηd .
Combining this and (164) ensures that for all δ ∈ (0, 1], d ∈ N it holds that Z h i δ,d,1,x p 2p−1 p 2 p E Φ0,d (Y ) νd (dx) T δ Rd 2 p i pκ p 2κ+4 pκdκ eκ T h ≤ 1 + κdκ T + 2(pι − 1)κdκ T + ηdη . 2
(166)
This, (152), and (159) prove that for all d ∈ N, ε ∈ (0, 1] it holds that # " Md,ε Z p P 1 D ,d,m,x d,ε νd (dx) ≤ εp + εp < εp . (167) Φ0,d E ud (T, x) − Dd,ε YT 4 2 Md,ε d m=1
R
Corollary 2.4 therefore assures that for all d ∈ N, ε ∈ (0, 1] there exists wd,ε ∈ Ω such that "Z #1/p Md,ε p P 1 D ,d,m,x ud (T, x) − Rφ0,d YT d,ε (wd,ε ) νd (dx) < ε. (168) Dd,ε Md,ε d R
m=1
36
Moreover, note that for all d ∈ N, ε ∈ (0, 1] it holds that i2/p (κ+4) κ κ2 T 2 h p pκ 2 pκd e κ η κ 1 + κd T + 2(pι − 1)κd T + ηd +1 Md,ε ≤ ε h i 2/p p pκ 2 = 22(κ+4) p2 κ2 d2κ e2κ T 1 + κdκ T + 2(pι − 1)κdκ T + ηdη ε−2 + 1 i h p pκ 2 + ηdη ε−2 + 1 ≤ 22(κ+4) p2 κ2 d2κ e2κ T 1 + κdκ T + 2(pι − 1)κdκ T h i √ pκ 2 (169) ≤ 22(κ+4) p2 κ2 d2κ e2κ T 1 + ιdι T + pιdι T + ηdη ε−2 + 1. Hence, we obtain that for all d ∈ N, ε ∈ (0, 1] it holds that 2 Md,ε ≤ 22(κ+4) p2 κ2 d2κ e2κ T 1 + 2pιdι max{1, T }|pκ + ηdη ε−2 + 1 2 ≤ 22(κ+4) p2 κ2 d2κ e2κ T dmax{pκι,η} 1 + |2pι max{1, T }|pκ + η ε−2 + 1 2 ≤ 22(κ+4) p2 κ2 e2κ T 1 + |2pι max{1, T }|pκ + η dpκι+η+2κ ε−2 + 1 2 ≤ 22(κ+4) p2 ι2 e2κ T 1 + |2pι max{1, T }|pκ + η dpκι+η+2κ ε−2 + 1 2 ≤ 22(κ+4) p2 ι2 e2κ T 2 + |2pι max{1, T }|pκ + η dpκι+η+2κ ε−2 .
(170)
Next observe that for all d ∈ N, ε ∈ (0, 1] it holds that 1,d kf1,d (0)kRd ≤ kf1,d (0) − (Rφ1,d ε )(0)kRd + k(Rφε )(0)kRd ≤ εκdκ + κdκ ≤ 2κdκ ≤ 2ιdκ .
(171)
Therefore, we obtain that for all d ∈ N, ε ∈ (0, 1] it holds that −1 −(3+3κ+[κ2 +2κι+2]T ) 1 max{1, 2κ(κ + 1)dκ } −1 2−(2ι+1) Dd,ε = ε max{2κdκ , 1} + T − /2 e h −1/p ipι+pκ p κ η κ · 2 + max{1, κd , kf1,d (0)kRd } max{1, T } + 2(2ι − 1)κd T + ηd −1 −(3+3κ+[κ2 +2κι+2]T ) 1 max{1, 2κ(κ + 1)dκ } −1 2−(2ι+1) ≥ ε max{2κdκ , 1} + T − /2 e −1/p h ipι+pκ p η κ κ κ + ηd · 2 + max{1, κd , 2ιd } max{1, T } + 2(2ι − 1)κd T −1 −(3+3ι+[3ι2 +2]T ) 1 max{1, 2κ(κ + 1)dκ } −1 2−(2ι+1) ≥ ε 2 max{2κdκ , 1, T − /2 } e h −1/p √ ipι+pκ κ κ η · 2 + 2ιd max{1, T } + 2ιd T + ηd . (172) This proves that for all d ∈ N, ε ∈ (0, 1] it holds that −1 −1 2 1 Dd,ε ≥ ε 4ιdκ max{1, T − /2 } e−(3ι +3)(T +1) 4ι2 dκ 2−(2ι+1) −1/p pι+pκ κ η + ηd · 6ιd max{1, T } κ −1 2 −1 1 ≥ ε 4ιd max{1, T − /2 } e−(3ι +3)(T +1) 4ι2 dκ 2−(2ι+1) −1/p pι+pκ [−κ(pι+pκ)−η]/p · 6ι max{1, T } +η d −1 −1 2 1 ≥ 4ι max{1, T − /2 } e−(3ι +3)(T +1) 4ι2 2−(2ι+1) −1/p −(2κ+κ(κ+ι)+η) · [6ι max{1, T }]pι+pκ + η d ε. 37
(173)
Hence, we obtain that for all d ∈ N, ε ∈ (0, 1] it holds that Dd,ε ≥ |min{1, T
1/2
2 +3)(T +1)
}|e−(3ι
· [6ι max{1, T }]pι+pκ + η
ι−3 2−(2ι+5)
−1/p
d−(κ(2+κ+ι)+η) ε.
(174)
Moreover, note that Proposition 5.3 ensures that for all δ ∈ (0, 1], d ∈ N, t ∈ [0, T ], ω ∈ Ω there exist (ψδ,d,m,t,ω )m∈N ⊆ N such that for all x ∈ Rd , m ∈ N it holds that 3 2 R(ψδ,d,m,t,ω ) ∈ C(Rd , Rd ), P(ψδ,d,m,t,ω ) ≤ P(φ2,d ) + (P(φ2,d ) + P(φ1,d δ )) [χδ (t)/δ + 1], L(ψδ,d,m,t,ω ) = L(ψδ,d,1,t,ω ), and (Rψδ,d,m,t,ω )(x) = Ytδ,d,m,x (ω). This demonstrates that for all δ ∈ (0, 1], d, m ∈ N, ω ∈ Ω it holds that 3 P(ψδ,d,m,T,ω ) ≤ P(φ2,d ) + P(φ1,d χδ (T )/δ 2 + 2 δ ) 3 ≤ κdκ δ −κ T /δ 2 + 2 ≤ κ3 (T + 2)d3κ δ −3κ−2 .
(175)
(176)
Proposition 5.2 hence proves that for all δ ∈ (0, 1], d ∈ N, ω ∈ Ω there exist (ϕδ,d,m,ω )m∈N ⊆ N such that for all x ∈ Rd , m ∈ N it holds that R(ϕδ,d,m,ω ) ∈ C(Rd , R), P(ϕδ,d,m,ω ) ≤ P(φ2,d )(P(φ0,d δ ) + P(ψδ,d,m,T,ω )), L(ϕδ,d,m,ω ) = L(ϕδ,d,1,ω ), and δ,d,m,x (Rϕδ,d,m,ω )(x) = Φ0,d (ω)). (177) δ (YT This and (176) ensure that for all δ ∈ (0, 1], d, m ∈ N, ω ∈ Ω it holds that P(ϕδ,d,m,ω ) ≤ κdκ κdκ δ −κ + κ3 (T + 2)d3κ δ −3κ−2 ≤ κ2 d4κ δ −3κ−2 (T + 2)[1 + κ2 ] ≤ 2ι2 κ2 (T + 2)d4κ δ −3κ−2 .
(178)
Lemma 5.1 and (177) therefore show that for all M, d ∈ N, δ ∈ (0, 1], ω ∈ Ω there exists ΨM,d,δ,ω ∈ N such that for all x ∈ Rd it holds that R(ΨM,d,δ,ω ) ∈ C(Rd , R), P(ΨM,d,δ,ω ) ≤ 2M 2 ι2 κ2 (T + 2)d4κ δ −3κ−2 , and M 1 X 0,d δ,d,m,x (RΨM,d,δ,ω )(x) = Φ (YT (ω)). M m=1 δ
(179)
This, (170), and (174) assure that for all d ∈ N, ε ∈ (0, 1], ω ∈ Ω it holds that P(ΨMd,ε ,d,Dd,ε ,ω ) ≤ 2|Md,ε |2 ι2 κ2 (T + 2)d4κ |Dd,ε |−3κ−2 2 2 ≤ 2 22(κ+4) p2 ι2 e2κ T 2 + |2pι max{1, T }|pκ + η dpκι+η+2κ ε−2 h 2 1 2 2 4κ · ι κ (T + 2)d |min{1, T /2 }|e−(3ι +3)(T +1) ι−3 2−(2ι+5) −1/p −(κ(2+κ+ι)+η) i−3κ−2 · [6ι max{1, T }]pι+pκ + η d ε (180) 2 2 = 2 22(κ+4) p2 ι2 e2κ T 2 + |2pι max{1, T }|pκ + η ι2 κ2 (T + 2) h −1/p i−3κ−2 2 1 · |min{1, T /2 }|e−(3ι +3)(T +1) ι−3 2−(2ι+5) [6ι max{1, T }]pι+pκ + η · d2(pκι+η+4κ)+(κ(2+κ+ι)+η)(3κ+2) ε−3κ−6 . Combining this and (168) finishes the proof of item (ii). The proof of Proposition 6.1 is thus completed. 38
Corollary 6.2. Let T, κ, η, p ∈ (0, ∞), let Ad = (ad,i,j )(i,j)∈{1,...,d}2 ∈ Rd×d , d ∈ N, be symmetric positive semidefinite matrices, for every d ∈ N let k·kRd : Rd → [0, ∞) be the d-dimensional Euclidean norm and let νd : B(Rd ) → [0, 1] be a probability measure on Rd , let f0,d : Rd → R, d ∈ N, and f1,d : Rd → Rd , d ∈ N, be functions, let Ad : Rd → Rd , d ∈ N, and a : R → R be continuous functions which satisfy for all d ∈ N, x = (x1 , . . . , xd ) ∈ Rd that Ad (x) = (a(x1 ), . . . , a(xd )), let N = ∪L∈{2,3,4,... } ∪(l0 ,l1 ,...,lL )∈NL+1 (×Ln=1 (Rln ×ln−1 × Rln )),
(181)
let P : N → N, L : N → ∪L∈{2,3,4,... } NL+1 , and R : N → ∪k,l∈N C(Rk , Rl ) be the functions which satisfy for all L ∈ {2, 3, 4, . . . }, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , l0 , . . ., xL−1 ∈ RlL−1 with ∀ n ∈ N ∩ (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × Rln )), x0 ∈ RP L [1, L) : xn = Aln (Wn xn−1 + Bn ) that P(Φ) = n=1 ln (ln−1 + 1), R(Φ) ∈ C(Rl0 , RlL ), L(Φ) = (l0 , l1 , . . . , lL ), and (RΦ)(x0 ) = WL xL−1 + BL ,
(182)
2,d let (φm,d ε )(m,d,ε)∈{0,1}×N×(0,1] ⊆ N , (φ )d∈N ⊆ N , and assume for all d ∈ N, ε ∈ d 0,d d 2,d d d (0, 1], x, y ∈ R that R(φε ) ∈ C(R , R), R(φ1,d ε ), R(φ ) ∈ C(R , R ), |f0,d (x)| + Pd κ κ 1,d i,j=1 |ad,i,j | ≤ κd (1+kxkRd ), kf1,d (x)−f1,d (y)kRd ≤ κkx−ykRd , k(Rφε )(x)kRd ≤ P 0,d κ −κ , |(Rφ0,d κ(dκ + kxkRd ), P(φ2,d ) + 1m=0 P(φm,d ε R)(x) − (Rφε )(y)| ≤ ε ) ≤ κd ε p(2κ+1) κdκ (1+kxkκRd +kykκRd )kx−ykRd , L(φ2,d ) ∈ N3 , (Rφ2,d )(x) = x, Rd kzkRd νd (dz) ≤ ηdη , and 1,d κ κ |f0,d (x) − (Rφ0,d ε )(x)| + kf1,d (x) − (Rφε )(x)kRd ≤ εκd (1 + kxkRd ).
(183)
Then (i) there exist unique at most polynomially growing functions ud : [0, T ] × Rd → R, d ∈ N, such that for all d ∈ N, x ∈ Rd it holds that ud (0, x) = f0,d (x) and such that for all d ∈ N it holds that ud is a viscosity solution of ∂ ( ∂t ud )(t, x)
=
∂ ( ∂x ud )(t, x) f1,d (x)
+
d X
2
ad,i,j ( ∂x∂i ∂xj ud )(t, x)
(184)
i,j=1
for (t, x) ∈ (0, T ) × Rd and (ii) there exist (ψd,ε )(d,ε)∈N×(0,1] ⊆ N , c ∈ R such that for all d ∈ N, ε ∈ (0, 1] it holds that P(ψd,ε ) ≤ c dc ε−c , R(ψd,ε ) ∈ C(Rd , R), and Z
1/p |ud (T, x) − (Rψd,ε )(x)| νd (dx) ≤ ε. p
(185)
Rd
6.2
Rectified DNN approximations
Theorem 6.3. Let T, κ, η, p ∈ (0, ∞), let Ad = (ad,i,j )(i,j)∈{1,...,d}2 ∈ Rd×d , d ∈ N, be symmetric positive semidefinite matrices, for every d ∈ N let k·kRd : Rd → [0, ∞) be the d-dimensional Euclidean norm and let νd : B(Rd ) → [0, 1] be a probability 39
measure on Rd , let f0,d : Rd → R, d ∈ N, and f1,d : Rd → Rd , d ∈ N, be functions, let Ad : Rd → Rd , d ∈ N, be the functions which satisfy for all d ∈ N, x = (x1 , . . . , xd ) ∈ Rd that Ad (x) = (max{x1 , 0}, . . . , max{xd , 0}), let N = ∪L∈{2,3,4,... } ∪(l0 ,l1 ,...,lL )∈NL+1 (×Ln=1 (Rln ×ln−1 × Rln )),
(186)
let P : N → N, L : N → ∪L∈{2,3,4,... } NL+1 , and R : N → ∪k,l∈N C(Rk , Rl ) be the functions which satisfy for all L ∈ {2, 3, 4, . . . }, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , l0 , . . ., xL−1 ∈ RlL−1 with ∀ n ∈ N ∩ (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × Rln )), x0 ∈ RP L [1, L) : xn = Aln (Wn xn−1 + Bn ) that P(Φ) = n=1 ln (ln−1 + 1), R(Φ) ∈ C(Rl0 , RlL ), L(Φ) = (l0 , l1 , . . . , lL ), and (RΦ)(x0 ) = WL xL−1 + BL ,
(187)
d let (φm,d ε )(m,d,ε)∈{0,1}×N×(0,1] ⊆ N , and assume for all d ∈ N, ε ∈ (0, 1], x, y ∈ R that P d d d κ κ d 1,d R(φ0,d ε ) ∈ C(R , R), R(φε ) ∈ C(R , R ), |f0,d (x)| + i,j=1 |ad,i,j | ≤ κd (1 + kxkRd ), P1 m,d κ kf1,d (x)−f1,d (y)kRd ≤ κkx−ykRd , k(Rφ1,d ε )(x)kRd ≤ κ(d +kxkRd ), R m=0 P(φε ) ≤ p(4κ+15) 0,d κ κ κ κdκ ε−κ , |(Rφ0,d νd (dz) ε )(x)−(Rφε )(y)| ≤ κd (1+kxkRd +kykRd )kx−ykRd , Rd kzkRd η ≤ ηd , and 1,d κ κ |f0,d (x) − (Rφ0,d ε )(x)| + kf1,d (x) − (Rφε )(x)kRd ≤ εκd (1 + kxkRd ).
(188)
Then (i) there exist unique at most polynomially growing functions ud : [0, T ] × Rd → R, d ∈ N, such that for all d ∈ N, x ∈ Rd it holds that ud (0, x) = f0,d (x) and such that for all d ∈ N it holds that ud is a viscosity solution of ∂ ∂ ( ∂t ud )(t, x) = ( ∂x ud )(t, x) f1,d (x) +
d X
2
ad,i,j ( ∂x∂i ∂xj ud )(t, x)
(189)
i,j=1
for (t, x) ∈ (0, T ) × Rd and (ii) there exist (ψd,ε )(d,ε)∈N×(0,1] ⊆ N , c ∈ R such that for all d ∈ N, ε ∈ (0, 1] it holds that P(ψd,ε ) ≤ c dc ε−c , R(ψd,ε ) ∈ C(Rd , R), and Z
1/p |ud (T, x) − (Rψd,ε )(x)| νd (dx) ≤ ε. p
(190)
Rd
Proof of Theorem 6.3. Throughout this proof let a : R → R be the function which satisfies for all x ∈ R that a(x) = max{x, 0} (191) and let (φ2,d )d∈N ⊆ N satisfy for all d ∈ N, x ∈ Rd that R(ψ) ∈ C(Rd , Rd ), L(φ2,d ) = (d, 2d, d), and (Rφ2,d )(x) = x (cf. Lemma 5.4). Observe that for all d ∈ N, x = (x1 , . . . , xd ) ∈ Rd it holds that Ad (x) = (a(x1 ), . . . , a(xd )). 40
(192)
Next note that for all d ∈ N it holds that P(φ2,d ) = 2d(d + 1) + d(2d + 1) = 2d2 + 2d + 2d2 + d = 4d2 + 3d ≤ 7d2 .
(193)
This proves that for all d ∈ N, ε ∈ (0, 1] it holds that 2,d
P(φ ) +
1 X
2 κ −κ P(φm,d ≤ (κ + 7)dκ+2 ε−κ ε ) ≤ 7d + κd ε
m=0
(194)
≤ (2κ + 7)d2κ+7 ε−(2κ+7) . Moreover, observe that Young’s inequality assures that for all α ∈ [0, ∞) it holds that κ+7 κ+7 κ · α2κ+7 + ≤ α2κ+7 + . (195) ακ ≤ 2κ + 7 2κ + 7 2κ + 7 This ensures that for all d ∈ N, x, y ∈ Rd it holds that κ(1 + kxkκRd ) ≤ κ(2 + kxk2κ+7 ) ≤ (2κ + 7)(1 + kxk2κ+7 ) Rd Rd
(196)
and κ(1 +
kxkκRd
+
kykκRd )
2(κ + 7) 2κ+7 2κ+7 + kxkRd + kykRd ≤κ 1+ 2κ + 7 κ(4κ + 21) = + κ(kxk2κ+7 + kyk2κ+7 ) Rd Rd 2κ + 7 ≤ 2κ + 7 + (2κ + 7)(kxk2κ+7 + kyk2κ+7 ) Rd Rd 2κ+7 2κ+7 = (2κ + 7)(1 + kxkRd + kykRd ).
(197)
Combining this with (192), (194), the fact that Ad : Rd → Rd , d ∈ N, and a : R → R are continuous functions, and Corollary 6.2 (with T = T , κ = 2κ + 7, η = η, p = p, (Ad )d∈N = (Ad )d∈N , (νd )d∈N = (νd )d∈N , (f0,d )d∈N = (f0,d )d∈N , (f1,d )d∈N = (f1,d )d∈N , (Ad )d∈N = (Ad )d∈N , a = a, N = N , P = P, L = L, R = R, (φm,d ε )(m,d,ε)∈{0,1}×N×(0,1] 2,d 2,d = (φm,d ) , (φ ) = (φ ) in the notation of Corollary 6.2) d∈N d∈N (m,d,ε)∈{0,1}×N×(0,1] ε establishes items (i)–(ii). The proof of Theorem 6.3 is thus completed.
6.3
Rectified DNN approximations on the d-dimensional unit cube
Corollary 6.4. Let T, κ, p ∈ (0, ∞), let Ad = (ad,i,j )(i,j)∈{1,...,d}2 ∈ Rd×d , d ∈ N, be symmetric positive semidefinite matrices, for every d ∈ N let k·kRd : Rd → [0, ∞) be the d-dimensional Euclidean norm, let f0,d : Rd → R, d ∈ N, and f1,d : Rd → Rd , d ∈ N, be functions, let Ad : Rd → Rd , d ∈ N, be the functions which satisfy for all d ∈ N, x = (x1 , . . . , xd ) ∈ Rd that Ad (x) = (max{x1 , 0}, . . . , max{xd , 0}), let N = ∪L∈{2,3,4,... } ∪(l0 ,l1 ,...,lL )∈NL+1 (×Ln=1 (Rln ×ln−1 × Rln )),
(198)
let P : N → N and R : N → ∪k,l∈N C(Rk , Rl ) be the functions which satisfy for all L ∈ {2, 3, 4, . . . }, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , (WL , BL )) ∈ (×Ln=1 (Rln ×ln−1 × 41
l0 Rln )), x0 ∈ RP , . . ., xL−1 ∈ RlL−1 with ∀ n ∈ N ∩ [1, L) : xn = Aln (Wn xn−1 + Bn ) that P(Φ) = Ln=1 ln (ln−1 + 1), R(Φ) ∈ C(Rl0 , RlL ), and
(RΦ)(x0 ) = WL xL−1 + BL ,
(199)
d let (φm,d ε )(m,d,ε)∈{0,1}×N×(0,1] ⊆ N , and assume for all d ∈ N, ε ∈ (0, 1], x, y ∈ R that P d d 1,d d d κ κ R(φ0,d ε ) ∈ C(R , R), R(φε ) ∈ C(R , R ), |f0,d (x)| + i,j=1 |ad,i,j | ≤ κd (1 + kxkRd ), P 1 κ m,d kf1,d (x)−f1,d (y)kRd ≤ κkx−ykRd , k(Rφ1,d ε )(x)kRd ≤ κ(d +kxkRd ), m=0 P(φε ) ≤ κ κ κ 0,d κdκ ε−κ , |(Rφ0,d ε )(x) − (Rφε )(y)| ≤ κd (1 + kxkRd + kykRd )kx − ykRd , and 1,d κ κ |f0,d (x) − (Rφ0,d ε )(x)| + kf1,d (x) − (Rφε )(x)kRd ≤ εκd (1 + kxkRd ).
(200)
Then (i) there exist unique at most polynomially growing functions ud : [0, T ] × Rd → R, d ∈ N, such that for all d ∈ N, x ∈ Rd it holds that ud (0, x) = f0,d (x) and such that for all d ∈ N it holds that ud is a viscosity solution of ∂ ( ∂t ud )(t, x)
=
∂ ( ∂x ud )(t, x) f1,d (x)
+
d X
2
ad,i,j ( ∂x∂i ∂xj ud )(t, x)
(201)
i,j=1
for (t, x) ∈ (0, T ) × Rd and (ii) there exist (ψd,ε )(d,ε)∈N×(0,1] ⊆ N , c ∈ R such that for all d ∈ N, ε ∈ (0, 1] it holds that P(ψd,ε ) ≤ c dc ε−c , R(ψd,ε ) ∈ C(Rd , R), and 1/p Z p |ud (T, x) − (Rψd,ε )(x)| dx ≤ ε. (202) [0,1]d
Proof of Corollary 6.4. Throughout this proof for every d ∈ N let λd : B(Rd ) → [0, ∞] be the Lebesgue-Borel measure on Rd and let νd : B(Rd ) → [0, 1] be the function which satisfies for all B ∈ B(Rd ) that νd (B) = λd (B ∩ [0, 1]d ).
(203)
Observe that (203) implies that for all d ∈ N it holds that νd is a probability measure on Rd . This and (203) ensure that for all d ∈ N, g ∈ C(Rd , R) it holds that Z Z |g(x)| νd (dx) = |g(x)| dx. (204) Rd
[0,1]d
Combining this with, e.g., Grohs et al. [26, Lemma 3.12] demonstrates that for all d ∈ N it holds that Z Z p(4κ+15)/2 p(4κ+15) p(4κ+15) kxkRd νd (dx) = kxkRd dx ≤ d Rd [0,1]d (205) p(2κ+8) p(2κ+8) ≤d ≤ p(2κ + 8)d . Theorem 6.3 (with T = T , κ = κ, η = p(2κ+8), p = p, (Ad )d∈N = (Ad )d∈N , (νd )d∈N = (νd )d∈N , (f0,d )d∈N = (f0,d )d∈N , (f1,d )d∈N = (f1,d )d∈N , (Ad )d∈N = (Ad )d∈N , N = N , P = m,d 2,d 2,d P, R = R, (φm,d ε )(m,d,ε)∈{0,1}×N×(0,1] = (φε )(m,d,ε)∈{0,1}×N×(0,1] , (φ )d∈N = (φ )d∈N in the notation of Theorem 6.3) and (204) hence establish items (i)–(ii). The proof of Corollary 6.4 is thus completed. 42
Acknowledgements David Kofler is gratefully acknowledged for his useful comments regarding the a priori estimates in Subsections 4.1–4.2. This work has been partially supported through the research grant with the title “Higher order numerical approximation methods for stochastic partial differential equations” (Number 175699) from the Swiss National Science Foundation (SNSF). Furthermore, this work has been partially supported through the ETH Research Grant ETH-47 15-2 “Mild stochastic calculus and numerical approximations for nonlinear stochastic evolution equations with L´evy noise”.
References [1] Bach, F. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research 18, 19 (2017), 1–53. [2] Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39, 3 (1993), 930–945. [3] Barron, A. R. Approximation and estimation bounds for artificial neural networks. Mach. Learn. 14, 1 (1994), 115–133. [4] Beck, C., Becker, S., Grohs, P., Jaafari, N., and Jentzen, A. Solving stochastic differential equations and Kolmogorov equations by means of deep learning. arXiv:1806.00421 (2018), 56 pages. [5] Beck, C., Jentzen, A., and E, W. Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. arXiv:1709.05963 (2017), 56 pages. [6] Becker, S., Cheridito, P., and Jentzen, A. Deep optimal stopping. arXiv:1804.05394 (2018), 18 pages. [7] Blum, E. K., and Li, L. K. Approximation theory and feedforward networks. Neural networks 4, 4 (1991), 511–515. ¨ lcskei, H., Grohs, P., Kutyniok, G., and Petersen, P. Optimal [8] Bo approximation with sparsely connected deep neural networks. arXiv:1705.01714 (2017), 36 pages. [9] Burger, M., and Neubauer, A. Error Bounds for Approximation with Neural Networks. Journal of Approximation Theory 112, 2 (2001), 235–250. `s, E. J. Ridgelets: Theory and Applications, 1998. Ph.D. thesis, Stan[10] Cande ford University. [11] Chen, T., and Chen, H. Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks. IEEE Transactions on Neural Networks 6, 4 (1995), 904–910. 43
[12] Chouiekh, A., and Haj, E. H. I. E. Convnets for fraud detection analysis. Procedia Computer Science 127 (2018), 133–138. [13] Chui, C. K., Li, X., and Mhaskar, H. N. Neural networks for localized approximation. Math. Comp. 63, 208 (1994), 607–623. [14] Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal 2, 4 (1989), 303–314. [15] Dahl, G. E., Yu, D., Deng, L., and Acero, A. Context-dependent pretrained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing 20, 1 (2012), 30–42. [16] DeVore, R., Oskolkov, K., and Petrushev, P. Approximation by feedforward neural networks. Ann. Numer. Math. 4 (1996), 261–287. [17] E, W., Han, J., and Jentzen, A. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commun. Math. Stat. 5, 4 (2017), 349–380. [18] E, W., and Wang, Q. Exponential convergence of the deep neural network approximation for analytic functions. Science China Mathematics (2018). Published online: September 6, 2018, https://doi.org/10.1007/ s11425-018-9387-x. [19] E, W., and Yu, B. The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Commun. Math. Stat. 6, 1 (2018), 1–12. ¨ chter, D., Grohs, P., Jentzen, A., and Schwab, C. DNN Ex[20] Elbra pression Rate Analysis of High-dimensional PDEs: Application to Option Pricing. Preprint (2018). [21] Eldan, R., and Shamir, O. The power of depth for feedforward neural networks. Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA (2016), 907–940. [22] Ellacott, S. Aspects of the numerical analysis of neural networks. Acta Numer. 3 (1994), 145–202. [23] Fujii, M., Takahashi, A., and Takahashi, M. Asymptotic Expansion as Prior Knowledge in Deep Learning Method for high dimensional BSDEs. arXiv:1710.07030 (2017), 16 pages. [24] Funahashi, K.-I. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 3 (1989), 183–192. [25] Graves, A., Mohamed, A.-r., and Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, ICASSP (2013), pp. 6645–6649. 44
[26] Grohs, P., Hornung, F., Jentzen, A., and von Wurstemberger, P. A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. arXiv:1809.02362 (2018), 124 pages. [27] Hairer, M., Hutzenthaler, M., and Jentzen, A. Loss of regularity for Kolmogorov equations. Ann. Probab. 43, 2 (2015), 468–527. [28] Han, J., Jentzen, A., and E, W. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences 115, 34 (2018), 8505–8510. [29] Hartman, E. J., Keeler, J. D., and Kowalski, J. M. Layered neural networks with gaussian hidden units as universal approximations. Neural computation 2, 2 (1990), 210–215. `re, P. Deep Primal-Dual Algorithm for BSDEs: Applica[30] Henry-Laborde tions of Machine Learning to CVA and IM. (November 15, 2017), 16 pages. Available at SSRN: https://ssrn.com/abstract=3071506. [31] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 6 (2012), 82–97. [32] Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 2 (1991), 251 – 257. [33] Hornik, K. Some new results on neural network approximation. Neural networks 6, 8 (1993), 1069–1072. [34] Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Networks 2, 5 (1989), 359–366. [35] Hornik, K., Stinchcombe, M., and White, H. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks 3, 5 (1990), 551–560. [36] Hu, B., Lu, Z., Li, H., and Chen, Q. Convolutional neural network architectures for matching natural language sentences. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (2014), pp. 2042–2050. [37] Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 2261–2269.
45
[38] Hutzenthaler, M., and Jentzen, A. On a perturbation theory and on strong convergence rates for stochastic ordinary and partial differential equations with non-globally monotone coefficients. arXiv:1401.0295 (2014), 41 pages. [39] Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (2014), pp. 655– 665. [40] Khoo, Y., Lu, J., and Ying, L. Solving parametric PDE problems with artificial neural networks. arXiv:1707.03351 (2017), 17 pages. [41] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (2012), pp. 1097–1105. [42] Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks 6, 6 (1993), 861–867. [43] Mhaskar, H., and Micchelli, C. Degree of approximation by neural and translation networks with a single hidden layer. Adv. Appl. Math. 16, 2 (1995), 151–183. [44] Mhaskar, H. N. Neural networks for optimal approximation of smooth and analytic functions. Neural Comput. 8, 1 (1996), 164–177. [45] Mhaskar, H. N., and Poggio, T. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications 14, 06 (2016), 829–848. [46] Mishra, S. A machine learning framework for data driven acceleration of computations of differential equations. arXiv:1807.09519 (2018), 23 pages. [47] Montanelli, H., and Du, Q. New error bounds for deep networks using sparse grids. arXiv:1712.08688v3 (2017), 15 pages. [48] Nabian, M. A., and Meidani, H. A Deep Neural Network Surrogate for High-Dimensional Random Partial Differential Equations. arXiv:1806.02957 (2018), 23 pages. [49] Nguyen-Thien, T., and Tran-Cong, T. Approximation of functions and their derivatives: A neural network implementation with applications. Appl. Math. Model. 23, 9 (1999), 687–704. [50] Park, J., and Sandberg, I. W. Universal approximation using radial-basisfunction networks. Neural computation 3, 2 (1991), 246–257. ¨ chter, D., and Bo ¨ lcskei, [51] Perekrestenko, D., Grohs, P., Elbra H. The universal approximation power of finite-width deep ReLU networks. arXiv:1806.01528 (2018), 16 pages. 46
[52] Petersen, P., Raslan, M., and Voigtlaender, F. Topological properties of the set of functions generated by neural networks of fixed size. arXiv:1806.08459 (2018), 45 pages. [53] Petersen, P., and Voigtlaender, F. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. arXiv:1709.05289 (2017), 54 pages. [54] Pinkus, A. Approximation theory of the MLP model in neural networks. Acta Numer. 8 (1999), 143–195. [55] Raissi, M. Forward-Backward Stochastic Neural Networks: Deep Learning of High-dimensional Partial Differential Equations. arXiv:1804.07010 (2018), 17 pages. [56] Roy, A., Sun, J., Mahoney, R., Alonzi, L., Adams, S., and Beling, P. Deep learning detecting fraud in credit card transactions. In 2018 Systems and Information Engineering Design Symposium (SIEDS) (2018), pp. 129–134. [57] Schmitt, M. Lower bounds on the complexity of approximating continuous functions by sigmoidal neural networks. In Proceedings of the 12th International Conference on Neural Information Processing Systems (Cambridge, MA, USA, 1999), NIPS’99, MIT Press, pp. 328–334. [58] Schwab, C., and Zech, J. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ. Analysis and Applications (2018). Published online: August 30, 2018, https://doi. org/10.1142/S0219530518500203. [59] Shaham, U., Cloninger, A., and Coifman, R. R. Provable approximation properties for deep neural networks. Appl. Comput. Harmon. Anal. 44, 3 (2018), 537–557. [60] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014), 14 pages. [61] Sirignano, J., and Spiliopoulos, K. DGM: A deep learning algorithm for solving partial differential equations. arXiv:1708.07469 (2017), 31 pages. [62] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 1701–1708. [63] Wang, R., Fu, B., Fu, G., and Wang, M. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17 (2017). [64] Wang, W., Yang, J., Xiao, J., Li, S., and Zhou, D. Face recognition based on deep learning. In Human Centered Computing (2015), pp. 812–820. [65] Wu, C., Karanasou, P., Gales, M. J., and Sim, K. C. Stimulated deep neural network for speech recognition. In Interspeech 2016 (2016), pp. 400–404. 47
[66] Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Networks 94 (2017), 103–114. [67] Yarotsky, D. Universal approximations of invariant maps by neural networks. arXiv:1804.10306 (2018), 64 pages. [68] Zhai, S., Chang, K.-h., Zhang, R., and Zhang, Z. M. Deepintent: Learning attentions for online advertising with recurrent neural networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), pp. 1295–1304.
48