Moment Tensor Potentials: a class of systematically improvable interatomic potentials Alexander V. Shapeev∗
arXiv:1512.06054v1 [physics.comp-ph] 18 Dec 2015
December 21, 2015
Abstract Density functional theory offers a very accurate way of computing materials properties from first principles. However, it is too expensive for modelling large-scale molecular systems whose properties are, in contrast, computed using interatomic potentials. The present paper considers, from a mathematical point of view, the problem of constructing interatomic potentials that approximate a given quantum-mechanical interaction model. In particular, a new class of systematically improvable potentials is proposed, analyzed, and tested on an existing quantum-mechanical database.
1
Introduction
Molecular modelling is an increasingly popular tool in biology, chemistry, physics, and materials science. The success of molecular modelling largely depends on the accuracy and efficiency of calculating the interatomic forces. The two major approaches to computing interatomic forces are (1) quantum mechanics (QM) calculations, and (2) using the empirical interatomic potentials. In the first approach, the electron structure is computed which yields the forces on the atoms (more precisely, on their nuclei). The QM calculations are therefore rather computationally demanding, but can yield a high quantitative accuracy. On the other hand, the accuracy and transferability of the empirical interatomic potentials is limited, but they are orders of magnitude less computationally demanding. With interatomic potentials, typically, the forces on atoms come from the energy of interaction of each atom with its atomic neighborhood (typically consisting of tens to hundreds of atoms). This hence makes it very attractive to design a combined approach whose efficiency would be comparable to interatomic potentials, yet the accuracy would be same as the ab initio simulations. Moreover, the scaling of computational complexity of using the interatomic potentials is O(N ), where N is the number of atoms, whereas, for instance, the standard Kohn-Sham density functional theory calculations scale like O(N 3 )—which implies that the need in such combined approaches is even higher for large numbers of atoms. One way of achieving this is designing accurate interatomic potentials. With this goal in mind, we categorize all potentials into two groups, following the regression analysis terminology of [1]. The first group is the parametric potentials with a fixed number of numeric or functional parameters. All empirical potentials known to the author are parametric, which is their ∗
Skolkovo Institute of Science and Technology, Skolkovo Innovation Center, Building 3, Moscow 143026 Russia. Email:
[email protected] .
1
disadvantage—their accuracy cannot be systematically improved. As a result, if the class of problems is sufficiently large and the required accuracy is sufficiently high (say, close to that of ab initio calculations), then the standard parametric potentials are not sufficient and one needs to employ the nonparametric potentials. In theory, the accuracy of these potentials can be systematically improved. In practice, however, increasing the accuracy is done at the cost of higher computational complexity (which is nevertheless orders of magnitude less than that of the QM calculations), and the accuracy is still limited by that of the available QM model the potential is fitted to. Also, it should be noted that, typically, such fitted potentials do not perform well for the situations that they were not fitted to (i.e., they exhibit little or no transferability). Each nonparametric potential has two major components: (1) a representation (also referred to as “descriptors”) of atomic environments, and (2) a regression model being a function of the representation. It should be emphasized that finding a good representation of atomic environments is not a trivial task, as the representation is required to satisfy certain restrictions, namely, invariance with respect to Euclidean transformations (translations, rotations, and reflections), permutation of chemically equivalent atoms, smoothness with respect to distant atoms coming to and leaving the boundary of the atomic environment, as well as completeness (for more detail see discussion in [1, 4]). The latter means that the representation should contain all the information to unambiguously reconstruct the atomic environment. In this context, one common approach to constructing nonparametric potentials, called the neural networks potentials (NNP) [5, 6], is using artificial neural networks as the regression model and a family of descriptors first proposed by Behler and Parrinello [6]. These descriptors have a form of radial functions applied to and summed over atomic distances to the central atom, augmented with similar functions with angular dependence summed over all pairs of atoms. Another approach is the Gaussian Approximation Potentials (GAP) [2, 15] which uses Gaussian regression process and a cleverly designed set of descriptors that are evaluated by expanding smoothened density field of atoms into a spherical harmonics basis [1]. Other recent approaches include [16] which employs the bispectrum components of the atomic density, as proposed for an earlier version of GAP [2], and used a linear regression model to fit the QM data. Also, [12] uses Gaussian regression a force-based model rather than the potential energy based model. Additionally, in a recent manuscript [11] the authors put forward an approach of regressing the interatomic interaction energy together with electron density. In the present paper we propose a new approach to constructing nonparametric potentials based on linear regression and invariant polynomials. The main feature of the proposed approach is that the proposed form of the potential can provably approximate any regular function satisfying all the needed symmetries (see Theorems 3.1 and 3.2), while being computationally efficient. The building block of the proposed potentials is what we call the moment tensors—similar to inertia tensors calculated for atomic environments. Hence we call this approach the Moment Tensor Potentials (MTP). The manuscript is structured as follows. Section 2 gives an overview of interatomic potentials. Section 3 introduces MTP and then formulates and proves the two main theorems. The practical implementation of MTP is discussed in Section 4. Section 5 reports the accuracy and computational efficiency (i.e., the CPU time) of MTP and compares them with GAP.
2
Interatomic Potentials
Consider a system of Ntot atoms with positions x ∈ (Rd )Ntot (more precisely, with positions of their nuclei to be x), where d is the number of physical dimensions, taken as d = 3 in most of applications. They have a certain energy of interaction, E q (x), typically given by some QM model, which we aim to approximate. For simplicity, we assume that all atoms are chemically equivalent. 2
We make an assumption that E q (x) can be well approximated by a sum of energies of atomic environments of individual atoms. We denote the atomic environment of atom k as Dxk and let it equal to a tuple Dxk := (xi − xk )1≤i≤Ntot , 0 0 is the Boltzmann constant, and T > 0 is the electronic temperature, and we take the chemical potential to be zero. Next, we let R > 0 and δ0 > 0, allow ui to vary in Vδ0 , where Vδ := {ζ ∈ C : Re(ζ) ∈ [−R − δ, R + δ]d , Im(ζ) ∈ [−δ, δ]d }, and assume that ϕ(v) is analytically extended to Vδ0 . Further, define Mδ := supζ∈Vδ ϕ(ζ). We note that in many models ϕ(ζ) blows up at ζ = 0, therefore by assuming analytical extensibility of ϕ onto Mδ0 we implicitly approximate such irregular ϕ with a function that is regular around ζ = 0. Theorem 3.2. For any m ∈ N there exists pm ∈ Pperm ∩ Prot of degree m such that sup
|V tb (u) − pm (u)| < Cρ−m ,
(6)
u: maxi |ui |≤R
where C > 0 and ρ > 1 are some constants that depend only on n, δ0 , Mδ0 , and kB T . The proof of this theorem is based on the following result by Hackbusch and Khoromskij [10]: Proposition 3.3. Let Eρ := {ζ ∈ C : |z − 1| + |z + 1| ≤ ρ + ρ−1 }, for some ρ > 1, Eρ(j) := [−1, 1]j−1 × Eρ × [−1, 1]n−j−1 , (j)
f = f (z1 , . . . , zn ) defined on the union of all Eρ , and Mρ (f ) := max sup |f (z)|. 1≤j≤n
(j)
z∈Eρ
Let p = p(z1 , . . . , zn ) be the polynomial interpolation of f of degree m on the Chebyshev-GaussLobatto nodes in [−1, 1]n . Then sup
|f (z) − p(z)| ≤ 2(1 + 2π −1 log(n))m Mρ (f )
maxi |zi | ≤1
ρ−m . ρ−1
Proof of Theorem 3.2. The plan of the proof is to, after an auxiliary step (Step 1), successively obtain bounds on ϕ (Step 2), on H(u) (Step 3), on an interpolating polynomial (Step 4), and then symmetrize this interpolating polynomial (Step 5). Step 1. First, we note that without loss of generality we can assume that sup |ϕ0 (z)| ≤ Mδ0 /δ0 =: Mδ00 .
(7)
z∈Vδ0
Indeed, using the Cauchy integral formula, one can bound ϕ0 (z) ≤ reassigning δ0 ← δ0 /2 we attain (7).
Mδ0 δ0 /2
whenever z ∈ Vδ0 /2 . By
Step 2. Next we let δ ∈ (0, δ0 ], which will be fixed later, and note that if z ∈ Vδ then |Im(ϕ(z))| ≤ Im(ϕ(Re(z))) + Mδ00 Im(z) ≤ Mδ00 δ thanks to the intermediate value theorem. 7
Step 3. Next, following Proposition 3.3, define Uδ := ∪ni=1 V0i−1 × Vδ × V0n−i−1 . Hence note that for u ∈ Uδ , H(u) is symmetric (possibly non-Hermitian) with at most 2n elements being non-real, which makes it easy to estimate (where we make use the Frobenius norm): √ kIm(H(u))k ≤ 2nMδ0 δ. By a spectrum perturbation argument (namely, the the Bauer-Fike theorem [3]) we have the corresponding bound on the spectrum: √ |Im(Sp(H(u)))| ≤ 2nMδ0 δ. The real part of the spectrum can be estimated similarly, |Re(Sp(H(u)))| ≤ kH(u)k ≤ nMδ0 . Step 4. We next get bounds on V tb (u) for u ∈ Uδ . If needed we decrease δ such that π k T and use the formula [8, (2.26)], 3 B I 1 tb f tb (z)(H(u) − zI)−1 dz, V (u) = − 2πi γ where we take γ := ∂Ω, where Ω = {z ∈ C : |Re(z)| ≤ nMδ0 + π3 kB T, |Im(z)| ≤ z ∈ γ we can hence estimate −1 k(H(u) − zI)−1 k ≤ π3 kB T and thus
−1 π k T 3 B
sup |V tb (u)| ≤ sup |f tb (z)| |γ| u∈Uδ
z∈Ω
√
2nMδ0 δ
1 that depends only on n, δ0 , Mδ0 , and kB T such that {z ∈ C : |z/R − 1| + |z/R + 1| ≤ ρ + ρ−1 } ⊂ Vδ , hence Proposition 3.3 can be applied to the function f (z) = V tb (Rz) to yield the interpolating polynomial p˜m (u) such that sup
|V tb (u) − p˜m (u)| ≤
u: maxi |ui |≤R
|V tb (u) − p˜m (u)| ≤ Cρ−m ,
sup u: maxi |ui |∞ ≤R
Clearly by construction p˜m ∈ Pperm . Step 5. We define R pm (u) :=
Q∈O(d)
p˜m (Qu)dQ
R Q∈O(d)
dQ
∈ Pperm ∩ Prot ,
and thanks to rotation invariance of V tb we recover (6).
8
3.3
Proof of Theorem 3.1
We start with stating the First Fundamental Theorem for the orthogonal group O(d) [17]: Theorem 3.4. p ∈ P is rotation invariant if and only if it can be represented as a polynomial of n(n + 1)/2 scalar variables of the form rij (u) := ui · uj where 1 ≤ i ≤ j ≤ n. We hence can identify a polynomial p = p(u) ∈ Prot with a respective polynomial q = q(r) ∈ Q, where Q is the set of polynomials of n(n + 1) scalar variables r = (rij )1≤ i,j ≤n (we lift the requirement i ≤ j for the ease of notation in what follows). In order to proceed, we introduce the notation of composition of sequences ab := (ab1 , . . . , abm ) if b = (b1 , . . . , bm ) ⊂ {1, . . . , n}m and a = (a1 , . . . , an ). Hence define Qperm := {q ∈ Q : q(r) ≡ q(rσσ ) ∀σ ∈ Sn }, correspondingly to Pperm , where we likewise let rσσ := (rσi σj )1≤ i,j ≤n . We next formulate a rather intuitive result, essentially stating that Qperm can be spanned by symmetrizing all monomials formed from r. Lemma 3.5. ( Qperm = Span
) m m Y XY (rσi σj )αij : m ∈ N, m ≤ n, α ∈ Zm×m , + σ∈Sn i=1 j=i
where Z+ = {0} ∪ N. Proof. If q ∈ Qperm then
1 X 1 X q(rσσ ) = q(r) = q(r). n! σ∈S n! σ∈S n n Q Qm αij It remains to apply this identity to all monomials q(r) = m i=1 j=i rij to obtain the stated result.
The next step towards proving the main result is the following lemma, where we denote N := {1, . . . , n}: Lemma 3.6. For m ∈ N, m ≤ n, and α ∈ Zm×m , + m Y m X Y Bα (u) = (uγi · uγj )αij . γ∈N m i=1 j=i
Proof. Before commencing with the proof, we note that we will use the distributive law of addition and multiplication in the following form, m X Y
m XY
f (i, j) =
j=1 i∈I
f (ij , j),
i∈I m j=1
or, more generally YX j∈J i∈Ij
X
f (i, j) =
Y
i∈(Ij )j∈J j∈J
9
f (ij , j).
We denote M := {(i, j) ∈ {1, . . . , m}2 : i < j}, B := ({1, . . . , d}αij )(i,j)∈M and T (i) := Mαii ,α0i . Then ! ! i−1 m X Y Y (i) ji ij Tβ (1,i) ...β (i−1,i) β (i,i+1) ...β (i,m) (u) = |uγ |2αii u⊗α u⊗α γ γ β (j,i) β (i,j) =
γ∈N
j=1
X
αji i−1 Y Y
|uγ |2αii
j=i+1
uγ,β (j,i) `
j=1 `=1
γ∈N
αij m Y Y
!
! uγ,β (i,j) . `
j=i+1 `=1
Hence write (now explicitly summing over multiindicies) m αY
Mαii ,α0i (u) =
i=1
m XY
(i)
Tβ (1,i) ...β (i−1,i) β (i,i+1) ...β (i,m)
β∈B i=1
=
m X XY
αji i−1 Y Y
|uγ |2αii
β∈B i=1 γ∈N
=
m X X Y β∈B
=
γ∈N m
β∈B γ∈N m i=1
=
m X Y
=
2αii
|uγi |
m X Y γ∈N m
= =
X
i=1 m Y
m Y
(i,j)
!
uγi ,β (i,j) `
j=i+1 `=1
uγi ,β (i,j) uγj ,β (i,j) `
αij Y
X
`
uγi ,β`
uγj ,β`
j=i+1 β∈{1,...,d}αij `=1 2αii
|uγi |
γ∈N m i=1
=
`
j=i+1 `=1
γ∈N m i=1 m X Y
Y Y
|uγi |2αii
u
Y Y
uγi ,β (j,i)
j=1 `=1 αij m
!
γ,β` j=i+1 `=1 ! αij m
`
YY
i=1
m X X Y
uγ,β (j,i)
j=1 `=1 i−1 αji
|uγi |2αii
αij m Y Y
!
αij d m Y X Y
(uγi ,β )(uγj ,β )
j=i+1 `=1 β=1 αij m
|uγi |2αii |uγi |2αii
γ∈N m i=1 m Y m X Y
Y Y j=i+1 `=1 m Y
uγi · uγj
uγi · uγj
αij
j=i+1
uγi · uγj
αij
.
γ∈N m i=1 j=i
Proof of Theorem 3.1. In view of the previous lemma, we can denote ( ) m Y m X Y ˜ perm = Span Q (rγ γ )αij : m ∈ N, m ≤ n, α ∈ Zm×m i j
+
γ∈N m i=1 j=i
˜ perm = Qperm , and, applying earlier lemmas, can formulate the statement of Theorem 3.1 as Q which in its turn is an immediate corollary of the following more specialized result.
10
Lemma 3.7. For m ∈ Z+ , m ≤ n, denote ( ) m Y m XY Q(m) (rσi σj )αij : α ∈ Zm×m , perm := Span +
and
σ∈Sn i=1 j=i
(
) m Y m X Y (rγi γj )αij : α ∈ Zm×m . +
˜ (m) := Span Q perm
γ∈N m i=1 j=i (m) (m) ˜ perm Then Q = Qperm .
Before we prove this lemma, we give two auxiliary results. Lemma 3.8. We equip N m with the lexicographical order and hence denote Γ := {γ ∈ Nm : γ = min{σγ : σ ∈ Sn }} and Cγ = #{σγ : σ ∈ Sn }, where # denotes the number of elements in a set. Then m Y m m Y m X Y X Cγ X Y αij (rγi γj ) = (rσγi σγj )αij . n! γ∈N m i=1 j=i γ∈Γ σ∈S i=1 j=i n
Proof. Any σ ∈ Sn induces a one-to-one mapping γ 7→ σγ on N m . Hence m Y m m Y m X Y X Y αij (rγi γj ) = (rσγi σγj )αij γ∈N m
therefore
γ∈N m
i=1 j=i
m Y m X Y
αij
(rγi γj )
γ∈N m i=1 j=i
∀σ ∈ Sn ,
i=1 j=i
m Y m X 1 XY = (rσγi σγj )αij n! σ∈S i=1 j=i γ∈N m
∀σ ∈ Sn .
n
It remains to group up the terms for which σγ is the same. The next auxiliary result is proved by a trivial combinatorial argument, essentially expressing that all elements of Γ other than (1, . . . , m) ∈ Γ have repeated values. Proposition 3.9. Let m ≥ 1. Then Γ = {(1, . . . , m)} ∪ Γ0 , where Γ0 := {γ ∈ Γ : maxi γi ≤ m − 1}. Proof of Lemma 3.7. We argue by induction over m. For m = 0 the statement is obvious since (0) ˜ (0) Qperm = Q perm = Span{1}, therefore we only need to prove the induction step. Now, for m ∈ N, m ≤ n, we choose an arbitrary α and let q(r) := q˜(r) :=
n! C{1,...,m}
m X Y
(rσj σk )αjk ∈ Qperm ,
σ∈Sn j,k=1 m X Y
and
˜ perm . (rγj γk )αjk ∈ Q
γ∈N m j,k=1
(m) ˜ (m) These q(r) and q˜(r) span Qperm and Q ˜(r) ∈ perm , respectively. We aim to show that q(r) − q (m−1) (m−1) (m−1) (m) ˜ Qperm = Qperm . This will prove the result considering that, by definition, Qperm ⊆ Qperm and (m−1) ˜ perm ˜ (m) Q ⊆Q perm .
11
Indeed, in view of the two previous results, recall the definition Γ0 := {γ ∈ Γ : maxi γi ≤ m−1}, and write m Y m m Y m X Cγ X Y XY n! q˜(r) − q(r) = (rσγi σγj )αij − (rσi σj )αij C{1,...,m} γ∈Γ n! σ∈S i=1 j=i σ∈S i=1 j=i n
=
X γ∈Γ0
Cγ C{1,...,m}
n
m Y m XY (rσγi σγj )αij σ∈Sn i=1 j=i
Next, we denote X
(γ)
αk` :=
αij .
1≤i≤j≤m γi =k, γj =`
and hence express q˜(r) − q(r) =
X γ∈Γ0
Cγ C{1,...,m}
Y m−1 Y X m−1 (γ) (rσk σ` )αk` . σ∈Sn k=1 `=i
Note that the upper limit of both products is m − 1 thanks to the definition of Γ0 (and, of course, (γ) Proposition 3.9). Also, note that we used the fact that, from the definition of α(γ) , αk` = 0 P Qm−1 Qm−1 (γ) (m−1) whenever k > `. Since σ∈Sn k=1 `=i (rσk σ` )αk` belongs to Qperm , by its definition, for each γ, we finally derive that ˜ (m−1) q˜(r) − q(r) ∈ Q(m−1) perm = Qperm , which concludes the proof of the induction step.
4
Practical Implementation
Here we suggest some generalizations of the representation outlined in Section 3.1 which are more suitable for a practical implementation, in particular such that the (R3) requirement is satisfied. Then we present an algorithm of computing Bα (u) and its derivatives with respect to ui for a given u. Computing derivatives are needed since in addition to the interaction energy we need forces on atoms.
4.1
Generalizing the Radial Functions
First, notice that for a fixed ν, a linear combination of moment tensors Mµ,ν (u) is a polynomial 2 of |ui |2 multiplied by u⊗ν i . The space of polynomials of |ui | can be substituted with any other space of functions (which, for generality, can be made dependent on ν), provided that they can represent any regular function of |u|, i.e., X ˜ µ,ν (u) := M fµ,ν (|ui |)u⊗ν (8) i , i∈N
where, e.g., fµ,ν (r) = r−µ−ν fcut (r) or fµ (r) = e−kµ r fcut (r), kµ > 0 is some sequence of real numbers, and fcut (r) is some function such that fcut (r) = 0 for r ≥ Rcut . Here fµ,ν (r) plays essentially the same role as the “radial symmetry functions” in the NNP [5] or the radial basis functions in [1]. ˜ µ,ν (u) is as follows. Suppose that we set fµ,ν (r) := 1 if The mechanical interpretation of M ˜ µ,0 simply gives the number of atoms at a distance less r < r0 and fµ,ν (r) := 0 otherwise. Then M ˜ µ,1 is the center of mass of such atoms (scaled by the than r0 (or the “mass” of these atoms), M ˜ µ,2 is the tensor of second moments of inertia, etc. For a general set of radial functions mass), M ˜ µ,ν can be interpreted as weighted moments of inertia. fµ,ν (r), the functions M 12
4.2
Generalizing Bα to Tensors of Arbitrary Dimension
For the implementation issues it will be convenient to generalize Bα to tensors of arbitrary dimension. αY by allowing the result to be a tensor of nonzero dimension: We first extend the product Y k α
T
(i)
:= β (1,1) ...β (k,k)
i=1
where each T (i) is a tensor of dimension multiindices such that β (i,j) ∈ {1, . . . , d}αij . Hence we define
Pk
Bα,α ¯ (u) :=
j=1
k αY
k Y
(i)
Tβ (i,1) ...β (i,k) ,
i=1
αij (here we let αij = αji ), β (i,j) = β (j,i) are
˜ Pk M α ¯i ,
j=1
αij (u),
i=1
which is a tensor of dimension
4.3
Pk
i=1
αii .
The Interatomic Potential
The interatomic potential thus has a form V (u) =
X
cα Bα (u),
α∈A
where A is a set of matrix-valued indices fixed a priori and cα is the set of coefficients found in the training stage.
4.4
The Algorithm of Computing the Energy and Forces
Next, we propose an algorithm for computing the interatomic potentials and its derivatives. The algorithm consists of two parts, the precomputation step and the evaluation step. The precomputation step accepts the set A of values of α as an input and generates the data for the next step, which is efficient evaluation of Bα (u) for given u. Before we proceed, we make two observations. ˜ µ,ν are the “moments” of the form 1. The elements of tensors M mµ,β (u) :=
X i∈N
fµ,ν (|ui |)
d Y
β
ui,jj ,
j=1
where β is a multiindex such that |β| = ν. The higher the dimension is, the more repeated ˜ µ,ν contains (e.g., each matrix M ˜ µ,2 has 9 elements, at most 6 of which may moments each M be different due to its symmetricity). ˜ µ,ν (which means that the elements of Bα 2. The scalar functions Bα consist of products of M are linear combinations of products of mµ,β ). Differentiating products of two terms is easier k×k , fix 1 ≤ I ≤ k, than products of three or more terms. Hence consider Bα,α ¯ ∈ Zk+ , α ∈ Z+ ¯ ,α ¯ and denote β = (¯ α1 , . . . , α ¯ I ), γ¯ = (¯ αI+1 , . . . , α ¯ k ),
13
P α11 + ki=I+1 αi1 α12 ··· α1I P α12 α22 + ki=I+1 αi2 · · · α2I β := and , .. .. .. . . . . . . P α1I α2I · · · αII + ki=I+1 αiI P αI+1,I+1 + Ii=1 αi,I+1 αI+1,I+2 ··· αI+1,k P αI+1,I+2 αI+2,I+2 + Ii=1 αi,I+2 · · · αI+2,k γ := . .. .. .. .. . . . . PI αI+1,k αI+2,k · · · αkk + i=1 αi,k and Bγ¯,γ . One then can express the elements of Bα,α through products of elements of Bβ,β ¯ ¯ Note that by reordering the rows and columns of α and β, one can generate many different ways of representing Bα,α ¯ (u). When doing computations, one should exercise this freedom in such a way that the resulting tensors are of minimal dimension so that the total computation cost is reduced. and Bγ¯,γ do not have to be scalarFinally, note that even if Bα,α was scalar-valued, Bβ,β ¯ ¯ valued—which motivates the need to have generalized Bα,α to be tensor-valued. ¯ Precomputation The precomputation step is hence as follows (we keep the argument u of, e.g., Bα (u) to match the above notation, but particular values of u never enter the precomputation step) • Starting with the set of Bα (u) (indexed by α ∈ A), establish their correspondence to Bα,α ¯ (u)) and recursively represent each Bα,α (u) through some B (u) and B (u) as described above. ¯ ¯ γ ¯ ,γ β,β In each case, out of all such products choose the one with the minimal sum of the number of dimensions of these two tensors. • Enumerate all elements in all tensors Bα,α ¯ (u) as bi (u) (i is the ordinal number of the corresponding element). Pi cj b`j (u)bkj (u). • Represent each bi (u) as either bi (u) = mµi ,βi (u) or bi (u) = Jj=1 • Output the resulting (1) correspondence of Bα (u) to bi (u), (2) µi and βi , and (3) tuples of (i, c, `, k) corresponding to bi (u) =
PJi
j=1 cj b`j (u)bkj (u).
Evaluation The evaluation step, as written below, evaluates Bα (u) and their derivatives for u given as an input, using the precomputed data (described above). 1. For a given u, calculate all mµi ,βi (u) and their derivatives with respect to all components of u. 2. Then calculate all other bi (u) and derivatives using the tuples of (i, c, `, k). 3. Finally, pick those bi (u) that correspond to scalar Bα (u) (as opposed to non-scalar Bα,α ¯ ). 14
It remains to form the linear combination of Bα (u) with the coefficients obtained from a linear regression (training), and then sum up the energies and the forces for all atomic environments to obtain the energy and forces for a given atomistic system.
4.5
Training
Once the set A of the values of α is fixed, we need to determine the coefficients cα . This is done with the regularized (to avoid overfitting) linear regression in the following way. For a database X = {x(k) : k = 1, . . . , K} of atomic configurations x(k) each of size N (k) , and their reference energies and forces, E q (x(k) ) = E (k) and ∇E q (x(k) ) = f (k) , we form an overdetermined system of linear equations on cα , (k) N XX
(k)
cα B(Dxi ) = E (k) ,
i=1 α∈A
∂
(k) N XX
(k) cα B(Dxi ) (k) ∂xj i=1 α∈A
(k)
= fj ,
(cf. (1) and (2)), which we write in the matrix form Xc = g. These equations may be illconditioned, hence a regularization must be used. We tried three versions of regularization, namely `2 regularization, `1 regularization, and `0 regularization, all described below. 4.5.1
`2 regularization
For the solution of the overdetermined linear equations Xc = g we take c = (X T X + γ diag(X T X))−1 X T g, where diag(A) denotes the diagonal matrix whose diagonal elements are the same as in A. The regularizing parameter γ was determined from the 16-fold (unless stated otherwise) cross-validation scheme as described in the next paragraph. To perform the 16-fold cross-validation, we split the database X evenly into 16 non-overlapping ˜1, . . . , X ˜ 16 . We then train 16 different models, each on the database X \ X ˜ i and find databases X ˜ the RMS error when tested on Xi . The parameter γ is then chosen such that the cross-validation RMS error averaged over these 16 models is minimal. 4.5.2
`1 regularization
The `1 regularization minimizes kXc − gk2 + γkck21 , where k • k1 is the `1 norm of the vector. The advantage of the `1 regularization is that it produces sparse solutions c. However, we found that the `0 regularization described below produces better results (i.e. produces sparser solutions for the given target value of the regression error kXc − gk), hence we will focus on the latter. 4.5.3
`0 regularization
We solve a sequence of problems, parametrized by an integer parameter Nnz (number of non-zeros) as follows: find min kXc − gk subject to kck0 = Nnz , and choose the minimal Nnz such that kXc − gk reaches the accuracy goal. Here kvk0 is the number of nonzero components of the vector v ∈ Rn (also called the `0 norm). 15
To describe the algorithm, it is convenient to rewrite the problem as follows. Let A be the set of all indices α (earlier denoted as A). find A ⊂ A such that min kXc − gk is minimal, c
subject to |A| = Nnz and cA\A = 0. This is essentially a compressed sensing problem [18] for which we take a standard greedy algorithm (similar to the basis pursuit from the compressed sensing literature) and turn it into a genetic algorithm by adding the local search and crossover steps. The main variable in this algorithm is the family (population) of the sets A which is denoted by A. The algorithm is as follows. 1. Let A = {∅} as the solution for Nnz = 0. 2. For each A ∈ A find i ∈ / A, such that c corresponding to A ∪ {i} is the best (i.e., minimizing kXc − gk). Then replace A with A ∪ {i}. 3. “Crossover”: If |A| > 1 then do the following. For each pair of sets, A, A0 ∈ A, divide randomly these sets into two nonintersecting subsets, A = A1 ∪ A2 and A0 = A01 ∪ A02 , and generate the new sets A1 ∪ A02 and A01 ∪ A2 . Then replace all sets A with the newly generated sets. (Note that if |A| = 4 then up to 12 sets will be generated—2 sets from each of 6 pairs.) 4. “Local search”: 4.1. For each A ∈ A find j ∈ A such that c corresponding to A \ {j} is the best. 4.2. Then find i ∈ / A \ {j} such that c corresponding to (A \ {j}) ∪ {i} is the best. 4.3. If i 6= j then: 4.3a. include (A \ {j}) ∪ {i} into A, 4.3b. if |A| > 4 then exclude A from A, and 4.3c. go to step 4.1. 5. Remove all but 4 best sets in A. 6. Repeat steps 2–5 until the accuracy goal on kXc − gk is reached. Then take the best set A ∈ A and compute the corresponding c. We note that whenever A is fixed then finding c corresponding to A is easy: c = ((X T X)AA )−1 (X T g)A , where •A and •AA are the operations of extracting a subset of rows and columns corresponding to A ⊂ A.
5
Numerical Experiments
Our next goal is to understand how MTP performs compared to other interatomic potentials. Unfortunately, there are no existing works performing quantitative comparison between different machine learning potentials in terms of accuracy and computational efficiency. In the present work we compare the performance of MTP with GAP for tungsten [15, 14], thanks to the QM database and the GAP code published at www.libatoms.org. We test MTP on fitting the database of 9 693 configurations of tungsten, with nearly 150 000 individual atomic environments, and compare it to the analogously fitted GAP, namely, the one tagged as GAP6 in [15] or as the iterative-SOAP-GAP or I-S-GAP in [14]. 16
We chose Rcut = 4.9˚ A, and also set the minimal distance parameter to be Rmin := 1.9˚ A. For the radial functions we chose ( r−ν−2 rµ (Rcut − r)2 r < Rcut , fˆµ,ν (r) := 0 r ≥ Rcut , and then for each ν we orthonormalized them on the interval [Rmin , Rcut ] with the weight rν+2 (r − Rmin )(Rcut −r) to get the functions fµ,ν used with (8). Here r−ν compensates for r⊗ν , r−2 prioritizes closer atoms to more distant onces, and (Rcut − r)2 ensures a smooth cut-off.
5.1
Convergence with the Number of Basis Functions
Even though theoretically it is proved that MTP is systematically improvable, the actual accuracy depends on the size of the regression problem to be solved. We therefore first study how fast the MTP fit converges with the number of basis functions used. Theorem 3.2 suggests that the fitting error decreases exponentially with the polynomial degree. At the same time, a crude upper bound on the dimension of the space of polynomials of degree m is (n + 1)m , where n is the number of atoms in the atomic neighborhood. This suggests an algebraic decay of the fitting error with the number of basis functions. We hence study convergence of the error of fitting of potentials based on two choices of the sets of α. The first choice of the set for α, AN , was to limits the corresponding degree of Bα for α ∈ AN by N : AN = {α : deg(Bα ) ≤ N }. Since we modified the radial functions fµ,ν , we make a convention that the degree of fµ,ν is µ + 1 (while the degree of the constant function f (u) := 1 is zero.) Interestingly, it was found that a slightly better choice is A0N = {α : deg(Bα ) ≤ N + 8(#α)}, where #α is the length of α. In fact, A0N corresponds to AN but with the convention deg(fµ,ν ) = µ + 5. Figure 1 displays the RMS fitting error for the force as a function of the size of the set AN . One can observe an algebraic convergence, as follows from with Theorem 3.2. A preasymptotic regime with a faster algebraic convergence rate can be seen for small #A.
RMS error
0.2
æ æ
à
0.2
æ æ
à æ
0.1
æ à æ
errHA¢N L
à
errHAN L
0.1
æ à æ
æ à æ
0.05 50 100
æ
æ à æ
æ
æ
0.05
0.3 HðAL-0.227
æ 10 000
1000 ðA
Figure 1: RMS fitting error of the force as a function of the size of AN and A0N . An algebraic convergence can be observed.
17
5.2
Performance Tests
We next test the performance of MTP and compare it to GAP in terms of fitting error and computation (CPU) time. We chose three versions of MTP different from each other by the set A. The first MTP, denoted by MTP1 , was generated by limiting deg(Bα ) + 8(#α) ≤ 62 and additionally limiting (#α) ≤ 4, µ ≤ 5, and ν ≤ 4. MTP2 was generated by limiting deg(Bα )+8(#α) ≤ 52 and additionally limiting (#α) ≤ 5, µ ≤ 3, and ν ≤ 5, and applying the `0 regularization algorithm (as described in Section 4.5.3) to extract 760 basis functions. Finally, MTP3 was similarly generated by extracting 272 basis functions. Potential: GAP MTP1 CPU time/atom [ms]: 134.2 ±2.6 37.1 ±0.8 basis functions: 10 000 11 133 Fit errors: force RMS error [eV/˚ A]: 0.0633 0.0427 [%]: 4.2% 2.8% Cross-validation errors: regularization parameter γ: 3 · 10−9 force RMS error[eV/˚ A]: 0.0511 [%]: 3.4%
MTP2 4.4 ±0.2 760
MTP3 0.85 ±0.06 272
0.0633 4.2%
0.0995 6.6%
0 0.0642 4.3%
0 0.101 6.7%
Table 1: Efficiency and accuracy of MTP as compared to GAP. The mean CPU time is given for a single-core computation on the laptop Intel i7-2675QM CPU, together with the standard deviation as computed on 12 independent runs. The last section of the table reports the force RMS errors for the 16-fold cross-validation. The data from the conducted efficiency and accuracy tests are summarized in Table 1. The RMS force (more precisely, RMS of all force components over all non-single-atom configurations) was 1.505 eV/˚ A, which was used to compute the relative RMS error. The errors relative to this RMS force are also presented in the table. The GAP error was calculated based on the data from [14]. The CPU times do not include the initialization (precomputation) or constructing the atomistic neighborhoods. One can see that MTP1 has about the same number of fitting parameters as GAP, while its fitting accuracy is about 1.5 times better and the computation time is 3.5 times smaller. MTP2 was constructed such that its fitting accuracy is comparable to GAP, but it uses much less parameters for fitting and its computation is 30 times faster. Finally, MTP3 is a potential with 1.5 larger fitting error, while its computation time is about 5 times smaller than that of MTP2 . Also included in the table is the 16-fold cross-validation error. It shows that the MTP2 and MTP3 are not overfitted on the given database, whereas MTP1 needs regularization to avoid overfitting. We anticipate, however, that by significantly increasing the database size, the crossvalidation error of MTP1 would go down and reach the current fitting error MTP1 , since the fitting error follows closely the algebraic decay (see Figure 1) and is not expected to deteriorate with increasing the database size.
6
Conclusion
The present paper considers the problem of approximating a QM interaction model with interatomic potentials from a mathematical point of view. In particular, (1) a new class of nonpara18
metric (i.e., systematically improvable) potentials satisfying all the required symmetries has been proposed and advantages in terms of accuracy and performance over the existing schemes have been discussed, (2) an algebraic convergence of fitting with these potentials in a simple setting have been proved which was then confirmed with numerical experiments.
Acknowledgement The author is grateful to G´abor Cs´anyi for igniting the interest in this topic and for valuable discussions as a part of our on-going collaboration. Also, the author thanks Albert Bart´ok-P´artay for his help with implementation of MTP in the QUIP software1 which was used to compare the performance of MTP with GAP.
References [1] A. P. Bart´ok, R. Kondor, and G. Cs´anyi. On representing chemical environments. Physical Review B, 87(18):184115, 2013. [2] A. P. Bart´ok, M. C. Payne, R. Kondor, and G. Cs´anyi. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Physical review letters, 104(13):136403, 2010. [3] F. L. Bauer and C. T. Fike. Norms and exclusion theorems. Numerische Mathematik, 2(1):137–141, 1960. [4] J. Behler. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. The Journal of chemical physics, 134(7):074106, 2011. [5] J. Behler. Representing potential energy surfaces by high-dimensional neural network potentials. Journal of Physics: Condensed Matter, 26(18):183001, 2014. [6] J. Behler and M. Parrinello. Generalized neural-network representation of high-dimensional potential-energy surfaces. Physical review letters, 98(14):146401, 2007. [7] B. J. Braams and J. M Bowman. Permutationally invariant potential energy surfaces in high dimensionality. International Reviews in Physical Chemistry, 28(4):577–606, 2009. [8] H. Chen and C. Ortner. QM/MM methods for crystalline defects. Part 1: Locality of the tight binding model. arXiv preprint arXiv:1505.05541, 2015. [9] H. Chen and C. Ortner. QM/MM methods for crystalline defects. Part 2: Consistent energy and force-mixing. arXiv preprint arXiv:1509.06627, 2015. [10] W. Hackbusch and B. N. Khoromskij. Towards H-matrix approximation of the linear complexity. In Problems and Methods in Mathematical Physics, Operator Theory Adv. Appl., volume 121, pages 194–220. Birkh¨auser Verlag, Basel, 2001. [11] M. Hirn, N. Poilvert, and S. Mallat. Quantum energy regression using scattering transforms. arXiv preprint arXiv:1502.02077, 2015. 1
Available at http://www.libatoms.org/Home/LibAtomsQUIP
19
[12] Z. Li, J. R. Kermode, and A. De Vita. Molecular dynamics with on-the-fly machine learning of quantum-mechanical forces. Physical review letters, 114(9):096405, 2015. [13] F. Q. Nazar and C. Ortner. Locality of the Thomas-Fermi-von Weizs¨acker equations. arXiv preprint arXiv:1509.06753, 2015. [14] W. J. Szlachta. First principles interatomic potential for tungsten based on Gaussian process regression. arXiv preprint arXiv:1403.3291, 2014. [15] W. J. Szlachta, A. P. Bart´ok, and G. Cs´anyi. Accuracy and transferability of Gaussian approximation potential models for tungsten. Physical Review B, 90(10):104108, 2014. [16] A. P. Thompson, L. P. Swiler, C. R. Trott, S. M. Foiles, and G. J. Tucker. SNAP: Automated generation of quantum-accurate interatomic potentials. Journal of Computational Physics, 285:316–330, 2015. [17] H. Weyl. The Classical Groups: Their Invariants and Representations. Princeton landmarks in mathematics and physics. Princeton University Press, 1997. [18] Y. Zhang. Theory of compressive sensing via `1 -minimization: a non-rip analysis and extensions. Journal of the Operations Research Society of China, 1(1):79–105, 2013.
20