Resolution of Singularities and the Generalization Error with Bayesian

Resolution of Singularities and the Generalization Error with Bayesian Estimation for Layered Neural Network Miki Aoyagi ∗and Sumio Watanabe

†

Abstract Hierarchical learning models such as layered neural networks have singular Fisher metrics, since their parameters are not identifiable. These are called non-regular learning models. The stochastic complexities of non-regular learning models in Bayesian estimation are asymptotically obtained by using poles of their zeta functions which are the integrals of their Kullback distances and their priori probability density functions [1, 2, 3]. However, for several examples, upper bounds of the main terms in asymptotic forms of the stochastic complexities were obtained but not the exact values, because of their computational complexities. In this paper, we show a computational way for obtaining the exact value of the layered neural network and we give the asymptotic form of its stochastic complexity explicitly. Key Words ∗

Department of Mathematics, Faculty of Science and Technology, Sophia University, 7–1, Kioi-cho, Chiyoda-ku, Tokyo, 102–8554 Japan, [email protected] † Precision and Intelligence Laboratory, Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku, Yokohama, 226–8503 Japan, [email protected]

1

Stochastic complexity, layered neural networks, non-regular learning models, Bayesian estimate, resolution of singularities

1

Introduction

Learning models such as a layered neural network, a normal mixture and a Boltzmann machine have their singular Fisher matrix functions I(w). Their parameters w are not identifiable. For example, I(w0 ) of a three layered neural network is singular (detI(w0 ) = 0), when w0 represents a small model with less hidden units than those of the three layered neural network. The subset which consists of the parameters representing the small model is an analytic variety in all parameter space. Such a learning model is called a non-regular model. Theories of regular statistical models, for example, model selection theories AIC[4], TIC[5], HQ[6], NIC[7], BIC[8], MDL[9] cannot be applied to analyzing such non-regular models. So rapid construction of mathematical theories is necessary, since non-regular models have been applied practically to many information technology fields. Recently, a close connection between the Bayesian stochastic complexities of non-regular learning models and resolution of singularities has been revealed in [1, 2, 3] as follows. Let n be the number of training samples of a non-regular learning model. Its average stochastic complexity (Its free energy) F (n) is asymptotically equal to F (n) = λ log n − (` − 1) log log n + O(1), where λ is a positive rational number, ` is a natural number and O(1) is a bounded function of n. Its Bayesian generalization error G(n) is the average 2

Kullback distance of the inference of the non-regular learning model from its true distribution. Since it has been known that G(n) = F (n + 1) − F (n) if it has an asymptotic expansion, we have G(n) ∼ =

λ `−1 − . n n log n

The values λ and ` are obtained by using the poles of the learning model’s zeta function, based on a resolution method of its singularities. The zeta function is defined by the integral of the Kullback distance and the a priori probability density function of the learning model. For regular models, λ = d/2 and ` = 1, where d is the dimension of their parameter space. Non-regular models have smaller value λ than d/2, so they are more effective learning models than regular ones in the Bayesian estimation. In spite of those mathematical foundations, the values λ were not obtained since it was difficult to calculate them. Only their upper bounds were obtained in the case of the three layered neural network [11] and the reduced rank regression [12] in the past. To overcome this difficulty, the paper [10] proposed a probabilistic calculation method for λ, but the method could not obtain the exact values λ, also. In fact, poles of zeta functions have been investigated well only in special cases, for example in the prehomogeneous spaces, but Kullback distances do not occur in the prehomogeneous spaces. So, to investigate the poles of Kullback distances is a new problem even in mathematics. Moreover, by Hironaka’s Theorem [13], it is known that desingularization of an arbitrary polynomial can be obtained by using a blowing-up process. However desingularization of any polynomial in general, although it is known 3

as a finite process, is very difficult. Furthermore, Kullback distances are not simple polynomials, i.e., they have parameters, for example p which is the number of hidden units of the three layered neural networks in Eq.(3). In this paper, we propose a new method for obtaining the exact asymptotic form of the stochastic complexity and its main term λ for hieratical learning models. Our method uses a recursive blowing-up, which yields a complete desingularization. By applying it to the three layered neural network, we show the effectiveness of the method. Our method in this paper first clarifies the asymptotic behavior of the stochastic complexity in Bayesian estimation for the three layered neural network. So, we can compare asymptotic behaviors of regular models and non-regular models. One of applications of our result in view of the learning theory is as follows. By the MCMC method, estimated values of marginal likelihoods were calculated for hyper-parameter estimations and model selection methods of non-regular learning models, but theoretical values were not known. The theoretical values of the marginal likelihoods are given in this paper. This enables us to construct a mathematical foundation for analyzing and developing the precision of the MCMC method. We explain Bayesian learning theory in section 2 and resolution of singularities in section 3. The main term λ and its order ` for the three layered neural network are obtained in section 4.

4

2

Bayesian learning theory

In this section, we give a framework of Bayesian learning obtained in [1, 2, 3]. Let RN be an input-output space and W ⊂ Rd a parameter space. Take x ∈ RN and w ∈ W . Let p(x|w) be a learning model and ψ(w) an a priori probability density function. Assume that its true probability distribution p(x|w0 ) is included in the learning model. Let X n = (X1 , X2 , ..., Xn ) be arbitrary n training samples. Xi ’s are randomly selected from the true probability distribution p(x|w0 ). Then, its a posteriori probability density function p(w|X n ) is written by n Y 1 p(w|X ) = ψ(w) p(Xi |w), Zn i=1 n

where

Z Zn =

ψ(w) W

n Y

p(Xi |w)dw.

i=1

n

So the average inference p(x|X ) of its Bayesian distribution is given by Z n p(x|X ) = p(x|w)p(w|X n )dw. Then, its generalization error G(n) is written as Z p(x|w0 ) G(n) = En { p(x|w0 ) log dx}, p(x|X n )

(1)

where En {·} is the expectation value. Let Kn (w) = 1/n

n X

log(p(Xi |w0 )/p(Xi |w)).

i=1

Its average stochastic complexity (the free energy ) Z F (n) = −En {log exp(−nKn (w))ψ(w)dw}, 5

(2)

satisfies G(n) = F (n + 1) − F (n), if it has an asymptotic expansion. Define the zeta function J(z) of the learning model by Z J(z) = K(w)z ψ(w)dw, where K(w) is the Kullback distance of the learning model: Z p(x|w0 ) K(w) = p(x|w0 ) log dx. p(x|w) Then, for the maximum pole −λ of J(z) and its order `, we have F (n) = λ log n − (` − 1) log log n + O(1), and G(n) ∼ =

λ `−1 − , n n log n

where O(1) is a bounded function of n. The values λ and ` can be calculated by using a blowing-up process.

3

Resolution of singularities

In this section, we introduce resolution of singularities. A blowing-up process is a main tool in resolution of singularities of algebraic varieties. The following theorem is the analytic version of Hironaka’s theorem[13] used by Atiyah[14]. Theorem 1 Let f (x) be a real analytic function in a neighborhood of 0 ∈ Rn . There exist an open set V 3 0, a real analytic manifold U and a proper analytic 6

map µ (proper means that µ’s inverse images of compact sets are compact) from U to V such that (1) µ : U \ E → V \ f −1 (0) is an isomorphism, where E = µ−1 (f −1 (0)), (2) for each u ∈ U , there is a local analytic coordinate (u1 , · · · , un ) such that f (µ(u)) = ±u1s1 us22 · · · usnn , where s1 , · · · , sn are non-negative integers. Applying Hironaka’s theorem to the Kullback distance K(w) for each w ∈ K −1 (0)∩W , we have a proper analytic map µw from an analytic manifold Uw to a neighborhood Vw of w satisfying Theorem 1 (1) and (2). Then the local integration on Vw of the zeta function J(z) of the learning model is Z K(w)z ψ(w)dw

Jw (z) = ZV w = Uw

|us11 us22 · · · usnn |z ψ(µw (u))|µ0w (u)|du.

Therefore, we can obtain the value Jw (z). For each w ∈ W \ K −1 (0), there exists a neighborhood Vw such that K(w0 ) 6= 0 for all w0 ∈ Vw . So Jw (z) = R K(w)z ψ(w)dw has no poles. Since the parameter set W is compact, the Vw poles and their orders of J(z) are computable. Next we explain construction of the blowing-up along a manifold used in this paper. Define the manifold M by gluing k open sets Ui ∼ = Rn , i = 1, 2, · · · , k (n ≥ k) as follows. Denote the coordinate of Ui by (ξ1i , · · · , ξni ). Set the equivalence relation (ξ1i , ξ2i , · · · , ξni ) ∼ (ξ1j , ξ2j , · · · , ξnj )

7

at ξji 6= 0 and ξij 6= 0, by ξij = 1/ξji , ξjj = ξii ξji , ξhj = ξhi /ξji , 1 ≤ h ≤ k, h 6= i, j, ξlj = ξli , k + 1 ≤ ` ≤ n. Put M =

`k i=1

Ui / ∼.

Then the blowing map π : M → Rn is defined by (ξ1i , · · · , ξni ) 7→ (ξii ξ1i , · · · , ξii ξi−1i , ξii , ξii ξi+1i , · · · , ξii ξki , ξk+1i , · · · , ξni ), for each (ξ1i , · · · , ξni ) ∈ Ui . This map is well-defined and is called the blowing-up along N = {(x1 , · · · , xk , xk+1 , · · · , xn ) ∈ Rn | x1 = · · · = xk = 0}, The blowing map satisfies (1) π : M → Rn is proper, (2) π : M \ π −1 (N ) → Rn \ N is an isomorphism.

4

The learning curve of a three layered neural network

In this section, we obtain the maximum pole of the zeta function of a three layered neural network, by using a blowing-up process. Consider the three layered neural network of one input unit, p hidden units and one output unit. Denote an input value by x, and an output values by y. Let f (x, w) =

p X

am tanh(bm x),

m=1

where w = {am , bm |m = 1, · · · , p} is the parameter vector. 8

Consider the statistical model 1 1 p(y|x, w) = √ exp(− (y − f (x, w))2 ). 2 2π Assume that the probability density function q(x) of the input x is the uniform distribution on the interval [−1, 1] and that the a priori probability density function ψ(w) of w is a C ∞ − function with compact support W , satisfying ψ(0) > 0. Let the true distribution be p(y|x, w0 ) =

√1 2π

exp(− 21 y 2 ). That is, the

true parameter set which gives the true distribution contains the case of a1 = · · · = ap = b1 = · · · = bp = 0. Then the Kullback distance is 1 K(w) = 2

Z

1

f (x, w)2 dx.

−1

By using Taylor expansion of K(w), the maximum pole λ and its order ` R of W K(w)z ψdw are equal to those of Z W

p p p X X Y 2n−1 2 z { ( am bm ) } dam dbm . n=1 m=1

m=1

In the paper [15], it is shown that √ p 1 ≤λ≤ . 4m − 2 2 m=1 p X

Let Ψ be the differential form such as p p p X X Y 2n−1 2 z Ψ={ ( am bm ) } dam dbm . n=1 m=1

m=1

Main Theorem We have p + i2 + i , `= λ= 4i + 2 9

½

2, i2 = p, 1, i2 < p,

(3)

where i is the maximum integer satisfying i2 ≤ p. We prove Main Theorem by using a blowing-up process of Ψ. Before the proof of Main Theorem, let us give some notation. Since we often change variables by using a blowing-up process, it is more convenient for us to use the same symbols bm rather than b0m , b00m , · · · , etc, for the sake of simplicity. For instance, “Let b1 = v1 , bm = v1 bm , instead of “Let b1 = v1 , bm = v1 b0m . We divide the proof into two parts. One part is for obtaining desingularization and the other is for comparing the absolute values of poles. Proof of Main Theorem: Part 1 Construct the blowing-up along the submanifold {b1 = 0, · · · , bp = 0}. ` Let M be the manifold M = pi=1 Ui / ∼. Set the coordinate on U1 by (a1 , · · · , ap , v1 , b2 , · · · , bp ). After the blowing-up by the transform {b1 = v1 , bm = v1 bm , m = 2, · · · , p} on U1 , we have p p X X Ψ = { v14n−2 (a1 + am b2n−1 )2 }z v1p−1 m n=1

m=2

da1 dv1

p Y

dam dbm .

m=2

By the symmetry of b1 , . . ., bp , this setting is considered as the general case {bi = vi , bm = vi bm , m = 1, 2, · · · , p, m 6= i}.

10

Let d1 = a1 +

Pp m=2

am bm . Then we have

Ψ=

{v12 (d21

+

p X

v14n−4

n=2

(d1 −

p X

am bm +

am b2n−1 )2 )}z m

m=2

m=2

v1p−1 dd1 dv1

p X

p Y

dam dbm .

m=2

Put the auxiliary function fn,l by fn,l (x) =

n−l X

2j

l−1 2jl 2 b2j = 1 + · · · ≥ 1. 2 · · · bl−1 x

j2 +···+jl =0

This function satisfies fn,l (bm ) − fn,l (bl ) = (b2m − b2l )fn,l+1 (bm ). Set c2 =

p X

am bm (b2m − 1),

m=i

and ci =

p X

am bm (b2m − 1)(b2m − b22 ) · · · (b2m − b2i−1 ),

m=i

for i ≥ 3. By using fn,l , we have Ψ =

{v12 (d21

+

p X

v14n−4 (d1 + fn,2 (b2 )c2 + · · ·

n=2

+fn,i (bi )ci + · · · + fn,n (bn )cn )2 )}z p Y p−1 v1 dd1 dv1 dam dbm .

(4)

m=2

We have the condition d21 +

Pp n=2

v14n−4 (d1 + fn,2 (b2 )c2 + · · · + fn,i (bi )ci +

· · · + fn,n (bn )cn )2 = 0 if and only if d1 = c2 = · · · = cp = 0. For the sake of simplicity, we use an abbreviation fn,i instead of fn,i (bi ). 11

Eq.(5), k, K, α.         Step 2       

Transformation (i)

@

(ii)

@

q +K q +1 @ R Poles − t1 +2 , − lt , 1 ≤ l ≤ k − 1. @ 1 l

?

Eq.(7) (iii)

@

(iv)

@

q +2(K−1)+1 @ , − qlt+1 , 1 ≤ l ≤ k − 1. R − 1 t +4 @ 1 l

?

        Step 3

Eq.(9) Case P 2, (v) (α) +#C (α) (α) ql +K−1+#A − l∈A P (α) tl +2 , @ ¡ R Eq.(5) ¡ ª ? @ l∈A ql +1 ¡@ k, K + 1, α. − tl , 1 ≤ l ≤ k − 1. @ ¡ Case 2, (vi) ¡ @ Case 2, (vii) Case 1

      

¡@

¡

@

R @

¡ ª

(A) Eq.(9), k + 1, K, α + 1. (B) Eq.(5), k + 1, K + 1, α + 1.

Eq.(9), k, K, α + 1.

Figure 1: The flow chart of the Main Theorem’s proof. 0

0

Take J (α) or J ∈ Rα . Denote J (α) = (J (α ) , ∗) and α ≥ α0 by J (α) > J (α ) . Also denote J (α) = (0, · · · , 0) by J (α) = 0(α) or J (α) = 0. We need the following inductive statements of k, K, α for calculating poles by using a blowing-up process. Figure 1 is the flow chart of the Main Theorem’s proof. Inductive statements (α)

Set s(J) = #{m; k ≤ m ≤ p, Jm = J}, (α)

s(i, J) = #{m; k ≤ m ≤ i − 1, Jm = J} for any J ∈ Rα , where # implies the number of elements. (a) K ≥ k,

12

(b) Ψ=

{v1t1 v2t2 v3t3

tk−1 · · · vk−1

³

d21 + (d1 v12 + d2 )2

+ · · · + (d1 v12K−4 + d2 v12K−6 fK−1,2 + · · · p X 2 +dK−1 fK−1,K−1 ) + v14n−4K+4 (d1 v12K−4 n=K 2K−6 fn,2 + · · · + dK−1 fn,K−1 +d2 v1 p ´ k−1 X Y 2 qm + fn,i ci ) }z vm m=1 i=K p p K−1 k−1

Y

Y

ddm

m=1

Y

dam

m=1

m=K

Y

dvm

dbm .

(5)

m=k

Here, tl , qm ∈ Z+ . Also, there exist RJ (α) ⊂ Rα , t(i, J, l) ∈ Z+ , and functions g(i, m) 6= 0, (K ≤ i ≤ p, 2 ≤ l ≤ k − 1, i ≤ m ≤ p) such that t(i,0,2) t(i,0,3) v3

c i = v2 X

g(i, m)am bm

X

t(i,J,2) t(i,J,3) v3

v2

J∈RJ (α)

X

g(i, m)am bm

X

(α)

(α)

Y

t(i,J,2) t(i,J,3) v3

g(i, m)am

(bm − bi0 )

Y

t(i,J,k−1)

· · · vk−1

(bm − bi0 ).

k≤i0 J (µ) J (µ) 6∈RJ (µ) ,J (µ) 6=0

(ξ)

(e) There exist gl ≥ 0, ηk0 ,l ≥ 0 for 2 ≤ k 0 ≤ K −1, 1 ≤ ξ ≤ gl , 2 ≤ l ≤ k−1, such that g

l tl X (ξ) (ξ) = (1 + η2,l + · · · + ηK−1,l ), 2 ξ=1

(ξ)

(ξ)

(ξ)

0 ≤ η2,l ≤ 2, 0 ≤ η2,l + η3,l ≤ 4, .. . (ξ)

(ξ)

(ξ)

0 ≤ η2,l + η3,l + · · · + ηK−1,l ≤ 2(K − 2). (ξ)

(f) Set ϕl

(ξ)

(ξ)

= p + 2η2,l + · · · + (K − 1)ηK−1,l . There exist φl ∈ Z+ for P 2 ≤ l ≤ k − 1, such that gl ≤ Jm(α) >J (µ) DJ (µ) ,l and ql + 1 =

gl X

(ξ)

ϕl + φl

ξ=1

+

p X

(−gl +

m=k

X

DJ (µ) ,l ).

(α) Jm >J (µ)

The end of inductive statements Statements (d), (e) and (f) are needed to compare poles. The definitions of all variables will be given later on in the proof. (α)

If Jm = 0 for all m and α, we have α = k − 1 and the followings: 14

(a’) k = K. (b’) t(i,0,2) t(i,0,3) v3

t(i,0,k−1)

c i = v2 X

· · · vk−1

Y

am bm

(b2m − b2i0 ).

k≤i0 J (µ)

l∈A(α)

gl + (

l∈A(α)

+

m=k (α) Jm 6∈JC (α)

X

+

p X

DJ (µ) ,l ) + 1)

φl + K − 1.

l∈A(α) (ξ 0 )

(ξ 0 )

(I) Assume that there exist l0 ∈ A(α) and ξ 0 such that 0 ≤ η2,l0 + η3,l0 + · · · + (ξ 0 )

ηK−1,l0 < 2(K − 2). Let gk 0 →

X

gl ,

l∈A(α)

φk0 →

X l∈A(α)

20

φl ,

X

DJ (α) ,k0 →

DJ (α) ,l + 1 if J (α) ∈ JC (α) ,

l∈A(α)

X

DJ (µ) ,k0 →

DJ (µ) ,l if J (µ) 6∈ JC (α) .

l∈A(α)

Let (g )

(1)

ϕk 0 , · · · , ϕ k 0 k 0 (ξ)

be ϕl , l ∈ A(α) , ξ = 1 · · · , gl and (g )

(1)

η`,k0 , · · · , η`,kk00 (ξ)

be η`,l , l ∈ A(α) , ξ = 1 · · · , gl by numbering in the same order for any `. Then we have tk0 /2 =

gk 0 X

(ξ)

ξ=1 gk 0

qk 0 + 1 =

(ξ)

(1 + η2,k0 + · · · + ηK−1,k0 ) + 1,

X

(ξ)

ϕk0 + K − 1

ξ=1 p

X

+

X

(−gk0 +

m=k

DJ (µ) ,k0 ) + φk0 .

(α) Jm >J (µ)

(ξ 0 )

(ξ 0 )

By the assumption (I), there exists ξ 0 such that 0 ≤ η2,k0 + η3,k0 + · · · + (ξ 0 )

ηK−1,k0 < 2(K − 2). Therefore, by putting (ξ 0 )

(ξ 0 )

(ξ 0 )

ϕk0 → p + 2η2,k0 + · · · + (K − 1)(ηK−1,k0 + 1), (ξ 0 )

(ξ 0 )

ηK−1,k0 → ηK−1,k0 + 1, we have Statements (d), (f). (ξ)

(ξ)

The end of (I) (ξ)

(II) Assume 0 ≤ η2,l + η3,l + · · · + ηK−1,l = 2(K − 2) for all l ∈ A(α) and ξ.

21

By the assumption (d), for any J ∈ Rα , we have gl X tl (ξ) (ξ) = (1 + η2,l + · · · + ηK−1,l ) 2 ξ=1

= gl (1 + 2(K − 2)) X ≤ D0(µ) ,l (2s(i, 0(µ) ) + 1) J>0(µ)

X

+

DJ (µ) ,l (s(i, J (µ) ) + 1)

J>J (µ) J (µ) ∈RJ (µ)

X

+

DJ (µ) ,l s(i, J (µ) ).

J>J (µ) J (µ) 6∈RJ (µ) ,J (µ) 6=0

In particular, if some J (µ) are not equal to 0 then s(K, J (µ) ) ≤ K − 2. Therefore we have X

gl
J (µ)

Let X

DJ (α) ,k0 →

DJ (α) ,l + 1 if J (α) ∈ JC (α) ,

l∈A(α)

X

DJ (µ) ,k0 → gk0 →

X

DJ (µ) ,l if J (µ) 6∈ JC (α) ,

l∈A(α)

gl + 1, φk0 →

l∈A(α)

X l∈A(α)

(g )

and ϕk0 k0 = p. In addition, let (g −1)

(1)

ϕk 0 , · · · , ϕ k 0 k 0 (ξ)

be ϕl , l ∈ A(α) , ξ = 1 · · · , gl , and (g −1)

(1)

η`,k0 , · · · , η`,kk00 22

φl + K − k,

(ξ)

be η`,l , l ∈ A(α) , ξ = 1 · · · , gl by numbering in the same order for any `. Put (g )

η`,kk00 = 0 for ` ≥ 1. Then we have gl X X

qk 0 + 1 =

l∈A(α)

X

+p+

ξ=1

p X

(−

m=k

X

X

(α) Jm >J (µ)

l∈A(α)

−1 + +

(ξ) ϕl

X

gl

l∈A(α)

DJ (µ) ,l )

φl + p − k + 1 − p + K − 1 + #C (α)

l∈A(α)

=

gk 0 X

(ξ) ϕk0

ξ=1

+

p X m=k

By Eq.(15) we have gk0 ≤

X

(−gk0 +

DJ (µ) ,k0 ) + φk0 .

(α) Jm >J (µ)

P (α)

Jm >J (µ)

DJ (µ),k0 . The end of (II)

(α)

In particular, assume that all Jm = 0. Let tk0 → tk0 + 2, t(i, 0, k 0 ) → t(i, 0, k 0 ) − 1, qk0 → qk0 + k − 1, ci → ci /vk0 . (1)

(1)

Then tk0 /2 = 1 + η2,k0 + · · · + ηk−1,k0 + 1. By putting (1)

(1)

ηk−1,k0 → ηk−1,k0 + 1, we have (d0 )

t˜(i, 0(k−1) , k 0 ) = D0(k0 −1) ,k0 (2(i − k 0 ) + 1),

(e0 )

t(k, 0(k−1) , k 0 ) + ηk−1,k0 = 2,

(f 0 )

qk0 + 1 = ϕk0

(1)

(1)

(1)

(1)

= p + 2η2,k0 + · · · + (k − 1)ηk−1,k0 . If JC (α) is not empty, we need the following Step(∗). 23

Step(∗) (α+1)

Let jm

(α)

be any real number for each m with Jm ∈ JC (α) . For m

(α)

(α+1)

with Jm 6∈ JC (α) , let jm

(α)

(α)

(α)

= jm where Jm = (J 0 , jm ), J 0 ∈ Rα−1 and

(α)

(α+1)

jm ∈ R. Consider a sufficiently small neighborhood of bm = jm it. Put

( (α+1) Jm

(α)

(α+1) 2

(Jm , jm ) (α) (α+1) (Jm , jm )

=

and fix

(α)

if Jm = 0, (α) if Jm 6= 0.

Let t(i, J, (l, τ )) = t(i, J 0 , (l, τ )) for J = (J 0 , ∗) ∈ Rα+1 , where J 0 ∈ Rα . Change g(i, m) 6= 0 and bm properly, taking into account that the neighbor(α+1)

hood of bm = jm

. Let RJ (α+1) be the set of J satisfying X

g(i, m)am bm

Y

(bm − bi0 ),

k+1≤i0 J (µ) DJ (µ) ,l ) + φl for all l, since k is replaced k

by k + 1 later. (α)

If A(α) = φ, then we have all Jm ∈ JC (α) . Therefore, #C = p − (k − 1), qk + 1 = p + K − k, tk = 2. Put DJm(α) ,k = 1, DJm(α) ,l = 0(l 6= k), (1)

(1)

η`,k = 0, ϕk = p, φk = K − k. The end of (III) Do the procedure Step(∗) and let k → k + 1. Those new parameters satisfy Statements (a)∼(f) with Eq.(9) instead of (α)

Eq.(5). If we have all Jm = 0, the above case does not appear. (B) (α)

Next consider the case K ≤ k 0 < p. We can assume k 0 = K. If JK 6∈ (α)

(α)

(α)

RJ (α) , JK 6= 0 then there exists i0 < K such that Ji0 = JK because of the Case 2 assumption. So the case results in (A). (α) (α) Consider the case J˜ := JK ∈ RJ or J˜ := JK = 0. By the trans-

formation (vii), we have cK = vk (aK g(K, K) + · · · ). Now, there is no aK in the formulas ci , i ≥ K + 1. So, change the variable from aK to dK by dK = aK g(K, K) + · · · : cK = vk dK . Since aK , bK have disappeared in all formulas ci , we change the variables from bk , · · · , bK−1 to bk+1 , · · · , bK . Also (α)

we change Ji

and RJ (α) , properly.

Let tk → T, t(i, J, k) → Ti,J , qk → Q, ci → ci /vk , 25

for K + 1 ≤ i ≤ p, by using Eq.(10), (11), (12). Proceed Step (III) and Step(∗). Let k → k + 1, K → K + 1, t1 → t1 + 4, q1 → q1 + 2(K − 1), and we have Statements (a)∼(f). (α)

Assume that all Jm are 0. Then we have A(k) = φ. Let tk = 2, t(i, 0(k−1) , k) = 2(i − k), qk = p − 1, ci → ci /vk , D0(k−1) ,k = 1, D0(k−1) ,l = 0(l 6= k), (1)

(1)

η`,k = 0, ϕk = p. (1)

By t(k, 0(k−1) , l) = 0(2 ≤ l ≤ k − 1) and (e’), we have ηk−1,l = 2. Also we obtain t(k + 1, 0(k−1) , l) = t˜(k + 1, 0(k−1) , l) − tl /2 = 2, by using (d’) and (e’). Therefore before Step(∗), we have (d0 )

t˜(i, 0(k−1) , k) = D0(k−1) ,k (2(i − k) + 1) = 2(i − k) + 1,

(e0 )

(1)

t(k + 1, 0(k−1) , l) + ηk,k = 2 + 0, (2 ≤ l ≤ k),

(f 0 )

(1)

qk + 1 = ϕ k = p (1)

(1)

= p + 2η2,k + · · · + (k − 1)ηk−1,k . After Step(∗), we have t˜(i, 0(k−1) , k) = t˜(i, 0(k) , k) and t(i, 0(k−1) , k) = t(i, 0(k) , k). Let k → k + 1, K → K + 1, t1 → t1 + 4, q1 → q1 + 2(K − 1). 26

Those new parameters satisfy Statements (a’)∼(f’). By repeating Step 3, the conditions Case 2 (vi) and (vii)(A) disappear, whose transformations (vi) and (vii)(A) in Case 2 satisfy Eq.(5) instead of Eq.(9). So, K is increased with these finite steps. K = p + 1 completes the blowing-up process. Proof of Main Theorem: Part 2 To obtain the maximum pole and its order, we prepare the following four lemmas. Lemma 1 (α)

If all Jm = 0, then for each 1 ≤ k ≤ p, we have the poles, − p2 , p+k − 4 , − p+2k , 6 p+2k+2(k+1) p+2k+k+1 − ,− , 8 10 .. . , − p+(i−1)(2k−2+i)+k+i−1 4i p+(i−1)(2k−2+i)+2(k+i−1) − , 4i+2 .. . − (p−k−1)(k−2+p)+p−1 , 4(p−k) p+(p−k−1)(k−2+p)+2(p−2) . − 4(p−k)+2 which are related to vk . Proof

The proof is obtained by Statements (a’)∼(f’) in Part 1. Q.E.D.

Lemma 2 If am , bm > 0, m = 1, · · · , M , then we have

PM am Pm=1 M m=1 bm

m ≥ min{ abm |m=

1, . . . , M }. Lemma 3 Let K ∈ N. Assume that ηk0 ∈ Z+ , 2 ≤ k 0 ≤ K − 1 satisfy 0 ≤ η2 ≤ 2, · · · , 0 ≤ η2 + η3 + · · · + ηK−1 ≤ 2(K − 2). Let t := 1 + η2 + · · · + ηK−1 , 27

ϕ := p + 2η2 + · · · + (K − 1)ηK−1 , and t = 2i + m,

i ∈ N,

m = 0 or 1.

Then we have ϕ p + i2 + im > = 2t 4i + 2m p + 1 + 1 + 2 + 2 + · · · + (i − 1) + (i − 1) + i + im 2t Proof

It is necessary only to compare those numerators.

Q.E.D.

Remark The poles −

p + i2 + im 4i + 2m

in Lemma 3 are equal to those for v1 (k = 1) in Lemma 1. Lemma 4

(α)

If some Jm are not equal to 0, then the poles of (6), (8), (13)

and (14) are smaller than one of those in Lemma 1. Proof

We use the same notations in Part 1 of the Main Theorem’s

proof. By using Statements (e) and (f), we have Pgl (ξ) ql + 1 ξ=1 ϕl ≥ Pg , (ξ) (ξ) l tl 2 ξ=1 (1 + η2,l + · · · + ηK−1,l ) where

ql +1 tl

is in Eq.(6), Eq.(8) or Eq.(14) for 2 ≤ l ≤ k − 1. By Lemma 2,

we have (ξ)

ϕl ql + 1 ≥ min { }. (ξ) (ξ) 1≤ξ≤gl 2(1 + η tl 2,l + · · · + ηK−1,l ) Therefore, by Lemma 3, one of the poles for v1 in Lemma 1 is greater than − qlt+1 . l 28

Next consider the poles in Eq.(13). (ξ 0 )

(ξ 0 )

(I’) Assume that there exist l0 ∈ A(α) and ξ 0 such that 0 ≤ η2,l0 + η3,l0 + (ξ 0 )

· · · + ηK−1,l0 < 2(K − 2). Then P

ql + K − 1 + #A(α) + #C (α) P l∈A(α) tl + 2 P (α) (ql + 1) + K − 1 ≥ l∈AP l∈A(α) tl + 2

l∈A(α)

P

l∈A(α) ,l6=l0 (ql

+ 1) + (ql0 + 1) + K − 1 l∈A(α) ,l6=l0 tl + tl0 + 2

P

≥

ql + 1 ql 0 + 1 + K − 1 , | l ∈ A(α) , l 6= l0 } tl tl0 + 2 P (ξ) P (ξ) ξ ϕl 0 + K − 1 ξ ϕl , ≥ min{ | l ∈ A(α) , l 6= l0 } tl tl0 + 2 ≥ min{

(ξ)

≥ min{

min

ϕl

l∈A(α) ,l6=l0 ,ξ

(ξ)

(ξ)

2(1 + η2,l + · · · + ηK−1,l )

,

(ξ)

min0 ξ6=ξ

ϕl0 (ξ)

(ξ)

2(1 + η2,l0 + · · · + ηK−1,l0 )

,

(ξ 0 )

ϕl 0 + K − 1 (ξ 0 )

(ξ 0 )

2(1 + η2,l0 + · · · + (ηK−1,l0 + 1))

}.

Therefore, by Lemma 3, we have the poles which are related to v1 in Lemma 1 and are greater than the pole in Eq.(13). (ξ)

(ξ)

(II’) Assume that for all l ∈ A(α) and ξ, we have 0 ≤ η2,l + η3,l + · · · + (ξ)

ηK−1,l = 2(K − 2) and some J (µ) are not equal to 0. P Then gl < J>J (µ) DJ (µ),l and P

ql + K − 1 + #A(α) + #C (α) P l∈A(α) tl + 2 P (α) (ql + 1) + K − 1 ≥ l∈AP l∈A(α) tl + 2

l∈A(α)

29

P ≥ min { l∈A(α)

(ξ)

ϕl , tl

ξ

P

(ξ)

ϕl + p − k + 1 + K − 1 } tl + 2 P (ξ) p ξ ϕl , }. ≥ min { (α) tl 2 l∈A ξ

So by Lemma 3, we have a greater pole with respect to v1 in Lemma 1 P

than −

l∈A(α)

ql +K−1+#A(α) +#C (α) P t +2 l∈A(α) l

in Eq.(13). Q.E.D.

Therefore the maximum pole is chosen from the followings: p p+1 p+2 − ,− ,− , 2 4 6 −

−

−

p+4 p+6 ,− , 8 10 .. .

p + i2 p + i2 + i ,− , 4i 4i + 2 .. .

p + (p − 1)2 p + (p − 1)2 + (p − 1) ,− . 4(p − 1) 4(p − 1) + 2

Some computations show that p + i2 + 2 , 4i + 2 where i = max{j ∈ N | j 2 ≤ p} is the maximum pole. If i2 = p then we have ½ 2 if i2 = p, p+i2 p+i2 +i − 4i = − 4i+2 . So the order of the maximum pole is 1 if i2 6= p. The end of proof of Main Theorem

30

5

Conclusion

In this paper, we introduce our inductive method for obtaining the poles of the zeta function for the three layered neural network. The blowing-up process of the method enables us to obtain the stochastic complexity of the three layered neural network asymptotically. So, we show that this algebraic method can be effectively used to solve our problem in the learning theory. Also, the purpose of obtaining the maximum poles of zeta functions for the leaning theory may be considered as a new problem in mathematics, since most of Kullback functions’ singularities have not been investigated, yet. The method in this paper would be useful for calculating asymptotic forms for not only the layered neural network model but also other cases. Our aim is to develop the mathematical methods in that context. Finally, we remark that λ in Main Theorem 1 becomes the same value as the upper bound

√

p 2

in the paper [15], if p = i2 .

References [1] Watanabe, S., “Algebraic analysis for singular statistical estimation,” Lecture Notes on Computer Science, 1720, pp. 39-50, 1999. [2] Watanabe, S.,“Algebraic analysis for nonidentifiable learning machines,”Neural Computation, 13 (4), pp. 899-933, 2001. [3] Watanabe, S.,“Algebraic geometrical methods for hierarchical learning machines,”Neural Networks, 14 (8), pp. 1049-1060, 2001.

31

[4] Akaike, H.,“A new look at the statistical model identification,” IEEE Trans. on Automatic Control, vol.19, pp.716-723, 1974. [5] Takeuchi, K.,“Distribution of an information statistic and the criterion for the optimal model,” Mathematical Science, No.153, pp.12-18, 1976 (In Japanese). [6] Hannan, E.J. and Quinn, B.G.,“ The determination of the order of an autoregression,” Journal of Royal Statistical Society, Series B. vol.41, pp.190-195, 1979. [7] Murata, N., Yoshizawa, S. and Amari, S., “ Network information criterion - determining the number of hidden units for an artificial neural network model,” IEEE Trans. on Neural Networks, vol.5, no.6, pp.865872, 1994. [8] Schwarz, G.,“ Estimating the dimension of a model,” Annals of Statistics, vol.6, no.2, pp.461-464, 1978. [9] Rissanen, J.,“ Universal coding, information, prediction, and estimation,” IEEE Trans. on Information Theory, vol.30, no.4, pp.629-636, 1984. [10] Yamazaki, K. and Watanabe, S., “ A probabilistic algorithm to calculate the learning curves of hierarchical learning machines with singularities,” IEICE Trans., J85-DII, no.3, pp.363-372, 2002.

32

[11] Watanabe, S., “Learning efficiency of redundant neural networks in Bayesian estimation,” IEEE Transactions on Neural Networks, vol.12, no.6, pp.1475-1486, 2001. [12] Watanabe, K. and Watanabe, S., “ Upper bounds of Bayesian generalization errors in reduced rank regression,” IEICE Trans., J86-A, no.3, pp.278-287, 2003. [13] Hironaka, H., “Resolution of Singularities of an algebraic variety over a field of characteristic zero,” Annals of Math., vol. 79, pp.109-326, 1964. [14] Atiyah, M. F., “ Resolution of singularities and division of distributions,” Comm. Pure and Appl. Math., vol.13, pp.145-150, 1970. [15] Watanabe, S.,“On the generalization error by a layered statistical model with Bayesian estimation,” IEICE Trans., J81-A, pp. 1442-1452, 1998, (English version : Elect. and Comm. in Japan., John Wiley and Sons, 83 (6), pp.95-106, 2000. Miki Aoyagi received the Doctor’s degree of Mathematics from Graduate School of Kyushu University, Fukuoka, Japan, in 1995 and 1997, respectively. From 1997 until 2001, she was a PD Research Fellow of the Japan Society for the Promotion of Science for Young Scientists. Since 2001, she has been an Assistant at Department of Mathematics, Sophia University, Tokyo, Japan. Her research interests are focused on algebraic geometry, complex analysis and learning theory. (Doctor of Mathematics) She is a member of Japan

33

Mathematical Society.

Sumio Watanabe received the Master’s degree of Mathematics from Graduate School of Kyoto University, Kyoto, Japan, in 1982. He is now a Professor of Tokyo Institute of Technology. His research interests are focused on learning theory. (Doctor of Engineering) He is a member of IEEE and JNNS.

34

Resolution of Singularities and the Generalization Error with Bayesian

Resolution of Singularities and the Generalization Error with Bayesian

Suggest Documents

Asymptotic Bayesian Generalization Error when Training and Test ...

RENORMALIZATION AND RESOLUTION OF SINGULARITIES

Asymptotic Bayesian Generalization Error when Training and ... - ICML

Polar curves, resolution of singularities and the

Mirage resolution of cosmological singularities

Alterations and resolution of singularities - CiteSeerX

The generalization error of the symmetric and scaled support vector ...

RESOLUTION OF SINGULARITIES OF PAIRS PRESERVING SEMI

Painleve Test and the Resolution of Singularities for Integrable

Painleve Test and the Resolution of Singularities for Integrable ...

EMBEDDED RESOLUTION OF SINGULARITIES IN RIGID ...

EMBEDDED RESOLUTION OF SINGULARITIES IN RIGID ANALYTIC ...

A Generalization of Bayesian Inference.pdf - Google Sites

Bayesian Models of Inductive Generalization - Semantic Scholar

Generalization Error of Randomized Linear Zero

Error Characterization in the Vicinity of Singularities in ... - IEEE Xplore

estimation of generalization error: random and fixed inputs - Institute of ...

Canonical embedded and non-embedded resolution of singularities ...

Resolution of curvature singularities from quantum mechanical and ...

High-resolution phase imaging of phase singularities in the focal ...

An Empirical Bayesian interpretation and generalization of NL-means

Analysis of the generalization error: Empirical risk minimization over

Generalization Error Estimation under Covariate Shift - CiteSeerX

generalization for multilayer neural network bayesian ... - CiteSeerX