May 29, 2018 - Advances in Pure Mathematics. Least Squares Method from the View Point of. Deep Learning. Kazuyuki Fujii1,2. 1International College of Arts ...
Advances in Pure Mathematics, 2018, 8, 485-493 http://www.scirp.org/journal/apm ISSN Online: 2160-0384 ISSN Print: 2160-0368
Least Squares Method from the View Point of Deep Learning Kazuyuki Fujii1,2 International College of Arts and Sciences, Yokohama City University, Yokohama, Japan, Department of Mathematical Sciences, Shibaura Institute of Technology, Saitama, Japan
1 2
How to cite this paper: Fujii, K. (2018) Least Squares Method from the View Point of Deep Learning. Advances in Pure Mathematics, 8, 485-493. https://doi.org/10.4236/apm.2018.85027 Received: April 10, 2018 Accepted: May 26, 2018 Published: May 29, 2018 Copyright © 2018 by author and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access
Abstract The least squares method is one of the most fundamental methods in Statistics to estimate correlations among various data. On the other hand, Deep Learning is the heart of Artificial Intelligence and it is a learning method based on the least squares. In this paper we reconsider the least squares method from the view point of Deep Learning and we carry out the computation thoroughly for the gradient descent sequence in a very simple setting. Depending on the values of the learning rate, an essential parameter of Deep Learning, the least squares methods of Statistics and Deep Learning reveal an interesting difference.
Keywords Least Squares Method, Statistics, Deep Learning, Learning Rate, Linear Algebra
1. Introduction The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science. When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method. See for example [1]. On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future. As to Deep Learning see for example [2] [3] [4] [5] [6]. Deep Learning may be stated as a successive learning method based on the least squares method. Therefore, to reconsider it from the view point of Deep Learning is very natural and we carry out the calculation thoroughly of the successive approximation called gradient descent sequence. DOI: 10.4236/apm.2018.85027 May 29, 2018
485
Advances in Pure Mathematics
K. Fujii
When the learning rate changes a difference in method between Statistics and Deep Learning gives different results. Theorem I and II in the text are our main results and a related problem (exercise) is presented for readers. Our results may give a new insight to both Statistics and Data Science. First of all let us explain the least squares method for readers in a very simple setting. For n pieces of two dimensional real data
{( x , y ) , ( x , y ) ,, ( x , y )} 1
1
2
2
n
n
we assume that their scatter plot is like Figure 1 Then a model function is linear
f ( x= ) ax + b.
(1)
For this function the error (or loss) function is defined by 2 1 n yi − ( axi + b )} . { ∑ 2 i=1
E ( a, b= )
(2)
The aim of least squares method is to minimize the error function (2) with respect to
{a, b} . A little calculation gives
n n n n 1 n 2 2 2 2 ∑ xi a + 2∑ xi ab + nb − 2∑ xi yi a − 2∑ yi b + ∑ yi . 2 = =i 1 =i 1 =i 1 i 1 =i 1
E (= a, b )
Then the equations for the stationality ∂E ∂E = = 0 ∂a ∂b
(3)
give a linear equation for a and b
∑ x ∑ x a ∑ n xi yi i 1 = = n n x b n ∑ i 1 yi i = ∑i 1= n n 2 i =i 1 = i 1 i
(4)
Figure 1. Scatter plot 1. DOI: 10.4236/apm.2018.85027
486
Advances in Pure Mathematics
K. Fujii
and its solution is given by −1
∑ n xi2 ∑ n xi ∑ n xi yi a = i 1 =i 1 = i1 . = n n b x n y ∑ ∑ i i = i 1 i 1=
Explicitly, we have
(∑
(5)
) ( ∑ x )( ∑ y ) , n (∑ x ) − (∑ x ) − ( ∑ x )( ∑ x y ) + ( ∑ x )( ∑ y ) b= . n (∑ x ) − (∑ x ) n n n i =i 1 i i =i 1 = i 1 i
a=
n
xy −
n n 2 i =i 1 = i 1 i
2
n
n n n 2 i i =i 1 = i 1 i i =i 1 = i 1 i n n 2 i =i 1 = i 1 i
(6)
2
To check that a and b give the minimum of (2) is a good exercise. Note We have an inequality
(
)
n x12 + x22 + + xn2 ≥ ( x1 + x2 + + xn )
2
(7)
and the equal sign holds if and only if
x= x= = xn . 1 2 Since
{ x1 , x2 , , xn }
are data we may assume that xi ≠ x j for some i ≠ j .
Therefore
(
)
n x12 + x22 + + xn2 − ( x1 + x2 + + xn ) > 0 2
gives n n 2 i =i 1 = i 1 i
∑ x ∑ x n n ∑ i=1 xi
> 0.
(8)
2. Least Squares Method from Deep Learning In this section we reconsider the least squares method in Section 1 from the view point of Deep Learning. First we arrange the data in Section 1 like
Input data : {( x1 ,1) , ( x2 ,1) , , ( xn ,1)}
Teacher signal : { y1 , y2 , , yn } and consider a simple neuron model in [7] (see Figure 2)
Figure 2. Simple neuron model 1. DOI: 10.4236/apm.2018.85027
487
Advances in Pure Mathematics
K. Fujii
Here we use the linear function (1) instead of the sigmoid function z = σ ( x ) . In this case the square error function becomes 2 2 1 n 1 n yi − f ( xi )= yi − ( axi + b )} . ( ) { ∑ ∑ 2 i 1= 2i1 =
L ( a, b= )
We usally use L ( a, b ) instead of E ( a, b ) in (2). Our aim is also to determine the parameters {a, b} in order to minimize L ( a, b ) . However, the procedure is different from the least squares method in Section 1. This is an important and interesting point. For later use let us perform a little calculation n n ∂L ∂f = −∑ ( yi − f ( xi ) ) = −∑ ( yi − f ( xi ) ) xi , ∂a ∂a = i 1= i 1 n n ∂L ∂f = −∑ ( yi − f ( xi ) ) = −∑ ( yi − f ( xi ) ) . ∂b ∂b = i 1= i 1
(9)
We determine the parameters {a, b} successively by the gradient descent method, see for example [8]. For t = 0,1, , n, a ( 0) a ( t ) a ( t + 1) → → → → b ( 0) b ( t ) b ( t + 1)
and
a ( t + 1)= a ( t ) −
∂L , ∂a ( t )
(10)
∂L b ( t + 1)= b ( t ) − , ∂b ( t ) where
L = L ( a (t ) , b (t )) =
1 n ∑ yi − ( a ( t ) xi + b ( t ) ) 2 i=1
{
}
2
and ( 0 < < 1) is small enough. The initial value ( a ( 0 ) , b ( 0 ) ) is given appropriately. As will be shown shortly in Theorem I, their explicit values are not important. Comment The parameter is called the learning rate and it is very hard to choose properly as emphasized in [7]. In this paper we provide an estimation (see Theorem II). Let us write down (10) explicitly: T
n
{
}
a ( t + 1)= a ( t ) + ∑ yi − ( a ( t ) xi + b ( t ) ) xi i =1
n n n 2 = 1 − ∑ xi a ( t ) − ∑ xi b ( t ) + ∑ xi yi = i 1 = i 1 i 1=
and n
{
}
b ( t + 1)= b ( t ) + ∑ yi − ( a ( t ) xi + b ( t ) ) i =1
n n = − ∑ xi a ( t ) + (1 − n ) b ( t ) + ∑ yi . = i 1 i 1=
DOI: 10.4236/apm.2018.85027
488
Advances in Pure Mathematics
K. Fujii
These are cast in a vector-matrix form
(
)
(
)
n n n 2 a ( t ) − ∑ i 1 x= i b ( t ) + ∑ i 1 xi yi a ( t + 1)= 1 − ∑ i 1 xi = = n n + b t 1 ( ) − ∑ i 1 x= i a ( t ) + (1 − n ) b ( t ) + ∑ i 1 yi =
(
)
1 − ∑ n xi2 − ∑ n xi a ( t ) ∑ n xi yi = = i 1 =i 1 i 1 = + n − n x b t ( ) 1 n y − ∑ ∑ i = i 1= i 1 i 1 0 ∑ n xi2 ∑ n xi a ( t ) ∑ n xi yi = i =1 i 1 =i 1 . = + − n n n b ( t ) ∑ i=1 yi 0 1 ∑ i=1 xi
(11)
For simplicity by setting
∑ n xi2 ∑ n xi ∑ n xi yi a (t ) = = i 1 =i 1 i 1 = = x ( t ) = , f , A n n b t ( ) n i = ∑ i 1 x= ∑ i 1 yi we have a simple equation
x ( t + 1) =
( E − A) x ( t ) + f
(12)
where E is a unit matrix. Due to (8) the matrix A is invertible ( ∃A−1 ), that is, we x= = xn . exclude the trivial and uninteresting case x= 1 2 The solution is easy and given by
{
}
x ( t ) = ( E − A ) x ( 0 ) + E − ( E − A ) A−1 f . t
t
(13)
Note Let us consider a simple difference equation
x ( n += 1) ax ( n ) + b
( a ≠ 1)
for n = 0,1,2, . Then, the solution is given by
x ( n ) =an x ( 0) +
an − 1 1 − an b =an x ( 0) + b. a −1 1− a
Check this. Comment The solution (13) gives
lim x ( t ) = A−1 f
(14)
t →∞
if lim ( E − A ) = O t
(15)
t →∞
where O is a zero matrix. (14) is just the equation (5). Let us evaluate (13) further. For the purpose we make some preparations from Linear Algebra [9]. For simplicity we set
∑ x ∑ x α β ∑ n xi yi e = ≡ i1 ≡ A= , f = n n x β δ f n y ∑ ∑ i i = i 1 i 1= n n 2 i i 1 i =i 1 =
and want to diagonalize A. The characteristic polynomial of A is DOI: 10.4236/apm.2018.85027
489
Advances in Pure Mathematics
K. Fujii
λ − α −β = ( λ − α )( λ − δ ) − β 2 −β λ −δ
f (λ ) = λE − A =
= λ 2 − (α + δ ) λ + αδ − β 2 and the solutions are given by = λ±
α +δ ±
(α + δ )
(
2
)
− 4 αδ − β 2 α +δ ± = 2
(α − δ )
2
+ 4β 2
2
.
(16)
It is easy to see
λ+ > λ− > 0,
λ+ > 1
(δ =
n and n ≥ 2 ) .
(17)
We set the two eigenvectors of matrix A, corresponding to λ+ and λ− , in a matrix form λ −δ Q = + β
β . λ− − α
It is easy to see
( λ+ − δ )
2
+ β 2 = ( λ− − α ) + β 2 ≡ Λ 2 2
from (16) and we also set Q=
1 λ+ − δ Λ β
β
. λ− − α
(18)
Then it is easy to see
Q T Q =QQ T =E ⇒ Q T =Q −1. Namely, Q is an orthogonal matrix. Then the diagonalization of A becomes λ A = Q + 0
0 T Q . λ−
(19)
By substituting (19) into (13) and using 0 T 1 − λ+ E − A = Q Q 1 − λ− 0
we finally obtain Theorem I A general solution to (12) is
(1 − λ+ )t a (t ) = Q 0 b (t )
a ( 0) QT (1 − λ− ) b ( 0 ) 1 − (1 − λ+ )t 0 λ+ Te Q . + Q t 1 1 λ − − ( ) f − 0 λ− 0
t
(20)
This is our main result.
Lastly, let us show how to choose the learning rate ( 0 < < 1) , which is a DOI: 10.4236/apm.2018.85027
490
Advances in Pure Mathematics
K. Fujii
very important problem in Deep Learning. Let us remember
λ+ > λ− > 0 and λ+ > 1 from (17). From (15) the equations lim ( E − A ) = O ⇔ lim (1 − λ+ ) = lim (1 − λ− ) = 0 t
t
t →∞
t
t →∞
t →∞
determine the range of . Noting 1 − x < 1( ⇔ 0 < x < 2 ) ⇒ lim (1 − x ) = 0 n
n→∞
we obtain Theorem II The learning rate must satisfy an inequality
0 < λ+ < 2 ⇔ 0 <