Least Squares Method from the View Point of Deep Learning

Advances in Pure Mathematics, 2018, 8, 485-493 http://www.scirp.org/journal/apm ISSN Online: 2160-0384 ISSN Print: 2160-0368

Least Squares Method from the View Point of Deep Learning Kazuyuki Fujii1,2 International College of Arts and Sciences, Yokohama City University, Yokohama, Japan, Department of Mathematical Sciences, Shibaura Institute of Technology, Saitama, Japan

1 2

How to cite this paper: Fujii, K. (2018) Least Squares Method from the View Point of Deep Learning. Advances in Pure Mathematics, 8, 485-493. https://doi.org/10.4236/apm.2018.85027 Received: April 10, 2018 Accepted: May 26, 2018 Published: May 29, 2018 Copyright © 2018 by author and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access

Abstract The least squares method is one of the most fundamental methods in Statistics to estimate correlations among various data. On the other hand, Deep Learning is the heart of Artificial Intelligence and it is a learning method based on the least squares. In this paper we reconsider the least squares method from the view point of Deep Learning and we carry out the computation thoroughly for the gradient descent sequence in a very simple setting. Depending on the values of the learning rate, an essential parameter of Deep Learning, the least squares methods of Statistics and Deep Learning reveal an interesting difference.

Keywords Least Squares Method, Statistics, Deep Learning, Learning Rate, Linear Algebra

1. Introduction The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science. When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method. See for example [1]. On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future. As to Deep Learning see for example [2] [3] [4] [5] [6]. Deep Learning may be stated as a successive learning method based on the least squares method. Therefore, to reconsider it from the view point of Deep Learning is very natural and we carry out the calculation thoroughly of the successive approximation called gradient descent sequence. DOI: 10.4236/apm.2018.85027 May 29, 2018

485

Advances in Pure Mathematics

K. Fujii

When the learning rate changes a difference in method between Statistics and Deep Learning gives different results. Theorem I and II in the text are our main results and a related problem (exercise) is presented for readers. Our results may give a new insight to both Statistics and Data Science. First of all let us explain the least squares method for readers in a very simple setting. For n pieces of two dimensional real data

{( x , y ) , ( x , y ) ,, ( x , y )} 1

1

2

2

n

n

we assume that their scatter plot is like Figure 1 Then a model function is linear

f ( x= ) ax + b.

(1)

For this function the error (or loss) function is defined by 2 1 n yi − ( axi + b )} . { ∑ 2 i=1

E ( a, b= )

(2)

The aim of least squares method is to minimize the error function (2) with respect to

{a, b} . A little calculation gives

n n n n 1 n 2 2 2 2 ∑ xi a + 2∑ xi ab + nb − 2∑ xi yi a − 2∑ yi b + ∑ yi  . 2 = =i 1 =i 1 =i 1  i 1 =i 1 

E (= a, b )

Then the equations for the stationality ∂E ∂E = = 0 ∂a ∂b

(3)

give a linear equation for a and b

 ∑ x ∑ x   a   ∑ n xi yi  i 1    =  = n  n x    b n     ∑ i 1 yi  i =  ∑i 1= n n 2 i =i 1 = i 1 i

(4)

Figure 1. Scatter plot 1. DOI: 10.4236/apm.2018.85027

486


K. Fujii

and its solution is given by −1

 ∑ n xi2 ∑ n xi   ∑ n xi yi   a = i 1 =i 1 =   i1 .   =  n n    b   x n y ∑ ∑ i i = i 1  i 1=   

Explicitly, we have

(∑

(5)

) ( ∑ x )( ∑ y ) , n (∑ x ) − (∑ x ) − ( ∑ x )( ∑ x y ) + ( ∑ x )( ∑ y ) b= . n (∑ x ) − (∑ x ) n n n i =i 1 i i =i 1 = i 1 i

a=

n

xy −

n n 2 i =i 1 = i 1 i

2

n

n n n 2 i i =i 1 = i 1 i i =i 1 = i 1 i n n 2 i =i 1 = i 1 i

(6)

2

To check that a and b give the minimum of (2) is a good exercise. Note We have an inequality

(

)

n x12 + x22 +  + xn2 ≥ ( x1 + x2 +  + xn )

2

(7)

and the equal sign holds if and only if

x= x=  = xn . 1 2 Since

{ x1 , x2 , , xn }

are data we may assume that xi ≠ x j for some i ≠ j .

Therefore

(

)

n x12 + x22 +  + xn2 − ( x1 + x2 +  + xn ) > 0 2

gives n n 2 i =i 1 = i 1 i

∑ x ∑ x n n ∑ i=1 xi

> 0.

(8)

2. Least Squares Method from Deep Learning In this section we reconsider the least squares method in Section 1 from the view point of Deep Learning. First we arrange the data in Section 1 like

Input data : {( x1 ,1) , ( x2 ,1) , , ( xn ,1)}

Teacher signal : { y1 , y2 , , yn } and consider a simple neuron model in [7] (see Figure 2)

Figure 2. Simple neuron model 1. DOI: 10.4236/apm.2018.85027

487


K. Fujii

Here we use the linear function (1) instead of the sigmoid function z = σ ( x ) . In this case the square error function becomes 2 2 1 n 1 n yi − f ( xi )= yi − ( axi + b )} . ( ) { ∑ ∑ 2 i 1= 2i1 =

L ( a, b= )

We usally use L ( a, b ) instead of E ( a, b ) in (2). Our aim is also to determine the parameters {a, b} in order to minimize L ( a, b ) . However, the procedure is different from the least squares method in Section 1. This is an important and interesting point. For later use let us perform a little calculation n n ∂L ∂f = −∑ ( yi − f ( xi ) ) = −∑ ( yi − f ( xi ) ) xi , ∂a ∂a = i 1= i 1 n n ∂L ∂f = −∑ ( yi − f ( xi ) ) = −∑ ( yi − f ( xi ) ) . ∂b ∂b = i 1= i 1

(9)

We determine the parameters {a, b} successively by the gradient descent method, see for example [8]. For t = 0,1, , n,  a ( 0)   a ( t )   a ( t + 1)    → →  →  →  b ( 0)   b ( t )   b ( t + 1) 

and

a ( t + 1)= a ( t ) − 

∂L , ∂a ( t )

(10)

∂L b ( t + 1)= b ( t ) −  , ∂b ( t ) where

L = L ( a (t ) , b (t )) =

1 n ∑ yi − ( a ( t ) xi + b ( t ) ) 2 i=1

{

}

2

and  ( 0 <  < 1) is small enough. The initial value ( a ( 0 ) , b ( 0 ) ) is given appropriately. As will be shown shortly in Theorem I, their explicit values are not important. Comment The parameter  is called the learning rate and it is very hard to choose  properly as emphasized in [7]. In this paper we provide an estimation (see Theorem II). Let us write down (10) explicitly: T

n

{

}

a ( t + 1)= a ( t ) +  ∑ yi − ( a ( t ) xi + b ( t ) ) xi i =1

n n   n  2 = 1 −  ∑ xi  a ( t ) −   ∑ xi  b ( t ) +  ∑ xi yi = i 1 = i 1    i 1= 

and n

{

}

b ( t + 1)= b ( t ) +  ∑ yi − ( a ( t ) xi + b ( t ) ) i =1

n  n  = −  ∑ xi  a ( t ) + (1 − n ) b ( t ) +  ∑ yi . = i 1  i 1= 

DOI: 10.4236/apm.2018.85027

488


K. Fujii

These are cast in a vector-matrix form

(

)

(

)

n n n 2   a ( t ) −  ∑ i 1 x= i b ( t ) +  ∑ i 1 xi yi  a ( t + 1)=   1 −  ∑ i 1 xi =  =     n n + b t 1 ( )    − ∑ i 1 x=  i a ( t ) + (1 − n ) b ( t ) +  ∑ i 1 yi =  

(

)

1 −  ∑ n xi2 − ∑ n xi   a ( t )    ∑ n xi yi  = = i 1 =i 1  i 1  =   + n  − n x    b t ( )   1 n y −   ∑ ∑ i = i 1= i 1 i      1 0  ∑ n xi2 ∑ n xi    a ( t )   ∑ n xi yi  =  i =1 i 1 =i 1   . =   +   −   n n   n    b ( t )   ∑ i=1 yi   0 1   ∑ i=1 xi

(11)

For simplicity by setting

 ∑ n xi2 ∑ n xi   ∑ n xi yi   a (t )  = = i 1 =i 1  i 1   = = x ( t ) = , f  , A n n     b t ( ) n    i =  ∑ i 1 x=  ∑ i 1 yi  we have a simple equation

x ( t + 1) =

( E − A) x ( t ) + f

(12)

where E is a unit matrix. Due to (8) the matrix A is invertible ( ∃A−1 ), that is, we x=  = xn . exclude the trivial and uninteresting case x= 1 2 The solution is easy and given by

{

}

x ( t ) = ( E − A ) x ( 0 ) + E − ( E − A ) A−1 f . t

t

(13)

Note Let us consider a simple difference equation

x ( n += 1) ax ( n ) + b

( a ≠ 1)

for n = 0,1,2, . Then, the solution is given by

x ( n ) =an x ( 0) +

an − 1 1 − an b =an x ( 0) + b. a −1 1− a

Check this. Comment The solution (13) gives

lim x ( t ) = A−1 f

(14)

t →∞

if lim ( E − A ) = O t

(15)

t →∞

where O is a zero matrix. (14) is just the equation (5). Let us evaluate (13) further. For the purpose we make some preparations from Linear Algebra [9]. For simplicity we set

 ∑ x ∑ x  α β   ∑ n xi yi   e  =  ≡  i1 ≡  A= , f = n  n x    β δ f   n y ∑ ∑ i i = i 1  i 1=      n n 2 i i 1 i =i 1 =

and want to diagonalize A. The characteristic polynomial of A is DOI: 10.4236/apm.2018.85027

489


K. Fujii

λ − α −β = ( λ − α )( λ − δ ) − β 2 −β λ −δ

f (λ ) = λE − A =

= λ 2 − (α + δ ) λ + αδ − β 2 and the solutions are given by = λ±

α +δ ±

(α + δ )

(

2

)

− 4 αδ − β 2 α +δ ± = 2

(α − δ )

2

+ 4β 2

2

.

(16)

It is easy to see

λ+ > λ− > 0,

λ+ > 1

(δ =

n and n ≥ 2 ) .

(17)

We set the two eigenvectors of matrix A, corresponding to λ+ and λ− , in a matrix form λ −δ Q =  +  β

β  . λ− − α 

It is easy to see

( λ+ − δ )

2

+ β 2 = ( λ− − α ) + β 2 ≡ Λ 2 2

from (16) and we also set Q=

1  λ+ − δ  Λ β

β

 . λ− − α 

(18)

Then it is easy to see

Q T Q =QQ T =E ⇒ Q T =Q −1. Namely, Q is an orthogonal matrix. Then the diagonalization of A becomes λ A = Q + 0

0 T Q . λ− 

(19)

By substituting (19) into (13) and using 0  T 1 − λ+ E − A = Q Q 1 − λ−   0

we finally obtain Theorem I A general solution to (12) is

 (1 − λ+ )t  a (t )    = Q  0  b (t )  

  a ( 0)   QT    (1 − λ− )   b ( 0 )   1 − (1 − λ+ )t    0 λ+   Te Q  . + Q t  1 1 λ  − − ( )   f − 0   λ−   0

t

(20)

This is our main result.

Lastly, let us show how to choose the learning rate  ( 0 <  < 1) , which is a DOI: 10.4236/apm.2018.85027

490


K. Fujii

very important problem in Deep Learning. Let us remember

λ+ > λ− > 0 and λ+ > 1 from (17). From (15) the equations lim ( E − A ) = O ⇔ lim (1 − λ+ ) = lim (1 − λ− ) = 0 t

t

t →∞

t

t →∞

t →∞

determine the range of . Noting 1 − x < 1( ⇔ 0 < x < 2 ) ⇒ lim (1 − x ) = 0 n

n→∞

we obtain Theorem II The learning rate  must satisfy an inequality

0 < λ+ < 2 ⇔ 0 < 

Least Squares Method from the View Point of Deep Learning

Least Squares Method from the View Point of Deep Learning

Suggest Documents

Least squares learning - UCEMA

PDF The Method of Least Squares

The Quasi-Newton Least Squares Method

Least-squares transformations between point-sets

Weighted Least Squares for Point-Based Registration

Deep Least Squares Regression for Speaker Adaptation

Learning to Solve Least Squares Curve Method by ...

A Bias-Eliminated Least Squares Method for

A Spline Least Squares Method for Numerical

Combined Helmholtz equationâleast squares method for ...

Recursive Least Squares Method in Parameters ... - doiSerbia

A Discrete Least Squares Method - Semantic Scholar

Least Squares Kinetic Upwind Mesh-free Method

TRANSCENDING THE LEAST SQUARES

Solution of the Least Squares Method problem of

The Dual of the Least-Squares Method - Scientific Research Publishing

Adaptive Î» Least-Squares Temporal Difference Learning

Recursive Least Squares Dictionary Learning Algorithm Abstract

Least-squares Temporal Difference Learning - CiteSeerX

with the Method of Least Squares - Semantic Scholar

Least-Squares Policy Iteration - Machine Learning

Recursive Least Squares Dictionary Learning Algorithm Abstract

Incremental Least-Squares Temporal Difference Learning

A characterization of the Logarithmic Least Squares Method

Least Squares Method from the View Point of Deep Learning