A Matrix Based Derivation and Implementation of Convolutional Neural Network in Java
Weidong Zhang
[email protected]
Abstract A fully connected Neural Network (NN) and Convolutional Neural Network (CNN) model are implemented in Java from scratch. The small Java package contains 8 classes and ~4000 lines of code, without dependency to any 3rd party library. If used as a library, its JAR file size is 80KB, thus making it a good candidate for embedded AI solutions (e.g. Robots). NN and CNN’s forward and back propagation algorithms are derived in a matrix convention, which resulted in simple and clear Java code. MNIST’s data sets are used to validate the Java implementation. NN and CNN models’ hyperparameters impacted the end results with similar behaviors as reported by other research papers. With certain chosen hyperparameters in 2-layer NN and 3-layer CNN architectures, NN’s lowest test error rate is 1.81%, CNN’s lowest test error rate is 1.49%. The test results validated both the Matrix based equations and Java implementation. This simple Java package can be used for future research in Machine Learning algorithms. 1. Introduction A fully connected Neural Network (NN) model has one or more hidden layers between its input and output layer. A layer’s nodes connect to its neighboring layers’ nodes by weights. During forward propagation, node values in each layer are updated based on the model’s input values and current weights. During backpropagation, each layer’s weights are updated based on stochastic gradient descent (SGD) algorithms, with inputs from errors between the output and target layer [1]. On the other hand, a Convolutional Neural Network (CNN) model uses a small kernel to join with a zone of the same size on the input matrix. The kernel slides across the input matrix horizontally and vertically, until it iteratively joins with the last zone at the right bottom corner of input. For simplicity’s sake, matrix hereby is used as a generic term to represent 1D vector, 2D matrix and 3D tensor.
Figure 1: A conceptual CNN layer to show how an input block joins with a weight kernel to generate an output value.
The benefits of CNN over fully connected Neural Network (NN) are three folds: 1. Each kernel has a much smaller size and therefore takes up less memory during each propagation cycle. 2. Within a CNN layer, multiple kernels can be applied to detect different local features such as a sharp edge or a round
corner. 3. Local image features detected at a lower layer (such as a circle and oval) can be consolidated into a more complex image feature (such as an eye) for a higher CNN layer to capture. In this paper, NN and CNN’s forward and back propagation algorithms are first derived into equations with a matrix convention, then implemented in Java based on the equations. NN was first implemented to serve as a foundation for CNN. NN and CNN’s models are tested against MNIST’s data set, to validate the involved equations and Java implementation. Multiple NN and CNN models were tested with a few hyperparameters, including 1) learning rate, 2) weight initialization, 3) drop-out rate, 4) L2 regularization, 5) momentum. After each training epoch of 60,000 training digits, the trained model at that moment was tested against 10,000 testing digits. The error rates from both training and test results are collected for analysis. 2. NN forward and backward propagation
Figure 2: Conceptual Neural Network model.
Forward propagation As illustrated in the diagram above, a NN model could have multiple hidden layers, but always one target layer at the end. Each layer has an input matrix (e.g. 𝑂𝑖 ), a weight matrix (e.g. 𝑊𝑖𝑗 ) and an output matrix (e.g. 𝑂𝑗 ). A Bias matrix (e.g. 𝐵𝑗 ) is also typically used in a NN model. NN’s forward propagation in each layer involves 3 common Matrix operations: Multiplication (denoted as A × B), Element-wise operation (denoted as A ⨀ B for multiplication), and Transposition (denoted as AT). The relevant entries are denoted in the following with more details: 𝑿𝒊 : A hidden layer’s input matrix. Each of 𝑋𝑖 ’s columns is called an input vector, which has a number of independent features. 𝑥(𝑖,𝑚) denotes the i-th row and m-th column value of matrix 𝑋𝑖 . f(): It is the activation function. Common activation functions used in NN hidden layers are Sigmoid, Tanh and RELU, as shown in the diagram below.
Figure 3: Common activation functions used in NN and CNN
𝑶𝒊 : It has the same size as 𝑋𝑖 . In a matrix notation: 𝑂𝑖 = f(𝑋𝑖 ), which indicates an element-wise activation function operation over 𝑂𝑖 : 𝑜(𝑖,𝑚) = f(𝑥(𝑖,𝑚) ). 𝑿𝒋 : A layer’s output matrix. 𝑾𝒊𝒋 : It is the weight matrix between 𝑂𝑖 and 𝑋𝑗 . 𝑩𝒋 : It is the bias vector between 𝑂𝑖 and 𝑋𝑗 . In forward propagation, 𝑋𝑗 can be calculated by the following matrix notation: 𝑋𝑗 = 𝑊𝑖𝑗 T × 𝑂𝑖 + 𝐵𝑗
(1)
S(): For classification NN models, a target layer’s activation function is typically Softmax, which is a vector based function [4]. 𝑶𝒌 : Target layer’s output matrix, which is also a NN model’s final output matrix. 𝑂𝑘 = S(𝑋𝑘 ),
(2)
𝑻𝒌 : Target value matrix. Same size as 𝑂𝑘 . Target Layer Backpropagation To start backpropagation, a cost function must be chosen to properly evaluate the error matrix E between output matrix 𝑂𝑘 and target matrix 𝑇𝑘 . Cross-entropy error function is typically chosen for classification NN models. If 𝑂𝑘 has a mini-batch of m output vectors, the error matrix E has one row and m columns. Each of its values can be described as [4]: 𝐸(𝑚) = − ∑[𝑇(𝑘,𝑚) · log 𝑂(𝑘,𝑚) + (1 − 𝑇(𝑘,𝑚) ) · log(1 − 𝑂(𝑘,𝑚) )]
(3)
After error E matrix is calculated, its derivative with respect to target layer’s weight matrix 𝑊𝑗𝑘 can be described as: 𝜕𝐸 𝜕𝐸 𝜕𝑋𝑘 = ∙ 𝜕𝑊𝑗𝑘 𝜕𝑋𝑘 𝜕𝑊𝑗𝑘
(4)
In the above equation,
𝜕𝐸 𝜕𝑋𝑘
is conceptually the error changing rate with respect to target layer’s input
matrix 𝑋𝑘 , and can be named as 𝛿𝑘 due to this nature. Mathematically, 𝛿𝑘 is the derivative of crossentropy cost function with respect to the Softmax activation function, and results in a simple matrix form [4]: 𝛿𝑘 =
𝜕𝐸 = 𝑂𝑘 − 𝑇𝑘 𝜕𝑋𝑘
(5)
𝜕𝑋
Because of forward propagation relationships in equation (1), 𝜕𝑊 𝑘 can be simplified to: 𝑗𝑘
𝜕𝑋𝑘 = 𝑂𝑗 𝜕𝑊𝑗𝑘
(6)
By introducing equation (5) and (6) into (4), it can be simplified to: 𝜕𝐸 = 𝛿𝑘 ∙ 𝑂𝑗 = (𝑂𝑘 − 𝑇𝑘 ) ∙ 𝑂𝑗 𝜕𝑊𝑗𝑘
(7)
Equation (7) can be rewritten into a matrix form for Java programming: 𝜕𝐸 = 𝑂𝑗 × (𝑂𝑘 − 𝑇𝑘 )𝑇 𝜕𝑊𝑗𝑘
(8)
Based on SGD, the target layer’s weight matrix 𝑊𝑗𝑘 will be updated as: 𝑊𝑗𝑘 = 𝑊𝑗𝑘 − ∆𝑊𝑗𝑘 𝜕𝐸 = 𝑊𝑗𝑘 − 𝜂 ∙ 𝜕𝑊𝑗𝑘 = 𝑊𝑗𝑘 + 𝜂 ∙ (𝑂𝑗 × (𝑇𝑘 − 𝑂𝑘 )𝑇 )
(9)
In the above equation, 𝜂 is called the learning rate, which is typically a small arbitrary number in the range from 0.0001 to 0.01. Hidden Layer Backpropagation A NN hidden layer’s weight matrix can be derived in a similar approach: 𝜕𝐸 𝜕𝐸 𝜕𝑋𝑘 𝜕𝑂𝑗 𝜕𝑋𝑗 = ∙ ∙ ∙ 𝜕𝑊𝑖𝑗 𝜕𝑋𝑘 𝜕𝑂𝑗 𝜕𝑋𝑗 𝜕𝑊𝑖𝑗 𝜕𝑂𝑗 = 𝛿𝑘 ∙ 𝑊𝑗𝑘 ∙ ∙ 𝑂𝑖 𝜕𝑋𝑗 Since 𝑂𝑗 = 𝑡𝑎𝑛ℎ(𝑋𝑗 ), each element in matrix the same position in 𝑋𝑗 :
𝜕𝑂𝑗 𝜕𝑋𝑗
(10)
is the derivative of its tanh value of the element at
𝜕𝑂𝑗 = 𝑡𝑎𝑛ℎ′ (𝑋𝑗 ) 𝜕𝑋𝑗 = 𝐼 − 𝑂𝑗 ⨀𝑂𝑗
(11)
Equation (10) can be further simplified by introducing 𝛿𝑗 . Similar to 𝛿𝑘 , 𝛿𝑗 represents the error changing rate with respect to current hidden layer’s output matrix 𝑂𝑗 , and can be expressed as: 𝛿𝑗 = 𝛿𝑘 ∙ 𝑊𝑗𝑘 ∙
𝜕𝑂𝑗 𝜕𝑋𝑗
(12)
As indicated in equation (12), 𝛿𝑗 relies on two matrices from its next layer, 𝛿𝑘 and 𝑊𝑗𝑘 . In a matrix form, 𝛿𝑗 can be described as: 𝛿𝑗 = 𝑊𝑗𝑘 × 𝛿𝑘 ⨀ (𝐼 − 𝑂𝑗 ⨀𝑂𝑗 )
(13)
As a result, equation (10) can be simplified to equation (14) which has a similar convention as equation (7): 𝜕𝐸 = 𝑂𝑖 × 𝛿𝑗 𝑇 𝜕𝑊𝑖𝑗
(14)
Finally, a hidden layer’s weight matrix can be updated thru backpropagation: 𝑊𝑖𝑗 = 𝑊𝑖𝑗 − 𝜂 ∙ 𝑂𝑖 × 𝛿𝑗 𝑇
(15)
With 𝛿𝑗 being derivable from its previous layer’s values of 𝑊𝑗𝑘 and 𝛿𝑘 , equations (14) and (15) can be applied to multiple hidden layers iteratively, until a NN model’s input layer’s weight matrix is updated to complete a full cycle of backpropagation. 3. CNN forward and backward propagation Forward propagation On top of NN’s forward propagation, a CNN layer’s forward propagation involves a few extra steps, as illustrated in the diagram below. During a forward propagation step, with a striding step of 𝑆𝑥 and 𝑆𝑦 in x and y directions, a kernel weight matrix 𝑊𝑖𝑗 does an element wise multiplication with 𝑂𝑖_𝑚𝑖𝑛𝑖 in 𝑂𝑖 . The summed-up value is put to the corresponding element in output matrix 𝑋𝑗 . This is called a convolution step.
Figure 4: CNN's forward and back propagation procedure
Multiple layers of 𝑊𝑖𝑗 (marked by L in the above diagram) can be initialized to involved in the same convolution process. As a result, the output matrix 𝑋𝑗 ends up having a size of 𝑊𝑗 , 𝐻𝑗 and L, for its width, height and depth respectively. Next, each element in 𝑋𝑗 is activated thru the activation function Tanh(), and results in a matrix 𝑂𝑗 . 𝑋𝑗 and 𝑂𝑗 have the same dimensions. In the Java implementation, each convolution step strides on the X direction (horizontal) first, based on the given stride step 𝑆𝑥 . Once the convolution hits the right border of 𝑂𝑖 , it resumes to its left border and strides on the Y direction by a step size 𝑆𝑦 . The black arrows in the diagram above illustrate the conceptual CNN steps, which are similar to those in NN. The actual operational steps are illustrated by the green arrows: During each convolution step, elements in 𝑂𝑖_𝑚𝑖𝑛𝑖 are lined up to form a corresponding row in 𝑂𝑖_𝑐𝑜𝑛𝑣 . The elements per row in 𝑂𝑖_𝑐𝑜𝑛𝑣 is arranged by the order of x, y and z in 𝑂𝑖_𝑚𝑖𝑛𝑖 . As a result, 𝑂𝑖_𝑐𝑜𝑛𝑣 has 𝑊𝑗 · 𝐻𝑗 rows and W·H·D columns. The relationships among these dimensions can be expressed as: 𝑊𝑗 =
(𝑊𝑖 − 𝑊) +1 𝑆𝑥
(16)
𝐻𝑗 =
(𝐻𝑖 − 𝐻) +1 𝑆𝑦
(17)
Similarly, L layers of 𝑊𝑖𝑗 are flattened to form a 2D matrix of 𝑊𝑖𝑗_𝑐𝑜𝑛𝑣 , which has W·H·D rows and L columns. The relationships among 𝑂𝑖_𝑐𝑜𝑛𝑣 , 𝑊𝑖𝑗_𝑐𝑜𝑛𝑣 , 𝑋𝑗_𝑓𝑙𝑎𝑡 and 𝑂𝑗_𝑓𝑙𝑎𝑡 are identical to the corresponding ones in a NN model for forward propagation, as expressed by equation (1) and (2).
At the end of forward propagation, 𝑂𝑗_𝑓𝑙𝑎𝑡 needs to be converted back to its 3D form of 𝑂𝑗 , so that it can be used as an input matrix to the next CNN/NN layer. The conversion process from 𝑂𝑗_𝑓𝑙𝑎𝑡 to 𝑂𝑗 is much simpler: Each column in 𝑂𝑗_𝑓𝑙𝑎𝑡 (with 𝑊𝑗 · 𝐻𝑗 elements) is remapped into a X-Y plane of 𝑂𝑗 (with 𝐻𝑗 rows and 𝑊𝑗 columns), going by X direction first. Essentially, CNN’s forward propagation has two extra steps: At the beginning, its 3D input matrix 𝑂𝑖 is convoluted to a 2D matrix 𝑂𝑖_𝑐𝑜𝑛𝑣 . At the end, its 2D output matrix 𝑂𝑗_𝑓𝑙𝑎𝑡 is transformed to a 3D matrix 𝑂𝑗 . Both steps are matrix element rearrangement and do not involve in mathematical operations. As a result, they do not affect CNN’s back propagation derivations that are to be explained next. As shown in Figure 4, if a CNN layer is connected to a subsequent CNN layer, its output matrix 𝑂𝑗 shall be convoluted to 𝑂𝑗_𝑐𝑜𝑛𝑣 to serve as the input matrix of that layer, based on that layer’s dimensional parameters. If however a CNN layer is connected to a NN layer, its output 𝑂𝑗 is to be flattened to form a 1D matrix 𝑂𝑗_1𝐷 to server as the input vector to the NN layer. Backpropagation CNN’s backpropagation is similar to that of NN in equation (10), with slight modifications to the matrix naming convention: 𝜕𝑋𝑘_𝑓𝑙𝑎𝑡 𝜕𝑂𝑗_𝑓𝑙𝑎𝑡 𝜕𝑋𝑗_𝑓𝑙𝑎𝑡 𝜕𝐸 𝜕𝐸 = ∙ ∙ ∙ 𝜕𝑊𝑖𝑗_𝑐𝑜𝑛𝑣 𝜕𝑋𝑘_𝑓𝑙𝑎𝑡 𝜕𝑂𝑗_𝑐𝑜𝑛𝑣 𝜕𝑋𝑗_𝑓𝑙𝑎𝑡 𝜕𝑊𝑖𝑗_𝑐𝑜𝑛𝑣 𝜕𝑂𝑗_𝑓𝑙𝑎𝑡 = 𝛿𝑘_𝑓𝑙𝑎𝑡 ∙ 𝑊𝑗𝑘_𝑐𝑜𝑛𝑣 ∙ ∙ 𝑂𝑖_𝑐𝑜𝑛𝑣 𝜕𝑋𝑗_𝑓𝑙𝑎𝑡
(18)
In the above equation, the multiplication of 𝛿𝑘_𝑓𝑙𝑎𝑡 and 𝑊𝑗𝑘_𝑐𝑜𝑛𝑣 can be assigned to an arbitrary variable called 𝛿𝑗𝑘_𝑐𝑜𝑛𝑣 : 𝛿𝑗𝑘_𝑐𝑜𝑛𝑣 = 𝛿𝑘_𝑓𝑙𝑎𝑡 × 𝑊𝑗𝑘𝑐𝑜𝑛𝑣 𝑇
(19)
By introducing Equation (19) to (18), it can be simplified as: 𝜕𝑂𝑗_𝑓𝑙𝑎𝑡 𝜕𝐸 = 𝛿𝑗𝑘_𝑐𝑜𝑛𝑣 ∙ ∙ 𝑂𝑖_𝑐𝑜𝑛𝑣 𝜕𝑊𝑖𝑗_𝑐𝑜𝑛𝑣 𝜕𝑋𝑗_𝑓𝑙𝑎𝑡
(20)
To enable matrix operations in equation (20), 𝛿𝑗𝑘_𝑐𝑜𝑛𝑣 shall go thru two transformation steps, as illustrated by two red arrows in Figure 4. First, it needs to reverse the convolution process and transform itself to a 3D matrix, by a function denoted as deConv(). Second, this 3D matrix needs to be flattened by a function, denoted as to2D(), to form a 2D matrix, to have the same dimension as that of 𝑂𝑗_𝑓𝑙𝑎𝑡 : 𝛿𝑗𝑘_𝑓𝑙𝑎𝑡 = 𝑡𝑜2𝐷( 𝑑𝑒𝐶𝑜𝑛𝑣(𝛿𝑗𝑘_𝑐𝑜𝑛𝑣 ) ) Therefore, Equation (20) can be rewritten into:
(21)
𝜕𝑂𝑗_𝑓𝑙𝑎𝑡 𝜕𝐸 = 𝛿𝑗𝑘_𝑓𝑙𝑎𝑡 ∙ ∙ 𝑂𝑖_𝑐𝑜𝑛𝑣 𝜕𝑊𝑖𝑗_𝑐𝑜𝑛𝑣 𝜕𝑋𝑗_𝑓𝑙𝑎𝑡 𝜕𝑂𝑗_𝑓𝑙𝑎𝑡 𝜕𝑋𝑗_𝑓𝑙𝑎𝑡
(22)
contains the element-wise partial derivative values of 𝑂𝑖_𝑐𝑜𝑛𝑣 , and has the same dimension as 𝜕𝑂
that of 𝛿𝑗𝑘_𝑓𝑙𝑎𝑡 . 𝛿𝑗 is introduced to contain the element-wise multiplication of 𝜕𝑋𝑗_𝑓𝑙𝑎𝑡 and 𝛿𝑗𝑘_𝑓𝑙𝑎𝑡 , and 𝑗_𝑓𝑙𝑎𝑡
can be described as: 𝛿𝑗 = 𝛿𝑗𝑘_𝑓𝑙𝑎𝑡 ⨀
𝜕𝑂𝑗_𝑓𝑙𝑎𝑡 𝜕𝑋𝑗_𝑓𝑙𝑎𝑡
(23)
As a result, Equation (22) can be further simplified and look similar to equation (14): 𝜕𝐸 = 𝑂𝑖_𝑐𝑜𝑛𝑣 𝑇 × 𝛿𝑗 𝜕𝑊𝑖𝑗_𝑐𝑜𝑛𝑣
(24)
Up to this point, a CNN layer’s weight matrix can be updated thru backpropagation. 4. Configuration of hyperparameters The default NN and CNN models tend to overfit over the training process: the test error rate typically gets stuck at a much higher level than the training error rate, even though the latter can go down as low as 0%. Hyperparameters are introduced to delay the overfit phenomenon. They are also used to validate the current Java implementation, which should show similar trends caused by various hyperparameters, as reported by other research papers. Activation function Tanh function is chosen as the activation function in all NN/CNN models, due to its symmetric shape relative to X/Y axes, based on recommendations in LeCun’s paper [2]. Weight Initialization Weights are initialized to have a Gaussian distribution with a standard deviation of number of input nodes of each layer [2].
1 , √𝑁
where N is the
Learning Rate Decay Learning rate 𝜂 in all models is lineally decayed over the whole training course. The final value of 𝜂 at the end of the final training epoch, is drop to 80% of its initial value. L2 Regularization of Weights L2 regularization is to add a term to the cost function. In a single variable form, the cost function can be described as:
𝑛
𝜆 𝑒 = 𝑒0 + ∑ 𝑤𝑖 2 2𝑛
(25)
𝑖=1
The partial derivative of error e with respect to a given weight 𝑤𝑖 is: 𝜕𝑒 𝜕𝑒0 𝜆 = + 𝑤𝑖 𝜕𝑤𝑖 𝜕𝑤𝑖 𝑛
(26)
Since both 𝜆 and 𝑛 are arbitrary constants, the above equation can be further simplified by redefining 𝜆 : 𝜕𝑒 𝜕𝑒0 = + 𝜆 ∙ 𝑤𝑖 𝜕𝑤𝑖 𝜕𝑤𝑖
(27)
With the extra regularization term, the backpropagation can be written as: 𝜕𝑒 𝜕𝑤𝑖 𝜕𝑒0 = 𝑤𝑖 − 𝜂 ( + 𝜆 ∙ 𝑤𝑖 ) 𝜕𝑤𝑖 𝜕𝑒0 = (1 − 𝜂𝜆)𝑤𝑖 − 𝜂 𝜕𝑤𝑖
𝑤𝑖 = 𝑤𝑖 − 𝜂
(28)
In its matrix form, the above equation can be rewritten as: 𝑊𝑖𝑗 = (1 − 𝜂𝜆)𝑊𝑖𝑗 − 𝜂
𝜕𝐸 𝜕𝑊𝑖𝑗
(29)
It is evident that each element in the weight matrix is linearly “decayed” by a certain percentage controlled by 𝜆 after each iteration. Intuitively speaking, L2 regularization can smooth out the influence of big weights and even out noise data over the training process. Weight Momentum From a physics perspective, to find the minimum value of a loss function is like letting go a ball from a random location on the functional curve, for the ball to roll down the curve, pick up momentum and settle down at the lowest point (minimum loss) of the function. If the ball rolls without any friction, it will never stop at any point, no matter how small the learning rate is, thus making it much harder to find the minimum loss. Therefore, introducing a velocity concept, along with a friction parameter (which is traditionally called as momentum in the NN field) can help the ball settle at the lowest point of the functional curve. If the rolling ball has a velocity of v at a certain point, v can be updated at the end of each learning step as: 𝑣 = 𝜇𝑣 − 𝜂
𝜕𝑒 𝜕𝑤𝑖
(30)
Where: 𝜇 is the momentum parameter, with a value rang from 0 to 1, 1 being no friction, 0.9 being losing 10% of its initial velocity, etc. Once the speed is updated, the weight value can be updated as: 𝑤𝑖 = 𝑤𝑖 + 𝑣
(31)
The above two single-variable equations can be expanded to a matrix forms as: 𝑉 = 𝜇𝑉 − 𝜂
𝜕𝐸 𝜕𝑊𝑖𝑗
𝑊𝑖𝑗 = 𝑊𝑖𝑗 + 𝑉
(32)
(33)
Weight Drop out Weight dropout is an effective technique to delay the overfit phenomena. During each forward propagation iteration, a certain ratio of weights is randomly set to zero, thus preventing these weights influencing certain outcomes. 5. Java implementation of NN and CNN
Figure 5: Class diagram of NN and CNN Java implementation
The above class diagram shows the Java implementation of the NN and CNN models. Besides the classes shown in the diagram, this implementation does not rely on any other 3rd party Java library. There are
about 4000 lines of code in total. The CNN model does not use any pooling or max filter layer for simplicity’s sake. A brief explanation of each Java class is given below: MNISTDataUtil: A utility class to read MNIST training/testing data out of external CSV files, store into memoty objects, and initiate training and testing processes. It also initialize NN and CNN models based on constructor parameters. NN: Constructor of a NN model, which includes number of nodes in each layer and hyper-parameters. The following sample code defines a NN model with 2 layers. The first layer has 784 input nodes and 100 hidden nodes, the second layer has 100 input nodes and 10 output nodes. It also takes in a L2 regularization parameter value of 0.00001 and weight drop-out rate of 0.02: NN nn = NN.getInstance(new int[]{784, 100, 10}); nn.setL2(0.00001); nn.dropout(0.02); nn.setActivation(NN.TANH);
CNN: Constructor of a CNN model. A CNN model typically has multiple CNN layers, the final CNN layer’s output is fed into a NN layer as input. The following sample code shows how to construct a CNN model with 3 CNN layers and one NN target layer: int[][] config = new int[][]{{5, 5, 20, 1, 1}, {5, 5, 20, 1, 1}, {5, 5, 10, 1, 1}}; CNN cnn = CNN.getInstance(config); NN nn = NN.getInstance(new int[]{2560, 10}); cnn.setNNLayers(nn); //to chain up CNN and NN layers
HiddenLayer / TargetLayer: A NN model is composed of zero or more hidden layers, and one target layer at the end. These two classes encapsulate the code for forward and backpropagation as covered by the equations earlier. For example, a hidden layer’s forward propagation is expressed in two lines of code: Xj = Matrix.sum(Oi, Matrix.randomColZeros(Wij, dropOutRate), Bj); Oj = Matrix.tanh(Xj);
CNNLayer: Constructor of a CNN layer. It encapsulates the code for forward and backpropagation, along with the convolution operations. Matrix: Contains all matrix based functions required by NN and CNN’s models, such as addition, multiplication, element-wise operations, etc. It was based on Princeton’s Matrix.java class (https://introcs.cs.princeton.edu/java/95linear/Matrix.java), but expanded with additional methods. StdRandom: Contains core math functions used in NN and CNN, such as Gaussian distribution, standard deviation, etc. It was based on Princeton’s StdOut.java class (https://introcs.cs.princeton.edu/java/95linear/StdOut.java.java). 6. NN implementation test results Multiple NN models are tested for the Java implantation. Each model is trained with MNIST’s training dataset of 60K samples. At the end of each epoch, the trained NN model is tested against MNIST’s 10K test samples. Both training and test error rates are collected for comparison.
To compare the effects of hyperparameters, a two-layer NN model of 784-100-10 is first set up to collect benchmark training and test error rates. The initial learning rate is 0.001. L2 regularization hyperparameter is set to 0.00001. Weight drop-out rate is 0.01. No momentum is applied. As shown in the diagram below, during the training process, the training error quickly converges down and drops close to 0% at the later phase of 100 training epochs, while the test error rate lingers around 2.55% and drops to its lowest rate of 2.52% at epoch 97. The quick disappearance of training error proves the correct implementation of forward and back propagations based on earlier equations. The flattened curve of the test error rate around 2.5% indicates a typical overfitted training process.
Figure 6: NN model (784-100-10)'s benchmark training and test error rate
Next, two hyperparameters are introduced to study how each affects the test error rate: 1) drop-out rate and 2) momentum rate 𝜂. The training error rate curves are not presented here because they show similar overfit patterns. The test error rates however are improved by each hyperparameter with certain values. As shown in the following diagram, a momentum 𝜂 value of 0.667 and a drop-out rate of 0.02% improved the test results. By applying both in the NN model, the minimum error rate dropped from 2.52% to 2.36%.
Figure 7: Hyperparameter's effect to NN's test error rate
To study how the number of nodes affect the test error rate, the “best” hyperparameters (momentum=0.667, drop out=0.02, L2=0.00001) were applied to others NN models with more hidden nodes. The test results show the general trend as: More NN hidden nodes generates lower test error rate. A 784-800-10 NN model generates the lowest error rate of 1.86% which is close to the best error rate of 1.6% of a NN model with the same model size, as listed in LeCun’s MNIST performance metrics [13] .
Figure 8: Number of hidden nodes to NN's test error rate
For a NN model, more nodes equalize to more weights among them. The diagram below shows the relationship between the total number of weights and test error rate. Even though more weights correlate to lower test error rates, their nonlinear relationship indicates a flattened pattern, which implies that adding more nodes can only drop the test error rate to a certain point before it flattens out.
Figure 9: Relationship of total number of weights to error rate in NN models
7. CNN implementation test results
Figure 10: A 3-layer CNN model connected with a NN target layer
A 3-layer CNN model was constructed to connect with a fully connected NN target layer for Softmax classification at the end. The CNN model has the same kernel configurations of {5, 5, 10, 1, 1} for all three layers (the numbers indicate the width, height, depth, stride X and stride Y of the kernel respectively). This CNN model has ~30,000 weight parameters, whereas the benchmark NN model has ~80,000 weight parameters. Through preliminary testing, it was found a CNN model is more sensitive to learning rate, and requires a much smaller size than the NN model. As a result, this CNN model has the following hyper-parameters: learning rate = 0.0002; drop out rate = 0.01; L2 = 0.00001, momentum = 0.667. Comparing the CNN test results with those in a NN model, it was noted that: 1) The CNN model has a much smaller test error rate of 1.49%, even though the CNN model has much fewer weights (30K vs. 80K). 2) The CNN model does not overfit as quickly as NN, as indicated in the diagram below.
Figure 11: CNN models' training and test error rates during training epochs
8. Summary The Java implementation of NN and CNN’s forward and back propagation algorithms, as described by the derived equations, is validated thru the MNIST data set. The test results of NN and CNN models match with the trends reported by other papers. It was found CNN is more sensitive to the learning rate size, and generates a higher training error rate, but lower test error rate. As a result, CNN by nature is less prone to the overfitting phenomena comparing to a NN model. This can be explained by the localized weight kernels in CNN, which can only “overfit” a small region of the given input image. It was also found that adding more layers or nodes in a CNN model only does not drop the test error rate as fast as the NN models, as adding more kernels may cause a “cancel-out” effect of each other. The small self-contained Java library of 80KB can be used for future studies of deep learning algorithms.
References [1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. “Machine Learning” MIT Press, 2016 [2] LeCun, Yann A., Leon Bottou, Genevieve B. Orr, and Klaus-Robert Muller. “Efficient backprop.” In Neural networks: Tricks of the trade, pp. 9-48. Springer Berlin Heidelberg, 2012. [3] LeCun, Yann A., Yoshua Bengio. “Convolutional Networks for Images, Speech, and Time-Series.” The handbook of brain theory and neural networks, 1995. [4] Michael Nielsen. “Neural networks and deep learning”, http://neuralnetworksanddeeplearning.com, 2017. [5] Nitish Srivastava, Geoffrey Hinton. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research, 2014 [6] Yann LeCun, Léeon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-Based Learning Applied to Document Recognition”, PROC. OF THE IEEE, NOVEMBER 1998. [7] Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks”, International Conference on Artificial Intelligence and Statistics 2010. [8] Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012. [9] Yann LeCun, Corinna Cortes and Christopher J.C. Burges, “The MNIST database of handwritten digits”, http://yann.lecun.com/exdb/mnist/.