Jun 15, 2016 - Results of Proposed Learning Algorithm and Its Distributed ... A convolutional neural network (CNN) is a variant of ANNs that is optimized.
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
An Efficient and Effective Convolutional Neural Network for Visual Pattern Recognition Liew Shan Sung Supervisor: Prof. Dr. Mohamed Khalil Mohd. Hani Co-supervisor (Former): Dr. Rabia Bakhteri
June 15, 2016
1 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Outline 1
Introduction Introduction Problem Statement Research Objectives Scope of Work Research Methodology
2
Proposed Models and Learning Algorithms Proposed Models Proposed Learning Algorithms
3
Results and Analysis Results of Proposed Convolutional Neural Network Models Results of Proposed Learning Algorithm and Its Distributed Computing Implementation
4
Conclusion Conclusion and Contributions Future Work Publications
2 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Introduction: Artificial Neural Networks An artificial neural network (ANN) is a bio-inspired mathematical model that can approximate functions. Input
Fan-in
Figure:
Hidden
Output
Fan-out
A multilayer perceptron (MLP) with a single hidden layer.
Typically trained supervisedly [3] using gradient descent (GD) method [7]. Shortcomings: Poor generalization ability [8]. Unsuitable for interpreting multi-dimensional data [10]. 3 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Introduction: Convolutional Neural Networks A convolutional neural network (CNN) is a variant of ANNs that is optimized for visual recognition tasks [9]. Incorporates feature extraction, dimensionality reduction, and classification into a single trainable model. Input 1@32×32
C1 6@28×28
S1 6@14×14
C2 16@10×10
S2 16@5×5
C3 128@1×1
F4 84@1×1
F5 10@1×1
Feature map 5×5 Convolutions
2×2 Pooling
5×5 Convolutions
Figure:
2×2 Pooling
5×5 Convolutions
Full Connection Full connection
The LeNet-5 CNN architecture [9].
Advantages over conventional ANNs [9]: Weight sharing minimizes parameters. Require minimal preprocessing of raw inputs. 4 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Introduction: Deep Learning and Distributed Machine Learning Deep learning (DL) A subset of machine learning (ML) techniques that learn hierarchical representations of data in deep-layered model architectures [14]. Most common DL model: deep neural network (DNN) [120]. CNNs - well-established and show great potentials.
Issue Training deeper models using large datasets is extremely computationally expensive [22]. Leads to distributed ML concept.
Distributed ML Distribute training process to multiple machines in a parallel computing platform to achieve parallelism speedup [19]. Distributed versions of learning algorithms have been developed. Mostly first order methods [19, 25, 28]. 5 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Problem Statement
Outstanding issues with current models and learning algorithms: CNN model (architecture) 1. Slow computations in convolutional layer due to kernel weight flipping. 2. Gradient diffusion problem and training instability due to inappropriate activation functions. Learning algorithm 3. Slow convergence of conventional algorithms (i.e. first order). 4. Difficulty of tuning many hyperparameters. Distributed machine learning 5. Training inefficiency of conventional distributed learning algorithms. 6. Little discussion on mapping a learning algorithm for distributed computing models.
6 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Problem Statement 1 : Convolutional Layer Research issue Propagations through a convolutional layer require flipping of kernel weights [32] that slows down the computational time of a CNN model.
9 8 7 1 2 3 4 5 6
6 5 4 3 2 1
7 8 9
(a) Figure:
(b)
(a) The original kernel, and (b) convolution with the flipped kernel.
Previous related works: Some works reported to perform convolutions - use cross-correlations instead [34 -38]. The effect of weight flipping on the CNN learning performance remains an open question. 7 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Problem Statement 2 : Activation Function Research issue Common activation functions (bounded): gradient diffusion problem [45] Other activation functions (unbounded): training instability [46].
Logistic (logsig)
Rectified linear unit (relu) [47]
logsig
relu
1
bifire
2 f(x) Df(x)
1.8
4 f(x) Df(x)
3.5
1.6
3
0.7
1.4
2.5
0.6
1.2
2
1
1.5
0.5
y
0.8
y
y
0.9
Bi-firing (bifire) [46]
0.4
0.8
1
0.3
0.6
0.5
0.2
0.4
0
0.1
0.2
−0.5
0 −8
−6
−4
−2
0
2
4
6
8
0 −2
−1.5
x
−1
−0.5
0
0.5
1
1.5
2
x
(a) (b) Figure: Activation and gradient curves of the activation functions:
−1 −4
f(0.5) Df(0.5) −3
−2
−1
0
1
2
3
4
x
(c) (a) logistic, (b) rectified linear
unit (ReLU) [47], and (c) bi-firing [46]. 8 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Problem Statement 2 : Activation Function (Cont’d)
Previous related works to reduce the training (numerical) instability: Weight regularization [46]. Input and output normalization [60, 88]. Modification of loss function [91, 92]. Gradient clipping [96, 97]. Proper selection of hyperparameter values [99].
Issues with these approaches: More computations are required in the training process. Little in-depth studies on the impact of an activation function on the training stability.
9 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Problem Statement 3 : Learning Algorithm
Research issue Common first order learning algorithms suffer from slow convergence. First order methods (e.g. stochastic gradient descent (SGD)) Simple and effective. Have higher chances of reaching poor local minima especially for ill-conditioned problems [8, 49].
Research issue Most second order methods are computationally expensive. Second order methods (e.g. Levenberg-Marquardt (LMA)) Converge faster than first order methods. Require inversion of Hessian matrix - compute-intensive [49].
10 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Problem Statement 4 : Learning Algorithm (Cont’d) Research issue Some learning algorithms have many hyperparameters to be tuned manually [52]. More difficult to find a good solution.
Previous related works: Table: List of known supervised learning algorithms for training the NN models. Learning algorithm Total hyperparameters AdaGrad [107] >1 AdaDelta [55] 2 BGD with learning rate schedule [48] >2 SGD with learning rate schedule [48] >2 Bold driver [106] 3 Rprop [108] 5 iRprop [7] 5 GD with adaptive step size controller [109] >5 AdaDec [48] 6 BFGS [51] >1 L-BFGS [111] >1 Layer-specific SDLM [112] 2 SDLM [9] 3 B-SDLM (this work) 1 11 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Problem Statement 5 : Distributed Learning Algorithm Most distributed learning algorithms are based on first order methods [45, 49]. Issue: Converge slowly.
Some other algorithms are effective in training deep models. Issue: Computationally expensive [19, 55].
Previous related works: Table: List of known recent previous works on distributed supervised learning algorithms. Work Dettmers et al. [169] Shokri et al. [172] Heigold et al. [28] Kumar et al. [30] Li et al. [173] Zhang et al. [174] Dean et al. [19] Zeiler [55] Zinkervich et al. [171] Agarwal et al. [53] Dean et al. [19] Suri et al. [54] This work
Year Learning algorithm Characteristic 2015 Mini-batch SGD 2015 Distributed selective SGD 2014 A-SGD 2014 Distributed SGD 2014 EMSO-GD First order method 2014 Elastic averaging SGD 2012 Downpour SGD 2012 AdaDelta 2010 Parallel SGD 2014 SGD + L-BFGS 2012 SandBlaster L-BFGS Second order method 2002 Parallel LMA 2016 Distributed B-SDLM Efficient and effective 12 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Problem Statement 6 : Mapping of Learning Algorithm
Research issue There are inadequate discussions in the literature on the process of mapping a learning algorithm for parallel computation [30, 56, 57]. Therefore, this leaves open questions on the mapping considerations: Types of parallelism. Thread models. Communications between physical machines. Task scheduling and synchronization mechanism.
13 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Research Objectives
1. To propose an efficient convolutional neural network (CNN) model Consists of convolutional layers with correlation filtering and bounded activation functions. Faster computation, improved generalization performance and better training stability.
2. To develop an effective stochastic second order learning algorithm, i.e. bounded stochastic diagonal Levenberg-Marquardt (B-SDLM) Converges faster and is computationally efficient. Alleviates hyperparameter overfitting problem.
3. To propose a distributed second order learning algorithm Converges faster and better than common distributed first order learning algorithms, With a systematic methodology of mapping the proposed learning algorithm for parallel computation.
14 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Scope of Work C/C++ based software implementation Pthreads library MPICH library
Platforms Quad-core CPU computing platform. CPU cluster with four aforementioned platforms.
Case studies (visual pattern recognition) MNIST - basic handwritten digit classification. mnist-rot-bg-img - complex handwritten digit classification. AR Purdue - face recognition.
Figure:
Examples of the handwritten digit images in the mnist-rot-bg-img database.
15 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Research Methodology
Develop baseline DNN and CNN models as well as training procedure with SGD as the baseline learning algorithm
Propose an improved CNN model that has convolutional layers with correlation filtering and bounded activation functions
Derive an efficient B-SDLM learning algorithm to train the NN models effectively
Implement the proposed distributed B-SDLM learning algorithm to achieve fast parallelism speedup
Figure:
General approach taken in the research work. 16 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Model: Convolutional Layer with Correlation Filtering Problem 1 : Weight flipping slows down the CNN computation.
Proposed solution 1 : Convolutional layer with correlation filtering Replace convolutions with cross-correlations to eliminate the weight flipping operation. Original:
(l) Yj (x , y ) = f
X
X
X
i∈N (l−1) u∈Kx(l) v ∈Ky(l)
(l−1)
Yi
(l) (l) f (l) (u, v ) + B (l) Sx x + u, Sy y + v W j ji
(1)
Proposed: (l) Yj (x , y ) = f
X
X
X
i∈N (l−1) u∈Kx(l) v ∈Ky(l)
(l ) (l−1) (l) (l) (l) Yi Sx x + u, Sy y + v Wji (u, v ) + Bj (2)
Contribution 1 : Proposed convolutional layer achieves faster execution speed and better learning performance. 17 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Model: Bounded Activation Functions Problem 2 : Training instability due to unbounded outputs. Proposed solution 2 : Derive new activation functions with bounded output range based on the universal approximation theorem (UAT) [71]. ReLU (relu) [47]
Bounded ReLU (brelu)
f (x ) = max (0, x )
(3)
Leaky ReLU (lrelu) [83] ( x f (x ) = 0.01x
x >0 otherwise
(5)
Bi-firing (bifire) [46]
f (x ) =
−x − x2 2A
x−
A 2
A 2
x < −A −A 6 x 6 A x >A
(7)
0 f (x ) = min (max (0, x ) , A) = x A
x 60 0A (4)
Bounded leaky ReLU (blrelu) x 60 0.01x f (x ) = x 0 A Bounded bi-firing (bbifire) B x < −B − A2 A A −x − 2 −B − 2 6 x < −A x2 f (x ) = 2A −A 6 x 6 A x − A2 A < x 6 B + A2 B x > B + A2
(6)
(8)
18 / 44
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Model: Bounded Activation Functions (Cont’d) brelu
blrelu
1.5
bbifire
1.5 f(1) Df(1)
6 f(1) Df(1)
1
1
0.5
0.5
5
f(0.5, 4) Df(0.5, 4)
4
y
3
y
y
Introduction
2
0
1
0
0 −0.5 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−0.5 −2
−1.5
−1
−0.5
0
x
(a) Figure:
0.5
1
1.5
2
−1 −8
−6
−4
−2
x
(b)
0
2
4
6
8
x
(c)
Activation and gradient curves of proposed activation functions: (a) bounded ReLU, (b)
bounded leaky ReLU, and (c) bounded bi-firing.
Can alleviate the training instability Without additional techniques that impose more computations.
Contribution 2 : Bounded activation functions improve the generalization performance and training stability of an NN model. 19 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: Bounded SDLM (B-SDLM) Problem 3 : Second order methods are generally computationally expensive due to the Hessian calculation. Stochastic diagonal Levenberg-Marquardt (SDLM) [9] More efficient than most second order methods. Diagonal Hessian estimation with running average method [9]. *
∂2E (l)2 ∂Wji
+
* =γ
(t+1)
+
∂2E (l)2 ∂Wji
+ (1 − γ) (t)
∂2E
(9)
(l)2
∂Wji
(t)
Issue: Still imposes certain computational overhead. Proposed solution 3 : Perform simple averaging of estimated Hessians for a small subset of the training set instead of the running average. P
∂2E (l)2
∂Wji
=
∂2 E MH (l)2 ∂Wji (m)
MH
(10)
Simpler computation and less hyperparameters. 20 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: B-SDLM (Cont’d) Problem 4 : Hyperparameter overfitting problem due to many hyperparameters to be tuned manually. η
(l) Wji
η + = * ∂2 E +µ 2 ∂W (l)
(11)
ji
Layer-specific SDLM (L-SDLM) [112] Removes the hyperparameter µ - Adds more computation.
Proposed solution 4 : Replace µ with a boundary condition that serves the same purpose of ensuring the learning stability.
η
(l) Wji
=
η
2 ∂ E 2 (l) ∂Wji
η
∂2 E (l)2 ∂Wji
! >1
(12)
otherwise
Consists of only a single hyperparameter. 21 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: B-SDLM (Cont’d) // Initialization stage initialize all weights W (l) and biases B (l) t =0 repeat Shuffle M training samples // Hessian estimation stage if t mod tupdt = 0 then for m = 1 to MH do forward propagation second order backward propagation accumulate Hessians end for calculate average Hessians using Eq. (10) calculate learning rates using Eq. (12) end if // Training stage for m = 1 to M do forward propagation first order backward propagation if m mod MB = 0 then weight update using Eq. (13) end if end for calculate average loss value for training samples // Testing stage for m = 1 to MT do forward propagation end for calculate average loss value for testing samples t =t+1 until E < or t > tmax
=⇒Proposed solution 3 =⇒Proposed solution 4
=⇒Proposed solution
22 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: B-SDLM (Cont’d) Supports mini-batch learning mode:
! P
MB
(l)
(l)
Wji = Wji − η
(l) Wji
∂E (l) ∂Wji
m
MB
(13)
Contribution 3 : B-SDLM achieves fast and better convergence while having minimal computational overhead than SGD. Computational complexity overhead over SGD: O hM logtupdt tmax
Contribution 4 : B-SDLM alleviates the hyperparameter overfitting problem by consisting of only a single hyperparameter. 23 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: Distributed B-SDLM
Problem 5 : Most distributed learning algorithms are based on first order methods that converge slowly.
Proposed solution 5 : Propose a distributed learning algorithm based on B-SDLM (a stochastic second order method).
Problem 6 : Inadequate discussions in the literature on the mapping of a learning algorithm for distributed computing models.
Proposed solution 6 : Formulate a systematic methodology based on the existing general approach of mapping an algorithm for parallel computation.
24 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) General methodology of mapping algorithms for parallel computation Layer 5
Application Processing Tasks
Layer 4
I/O Data
Algorithm Design Task Dependence Graph
Layer 3
Task-Directed Graph (DG)
Parallelization and Scheduling
Processor Assignment and Scheduling
Layer 2
VLSI Tools
Thread Assignment and Scheduling
Concurrency Platforms
HDL Code
Layer 1
Figure:
C/Fortran Code
Hardware Design
Multithreading
Custom Hardware Implementation
Software Implementation
Phases or layers of implementing an algorithm in software or hardware for parallel
computations [179].
Layer 5: Application phase Aim: To accelerate the NN training through parallelism. 25 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 4: Algorithmic development phase Table:
Model structure parameters
List of tasks in the training procedure
using B-SDLM. Node Task 0 Calculate fan-in and fan-out Initialization 1 Initialize weights 2 Shuffle dataset 3 Fetch data for Hessian estimation 4 Forward propagation Hessian 5 Second order backward propagation estimation 6 Accumulate Hessian 7 Calculate average Hessian 8 Calculate learning rates 9 Fetch data for gradient computation 10 Forward propagation 11 Calculate error Training 12 Accumulate misclassification error 13 First order backward propagation 14 Update weights 15 Calculate accuracy 16 Fetch data for testing 17 Forward propagation Testing 18 Calculate error 19 Accumulate misclassification error 20 Calculate accuracy
Initialization 0 stage
Stage
Training data 2
1
3 4
Hessian estimation stage
(1)
5 6
7 Training data
Hyperparameters
8 9 10
Training stage
11
Testing data
14
Hyperparameters (4)
13
12
(2)
15
Training accuracy
16 17
Testing stage
(3) 18 19
20 Testing accuracy
Figure:
DCG of the B-SDLM algorithm.
26 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 3: Parallelization and scheduling phase Parallelization 0
2
Initialization stage
Training data Learned parameters
1
22
21
3
3
3
3
Training 10 stage
(2)
4
4
4
4
11
13
Hessian (1) (1)······ 5 estimation 5 stage 6
6
7
(1)
(1)
5
5
6
6
(4)
8
······
15
10
(2)
14
11
13
Learned parameters
16
16
17
17
······
······
19
(4)
Training accuracy
17
(3) (3)······ Testing 18 18 stage
Gradients Hyperparameters
21
12
16
19
Gradients
9 Learned parameters 22
12 Testing data
Hyperparameters
Training data
Gradients 9
······
Training data
······
Model structure parameters
16 17
(3)
(3)
18
18
19
19
20 Testing accuracy
Figure:
DCG of the proposed distributed B-SDLM algorithm.
Hessian estimation and testing stages: parallel processing of data batches. Training stage: Asynchronous weight updates with a parameter server [19].
27 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Scheduling and synchronization Parameter Shared Server Memory calcFan() modelStructureParam
Model Replica
[for each training sample] dataSample
loop
fetchModelParam() modelParam
8 9
fProp()
initModelParam() calcErr() loop shuffleDataset()
datasetSeq shuffledDatasetSeq
[while not reaching convergence]
accMCE() 1
fetchModelParam() modelParam
2
[for each Hessian sample] dataSample fProp()
loop
3
bProp() 11 13
updateParam()
pushGradient()
10
accMCE()
12
calcAccuracy()
bbProp()
[for each testing sample] dataSample
loop accHess() 5
14
fetchModelParam() modelParam
accHess() 4
15
fProp()
16
calcAvgHess()
calcErr() calcLearningRate() accMCE() fetchLearningRate() learningRate
6 7
accMCE() 18
17
calcAccuracy()
Function Data sent Data received as function return
Figure:
Sequence diagram of the proposed distributed B-SDLM algorithm.
UML sequence diagram [180] Depict the interactions between a parameter server and a worker.
28 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 2: Coding phase Concurrency platforms: 1. Shared-memory multiprocessors system - Pthreads implementation 2. Distributed-memory multiprocessors system - MPICH implementation Parameter server thread model: Parameter Server
Parameter Server W
W W
W
W W
W
Model Replicas
Model Replicas
Data
Data
(a) Figure:
W W
W
W W
(b)
The parameter server thread model for the (a) Pthreads and (b) MPICH
implementations. 29 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 1: Implementation phase Implementation platforms: 1. Pthreads implementation: A quad-core CPU computing platform. 2. MPICH implementation: A CPU cluster with four aforementioned platforms.
Contribution 5 : Training using the distributed B-SDLM learning algorithm achieves faster and scalable parallelism speedup.
Contribution 6 : A systematic methodology of mapping a learning algorithm into the deployment on parallel computing platforms is presented.
30 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: Convolutional Layer with Correlation Filtering Faster execution speed (up to 1.4× faster). Better learning performance (up to 14.5% improvement in terms of testing misclassification error rates (MCRs)).
Table:
MCRs and average execution time for CNN models composed of convolutional layers with
different weight flipping modes. The colored results denote the proposed convolutional layer. Flipping MCR (%) Average time Computation time mode Training Testing per epoch (s) overhead (%) 2 0.01 1.17 35.82 26.44 1 0.01 1.19 29.36 3.64 MNIST [9] 0 0.01 1.02 28.33 2 11.69 26.20 26.99 35.56 1 13.03 26.87 21.98 10.40 mnist-rot-bg-img [194] 0 5.72 22.97 19.91 2 12.85 21.00 9.43 11.33 1 7.30 21.83 9.17 8.26 AR Purdue [195] 0 11.85 19.67 8.47 Dataset
Weight flipping modes: Proposed - No flipping (0). Flipping of kernel’s indices (1). Flipping of kernel’s values (2). 31 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: Bounded Activation Functions - Benchmarking
Perform comparably well or even better than other activation functions (with bbifire achieving a slightly higher testing MCR than bifire).
Table:
Testing MCRs of the MLPs with different activation functions on the mnist-rot-bg-img
dataset. The colored functions denote the proposed functions. Activation function Testing MCR (%) logsig [46] 62.64 relu [46] 51.98 brelu 50.09 lrelu 50.52 blrelu 49.88 bifire [46] 48.75 bbifire 48.97
32 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: Bounded Activation Functions - Comparative Analysis
Outperform their original forms in terms of the testing MCRs.
Table:
Results of the CNN models using different activation functions. The colored functions
denote the proposed functions. Activation function relu brelu lrelu blrelu bifire bbifire
Testing MCR (%) MNIST mnist-rot-bg-img AR Purdue MSE Softmax + CE MSE Softmax + CE MSE Softmax + CE 0.97 1.02 25.22 22.97 20.50 19.67 0.94 0.92 23.60 22.59 17.33 16.50 0.95 1.06 25.70 25.79 16.67 7.67 0.96 0.96 24.44 23.42 4.17 7.00 0.89 1.04 25.02 25.61 N/A 7.33 0.90 0.86 23.09 24.05 7.17 6.67
33 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: Bounded Activation Functions - Training Stability Improves the training stability significantly as opposed to the ones with unbounded functions.
Table:
Probability of numerical instability for the CNN models using different activation functions
on three different datasets. The colored functions denote the proposed functions. Activation function relu brelu lrelu blrelu bifire bbifire
Probability of numerical instability (%) MNIST mnist-rot-bg-img AR Purdue MSE Softmax + CE MSE Softmax + CE MSE Softmax + CE 0.00 40.00 0.00 0.00 0.00 0.00 0.00 8.57 0.00 0.00 0.00 0.00 60.00 40.00 60.00 60.00 80.00 0.00 51.43 8.57 25.71 0.00 31.43 0.00 95.00 93.33 92.50 0.00 100.00 45.00 26.67 53.33 0.00 0.00 0.00 0.00
Probability of numerical instability Proportion of experimental settings for an activation function that result in the numerical instability.
34 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: B-SDLM - Benchmarking MNIST: Lower testing MCR. AR Purdue: Can classify all testing samples correctly.
Table:
Benchmarking of learning algorithms with previous existing works on the MNIST dataset. Work Year Learning algorithm Testing MCR (%) vSGD-g 3.65 Schaul et al. vSGD-l 2.16 2013 [113] SGD 2.15 vSGD-b 2.05 Zeiler [55] 2012 AdaDelta 2.00 This work 2016 B-SDLM 1.98
Table:
Benchmarking of face recognition using the AR Purdue face dataset. Accuracy Work Year Approach (%) Roli et al. [198] 2006 Semi-supervised PCA 85.33 Rose [197] 2006 Gabor and log-Gabor filters 89.00 Song et al. [199] 2007 Parameterized direct LDA 90.00 Patel et al. [200] 2012 Dictionary-based recognition 93.70 Jiang et al. [201] 2011 K-SVD 97.80 Syafeeza et al. (SDLM) [65] 2014 CNN 99.50 This work (B-SDLM) 2016 CNN 100.00 35 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: B-SDLM - Comparisons among Learning Algorithms Consistently outperforms other learning algorithms on all three datasets. Faster than other SDLM variants.
Table:
CE errors and MCRs for various learning algorithms on the MNIST dataset. Dataset
MNIST
mnist-rot-bg-img
AR Purdue
MCR (%) Average time Training Testing per epoch (s) SGD 0.01 1.12 25.69 SDLM 0.01 1.08 34.13 L-SDLM 0.01 1.04 33.11 B-SDLM (This work) 0.01 0.90 28.40 SGD 13.49 25.61 20.19 SDLM 1.21 23.52 21.42 L-SDLM 5.72 22.97 19.91 B-SDLM (This work) 0.38 21.14 19.12 SGD 16.95 24.67 8.34 SDLM 15.80 26.00 8.75 L-SDLM 11.85 19.67 8.47 B-SDLM (This work) 4.65 15.83 8.40 Learning algorithm
36 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: Distributed B-SDLM - Learning Convergence Outperforms other distributed learning algorithms on both datasets.
Best misclassification errors on testing set
Best misclassification errors on testing set 25
SGD SDLM L−SDLM B−SDLM
1.5 1.4
Misclassification error (%)
Misclassification error (%)
1.6
1.3 1.2 1.1 1 0.9 0.8 1
2
3
4
Total workers
(a) Figure:
5
6
7
SGD SDLM L−SDLM B−SDLM
24.5 24 23.5 23 22.5 22 21.5 21 1
2
3
4
5
6
7
Total workers
(b)
Testing MCRs for various distributed learning algorithms based on the parameter server
thread model on (a) MNIST and (b) mnist-rot-bg-img datasets in the Pthreads implementation.
37 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: Distributed B-SDLM - Parallelism Speedup Execution time decreases significantly as more workers are assigned for the gradient computations.
Average execution time for a single training epoch
Average execution time for a single training epoch 24
45 SGD SDLM L−SDLM B−SDLM
40
20 18
30
Time (s)
Time (s)
35
SGD SDLM L−SDLM B−SDLM
22
25 20
16 14 12 10
15 8
10 5 1
6
2
3
4
5
6
Total workers
(a) Figure:
7
4 1
2
3
4
5
6
7
Total workers
(b)
Average execution time of a single training epoch for various distributed learning
algorithms based on the parameter server thread model on (a) MNIST and (b) mnist-rot-bg-img datasets in the Pthreads implementation.
38 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: Distributed B-SDLM - Convergence Rate Achieve the fixed accuracies faster than the distributed SGD algorithm. Compared to sequential NN training: MNIST: 5.5× faster to reach 98% testing accuracy. mnist-rot-bg-img: 4.3× faster to reach 75% testing accuracy. Time reach 98% testing accuracy
Time to reach 75% testing accuracy
300
250 SGD SDLM L−SDLM B−SDLM
250
SGD SDLM L−SDLM B−SDLM
200
Time (s)
Time (s)
200 150
150
100
100 50
50 0 1
2
3
4
Total workers
(a) Figure:
5
6
7
0 1
2
3
4
5
6
7
Total workers
(b)
Time taken to reach a fixed classification accuracy on the testing set for various
distributed learning algorithms based on parameter server thread model: : (a) 98% on the MNIST dataset, and (b) 75% on the mnist-rot-bg-img dataset in the Pthreads implementation.
39 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Results: Distributed B-SDLM - Towards Larger Computing Platform Compared to sequential NN training: 6× and 12.3× faster to reach 0.01 training and 0.08 testing loss values. 5.7× faster to reach 99% training accuracy. 8.7× faster to reach 98% testing accuracy. Time to reach a certain loss value
Time to reach a certain classification accuracy
1400
600 0.01 (Training) 0.08 (Testing)
1200
400
Time (s)
Time (s)
1000 800 600
100
200 5
10
Total workers
(a) Figure:
300 200
400
0 0
99% (Training) 98% (Testing)
500
15
0 0
5
10
15
Total workers
(b)
Time taken to reach a certain (a) loss value and (b) classification accuracy on the
MNIST dataset when training with batch size = 16 in the MPICH implementation. 40 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Conclusion and Contributions In response to Objective 1 : 1. Convolutional layer with correlation filtering achieves faster execution speed (up to 1.4× faster) and better learning performance (up to 14.5% improvement). 2. Bounded activation functions improve the generalization performance and training stability of an NN model significantly (with training instability being eliminated in some cases).
In response to Objective 2 : 3. B-SDLM achieves fast and better convergence (up to 19.6% improvement) while having minimal computational overhead than SGD. 4. B-SDLM alleviates the hyperparameter overfitting problem by consisting of only a single hyperparameter.
In response to Objective 3 : 5. Training using the distributed B-SDLM learning algorithm achieves faster and scalable parallelism speedup (up to 12.3× faster to reach a certain loss value). 6. A systematic methodology of mapping a learning algorithm into the deployment on parallel computing platforms is presented. 41 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Future Work
More features can be extracted by replacing convolutions with a trainable function to learn deeper. The hyperparameters of the bounded activation functions can be made trainable. Hyperparameter optimization is a viable approach to deal with the hyperparameter overfitting issue. The combination of both model and data parallelism is a promising approach to achieve further training speedup. The learning algorithm can be mapped onto larger scale computing platforms to expand its DL capability for big data applications.
42 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Publications Journals 1. Liew, S. S., Khalil-Hani, M., and Bakhteri, R. An Optimized Second Order Stochastic Learning Algorithm for Neural Network Training. Neurocomputing. 2016. vol. 186, 74-89. (ISI, IF 2.083 (Q2)). 2. Liew, S. S., Khalil-Hani, M., and Bakhteri, R. Bounded Activation Functions for Enhanced Training Stability of Deep Neural Networks on Visual Pattern Recognition Problems. Neurocomputing. 2016. (ISI, IF 2.083 (Q2)). Under revision. 3. Liew, S. S., Khalil-Hani, M., Syafeeza, A. and Bakhteri, R. Gender Classification: A Convolutional Neural Network Approach. Turk. J. Elec. Engin. 2016. vol. 24. 1248-1264. (ISI, IF 0.407 (Q4)).
Conferences 4. Khalil-Hani, M., Liew, S. S. and Bakhteri, R. Distributed Learning on Multi-Core Platform for Neural Network in Visual Pattern Recognition. 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC 2016). 2016. (Scopus). In review process. 5. Liew, S. S., Khalil-Hani, M., and Bakhteri, R. Distributed B-SDLM: Accelerating the Training Convergence of Deep Neural Networks through Parallelism. 14th Pacific Rim International Conference on Artificial Intelligence (PRICAI 2016). 2016. (Scopus). 6. Khalil-Hani, M., Liew, S. S. and Bakhteri, R. An Optimized Second Order Stochastic Learning Algorithm for Neural Network Training. Arik, S., Huang, T., Lai, W. K. and Liu, Q., eds. Neural Information Processing. Springer International Publishing. 2015, Lecture Notes in Computer Science, vol. 9489. ISBN 978-3-319-26531-5. 38-45. (Scopus). 7. Khalil-Hani, M. and Liew, S. S. A-SDLM: An Asynchronous Stochastic Learning Algorithm for Fast Distributed Learning. Javadi, B. and Garg, S., eds. 13th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2015). Sydney, Australia: ACS. 2015, CRPIT, vol. 163. 75-84. (Scopus). 8. Khalil-Hani, M. and Liew, S. S. A Convolutional Neural Network Approach for Face Verification. High Performance Computing Simulation (HPCS), 2014 International Conference on. 2014. 707-714. (Scopus).
43 / 44
Introduction
Proposed Models and Learning Algorithms
Results and Analysis
Conclusion
Publications (Cont’d)
Others 9. Syafeeza, A. R., Khalil-Hani, M., Liew, S. S. and Bakhteri, R. Convolutional Neural Networks with Fused Layers Applied to Face Recognition. International Journal of Computational Intelligence and Applications, 2015. 14(03): 1550014. (Scopus). 10. Syafeeza, A., Khalil-Hani, M., Liew, S. S. and Bakhteri, R. Convolutional Neural Network for Face Recognition with Pose and Illumination Variation. International Journal of Engineering and Technology, 2014. 6(1): 44-57. ISSN 0975-4024. (Scopus).
44 / 44