An Efficient and Effective Convolutional Neural

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

An Efficient and Effective Convolutional Neural Network for Visual Pattern Recognition Liew Shan Sung Supervisor: Prof. Dr. Mohamed Khalil Mohd. Hani Co-supervisor (Former): Dr. Rabia Bakhteri

June 15, 2016

1 / 44

Introduction



Conclusion

Outline 1

Introduction Introduction Problem Statement Research Objectives Scope of Work Research Methodology

2

Proposed Models and Learning Algorithms Proposed Models Proposed Learning Algorithms

3

Results and Analysis Results of Proposed Convolutional Neural Network Models Results of Proposed Learning Algorithm and Its Distributed Computing Implementation

4

Conclusion Conclusion and Contributions Future Work Publications

2 / 44

Introduction



Conclusion

Introduction: Artificial Neural Networks An artificial neural network (ANN) is a bio-inspired mathematical model that can approximate functions. Input

Fan-in

Figure:

Hidden

Output

Fan-out

A multilayer perceptron (MLP) with a single hidden layer.

Typically trained supervisedly [3] using gradient descent (GD) method [7]. Shortcomings: Poor generalization ability [8]. Unsuitable for interpreting multi-dimensional data [10]. 3 / 44

Introduction



Conclusion

Introduction: Convolutional Neural Networks A convolutional neural network (CNN) is a variant of ANNs that is optimized for visual recognition tasks [9]. Incorporates feature extraction, dimensionality reduction, and classification into a single trainable model. Input 1@32×32

C1 6@28×28

S1 6@14×14

C2 16@10×10

S2 16@5×5

C3 128@1×1

F4 84@1×1

F5 10@1×1

Feature map 5×5 Convolutions

2×2 Pooling

5×5 Convolutions

Figure:

2×2 Pooling

5×5 Convolutions

Full Connection Full connection

The LeNet-5 CNN architecture [9].

Advantages over conventional ANNs [9]: Weight sharing minimizes parameters. Require minimal preprocessing of raw inputs. 4 / 44

Introduction



Conclusion

Introduction: Deep Learning and Distributed Machine Learning Deep learning (DL) A subset of machine learning (ML) techniques that learn hierarchical representations of data in deep-layered model architectures [14]. Most common DL model: deep neural network (DNN) [120]. CNNs - well-established and show great potentials.

Issue Training deeper models using large datasets is extremely computationally expensive [22]. Leads to distributed ML concept.

Distributed ML Distribute training process to multiple machines in a parallel computing platform to achieve parallelism speedup [19]. Distributed versions of learning algorithms have been developed. Mostly first order methods [19, 25, 28]. 5 / 44

Introduction



Conclusion

Problem Statement

Outstanding issues with current models and learning algorithms: CNN model (architecture) 1. Slow computations in convolutional layer due to kernel weight flipping. 2. Gradient diffusion problem and training instability due to inappropriate activation functions. Learning algorithm 3. Slow convergence of conventional algorithms (i.e. first order). 4. Difficulty of tuning many hyperparameters. Distributed machine learning 5. Training inefficiency of conventional distributed learning algorithms. 6. Little discussion on mapping a learning algorithm for distributed computing models.

6 / 44

Introduction



Conclusion

Problem Statement 1 : Convolutional Layer Research issue Propagations through a convolutional layer require flipping of kernel weights [32] that slows down the computational time of a CNN model.

9 8 7 1 2 3 4 5 6

6 5 4 3 2 1

7 8 9

(a) Figure:

(b)

(a) The original kernel, and (b) convolution with the flipped kernel.

Previous related works: Some works reported to perform convolutions - use cross-correlations instead [34 -38]. The effect of weight flipping on the CNN learning performance remains an open question. 7 / 44

Introduction



Conclusion

Problem Statement 2 : Activation Function Research issue Common activation functions (bounded): gradient diffusion problem [45] Other activation functions (unbounded): training instability [46].

Logistic (logsig)

Rectified linear unit (relu) [47]

logsig

relu

1

bifire

2 f(x) Df(x)

1.8

4 f(x) Df(x)

3.5

1.6

3

0.7

1.4

2.5

0.6

1.2

2

1

1.5

0.5

y

0.8

y

y

0.9

Bi-firing (bifire) [46]

0.4

0.8

1

0.3

0.6

0.5

0.2

0.4

0

0.1

0.2

−0.5

0 −8

−6

−4

−2

0

2

4

6

8

0 −2

−1.5

x

−1

−0.5

0

0.5

1

1.5

2

x

(a) (b) Figure: Activation and gradient curves of the activation functions:

−1 −4

f(0.5) Df(0.5) −3

−2

−1

0

1

2

3

4

x

(c) (a) logistic, (b) rectified linear

unit (ReLU) [47], and (c) bi-firing [46]. 8 / 44

Introduction



Conclusion

Problem Statement 2 : Activation Function (Cont’d)

Previous related works to reduce the training (numerical) instability: Weight regularization [46]. Input and output normalization [60, 88]. Modification of loss function [91, 92]. Gradient clipping [96, 97]. Proper selection of hyperparameter values [99].

Issues with these approaches: More computations are required in the training process. Little in-depth studies on the impact of an activation function on the training stability.

9 / 44

Introduction



Conclusion

Problem Statement 3 : Learning Algorithm

Research issue Common first order learning algorithms suffer from slow convergence. First order methods (e.g. stochastic gradient descent (SGD)) Simple and effective. Have higher chances of reaching poor local minima especially for ill-conditioned problems [8, 49].

Research issue Most second order methods are computationally expensive. Second order methods (e.g. Levenberg-Marquardt (LMA)) Converge faster than first order methods. Require inversion of Hessian matrix - compute-intensive [49].

10 / 44

Introduction



Conclusion

Problem Statement 4 : Learning Algorithm (Cont’d) Research issue Some learning algorithms have many hyperparameters to be tuned manually [52]. More difficult to find a good solution.

Previous related works: Table: List of known supervised learning algorithms for training the NN models. Learning algorithm Total hyperparameters AdaGrad [107] >1 AdaDelta [55] 2 BGD with learning rate schedule [48] >2 SGD with learning rate schedule [48] >2 Bold driver [106] 3 Rprop [108] 5 iRprop [7] 5 GD with adaptive step size controller [109] >5 AdaDec [48] 6 BFGS [51] >1 L-BFGS [111] >1 Layer-specific SDLM [112] 2 SDLM [9] 3 B-SDLM (this work) 1 11 / 44

Introduction



Conclusion

Problem Statement 5 : Distributed Learning Algorithm Most distributed learning algorithms are based on first order methods [45, 49]. Issue: Converge slowly.

Some other algorithms are effective in training deep models. Issue: Computationally expensive [19, 55].

Previous related works: Table: List of known recent previous works on distributed supervised learning algorithms. Work Dettmers et al. [169] Shokri et al. [172] Heigold et al. [28] Kumar et al. [30] Li et al. [173] Zhang et al. [174] Dean et al. [19] Zeiler [55] Zinkervich et al. [171] Agarwal et al. [53] Dean et al. [19] Suri et al. [54] This work

Year Learning algorithm Characteristic 2015 Mini-batch SGD 2015 Distributed selective SGD 2014 A-SGD 2014 Distributed SGD 2014 EMSO-GD First order method 2014 Elastic averaging SGD 2012 Downpour SGD 2012 AdaDelta 2010 Parallel SGD 2014 SGD + L-BFGS 2012 SandBlaster L-BFGS Second order method 2002 Parallel LMA 2016 Distributed B-SDLM Efficient and effective 12 / 44

Introduction



Conclusion

Problem Statement 6 : Mapping of Learning Algorithm

Research issue There are inadequate discussions in the literature on the process of mapping a learning algorithm for parallel computation [30, 56, 57]. Therefore, this leaves open questions on the mapping considerations: Types of parallelism. Thread models. Communications between physical machines. Task scheduling and synchronization mechanism.

13 / 44

Introduction



Conclusion

Research Objectives

1. To propose an efficient convolutional neural network (CNN) model Consists of convolutional layers with correlation filtering and bounded activation functions. Faster computation, improved generalization performance and better training stability.

2. To develop an effective stochastic second order learning algorithm, i.e. bounded stochastic diagonal Levenberg-Marquardt (B-SDLM) Converges faster and is computationally efficient. Alleviates hyperparameter overfitting problem.

3. To propose a distributed second order learning algorithm Converges faster and better than common distributed first order learning algorithms, With a systematic methodology of mapping the proposed learning algorithm for parallel computation.

14 / 44

Introduction



Conclusion

Scope of Work C/C++ based software implementation Pthreads library MPICH library

Platforms Quad-core CPU computing platform. CPU cluster with four aforementioned platforms.

Case studies (visual pattern recognition) MNIST - basic handwritten digit classification. mnist-rot-bg-img - complex handwritten digit classification. AR Purdue - face recognition.

Figure:

Examples of the handwritten digit images in the mnist-rot-bg-img database.

15 / 44

Introduction



Conclusion

Research Methodology

Develop baseline DNN and CNN models as well as training procedure with SGD as the baseline learning algorithm

Propose an improved CNN model that has convolutional layers with correlation filtering and bounded activation functions

Derive an efficient B-SDLM learning algorithm to train the NN models effectively

Implement the proposed distributed B-SDLM learning algorithm to achieve fast parallelism speedup

Figure:

General approach taken in the research work. 16 / 44

Introduction



Conclusion

Proposed Model: Convolutional Layer with Correlation Filtering Problem 1 : Weight flipping slows down the CNN computation.

Proposed solution 1 : Convolutional layer with correlation filtering Replace convolutions with cross-correlations to eliminate the weight flipping operation. Original: 

  (l) Yj (x , y ) = f  

X

X

X

i∈N (l−1) u∈Kx(l) v ∈Ky(l)

(l−1)

Yi

 (l) (l) f (l) (u, v ) + B (l)  Sx x + u, Sy y + v W j  ji

(1)

Proposed:   (l) Yj (x , y ) = f  

 X

X

X

i∈N (l−1) u∈Kx(l) v ∈Ky(l)

(l ) (l−1) (l)  (l) (l) Yi Sx x + u, Sy y + v Wji (u, v ) + Bj   (2)

Contribution 1 : Proposed convolutional layer achieves faster execution speed and better learning performance. 17 / 44

Introduction



Conclusion

Proposed Model: Bounded Activation Functions Problem 2 : Training instability due to unbounded outputs. Proposed solution 2 : Derive new activation functions with bounded output range based on the universal approximation theorem (UAT) [71]. ReLU (relu) [47]

Bounded ReLU (brelu)

f (x ) = max (0, x )

(3)

Leaky ReLU (lrelu) [83] ( x f (x ) = 0.01x

x >0 otherwise

(5)

Bi-firing (bifire) [46]

f (x ) =

  −x − x2  2A

 x−

A 2

A 2

x < −A −A 6 x 6 A x >A

(7)

  0 f (x ) = min (max (0, x ) , A) = x  A

x 60 0A (4)

Bounded leaky ReLU (blrelu)   x 60 0.01x f (x ) = x 0 A Bounded bi-firing (bbifire)  B x < −B − A2    A A   −x − 2 −B − 2 6 x < −A x2 f (x ) = 2A −A 6 x 6 A    x − A2 A < x 6 B + A2    B x > B + A2

(6)

(8)

18 / 44



Conclusion

Proposed Model: Bounded Activation Functions (Cont’d) brelu

blrelu

1.5

bbifire

1.5 f(1) Df(1)

6 f(1) Df(1)

1

1

0.5

0.5

5

f(0.5, 4) Df(0.5, 4)

4

y

3

y

y

Introduction

2

0

1

0

0 −0.5 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−0.5 −2

−1.5

−1

−0.5

0

x

(a) Figure:

0.5

1

1.5

2

−1 −8

−6

−4

−2

x

(b)

0

2

4

6

8

x

(c)

Activation and gradient curves of proposed activation functions: (a) bounded ReLU, (b)

bounded leaky ReLU, and (c) bounded bi-firing.

Can alleviate the training instability Without additional techniques that impose more computations.

Contribution 2 : Bounded activation functions improve the generalization performance and training stability of an NN model. 19 / 44

Introduction



Conclusion

Proposed Learning Algorithm: Bounded SDLM (B-SDLM) Problem 3 : Second order methods are generally computationally expensive due to the Hessian calculation. Stochastic diagonal Levenberg-Marquardt (SDLM) [9] More efficient than most second order methods. Diagonal Hessian estimation with running average method [9]. *

∂2E (l)2 ∂Wji

+

* =γ

(t+1)

+

∂2E (l)2 ∂Wji

+ (1 − γ) (t)

∂2E

(9)

(l)2

∂Wji

(t)

Issue: Still imposes certain computational overhead. Proposed solution 3 : Perform simple averaging of estimated Hessians for a small subset of the training set instead of the running average. P  

∂2E (l)2

∂Wji

 =

∂2 E MH (l)2 ∂Wji (m)

MH

(10)

Simpler computation and less hyperparameters. 20 / 44

Introduction



Conclusion

Proposed Learning Algorithm: B-SDLM (Cont’d) Problem 4 : Hyperparameter overfitting problem due to many hyperparameters to be tuned manually. η

(l) Wji

η + = * ∂2 E +µ 2 ∂W (l)

(11)

ji

Layer-specific SDLM (L-SDLM) [112] Removes the hyperparameter µ - Adds more computation.

Proposed solution 4 : Replace µ with a boundary condition that serves the same purpose of ensuring the learning stability.

η

(l) Wji

=

         

η

 2   ∂ E 2  (l)  ∂Wji 

     η

∂2 E (l)2 ∂Wji

! >1

(12)

otherwise

Consists of only a single hyperparameter. 21 / 44

Introduction



Conclusion

Proposed Learning Algorithm: B-SDLM (Cont’d) // Initialization stage initialize all weights W (l) and biases B (l) t =0 repeat Shuffle M training samples // Hessian estimation stage if t mod tupdt = 0 then for m = 1 to MH do forward propagation second order backward propagation accumulate Hessians end for calculate average Hessians using Eq. (10) calculate learning rates using Eq. (12) end if // Training stage for m = 1 to M do forward propagation first order backward propagation if m mod MB = 0 then weight update using Eq. (13) end if end for calculate average loss value for training samples // Testing stage for m = 1 to MT do forward propagation end for calculate average loss value for testing samples t =t+1 until E < or t > tmax

=⇒Proposed solution 3 =⇒Proposed solution 4

=⇒Proposed solution

22 / 44

Introduction



Conclusion

Proposed Learning Algorithm: B-SDLM (Cont’d) Supports mini-batch learning mode:

! P

MB

(l)

(l)

Wji = Wji − η

(l) Wji

∂E (l) ∂Wji

m

MB

(13)

Contribution 3 : B-SDLM achieves fast and better convergence while having minimal computational overhead than SGD. Computational complexity overhead over SGD: O hM logtupdt tmax

Contribution 4 : B-SDLM alleviates the hyperparameter overfitting problem by consisting of only a single hyperparameter. 23 / 44

Introduction



Conclusion

Proposed Learning Algorithm: Distributed B-SDLM

Problem 5 : Most distributed learning algorithms are based on first order methods that converge slowly.

Proposed solution 5 : Propose a distributed learning algorithm based on B-SDLM (a stochastic second order method).

Problem 6 : Inadequate discussions in the literature on the mapping of a learning algorithm for distributed computing models.

Proposed solution 6 : Formulate a systematic methodology based on the existing general approach of mapping an algorithm for parallel computation.

24 / 44

Introduction



Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) General methodology of mapping algorithms for parallel computation Layer 5

Application Processing Tasks

Layer 4

I/O Data

Algorithm Design Task Dependence Graph

Layer 3

Task-Directed Graph (DG)

Parallelization and Scheduling

Processor Assignment and Scheduling

Layer 2

VLSI Tools

Thread Assignment and Scheduling

Concurrency Platforms

HDL Code

Layer 1

Figure:

C/Fortran Code

Hardware Design

Multithreading

Custom Hardware Implementation

Software Implementation

Phases or layers of implementing an algorithm in software or hardware for parallel

computations [179].

Layer 5: Application phase Aim: To accelerate the NN training through parallelism. 25 / 44

Introduction



Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 4: Algorithmic development phase Table:

Model structure parameters

List of tasks in the training procedure

using B-SDLM. Node Task 0 Calculate fan-in and fan-out Initialization 1 Initialize weights 2 Shuffle dataset 3 Fetch data for Hessian estimation 4 Forward propagation Hessian 5 Second order backward propagation estimation 6 Accumulate Hessian 7 Calculate average Hessian 8 Calculate learning rates 9 Fetch data for gradient computation 10 Forward propagation 11 Calculate error Training 12 Accumulate misclassification error 13 First order backward propagation 14 Update weights 15 Calculate accuracy 16 Fetch data for testing 17 Forward propagation Testing 18 Calculate error 19 Accumulate misclassification error 20 Calculate accuracy

Initialization 0 stage

Stage

Training data 2

1

3 4

Hessian estimation stage

(1)

5 6

7 Training data

Hyperparameters

8 9 10

Training stage

11

Testing data

14

Hyperparameters (4)

13

12

(2)

15

Training accuracy

16 17

Testing stage

(3) 18 19

20 Testing accuracy

Figure:

DCG of the B-SDLM algorithm.

26 / 44

Introduction



Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 3: Parallelization and scheduling phase Parallelization 0

2

Initialization stage

Training data Learned parameters

1

22

21

3

3

3

3

Training 10 stage

(2)

4

4

4

4

11

13

Hessian (1) (1)······ 5 estimation 5 stage 6

6

7

(1)

(1)

5

5

6

6

(4)

8

······

15

10

(2)

14

11

13

Learned parameters

16

16

17

17

······

······

19

(4)

Training accuracy

17

(3) (3)······ Testing 18 18 stage

Gradients Hyperparameters

21

12

16

19

Gradients

9 Learned parameters 22

12 Testing data

Hyperparameters

Training data

Gradients 9

······

Training data

······

Model structure parameters

16 17

(3)

(3)

18

18

19

19

20 Testing accuracy

Figure:

DCG of the proposed distributed B-SDLM algorithm.

Hessian estimation and testing stages: parallel processing of data batches. Training stage: Asynchronous weight updates with a parameter server [19].

27 / 44

Introduction



Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Scheduling and synchronization Parameter Shared Server Memory calcFan() modelStructureParam

Model Replica

[for each training sample] dataSample

loop

fetchModelParam() modelParam

8 9

fProp()

initModelParam() calcErr() loop shuffleDataset()

datasetSeq shuffledDatasetSeq

[while not reaching convergence]

accMCE() 1


2

[for each Hessian sample] dataSample fProp()

loop

3

bProp() 11 13

updateParam()

pushGradient()

10

accMCE()

12

calcAccuracy()

bbProp()

[for each testing sample] dataSample

loop accHess() 5

14


accHess() 4

15

fProp()

16

calcAvgHess()

calcErr() calcLearningRate() accMCE() fetchLearningRate() learningRate

6 7

accMCE() 18

17

calcAccuracy()

Function Data sent Data received as function return

Figure:

Sequence diagram of the proposed distributed B-SDLM algorithm.

UML sequence diagram [180] Depict the interactions between a parameter server and a worker.

28 / 44

Introduction



Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 2: Coding phase Concurrency platforms: 1. Shared-memory multiprocessors system - Pthreads implementation 2. Distributed-memory multiprocessors system - MPICH implementation Parameter server thread model: Parameter Server

Parameter Server W

W W

W

W W

W

Model Replicas

Model Replicas

Data

Data

(a) Figure:

W W

W

W W

(b)

The parameter server thread model for the (a) Pthreads and (b) MPICH

implementations. 29 / 44

Introduction



Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 1: Implementation phase Implementation platforms: 1. Pthreads implementation: A quad-core CPU computing platform. 2. MPICH implementation: A CPU cluster with four aforementioned platforms.

Contribution 5 : Training using the distributed B-SDLM learning algorithm achieves faster and scalable parallelism speedup.

Contribution 6 : A systematic methodology of mapping a learning algorithm into the deployment on parallel computing platforms is presented.

30 / 44

Introduction



Conclusion

Results: Convolutional Layer with Correlation Filtering Faster execution speed (up to 1.4× faster). Better learning performance (up to 14.5% improvement in terms of testing misclassification error rates (MCRs)).

Table:

MCRs and average execution time for CNN models composed of convolutional layers with

different weight flipping modes. The colored results denote the proposed convolutional layer. Flipping MCR (%) Average time Computation time mode Training Testing per epoch (s) overhead (%) 2 0.01 1.17 35.82 26.44 1 0.01 1.19 29.36 3.64 MNIST [9] 0 0.01 1.02 28.33 2 11.69 26.20 26.99 35.56 1 13.03 26.87 21.98 10.40 mnist-rot-bg-img [194] 0 5.72 22.97 19.91 2 12.85 21.00 9.43 11.33 1 7.30 21.83 9.17 8.26 AR Purdue [195] 0 11.85 19.67 8.47 Dataset

Weight flipping modes: Proposed - No flipping (0). Flipping of kernel’s indices (1). Flipping of kernel’s values (2). 31 / 44

Introduction



Conclusion

Results: Bounded Activation Functions - Benchmarking

Perform comparably well or even better than other activation functions (with bbifire achieving a slightly higher testing MCR than bifire).

Table:

Testing MCRs of the MLPs with different activation functions on the mnist-rot-bg-img

dataset. The colored functions denote the proposed functions. Activation function Testing MCR (%) logsig [46] 62.64 relu [46] 51.98 brelu 50.09 lrelu 50.52 blrelu 49.88 bifire [46] 48.75 bbifire 48.97

32 / 44

Introduction



Conclusion

Results: Bounded Activation Functions - Comparative Analysis

Outperform their original forms in terms of the testing MCRs.

Table:

Results of the CNN models using different activation functions. The colored functions

denote the proposed functions. Activation function relu brelu lrelu blrelu bifire bbifire

Testing MCR (%) MNIST mnist-rot-bg-img AR Purdue MSE Softmax + CE MSE Softmax + CE MSE Softmax + CE 0.97 1.02 25.22 22.97 20.50 19.67 0.94 0.92 23.60 22.59 17.33 16.50 0.95 1.06 25.70 25.79 16.67 7.67 0.96 0.96 24.44 23.42 4.17 7.00 0.89 1.04 25.02 25.61 N/A 7.33 0.90 0.86 23.09 24.05 7.17 6.67

33 / 44

Introduction



Conclusion

Results: Bounded Activation Functions - Training Stability Improves the training stability significantly as opposed to the ones with unbounded functions.

Table:

Probability of numerical instability for the CNN models using different activation functions

on three different datasets. The colored functions denote the proposed functions. Activation function relu brelu lrelu blrelu bifire bbifire

Probability of numerical instability (%) MNIST mnist-rot-bg-img AR Purdue MSE Softmax + CE MSE Softmax + CE MSE Softmax + CE 0.00 40.00 0.00 0.00 0.00 0.00 0.00 8.57 0.00 0.00 0.00 0.00 60.00 40.00 60.00 60.00 80.00 0.00 51.43 8.57 25.71 0.00 31.43 0.00 95.00 93.33 92.50 0.00 100.00 45.00 26.67 53.33 0.00 0.00 0.00 0.00

Probability of numerical instability Proportion of experimental settings for an activation function that result in the numerical instability.

34 / 44

Introduction



Conclusion

Results: B-SDLM - Benchmarking MNIST: Lower testing MCR. AR Purdue: Can classify all testing samples correctly.

Table:

Benchmarking of learning algorithms with previous existing works on the MNIST dataset. Work Year Learning algorithm Testing MCR (%) vSGD-g 3.65 Schaul et al. vSGD-l 2.16 2013 [113] SGD 2.15 vSGD-b 2.05 Zeiler [55] 2012 AdaDelta 2.00 This work 2016 B-SDLM 1.98

Table:

Benchmarking of face recognition using the AR Purdue face dataset. Accuracy Work Year Approach (%) Roli et al. [198] 2006 Semi-supervised PCA 85.33 Rose [197] 2006 Gabor and log-Gabor filters 89.00 Song et al. [199] 2007 Parameterized direct LDA 90.00 Patel et al. [200] 2012 Dictionary-based recognition 93.70 Jiang et al. [201] 2011 K-SVD 97.80 Syafeeza et al. (SDLM) [65] 2014 CNN 99.50 This work (B-SDLM) 2016 CNN 100.00 35 / 44

Introduction



Conclusion

Results: B-SDLM - Comparisons among Learning Algorithms Consistently outperforms other learning algorithms on all three datasets. Faster than other SDLM variants.

Table:

CE errors and MCRs for various learning algorithms on the MNIST dataset. Dataset

MNIST

mnist-rot-bg-img

AR Purdue

MCR (%) Average time Training Testing per epoch (s) SGD 0.01 1.12 25.69 SDLM 0.01 1.08 34.13 L-SDLM 0.01 1.04 33.11 B-SDLM (This work) 0.01 0.90 28.40 SGD 13.49 25.61 20.19 SDLM 1.21 23.52 21.42 L-SDLM 5.72 22.97 19.91 B-SDLM (This work) 0.38 21.14 19.12 SGD 16.95 24.67 8.34 SDLM 15.80 26.00 8.75 L-SDLM 11.85 19.67 8.47 B-SDLM (This work) 4.65 15.83 8.40 Learning algorithm

36 / 44

Introduction



Conclusion

Results: Distributed B-SDLM - Learning Convergence Outperforms other distributed learning algorithms on both datasets.

Best misclassification errors on testing set

Best misclassification errors on testing set 25

SGD SDLM L−SDLM B−SDLM

1.5 1.4

Misclassification error (%)

Misclassification error (%)

1.6

1.3 1.2 1.1 1 0.9 0.8 1

2

3

4

Total workers

(a) Figure:

5

6

7


24.5 24 23.5 23 22.5 22 21.5 21 1

2

3

4

5

6

7

Total workers

(b)

Testing MCRs for various distributed learning algorithms based on the parameter server

thread model on (a) MNIST and (b) mnist-rot-bg-img datasets in the Pthreads implementation.

37 / 44

Introduction



Conclusion

Results: Distributed B-SDLM - Parallelism Speedup Execution time decreases significantly as more workers are assigned for the gradient computations.

Average execution time for a single training epoch

Average execution time for a single training epoch 24

45 SGD SDLM L−SDLM B−SDLM

40

20 18

30

Time (s)

Time (s)

35


22

25 20

16 14 12 10

15 8

10 5 1

6

2

3

4

5

6

Total workers

(a) Figure:

7

4 1

2

3

4

5

6

7

Total workers

(b)

Average execution time of a single training epoch for various distributed learning

algorithms based on the parameter server thread model on (a) MNIST and (b) mnist-rot-bg-img datasets in the Pthreads implementation.

38 / 44

Introduction



Conclusion

Results: Distributed B-SDLM - Convergence Rate Achieve the fixed accuracies faster than the distributed SGD algorithm. Compared to sequential NN training: MNIST: 5.5× faster to reach 98% testing accuracy. mnist-rot-bg-img: 4.3× faster to reach 75% testing accuracy. Time reach 98% testing accuracy

Time to reach 75% testing accuracy

300

250 SGD SDLM L−SDLM B−SDLM

250


200

Time (s)

Time (s)

200 150

150

100

100 50

50 0 1

2

3

4

Total workers

(a) Figure:

5

6

7

0 1

2

3

4

5

6

7

Total workers

(b)

Time taken to reach a fixed classification accuracy on the testing set for various

distributed learning algorithms based on parameter server thread model: : (a) 98% on the MNIST dataset, and (b) 75% on the mnist-rot-bg-img dataset in the Pthreads implementation.

39 / 44

Introduction



Conclusion

Results: Distributed B-SDLM - Towards Larger Computing Platform Compared to sequential NN training: 6× and 12.3× faster to reach 0.01 training and 0.08 testing loss values. 5.7× faster to reach 99% training accuracy. 8.7× faster to reach 98% testing accuracy. Time to reach a certain loss value

Time to reach a certain classification accuracy

1400

600 0.01 (Training) 0.08 (Testing)

1200

400

Time (s)

Time (s)

1000 800 600

100

200 5

10

Total workers

(a) Figure:

300 200

400

0 0

99% (Training) 98% (Testing)

500

15

0 0

5

10

15

Total workers

(b)

Time taken to reach a certain (a) loss value and (b) classification accuracy on the

MNIST dataset when training with batch size = 16 in the MPICH implementation. 40 / 44

Introduction



Conclusion

Conclusion and Contributions In response to Objective 1 : 1. Convolutional layer with correlation filtering achieves faster execution speed (up to 1.4× faster) and better learning performance (up to 14.5% improvement). 2. Bounded activation functions improve the generalization performance and training stability of an NN model significantly (with training instability being eliminated in some cases).

In response to Objective 2 : 3. B-SDLM achieves fast and better convergence (up to 19.6% improvement) while having minimal computational overhead than SGD. 4. B-SDLM alleviates the hyperparameter overfitting problem by consisting of only a single hyperparameter.

In response to Objective 3 : 5. Training using the distributed B-SDLM learning algorithm achieves faster and scalable parallelism speedup (up to 12.3× faster to reach a certain loss value). 6. A systematic methodology of mapping a learning algorithm into the deployment on parallel computing platforms is presented. 41 / 44

Introduction



Conclusion

Future Work

More features can be extracted by replacing convolutions with a trainable function to learn deeper. The hyperparameters of the bounded activation functions can be made trainable. Hyperparameter optimization is a viable approach to deal with the hyperparameter overfitting issue. The combination of both model and data parallelism is a promising approach to achieve further training speedup. The learning algorithm can be mapped onto larger scale computing platforms to expand its DL capability for big data applications.

42 / 44

Introduction



Conclusion

Publications Journals 1. Liew, S. S., Khalil-Hani, M., and Bakhteri, R. An Optimized Second Order Stochastic Learning Algorithm for Neural Network Training. Neurocomputing. 2016. vol. 186, 74-89. (ISI, IF 2.083 (Q2)). 2. Liew, S. S., Khalil-Hani, M., and Bakhteri, R. Bounded Activation Functions for Enhanced Training Stability of Deep Neural Networks on Visual Pattern Recognition Problems. Neurocomputing. 2016. (ISI, IF 2.083 (Q2)). Under revision. 3. Liew, S. S., Khalil-Hani, M., Syafeeza, A. and Bakhteri, R. Gender Classification: A Convolutional Neural Network Approach. Turk. J. Elec. Engin. 2016. vol. 24. 1248-1264. (ISI, IF 0.407 (Q4)).

Conferences 4. Khalil-Hani, M., Liew, S. S. and Bakhteri, R. Distributed Learning on Multi-Core Platform for Neural Network in Visual Pattern Recognition. 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC 2016). 2016. (Scopus). In review process. 5. Liew, S. S., Khalil-Hani, M., and Bakhteri, R. Distributed B-SDLM: Accelerating the Training Convergence of Deep Neural Networks through Parallelism. 14th Pacific Rim International Conference on Artificial Intelligence (PRICAI 2016). 2016. (Scopus). 6. Khalil-Hani, M., Liew, S. S. and Bakhteri, R. An Optimized Second Order Stochastic Learning Algorithm for Neural Network Training. Arik, S., Huang, T., Lai, W. K. and Liu, Q., eds. Neural Information Processing. Springer International Publishing. 2015, Lecture Notes in Computer Science, vol. 9489. ISBN 978-3-319-26531-5. 38-45. (Scopus). 7. Khalil-Hani, M. and Liew, S. S. A-SDLM: An Asynchronous Stochastic Learning Algorithm for Fast Distributed Learning. Javadi, B. and Garg, S., eds. 13th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2015). Sydney, Australia: ACS. 2015, CRPIT, vol. 163. 75-84. (Scopus). 8. Khalil-Hani, M. and Liew, S. S. A Convolutional Neural Network Approach for Face Verification. High Performance Computing Simulation (HPCS), 2014 International Conference on. 2014. 707-714. (Scopus).

43 / 44

Introduction



Conclusion

Publications (Cont’d)

Others 9. Syafeeza, A. R., Khalil-Hani, M., Liew, S. S. and Bakhteri, R. Convolutional Neural Networks with Fused Layers Applied to Face Recognition. International Journal of Computational Intelligence and Applications, 2015. 14(03): 1550014. (Scopus). 10. Syafeeza, A., Khalil-Hani, M., Liew, S. S. and Bakhteri, R. Convolutional Neural Network for Face Recognition with Pose and Illumination Variation. International Journal of Engineering and Technology, 2014. 6(1): 44-57. ISSN 0975-4024. (Scopus).

44 / 44

An Efficient and Effective Convolutional Neural

An Efficient and Effective Convolutional Neural

Suggest Documents

Towards Effective Low-bitwidth Convolutional Neural Networks

An Effective Training Method For Deep Convolutional Neural Network

Towards Effective Low-bitwidth Convolutional Neural Networks

Towards Efficient Convolutional Neural Network for Domain

Efficient Multiple Instance Convolutional Neural Networks for

Efficient Fast Convolution Architectures for Convolutional Neural

Convolutional neural networks: an overview and ...

An Analysis of Convolutional Neural Networks for

Convolutional Neural Networks

Compressing Convolutional Neural Networks

Convolutional Neural Networks

Convolutional Neural Fabrics - arXiv

Deep Convolutional Neural Networks for Efficient ... - James Charles

Efficient Convolutional Neural Network For Audio Event Detection

Efficient Convolutional Neural Network with Binary Quantization Layer

Second-order Convolutional Neural Networks

Graph Based Convolutional Neural Network

Understanding convolutional neural networks via

Convolutional Neural Network Architecture MATLAB

Applying Densely Connected Convolutional Neural

Scale-Invariant Convolutional Neural Networks

Kernel Graph Convolutional Neural Networks

Graph Capsule Convolutional Neural Networks

Learning Low Dimensional Convolutional Neural