IEICE Technical Report Format

0 downloads 0 Views 618KB Size Report
With the increasing availability of touch-based or pen-based devices such as smart phones, tablet PCs and smart boards, interest in this area has been growing.
THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS

TECHNICAL REPORT OF IEICE

Combination of LSTM and CNN for recognizing mathematical symbols HAI Nguyen Dai † ANH Le Duc‡ and Masaki NAKAGAWA* Department of Computer and InformationScience, Tokyo University of A&T 2-24-16 Naka-cho, Koganei-shi, Tokyo 184-8588 †

E-mail: ‡

[email protected]

[email protected] *

[email protected]

Abstract Combining classifiers is an approach that has been shown to be useful on numerous occasions when striving for further improvement over the performance of individual classifiers. In this paper we present a combination of CNN and LSTM, which are both sophisticated neural networks and good classifiers for offline and online handwriting mathematical character recognition, respectively. In addition, we employ the dropout technique and gradient based local features to improve accuracy of CNN and LSTM, respectively. The best combination ensemble has a recognition rate which is higher than the rate achieved by the best individual classifier on the Mathbrush database. Keyword LSTM, directional features, CNN, Drop out, linear combination

1. Introduction

local

off-line

features

by

LSTM

significantly

Online mathematical symbol recognition is an essential

outperformed HMM. However, the hybrid features in

component of any pen-based mathematical formulation

combination with LSTM did not produce the great

recognition systems. With the increasing availability of

improvement from the same features with

touch-based or pen-based devices such as smart phones,

section 3 we describe the way of using directional (or

tablet PCs and smart boards, interest in this area has been

gradient) features as local off-line features following the

growing. However, the existing system are still far from

method of Kawamura et al. [4] and show that the

perfection because of challenges that arise from the

recognition rate by adding more gradient features is

two-dimensional nature of mathematical input and the

improved when combined with LSTM.

HMM. In

large symbol set with many similar looking symbols. In

For off-line handwriting recognition, convolutional

this work we just address the problem of mathematical

neural network (CNN) has been proven as a state of the art

symbol recognition.

method. CNNs combine three architectural ideas to ensure

Long short term memory (LSTM) and conventional

some degree of shift, scale, and distor tion invariance:

recurrent neural networks (RNN) have been successfully

local receptive fields, shared weights, and spa tial or

applied to sequence prediction and sequence labeling tasks.

temporal sub-sampling [2]. The network is usually trained

For online handwriting recognition, bidirectional LSTM

like a standard neural network by back propagation. For

networks with a connectionist temporal classification

over-fitting problem, we use two choices: 1. enlarging the

(CTC) output layer using forward and backward types of

training dataset by using linear transformations like

algorithm have been shown to outperform a state o f the art

translation, rotation or sheer; as well as 2. Using Drop out

HMM-based systems in handwriting recognition in [3].

technique

Recently, Álvaro et al. proposed a set of hybrid features

Units(RELU) have been highly successful for computer

that combine both on-line and off-line information and

tasks and proved faster to train than other standard units

using either HMM or LSTM for classification in [5]. The

such that sigmoid or tanh . RELUs are thus a good choice

symbol recognition rate achieved using raw images as

to combine with dropout for the Mathbrush database.

introduced

in

[1]..

Rectified

Linear

The combination of multiple classifiers has been shown

The input pattern, composed of the coordinates of

to be suitable for improving the recognition performance

sampled pen-down points, is smoothed by replacing the

in

coordinates of each point with a weighted average of the

difficult

classification

problems

[6-7].

Also

in

handwritten recognition, classifier combination has been appliedt.In

this

work,

we

deploy

simply

current point and two neighboring points.

linear

In normalization, the coordinates are transformed such

combination of CNN and LSTM. Our experiments also

that the pattern size is standardized and the shape is

show that the best combination ensemble has a recognition

regulated.

rate which is higher than relatively 1% the rate achieved

normalization method, named bi-moment normalization

by the best individual classifier on

[8].

the Mathbrush

database.

We

adopt

a

global

curve

fitting-based

For feature extraction, the local histogram of stroke

The rest of this paper is organized as follows. Section 2

direction

(direction

feature)

has

been

proven

very

gives an overview of the recognition system. Section 3

effective in character recognition. We use the directional

describes

online

decomposition technique of Kawamura et al [10] to extract

features and local gradient feature when using with LSTM

the

hybrid

feature

which

includes

directional features in normalized patterns as offline

classifier for on-line handwritten mathematical symbol

features for LSTM.

recognition. Section 4 details architecture of CNN and

We build 2 neural networks: CNN and LSTM as

the dropout technique. We detail the combination of LSTM

individual classifiers for recognizing an offline pattern

and CNN in section 5. Section 6 reports our experimental

and an online pattern, respectively. One effective approach

results and section 7 provides the conclusion.

to improving the performance of handwriting recognition

2. System overview

is to combine multiple classifiers. In our c ase,

The online mathematical symbol recognition system is

we take

advance of both methods: one is LSTM for online

depicted in Fig. 1. It consists of 2 pre -processing

recognition and

steps(trajectory smoothing and pattern normalization),

We tried linear combination of both neural networks for

feature extraction for LSTM, 2 neural networks: CNN and

simplicity instead of using any sophisticated schemes of

LSTM and a final linear combiner. Pre-processing is to

combining.

regulate the pattern shape for reducing the within -class

3. LSTM and gradient features

shape variation. The linear combiner is to combine the classification

results

of

both

CNN

and

LSTM

for

improving accuracy rate.

the other isCNN for offline recognition.

3.1. Overview of LSTM This section outlines the principle of LSTM RNN that is used for online symbol classification task and combined with CNN in section 5 and our proposed local gradient

Input pattern

features to improve the accuracy of LSTM. An LSTM layer is composed of recurrently connected

Trajectory smoothing

memory blocks, each of which contains one or more memory cells, along with three multipli cative "gate" units: input, output, and forget gates. The gates perform

Pattern normalization

functions analogous to reading, writing, and reseting operation. More specifically, the cell input is multiplied Context Feature Extraction

CNN

by the activation of the input gat e, the cell output by that of the output gate, and the previous cell values by the forget gate (see Fig. 2). The overall effect is to allow the

LSTM

network to store and retrieve information over long periods of time. For example, as long as the input gate remains closed, the activation of the cell will not be Linear Combination

overwritten by new inputs and can therefore be made available to the net much later in the sequence by opening

Output

Figure 1.Diagram of the online mathematical symbol recognition system.

the output gate.

improved when used with LSTM and online features since classifier may not process efficiently features of raw image. In this work, we employed the gradient feature which performed efficiently in character recognition. Firstly, we normalize linearly each online character pattern to standard size (64x64). Then for each point p = (x, y), we use 6 time-based features: 

This point is end point (1), otherwise (0).



normalized coordinates: (x, y)



normalized angle: (sin , cos)



distance between point(i, i+1)

Figure 2. LSTM memory block consisting of one

In order to combine these online features with

memory cell: the input, output, and forget gates collect

context information around each point, we employ the

activations from inside and outside the block which

gradient direction feature as context features.

control the cell through multiplicative units (depicted as small circles)

Regarding gradient features: For each point p = (x, y), we employ context window centered at p. From context window, gradient direction features are decomposed into

Another problem with standard RNNs is that they have

components in 8 chain code directions (depicted as figure

access to pass but not to future context. This can be

4). We partitioned context window into 9 sub-windows of

overcome by using bidirectional RNNs [10], where two

equal-sizes 5x5. We calculated the value for each block

separate recurrent hidden layers scan the input sequences

by using a Gaussian blurring mask of size 10x10. We use

in

PCA to reduce dimension into 10 dimensions

opposite

directions.

The

two

hidden

layers

are

connected to the same output layer, which therefore has access to context information in both directions. The

Direction 1 9 features

amount of context information that the network actually uses is learned during training, and does not have to be

Direction 2 9 features

specified beforehand. Fig. 3 shows the structure of a simple bidirectional network

72 features

Direction 7 9 features

Direction 8 9 features

Figure 4. Extracting gradient features. 3.3. Details of learning In this work, we use the BLSTM architecture. The size Figure 3. Structure of a bidirectional network with input

of the input layer is 6 for online features and 16 for the

i, output o, and two hidden layers for forward and

combination of online and gradient features. We use 2

backward processing

hidden BLSTM layers with 32 and 128 memory blocks per layer. The output layer is softmax layer, the size of which

3.2. Feature extraction for LSTM

size is 93, the number of mathematical symbol classes.

We extract online features and low level context

We initialized weights of the network from a zero-mean

information around each point to train BLSTM. This

Gaussian distribution with standard deviation 0.1. We

approach was presented by Alvaro at el. [5]. They used

trained our network using online stochastic gradient

raw images centered at each point to present the context

descent with a learning rate of 0.0001 and a momentum of

information.

0.9. The training algorithm is stopped when error rate was

However the

recognition rate

was not

not improved for 20 epochs.

exponentially-many dropout networks. We use dropout in the first two fully-connected layers as in Figure 6.

4. Architecture of CNN The architecture of our CNN is summarized in figure 5 . It contains 9 learned layers - 6 convolution+sub-pooling layers, 2 full-connected layers and 1 softmax layer finally.

Figure 6. Drop out 2 first full-connected layers with drop rate of 0.5 - dropped nodes do not contribute to forward and backward processes. 4.3. Details of learning We trained our models using stochastic gradient descent with a batch size of 64 samples, momentum of 0.95. The update rule for weight w was: 𝐦𝐢+𝟏 = 𝟎. 𝟗𝟓𝐦𝐢 − 𝟎. 𝟎𝟎𝟎𝟓. 𝛜. 𝐰𝐢 − 𝛜. 𝐆𝐫𝐚𝐝𝐢 𝐰𝐢+𝟏 = 𝐰𝐢+𝟏 + 𝐦𝐢+𝟏

Figure 5.Architecture of CNN.

Where i is the iteration index, m is the momentum variable, ϵ is the learning rate, and Gradi is the average

4.1. ReLU Nonlinearity Following Nair and Hinton[9], we refer to neurons with

Rectified

Linear

Units(ReLU)

defined

as

over the ith batch of the derivative of the objective with respect to w, evaluated at wi

follows:f x = max⁡ (x, 0). The advantages of using ReLU in

We initialized the weights in each layer fr om a

neural networks are:(1) if hard max function is used as

zero-mean Gaussian distribution with standard deviation

activation function, it induces the sparsity in the hidden

0.01 and the biases in each layer with the constant 0.

units. (2) ReLU does not face gradient vanishing problem s

We used an equal learning rate for all layers, which we

as with sigmoid and tanh function s. (3) Also, it has been

adjusted manually throughout training. The heuristic

shown that deep networks can be trained efficiently using

which we followed was to multip ly the learning rate by 0.5

ReLU even without pre-training

when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at

4.2. Dropout

0.01. We trained the network for 100 epochs and stopped

The recently introduced technique, called "dropout"[1],

when error rate did not improve for 20 epochs which took

consists of setting to zero the output of each hidden

2 to 3 hours on NVIDIA Quadro K600 4GB GPU.

neuron with probability 0.5. The neurons which are

5. Combination of CNN and LSTM

dropped in this way do not contribute to the forward pass

In mathematical symbol database, there are many

and do not participate in back-propagation. This technique

symbols having similar shape such as 'p' and ' ρ', or 'B' and

reduces complex co-adaptations of neurons since a neuron

'β' which get confused to off-line classifiers. Fortunately,

cannot rely on the presence of some other neurons. At test

these similar symbols can be more easily recognized by

time, we use all neurons but double weights, which is a

on-line classifiers because of their different stroke orders

reasonable approximation to taking the geometric mean of

and directions. On the other hand, some symbols are

the

difficult for on-line classifiers such that '0' and '6' or 'l'

predictive

distributions

produced

by

the

and 'e' due to the same stroke d irections. In this case, off-line classification is more appropriate.

Regarding the dropout technique, it provided a great improvement in CNN increasing the symbol recognition error rate from 12.67% down to 11.01% (approximately

Combining multiple individual classifiers is proven to

1.5%) in average. It shows that dropout improves the

be a suitable way to improve recognition rate for difficult

performance of CNN on handwritten mathematical symbol

classification problems. There are many sophisticated

recognition tasks.

methods to combine but in this work we deploy linear

Regarding local gradient features, they provided a

combination of CNN and LSTM for simplicity as the

significant improvement in LSTM compared to using only

following formulation:

on-line features

𝐒𝐜𝐨𝐫𝐞𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧 = 𝛂 𝐒𝐜𝐨𝐫𝐞𝐂𝐍𝐍 + 𝟏 − 𝛂 𝐒𝐜𝐨𝐫𝐞𝐋𝐒𝐓𝐌 Where α is the weighting parameter for combination. This

parameter

is

estimated

by

an

experiment

on

validation set.

(Error rate down

from 11.12 % to

10.3%).So we found that our proposed local gradient features present differences with respect to the results obtained with on-line features better than contextual features proposed in [5] .

6. Experiment and results 6.1. Dataset We use MathBrush as a database for our experiments. The database contains 4,654 annotated mathematical expressions written by 20 different writers. The number of

Table 1. Experimental results of CNN and LSTM on 5 trials. Error rate

CNN (%)

CNN +dropout (%)

LSTM+ Online feature (%)

MB1 MB2 MB3 MB4 MB5

13.19 13.31 12.32 12.48 12.05

11.06 11.18 10.07 10.74 11.41

11.3 11.4 10.9 11.1 10.9

LSTM +Online feature +local gradient(%) 11.0 10.5 9.95 9.83 10.5

Avg

12.67

11.01

11.12

10.3

mathematical symbols is 26K and they are distributed in 100 different classes. As previous approaches have also used this database, we followed the same experimentation described in [5] in order to obtain comparable resultsso we discarded 6 symbol classes (≤, ≠,