THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS
TECHNICAL REPORT OF IEICE
Combination of LSTM and CNN for recognizing mathematical symbols HAI Nguyen Dai † ANH Le Duc‡ and Masaki NAKAGAWA* Department of Computer and InformationScience, Tokyo University of A&T 2-24-16 Naka-cho, Koganei-shi, Tokyo 184-8588 †
E-mail: ‡
[email protected]
[email protected] *
[email protected]
Abstract Combining classifiers is an approach that has been shown to be useful on numerous occasions when striving for further improvement over the performance of individual classifiers. In this paper we present a combination of CNN and LSTM, which are both sophisticated neural networks and good classifiers for offline and online handwriting mathematical character recognition, respectively. In addition, we employ the dropout technique and gradient based local features to improve accuracy of CNN and LSTM, respectively. The best combination ensemble has a recognition rate which is higher than the rate achieved by the best individual classifier on the Mathbrush database. Keyword LSTM, directional features, CNN, Drop out, linear combination
1. Introduction
local
off-line
features
by
LSTM
significantly
Online mathematical symbol recognition is an essential
outperformed HMM. However, the hybrid features in
component of any pen-based mathematical formulation
combination with LSTM did not produce the great
recognition systems. With the increasing availability of
improvement from the same features with
touch-based or pen-based devices such as smart phones,
section 3 we describe the way of using directional (or
tablet PCs and smart boards, interest in this area has been
gradient) features as local off-line features following the
growing. However, the existing system are still far from
method of Kawamura et al. [4] and show that the
perfection because of challenges that arise from the
recognition rate by adding more gradient features is
two-dimensional nature of mathematical input and the
improved when combined with LSTM.
HMM. In
large symbol set with many similar looking symbols. In
For off-line handwriting recognition, convolutional
this work we just address the problem of mathematical
neural network (CNN) has been proven as a state of the art
symbol recognition.
method. CNNs combine three architectural ideas to ensure
Long short term memory (LSTM) and conventional
some degree of shift, scale, and distor tion invariance:
recurrent neural networks (RNN) have been successfully
local receptive fields, shared weights, and spa tial or
applied to sequence prediction and sequence labeling tasks.
temporal sub-sampling [2]. The network is usually trained
For online handwriting recognition, bidirectional LSTM
like a standard neural network by back propagation. For
networks with a connectionist temporal classification
over-fitting problem, we use two choices: 1. enlarging the
(CTC) output layer using forward and backward types of
training dataset by using linear transformations like
algorithm have been shown to outperform a state o f the art
translation, rotation or sheer; as well as 2. Using Drop out
HMM-based systems in handwriting recognition in [3].
technique
Recently, Álvaro et al. proposed a set of hybrid features
Units(RELU) have been highly successful for computer
that combine both on-line and off-line information and
tasks and proved faster to train than other standard units
using either HMM or LSTM for classification in [5]. The
such that sigmoid or tanh . RELUs are thus a good choice
symbol recognition rate achieved using raw images as
to combine with dropout for the Mathbrush database.
introduced
in
[1]..
Rectified
Linear
The combination of multiple classifiers has been shown
The input pattern, composed of the coordinates of
to be suitable for improving the recognition performance
sampled pen-down points, is smoothed by replacing the
in
coordinates of each point with a weighted average of the
difficult
classification
problems
[6-7].
Also
in
handwritten recognition, classifier combination has been appliedt.In
this
work,
we
deploy
simply
current point and two neighboring points.
linear
In normalization, the coordinates are transformed such
combination of CNN and LSTM. Our experiments also
that the pattern size is standardized and the shape is
show that the best combination ensemble has a recognition
regulated.
rate which is higher than relatively 1% the rate achieved
normalization method, named bi-moment normalization
by the best individual classifier on
[8].
the Mathbrush
database.
We
adopt
a
global
curve
fitting-based
For feature extraction, the local histogram of stroke
The rest of this paper is organized as follows. Section 2
direction
(direction
feature)
has
been
proven
very
gives an overview of the recognition system. Section 3
effective in character recognition. We use the directional
describes
online
decomposition technique of Kawamura et al [10] to extract
features and local gradient feature when using with LSTM
the
hybrid
feature
which
includes
directional features in normalized patterns as offline
classifier for on-line handwritten mathematical symbol
features for LSTM.
recognition. Section 4 details architecture of CNN and
We build 2 neural networks: CNN and LSTM as
the dropout technique. We detail the combination of LSTM
individual classifiers for recognizing an offline pattern
and CNN in section 5. Section 6 reports our experimental
and an online pattern, respectively. One effective approach
results and section 7 provides the conclusion.
to improving the performance of handwriting recognition
2. System overview
is to combine multiple classifiers. In our c ase,
The online mathematical symbol recognition system is
we take
advance of both methods: one is LSTM for online
depicted in Fig. 1. It consists of 2 pre -processing
recognition and
steps(trajectory smoothing and pattern normalization),
We tried linear combination of both neural networks for
feature extraction for LSTM, 2 neural networks: CNN and
simplicity instead of using any sophisticated schemes of
LSTM and a final linear combiner. Pre-processing is to
combining.
regulate the pattern shape for reducing the within -class
3. LSTM and gradient features
shape variation. The linear combiner is to combine the classification
results
of
both
CNN
and
LSTM
for
improving accuracy rate.
the other isCNN for offline recognition.
3.1. Overview of LSTM This section outlines the principle of LSTM RNN that is used for online symbol classification task and combined with CNN in section 5 and our proposed local gradient
Input pattern
features to improve the accuracy of LSTM. An LSTM layer is composed of recurrently connected
Trajectory smoothing
memory blocks, each of which contains one or more memory cells, along with three multipli cative "gate" units: input, output, and forget gates. The gates perform
Pattern normalization
functions analogous to reading, writing, and reseting operation. More specifically, the cell input is multiplied Context Feature Extraction
CNN
by the activation of the input gat e, the cell output by that of the output gate, and the previous cell values by the forget gate (see Fig. 2). The overall effect is to allow the
LSTM
network to store and retrieve information over long periods of time. For example, as long as the input gate remains closed, the activation of the cell will not be Linear Combination
overwritten by new inputs and can therefore be made available to the net much later in the sequence by opening
Output
Figure 1.Diagram of the online mathematical symbol recognition system.
the output gate.
improved when used with LSTM and online features since classifier may not process efficiently features of raw image. In this work, we employed the gradient feature which performed efficiently in character recognition. Firstly, we normalize linearly each online character pattern to standard size (64x64). Then for each point p = (x, y), we use 6 time-based features:
This point is end point (1), otherwise (0).
normalized coordinates: (x, y)
normalized angle: (sin , cos)
distance between point(i, i+1)
Figure 2. LSTM memory block consisting of one
In order to combine these online features with
memory cell: the input, output, and forget gates collect
context information around each point, we employ the
activations from inside and outside the block which
gradient direction feature as context features.
control the cell through multiplicative units (depicted as small circles)
Regarding gradient features: For each point p = (x, y), we employ context window centered at p. From context window, gradient direction features are decomposed into
Another problem with standard RNNs is that they have
components in 8 chain code directions (depicted as figure
access to pass but not to future context. This can be
4). We partitioned context window into 9 sub-windows of
overcome by using bidirectional RNNs [10], where two
equal-sizes 5x5. We calculated the value for each block
separate recurrent hidden layers scan the input sequences
by using a Gaussian blurring mask of size 10x10. We use
in
PCA to reduce dimension into 10 dimensions
opposite
directions.
The
two
hidden
layers
are
connected to the same output layer, which therefore has access to context information in both directions. The
Direction 1 9 features
amount of context information that the network actually uses is learned during training, and does not have to be
Direction 2 9 features
specified beforehand. Fig. 3 shows the structure of a simple bidirectional network
72 features
Direction 7 9 features
Direction 8 9 features
Figure 4. Extracting gradient features. 3.3. Details of learning In this work, we use the BLSTM architecture. The size Figure 3. Structure of a bidirectional network with input
of the input layer is 6 for online features and 16 for the
i, output o, and two hidden layers for forward and
combination of online and gradient features. We use 2
backward processing
hidden BLSTM layers with 32 and 128 memory blocks per layer. The output layer is softmax layer, the size of which
3.2. Feature extraction for LSTM
size is 93, the number of mathematical symbol classes.
We extract online features and low level context
We initialized weights of the network from a zero-mean
information around each point to train BLSTM. This
Gaussian distribution with standard deviation 0.1. We
approach was presented by Alvaro at el. [5]. They used
trained our network using online stochastic gradient
raw images centered at each point to present the context
descent with a learning rate of 0.0001 and a momentum of
information.
0.9. The training algorithm is stopped when error rate was
However the
recognition rate
was not
not improved for 20 epochs.
exponentially-many dropout networks. We use dropout in the first two fully-connected layers as in Figure 6.
4. Architecture of CNN The architecture of our CNN is summarized in figure 5 . It contains 9 learned layers - 6 convolution+sub-pooling layers, 2 full-connected layers and 1 softmax layer finally.
Figure 6. Drop out 2 first full-connected layers with drop rate of 0.5 - dropped nodes do not contribute to forward and backward processes. 4.3. Details of learning We trained our models using stochastic gradient descent with a batch size of 64 samples, momentum of 0.95. The update rule for weight w was: 𝐦𝐢+𝟏 = 𝟎. 𝟗𝟓𝐦𝐢 − 𝟎. 𝟎𝟎𝟎𝟓. 𝛜. 𝐰𝐢 − 𝛜. 𝐆𝐫𝐚𝐝𝐢 𝐰𝐢+𝟏 = 𝐰𝐢+𝟏 + 𝐦𝐢+𝟏
Figure 5.Architecture of CNN.
Where i is the iteration index, m is the momentum variable, ϵ is the learning rate, and Gradi is the average
4.1. ReLU Nonlinearity Following Nair and Hinton[9], we refer to neurons with
Rectified
Linear
Units(ReLU)
defined
as
over the ith batch of the derivative of the objective with respect to w, evaluated at wi
follows:f x = max (x, 0). The advantages of using ReLU in
We initialized the weights in each layer fr om a
neural networks are:(1) if hard max function is used as
zero-mean Gaussian distribution with standard deviation
activation function, it induces the sparsity in the hidden
0.01 and the biases in each layer with the constant 0.
units. (2) ReLU does not face gradient vanishing problem s
We used an equal learning rate for all layers, which we
as with sigmoid and tanh function s. (3) Also, it has been
adjusted manually throughout training. The heuristic
shown that deep networks can be trained efficiently using
which we followed was to multip ly the learning rate by 0.5
ReLU even without pre-training
when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at
4.2. Dropout
0.01. We trained the network for 100 epochs and stopped
The recently introduced technique, called "dropout"[1],
when error rate did not improve for 20 epochs which took
consists of setting to zero the output of each hidden
2 to 3 hours on NVIDIA Quadro K600 4GB GPU.
neuron with probability 0.5. The neurons which are
5. Combination of CNN and LSTM
dropped in this way do not contribute to the forward pass
In mathematical symbol database, there are many
and do not participate in back-propagation. This technique
symbols having similar shape such as 'p' and ' ρ', or 'B' and
reduces complex co-adaptations of neurons since a neuron
'β' which get confused to off-line classifiers. Fortunately,
cannot rely on the presence of some other neurons. At test
these similar symbols can be more easily recognized by
time, we use all neurons but double weights, which is a
on-line classifiers because of their different stroke orders
reasonable approximation to taking the geometric mean of
and directions. On the other hand, some symbols are
the
difficult for on-line classifiers such that '0' and '6' or 'l'
predictive
distributions
produced
by
the
and 'e' due to the same stroke d irections. In this case, off-line classification is more appropriate.
Regarding the dropout technique, it provided a great improvement in CNN increasing the symbol recognition error rate from 12.67% down to 11.01% (approximately
Combining multiple individual classifiers is proven to
1.5%) in average. It shows that dropout improves the
be a suitable way to improve recognition rate for difficult
performance of CNN on handwritten mathematical symbol
classification problems. There are many sophisticated
recognition tasks.
methods to combine but in this work we deploy linear
Regarding local gradient features, they provided a
combination of CNN and LSTM for simplicity as the
significant improvement in LSTM compared to using only
following formulation:
on-line features
𝐒𝐜𝐨𝐫𝐞𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧 = 𝛂 𝐒𝐜𝐨𝐫𝐞𝐂𝐍𝐍 + 𝟏 − 𝛂 𝐒𝐜𝐨𝐫𝐞𝐋𝐒𝐓𝐌 Where α is the weighting parameter for combination. This
parameter
is
estimated
by
an
experiment
on
validation set.
(Error rate down
from 11.12 % to
10.3%).So we found that our proposed local gradient features present differences with respect to the results obtained with on-line features better than contextual features proposed in [5] .
6. Experiment and results 6.1. Dataset We use MathBrush as a database for our experiments. The database contains 4,654 annotated mathematical expressions written by 20 different writers. The number of
Table 1. Experimental results of CNN and LSTM on 5 trials. Error rate
CNN (%)
CNN +dropout (%)
LSTM+ Online feature (%)
MB1 MB2 MB3 MB4 MB5
13.19 13.31 12.32 12.48 12.05
11.06 11.18 10.07 10.74 11.41
11.3 11.4 10.9 11.1 10.9
LSTM +Online feature +local gradient(%) 11.0 10.5 9.95 9.83 10.5
Avg
12.67
11.01
11.12
10.3
mathematical symbols is 26K and they are distributed in 100 different classes. As previous approaches have also used this database, we followed the same experimentation described in [5] in order to obtain comparable resultsso we discarded 6 symbol classes (≤, ≠,