Evaluation of SVM, MLP and GMM Classifiers for Layout ... - IEEE Xplore

3 downloads 1829 Views 688KB Size Report
Email: {hao.wei, micheal.baechler, fouad.slimane, rolf.ingold}@unifr.ch. Abstract—This ... tion retrieval of historical document is layout segmentation. Historical ...
2013 12th International Conference on Document Analysis and Recognition

Evaluation of SVM, MLP and GMM Classifiers for Layout Analysis of Historical Documents Hao Wei, Micheal Baechler, Fouad Slimane and Rolf Ingold DIVA Group, Department of Informatics, University of Fribourg Bd. de Perolles 90, 1700 Fribourg, Switzerland Email: {hao.wei, micheal.baechler, fouad.slimane, rolf.ingold}@unifr.ch Abstract—This paper presents a comparison between three classifiers based on Support Vector Machines, Multi-Layer Perceptrons and Gaussian Mixture Models respectively to detect physical structure of historical documents. Each classifier segments a scaled image of historical document into four classes, i.e., areas of periphery, background, text and decoration. We evaluate them on three data sets of historical documents. Depending on data sets, the best classification rates obtained vary from 90.35% to 97.47%.

We consider the segmentation as a pixel classification problem, where each pixel is represented as a vector containing features based on colors extracted from the image. SVM, MLP and GMM were used as classifiers. The experiments show the best classification rates for the three manuscripts are respectively 97.47%, 90.74% and 90.35%. SVM and MLP generally outperform GMM. The rate of MLP is higher than that of SVM for two manuscripts, but lower than that of SVM for the remaining one. We also evaluated an algorithm combining the 3 classifiers by a fusion based on majority vote. The remainder of this paper is organized as follows. Section II gives an overview of existing works in layout analysis for historical documents. Section III describes the corpus used for our experiments. Section IV details the three segmentation systems. Our experimental result is presented in Section V. Section VI draws some conclusions.

I. I NTRODUCTION An important initial step for word recognition and information retrieval of historical document is layout segmentation. Historical document usually has complex layout, which makes layout analysis difficult. Our goal is to develop a layout analysis system based on a pyramidal approach using several analysis levels, where the segmentation output of the upper level is used as an input for the lower level cf. [1], [2]. In other words each level will refine the segmentation of upper levels. To begin we decided that our first level segments a document image into four classes, i.e., areas of periphery, background, text and decoration. We plan to use a pyramidal approach to the physical structure and the layout elements of historical manuscript. We start by segmenting a small scaled image of document in order to extract a rough global view of layout by classifying the image pixels into periphery, background, text and decoration. This is our first classification level of our pyramidal approach. Then we increase the image size in order to refine more the boundary of elements contained in the physical structure and to correct the error of misclassification in the previous step, i.e., we will segment the image pixels identified as block of text into text lines, background, and decoration. This is our second classification level of our pyramidal approach. Starting by processing image of small size rather than image of big size presents a real gain of time processing. However, this paper is to evaluate several classification techniques in order to get this rough layout view starting from image of small size for our first classification level. This paper compares between three different implementations of the first level based on three machine learning techniques: Support Vector Machines (SVM), Multi-Layer Perceptrons (MLP) and Gaussian Mixture Models (GMM). We evaluated them on three manuscripts written in German, English and Latin. 1520-5363/13 $26.00 © 2013 IEEE DOI 10.1109/ICDAR.2013.247

II. R ELATED WORKS There are many works in the field of layout analysis for historical documents. In [1], the authors described a semiautomatic tool which annotates medieval manuscripts. The tool is composed of two parts. In the first part, a segmentation based on connected component is performed on the textual content in order to segment manuscripts into texts and text lines. The second part provides an interactive interface allowing the user to customize the automatic analysis. In [2], the authors presented an architecture of pyramidal approach using three analysis levels, evaluated on medieval manuscripts. Using as classifier a dynamic multi-layer perceptron, the output of the upper level is used as a feature in the lower level. Likforman-Sulem [3] presented a survey of existing methods of text line segmentation task for historical documents. After describing the characteristics of text line structures in historical documents and the different ways of defining a text line, the paper surveyed the different approaches to segment the clean image into text lines, e.g., projection-based methods, smearing methods, grouping methods, etc. In [4], the authors proposed a method of characterization of pictures of old documents based on a texture approach. They used a multi-resolution study of the textures by extracting five features linked to the frequencies and orientations in different parts of a page. 1252 1220

DEBORA [5] is a multidisciplinary project aiming at digitizing and making rare 16th century books more accessible. The authors designed a multistage segmentation scheme that simultaneously separates text from graphics and adapted the threshold methods to the image content. Their segmentation task includes the segmentation of text from non text, segmentation of the main text body from magrins, and physical layout segmentation. AGORA [6], [7] is a software to extract meta-data of indexation from historical documents images. AGORA applies a hybrid segmentation algorithm that builds two maps: a shape map that focuses on connected components and a background map which provides information about white areas corresponding to block separations in the page. Then it uses simultaneously the information provided by the 2 maps to segment the image. The proposed “user driven approach” achieved above 93% accurate results over different data set tested. In [8], the authors presented a system for automatic segmentation, annotation and image retrieval based on content, focused on illuminated manuscripts and in particular the Borso D’Este Holy Bible. The processed documents could be mainly divided into three parts: background, text and decorations. Due to complex noisy situation and layout of manuscripts, some new methods were proposed to ensure robust segmentation and precise extraction of objects of interest. In Historical Document Layout Analysis Competition (ICDAR 2011) [9], four submitted methods were evaluated. The results indicate that there is a convergence to a certain methodology with some variations in the approach. However, there is still a considerable need to develop robust methods that deal with the idiosyncrasies of historical documents.

written in English. These pages were taken from letters written by George Washington and his associates in the 18th century. The Library of Congress makes them available online3 . The first two data sets consist of images of manuscripts written with ink on parchment and the images are in color. George Washington data set consists of images of manuscript written with ink on paper and the images are in gray levels. Figure 1 shows some example pages from the datasets.

III. C ORPUS Due to a lack of public research data for historical document in layout analysis field, we have created our own ground truth data set using our model layout [1], [2]. To benchmark our system we decide to use data sets employed by other researchers [10], [11]. Our data corpus consists of three data sets. Firstly, Saint Gall data set consists of 60 manuscript pages from a medieval manuscript written in Latin. It contains the hagiography Vita sancti Galli by Walafrid Strabo. The Abbey Library has a copy of this manuscript within the Cod. Sang. 562. The e-codices project makes it available through the internet1 . Secondly, Parzival data set [10] consists of 47 pages written by three writers. These pages were taken from a medieval German manuscript from the 13th century that contains the epic poem Parzival by Wolfram von Eschenbach. The Abbey Library of Saint Gall has a copy in Cod. 857. An electronic manuscript edition was published on CD-ROM by the German Language Institute of the University of Bern, Switzerland2 . Finally, George Washington data set consists of 20 pages

(a)

(b)

(c)

(d)

Fig. 1. Example pages : (a) page from George Washington data set (letterbook 1, page 278), (b) page from Saint Gall data set (Cod. Sang, 562, page 5), finally (c) and (d) pages from Parzival data set (Cod. 857, page 144 and 271).

IV. S YSTEM D ESCRIPTION In this part, Section IV-A describes the feature extraction method used for historical images. Section IV-B presents the SVM classifier. Section IV-C presents the MLP classifier. Section IV-D presents the GMM classifier. A. Feature extraction The feature extraction in classification techniques attributes numerical vectors to data. Each pixel px,y in the scaled image is defined as a vector (r, g, b)T in RGB space color. Its feature vector is composed of the following: 1) Its coordinates x, y in scaled image. 2) Its color values consisting of primary colors r, g, b.

1 http://www.e-codices.unifr.ch

3 George Washington Papers at the Library of Congress from 1741 to 1799, Series 2, to be found at http://memory.loc.gov/ammem/gwhtml/gwseries2.html

2 http://www.parzival.unibe.ch

1221 1253

3) Defining the neighborhood of this px,y by a n × n window with n = 9 , the components of the vector by the sum of neighborhood, i.e., d thatformed d p where d = (n − 1)/2. x+i,y+j i=−d j=−d 4) Defining the neighborhood as a horizontal vector where (px−d,y , ..., px−1,y , px+1,y , ..., px+d,y ) d = (n − 1)/2, for each primary plan we compute: a) Maximum and minimum of neighbor pixels. b) Sorted components by pairs as follows max(px+d,y , px−d,y ), max(px+d−1,y , px−d+1),y ), ..., min(px+d−1,y , px−d+1,y ), min(px+d,y , px−d,y ). c) Multiplied components by pair which are computed similarly as in last point. We use only red and blue color plans for b) and c). 5) Similarly to the last point, we define the neighborhood as a vertical vector, and also compute maximum, minimum, sorted components and multiplied components by pair. 6) Defining the neighborhood as a pixel column in the scaled image of the whole document, the components of the vector formed by the sum of neighbor pixels. Finally, we get 87-dimension feature vector for each pixel.

Fig. 2. Physical layout classification system using Gaussian mixture models.

function as the transfer function for neurons in hidden and output layers. Given the training data: {x(1) , d(1) ), (x(2) , d(2) ), ...(x(p) , d(p) }, where (x(i) , d(i) ) is the ith pair of input vector and desired output vector, the goal of p training is to minimize the error term E(w) = 12 i=1 y(i) − d(i) 2 , where w is weight matrix and y(i) is the actual output of the network. Back-propagation algorithm is used to solve this optimization problem in our experiment [13].

B. SVM classifier

D. GMM classifier

Support Vector Machines are supervised learning models used for classification and regression analysis. Suppose we have a set of training data points {(x1 , y1 ), (x2 , y2 ), ..., (xN , yN )}, where xi ∈ Rd and yi ∈ {±1}. If these data are linearly separable, we would like to find a linear separating hyperplane classifier H, and two hyperplanes H1 and H2 parallel to H, with the condition that there is no data points between H1 and H2 . When the data points are not linearly separable, we can use kernel function to transform the data points to some higher dimensional space such that the data points will be linearly separable in that space. Another technology is that we can allow some data points, i.e., noise, to be between H1 and H2 and we use the penalty factor C to penalize this situation [12]. In our experiment, we used radial basis function as kernel function and penalty factor C = 1, by default. Using Lagrange multipliwe finally have solve the following ers α1 , α2 , ..., αN ≥ 0,  to N N 1 formula, maxα L = i=1 αi − 2 i,j=1 αi αj yi yj xi · xj , N subject to i=1 αi yi = 0, 0 ≤ αi ≤ C for i = 1, ..., N.

As presented in Figure 2, the proposed physical structure extraction system includes two main parts (training and classification) similarly to the system presented in [14] for Arabic font recognition. In this system, Gaussian Mixture Models defined in [15] are used to estimate the likelihoods of four layout categories. During training, the Expectation-Maximization (EM) algorithm is used to iteratively refine the component weights, means and variances to monotonically increase the likelihood of the training feature vectors [16]. In our experiments we used the EM algorithm to build the models by applying a simple binary splitting procedure after each 10 iterations to increase the number of Gaussian mixtures through the training procedure up to 1024 mixtures. At recognition every pixel line in image is presented by feature vectors, then an alternative Viterbi algorithm uses them and grammars taking into account all possible sequence of models in each pixel line to find the best sequence of hypothesis models. Performances are evaluated in terms of classification rates using an unseen set of historical document images. For more details about the use of GMMs, we refer to [14].

C. MLP classifier Multi-Layer Perceptron is a widely used type of neural network. It consists of one input layer, one or several hidden layers and one output layer. The input layer distributes the input values to each of the neurons in the hidden layer. For each neuron in the hidden layer(s) and output layer, the output of each neuron in the previous layer is multiplied by a weight and the resulting weighted values and the bias are added together. The summation is transfered by a transfer function. The log-sigmoid function is a frequently used transfer function, which maps the input from negative to positive infinity to the output from 0 to 1. In our experiment, we used the log-sigmoid

V. E XPERIMENTAL R ESULTS Due to the high quality of images of manuscripts, we decided to extract the elements contained in the physical structure by processing document images in a pyramidal approach. As mentioned before we started by segmenting an image of small size to compute a rough view of the layout, i.e, we classified pixels of the scaled image into 4 classes: the areas of periphery, background, text and decoration. This is our first classification level. Later we will increase the size in order to get more details of elements contained in the physical

1222 1254

C. Segmentation result using GMM

structure in the second classification level. This paper evaluates several classifiers implementing the first level of our pyramidal approach. For the images in Saint Gall data set, we scaled each image down to 52 × 78 pixels. We chose 20 pages for training and 30 pages for testing. For George Washington data set, we scaled each image down to 69 × 106 pixels. We chose 10 pages for training and 5 pages for testing. For Parzival data set, we scaled each image down to 63 × 94 pixels. We chose 24 pages for training and 14 pages for testing. Each pixel in the scaled image was considered as one sample, represented as a 87-dimension feature vector. Figure 3(a) shows the ground truth generated manually of the page in Figure 1(c). In images of Figure 3, the areas of periphery, background, text and decoration are respectively in black, red, green and blue.

From a practical point of view, GMMs can be seen as onestate Hidden Markov Models. We therefore used the HTK toolkit5 to implement our modeling scheme. The global classification rates for the three manuscripts are between 83.32% and 89.01% as presented in Table V. See Figure 3(d) for the segmentation result. TABLE III C LASSIFICATION RATES OF EACH CLASS ON 3 DATA SETS USING GMM Saint Gall Parzival G. Washington

periphery 0.9838 0.8703 0.9514

background 0.7961 0.8076 0.5649

text 0.9221 0.8621 0.9214

decoration 0.5602 0.4251 –

A. Segmentation result using SVM For the SVM classifier, we used LIBSVM Matlab interface with default parameters [17]. Table I shows the classification rate of each class on the 3 data sets. There is no pixel belonging to the class of decoration in the images of G. Washington data set. See Table V for the global classification rate and Figure 3(b) for the segmentation result. TABLE I C LASSIFICATION RATES OF EACH CLASS ON 3 Saint Gall Parzival G. Washington

periphery 0.9824 0.8918 0.9706

background 0.9602 0.8976 0.6852

DATA SETS USING

text 0.9485 0.9405 0.9384

SVM

decoration 0.6541 0.3049 –

(a)

(b)

(c)

(d)

B. Segmentation result using MLP For the MLP classifier, we used Matlab Neural Network Toolbox4 . We adopted the topology of three-layer perceptron, i.e., the topology with only one hidden layer. The weights of the MLP were initialized randomly. We tried 5 kinds of network with 10, 20, 30, 40 and 50 hidden neurons respectively. Due to the randomly initialized weights, we tried each network 10 times. To avoid being confusing, here “one time” means that we train the network for hundreds of iterations until the stopping criteria of training is satisfied, and then we test the network. Table II shows the classification rate of each class on the 3 data sets, using the topologies with 40, 40 and 50 hidden neurons respectively, which achieved the best classification rate for the corresponding data set. See Table V for the global classification rate and Figure 3(c) for the segmentation result.

Fig. 3. Segmentation of the page in Figure 1(c). (a) ground truth (b) segmentation using SVM. (c) segmentation using MLP. (d) segmentation using GMM.

D. Analysis of the experimental results Figure 3 shows that the areas of periphery, background and text dominate in the image. Their classification rates are high for all of 3 classifiers. For the area of decoration, the classification rate is low, shown in the three tables above. However, because the area of decoration occupies only a small part of the image, it is reasonable that the global classification rate of the images is still high.

TABLE II C LASSIFICATION RATES OF EACH CLASS ON 3 Saint Gall Parzival G. Washington

periphery 0.9856 0.8724 0.9674

background 0.9781 0.8963 0.7846

DATA SETS USING

text 0.9638 0.9409 0.9302

MLP

decoration 0.6692 0 –

5 http://htk.eng.cam.ac.uk/

4 http://www.mathworks.com/products/neural-network/index.html

1223 1255

Finally we combined the 3 classifiers together, by implementing a kind of majority vote algorithm in order to further improve the classification rate. More precisely, the final label is determined by the label predicted by the majority of classifiers. If the classifiers predict three different labels, then the final label is determined by the best classifier for the corresponding data set. As presented in Table V, the combination algorithm performs worse than the best classifiers for two data sets, i.e., the Saint Gall and G. Washington data sets, indicating that for some pixels, two classifiers predict the same wrong label while the remaining one predicts correctly. Normally the combination algorithm performs better by combining poor classifiers. Since we have good classifiers, we could not predict the performance of combination algorithm in advance. As the area of decoration of images in G. Washington data set is missing, here we present in Table IV the confusion matrix for the Saint Gall and Parzival data sets combined using the combination algorithm. TABLE IV C ONFUSION MATRIX OF COMBINATION ALGORITHM ON 2 periphery background text decoration

periphery 0.9612 0.0169 0 0.0016

background 0.0388 0.9404 0.0442 0.4275

text 0 0.0420 0.9554 0.2356

of the region explicitly since the result are used for the next classification level. We plan in future to investigate further feature extraction methods, e.g., Gabor feature and to evaluate several other types of classifier for our pyramidal approach. VII. A CKNOWLEDGEMENT The authors want to thank Van Cuong Kieu for providing them with the set of distorted images of the Saint Gall database. This work has been supported by the Swiss NSF project CRSI22 125220 and fellowship project PBBEP2 141453. R EFERENCES [1] M. Baechler, J.L. Bloechle, and R. Ingold, “Semi-automatic annotation tool for medieval manuscripts”, International Conference on Frontiers in Handwritting Recognition, vol. 0, pp. 182-187, 2010. [2] M. Baechler, R. Ingold, “Multi Resolution Layout Analysis of Medieval Manuscripts Using Dynamic MLP”. International Conference on Document Analysis and Recognition, pp. 1185-1189, 2011. [3] L. Likforman-Sulem, A. Zahour, and B. Taconet, “Text line segmentation of historical documents: a survey”, International Journal on Document Analysis and Recognition, vol. 9, no. 2, pp. 123-138, 2007. [4] N. Journet, J.Y. Ramel, R. Mullot, and V. Eglin, “A proposition of retrieval tools for historical document images libraries”, International Conference on Document Analysis and Recognition, vol. 2, pp. 10531057, 2007. [5] F. Le Bourgeois and H. Emptoz, “DEBORA: Digital access to books of the renaissance”, International Journal on Document Analysis and Recognition, vol. 9, no. 2, pp. 193-221, 2007. [6] J.Y. Ramel, S. Busson, and M.L. Demonet, “AGORA: the interactive document image analysis tool of the bvh project”, Document Image Analysis for Libraries, pp. 145-155, 2006. [7] J.Y. Ramel, S. Leriche, M.L. Demonet, and S. Busson, “User-driven page layout analysis of historical printed books”, International Journal on Document Analysis and Recognition, vol. 9, no. 2, pp. 243-261, 2007. [8] C. Grana, D. Borghesani, S. Calderara, and R. Cucchiara, ““Inside the Bible”: segmentation, annotation and retrieval for a new browsing experience”, ACM international conference on Multimedia Information Retrieval, pp. 379-386, 2008. [9] A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, “Historical Document Layout Analysis Competition”, International Conference on Document Analysis and Recognition, pp. 1516-1520, 2011. [10] A. Fischer, A. Keller, V. Frinken, and H. Bunke, “Lexicon-Free Handwritten Word Spotting Using Character HMMs”, Pattern Recognition Letters, Volume 33(7), pages 934-942, 2012. [11] A. Garz, A. Fischer, R. Sablatnig and H. Bunke, “Binarization-Free Text Line Segmentation for Historical Documents Based on Interest Point Clustering”, International Workshop on Document Analysis Systems, pp. 95-99, 2012. [12] S. Abe, Support Vector Machines for Pattern Classification, Springer, 2005. [13] C.G. Looney, Pattern Recognition using Neural Networks: Theory and Algorithms for Engineers and Scientists, Oxford University Press, 1997. [14] F. Slimane, S. Kanoun, J. Hennebert, A.M. Alimi, R. Ingold, “A study on font-family and font-size recognition applied to Arabic word images at ultra-low resolution”, Pattern Recognition Letters, Vol. 34, Issue 2, pp. 209-218, 2013. [15] D. Reynolds. Gaussian Mixture Models. Encyclopedia of Biometric Recognition, Springer, Feb. 2008. [16] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the em algorithm”, Royal Statistical Society, 39(1):138, 1977. [17] C.C. Chang and C.J. Lin, “LIBSVM: a library for support vector machines”. ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011. Software available at http://www.csie.ntu. edu.tw/∼cjlin/libsvm

DATA SETS

decoration 0 0.0008 0.0005 0.3353

Table V shows the global classificataion rates for the 4 algorithms. Both SVM and MLP outperform GMM. For the Saint Gall and the G. Washington data sets, MLP outperforms SVM. For the Parzival data set, SVM outperforms MLP. TABLE V G LOBAL CLASSIFICATION RATES ON THE 3 Saint Gall Parzival G. Washington

SVM 0.9618 0.9074 0.8865

MLP 0.9747 0.9011 0.9035

GMM 0.8901 0.8332 0.8457

DATA SETS

Combination 0.9710 0.9111 0.8908

VI. C ONCLUSIONS This paper has presented a comparison between three classifiers, i.e., support vector machines, multilayer perceptrons and Gaussian mixture models, applied to segment historical document. We classified pixels of a scaled document image into 4 classes (areas): periphery, background, text and decoration. Our experiments show the best classification rates are 97.47%, 90.74% and 90.35% for Saint Gall, Parzival, and G. Washington data set respectively. The experiments show that the periphery, background and text areas which occupy the dominant part of the page were segmented well. We also combined the classifiers together by letting them vote for the label in order to further improve the result. For all data sets, both SVM and MLP have better performance than GMM. For the Saint Gall and George Washington data sets, MLP outperforms SVM. For the Parzival data set, SVM outperforms MLP. At the moment the segmentation problem is considered as a pixel classification problem. We did not compute the contours

1224 1256