A Design of a Preprocessing Framework for Large Database of ...

2 downloads 0 Views 2MB Size Report
binarization and post-processing for each book of the used database. ▫ The framework ... 23 books from the Google-Books database (Version 1.0, 2007), where.
Platzhalter für Bild, Bild auf Titelfolie hinter das Logo einsetzen

A Design of a Preprocessing Framework for Large Database of Historical Documents Ines Ben Messaoud, Haikal El Abed, Hamid Amiri, and Volker Märgner, 17.09.2011

Outline 1. Introduction 2. Preprocessing Framework for Historical Documents  Overview  Selection Phase  Evaluation of the Selection 3. Evaluation of Binarization  Ground-Truth Generation  Evaluation Metrics 4. Tests and Results  Document Databases  Experiments and Results 5. Conclusions

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 2/16

Motivation Acquisition

Classification / Recognition

Scanner

Database Parameter

HMMs

Camera Training and Test Samples

Configuration

Files

Neural Network

Lexicon Support Vector Machine

Pre-Processing Noise Reduction

Parameter

Features extraction

Contour

PAWs

Results / Post-Processing

Parameter Configuration

Parameter

Thinning

Words Configuration

Normalise

Method Structure

Phrase

Configuration

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 3/16

Presentation

Motivation Acquisition

Classification / Recognition

Scanner

Database Parameter

HMMs

Camera Training and Test Samples

Configuration

Files

Neural Network

Lexicon Support Vector Machine

Pre-Processing Noise Reduction

Parameter

Features extraction

Contour

PAWs

Results / Post-Processing

Parameter Configuration

Parameter

Thinning

Words Configuration

Normalise

Method Structure

Phrase

Configuration

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 4/16

Presentation

Preprocessing Framework for Historical Documents: Overview  Published databases for historical documents are mostly designed as selection of books.  Pages belonging to the same book have common characteristics  The proposed work is a framework for historical documents preprocessing. The objective of such framework is to select a set of preprocessing methods, including noise removal, binarization and post-processing for each book of the used database  The framework is composed of two phases  Selection phase: a set of preprocessing methods are applied on a subset of images from each book and one or more preprocessing methods are selected as the best methods.  Evaluation of the selection: the evaluation of the selected methods during the selection phase on another subset of images, where (the rest of the book)

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 5/16

Preprocessing Framework for Historical Documents: selection Phase 1 Noise Removal Methods

Binarization Methods

Post-processing

images Evaluation Ground-truth Generation

2 Selected Binarization Method

Evaluation of Binarization

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 6/16

Preprocessing Framework for Historical Documents: Evaluation of the Selection

Selected Binarization Methods

Images

Evaluation Ground-truth Generation Binarization Evaluation

Evaluation Metrics

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 7/16

Outline 1. Introduction 2. Preprocessing Framework for Historical Documents  Overview  Selection Phase  Evaluation of the Selection 3. Evaluation of Binarization  Ground-Truth Generation  Evaluation Metrics 4. Tests and Results  Document Databases  Experiments and Results 5. Conclusions

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 8/16

Evaluation of Binarization: Ground-truth Generation  The developed method for ground-truth generation is an improved version of [Ntirogiannis2008] Input  Image Stroke Width Estimation

Binarization mb

Adaptive Binarization

Contour



 Skeletonization

dilation

F measur e

 and are in Recall and Precision equations respectively and is equal to in both equations. 

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 9/16

Evaluation of Binarization: Evaluation Metrics  The evaluation metrics are performed in order to evaluate the binrization methods, those metrics have been used in [Pratikakis2010, Paredes2010, Barney Smith2010], Fmeasure, p-FM, PSNR, NRM, MPM, GA and ½  P-Fmeasure, calculated using Fmeasure equation, where are considered and in both equations Recall and Precision.  Peak-Signal to noise  Negative rate metric

 Misclassification penalty metric  Geometric mean accuracy

 Normalized cross correlation

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 10/16

Outline 1. Introduction 2. Preprocessing Framework for Historical Documents  Overview  Selection Phase  Evaluation of the Selection 3. Evaluation of Binarization  Ground-Truth Generation  Evaluation Metrics 4. Tests and Results  Document Databases  Experiments and Results 5. Conclusions

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 11/16

Tests and Results: Document Databases  The ground-truth generation method was tested on the benchmarking dataset of binarization DIBCO 2009 composed of 10 parts of images, 5 handwritten and 5 printed text images.  The selection and evaluation phases are applied on :  23 books from the Google-Books database (Version 1.0, 2007), where  7 books from the Bayerische Staatsbibliothek (BSB) database, where  The selection phase is applied on , and the evaluation of the selection is applied on .  A set of binarization methods were tested during our work, Otsu [Otsu1979], Bernsen [Bernsen1986], Niblack [Niblack 1986], Sauvola [Sauvola2000], Gatos[Gatos2006] and Ben Messaoud [BenMessaoud2011] denoted as .

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 12/16

Tests and Results: Experiments and Results  Three binarization methods mb are tested Sauvola , Gatos and Lu [Lu2010] and three values of dilation parameter are tested.  According to Table 1, the best parameters of the ground-truth generation are and Lu’s method as mb. Those parameters satisfy

Printed

Handwritten

0.5

0.75

1

0.5

0.75

1

Sauvola

75.31

84.37

94.69

83.56

86.48

90.53

Gatos

76.80

84.86

94.89

81.91

85.65

93.81

Lu

77.18

88.03

96.28

82.76

88.13

95.83

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 13/16

Tests and Results: Experiments and Results  Three binarization methods are selected during the selection phase as the best methods for each book of the Google-Books and BSB databases, and one during the evaluation of binarization.  m1 Otsu, m2 Bernsen, m3 Niblack, m4 Sauvola, m5 Gatos, m6 Ben Messaoud

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 14/16

Tests and Results: Experiments and Results  For each method we calculate  m1 Otsu, m2 Bernsen, m3 Niblack, m4 Sauvola, m5 Gatos, m6 Ben Messaoud

1st

2nd

Selection

Evaluation 3rd

1st

1st

0.39 0.48 0.13

0.35

0.30 0.39 0.30

0.30

0.26 0.13 0.35

0.22

BSB

Google-Books

Selection

2nd

Evaluation 3rd

1st

0.43 0.14 0.29

0.29

0.29 0.29 0.14

0.29

0.29 0.29 0.14

0.43

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 15/16

Conclusions  We have proposed in this work a framework for preprocessing of historical documents applied on a large database (about 900 documents).  The framework is based on the selection of the best binarization method for each book set of the used dataset.  The selection is validated during the evaluation phase.  According to our experiments the selection is validated for most of the used book datasets.  This framework will be extended to be applied on larger database and the selection phase will be ameliorate by using an intelligent selection according to the document characteristics.

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 16/16

Thank you

Dipl.-Ing. Haikal El Abed, [email protected]

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 17/16

References [Ntirogiannis2008] K. Ntirogiannis, B. Gatos, and I. Pratikakis, An objective evaluation methodology for document image binarization techniques," in IAPR Inter. Workshop on Document Analysis Systems (DAS), September 2008, pp. 217-224. [Stathis2008] P. Stathis, E. Kavallieratou, and N. Papamarkos, An evaluation technique for binarization algorithms,„ Journal of Universal Computer Science, vol. 14, no. 18, pp. 30113030, October 2008. [Paredes 2010] R. Paredes and E. Kavallieratou, "ICFHR 2010 contest : Quantitative evaluation of binarization algorithms,“ in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 733-736. [Barney Smith2010] E. Barney Smith, "An analysis of binarization ground truth," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 27-34. [Otsu1979] N. Otsu, "A threshold selection method from gray level histograms," IEEE Trans. Syst., Man, Cybern., vol. 9, pp. 62-66, 1979. [Bernsen1986] J. Bernsen, "Dynamic thresholding of grey-level images," in Inter. Conf. on Pattern Recognition (ICPR), 1986, pp. 1251-1255. [Niblack1986] W. Niblack, "An introduction to digital image processing," in Prentice Hall Englewood Clis, 1986, pp. 115-116.

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 18/16

References [Sauvola 2000] J. Sauvola and M. Pietikäinen, Adaptive document image binarization," Pattern Recognition, vol. 33, no. 2, pp. 225-236, February 2000. [Gatos2006] B. Gatos, I. Pratikakis, and S. Perantonis, "Adaptive degraded document image binarization," Pattern Recognition, vol. 39, pp. 317-327, September 2006. [Ben Messaoud2011] I. Ben Messaoud, H. El Abed, H. Amiri, and V. Märgner, "New binarization approach based on text block extraction," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2011. [Lu2010] S. Lu and B. S. . C. L. Ta, "Document image binarization using background estimation and stroke edge," Inter. Journal on Document Analysis and Recognition, vol. 13, no. 4, pp. 303-314, December 2010. [Pratikakis2010] I. Pratikakis, B. Gatos, and K. Ntirogiannis, " H-DIBCO 2010-handwritten document image binarization competition," in International Conference on Frontiers in Handwriting Recognition (ICFHR), 2010, pp. 727-732.

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 19/16

Binarization Evaluation

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 20/16

Samples from the BSB Database

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 21/16

Samples from the Google-Books Database

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 22/16

Ben Messaoud Binarization

Input Image

Transformation to Grayscale

Noise Removal

Region Localization

-

Binarization Object of Interest Oi

Background

+ Output Image

17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 23/16

Suggest Documents