Platzhalter für Bild, Bild auf Titelfolie hinter das Logo einsetzen
A Design of a Preprocessing Framework for Large Database of Historical Documents Ines Ben Messaoud, Haikal El Abed, Hamid Amiri, and Volker Märgner, 17.09.2011
Outline 1. Introduction 2. Preprocessing Framework for Historical Documents Overview Selection Phase Evaluation of the Selection 3. Evaluation of Binarization Ground-Truth Generation Evaluation Metrics 4. Tests and Results Document Databases Experiments and Results 5. Conclusions
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 2/16
Motivation Acquisition
Classification / Recognition
Scanner
Database Parameter
HMMs
Camera Training and Test Samples
Configuration
Files
Neural Network
Lexicon Support Vector Machine
Pre-Processing Noise Reduction
Parameter
Features extraction
Contour
PAWs
Results / Post-Processing
Parameter Configuration
Parameter
Thinning
Words Configuration
Normalise
Method Structure
Phrase
Configuration
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 3/16
Presentation
Motivation Acquisition
Classification / Recognition
Scanner
Database Parameter
HMMs
Camera Training and Test Samples
Configuration
Files
Neural Network
Lexicon Support Vector Machine
Pre-Processing Noise Reduction
Parameter
Features extraction
Contour
PAWs
Results / Post-Processing
Parameter Configuration
Parameter
Thinning
Words Configuration
Normalise
Method Structure
Phrase
Configuration
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 4/16
Presentation
Preprocessing Framework for Historical Documents: Overview Published databases for historical documents are mostly designed as selection of books. Pages belonging to the same book have common characteristics The proposed work is a framework for historical documents preprocessing. The objective of such framework is to select a set of preprocessing methods, including noise removal, binarization and post-processing for each book of the used database The framework is composed of two phases Selection phase: a set of preprocessing methods are applied on a subset of images from each book and one or more preprocessing methods are selected as the best methods. Evaluation of the selection: the evaluation of the selected methods during the selection phase on another subset of images, where (the rest of the book)
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 5/16
Preprocessing Framework for Historical Documents: selection Phase 1 Noise Removal Methods
Binarization Methods
Post-processing
images Evaluation Ground-truth Generation
2 Selected Binarization Method
Evaluation of Binarization
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 6/16
Preprocessing Framework for Historical Documents: Evaluation of the Selection
Selected Binarization Methods
Images
Evaluation Ground-truth Generation Binarization Evaluation
Evaluation Metrics
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 7/16
Outline 1. Introduction 2. Preprocessing Framework for Historical Documents Overview Selection Phase Evaluation of the Selection 3. Evaluation of Binarization Ground-Truth Generation Evaluation Metrics 4. Tests and Results Document Databases Experiments and Results 5. Conclusions
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 8/16
Evaluation of Binarization: Ground-truth Generation The developed method for ground-truth generation is an improved version of [Ntirogiannis2008] Input Image Stroke Width Estimation
Binarization mb
Adaptive Binarization
Contour
Skeletonization
dilation
F measur e
and are in Recall and Precision equations respectively and is equal to in both equations.
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 9/16
Evaluation of Binarization: Evaluation Metrics The evaluation metrics are performed in order to evaluate the binrization methods, those metrics have been used in [Pratikakis2010, Paredes2010, Barney Smith2010], Fmeasure, p-FM, PSNR, NRM, MPM, GA and ½ P-Fmeasure, calculated using Fmeasure equation, where are considered and in both equations Recall and Precision. Peak-Signal to noise Negative rate metric
Misclassification penalty metric Geometric mean accuracy
Normalized cross correlation
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 10/16
Outline 1. Introduction 2. Preprocessing Framework for Historical Documents Overview Selection Phase Evaluation of the Selection 3. Evaluation of Binarization Ground-Truth Generation Evaluation Metrics 4. Tests and Results Document Databases Experiments and Results 5. Conclusions
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 11/16
Tests and Results: Document Databases The ground-truth generation method was tested on the benchmarking dataset of binarization DIBCO 2009 composed of 10 parts of images, 5 handwritten and 5 printed text images. The selection and evaluation phases are applied on : 23 books from the Google-Books database (Version 1.0, 2007), where 7 books from the Bayerische Staatsbibliothek (BSB) database, where The selection phase is applied on , and the evaluation of the selection is applied on . A set of binarization methods were tested during our work, Otsu [Otsu1979], Bernsen [Bernsen1986], Niblack [Niblack 1986], Sauvola [Sauvola2000], Gatos[Gatos2006] and Ben Messaoud [BenMessaoud2011] denoted as .
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 12/16
Tests and Results: Experiments and Results Three binarization methods mb are tested Sauvola , Gatos and Lu [Lu2010] and three values of dilation parameter are tested. According to Table 1, the best parameters of the ground-truth generation are and Lu’s method as mb. Those parameters satisfy
Printed
Handwritten
0.5
0.75
1
0.5
0.75
1
Sauvola
75.31
84.37
94.69
83.56
86.48
90.53
Gatos
76.80
84.86
94.89
81.91
85.65
93.81
Lu
77.18
88.03
96.28
82.76
88.13
95.83
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 13/16
Tests and Results: Experiments and Results Three binarization methods are selected during the selection phase as the best methods for each book of the Google-Books and BSB databases, and one during the evaluation of binarization. m1 Otsu, m2 Bernsen, m3 Niblack, m4 Sauvola, m5 Gatos, m6 Ben Messaoud
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 14/16
Tests and Results: Experiments and Results For each method we calculate m1 Otsu, m2 Bernsen, m3 Niblack, m4 Sauvola, m5 Gatos, m6 Ben Messaoud
1st
2nd
Selection
Evaluation 3rd
1st
1st
0.39 0.48 0.13
0.35
0.30 0.39 0.30
0.30
0.26 0.13 0.35
0.22
BSB
Google-Books
Selection
2nd
Evaluation 3rd
1st
0.43 0.14 0.29
0.29
0.29 0.29 0.14
0.29
0.29 0.29 0.14
0.43
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 15/16
Conclusions We have proposed in this work a framework for preprocessing of historical documents applied on a large database (about 900 documents). The framework is based on the selection of the best binarization method for each book set of the used dataset. The selection is validated during the evaluation phase. According to our experiments the selection is validated for most of the used book datasets. This framework will be extended to be applied on larger database and the selection phase will be ameliorate by using an intelligent selection according to the document characteristics.
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 16/16
Thank you
Dipl.-Ing. Haikal El Abed,
[email protected]
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 17/16
References [Ntirogiannis2008] K. Ntirogiannis, B. Gatos, and I. Pratikakis, An objective evaluation methodology for document image binarization techniques," in IAPR Inter. Workshop on Document Analysis Systems (DAS), September 2008, pp. 217-224. [Stathis2008] P. Stathis, E. Kavallieratou, and N. Papamarkos, An evaluation technique for binarization algorithms,„ Journal of Universal Computer Science, vol. 14, no. 18, pp. 30113030, October 2008. [Paredes 2010] R. Paredes and E. Kavallieratou, "ICFHR 2010 contest : Quantitative evaluation of binarization algorithms,“ in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 733-736. [Barney Smith2010] E. Barney Smith, "An analysis of binarization ground truth," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 27-34. [Otsu1979] N. Otsu, "A threshold selection method from gray level histograms," IEEE Trans. Syst., Man, Cybern., vol. 9, pp. 62-66, 1979. [Bernsen1986] J. Bernsen, "Dynamic thresholding of grey-level images," in Inter. Conf. on Pattern Recognition (ICPR), 1986, pp. 1251-1255. [Niblack1986] W. Niblack, "An introduction to digital image processing," in Prentice Hall Englewood Clis, 1986, pp. 115-116.
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 18/16
References [Sauvola 2000] J. Sauvola and M. Pietikäinen, Adaptive document image binarization," Pattern Recognition, vol. 33, no. 2, pp. 225-236, February 2000. [Gatos2006] B. Gatos, I. Pratikakis, and S. Perantonis, "Adaptive degraded document image binarization," Pattern Recognition, vol. 39, pp. 317-327, September 2006. [Ben Messaoud2011] I. Ben Messaoud, H. El Abed, H. Amiri, and V. Märgner, "New binarization approach based on text block extraction," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2011. [Lu2010] S. Lu and B. S. . C. L. Ta, "Document image binarization using background estimation and stroke edge," Inter. Journal on Document Analysis and Recognition, vol. 13, no. 4, pp. 303-314, December 2010. [Pratikakis2010] I. Pratikakis, B. Gatos, and K. Ntirogiannis, " H-DIBCO 2010-handwritten document image binarization competition," in International Conference on Frontiers in Handwriting Recognition (ICFHR), 2010, pp. 727-732.
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 19/16
Binarization Evaluation
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 20/16
Samples from the BSB Database
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 21/16
Samples from the Google-Books Database
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 22/16
Ben Messaoud Binarization
Input Image
Transformation to Grayscale
Noise Removal
Region Localization
-
Binarization Object of Interest Oi
Background
+ Output Image
17.09.2011 | Haikal El Abed | A preprocessing Framework for Large Database of Historical Documents | 23/16