APPLIED PROBLEMS
Algorithm for Segmentation of Documents Based on Texture Features A. M. Vil’kin, I. V. Safonov, and M. A. Egorova National Research Nuclear University MEPhI, Kashirskoe sh. 31, Moscow, 115409 Russia email:
[email protected],
[email protected],
[email protected] Abstract—The ascending approach to segmentation of scanned documents in the area of background, text, and photographs is considered. In the first stage, the image is divided into blocks. For each block, a series of texture features is calculated. On the basis of these features, the type of the block is determined. Various posi tions and sizes of blocks, 26 texture features, and 4 algorithms of classification of blocks were considered. In the second stage, the type of block was corrected on the basis of the analysis of neighboring regions. For esti mating the results, the error matrix and the ICDAR 2007 criterion are used. Keywords: segmentation, supervised learning, feature extraction. DOI: 10.1134/S1054661813010136
INTRODUCTION Segmentation is the process of dividing a digital image into several segments. In the case of scanned documents, such segments are zones with texts, pic tures, and background. Segmentation is employed in optical character rec ognition (OCR) systems [1], systems with mixed raster content (MRC) compression [2], problems of image retrieval, and problems of detection of text in photo/video streams. Research groups all over the world are engaged in segmentation of documents. Many algorithms on this subject have been developed. Unfortunately, most of them provide good results only for a limited set of input images, because the analysis is based on specific features or assumptions, e.g., that the background is white and the figures and text blocks have a rigorously rectangular form. At the present time, much attention is paid to development of universal approaches that will provide good results for various types of docu ments: newspapers, magazines, documents, and arti cles, in which the arrangement and orientation of regions are arbitrary, the background is nonhomoge neous, etc. Two main approaches can be distinguished: ascending [3–5] and descending [7]. In ascending algorithms, the analysis begins from lowlevel objects, such as pixels, zones, and neighboring regions; then, the obtained objects are connected and classified as regions of the document. Such algorithms process regions of complex form well but disregard highlevel objects. On the contrary, descending algorithms start from the entire image and try to divide it into regions Received March 10, 2011
of a particular type. Such algorithms are not always able to process regions of complex form and arrange ment, e.g., nonrectangular text blocks or headings into several text columns. There are also hybrid algorithms, which combine descending and ascending algorithms. We propose an ascending algorithm for segmenta tion of a document in a region with text, a picture, and the background. The algorithm was developed for halftone 8 bpp documents with the scan resolution of 300 dpi. In the first stage, the algorithm classifies blocks, using texture features, and then corrects the obtained classes of blocks on the basis of the analysis of neighboring regions. On Fig. 1 there is scheme of pro posed algorithm. ESTIMATION OF THE RESULTS The result of classification by different algorithms is usually compared manually with a marked image, full coincidence with which is considered as perfect segmentation. Two criteria are suggested for compari son: the fraction of correctly classified pixels and the criterion of the ICDAR Page Segmentation Competi tion 2007. In the framework of the ICDAR confer ence, the capacities of the existing algorithms are tra ditionally compared on a realistic data set in order to compare and stimulate the development of a universal method of segmentation. The criterion and the set of test images are under continuous development. Figure 2 shows a perfect markup and a result of seg mentation. Two blocks have been merged, and the order of words and sentences in the OCR has been lost, but the fraction of correctly classified pixels is 0.99, whereas the ICDAR 2007 criterion gives a more adequate result for the OCR problem: 0.86.
ISSN 10546618, Pattern Recognition and Image Analysis, 2013, Vol. 23, No. 1, pp. 153–159. © Pleiades Publishing, Ltd., 2013.
154
VIL’KIN et al. 1st stage Partition into blocks
Features extraction
2nd stage Classifi cation
Block type adjustment
Fig. 1. Scheme of the algorithm.
Fig. 2. Perfect markup and result of segmentation.
In the first case, the classification results are repre sented in the form of the relative error matrix [12] averaged over all images. Each row of the matrix is a class determined by the algorithm, and each column is a true class. The percentage of correctly classified pix els (PCCM) can be found as the sum of the elements on the main diagonal of the matrix. The true values are taken from manually marked up masks. This criterion has a number of shortcomings. In particular, errors of small area but of great importance are taken into account poorly. An alternative may be the criterion that was used in ICDAR 2007 contests [13]. It is sen sitive to missing and merging of zones even of small area. In this criterion, a table of correspondence is cre ated for each region. The values in the table are con sidered as pixelwise intersections of the sets resulting from segmentation with the true mask [14]. Let I be the set of all ON pixels of the image, Gj be the set of all pixels inside the jth region of the true mask, Ri be the set of all pixels inside the ith region of the result of seg mentation, gj be the type of jth region of the true mask, ri be the type of ith region of the result of segmenta tion, and T(s) be the function returning the number of elements in the set s. The cell MS(i, j) of the corre spondence table is the result of comparing the ith region of the result of segmentation and the jth region
of the true mask. As in [15], using the pixelwise approach, we can define the value of the cell of the table as T ( Gj ∩ Ri ∩ I ) MS ( i, j ) = a , T ( ( Gj ∪ Ri ) ∩ I ) (1)
⎧ 1, if g j = r i , where a ⎨ ⎩ 0, otherwise.
If Ni is the number of regions of the true mask belonging to the type i, Mi is the number of regions of the result of segmentation belonging to the type i, and w1, w2, w3, and w4 are the predefined weights, then the detection rate DR and the recognition accuracy RA for the type i can be found as one2one g_one2many DR i = w 1 + w 2 , Ni Ni
(2)
one2one d_one2many RA i = w 3 + w 4 , Mi Mi
(3)
where the variables one2onei, g_one2manyi, d_one2many are calculated from correspondence table (1) in accordance with the steps [15] for each type i. The efficacy of segmentation (entity detection metric
PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 23
No. 1
2013
ALGORITHM FOR SEGMENTATION OF DOCUMENTS
(EDM)) of the segmentation for each type can be obtained by combining the values of the detection rate and the recognition accuracy: 2DR i RA i EDM i = . DR i + RA i
∑ N EDM i
SM =
dB x, y =
y
r = 1c = 1 r = 1c = 1 ,
the mean brightness,
(9)
2N* ( N – 1 )
(10)
dB y ( r, c ) = ( B i ( r, c + 1 ) – B i ( r, c )
(11)
+ B i ( r, c – 1 ) – B i ( r, c ) )/2,
i
The image I is originally divided into N × N pixel blocks. For each block, a number of quantitative char acteristics are calculated. In the experiments, the fol lowing quantitative characteristics of textures (their variations) were calculated:
and Bi(r, c) is the pixel intensity. The mean maximum difference between the values of points taken in the vertical (horizontal) direction for each vertical (horizontal) line of the block is max
dB x, y N
∑
N
( B max ( r ) – B min ( r ) ) +
∑
( B max ( c ) – B min ( c ) )
(12)
r=1 c=1 , = 2N
where
N
∑ ∑ B ( r, c ) i
= 1c = 1 B i = r , 2 N
(6)
where Bi(r, c) is the pixel intensity, and r and c are the indices running through the current block;
B max ( r ) = max ( B i ( r, c ), c = 1…N ),
(13)
B max ( c ) = max ( B i ( r, c ), r = 1…N ),
(14)
B min ( r ) = min ( B i ( r, c ), c = 1…N ),
(15)
B min ( c ) = min ( B i ( r, c ), r = 1…N ).
(16)
The mean distance to the most differing pixel in a block, calculated for each pixel of the block, is
the meansquare deviation, N
∑ ∑ ( B – B ( r, c ) ) i
σ =
∑ ∑ dB ( r, c )
dB x ( r, c ) = ( B i ( r + 1, c ) – B i ( r, c )
(5)
QUANTITATIVE FEATURES OF TEXTURES
N
∑∑
N–1 N
dB x ( r, c ) +
where r and c are the indices running through the cur rent block,
i
i
N
N N–1
+ B i ( r – 1, c ) – B i ( r, c ) )/2,
i .
∑N
the mean vertical, dBy, and horizontal, dBx, deriv atives of a block,
(4)
Combining all the values of the detection rate and the recognition accuracy, we can obtain the global esti mate of the efficiency. We define the segmentation metric SM as the weighted mean of all EDMi:
155
N
2
r = 1c = 1 ,
N
(7)
where Bi(r, c) is the pixel intensity, r and c are the indi ces running through the current block, and B i is the mean brightness; the mean difference of the mean brightnesses of blocks Bk in a 4connected vicinity of Bi, 4
∑ B –B i
k
=1 dB i = k , 4
(8)
DistPix Bi
= 1c = 1 = r , 2 N
(17)
where r and c are the indices running through the cur rent block and 2
2
(18) DistPix ( r, c ) = ( c m – c ) + ( r m – r ) , where cm and rm are such that |B(rm, cm) – B(r, c)| = max(|B(i, j) – B(r, c)|), i = 1…n, j = 1…n and Bi(r, c) is the value of the pixel intensity. The mean distance to the most differing pair of pix els in a block, calculated for each pixel of the block, is N
where B i is the mean brightness of the current block, and B k is the mean brightness of neighboring blocks; PATTERN RECOGNITION AND IMAGE ANALYSIS
N
∑ ∑ DistPix ( r, c )
i
DistPair Bi
Vol. 23
No. 1
N
∑ ∑ DistPair ( r, c )
= 1c = 1 = r , 2 N
2013
(19)
156
VIL’KIN et al.
where r and c are the indices running through the cur rent block and 2
DistPair ( r, c ) =
2
( cm – c ) + ( rm – r ) ,
(20)
where cm and rm are such that |B(rm, cm) – B(rm, cm – 1)| = max(|B(i, j) – B(i, j – 1)|), i = 2…n, j = 1…n, and Bi(r, c) is the value of the pixel intensity. The characteristics calculated from the cooccur rence matrix with the bias d = [0, –1] for a block are Energy =
∑ ∑ N [ i, j ], 2 a
i
N d [ i, j ]
, ∑ ∑ 1+ i–j i
Contrast =
∑∑ i
(21)
j
Homogeneity =
The fraction of variations in the pixel values of a binarized block after the morphological opening oper ation is
(22)
∑
o
∀( r, c ) ∈ B i P m = , 2 N b
o
binarization, and B i (r, c) is the pixel values of the block after the morphological opening operation. In this work, we used a 3 × 3 rectangular structural ele ment. The mean difference of a pixel and its 4connected neighbors for each pixel of the block is N
2
(24)
where B i (r, c) are the pixel values of the block after
j
( i – j ) N d [ i, j ],
b
{ 1 B i ( r, c ) ≠ B i ( r, c ) }
AvgDPix Bi
(23)
=
N
∑ ∑ AvgDPix ( r, c )
r = 1c = 1 , 2
(25)
N
j
where Nd is the normalized cooccurrence matrix.
where
AvgDPix ( r, c ) B i ( r + 1, c ) – B i ( r, c ) + B i ( r – 1, c ) – B i ( r, c ) + B i ( r, c + 1 ) – B i ( r, c ) + B i ( r, c ) – B i ( r – 1, c ) = , 4 and Bi(r, c) is the pixel intensity. The mean maximum difference of a pixel and its 4 connected neighbors for each pixel of a block is N
MAxDPix
Bi
∑∑
MaxDPix ( r, c )
= 1c = 1 = r , 2 N
(27)
MaxDPix ( r, c ) = max ( B i ( r + 1, c ) – B i ( r, c ) , B i ( r – 1, c ) – B i ( r, c ) , B i ( r, c + 1 ) – B i ( r, c ) , (28) B i ( r, c ) – B i ( r – 1, c ) ), and Bi(r, c) is the pixel intensity. The mean modulus of the gradient is
G =
∑
∇ xy B i ( r, c ) (29)
N
where ∇ xy B i ( r, c ) =
i 2
i 2
( dB x ) + ( dB y ) ,
{ 1 | ∇ xy B i ( r, c ) 〉T }
∀( r, c ) ∈ Bi P g = , 2 N
(31)
where |∇xyBi(r, c)| is calculated by formula (30).
where
∀( r, c ) ∈ Bi , 2
The percentage of pixels with the modulus of the gradient above the threshold is
∑
N
(26)
(30)
where dBx and dBy are calculated by formulas (10) and (11).
The most informative features were selected using the AdaBoost algorithm of machine learning [8]. This algorithm is adaptive in the sense that each subsequent classifier is constructed from objects wrongly classified by previous classifiers. AdaBoost calls a weak classifier in a loop. After each call, the distribution of weights corresponding to the importance of each object from the training sample for classification is updated. In each iteration, the weights of each wrongly classified object increase and the new classifier “focuses its attention” on them. We employed the AdaBoost tool box. This toolbox comprises various modifications of AdaBoost. For selecting the most informative features, Real AdaBoost was used. It turned out that the most informative features were the following: the mean brightness of the block Bi; the mean difference of the mean brightnesses of blocks Bk in the 4connected vicinity of the block Bi;
PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 23
No. 1
2013
ALGORITHM FOR SEGMENTATION OF DOCUMENTS i
i
the mean of the vertical, dB y , and horizontal, dB x , derivatives of the block; the homogeneity of the block Bi; the percentage of pixels with the gradient above the threshold; the percentage of variations in pixel values after the morphological operation of opening. CHOOSING THE CLASSIFICATION ALGORITHM For classification of blocks, we propose to use methods of training with a teacher. The following algorithms were used for classification: AdaBoost, the support vector machine (SVM) [9], the knearest neighbor method of [10], and artificial neural net works [11]. The basic idea of the support vector machine is to transfer the original vectors to a space of a higher dimension and seek a hyperplane with the maximum gap in this space. Two parallel hyperplanes are constructed on both sides of the hyperplane sepa rating our classes. The separating hyperplane will be that maximizing the distance to the two parallel hyper planes. The algorithm works under the assumption that the greater the difference between these parallel hyperplanes, the less the mean error of the classifier. For working with the SVM, the library http://www.csie.ntu.edu.tw/~cjln/libsvm was used. The knearest neighbor method is a metric algorithm based on estimating the similarity of objects. A classi fied object belongs to the class to which the nearest objects of the training sample belong. For working with the knearest neighbor method, the library http://www.cs.umd.edu/~mount/ANN was used. The neural network is a mathematical model, and its soft and hardware implementations are constructed in some similarity with neural networks of living organ isms. For working with neural networks, the library http://leenissen.dk/fann was used. For each of the algorithms, the best parameters are sought, i.e., parameters with which the result of classi fication is the best on a training sample. In the case of the knearest neighbor method, the optimal number of neighbors over which the classification will be per formed was chosen. For the AdaBoost, various modifi cations of the algorithm offered by the toolbox were used. For neural networks, the form of the functions, the number of layers, and the number of neurons on hidden layers were chosen. For the SVM, the functions and parameters of the kernel were chosen. After choosing the parameters and training, the algorithms were tested on the problem of determining text blocks. For representing the results, the accuracy reflecting the percentage of correct classification is used [6]. The results are summarized in Table 1. All the classification algorithms have successfully solved the classification problem and have given approximately the same results, which can be slightly altered by varying the PATTERN RECOGNITION AND IMAGE ANALYSIS
157
Table 1. Comparison of the algorithms in the problem of determining the class of block Algorithm
Accuracy
KNN Neural Networks AdaBoost SVM
93.0 90.1 93.1 93.2
Table 2. Comparison of the arrangement of blocks Arrangement of blocks Tiles Overlapping Small blocks
PCCM
SM
0.85 0.86 0.86
0.20 0.20 0.16
Table 3. Comparison of the efficiency of various arrangements of blocks Arrangement of blocks
Time, s
Tiles Overlapping Small blocks
6 25 24
Table 4. Relative error matrix for the proposed algorithm vari ant 1/variant 2 True back ground Classifica Background tion Text Picture
text
picture
0.38/0.38 0.05/0.06 0.01/0.00 0.04/0.03 0.34/0.33 0.01/0.01 0.00/0.01 0.03/0.04 0.14/0.14
parameters. Therefore, it is to the informative features that the maximum attention should be paid in solving classification problems. AdaBoost has been chosen for further application due to its relative simplicity and high operating speed. THE ARRANGEMENT OF BLOCKS We used three possible variants of arrangement of blocks: tiles, a small block inside a large one, and intersecting blocks (Fig. 3). The comparison of different arrangements of blocks is presented in Table 2. The use of the arrangements of blocks different from tiles makes possible a small gain in the accuracy, but it requires additional computations. The results are presented for a document with the resolution of Vol. 23
No. 1
2013
158
VIL’KIN et al. (a)
(b)
(c)
If both results are negative, the block is classified as background. CORRECTION OF CLASSES OF BLOCKS
Fig. 3.
true
Ct > Tt
Ct > Tt false true (Ct > Tt ) > (Ct−Tt ) Text true false Text
false true
Ct > Tt false
Picture Background
Picture Fig. 4.
3507 × 2480 on a notebook Benq s42, Core 2 Duo P8400.
As a result of classification, errors sometimes occur: separate blocks are classified wrongly. For obtaining homogeneous regions and suppressing noises for a block whose class differs from the class of neighboring blocks, we propose two approaches. For a slight correction, the following is proposed: the num ber of neighbors of each class is determined; if the number of neighbors of some class exceeds 6, then the block is reclassified. This iteration is performed repeatedly. The smaller the window, the more inho mogeneous and noisy the result. If it is required to obtain more homogeneous regions, e.g., for the MCR compression, the following approach is proposed: each classified block is represented as a zone. A zone is understood as a rectangle with a certain type. Then, each zone is extended to the left, right, upward, and downward until the number of blocks of another type inside the zone is less that the definite percentage of the area. Then, reclassification of blocks is performed: a block takes the type of the zone with the maximum area to which it belongs.
CLASSIFICATION OF BLOCKS The type of a block may be assigned to one of the three classes: figure, background, and text. To that end, two AdaBoost classifiers have been constructed: one for detecting photographs, Ci, and the other for detecting texts, Ct. Figure 4 shows the decision tree uniting both the classifiers for separation into three classes. Each classifier has been trained on a sample of 57 documents of different types and from different sources. If both the classifiers Ct and Ci give a positive result, then we choose the classifier that is farther from the corresponding threshold Tt and Ti in the modulus.
Document 1
Document 1 segm.
RESULTS The operation of the algorithm has been tested on a test sample of 174 scanned images of newspapers, documents, and magazines of various types. Examples of segmentation are shown in Fig. 5. Table 3 presents the error matrix averaged over all images of the collection. The PCCM for variant 1 is 86%, and that for variant 2 is 85%. The SM averaged over all images is 0.20 for variant 1 and 0.18 for variant 2. For FineReader 9.0, the PCCM is 82% and the SM is 0.32.
Document 2
Document 2 segm.
Fig. 5. Arrangement of blocks: (a) tiles; (b) classification of a small block inside a large one; (c) overlapping of block documents: (black color) text, (white color) background, and (gray color) picture. PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 23
No. 1
2013
ALGORITHM FOR SEGMENTATION OF DOCUMENTS
CONCLUSIONS The proposed algorithm correctly classifies a rather high percentage of pixels of a document but some times misses/merges regions of different types. To overcome this shortcoming, we propose to revise the stage of correcting the type of blocks and take into account the neighborhood in classification of blocks. The search for new informative features will also make it possible to improve the results. In order to get rid of the dependence of the algorithm on the scan resolu tion, it is necessary to get rid of the dependence of the features on the resolution. It should be noted that the final result is also influenced by the similarity between the segmented documents and the training sample.
159
11. H. T. Siegelmann and Sontag, E.D., “Turing Comput ability with Neural Nets,” Appl. Math. Lett. 4 (6), 77– 80 (1991). 12. http://en.wikipedia.org/wiki/Confusion_matrix. 13. A. Antonacopoulos, B. Gatos, and D. Bridson, “ICDAR2007 Page Segmentation Competition,” in Proc. ICDAR2007 (Curitiba, 2007), pp. 1279–1283. 14. B. A. Yanikoglu and L Vincent, “Pink Panther: A Com plete Environment for GroundTruthing and Bench marking Document Page Segmentation,” Pattern Rec ogn. 31 (9), 1191–1204 (1994). 15. I. Phillips and A. Chhabra, “Empirical Performance Evaluation of Graphics Recognition Systems,” IEEE Trans. Pattern Anal. Mach. Intelligence 21 (9), 849– 870 (1999).
REFERENCES 1. Zh. Lu, I. Bazzi, A. Kornai, J. Makhoul, P. Natarajan, and R. Schwartz, “A Robust, LanguageIndependent OCR System,” Proc. SPIE Electron. Imaging 3584, 96–104 (1999). 2. R. L. de Queiroz, R. Buckley, and M. Xu, “Mixed Ras ter Content (MRC) Model for Compound Image Compression. Conference on Visual Communications and Image Processing,” Proc. SPIE Electron. Imaging 3653, 1106–1117 (1999). 3. J. J. Sauvola andM. Pietikäinen, “Page Segmentation and Classification Using Fast Feature Extraction and Connectivity Analysis,” in Proc. Int. Conf. on Document Analysis and Recognition (Montreal, 1995), pp. 1127– 1131. 4. F. Wahl, K. Wong, and R. Casey, “Block Segmentation and Text Extraction in Mixed Text/Image Docu ments,” Comput. Graph. Image Processing 20, 375– 390 (1982). 5. H. S. Baird, M. A. Moll, Chang An, and M. R. Casey, “Document Image Content Inventories,” in Proc. of SPIE/IS and T Document Recognition and Retrieval Conf. (San Jose, 2007). 6. L. G. Shapiro and G. C. Stockman, Computer Vision (BINOM Knowledge Laboratory, Moscow, 2006) [in Russian]. 7. F. Cesarini, S. Marinai, G. Soda, and M. Gori, “Struc tured Document Segmentation and Representation by the Modified X–Y Tree,” in Proc. Int. Conf. on Docu ment Analysis and Recognition (Bangalore, 1999), p. 563. 8. R. E. Schapire and Y. Singer, “Improved Boosting Algorithms Using ConfidenceRated Predictions, in Proc. 11th Annu. Conf. on Computational Learning The ory (Madison, 1998). 9. P. H. Chen, C. J. Lin, and B. Schölkopf, “A Tutorial on νSupport Vector Machines,” Appl. Stoch. Models. Bus. Ind. 21, 111–136 (2005). 10. T. M. Cover and P. E. Hart, “Nearest Neighbor Pattern Classification,” IEEE Trans. Inf. Theory 13 (1), 21–27 (1967).
PATTERN RECOGNITION AND IMAGE ANALYSIS
Aleksei M. Vil’kin. 5th year stu dent at National Research Nuclear University MEPhI. Author of 3 pub lications on document analysis. Sci entific interests: image processing, pattern recognition, and document analysis.
Il’ya V. Safonov. In 1994, received MS degree in automatic and elec tronic engineering from the National Research Nuclear University MEPhI, and, in 1997, the PhD degree in computer science from the same university. Since 1998, Associ ate Professor in the Faculty of Cyber netics, MEPhI, engaged in research in image segmentation, feature extraction, and pattern recognition. In 2004, joined the Samsung Research Center, Moscow, Russia; engaged in photo, video, and document image enhance ment projects. Author of more than 120 publications on image processing and pattern recognition. Marta A. Egorova received her M.S. degree in cybernetics from Moscow Engineering Physics Insti tute/State University (MEPhI), Russia, in 2008. At present, she is a postgraduate student of MEPhI. Her current research interests include image quality estimation, image enhancement, pattern recog nition, and machine learning. She has 16 publications on image pro cessing and pattern recognition.
Vol. 23
No. 1
2013