Combining Fuzzy Clustering and Morphological

0 downloads 0 Views 507KB Size Report
cause they affect future Optical Character Recognition (OCR). This paper ... of the libraries to obtain ASCII versions of the books, what means to perform optical.
Combining Fuzzy Clustering and Morphological Methods for Old Documents Recovery João R. Caldas Pinto1, Lourenço Bandeira1, João M. C. Sousa1, Pedro Pina2 1 IDMEC, Instituto Superior Técnico Av. Rovisco Pais, 1049-001 Lisboa Portugal {jcpinto,lpcbandeira,j.sousa}@dem.ist.utl.pt 2 CVRM / Geo-Systems Centre, Instituto Superior Técnico Av. Rovisco Pais, 1049-001 Lisboa, Portugal [email protected]

Abstract. In this paper we tackle the specific problem of old documents recovery. Spots, print through, underlines and others ageing features are undesirable not only because they harm the visual appearance of the document, but also because they affect future Optical Character Recognition (OCR). This paper proposes a new method integrating fuzzy clustering of color properties of original images and mathematical morphology. We will show that this technique leads to higher quality of the recovered images and, at the same time, it delivers cleaned binary text for OCR applications. The proposed method was applied to books of XIX Century, which were cleaned in a very effective way.

1 Introduction The universal availability and the on-demand supply of digital duplicates of large mounts of written text are two inevitable paths of the future. To achieve this goal books have to be first digitalized. Due to the rarity of most of them this operation should be carried out only once, so we obtain high resolution color images. These large dimension images will be the future source data. The first natural possibility of the libraries is to make available in the Internet reduced versions of these images. However, not only the size is a problem, but also the visual appearance of the books can be of poor quality due to the aging process. Spots due to humidity, marks resulting from the ink that goes through the paper, rubber stamps, strokes of pen and underlines are features that, in general, are not desirable. By the other hand, it is an ultimate goal of the libraries to obtain ASCII versions of the books, what means to perform optical character recognition (OCR). However, if this is a well-established process for images of texts written with modern fonts and on clean paper, it is still a challenging problem for degraded old documents. Thus, all the contributions to improve quality of the images are important to the success of this process. In this paper we address the problem of documents image enhancement. In general, an old book contains a short number of colors, typically only two: one for the ink and another for the background, the paper color. However, with the ageing process colors change, the ink becomes clearer and the background yellowed, and other colors

emerge, some due to natural causes, like humidity or print through, and other by human actions, such as pencil handwritten notes, rubber stamps or underlines. This is illustrated in Fig. 1. This suggests that images can be segmented by colors using clustering algorithms, like fuzzy clustering [1-3]. On the other hand, from the geometric point of view, characters are quite distinct from the other elements present in a page. This fact can be exploited through a mathematical morphology approach [4, 5]. This paper proposes a combination of fuzzy clustering of original color images followed by a mathematical morphology step for removing residual geometrical artifacts. Applying our method we achieve two goals: an improved document appearance and a high quality binary image for further application of OCR, using for example the FineReader Engine [6]. This paper is organized as follows. Section 2 describes the fuzzy clustering techniques used in this paper, and their application to document segmentation. Section 3 describes the application of mathematical morphology to the same problem. The proposed recovering algorithm is presented in Section 4, where the obtained results are described and discussed. Finally, Section 5 presents the conclusions and possible future work.

(a) natural ageing processes

(b) human manipulation

Fig. 1. Example of a printed page with common problems

2 Fuzzy Clustering Fuzzy clustering (FC) in the Cartesian product space is applied to partition the data into subsets. Cluster analysis classifies objects according to similarities amongst them. In image analysis, clustering finds relationships between the system properties (colors in this paper).

x1 ,, xN be a set of

N data objects where xk R n . The set of data objects can then be represented as a N n data matrix X . The fuzzy clustering algoLet

rithms determine a fuzzy partition of

X into C clusters. Let U  ik   0,1

N C

X . Often, the cluster prototypes are points in the N C cluster space, i.e. vi R . The elements  0,1 of U represent the memik  denote a fuzzy partition matrix of n

bership of data object

xk in cluster i . Let V be a vector of cluster prototypes (cen-

ters) to be determined, defined by

V  v1 , v2 , , vC  .

Many clustering algorithms are available for solving for U and V iteratively. The fuzzy c-means is quite well known and revealed to present good results [1]. This algorithm does not determine directly the optimal number of clusters. This paper uses simple heuristics to determine the correct number of clusters, in order to reduce the number of colors classifying the different samples of text images. The fuzzy c-means algorithm searches for an optimal fuzzy partition U and for the prototype matrix of cluster means V . In other words, clustering  U ,V  X , C 

(1)

The optimization minimizes the following objective function, C

N

J X ,U , V  ik dik2 

(2)

i 1 k  1

where

 is a weighting parameter. The function dik is the distance of a data point

xk to the cluster prototype vk : d ik2  xk vi  xk vi  . The fuzzy c-means T

algorithm can be described as follows. Given the data X , choose the number of clusters 1 K N , the fuzziness pa0 0 . Initialize U  (e.g. random).

rameter m 1 and the termination criterion Repeat for l 1, 2, Step 1: Compute cluster means:

   x     l 1 m ik

N

l  i

v

k 1 N

k 1

k

l 1 m ik

, 1 i K

(3)

Step 2: Compute distances for 1 i K 1 k N .



x v

l dik2  xk vi

T

l

k

i

(4)

Step 3: Update partition matrix for 1 i K 1 k N . l ik 

1

 d K

j 1

until

l U  U l 1 .

ik

d jk 

2 m 1

(5)

3 Mathematical Morphology In the present context, Mathematical Morphology (MM) is applied to remove handwritten underlines before the OCR phase and is based on a previous study [7]. Two main steps constitute this phase: the first one consists of reinforcing the text set whose segmentation sometimes produces irregular and broken characters, while in the second one the underlines are suppressed. Step 1: Text characters reinforcement: This objective is achieved by firstly applying a closing () with the structuring element B1 of size to the initial binary image X, resulting from the fuzzy clustering phase, in order to reinforce the characters:

Y1 B1 ( X )

(6)

The filtering of unwanted structures of smaller dimension than text characters (defined by the dimension of the structuring element B2) is also a necessary operation to carry out, and is obtained by an erosion ( ) –reconstruction (R) sequence. The result is given by set Y2:

Y2 RY1 [B2 (Y1 )]

(7)

Step 2: Handwritten underlines removal: l The handwritten underlines are marked by applying directional openings  (Y2) with a segment l as structuring element in the horizontal direction (0 degrees). Only horizontal or sub-horizontal structures can resist, totally or partially, to this transform [8]. The partial directional reconstruction (DR) of the remaining regions permits to recover the underlines: the geodesic dilation uses a directional structuring element and is applied till the idempotence is reached. The set difference with the image Y2 permits to filter out the handwritten underlines. This sequence is summed up in the following equation:

Y3 Y2 / DRY2 [l ( o º) (Y2 )]

(8)

In order to recover the regions of the characters suppressed by the elimination of the handwritten underlines (in these regions there exists a superimposition between l letters and underlines), a dilation  in the vertical direction is applied. It gives the possibility of recuperating partially these common regions without reconstructing again the handwritten structures:

Y4 l (90 º) (Y3 )

(9)

The resulting set constitutes now the binary image to be introduced in the OCR system.

4 Results Integrating both methods we can take advantage of the most positive aspects of both approaches. The integration method uses the binarization of the output image of the FC step as the input image of the MM, as it can be seen in Fig. 2. In this way, the input image of the MM is the result of a very efficient segmentation process. The result of this MM step is a high quality cleaned binary image that can also be used for OCR applications [9]. In order to test the proposed technique, fuzzy clustering (FC) allied to mathematical morphology (MM), several experiments were conducted. The performance of each step was tuned and evaluated by visually inspecting the preprocessed word images. With the FC approach, three clusters have been used, because three distinct regions can be easily identified with the human eye: the background, the characters and the image defects, such as underlines, humidity spots, see-through letters, etc. The segmentation results obtained with the FC are presented in Fig. 3. Each cluster is classified through the analysis of the average and variance of every pixel that belong to more than 85% to that cluster. The characters (represented by the darker cluster) and the background (represented by the lighter cluster) can be easily identified by the c l u s t e r ’ sa v e r a g e .Wi t ht h eba c k g r ou n d’ sv a r i a n c e ,a na r t i f icial background was reproduced using a Gaussian distribution with the known parameters. To achieve this purpose we suggest a solution based on the fact that histograms corresponding to the RGB components of a homogeneous region approximately follow a Gaussian distribution with standard deviation smaller then 10 [10]. Results confirm the correctness of this approach (see Fig. 2 and Fig. 4). Note that if we wish to keep some other image features like handwritten notes we only need to keep a higher number of resulting clusters. An overall inspection of the images obtained by the FC step (Fig. 4(a)) and the MM step (Fig. 4(b)) permits to conclude that both methods are very satisfactory in removing features, since all of the undesired color/structured image defects are removed. The FC step produces normally clean background images with a good visual aspect. However, in some situations, there exists an over-filtering by suppressing some pixels of some words. Not only the MM step helps to remove undesired residual structural elements bu ta l s oc or r e c t ss omeoft h e s e“ da ma g e s ”i n t r odu c e di nt h ec h a r a c t e r sby the FC. Comparing all four images in Fig. 4. it can be seen that the integration algorithm takes advantage of the most positive aspects of both methods.

Color image

Fuzzy Clustering

Removed color artifacts

Text with residual undesired artifacts Original background Binarization

Create artificial background

Binary text image

Mathematical morphology

Removed structured artifacts Clean text image (for possible OCR applications) Artificial ground

back-

Image addition

Recovered image

Fig. 2. Fluxogram of the proposed recovering algorithm.

1

0.9

0.8

(a)

0.7

0.6

0.5

0.4

(b)

0.3

0.2

0.1

0

(c)

(d)

Fig. 3. Segmentation results: (a) text cluster, (b) background cluster, (c) cluster with undesired color artifacts and (d) gray scale representing membership to cluster i.

(a) text image after FC step

(b) text image after MM step

(c) original color image

(d) recovered color image

Fig. 4. Images from different steps of the algorithm.

6 Conclusions Recovering the visual appearance of degraded old documents is an important issue because it is a real concern of the Libraries to make them available by digital means, particularly through Internet access. Simultaneously, clean documents are an impor-

tant contribution to a higher OCR performance when applied to old documents, a problem now being tackled by several methods but still facing great challenges. We proposed a novel solution based on the combination of fuzzy clustering of the original images and a mathematical morphology step for removing residual geometrical artifacts. From the obtained results, we can conclude that this approach achieves very good outcomes. In addition, a high quality binary image is produced. These images can afterwards be used as inputs to OCR algorithms contributing to better performances. We emphasize that this method leads in several occasions to better quality binary images than FineReader and its only drawback is to be slightly more time consuming. As future work, some improvements can still be done in order to make this software available to the librarians, which is the ultimate goal of this project. In particular, some manually parameterization should be automated. Finally, the proposed algorithm will be extended to multicolored pages and characters.

Acknowledgements Thi swor kwa spa r t l ys uppor t e dby :t he” Pr og r a madeFi na nc i a me nt oPl ur i a nua ldeUni da de s de I&D (POCTI), do Quadro Comunitár i o de Apoi o I I I ” ;t he FCT pr oj e c t POSI/SRI/41201/2001; ” Pr og r a madoFSE-UE, PRODEP III, Quadro Comunitário de Apoio I I I ” ;a ndpr og r a m FEDER. Wea l s owi s ht oe x pr e s sourt ha nk st ot hePor t ug ue s eBi bi ot e c a Nacional for their continuous support, which made possible this work.

References 1. Bezdek J. C.: Pattern Recognition with Fuzzy Objective Function. Plenum Press, New York (1981) 2. Buse R., Liu Z. Q., and Bezdek J.: Word recognition using fuzzy logic. IEEE Transactions on Fuzzy Systems, 10(1) February (2001) 65–76 3. Driankov D., Hellendoorn H., and Reinfrank M.: An Introduction to Fuzzy Control. Springer, Berlin (1993) 4. Serra J.: Image Analysis and Mathematical Morphology. Academic Press, London (1982) 5. Soille P.: Morphological Image Analysis. 2nd edition. Springer, Berlin (2003) 6. ABBYY FineReader Homepage, http://www.abbyy.com, ABBYY Software House 7. Caldas Pinto J. R., Pina P., Bandeira L., Pimentel L., Ramalho M., Underline Removal on Old Documents, Lectures Notes in Computer Science, LNCS 3211, Springer, (2004) 226234 8. Soille P., Talbot H.: Directional Morphological Filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11) (2001) 1313-1329 9. Ribeiro C.S., Gil J.M., Caldas Pinto J.R, Sousa J.M.: Ancient document recognition using fuzzy methods. In: Proceedings of the 4th international Workshop on Pattern Recognition in Information Systems, Porto, Portugal (2004) 98-107 10. Caldas Pinto J.R., Marcolino A., Ramalho M., Clustering Algorithm for Colour Segmentation, SIARP’ 00 - V Ibero-American Symposium On Pattern Recognition, (2000) 611-617

Suggest Documents