Enhancement of Old Manuscript Images - IEEE Xplore

Enhancement of Old Manuscript Images Avekash Guptaα , Sunil Kumarβ , Rajat Guptaδ , Santanu Chaudhuryγ ,Shiv Dutt Joshiθ Department of Electrical Engineering, Indian Institute of Technology, Delhi,α ,γ ,θ , IBM India Research Lab β , Cypress Semiconductorδ [email protected]α , [email protected]β , rajat [email protected]δ , [email protected]γ , [email protected]θ

Abstract In this paper we address the issue of enhancement in the quality of scanned images of old manuscripts. Small portions of the text in these manuscripts have degraded with time and are not readable. We propose a segmentation based histogram matching scheme for enhancing these degraded text regions. To automatically identify the degraded text we use a matched wavelet based text extraction algorithm followed by MRF(Markov Random Field) post processing. Additionally we do background clearing to improve the quality of results. This method does not require any a priori information about the font, font size, background texture or geometric transformation. We have tested our method on a variety of manuscript images. The results show proposed method to be a robust, versatile and effective tool for enhancement of manuscript images.

1

Introduction

Scanning and digital photograph imaging, besides many programs for preserving the manuscripts in their physical form, have been used to preserve their content and current appearance for future studies. However, despite the availability of advanced photography and scanning equipments, natural aging and deterioration have rendered many manuscript images unreadable. Digital image processing techniques are necessary to improve the legibility of the manuscripts. The deficiency in the quality of original manuscripts arises due to aging effect, leading to deterioration of the writing media, with seepage of ink and smearing along cracks, damage to the manuscripts due to the holes used for binding the manuscripts and dirt and other discoloration. This results in a very poor contrast between the foreground text and the background. The manuscripts we deal with have some such small regions where contrast between text and background is very low. We call these regions as either ‘degraded text regions’

or ‘low contrast text(LCT) regions’. Figure 1 shows such an example image where these degraded regions are marked with ovals. Automatic extraction and enhancement of low contrast regions in aged manuscripts is not well-explored in literature. A wavelet based image enhancement technique has been used in [6] to tackle bleed-through effects in archived documents. Linear function approximations of document background [3], and a foreground-background separation method by using local adaptive analysis [5], are among other methods used in the digital restoration of historical document images. We present an algorithm based on matched wavelets and MRF model to automatically identify and extract LCT regions from scanned manuscript images and enhance them using a histogram matching technique. Section 2 describes our method. Subsection 2.1 describes a histogram matching based enhancement of a degraded text region. In subsection 2.2 we show how to segment out these degraded text regions using matched wavelet based segmentation followed by MRF postprocessing. Section 3 contains description of results obtained by using our algorithm and we conclude the paper with a brief summary in Section 4.

2

Enhancement of Low Contrast Text(LCT) Regions

Figure 2 shows the Block Diagram of our system. We start our discussion with the situation when we already have LCT regions extracted from the original image and we want to enhance these regions up to the level of rest of the image. We use histogram analysis techniques, explained below, for enhancing these regions.

2.1

Histogram Analysis

The LCT regions have lower contrast as compared to contrast of rest of the image. This is clearly visible in Figure 4(b) where we draw the histogram of only the LCT re-

Figure 3. Histograms of two sample manuscript images (Figure 1 & Figure 6(a)). Note that histograms are bimodal with two distinct peaks.

Figure 1. Old Manuscript image. Low Contrast Text(LCT) regions are marked in oval in above image.

Figure 2. System Block Diagram gions. Figure 4(b) indicates that the histogram of the degraded text region is also bimodal, just like the histogram of the complete image (3 (a) & (b)), but two peaks are very close to each other signifying lower contrast. Thus on the histogram level we can make clear distinction between the two classes, namely text and background, even in the regions of degradation. In our approach for enhancement of the text regions we make use of the following two specific properties of the manuscripts that we have: 1. Manuscripts have very thick text regions, which lead to a bimodal histogram with clear distinction between two peaks. Figure 3 illustrates the histograms of some of our manuscript images. The first peak corresponds to the text regions and the later peak, to the background. 2. LCT regions do not significantly affect the overall histogram of the image as they are very small in number but they make significant contribution if we draw the histogram of the degraded region alone. Enhancement can be done if we increase the contrast of LCT region. One of the ways of doing it is to increase the distance between both peaks. This could be achieved by

Figure 4. (a) degraded portion of image & (b) corresponding histogram of segmented image. This histogram is also bimodal, but the two peaks are very near to each other. (c) shows the result after matching the histogram of this image with rest of the image. (d) Final histogram of enhanced image

mapping the histogram of the LCT region to the histogram of the complete image. The algorithm used for histogram matching minimizes the error E to get the mapping T(k). Error is defined as E = |C1 (T (k)) − C0 (k)|

(1)

where C0 and C1 are cumulative histogram of LCT Regions and Entire image. Minimization of E is subjected to the constraints that T must be monotonic and C1 (T (a)) cannot overshoot C0 (a) by more than half the distance between the histogram counts at a. Figure 4(c) shows the improvement in the results, obtained by using this technique and Figure 4(d) shows the final histogram of this image after enhancement. This shows that final histogram is stretched as a result of the above processing.

2.2

Segmentation of LCT regions

In this section we explain, how to extract LCT regions from a manuscript image automatically, for histogram analysis. This is grounded on the concept of text segmentation using matched wavelet followed by MRF based postprocessing. 2.2.1

Text Extraction using Matched Wavelets: Background

Matched wavelet filters for any given signal have been designed in [4] .Further, Khanna et. al. [1] have used matched wavelets defined in [4] to create Globally Matched Wavelet Filters (GMWs) for Text images and used them for segmentation of text regions from natural images. These filters allow maximum energy to pass through the approximation subspace (equivalently minimum energy to pass in detail subspace) thus achieving better compression than any other class of wavelet filters. In this paper, we have used the text extraction algorithm designed in [1],[2], a brief overview of which is given below 1. For a 2D image signal I(n1 ,n2 ), a 1D signal a0x , with all rows of image placed adjacent to each other, and another 1D signal a0y with all the columns of image placed below each other, we estimate matched analysis wavelet filters h1x , h1y using eqs. 2 and 3. h1x (k)[ a0x (2m + k)a0x (2m + r)] = 0 (2) k

k

This segmentation algorithm classifies LCT regions as background. This can be seen as misclassification happening in the form of occurrence of certain isolated clusters of another class in a given class. Removing this misclassification is equivalent to making the classification smoother. This is achieved by utilizing the classification confidence output, of Fisher classifier, in MRF based postprocessing which exploits the contextual information around each pixel. Similar approach has been used recently in [8] to refine the results of segmenting the handwritten text, printed text and noise in the document image. 2.2.2

MRF based postprocessing

The problem of correcting the misclassification can be formulated in terms of energy minimization [9] [2]. Every pixel p must be assigned a label in the set L = {text, background}. F refers to a particular labeling of the pixels and fp refers to the value of the label of a particular pixel. In the first order MRF model, energy function is simplified to the following form: E(f ) =

{p,q}∈N

m

f or r = 0, 1, 2, ..., j − 1, j + 1, ..., N − 1 h1y (k)[ a0y (2m + k)a0y (2m + r)] = 0

value of pixel(x,y) of transformed image i. The segregation and class assignment of test image pixels is then done by training a 2 class (text/background) Fisher classifier which also generates a confidence in classification.

(3)

m

f or r = 0, 1, 2, ..., j − 1, j + 1, ..., N − 1

Here j th filter weight is kept constant to value 1. The solution gives corresponding weights of dual wavelet filters h1x and h1y i.e. the analysis high pass filters. From it other filters (analysis low pass, synthesis low pass and synthesis high pass) are obtained using FIR(Finite Impulse Response) perfect reconstruction bi-orthogonal filter bank design. 2. Matched wavelet estimated by training on set of pure text images and non-text images are than clustered to form a set of 6 Globally Matched Wavelets (GMWs) which are the characteristic wavelets of these classes. To locate text regions in an image, I(n1 ,n2 ), the image is passed through wavelet analysis filter bank , with each of the 6 GMW filter sets to obtain 6 transformed images. 3. The 6D feature vector at each pixel is defined as f¯(x, y) = fi (x, y)i=1,2,3,4,5,6 where fi (x,y) is the

Vp,q (fp , fq ) +

Dp (fp )

(4)

p∈P

where N is the set of interacting pair of pixels (8neighborhood). The first and second terms in the above equation are referred to as Esmooth (interaction energy) and Edata (energy corresponding to the data term) in the literature [9]. Dp (Data Term) is the measure of how well a label fp fits the pixel p given the observed data. Thus in our case classification confidence found using Fisher classification is the intuitive choice of Dp [8]. Esmooth makes F smooth everywhere. The discontinuity preserving energy function used here is Potts interaction penalty [9].Mathematically, the expression for the Potts interaction penalty is Vp,q (fp , fq ) = λ.T (fp = fq )

(5)

where λ is a constant and T is 1 when fp = fq else it is 0.The value of λ controls the amount of smoothening done by the energy function and we use a high value of λ for MRF based postprocessing which ensures over-smoothing of the results. Fig. 5(a) and 5(b), show the result of text extraction and MRF post processing for a sample degraded image. The results show that the LCT regions, classified as background in the simple text extraction, are corrected after MRF post

Total degraded characters 202

degraded characters detected as LCT 164

degraded characters recovered 97

% detection of degraded characters 0.81

% recovery of degraded characters 0.48

Table 1. processing. Subtracting the image obtained after MRF postprocessing with the one obtained by plain text segmentation (no postprocessing), gives us LCT regions. Fig 5(c) shows the segmented LCT regions after subtraction. The LCT image from last step is enhanced using the histogram matching technique explained in section 2. We implement an optional background clearing step, described in [3], to make manuscript image more readable. Figures 6(f) & 7(f) show the final enhanced manuscript images.

3

Results

Several different types of manuscript images are analyzed using our method. We discuss some of the results here. Regions painted green indicate the text area. Fig 5 gives complete results for a manuscript image with fairly distributed LCT regions. Fig 6 shows LCT region extraction from a non-aligned manuscript. Thus our method works equally well for rotated or skewed images. Fig 7 is an example of a manuscript with different font and very low contrast regions. In order to quantitatively measure the readability improvement by our method we tested our algorithm on some handformed degraded english text images. The results on 5 different scanned english text images are summarized in Table 1. First column shows total number of degraded characters in images. Second column shows number of degraded characters detected as LCT by our algorithm. Third column is recovered characters (detected by OCR (Optical Character Reader)) after histogram matching. Fourth and Fifth columns show the percentage accuracies of detection and recovery. The results show about 50% improvement in readability of text after applying our algorithm.

4

Conclusion

In this paper, we have presented a novel technique for locating and enhancing low contrast regions in old scanned manuscript images. It is based on a matched wavelet based text extraction algorithm followed by MRF post processing. We have applied our algorithm on several types of images with different fonts and complex backgrounds and obtained encouraging results.

Figure 5. (a) text segmentation result of Figure 1, green painted region is text and red is background (b)results after postprocessing. Note that LCT regions left in text segmentation are covered. (c)image with LCT regions extracted out(green regions) & (d) enhanced image after histogram matching

was funded by Media Lab Asia, Government of India.

References [1] Kumar Sunil, Khanna Nitin, Chaudhury Santanau, Joshi S D. Locating Text in Images using Matched Wavelets. IEEE International Conference on Document Analysis and Recognition,pp. 595-599 Vol. 2 Sept. 2005. [2] Rajat Gupta, Kumar Sunil, Khanna Nitin, Chaudhury Santanau and Shiv Dutt Joshi, Text Extraction and Document Image Segmentation using Matched Wavelets and MRF Model, Accepted in IEEE Trans. on IP July 2007

Acknowledgment

[3] Z. Shi and V. Govindaraju, Historical document image enhancement using background light intensity normalization. Proceedings of the Pattern Recognition, 17th International Conference on (ICPR’04) Volume 1 Pages: 473 - 476, 2004

We are grateful to C-DAC, Noida for providing the manuscript images used in our experiments. The project

[4] A. Gupta, S. D. Joshi and S. Prasad. A New Approach for Estimation of Wavelets with Separable and non-

Figure 6. (a) manuscript image with nonaligned text (b) after text segmentation (green:text,red:background) (c) after MRF postprocessing (d) LCT regions (marked green) (e) after histogram analysis (f) final image after background clearing

separable kernel from a given image. Accepted for publication in IEEE Trans. On Sig. Processing [5] C. Yan and G. Leedham, Decompose Threshold Approach to Handwriting Extraction in Degraded Historical Document Images. Proceedings of the Ninth International Workshop on Frontiers in Handwriting Recognition (IWFHR’04) - Volume 00, Pages: 239 244 [6] Wang Q., Xia T., Li L. and Tan C. L. Image enhancement of historical documents using directional wavelet. International Journal of Wavelets, Multiresolution and Information Processing, Vol.1, no.3, pp.291-305, 2003.

Figure 7. (a) image with bigger and highly degraded font (b) after text segmentation (green:text,red:background) (c) after MRF postprocessing (d) LCT regions (marked green) (e) after histogram analysis (f) final image after background clearing

[7] A. Gupta, S. D. Joshi and S. Prasad. A new approach for estimation of statistically matched wavelet. IEEE Trans. On Sig. Processing, May 2005,Volume 53, Issue 5, page 1778- 1793. [8] Yefeng Zheng, Huiping Li, and David Doermann, Machine printed text and handwriting identification in noisy document images, IEEE Trans. On PAMI, Vol 26, No. 3, pp. 337-353, March 2004. [9] Yuri Boykov, Olga Veksler, Ramin Zabih. Fast Approximate Energy Minimization via Graph Cuts. IEEE Trans. On PAMI, Vol 23, No. 11, pp. 1222-1239, March 2004.