Detection of Variable Regions in Complex

0 downloads 0 Views 547KB Size Report
115(17). [15] Jadhav, P. D., Jadhav, D. R., Gite, S. S., & Mulik, V. (2016). ... [19] Pushpa, B. R., Ashwin, M., & Vivek, K. R. Robust Text Extraction for. Automated ...
International Conference on Communication and Signal Processing, April 6-8, 2017, India

Detection of Variable Regions in Complex Document Images Sreelakshmi U.K, Akash V.G and N. Shobha Rani  Abstract—Pre-processing is one of the significant protocol to be executed for achieving precise recognition of contents from aged document images. The aged document images are heavily besieged with the noisy artifacts accumulated due to its sustainability for longer times. In this paper, a technique is proposed for detection of variable regions from aged documents by employing the object detection technique and further histogram of gradients features are fed to the nearest neighbor classifier. The performance of the algorithm is analyzed with standard datasets of Tobacco800 Complex Document Image Database of University of Maryland Institute for Advanced Computer Studies. Index Terms—Aged documents, ancient document images, HoG features, Preprocessing, variable region detection.

I. INTRODUCTION Detection of variable regions in document images is a prime research problem in the domain of document image processing and computer vision. The degree of complexity in processing of document images is influential by the noise persisting in the image. Especially, the aged document images or the images captured or scanned in improper illumination environment, or wrong usage of imaging equipment result in addition of noisy content to the original documents. The other cases which lead to noisy images include the resolution of document and type of ink used for printing or writing results in ink bleed-throughs and addition of marginal noise or salt and pepper noise in the document images. Pre-processing of these documents is of prime importance in order to simplify the machine based interpretation of document contents. Detection of variety of objects or contents available in the document images suffices the subsequent processing of document images especially in the process of image classification or retrieval problems. In the proposed work, the detection of variable regions involves detection of regions with handwritten contents like signatures, annotations and other types of impressions such as ink bleed through etc. Fig 1 depicts some of the document image

Sreelakshmi U.K is with the Amrita School of Arts and Science, Amrita vishwa vidyapeetam, Amrita university, Mysuru campus, Karnataka, India (email: [email protected]). Akash V.G is with the Amrita School of Arts and science, Amrita vishwa vidyapeetam, Amrita university, Mysuru campus, Karnataka, India (e-mail: [email protected]). N.Shobha Rani is with the computer science Department, Amrita vishwa vidyapeetam, Amrita university, Mysuru campus, Karnataka, India (e-mail: [email protected]).

samples with the mentioned characteristics from the Tobacco800 Complex Document Image Database of University of Maryland Institute for Advanced Computer Studies [1, 2, 3]. There exists quite a good number of research attempts in this direction and the summary of few of the important works as briefed as follows. Likforman et al [4] had proposed an approach for segmentation of lines in pre-processed document image. A document image is assumed to be noise free and without nontextual elements. The two different pre-processing techniques used include high pass filtering and morphological filtering. Rashid et al [5] had contributed an approach for script identification using features obtained from document images at character, word, text line or block level. The multi-script identification is performed using connected component analysis and convolution neural network. The discriminative learning models are used for script identification. The dataset employed for testing includes ancient Greek-Latin mix script document images. This approach proved to be easily adaptable for identification of other scripts. Nafchi et al [6] had proposed phase-based binarization model for the ancient document image enhancement and also a post processing method to improve binarization method. The method is defined of three simple methods pre-processing, main binarization and post processing. In pre-processing and binarization, features of phase are derived while in post processing they have considered Gaussian and median filters. The binarization step shows the high performance and makes the system is robust on the degraded images. Likforman et al [7] had proposed a technique for pre-processing of printed documents using two image restoration approaches as Non-local Mean filter and total variation minimization approach. The testing is done on printed documents and effectiveness is computed through character recognition i.e. OCR. The Non-local Means approach provides high accuracy for characters with low level degradation effective for characters are with high degradation. This preprocessing step increases the accuracy of character recognition system as an overall effect. Yahya et al [8] had devised a technique for enhancement of aged document images consisting of damaged background. The three image enhancement methods used include (a) image enhancement

978-1-5090-3800-8/17/$31.00 ©2017 IEEE

0807

Fig. 1. Input document image instances form UMIACS Tobacco800 database

technique using binarization method (b) image enhancement technique using hybrid methods, and (c) image enhancement technique using non-thresholding methods. Here the second method is more concentrated as these is more popular and even have great potential in future improvement. The technique provides quality degradation and makes the script readable. Reza et al [9] had proposed a technique of word spotting for very old historical documents. This technique need not to have a line or word segmentation while the system is language independent. The spotting technique is applied with the help of Euclidean distance measure enhanced by rotation and dynamic time wrapping transforms. This spotting technique obtained during an automatic pre-processing stage. Content level classifiers are used to extract the accurate stroke pixels. The images after preprocessing are of more quality and in readable form to the end user. Reza et al [10] had contributed an approach to identify bleed-through effect in degraded images using a variational approach. A flow field model is introduced based on a global control where solution for each resulting model is obtained by wavelet shrinkage or a time stepping scheme. This also depends on complexity and non-linearity of the models. For the double-sided document enhancement, the model uses reverse diffusion technique. The system is robust in case of noisy and complex background. Zemouri et al [11] had devised a combined binarization technique, global and local thresholding. This method has been tested on benchmarking dataset of Handwritten Document Image Binarization contest (H-DIBCO 2012).The evaluation on word spotting system performed efficiently in our approach. Maroua et al [12] had contributed texture-based approach on numerous degradation models. The image enhancement technique algorithms, non –local means filtering and super pixel techniques applied on document images. This shows robustness in texture feature extraction on segmentation consisting noisy background. Rowley et al [13] had devised preprocessing technique for removal of local intensity variations, pixel region classification based on segmentation. Restoration is achieved using exemplar-based image inpainting. The proposed methods were tested on 25

manuscripts images. The improvement of histogram segmentation significantly produced better results in the image segmentation. Thumilvannan et al [14] had contributed the image in binarized form using the gradient information method including the phase feature. The 3 main steps are preprocessing, binarization, post-processing. The binarized image is compressed using SPIHT algorithm which makes less storage space. The access to the image should be made faster and increase efficiency of storage. Pratiksha et al [15] had devised binarisation technique used to remove the noise using ostu’s thresholding. This technique is compared with niblack and sauvola thresholding for getting better result. The image still contains some salt and pepper noise at its margin. El Harraj et al [16] had presented a novel nonparametric and unsupervised method for an ancient document image. Optimized grayscale conversion algorithm is used for transformation. The original image brightness is equalized to obtain better quality. Even after preprocessing the image still contains some marginal noise. It is observed that, most of the works in the literature has adapted thresholding based approaches, pixel based region classification methods and other hybrid methods involving supervised and few unsupervised learning approaches. In the proposed approach, the detection of variable regions like handwritten regions and annotations is devised in terms of supervised learning method through non-thresholding or edge detection based techniques. The proposed methodology describes in Section II. Section III discuss about the results. At last, Section IV concludes the paper. II. PROPOSED METHODOLOGY The proposed method for variable region detection in degraded or noisy document images is comprised of four stages. Initially the document is assumed as input and forwarded for preprocessing in stage one. Further, the preprocessed images are subject to object detection in stage two. The histograms of gradient features are computed from the objects detected in stage three and finally these features are classified using nearest neighbor classifier in stage four.

0808

Fig. 2. Block diagram of proposed system

Fig 2 provides the overview of proposed methodology in a block diagram form A. Preprocessing Preprocessing is the process of enhancing the quality of an image by applying algorithms so that the image is transformed to a form suitable for subsequent processing. In the proposed work, once the input images are assumed to be binary images and are subject to morphological dilation [17] to improve the improve quality of printed text contents. The dilation operation impacts on the document so that the gradient details of the image are enhanced by appending additional pixels near the contour of foreground pixels. It is observed that most of the textual objects are detected as split objects; hence the morphological dilation is applied to uncover the splits within the same textual object pixels in the handwritten regions. The dilation of an image I with structuring element I is given by equation (1).

I †I

Ii iI

(1) Where I i is transformation of I by i. B. Object detection The dilated image is further proceeded for object detection using the connected component analysis [18]. A sequence of

continuous pixels of similar intensities bound to form a connected object and such objects are detected using connected component labeling based on the region properties. Fig 3 represents the outcome of objects detected from dilated version of images. C. Feature Computation and object classification The selection of a right feature extraction technique impact on the overall precision of algorithm devised. In the proposed method, the Histogram of Gradient (HoG) features [19] are employed for discrimination of printed text regions from variable regions in an image. HoG descriptors are efficient at representation of features especially for edge based images. The dispersion of gradient details in variable regions are more compared to printed regions. HoG descriptors describe the frequency of occurrence of intensities with respect to local image regions and returns normalized histogram vectors usually called as block histograms. In the proposed method, a cell size of 4x4 is employed for generation of block histograms. The outcome of HoG descriptors with regions detected is as depicted in the Fig 4. The normalized feature vectors generated from the objects detected are fed to the classifier as input. The classification in the proposed method is carried out using nearest neighbor classifier [20].

Fig. 3. Objects detection – Dilated images

0809

Fig. 4. Images after removal of variable regions

III. EXPERIMENTAL ANALYSIS

IV. CONCLUSION

The algorithm performance is analysed by training the classifier with around 560 trained images and the testing with about 225 images. The accuracy of the algorithm is subjectively in the proposed method. The accuracy is defined as the number of images with variable regions detected successfully to the total number of images employed for testing. There are some cases where the variable regions are detected partially. The results obtained after removal of variable region is as presented in the fig 4. The algorithm is also able to detect the regions like logos, artistic text etc, however there exists the cases where the algorithm result to partial removal of variable regions. Fig 5 depicts few of such cases.

Anticipation of high accuracies for OCR systems can be achieved only through an adept selection of objects required for feature extraction and classification. Removal of obstructing objects like variable regions in complex document images improvises the scope of accurate outcomes. In the proposed technique, variable regions are detected and removed by employing the object detection using connected component labeling and features vectors are extracted using HoG descriptors. After classification, the variable regions detected are subject to removal to suffice the subsequent processing steps.

Fig. 5. Partial removal of a variable regions- Exception cases

0810

ACKNOWLEDGMENT We would like to express our heart-felt gratitude to our guide and our project coordinator Ms. Shobha Rani N for her valuable suggestions and guidance rendered throughout. REFERENCES [1]

[2]

[3] [4] [5]

[6] [7]

[8]

[9]

D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, ¡°Building a test collection for complex document information processing,¡± in Proc. Annual Int. ACM SIGIR Conference, 2006, pp. 665¨C666.. G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis, The Complex Document Image Processing (CDIP) test collection, Illinois Institute of Technology, 2006. [Online]. Available: http://ir.iit.edu/projects/CDIP.html The Legacy Tobacco Document Library (LTDL), University of California, San Francisco, 2007. [Online]. Available: http://legacy.library.ucsf.edu/ Likforman-Sulem, L., Zahour, A., &Taconet, B. (2007). Text line segmentation of historical documents: a survey. International Journal of Document Analysis and Recognition (IJDAR), 9(2-4), 123-138. Rashid, S. F., Shafait, F., &Breuel, T. M. (2010, June). Connected component level multiscript identification from ancient document images. In Proceedings of the 9th IAPR Workshop on Document Analysis System (pp. 1-4). Nafchi, H. Z., Moghaddam, R. F., & Cheriet, M. (2014). Phase-based binarization of ancient document images: Model and applications. IEEE Transactions on Image Processing, 23(7), 2916-2930. Likforman-Sulem, L., Darbon, J., & Smith, E. H. B. (2009, July). Preprocessing of degraded printed documents by non-local means and total variation. In 2009 10th International Conference on Document Analysis and Recognition (pp. 758-762). IEEE. Yahya, S. R., Abdullah, S. S., Omar, K., Zakaria, M. S., & Liong, C. Y. (2009, August). Review on image enhancement methods of old manuscript with the damaged background. In 2009 International Conference on Electrical Engineering and Informatics (Vol. 1, pp. 6267). IEEE. Moghaddam, R. F., & Cheriet, M. (2009, July). Application of multilevel classifiers and clustering for automatic word spotting in historical document images. In 2009 10th International Conference on Document Analysis and Recognition (pp. 511-515). IEEE.

[10] Moghaddam, R. F., & Cheriet, M. (2010). A variational approach to degraded document enhancement. IEEE transactions on pattern analysis and machine intelligence, 32(8), 1347-1361. [11] Zemouri, E., Chibani, Y., & Brik, Y. (2014). Enhancement of historical document images by combining global and local binarization technique. International Journal of Information and Electronics Engineering, 4(1), 1. [12] Mehri, M., Kieu, V. C., Mhiri, M., Héroux, P., Gomez-Krämer, P., Mahjoub, M. A., & Mullot, R. (2014, April). Robustness assessment of texture features for the segmentation of ancient documents. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on (pp. 293-297). IEEE. [13] Rowley-Brooke, R., Pitié, F., & Kokaram, A. (2013). A non-parametric framework for document bleed-through removal. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 2954-2960). [14] Thumilvannan, P. S., Bhavani, S., & Sridevi, J. (2015). Image Binarization and Lossless Compression of Ancient Documents using SPIHT Algorithm. International Journal of Computer Applications, 115(17). [15] Jadhav, P. D., Jadhav, D. R., Gite, S. S., & Mulik, V. (2016). Enhancement of Old Degraded Documents Using Phase Base Binarization by Dip Techniqe. International Journal of Engineering Science, 4225. [16] K.Sakthidasan@Sankaran, G.Ammu and V.Nagarajan, “Non Local Image Restoration Using Iterative Method”, IEEE International Conference on Communication and Signal Processing-(ICCSP14), April 2014, pp. 1740 - 1744. [17] Parthasarathi, V., Surya, M., Akshay, B., Siva, K. M., & Vasudevan, S. K. (2015). Smart control of traffic signal system using image processing. Indian Journal of Science and Technology, 8(16), 1. [18] Ramanathan, R., Ponmathavan, S., Thaneshwaran, L., Nair, A. S., Valliappan, N., & Soman, K. P. (2009, December). Tamil font recognition using gabor filters and support vector machines. In Advances in Computing, Control, & Telecommunication Technologies, 2009. ACT'09. International Conference on (pp. 613-615). IEEE. [19] Pushpa, B. R., Ashwin, M., & Vivek, K. R. Robust Text Extraction for Automated Processing of Multi-Lingual Personal Identity Documents. [20] Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy k-nearest neighbor algorithm. IEEE transactions on systems, man, and cybernetics, (4), 580-585.

0811

Suggest Documents