Information Retrieval from Sanskrit Document ... - Sanskrit Library

Keyword Spotting Techniques for Sanskrit Documents Anurag Bhardwaj, Srirangaraj Setlur and Venu Govindaraju Center for Unified Biometrics and Sensors Department of Computer Science and Engineering University at Buffalo, Amherst NY - 14228 {ab94, setlur, govind}@cubs.buffalo.edu

Abstract. With advances in the field of digitization of printed documents and several mass digitization projects underway, information retrieval and document search have emerged as key research areas. However, most of the current work in these areas is limited to English and a few oriental languages. The lack of efficient solutions for Indic scripts and languages such as Sanskrit has hampered information extraction from a large body of documents of cultural and historical importance. This chapter presents two relevant topics in this area. First, we describe the use of a script specific Keyword Spotting for Sanskrit documents that makes use of domain knowledge of the script. Second, we address the needs of a digital library to provide access to a collection of documents from multiple scripts. This requires intelligent solutions which scale across different scripts. We present a script independent Keyword Spotting approach for this purpose. Experimental results illustrate the efficacy of our methods. Keywords: Document Analysis, Keyword Spotting, Optical Character Recognition, Document Retrieval, Indic Scripts.

1 Introduction For decades, Optical Character Recognition (OCR) has been considered as the primary enabling technology for automatic interpretation of handwritten or machine print document images. Given the relatively easy access to large number of document images in English and certain oriental scripts, OCR solutions have primarily focused on these scripts. However, the information boom of the last decade has led to a remarkable growth in the digital collection of documents in non-European and non-oriental scripts such as Indic scripts. Unfortunately, the progress in recognition technologies for Indic scripts such as Devanagari (used by Sanskrit, Hindi and other languages) has not been at par with the growth in the digital document collections. This can be attributed to a number of challenges in the form of poor quality documents, complex nature of the script and relatively fewer years of research on Indic OCR. Fortunately, recent advances in information retrieval have led to the emergence of Keyword Spotting as a viable method and an alternative to full OCR. Keyword Spotting essentially finds all occurrences of a typed input word in a set of handwritten/printed documents. Using this method, it is possible to obtain information from documents without relying on robust recognition strategies. Fig. 1 illustrates the concept where the boxed word image represents the spotted keyword.

Fig 1 - A sample Sanskrit document image with keyword boxed in red We present two different approaches to Sanskrit Keyword Spotting. In the first approach, we describe a Block Adjacency Graph (BAG) based scheme for word recognition. It includes a BAG based document clean up technique that uses a graph to maintain the overall character structure while removing noise that does not conform to a character shape. We use multiple hypotheses generated in the recognition phase to determine the similarity between a query word and a document word image. This ensures that even if the top choice result of an OCR is not correct, based on the similarity with the query, multiple word hypotheses returned by the OCR are considered. The second approach extends the idea of Keyword Spotting to multilingual documents. We use a moment based word matching technique which maintains a script invariant representation of all word images. Word matching is performed using the cosine similarity. We also employ a relevance feedback technique to refine the word spotting results. The rest of the chapter is organized as follows: Section 2 describes background work in this area. Section 3 explains the proposed methodologies. Section 4 discusses the scope of the proposed methods. Conclusions are drawn in Section 5.

2 Related Work

Keyword Spotting for Document Images

Recognition Based

Use OCR Results (Lexicon Based, word rankings, word recognition based)

Recognition-free

Use Image Based Features (Profile features, Gabor, GSC, DTW matching)

Fig 2 – Classification of existing Keyword Spotting techniques Existing Keyword Spotting approaches can be broadly classified into two categories (Fig. 2): (i) OCR based approaches and (ii) Image feature based approaches. The OCR based techniques are suitable only for applications where the documents are of good quality and involve relatively small lexicons. Thus, instead

of indexing the OCR’ed text of images for keyword retrieval, approaches based on indexing intermediate OCR likelihood or matching distance have been proposed for retrieving English documents [1, 2, 3]. However, such OCR systems are still not available for Sanskrit. This is because of several challenges in Sanskrit word recognition including: (i) Large number of character classes resulting from all possible permutations of alphabet classes, (ii) character shapes made of complex primitives that cannot be easily segmented using conventional approaches. (iii) multiple font styles and (iv) poor quality documents. To the best of our knowledge, our method is the first that uses a recognition-based approach for Keyword Spotting in Sanskrit documents. In recognition-free Keyword Spotting approaches (Fig. 2), after preprocessing of document images and word segmentation, feature vectors are extracted from word images and stored in a database. When a user provides a query word, the similarity between the query and the word image in the database is computed, and word images are returned in decreasing order of similarities [4, 5, 6]. The comparison functions accommodate inexact matches by using Dynamic Time Warping (DTW) or string edit distance measures. Some methods use probabilistic similarity metrics or clustering to compute the similarity between the query word image and the document word image[6, 7]. Global shape features have been found to be not as discriminative as using an OCR specific to the script, and are prone to failure in the case of scripts like Devanagari that have a large and complex character class. The extracted global word shape features are not very reliable in poor quality Sanskrit documents. Manmatha et al. [8] study various types of features for indexing handwritten English documents including profile based features, gradient features and Gabor based features [7]. Dynamic Time Warping (DTW) is used for finding similarity between the query word image and other word images. Though these features work fairly well with English documents, they have proved to be not adequate for Sanskrit given the complexity of class labels calling for more specific modeling of word images. Moreover, presence of the top horizontal line (head line) or "Shirorekha" renders most of the profile-based features ineffective. Also, DTW based approaches are slow. The Keyword Spotting method for Sanskrit proposed by Harish et al. [9] uses a Gradient, Structural and Concavity (GSC) feature set to measure the image characteristics at local, intermediate and large scales. The method specifies a sliding window approach. Local image features are computed from every window. However, in the presence of noisy word images, feature extraction is fragile. Also, lack of character segmentation points renders the connectivity information of the alphabets ineffective. Template based approaches such as these perform image matching in the feature space and hence require an image based model of the query word (template). Therefore the method does not scale with a large number of query words.

3 Proposed Methodologies This section describes two different approaches to Sanskrit Keyword Spotting. The first is based on a unifying framework for both recognition and retrieval of Sanskrit documents. A font independent OCR generates results at the (sub) character primitive level (components) for a given word image. These results are used by a document retrieval system which defines a ‘string edit’ based cost metric to align OCR results with an input query word. The second approach uses a script independent representation of word images to match the input query word with every indexed word image and retrieves the corresponding documents. The latter approach is recognition free but needs a query template consisting of word images of each query word to be searched. The following subsections describe these methodologies in detail.

3.1 Recognition-based Keyword Spotting

Fig 3 – System architecture for OCR based Keyword Spotting [10] Our system (Fig. 3) consists of 3 phases: (a) Preprocessing phase, (b) Recognition phase and (c) Matching phase. In the preprocessing phase, the horizontal profiles of the document image are used to segment the document into line images. Vertical profile of each line image is used to extract individual word images. Since the document images are often of poor print quality, the extracted word images are noisy. Therefore, the word images are cleaned by performing a Block Adjacency Graph (BAG) based smoothing procedure before initiating the recognition phase [10]. Block Adjacency Graph (BAG) can be created by classifying runs as merging, splitting or continuing runs. BAGs can be constructed using both horizontal runs as well as vertical runs. Hruns are classified by number of Hruns present above (abv) and below (bel) a given Hrun. In splitting Hruns, abv 1 and in merging Hruns, abv > 1 and bel