Zone-based Keyword Spotting in Bangla and Devanagari Documents a
Ayan Kumar Bhunia, bPartha Pratim Roy*,cUmapada Pal
a
Dept. of ECE, Institute of Engineering & Management, Kolkata, India b Dept. of CSE, Indian Institute of Technology Roorkee, India c CVPR Unit, Indian Statistical Institute, Kolkata, India b email:
[email protected], TEL: +91-1332-284816
Abstract In this paper we present a word spotting system in text lines for offline Indic scripts such as Bangla (Bengali) and Devanagari. Recently, it was shown that zone-wise recognition method improves the word recognition performance than conventional full word recognition system in Indic scripts [29]. Inspired with this idea we consider the zone segmentation approach and use middle zone information to improve the traditional word spotting performance. To avoid the problem of zone segmentation using heuristic approach, we propose here an HMM based approach to segment the upper and lower zone components from the text line images. The candidate keywords are searched from a line without segmenting characters or words. Also, we propose a novel feature combining foreground and background information of text line images for keyword-spotting by character filler models. A significant improvement in performance is noted by using both foreground and background information than their individual one. Pyramid Histogram of Oriented Gradient (PHOG) feature has been used in our word spotting framework. From the experiment, it has been noted that the proposed zone-segmentation based system outperforms traditional approaches of word spotting. Keywords- Word Spotting, Handwritten text recognition, Knowledge extraction, Hidden Markov Model.
1
1. Introduction Handwritten text recognition is one of the challenging problems in the field of pattern recognition. Due to the free-flow nature of handwriting and many writing variations, the recognition performance is not satisfactory even with sophisticated pre-processing and OCR techniques. While processing such handwritten documents, word spotting [3] techniques are useful to search the possible instances of specific/query words. For searching using “Word Spotting”, it does not require OCR of the entire document. The presence of writing distortion does not create much problem in retrieving similar target words as these approaches do not involve recognition of either the characters of the query word or the query word itself. The features are extracted from the whole word and thus the methods try to find similar features in the target images. One of the drawbacks of these methods is that these require exact word segmentation prior to the matching. Word retrieval will not be feasible using feature matching in target image in the case of improper word segmentation. To tackle this problem, recently segmentation free methods [5, 10, 11] are proposed. Word spotting has been extensively studied [2, 4, 5] to detect a word in a handwritten document page (or line) as per the user’s query keyword [5] or a template image [10, 13]. This way of searching or browsing approach often overcomes the problem of conventional recognition. Word spotting with Query by Example (QBE) principle takes instances of query word image for searching. Whereas Query by Text (QBT) [15] which uses learning based approach for retrieval proved more effective recently. Due to the success of "Word Spotting" with efficient technology such as Hidden Markov Model and Neural Network, it has been popular in extracting information from historical documents, handwritten forms, etc.
Classical approaches, proposed in Latin [5, 15], Arabic [2] scripts decompose text line into a sequence of vertical frames and features are extracted from each of them, and are fed to a decoding system to retrieve the text sequence of characters. Nevertheless, one of the bottlenecks of such systems is feature extraction. Especially, in Indian scripts (Bengali, Devanagari, etc.), a combination of vowels, modifiers, and characters lead to a huge number of character classes where recognition/spotting is still challenging [27]. The main problem of dealing with Bangla or Devanagari handwritten script is the free flow nature of handwriting and an immense amount of variation in writing styles. Besides that, there are many classes of characters and special upper and lower zone characters (i.e. modifiers) in Bangla and Devanagari scripts which make it even difficult to recognize than others (like English). In both 2
Bangla and Devanagari scripts, there are approximately 50 basic characters including vowels and consonants [29]. The consonants may join with other consonants or vowels to form cluster and thus the number of characters in those two scripts is about 300. Upper and lower zone modifiers when get combined with the consonants form large set of combination, which have to considered as different character units for each combination separately, because the plain sliding window feature will need to capture the information in all zones for identifying the modifier properly. Recently, it was shown that zone-wise segmentation method improves the word recognition performance than conventional full word recognition system in Indic scripts [29]. Inspired with this idea we consider the zone segmentation approach and use middle zone information to improve the traditional word spotting performance. Different zones in Bangla and Devanagari words are shown in Fig.1. We show visually in Fig.2 the character class reduction using zone segmentation for both Devanagari and Bangla characters.
Fig.1: Example showing presence of upper-middle-lower zone in (a) Bangla and (b) Devanagari scripts, respectively.
Fig.2: Example showing character units reduction using zone segmentation for Devanagari character ‘क’ and Bangla character ‘ক’ combining with different modifier combination.
3
This paper presents a Query-by-Text based word spotting system in segmented text lines using Hidden Markov Models. We propose here a novel approach of combining foreground and background information of text images for keyword-spotting by character filler models. The candidate keywords are searched in line without segmenting character or words. A significant improvement in performance is noted by using both foreground and background information than each of one alone. We apply an HMM-based zone wise segmentation in text lines to boost the word spotting performance. The framework is applied in Indic scripts such as Bengali and Devanagari along with Latin script for evaluation. Proposed framework is shown graphically in Fig.3.
Fig.3: Proposed framework of word spotting in Indic script
1.1 Related Work Handwritten word spotting is traditionally viewed as an image matching task between one or multiple query word images and a set of candidate word images in a database [3, 4]. There exist many pieces of work on word spotting in applications of postal documents [16, 17, 18], bank checks [19], digital libraries [20], historical documents [21], etc. A recent survey on recent word spotting methodologies on handwritten documents can be found in [32]. Different features have been explored by researchers. Corner features have been used by [22], GAF (gradient angular feature) was used by [23], etc.The techniques of query by example (QBE) or image template matching [10] was adopted by researchers in the early days of word spotting. In retrieval of important information from poorly written old documents [3], word spotting has been considered. Several local features have been used for achieving better performance among which some outperformed others in conjunction with Dynamic Time Wrapping (DTW). In [23] zones of interest (ZOI) part of the document have been extracted considering 4
only the informative part to retrieve the query keyword. In [10] Scale- Invariant Feature Transform (SIFT) [25] features are extracted from similar size patches followed by transformation of feature space of the document to topic space using Latent Semantic Indexing. Then using cosine distance measure, patches with enough evidence are considered as spotted result. The second approach of word spotting namely query by text (QBT) or the learning based approach [12, 13] which outperformed the older one, is being extensively used in recent systems. Here, word level segmentation has been avoided in [2, 26] using supervised learning model like HMM, BLSTM directly on the segmented text lines. HMM models are trained for each character and occurrence of specific character sequence is determined based on the probability scores which it returns for every testing text lines. Recurrent Neural Network (RNN) with Bidirectional Long Short-Term Memory (BLSTM) hidden nodes and Connectionist Temporal Classification (CTC) output layer has been explored for keyword spotting in [26]. Some work exists in which character template based [13] spotting has been considered whereas others depicts spotting at word level. Several works exist towards the application of word spotting such as keyword finding in historical documents [4, 13], searching and browsing through a digitized document, etc. A script independent word spotting method has also been proposed recently [2]. At each location of the window, a feature vector is extracted [15] and the sequence of feature vectors obtained in this fashion is modelled with Semi-continuous HMM. This approach has been tested only in segmented words.There exists several page level segmentation free techniques which uses scale invariant features (i.e. SIFT) [10]. A pyramidal histogram of character based training approach was proposed [33] for improved word retrieval performance. Very few works have been done for keyword spotting in Indic script. Shape code based word-matching technique was used in Indic printed text lines [27]. Here,vertical shaped based feature, zonal information of extreme points, loop shape and position, crossing count and some background information has been used to search a query word.Segmentation-free method for spotting query word images in Bangla handwritten documents is done in [28] using Heat Kernel signature (HKS) to represent the local characteristics of detected key points.
The contributions of this paper are three folds. First we present a novel feature extraction method for word spotting using combination of foreground and background information. We noted that background information improves the word spotting performance significantly. Next, a zone-wise 5
segmentation approach is used to reduce the number of character classes in Indic scripts. Using HMM training with middle-zone components, the performance improves. To boost the performance, a combination of zone information is used to reduce the false positives. Finally, we propose a HMM based zone alignment to improve the traditional projection based zone segmentation. The frame work for word spotting has been tested in Indic scripts namely Bangla and Devanagari. In depth experimental evaluation in a large dataset demonstrate the robustness of the proposed system. A preliminary experiment with foreground and background combination was presented in [31]. This paper extends it and present the details of word spotting in Indic script.
The rest of the paper is organized as follows. The word spotting framework is explained in details in Section 2. In Section 3, we discuss the zone segmentation approach using HMM and next wordspotting using middle zone information is explained. A combination of zone information is also presented in this section. We demonstrate the performance of our proposed framework in Section 4. Finally, conclusions and future work are presented in Section 5.
2. HMM-based Word Spotting Framework The major goal which deals with word spotting is to detect specific keyword in a pool of document images. Our system is able to search arbitrary words in the text lines. For this purpose, the document image is first binarized with a global binarization method. Next, the binary document is segmented into individual text lines using a line segmentation algorithm [6]. For skew-correction; we consider all the points on the extreme bottom of the text stroke and use Linear Regression analysis on these points to find out the best fitted line. The slope of the straight line δ represents skew of the text. Thereafter, a rotation by δ is done to correct the skew [29]. We have determined the slant angle locally by dividing the text line images into segments. Then, the slant angle is determined and corrected using vertical projection histogram and Wigner Viller distribution [35]. Thus, slope and slant of the text line is normalized to cope up with different handwriting style. Fig.4. provides the graphical description of the word spotting framework where concatenated features are feed to HMM. Word spotting is being done using text line scoring based on the filler and character model of HMM. For the word spotting system
6
we have used a novel feature extraction technique. Concatenation of foreground feature and background features are considered here. The details of each step are described below.
Fig.4: HMM-based word spotting framework.
2.1. Feature extraction Feature is a representation of an image which is more descriptive than the image. PHOG feature has been found to provide better result in Bangla handwritten script recognition [7]. PHOG [12] is the spatial shape descriptor which gives the feature of the image by spatial layout and local shape, comprising of gradient orientation at each pyramid resolution level. To extract the feature from each sliding window, we have divided it into cells at several pyramid level. The grid has
𝟒𝑵
individual cells
at N resolution level (i.e. N=0, 1, 2,..). Histogram of gradient orientation of each pixel is calculated from these individual cells and is quantized into L bins. Each bin indicates a particular octant in the angular radian space. The concatenation of all feature vectors at each pyramid resolution level gives 𝒊 168 dimensional (𝟖 × ∑𝑵 𝒊=𝟎 𝟒 = 168) feature vectors considering 8 bins and limiting the level to N=2 in
our implementation. The concatenation of all feature vectors at each pyramid resolution level gives 168 dimensional feature vectors considering 8 bins and limiting the level to N=2 in our implementation. For calculating background information we take care of the morphology of character set in Bangla and Devanagari scripts. In Bangla or Devanagari script it is noted that most of the characters have a horizontal line (Shirorekha) at the upper part. When two or more characters sit side by side to form a word, the horizontal lines of the characters touch and generate a long line called head-line. Because of such touching nature the characters in a word create big white regions (spaces) in Bangla or Devanagari scripts. These empty spaces are found by water reservoir principle [11]. For each pair of joining characters we will get unique reservoir formation, these reservoirs contain information about the 7
combination of characters forming the word. In Fig.5 the formation of bottom reservoirs are shown for Devanagari and Bangla text line, respectively.
Fig.5: Water Reservoir formation in (a) Devanagari and (b) Bangla text line image and position of sliding window is marked in red color.
We have calculated PHOG feature from foreground as well as background regions, formed by the reservoir. These features are then concatenated for the final feature from the text line image. An illustration of feature extraction technique is given in Fig. 6.
Fig.6: Feature extraction method shown graphically. The features are extracted from the sliding window marked in red color.
2.2. HMM-based Text line scoring In the field of handwritten text recognition, Hidden Markov Models have been extensively used because of efficiency at recognition in the cases of touching characters, distorted characters even 8
without being properly pre-processed [14]. The HMMs are capable of modelling cursive handwritten text which is difficult to segment for recognition. The simplest model is the character HMM which consists of J hidden states (S1, S2 ... SJ) in a linear topology as an observation O where ith observation (Oi) represents an n-dimensional feature vector x modelled using a Gaussian Mixture Model (GMM) with probability𝑃𝑆𝑗 (𝒙), 1