A General Framework for Multi-Character Segmentation and Its ...

6 downloads 647 Views 270KB Size Report
In this paper we propose a general framework for character segmentation in complex ... Keywords: Optical Character Recognition, Character Segmentation. 1.
A General Framework for Multi-Character Segmentation and Its Application in Recognizing Multilingual Asian Documents1 Di Wen, Xiaoqing Ding Dept. of Electronic Engineering, Tsinghua University; State Key Laboratory of Intelligent Technology and Systems, Beijing, 100084, P. R. China Email: {wendi,dxq}@ocrserv.ee.tsinghua.edu.cn ABSTRACT In this paper we propose a general framework for character segmentation in complex multilingual documents, which is an endeavor to combine the traditionally separated segmentation and recognition processes into a cooperative system. The framework contains three basic steps: Dissection, Local Optimization and Global Optimization, which are designed to fuse various properties of the segmentation hypotheses hierarchically into a composite evaluation to decide the final recognition results. Experimental results show that this framework is general enough to be applied in variety of documents. A sample system based on this framework to recognize Chinese, Japanese and Korean documents and experimental performance is reported finally. Keywords: Optical Character Recognition, Character Segmentation

1.

INTRODUCTION

The ultimate task of an Optical Character Recognition system is to convert paper document image into electronic document automatically. In recent years, although the single character classifier techniques have achieved great progress to gain robust recognition rate (over 99%) in recognizing isolated character image, the overall recognition rate for a complex document is still lagging. The reason for this is obvious, to recognize a real document image, there is no separated single character pattern before we segment the multi-character image. Thus, in practical OCR application, the key technical problem shifts from character recognition (recognizing single character) to text recognition (recognizing sequence of multiple characters). The bottleneck of the overall recognition performance is the character segmentation process [1]. Without a proper segmentation strategy, even a powerful recognition kernel does not promise low error rate. Statistics show that over 50% of the overall errors in a typical OCR system are attributed to the defects of segmentation algorithm design [1]. There are also segmentation-free techniques in OCR research [8]. Such technique is based on HMM model to recognize single character and a level building algorithm to match multiple character image against the single character patterns. However, this technique is more suitable for connective scripts, such as English and Arabic. Feasibility of this method in Asian scripts and complex document context is still vague. When designing the system to recognize Asian documents with various typesetting and multilingual text, the segmentation-based technique is still competitive. In this paper we are going to discuss three aspects of endeavor we exert in solving character segmentation problem. The first is to enhance robustness in multilingual context. Researchers have reported significant success on segmenting characters in a uniform context, for example, in a pure Latin context [3][4]. But these methods meet serious challenges in modern documents with multilingual context. In this paper we propose a two-step optimization scheme to meet this problem especially. The second is to design a single framework which take into account three Asian languages simultaneously. Little work has been reported to consider the common features of system design in recognizing different kinds of languages. Since the three major Asian languages (Chinese, Japanese and Korean) have close relation and share many common typesetting features in documents (See figure 1), we attempt to build a single segmentation routine for recognizing these three

1

Supported by 863 Hi-tech Plan (project 2001AA114081) & National Natural Science Foundation of China (project 60241005).

languages. The unified framework includes three steps: Dissection, Local Optimization and Global Optimization. This effort also promotes us to meditate the general philosophy for the character segmentation problem.

Fig. 1.

Typical samples of Asian documents

Finally, current achievements in OCR research reveal a new tendency to combine the design of character recognition and segmentation process to build a cooperative document recognition system. In this paper we introduce our composite evaluation method derived from the Bayesian MAP criterion to make comprehensive decision for recognition results. In the rest of this paper, we first describe the problem definition for character segmentation in section 2. Then the threestep framework design is detailed in section 3. Experimental results and conclusion are reported in section 4 and section 5. Terms used in this paper are as follows. We call an undetermined region of the image as a segment. The term primitive refers to an inseparable image patch we extract in the Dissection process, which make up the basic sequence for further analysis. For depiction clarity, our discussion is limited in horizontal text only.

2.

PROBLEM DEFINITION

In [2], Gary el proposed a communication theory model for document image analysis. In the model the generation of picks up a message S. Second, the document image is simulated in the following steps: First, a linguistic source imager source F transforms S into ideal binary glyph Q. Finally, Q is transformed in the degraded channel into the final image I, which is actually observed by us. In that case the task of character segmentation can be formulated as a decoding stage, which estimates the MAP message S* from the observed image I.

λ

Based on this paradigm, we define a simpler model by combining the imager and channel into a noisy imager. In our model, the task of character segmentation is to discern within the explicit noisy image for implicit characters. Therefore, in the image should be recognized. According to the Bayesian deciboth the message S and its geometric positions sion theory, it leads to the maximization of the following posterior probability:

π

(π * , S * ) = arg max P (π , S | I , λ , F ) π ,S

∝ arg max P ( I , π , S | λ , F ) π ,S

= arg max P ( S | λ ) P (π | S , F ) P ( I | π , S , F ) π ,S

(1)

λ

where is the linguistic source which generates character symbols and F is the noisy imager source which renders such symbols into glyph image with certain size, font, style, weight and blank. In equation (1) we derive a hierarchical fractionalization for the probability of multi-character recognition: the linguistic probability P ( S | λ ) , the geometric probability P (π | F , S ) and the recognition probability P ( I | π , S , F ) . To design a competent character segmentation algorithm, we suggest all these three probabilities should be evaluated to achieve optimal decision. However, since our initial observations of the text image are 2-D pixels, how to traverse all segmentation hypotheses in the 2-D plane remains a problem. In this paper, we use a typical pre-segmentation scheme to simplify this problem. We group the foreground pixels in the initial binary image into basic primitives: O = O1O2 OT, which make up our first level observation sequence. A primitive might be a connected component or a broken connected component, extracted by a special dissection process (See figure 2 for an example). We assume that each primitive belongs to only one implied character. Therefore the character segmentation problem can be abstracted as finding the optimal partition of sequence O.



Traversing all probable partitions for 1-D sequence is much easier. The state transition graph (STG) model offers us a systematic way. Assume that we can dissect image I into N primitives, then we can have N+1 intervals {ti | 0 i N} in the primitive sequence so that a Directed Acyclic Graph can characterize all segmentation hypotheses. As shown in figure 2, each probable partition π of I into L characters is characterized by a L-step path ΩL = {t0 , ti1 ,", tiL−1 , tN } . If we

≤≤

can further fractionalize the joint probability for the whole string in (1) into step-wise probabilities, we can utilize the traditional Viterbi scheme to search for the optimal path.

Fig. 2.

State transition graph for multi-character segmentation

One of the fractionalization can be deduced by taking into account the one-level Markov assumption as follows:

P ( S | λ ) P(π | S , F ) P ( I | π , S , F ) = P(c1c2 " cL | λ )i P ( z1 z 2 " z L | c1c2 " cL , F )i P ( I1 I 2 " I L | z1 z2 " z L , c1c2 " cL , F ) L −1

L

l =1

l =1

L

 {P (c1 )∏ P(cl +1 | cl )}i{∏ P ( zl | cl )}i{∏ P ( I l | zl , cl )} l =1

L

= {P (c1 ) P( z1 | c1 ) P ( I1 | z1 , c1 )}i∏ {P (cl | cl −1 ) P ( zl | cl ) P ( I l | zl , cl ) l =1

L

= ∏η l l =1

where

ηl = P(cl | cl −1 ) P( zl | cl ) P( I l | zl , cl )

(2)

is the one level accumulative probability along the segmentation path, multiplied by three partial probabilities. P(cl | cl −1 ) is the one-level linguistic transition probability, P( zl | cl ) denotes the one-level character geometric probability and P ( I l | zl , cl ) denotes the one-level character recognition probability. Taking logarithm of the probabilities we have:

(Ω* , S * ) = arg max arg max log P(Ω L , SL | I , λ , F ) L

Ω L , SL

L

= arg max arg max ∑ logηl L

Ω L , SL

(3)

l =1

Until now we can setup a general solution for multi-character segmentation, which contains a dissection step to decompose the initial image into primitives and a Viterbi optimization step. Figure 3 illustrates such a general algorithm. Dissection Break up image into primitives; Optimization COST[N+1], array recording backward cost PATH[N+1], array recording backward path Initialization: BEGIN COST [0]=0; COST [i ]=∞ (1 ≤ i ≤ N ).

Loop: FOR i=1 TO N DO COST[i] = m in [ C O S T ( k ) + η k ,i ] ; 0 ≤ k ≤ i −1

PATH[i] = a rg m in[ C O ST ( k ) + η k ,i ] ; 0 ≤ k ≤ i −1

END FOR END After the loop, COST[N] is the cost of the optimal path and PATH [N ] → PATH [PATH [N ]] → " → 0 are the back pointers indicating the optimal path. Fig. 3.

A general solution for multi-character segmentation problem

However, this general solution has several problems. First, most hypotheses in the STG can be excluded before recognition due to their unreasonable geometric properties. Second, the computational complexity for the algorithm is O( N ( N + 1) ) . So that the running cost will be significant when N is large. Third, the algorithm has no exception han2

dling mechanism. In the following section, we will further refine this paradigm to a practical algorithm. 3.

THE THREE-STEP MULTI-CHARACTER SEGMENTATION FRAMEWORK

3.1. Overview In section 2, we try to convert the character segmentation problem into the general dynamic programming problem. However, in multi-lingual documents, one-pass optimization is not enough to cope with complex context. E.g., many errors concerning similar characters in different languages or separable characters cannot be resolved by pure quantitive evaluation. They need to be verified in higher semantic level with special rules. Therefore we separate the Optimization step into Local Optimization and Global Optimization. The Local Optimization step involves a group of local primitives and performs quantitive dynamic programming optimization. And the Global Optimization step investigates the results of Local Optimization with global semantic rules. Figure 4 illustrates the three-step framework.

We make further refinements to augment robustness and efficiency of the algorithm. First, we introduce the concept of synchronized points (the intervals that are confirmed as characters boundaries with high certainty). With more SPs dis-

Fig. 4.

The three-step Multi-character segmentation framework

covered, we break up the global optimization chain into local sections iteratively. Second, we propose our predict and prove (PP) strategy, especially designed for segmenting the typical oriental characters in Asian documents. With this strategy we can effectively prune many paths in the optimization process. Third, feedback is introduced to cope with possible errors unaware in each step. 3.2. Hierarchical features According to equation (2), to evaluate three partial probabilities of a segmentation hypothesis, we collect features from three levels: the pixel level, the geometric level and the semantic level. They are listed in table 1.

These features are obtained in different way. The pixel-wise features are extracted directly from image pixels; while the geometric features are quantized values of the initial pixel-wise features by a nonlinear mapping function; and the semantic properties are output from the recognition kernel. There are also statistical properties for neighboring segments, which are updated dynamically in the algorithm and are important to detect multilingual characters. Feature Level

Pixel level Geometric level Semantic level

Column features

Single segment features

PJ (projection), SP (stroke penetration)

w (width), h (height), SW (stroke width), B (blank), UB (upper blank), DB (lower blank) r (quantized ratio), Q (quantized h and UB product)

Statistical features H (line height) Wmin (minimal oriental glyph width) Wmax (maximal oriental glyph width)

Rmin (minimal glyph ratio) Rmax (maximal glyph ratio)

Code (recognition code) R (recognition cost)

Tab. 1.

Categorized properties in different levels

3.3. The dissection step In the Dissection step, we extract all connect components as initial primitives and then go over all primitives to detect the possible touching components. When a primitive has abnormal width and height matching the following rule, it will be considered as an abnormal primitive, which needs further dissection. r ≥ Rmax + 2

&

w ≥ c1 i H

& h ≥ c2 i H

Rmax is the statistical maximal ratio for former recognized oriental glyphs and c1=1.25 and c2=0.5 are empirically chosen thresholds. How to break touching characters in further dissection is interesting. Various techniques have been proposed to find out cut positions for touching characters, including strategic method [4], classifier method [5][6], or even profile method [7]. We suggest that the touching cases should be separated into two categories: the simple touching cases (See fig.5a) and the complex touching cases (See fig.5b). Complex cases cannot be separated by traditional projection analysis.

Fig. 5.

Different cases of touching characters

In this paper we propose a progressive two-pass scheme to cut touching character components, coping with both the simple and complex touching cases. In the first pass, we use projection analysis to find out the simple touching points. In the second pass, we zoom into the doubtful components, increase the threshold and repeat the projection analysis again. If there is still no new candidate found, we will choose a column as cut position forcibly. 3.4. Local optimization Since many oriental glyphs in Asian documents occupy only one primitive, they can be easily identified by their typical sizes and shapes. So after the Dissection step, we try to detect such primitives and exclude them from further segmentation process. These primitives also act as the initial synchronized points to break the whole line into local sections.

In the Local Optimization step, we propose two schemes: a Predict and Prove scheme and a Composite Fusion scheme to evaluate a segment comprehensively. The Predict and Prove scheme is a heuristic searching scheme. It is specifically designed for recognizing Asian documents. As shown in figure 6, the PP scheme is done in four steps: First, the current context is estimated. Second, we try to find a reasonable interval for the next character based on the current context. If the preceding recognized characters are glyph characters, we tend to find the next segmentation place according to the glyph geometric properties. Otherwise if the context is western characters, we tend to find a western character. Third, the predictive segment will be recognized and evaluated with composite function. If this segment is proven to be creditable, we will move forward directly and ignore all intervals before this predictive place. If the segment is not creditable enough, we will recognize two other alternative places and select the optimal place finally. If no acceptable predictive place can be found, we consider that the previous predictive place is erroneous and a one-step retrospect is executed to trace another alternative path. Figure 6 illustrates the details of the Local Optimization step. In figure 6, the subroutine Predict(i,CONTEXT[i]) returns the predictive place from current position i given the CONTEXT[i], and CharType(Si,p) returns which type of characters Si,p belongs to given its recognition result. The composite evaluation i,p = Evaluate(Si,p) is calculated as follows:

η

ηi , p = w1 g (Si , p ) + w2 h( Si , p )

(4)

where g(Si,p) is the geometric cost of Si,p and h(Si,p) is the recognition cost and w1, w2 are coefficient weights of them. h(Si,p) measures the distance between the unified segment image and the classified pattern. g(Si,p) measures the geometric concordance of Si,p with its recognition result. The Composite-Fusion scheme gives us a composite evaluation of the segmentation hypothesis concerning both its geometric and recognition scores. Local Optimization COST[N+1], array recording backward cost PATH[N+1], array recording backward path CONTEXT[N+1], array recording context BEGIN Initialization: COST [0]=0; COST [i ]=∞ (1 ≤ i ≤ N ).

Loop: i=0; bFOUND=FALSE; WHEN i

Suggest Documents